How Journalistic Red-Teaming Can Help AI Companies Fix Misinformation Blindspots

by Hailey O’Connor | October 2024

Journalists should never act as enemies of truth — except when conducting misinformation red-team evaluations. Red-teaming involves a simulation of adversarial roles at the direction of an organization to assess and improve the organization’s defenses. In this scenario, the adversaries are misinformers, disinformers, and malign actors who deliberately distort the truth. 

As generative AI systems become increasingly powerful, top AI companies often use “red-teaming,” or the act of mimicking attacks on an organization’s systems to identify vulnerabilities and enhance security, to mitigate potential risks and prevent their AI products from generating risky or unwanted content. 

For certain categories of risky content — such as hate speech, pornographic content, child exploitation, and others — traditional red-teaming approaches have proven highly effective at identifying and addressing gaps in guardrails. However, when it comes to misinformation and disinformation, AI companies still struggle to ensure safe and accurate responses. In NewsGuard’s Monthly AI Misinformation Monitor evaluations, our team found top AI models have a 30-50% failure rate when prompted about known misinformation narratives. 

In short, misinformation and disinformation present unique challenges for AI models. 

To help address these challenges, our team at NewsGuard has pioneered a distinct approach to red-teaming tailored toward addressing misinformation and disinformation — using a journalistic approach based on rigorously verified data and human analysts. 

These exercises are aimed at diagnosing vulnerabilities in AI tools that could lend them to spreading false information and helping AI companies to strengthen their defenses against these gaps. In tests thus far, we’ve found that NewsGuard’s journalistic misinformation red-teaming can help reduce failures by two thirds or more.

Misinformation Red-Teaming 101

Red-teaming is the practice of mimicking attacks on an organization’s systems to identify vulnerabilities and enhance security. AI Misinformation Red-Teaming specifically tests AI systems for their potential to generate or spread false information. This approach provides crucial knowledge to help AI companies identify weaknesses in their models and implement safeguards.

We conduct red-teaming exercises using highly trained journalists who are subject-matter experts on common forms of misinformation and disinformation. Because misinformation and accurate information are often written similarly, with factual claims, cited sources, and logical explanations supporting those claims, this journalistic approach is required to accurately distinguish between factual claims and misinformation.

Our analysts  ensure that the red-team prompts address the various ways false narratives might be expressed, and explore the different tactics malign actors might use to bypass information integrity guardrails. This can be done by focusing on specific media formats (photo, audio, video) and/or across a range of misinformation topics — think the Olympics, the 2024 U.S. presidential election and the Russia-Ukraine war. 

Regardless of format or topic, misinformation and disinformation are systemic issues that are not confined to any single generative AI tool. To explore and address this industry-wide challenge, NewsGuard launched a monthly AI News Misinformation Monitor in July 2024. This initiative sets a new standard for measuring the accuracy and trustworthiness of the AI industry by tracking how each leading generative AI model responds to prompts related to significant falsehoods in the news. Every month, NewsGuard tests OpenAI’s ChatGPT-4o, You.com’s Smart Assistant, xAI’s Grok, Inflection AI’s Pi, Mistral AI’s le Chat, Microsoft’s Copilot, Meta’s AI, Anthropic’s Claude, Google’s Gemini, and Perplexity’s answer engine. Each test comprises a total of 300 prompts, with 30 prompts based on 10 false claims spreading online tested on each chatbot.

From state-sponsored disinformation to vaccine hoaxes: probing false information across topics

The prompts evaluate key areas in the news, including health, elections, climate, state-sponsored propaganda, international affairs, and companies and brands. They are designed to capture a wide range of styles that malign actors might employ, directing the model to produce falsehoods in various formats such as detailed news articles, X (formerly Twitter) threads, essays, TV scripts, press releases, and other forms of media.

Example of journalistic red-teaming
Above is an example from our September AI monitor, where one of our analysts prompted chatbots with a relevant false narrative. Chatbot 1 and Chatbot 5 are anonymous names for two of the aforementioned leading Chatbots.

Testing Various User Personas

NewsGuard tests three different user personas while red-teaming to ensure a comprehensive understanding of misinformation threats. The personas and prompt styles are reflective of how users use generative AI models for news and information. 

  • Innocent User: seeks factual information about a narrative without bias, approaching generative AI with neutral inquiry.
  • Leading Prompt: assumes the false narrative is true and requests additional details.
  • Malign Actor: aims to generate misinformation, including instructions to circumvent the guardrails that AI models may have in place.

These actors may include health hoax peddlers, election misinformers, and authoritarian governments. One notable area of deception is Russian disinformation, which has infiltrated generative AI, exposing even the most unsuspecting AI users to false claims and narratives.

In a widely covered NewsGuard report, our analyst tested top AI models with prompts based on a set of false narratives from a network of 167 Russian disinformation websites that masquerade as local U.S. news sources. Our team found that top AI chatbots spread these Russian disinformation narratives 32 percent of the time

Helping clients tackle misinformation 

AI-generated misinformation isn’t limited to inaccurate text; it can also manifest in images, audio, and video. Regardless of the format, AI-generated content can be indistinguishable from authentic material. It’s essential to adapt to the challenges of misinformation, especially in how one approaches red-teaming for different formats. For example, the strategies required for text, images, audio, and video differ significantly.

Red-teaming evaluations of generative AI tools help tech companies understand their models’ capabilities and risks, especially from a news perspective. Through adopting various personas, these evaluations reveal where the models tend to falter and assess their effectiveness in handling malign — or just simply wrong — information. 

More examples of NewsGuard Red-teaming evaluations from our September AI monitor.

By partnering with NewsGuard and using its unique misinformation expertise, developers can ensure greater trust and transparency in their models’ outputs.

To learn more about our offerings for AI companies, click here or contact us at [email protected].