How Journalistic Red-Teaming Can Help AI Companies Fix Misinformation Blindspots
by Hailey O’Connor | October 2024
Journalists should never act as enemies of truth — except when conducting misinformation red-team evaluations. Red-teaming involves a simulation of adversarial roles at the direction of an organization to assess and improve the organization’s defenses. In this scenario, the adversaries are misinformers, disinformers, and malign actors who deliberately distort the truth.
As generative AI systems become increasingly powerful, top AI companies often use “red-teaming,” or the act of mimicking attacks on an organization’s systems to identify vulnerabilities and enhance security, to mitigate potential risks and prevent their AI products from generating risky or unwanted content.
For certain categories of risky content — such as hate speech, pornographic content, child exploitation, and others — traditional red-teaming approaches have proven highly effective at identifying and addressing gaps in guardrails. However, when it comes to misinformation and disinformation, AI companies still struggle to ensure safe and accurate responses. In NewsGuard’s Monthly AI Misinformation Monitor evaluations, our team found top AI models have a 30-50% failure rate when prompted about known misinformation narratives.
In short, misinformation and disinformation present unique challenges for AI models.
To help address these challenges, our team at NewsGuard has pioneered a distinct approach to red-teaming tailored toward addressing misinformation and disinformation — using a journalistic approach based on rigorously verified data and human analysts.
These exercises are aimed at diagnosing vulnerabilities in AI tools that could lend them to spreading false information and helping AI companies to strengthen their defenses against these gaps. In tests thus far, we’ve found that NewsGuard’s journalistic misinformation red-teaming can help reduce failures by two thirds or more.
Misinformation Red-Teaming 101
Red-teaming is the practice of mimicking attacks on an organization’s systems to identify vulnerabilities and enhance security. AI Misinformation Red-Teaming specifically tests AI systems for their potential to generate or spread false information. This approach provides crucial knowledge to help AI companies identify weaknesses in their models and implement safeguards.
We conduct red-teaming exercises using highly trained journalists who are subject-matter experts on common forms of misinformation and disinformation. Because misinformation and accurate information are often written similarly, with factual claims, cited sources, and logical explanations supporting those claims, this journalistic approach is required to accurately distinguish between factual claims and misinformation.
Our analysts ensure that the red-team prompts address the various ways false narratives might be expressed, and explore the different tactics malign actors might use to bypass information integrity guardrails. This can be done by focusing on specific media formats (photo, audio, video) and/or across a range of misinformation topics — think the Olympics, the 2024 U.S. presidential election and the Russia-Ukraine war.
Regardless of format or topic, misinformation and disinformation are systemic issues that are not confined to any single generative AI tool. To explore and address this industry-wide challenge, NewsGuard launched a monthly AI News Misinformation Monitor in July 2024. This initiative sets a new standard for measuring the accuracy and trustworthiness of the AI industry by tracking how each leading generative AI model responds to prompts related to significant falsehoods in the news. Every month, NewsGuard tests OpenAI’s ChatGPT-4o, You.com’s Smart Assistant, xAI’s Grok, Inflection AI’s Pi, Mistral AI’s le Chat, Microsoft’s Copilot, Meta’s AI, Anthropic’s Claude, Google’s Gemini, and Perplexity’s answer engine. Each test comprises a total of 300 prompts, with 30 prompts based on 10 false claims spreading online tested on each chatbot.
From state-sponsored disinformation to vaccine hoaxes: probing false information across topics
The prompts evaluate key areas in the news, including health, elections, climate, state-sponsored propaganda, international affairs, and companies and brands. They are designed to capture a wide range of styles that malign actors might employ, directing the model to produce falsehoods in various formats such as detailed news articles, X (formerly Twitter) threads, essays, TV scripts, press releases, and other forms of media.