DeepSeek Debuts with 83 Percent ‘Fail Rate’ in NewsGuard’s Chatbot Red Team Audit

The new Chinese AI tool finished tied for 10th out of 11 industry players

A full version of this report is available through NewsGuard’s Reality Check.

By Macrina Wang, Charlene Lin, and McKenzie Sadeghi | Published on Jan. 29, 2025

Chinese artificial intelligence firm DeepSeek’s new chatbot failed to provide accurate information about news and information topics 83 percent of the time, ranking it tied for 10th out of 11 in comparison to its leading Western competitors, a NewsGuard audit found. It debunked provably false claims only 17 percent of the time.

Hangzhou-based DeepSeek was rolled out to the public on Jan. 20. Within days, the chatbot climbed to become the top downloaded app in Apple’s App Store, spurring a drop in U.S. tech stocks and a frenzy over the evolving AI arms race between China and the U.S.

DeepSeek claims it performs on par with its U.S. rival OpenAI despite reporting that it only spent $5.6 million on training, a fraction of the reported hundreds of millions spent by its competitors. DeepSeek has also drawn attention for being open source, meaning its underlying code is available for anyone to use or modify.

In light of DeepSeek’s launch, NewsGuard applied the same prompts it used in its December 2024 AI Monthly Misinformation audit to the Chinese chatbot, to assess how DeepSeek performed against its competitors on prompts users might make relating to topics in the news. NewsGuard’s monthly AI audit report uses a sampling of 10 Misinformation Fingerprints — the proprietary NewsGuard database of top provably false claims in the news and their debunks on subjects ranging from politics and health to business and international affairs.

NewsGuard found that with news-related prompts, DeepSeek repeated false claims 30 percent of the time and provided non-answers 53 percent of the time, resulting in an 83 percent fail rate. NewsGuard’s December 2024 audit on the 10 leading chatbots (OpenAI’s ChatGPT-4o, You.com’s Smart Assistant, xAI’s Grok-2, Inflection’s Pi, Mistral’s le Chat, Microsoft’s Copilot, Meta AI, Anthropic’s Claude, Google’s Gemini 2.0, and Perplexity’s answer engine) found that they had an average fail rate of 62 percent. DeepSeek’s fail rate places the chatbot as tied for 10th of the 11 models tested.

(While the overall percentages for these 10 chatbots are included below, results for the individual AI models are not publicly named because of the systemic nature of the problem. DeepSeek is named in order to compare this new entrant’s performance to that of the overall industry. Future audits will include all 11 AI models without naming them individually.)

On Jan. 28, 2025, NewsGuard sent two emails to DeepSeek seeking comment on these findings, but did not receive a response.

Read the Full Report