AI Detection Accuracy by Language: Why Detectors Fail Outside English (2026 Data)

Published:

Updated:

AI detection accuracy by language, English vs non-English

AI Detection Accuracy by Language: Why Detectors Fail Outside English (2026 Data)

By Vlad Ivanov, Detection Drama · Updated June 26, 2026 · 8 min read

91% → 74%

AI-detection accuracy drops from about 91% on English text to roughly 74% the moment the writing is in another language — and falls further for low-resource languages.

Source: Copyleaks 2026 testing

Key Takeaways

  • 91% vs ~74% — Copyleaks’ English accuracy versus its non-English average (Copyleaks 2026 testing).
  • 88% / 74% / 71% — one independent 300-sample test scored English, Spanish, then French in that order (Humanize AI Pro, 2026).
  • 82% → 74% — GPTZero holds ~82% on Spanish and French but drops to ~74% on Arabic and Mandarin (Paper Checker / Slator, 2026).
  • ~3 languages — Turnitin’s AI detector is English-first and added Japanese only in April 2025; Copyleaks covers ~30 (Paper Checker, 2026).
  • 61.3% vs 5.1% — false-positive rate on TOEFL essays by Chinese students versus US students (Liang et al., 2023; no superseding figure as of 2026).
  • 79 languages — the BLUFF benchmark (Feb 2026) shows detection degrading sharply on the 59 low-resource languages it covers (Slator, 2026).

1The English-vs-non-English accuracy gap

Every detector you have heard of was trained and tuned on English first, and the numbers show it: accuracy falls 10 to 20 points the moment the text is in another language.

Copyleaks — the tool with the widest language support — scores about 91% on English with a 7.2% false-positive rate, then averages roughly 74% across non-English content. An independent 300-sample test tells the same story by language, and the ranking is consistent with how much training data each language has.

Detector / Test English Non-English Source
Copyleaks 91% ~74% avg Copyleaks 2026 testing
GPTZero ~82% (ES/FR) ~74% (AR/ZH) Paper Checker / Slator
300-sample independent test 88% 74% ES / 71% FR Humanize AI Pro
English
88%
Spanish
74%
French
71%

That gap is not a rounding error — it is the difference between a tool you can lean on and a coin flip dressed up as a score. The mechanism is the same one that makes detectors misjudge short essays: they measure how statistically predictable text is, and they have simply seen far less non-English writing to calibrate against. It is worth remembering what an AI-detection score actually represents before treating a non-English result as evidence of anything.

For a multilingual classroom or newsroom the practical rule is blunt: a non-English AI score is a reason to look closer, not a verdict — and the further a language sits from English in the training data, the less that number is worth.

2How few languages detectors actually support

Before accuracy even enters the picture, most detectors simply do not claim to handle your language at all.

Detector Languages (AI detection) Source
Copyleaks ~30 (100+ plagiarism) Paper Checker 2026
Turnitin ~3 (added JA Apr 2025) Paper Checker 2026
GPTZero ES/FR reliable; <70% many others GPTZero
Pangram claims 99%+ ES/FR/ZH/AR Pangram Labs
~3
Languages Turnitin’s AI detector officially covers — the tool most institutions rely on is the least multilingual of the major options.

Source: Paper Checker, 2026

This matters because coverage and accuracy compound. Turnitin is the default at most universities, yet its AI checker is the most English-bound of the group, which is part of why Turnitin flags AI when other detectors don’t and vice versa. Copyleaks wins on breadth, though its own results show real limits once you leave English. Pangram posts the highest multilingual claims, but a vendor’s self-reported 99% should always be read against independent testing rather than taken at face value. Writers working across languages often reach for multilingual rewriting tools precisely because the detectors policing them are so unevenly calibrated.

3Low-resource languages are nearly invisible

The headline non-English numbers come from big languages like Spanish and French. For smaller languages, detection is closer to guesswork.

The February 2026 BLUFF benchmark put this on record. It spans 79 languages — 20 high-resource and 59 low-resource — with more than 200,000 samples, and found detection models performing substantially worse on the low-resource set, where there is far less training data to learn from.

79
Languages in the 2026 BLUFF benchmark; the 59 low-resource ones exposed the steepest accuracy drops, confirming that detection quality tracks training-data volume.

Source: Slator / BLUFF, 2026

There is a fix, but it is not the off-the-shelf detectors. A 2026 case study on Urdu fine-tuned a multilingual transformer (mdeberta-v3-base) to a 91.29% F1 score — English-grade accuracy — proving the gap is solvable with language-specific models rather than one English-trained classifier stretched across the world. This is the same brittleness that explains how humanizers exploit perplexity and burstiness: a system tuned narrowly on one distribution breaks on anything outside it.

4The false-positive trap for multilingual writers

The accuracy gap has a human cost: the writers most likely to be wrongly accused are the ones writing in a second language.

The most-cited evidence is still Liang et al. (2023), which recorded a mean false-positive rate of 61.3% on TOEFL essays by Chinese students against 5.1% for US students — no superseding figure has replaced it as of 2026. In the same study, 97% of the TOEFL essays were flagged by at least one of seven detectors, and 19% were unanimously misclassified as AI.

Chinese students (TOEFL)
61.3%
US students
5.1%

This is the overlap between language coverage and fairness, and it is why second-language writers need tools and habits built for ESL writing and a clear-eyed view of how detector bias against non-native English actually works. In 2026 testing, Turnitin and Originality.ai tied at the top overall (72% and 74%), but Turnitin produced more false positives on non-native English — a reminder that an “accurate” tool can still be the wrong one for a multilingual classroom. And because translated text largely slips through — neither the similarity checker nor the AI detector reliably catches content generated in one language and translated into English — the same blind spot that punishes honest ESL writers quietly rewards anyone gaming the system, which is exactly why a human-written essay can still look AI-generated.

Statistic card: AI detection accuracy by language, 91% English versus 74% non-English average, source Copyleaks 2026 testing
Detection Drama · Data: Copyleaks 2026 testing

Check a language

Pick a language to see how the major detectors are reported to handle it. Figures are accuracy bands from 2026 testing, not guarantees.

Methodology

Compiled June 2026 from public 2026 detector testing and peer-reviewed research. Per-language accuracy bands come from Copyleaks 2026 testing, an independent 300-sample multi-language test (Humanize AI Pro), and Paper Checker’s 2026 non-English roundup cross-referenced with Slator’s coverage of the BLUFF benchmark. False-positive figures for second-language writers come from Liang et al. (2023), the most recent primary dataset on the question as of 2026. Vendor self-reported accuracy (e.g. Pangram) is labeled as a claim. Numbers are reported ranges, not laboratory-controlled results, and will shift as detectors update their models.

FAQ

Are AI detectors accurate for non-English text?
Less so. Copyleaks scores about 91% on English but averages around 74% on non-English content, and an independent 300-sample test logged 88% English, 74% Spanish, and 71% French. Accuracy falls further for low-resource languages.
How many languages can Turnitin detect AI in?
Only around three. Turnitin’s AI detection is English-first and added Japanese in April 2025. Copyleaks covers about 30 languages for AI detection, the widest of the major tools.
Why are non-native English writers flagged so often?
Detectors key on low perplexity. Liang et al. found a mean false-positive rate of 61.3% on TOEFL essays by Chinese students versus 5.1% for US students, because second-language writing uses simpler, more predictable structures.
Can a detector catch AI text that was translated into English?
Usually not. If content is generated in one language and translated into English, neither the similarity checker nor the AI detector is likely to flag it — an easy evasion path the tools have not closed.
Which AI detector is best for multilingual content?
Copyleaks has the broadest coverage at roughly 30 languages, and Pangram claims over 99% accuracy on Spanish, French, Chinese, and Arabic — though vendor claims should be read against independent testing.
Do specialized models detect low-resource languages better?
Yes. A 2026 Urdu study fine-tuned mdeberta-v3-base to a 91.29% F1 score, showing language-specific models can close the gap that general English-trained detectors leave open.

Sources

  1. Paper Checker. “AI Detection in Non-English Languages (2026).” https://hub.paper-checker.com/blog/ai-detection-non-english-languages-2026-2/
  2. Humanize AI Pro. “Benchmarking AI Detectors: 2026 NLP Accuracy Report.” https://thehumanizeai.pro/articles/copyleaks-ai-detector-review-2026
  3. Slator. “New Benchmark Tests AI Detection Across Languages and Translation (BLUFF).” https://slator.com/ai-detection-across-languages/
  4. GPTZero. “Best AI Detector for Multi-Language Detection.” https://gptzero.me/news/what-is-the-best-ai-detector-for-multi-language-detection/
  5. Pangram Labs. “Multilingual AI Detector.” https://www.pangram.com/solutions/multilingual
  6. Liang et al. “GPT Detectors Are Biased Against Non-Native English Writers.” https://pmc.ncbi.nlm.nih.gov/articles/PMC10382961/
  7. arXiv. “AI-Generated Text Detection in Low-Resource Languages: A Case Study on Urdu.” https://arxiv.org/pdf/2510.16573
  8. Leap AI. “Turnitin AI Detection Accuracy 2026.” https://www.tryleap.ai/turnitin/accuracy
Vlad Ivanov

Vlad Ivanov

Runs Detection Drama, stress-testing AI detectors and humanizers against real student writing, and publishes the Words At Scale newsletter to 26,000+ subscribers. Connect on LinkedIn.

Last updated: June 26, 2026