AI Detection Accuracy by Language: Why Detectors Fail Outside English (2026 Data)
91% → 74%
AI-detection accuracy drops from about 91% on English text to roughly 74% the moment the writing is in another language — and falls further for low-resource languages.
Source: Copyleaks 2026 testing
Key Takeaways
- 91% vs ~74% — Copyleaks’ English accuracy versus its non-English average (Copyleaks 2026 testing).
- 88% / 74% / 71% — one independent 300-sample test scored English, Spanish, then French in that order (Humanize AI Pro, 2026).
- 82% → 74% — GPTZero holds ~82% on Spanish and French but drops to ~74% on Arabic and Mandarin (Paper Checker / Slator, 2026).
- ~3 languages — Turnitin’s AI detector is English-first and added Japanese only in April 2025; Copyleaks covers ~30 (Paper Checker, 2026).
- 61.3% vs 5.1% — false-positive rate on TOEFL essays by Chinese students versus US students (Liang et al., 2023; no superseding figure as of 2026).
- 79 languages — the BLUFF benchmark (Feb 2026) shows detection degrading sharply on the 59 low-resource languages it covers (Slator, 2026).
1The English-vs-non-English accuracy gap
Every detector you have heard of was trained and tuned on English first, and the numbers show it: accuracy falls 10 to 20 points the moment the text is in another language.
Copyleaks — the tool with the widest language support — scores about 91% on English with a 7.2% false-positive rate, then averages roughly 74% across non-English content. An independent 300-sample test tells the same story by language, and the ranking is consistent with how much training data each language has.
| Detector / Test | English | Non-English | Source |
|---|---|---|---|
| Copyleaks | 91% | ~74% avg | Copyleaks 2026 testing |
| GPTZero | ~82% (ES/FR) | ~74% (AR/ZH) | Paper Checker / Slator |
| 300-sample independent test | 88% | 74% ES / 71% FR | Humanize AI Pro |
That gap is not a rounding error — it is the difference between a tool you can lean on and a coin flip dressed up as a score. The mechanism is the same one that makes detectors misjudge short essays: they measure how statistically predictable text is, and they have simply seen far less non-English writing to calibrate against. It is worth remembering what an AI-detection score actually represents before treating a non-English result as evidence of anything.
For a multilingual classroom or newsroom the practical rule is blunt: a non-English AI score is a reason to look closer, not a verdict — and the further a language sits from English in the training data, the less that number is worth.
2How few languages detectors actually support
Before accuracy even enters the picture, most detectors simply do not claim to handle your language at all.
| Detector | Languages (AI detection) | Source |
|---|---|---|
| Copyleaks | ~30 (100+ plagiarism) | Paper Checker 2026 |
| Turnitin | ~3 (added JA Apr 2025) | Paper Checker 2026 |
| GPTZero | ES/FR reliable; <70% many others | GPTZero |
| Pangram | claims 99%+ ES/FR/ZH/AR | Pangram Labs |
This matters because coverage and accuracy compound. Turnitin is the default at most universities, yet its AI checker is the most English-bound of the group, which is part of why Turnitin flags AI when other detectors don’t and vice versa. Copyleaks wins on breadth, though its own results show real limits once you leave English. Pangram posts the highest multilingual claims, but a vendor’s self-reported 99% should always be read against independent testing rather than taken at face value. Writers working across languages often reach for multilingual rewriting tools precisely because the detectors policing them are so unevenly calibrated.
3Low-resource languages are nearly invisible
The headline non-English numbers come from big languages like Spanish and French. For smaller languages, detection is closer to guesswork.
The February 2026 BLUFF benchmark put this on record. It spans 79 languages — 20 high-resource and 59 low-resource — with more than 200,000 samples, and found detection models performing substantially worse on the low-resource set, where there is far less training data to learn from.
There is a fix, but it is not the off-the-shelf detectors. A 2026 case study on Urdu fine-tuned a multilingual transformer (mdeberta-v3-base) to a 91.29% F1 score — English-grade accuracy — proving the gap is solvable with language-specific models rather than one English-trained classifier stretched across the world. This is the same brittleness that explains how humanizers exploit perplexity and burstiness: a system tuned narrowly on one distribution breaks on anything outside it.
4The false-positive trap for multilingual writers
The accuracy gap has a human cost: the writers most likely to be wrongly accused are the ones writing in a second language.
The most-cited evidence is still Liang et al. (2023), which recorded a mean false-positive rate of 61.3% on TOEFL essays by Chinese students against 5.1% for US students — no superseding figure has replaced it as of 2026. In the same study, 97% of the TOEFL essays were flagged by at least one of seven detectors, and 19% were unanimously misclassified as AI.
This is the overlap between language coverage and fairness, and it is why second-language writers need tools and habits built for ESL writing and a clear-eyed view of how detector bias against non-native English actually works. In 2026 testing, Turnitin and Originality.ai tied at the top overall (72% and 74%), but Turnitin produced more false positives on non-native English — a reminder that an “accurate” tool can still be the wrong one for a multilingual classroom. And because translated text largely slips through — neither the similarity checker nor the AI detector reliably catches content generated in one language and translated into English — the same blind spot that punishes honest ESL writers quietly rewards anyone gaming the system, which is exactly why a human-written essay can still look AI-generated.
Check a language
Pick a language to see how the major detectors are reported to handle it. Figures are accuracy bands from 2026 testing, not guarantees.
Methodology
FAQ
Are AI detectors accurate for non-English text?
How many languages can Turnitin detect AI in?
Why are non-native English writers flagged so often?
Can a detector catch AI text that was translated into English?
Which AI detector is best for multilingual content?
Do specialized models detect low-resource languages better?
Sources
- Paper Checker. “AI Detection in Non-English Languages (2026).” https://hub.paper-checker.com/blog/ai-detection-non-english-languages-2026-2/
- Humanize AI Pro. “Benchmarking AI Detectors: 2026 NLP Accuracy Report.” https://thehumanizeai.pro/articles/copyleaks-ai-detector-review-2026
- Slator. “New Benchmark Tests AI Detection Across Languages and Translation (BLUFF).” https://slator.com/ai-detection-across-languages/
- GPTZero. “Best AI Detector for Multi-Language Detection.” https://gptzero.me/news/what-is-the-best-ai-detector-for-multi-language-detection/
- Pangram Labs. “Multilingual AI Detector.” https://www.pangram.com/solutions/multilingual
- Liang et al. “GPT Detectors Are Biased Against Non-Native English Writers.” https://pmc.ncbi.nlm.nih.gov/articles/PMC10382961/
- arXiv. “AI-Generated Text Detection in Low-Resource Languages: A Case Study on Urdu.” https://arxiv.org/pdf/2510.16573
- Leap AI. “Turnitin AI Detection Accuracy 2026.” https://www.tryleap.ai/turnitin/accuracy
Last updated: June 26, 2026
