AI Detector Accuracy After Humanization: Bypass Rate Statistics by Tool (2026)
Source: RAID Benchmark, ACL 2024 (Dugan et al., University of Pennsylvania)
Key Findings at a Glance
- 20–63% — documented detection accuracy on paraphrased/humanized AI text across major detectors (RAID benchmark, ACL 2024). [source]
- 39.5% → 22.1% — average detector accuracy drop when students applied basic editing (paraphrasing, sentence variation) in Perkins et al. (2024). [source]
- All 14 tools <80% accurate on raw AI text, all worsening under paraphrasing — Weber-Wulff et al. (2023) analysis of 14 detection tools. [source]
- 61.22% average false positive rate across 7 detectors on 91 verified human-written TOEFL essays — Liang et al. (2023/PNAS). [source]
- ~47% average bypass rate for general-purpose paraphrasers (e.g., QuillBot) across major detectors; dedicated humanizers report 64–75% in 2024–2025 tests. [source]
- Copyleaks performed best on humanized text in a 2025 DeepSeek humanization study — detecting 71% of humanized samples vs GPTZero’s 52%. [source]
- Turnitin’s Feb 2026 update split AI Writing Reports into “AI-generated” and “AI-paraphrased” categories — the first major detector to distinguish post-humanization content explicitly.
Every major AI detector publishes accuracy claims between 95% and 99%. Those numbers are real — measured against raw, unmodified AI output on clean test sets. The problem is that almost no real-world submission arrives raw.
When content passes through even a basic paraphraser before submission, detection accuracy collapses. The RAID benchmark — the largest independent evaluation ever conducted on AI detection, presented at ACL 2024 — documented accuracy on paraphrased AI text falling to 20–63% across major detectors. Some tools dropped below the accuracy of a coin flip when false positive rates were constrained to realistic levels.
This page compiles every published statistic on post-humanization detection rates — the numbers that matter if you’re an institution trying to catch AI-assisted work, or a student trying to understand what tools can actually see.
1Raw vs. Humanized AI Text: What the Data Shows
Detectors perform substantially better on raw AI output than on humanized or paraphrased content. The gap between the two scenarios is where most real-world AI use falls — students and professionals rarely submit unmodified GPT output.
| Test Condition | Avg Detection Rate | Source |
|---|---|---|
| Raw AI text (no modification) | 85–95% | Multiple vendor benchmarks, 2024 |
| Lightly paraphrased (1 pass) | 39.5% | Perkins et al., 2024 |
| Edited with deliberate imperfections | 22.1% | Perkins et al., 2024 |
| 3+ passes through quality humanizer | ~18% | Independent testing, 2026 |
| Cross-model detection (trained on GPT-4, tested on Llama) | ~20% | RAID benchmark, ACL 2024 |
Average drop in detection accuracy from raw AI text to humanized content, based on the range documented across RAID benchmark and Perkins et al. studies. The detectors that claim 95%+ accuracy are measuring a scenario that rarely matches how AI text is actually submitted.
Perkins et al. (2024) is particularly significant because they tested students using basic, freely available editing techniques — not sophisticated humanizer software. Paraphrasing, varying sentence length, and inserting deliberate grammatical imperfections were enough to drop average accuracy from 39.5% to 22.1%. The study was published in the International Journal for Educational Integrity (Springer, 2026), lending academic weight to findings that practitioner testing had been showing for years.
The RAID benchmark went further by testing cross-model generalization: a detector trained on ChatGPT 3.5 output performed barely above chance when evaluated against Llama, Mistral, or Claude outputs — a critical finding given that real-world AI use spans many models. This is the deeper problem underlying the bypass rate data: different detectors have fundamentally different model assumptions, and none performs well outside its training distribution.
2Detection Rates by Tool After Humanization (2025–2026 Data)
The four major institutional detectors — GPTZero, Turnitin, Copyleaks, and Originality.ai — show distinct performance profiles when content has been humanized. Copyleaks consistently performs best; GPTZero drops the most.
| Detector | Raw AI Accuracy | Humanized Accuracy | Performance Label | Source |
|---|---|---|---|---|
| GPTZero | 99.3%* | ~18–52% | Weakest post-hum. | GPTZero benchmark / 2025–26 tests |
| Turnitin | >90% | ~30% | Heavily degraded | Independent adversarial tests, 2025 |
| Copyleaks | 90.7% | ~40–71% | Best of four | RAID + 2025 DeepSeek study |
| Originality.ai | ~92% | ~60–97%† | Vendor-claimed best | Originality.ai meta-analysis (vendor) |
| *GPTZero 99.3% accuracy from GPTZero’s own benchmark on 3,000 samples. †Originality.ai 96.7% figure is vendor-reported from RAID-related evaluation — not independently verified. Raw accuracy figures are from independent tests where available. | ||||
A 2025 study specifically testing detectors against DeepSeek-generated text that had been humanized found Copyleaks detecting 71% of humanized samples, compared to GPTZero’s 52%. This is consistent with Copyleaks’ design focus on academic integrity contexts where paraphrasing is a known evasion vector. Turnitin’s February 2026 update acknowledged the humanization gap by introducing a dedicated “AI-paraphrased” flag in its AI Writing Report — which is an implicit admission that standard AI detection was insufficient for this use case.
GPTZero bypass rate achieved by WriteHuman (a dedicated AI humanizer) in 2025 testing — meaning GPTZero flagged only 28% of humanized WriteHuman output. Turnitin bypass rate for the same tool was 64%. These figures come from review-site testing, not peer-reviewed research.
3How Humanizer Quality Affects Bypass Rates
Not all paraphrasing is equal. General-purpose tools like QuillBot produce a ~47% average bypass rate. Dedicated AI humanizers built specifically to evade detectors report 64–75% bypass rates in testing — though these figures come primarily from competitors’ review sites.
| Tool / Approach | GPTZero Bypass | Turnitin Bypass | Average Bypass | Source Reliability |
|---|---|---|---|---|
| No modification (raw AI) | ~1–15% | ~5–15% | ~10% | Vendor benchmarks |
| QuillBot (general paraphraser) | ~51% | ~44% | ~47% | Competitor review, 2025 |
| StealthGPT | ~69% | ~67% | ~68% | Competitor review, 2025 |
| WriteHuman | ~72% | ~64% | ~68% | Competitor review, 2025 |
| Undetectable.ai | ~65% | ~71% | ~68% | Competitor review, 2024–25 |
| Manual editing (skilled) | ~78–82% | ~70–90% | ~79% | RAID / Perkins est. |
| Humanizer bypass rates are from competitor review blogs, not peer-reviewed studies. Treat as directional estimates. Testing methodology, detector versions, and content type vary. | ||||
The most striking finding is the ceiling problem: even the best dedicated humanizers top out at around 70–75% bypass rate in most independent tests. This means roughly 1 in 4 humanized submissions still gets flagged. The tools claiming 95%+ bypass rates are typically running their own tests against older detector versions — exactly the same benchmark inflation problem that detector vendors are guilty of. The real question for students considering humanizer tools isn’t the advertised number — it’s the number at the moment of submission.
Average bypass rate of QuillBot — a general-purpose paraphraser used by millions of students — across major AI detectors. This matters because QuillBot is the most commonly cited “AI tool” in academic misconduct cases, yet it bypasses detection nearly half the time.
4What Independent Academic Research Found (2023–2026)
Three peer-reviewed studies — Weber-Wulff (2023), Perkins et al. (2024), and the RAID benchmark (ACL 2024) — are the most cited independent evaluations. All three found that paraphrasing or humanization substantially degrades detection, and all three found real-world accuracy far below vendor claims.
| Study | Year | Key Finding | Methodology |
|---|---|---|---|
| Weber-Wulff et al. | 2023 | All 14 tested tools scored <80% accuracy; only 5 exceeded 70%. Paraphrasing “significantly lowered” detection in all five tools tested with obfuscation. | 14 tools × multiple text types, human & AI texts, with/without paraphrasing |
| Liang et al. (Stanford/PNAS) | 2023 | 61.22% average false positive rate across 7 detectors on 91 verified human-written TOEFL essays. Non-native English writers flagged at disproportionate rates. | 91 TOEFL essays (verified human), 7 detectors, controlled test |
| RAID Benchmark | 2024 | Detectors trained on one model were “mostly useless” against other models. Paraphrased content dropped accuracy to 20–63%. Most detectors became ineffective when FP rate constrained to <0.5%. | 6M+ generations, 11 models, 11 genres, 12 adversarial attack types (ACL 2024) |
| Perkins et al. | 2024 | Basic editing — paraphrasing, sentence variation, deliberate imperfections — dropped average accuracy from 39.5% to 22.1%. No specialized tools required. | Controlled student editing, multiple detectors (Springer, Int’l J. Ed. Integrity) |
The Weber-Wulff study remains foundational because it was the first large-scale academic test that deliberately applied paraphrasing and obfuscation to measure degradation — not just accuracy on raw AI text. Their key conclusion: detection tools “are neither accurate nor reliable” and their main failure mode is classifying AI text as human, not the reverse. This is the opposite of how media coverage typically presents the risk (overdetection of innocent students).
The RAID benchmark expanded this insight to scale. With 6 million samples across 11 LLMs and 12 adversarial techniques, it’s the closest thing to a gold standard for AI detection evaluation. Its finding that most detectors fail when cross-tested against models outside their training distribution is directly relevant to anyone who submits content generated by Gemini to a detector trained primarily on GPT-4 output — which describes a large share of real-world AI use. The broader implications for false positive rates are substantial: when detectors fail on humanized content, they’re more likely to produce both false positives (flagging human work) and false negatives (missing AI work).
Documents in the RAID benchmark — the largest AI detection evaluation ever published. Testing spanned 11 language models, 11 writing genres, and 12 adversarial attack types including paraphrasing, synonym replacement, and whitespace manipulation.
5Why Vendor Accuracy Claims Don’t Reflect Real-World Use
Vendor accuracy figures are typically measured against raw AI output on clean, same-model datasets. They don’t test humanized content, cross-model content, or realistic false positive scenarios — the three conditions that matter most in practice.
| Accuracy Claim Type | Tested Condition | Real-World Match? |
|---|---|---|
| Vendor benchmark (95–99%) | Raw AI text, same model as training data | Rarely matches |
| Cross-model test (RAID, 2024) | Different LLM than detector training data | Common scenario |
| Post-humanization (academic tests) | 1–3 passes through paraphraser or humanizer | Most submissions |
| Non-native English writer | Verified human text, second-language author | Significant share |
GPTZero’s published 99.3% accuracy figure, for example, comes from a 3,000-sample internal benchmark — one that the company acknowledges tests “typical AI-generated vs. human-written” content. That’s not the same as testing humanized content, multi-model content, or the specific edge cases (short documents, specialized vocabulary, non-standard English) where detectors fail. GPTZero has published detailed documentation of its benchmarking approach — and the methodology gap is acknowledged, not concealed.
Turnitin’s situation is more complex. Turnitin’s own published false positive rate of 1 in 100 human documents holds on raw tests but almost certainly does not hold when human documents are similar in style or length to AI-generated text in the training data. The company’s February 2026 update — which created a new “AI-paraphrased” category — is the clearest signal that even Turnitin’s internal teams recognize the humanization gap as a distinct and growing problem.
6Detection Probability Calculator
Estimate detection probability for your scenario
Frequently Asked Questions
How accurate are AI detectors on humanized text?
Which AI detector is best at catching humanized text?
Does paraphrasing alone bypass AI detectors?
What bypass rate do dedicated humanizer tools achieve?
Why do vendor accuracy claims not match real-world performance?
Does Turnitin catch AI text after humanization?
Methodology and Data Sources
Research date: June 2026. Freshness: 3 sources from 2026, 4 from 2024–25, 2 from 2023.
Data collection: This article draws from three categories of sources: (1) peer-reviewed academic studies (RAID/ACL 2024, Weber-Wulff 2023, Liang/PNAS 2023, Perkins/Springer 2024); (2) vendor-published benchmarking documentation (GPTZero, Turnitin, Originality.ai); (3) competitor review testing (Walter Writes AI, ProofreaderPro, Axis Intelligence). Category 3 sources have financial incentives that may bias results and are labelled accordingly.
Limitations: AI detector performance changes with model updates, often without versioned releases. All figures represent conditions at the time of testing. Bypass rates for specific humanizer tools are directional estimates from non-peer-reviewed sources. Cross-study comparison is difficult because test sets, content types, and detector versions differ.
Update schedule: This article will be updated when major academic studies are published or when major detectors announce significant model updates.
Sources
- Dugan, L. et al. “RAID: A Shared Benchmark for Robust Evaluation of Machine-Generated Text Detectors.” ACL 2024. arxiv.org/abs/2405.07940
- Weber-Wulff, D. et al. “Testing of Detection Tools for AI-Generated Text.” International Journal for Educational Integrity, 2023. arxiv.org/abs/2306.15666
- Liang, W. et al. “GPT detectors are biased against non-native English writers.” PNAS, 2023. pnas.org/doi/10.1073/pnas.2309583120
- Perkins, M. et al. “Evaluating the accuracy and reliability of AI content detectors in academic contexts.” International Journal for Educational Integrity, Springer, 2026. Springer link
- GPTZero. “How AI Detection Benchmarking Works at GPTZero.” gptzero.me. Accessed June 2026.
- ProofreaderPro. “How Accurate Are AI Detectors in 2026?” proofreaderpro.ai. Accessed June 2026.
- Walter Writes AI. “Are AI Detectors Accurate in 2026?” walterwrites.ai. Accessed June 2026.
- Turnitin. “AI Writing: Turnitin Feb 2026 Update Notes.” turnitin.com. Accessed June 2026.