AI Detector Accuracy After Humanization: Bypass Rate Statistics by Tool (2026)

Detection Drama · Free Download

Want to bypass Turnitin in 2026? Grab the free prompt pack.

Get the exact text-humanization prompts I use to drop an AI score by hand — copy, paste, submit. Free, straight to your inbox.

Send me the free prompts →

Free · No credit card · Straight to your inbox

AI Detector Accuracy After Humanization: Bypass Rate Statistics by Tool (2026)

By Detection Drama Research Team Last updated: June 11, 2026 9 min read

20–63%

Detection accuracy on paraphrased or humanized AI text across major detectors in independent benchmarking — compared to vendor claims of 95–99% measured on raw, unmodified AI output.

Source: RAID Benchmark, ACL 2024 (Dugan et al., University of Pennsylvania)

Key Findings at a Glance

20–63% — documented detection accuracy on paraphrased/humanized AI text across major detectors (RAID benchmark, ACL 2024). [source]
39.5% → 22.1% — average detector accuracy drop when students applied basic editing (paraphrasing, sentence variation) in Perkins et al. (2024). [source]
All 14 tools <80% accurate on raw AI text, all worsening under paraphrasing — Weber-Wulff et al. (2023) analysis of 14 detection tools. [source]
61.22% average false positive rate across 7 detectors on 91 verified human-written TOEFL essays — Liang et al. (2023/PNAS). [source]
~47% average bypass rate for general-purpose paraphrasers (e.g., QuillBot) across major detectors; dedicated humanizers report 64–75% in 2024–2025 tests. [source]
Copyleaks performed best on humanized text in a 2025 DeepSeek humanization study — detecting 71% of humanized samples vs GPTZero’s 52%. [source]
Turnitin’s Feb 2026 update split AI Writing Reports into “AI-generated” and “AI-paraphrased” categories — the first major detector to distinguish post-humanization content explicitly.

Every major AI detector publishes accuracy claims between 95% and 99%. Those numbers are real — measured against raw, unmodified AI output on clean test sets. The problem is that almost no real-world submission arrives raw.

When content passes through even a basic paraphraser before submission, detection accuracy collapses. The RAID benchmark — the largest independent evaluation ever conducted on AI detection, presented at ACL 2024 — documented accuracy on paraphrased AI text falling to 20–63% across major detectors. Some tools dropped below the accuracy of a coin flip when false positive rates were constrained to realistic levels.

This page compiles every published statistic on post-humanization detection rates — the numbers that matter if you’re an institution trying to catch AI-assisted work, or a student trying to understand what tools can actually see.

1Raw vs. Humanized AI Text: What the Data Shows

Detectors perform substantially better on raw AI output than on humanized or paraphrased content. The gap between the two scenarios is where most real-world AI use falls — students and professionals rarely submit unmodified GPT output.

Test Condition	Avg Detection Rate	Source
Raw AI text (no modification)	85–95%	Multiple vendor benchmarks, 2024
Lightly paraphrased (1 pass)	39.5%	Perkins et al., 2024
Edited with deliberate imperfections	22.1%	Perkins et al., 2024
3+ passes through quality humanizer	~18%	Independent testing, 2026
Cross-model detection (trained on GPT-4, tested on Llama)	~20%	RAID benchmark, ACL 2024

61%

Average drop in detection accuracy from raw AI text to humanized content, based on the range documented across RAID benchmark and Perkins et al. studies. The detectors that claim 95%+ accuracy are measuring a scenario that rarely matches how AI text is actually submitted.

RAID Benchmark (ACL 2024) + Perkins et al. (2024)

Perkins et al. (2024) is particularly significant because they tested students using basic, freely available editing techniques — not sophisticated humanizer software. Paraphrasing, varying sentence length, and inserting deliberate grammatical imperfections were enough to drop average accuracy from 39.5% to 22.1%. The study was published in the International Journal for Educational Integrity (Springer, 2026), lending academic weight to findings that practitioner testing had been showing for years.

The RAID benchmark went further by testing cross-model generalization: a detector trained on ChatGPT 3.5 output performed barely above chance when evaluated against Llama, Mistral, or Claude outputs — a critical finding given that real-world AI use spans many models. This is the deeper problem underlying the bypass rate data: different detectors have fundamentally different model assumptions, and none performs well outside its training distribution.

2Detection Rates by Tool After Humanization (2025–2026 Data)

The four major institutional detectors — GPTZero, Turnitin, Copyleaks, and Originality.ai — show distinct performance profiles when content has been humanized. Copyleaks consistently performs best; GPTZero drops the most.

Detector	Raw AI Accuracy	Humanized Accuracy	Performance Label	Source
GPTZero	99.3%*	~18–52%	Weakest post-hum.	GPTZero benchmark / 2025–26 tests
Turnitin	>90%	~30%	Heavily degraded	Independent adversarial tests, 2025
Copyleaks	90.7%	~40–71%	Best of four	RAID + 2025 DeepSeek study
Originality.ai	~92%	~60–97%†	Vendor-claimed best	Originality.ai meta-analysis (vendor)
*GPTZero 99.3% accuracy from GPTZero’s own benchmark on 3,000 samples. †Originality.ai 96.7% figure is vendor-reported from RAID-related evaluation — not independently verified. Raw accuracy figures are from independent tests where available.

A 2025 study specifically testing detectors against DeepSeek-generated text that had been humanized found Copyleaks detecting 71% of humanized samples, compared to GPTZero’s 52%. This is consistent with Copyleaks’ design focus on academic integrity contexts where paraphrasing is a known evasion vector. Turnitin’s February 2026 update acknowledged the humanization gap by introducing a dedicated “AI-paraphrased” flag in its AI Writing Report — which is an implicit admission that standard AI detection was insufficient for this use case.

72%

GPTZero bypass rate achieved by WriteHuman (a dedicated AI humanizer) in 2025 testing — meaning GPTZero flagged only 28% of humanized WriteHuman output. Turnitin bypass rate for the same tool was 64%. These figures come from review-site testing, not peer-reviewed research.

Competitor review testing, 2025. Directional only.

Raw AI text After humanization

GPTZero

99%

~35%

Turnitin

92%

~30%

Copyleaks

91%

~56%

Originality.ai

92%

~79%*

*Originality.ai post-humanization figure is vendor-reported. Bar reflects midpoint of claimed range.

3How Humanizer Quality Affects Bypass Rates

Not all paraphrasing is equal. General-purpose tools like QuillBot produce a ~47% average bypass rate. Dedicated AI humanizers built specifically to evade detectors report 64–75% bypass rates in testing — though these figures come primarily from competitors’ review sites.

Tool / Approach	GPTZero Bypass	Turnitin Bypass	Average Bypass	Source Reliability
No modification (raw AI)	~1–15%	~5–15%	~10%	Vendor benchmarks
QuillBot (general paraphraser)	~51%	~44%	~47%	Competitor review, 2025
StealthGPT	~69%	~67%	~68%	Competitor review, 2025
WriteHuman	~72%	~64%	~68%	Competitor review, 2025
Undetectable.ai	~65%	~71%	~68%	Competitor review, 2024–25
Manual editing (skilled)	~78–82%	~70–90%	~79%	RAID / Perkins est.
Humanizer bypass rates are from competitor review blogs, not peer-reviewed studies. Treat as directional estimates. Testing methodology, detector versions, and content type vary.

The most striking finding is the ceiling problem: even the best dedicated humanizers top out at around 70–75% bypass rate in most independent tests. This means roughly 1 in 4 humanized submissions still gets flagged. The tools claiming 95%+ bypass rates are typically running their own tests against older detector versions — exactly the same benchmark inflation problem that detector vendors are guilty of. The real question for students considering humanizer tools isn’t the advertised number — it’s the number at the moment of submission.

47%

Average bypass rate of QuillBot — a general-purpose paraphraser used by millions of students — across major AI detectors. This matters because QuillBot is the most commonly cited “AI tool” in academic misconduct cases, yet it bypasses detection nearly half the time.

Competitor review testing, 2025

AI detector bypass rates statistics infographic 2026 — Comparative bypass rates after humanization, by tool. Sources: RAID benchmark (ACL 2024), Perkins et al. (2024), competitor review testing (2025). Vendor-sourced data noted.

4What Independent Academic Research Found (2023–2026)

Three peer-reviewed studies — Weber-Wulff (2023), Perkins et al. (2024), and the RAID benchmark (ACL 2024) — are the most cited independent evaluations. All three found that paraphrasing or humanization substantially degrades detection, and all three found real-world accuracy far below vendor claims.

Study	Year	Key Finding	Methodology
Weber-Wulff et al.	2023	All 14 tested tools scored <80% accuracy; only 5 exceeded 70%. Paraphrasing “significantly lowered” detection in all five tools tested with obfuscation.	14 tools × multiple text types, human & AI texts, with/without paraphrasing
Liang et al. (Stanford/PNAS)	2023	61.22% average false positive rate across 7 detectors on 91 verified human-written TOEFL essays. Non-native English writers flagged at disproportionate rates.	91 TOEFL essays (verified human), 7 detectors, controlled test
RAID Benchmark	2024	Detectors trained on one model were “mostly useless” against other models. Paraphrased content dropped accuracy to 20–63%. Most detectors became ineffective when FP rate constrained to <0.5%.	6M+ generations, 11 models, 11 genres, 12 adversarial attack types (ACL 2024)
Perkins et al.	2024	Basic editing — paraphrasing, sentence variation, deliberate imperfections — dropped average accuracy from 39.5% to 22.1%. No specialized tools required.	Controlled student editing, multiple detectors (Springer, Int’l J. Ed. Integrity)

The Weber-Wulff study remains foundational because it was the first large-scale academic test that deliberately applied paraphrasing and obfuscation to measure degradation — not just accuracy on raw AI text. Their key conclusion: detection tools “are neither accurate nor reliable” and their main failure mode is classifying AI text as human, not the reverse. This is the opposite of how media coverage typically presents the risk (overdetection of innocent students).

The RAID benchmark expanded this insight to scale. With 6 million samples across 11 LLMs and 12 adversarial techniques, it’s the closest thing to a gold standard for AI detection evaluation. Its finding that most detectors fail when cross-tested against models outside their training distribution is directly relevant to anyone who submits content generated by Gemini to a detector trained primarily on GPT-4 output — which describes a large share of real-world AI use. The broader implications for false positive rates are substantial: when detectors fail on humanized content, they’re more likely to produce both false positives (flagging human work) and false negatives (missing AI work).

6M+

Documents in the RAID benchmark — the largest AI detection evaluation ever published. Testing spanned 11 language models, 11 writing genres, and 12 adversarial attack types including paraphrasing, synonym replacement, and whitespace manipulation.

Dugan et al., “RAID: A Shared Benchmark for Robust Evaluation of Machine-Generated Text Detectors,” ACL 2024

5Why Vendor Accuracy Claims Don’t Reflect Real-World Use

Vendor accuracy figures are typically measured against raw AI output on clean, same-model datasets. They don’t test humanized content, cross-model content, or realistic false positive scenarios — the three conditions that matter most in practice.

Accuracy Claim Type	Tested Condition	Real-World Match?
Vendor benchmark (95–99%)	Raw AI text, same model as training data	Rarely matches
Cross-model test (RAID, 2024)	Different LLM than detector training data	Common scenario
Post-humanization (academic tests)	1–3 passes through paraphraser or humanizer	Most submissions
Non-native English writer	Verified human text, second-language author	Significant share

GPTZero’s published 99.3% accuracy figure, for example, comes from a 3,000-sample internal benchmark — one that the company acknowledges tests “typical AI-generated vs. human-written” content. That’s not the same as testing humanized content, multi-model content, or the specific edge cases (short documents, specialized vocabulary, non-standard English) where detectors fail. GPTZero has published detailed documentation of its benchmarking approach — and the methodology gap is acknowledged, not concealed.

Turnitin’s situation is more complex. Turnitin’s own published false positive rate of 1 in 100 human documents holds on raw tests but almost certainly does not hold when human documents are similar in style or length to AI-generated text in the training data. The company’s February 2026 update — which created a new “AI-paraphrased” category — is the clearest signal that even Turnitin’s internal teams recognize the humanization gap as a distinct and growing problem.

Comparison of AI detector accuracy on raw vs humanized text — Raw AI text vs. post-humanization detection rates. Raw figures from vendor benchmarks and independent tests. Post-humanization data from RAID (2024), Perkins et al. (2024), and 2025 tool-specific tests.

6Detection Probability Calculator

Estimate detection probability for your scenario

Detector Content type

Frequently Asked Questions

How accurate are AI detectors on humanized text?

Independent benchmarks show accuracy on humanized AI text typically falls to 20–63%, compared to 85–95% on raw AI text. The RAID benchmark (ACL 2024) documented this range across multiple detectors. Vendor accuracy claims of 95–99% are measured against raw, unmodified AI output only — not the humanized or paraphrased content that makes up most real-world submissions.

Which AI detector is best at catching humanized text?

Copyleaks and Originality.ai have performed best in tests involving humanized content. In a 2025 study using DeepSeek-generated humanized text, Copyleaks detected 71% of humanized samples vs GPTZero’s 52%. Originality.ai claimed 96.7% on paraphrased content in one RAID-related evaluation, though this figure is vendor-reported. For a direct head-to-head on other accuracy dimensions, Turnitin vs GPTZero data is broken down here.

Does paraphrasing alone bypass AI detectors?

Yes. Weber-Wulff et al. (2023) found that paraphrasing significantly lowered detection accuracy across all five tools they tested. Perkins et al. (2024) found that basic editing — paraphrasing, varying sentence length, adding deliberate imperfections — dropped average detection accuracy from 39.5% to 22.1% without any specialized software. No paid tools were required.

What bypass rate do dedicated humanizer tools achieve?

General-purpose paraphrasers like QuillBot average around a 47% bypass rate. Dedicated AI humanizers (Undetectable.ai, StealthGPT, WriteHuman) report 64–75% bypass rates on Turnitin and GPTZero in 2024–2025 tests. These figures come from competitor review sites, not peer-reviewed research. The false negative rate data puts these bypass numbers in broader context.

Why do vendor accuracy claims not match real-world performance?

Vendor benchmarks test against raw, unmodified AI output on clean, same-model datasets. Real-world content is edited, paraphrased, or run through humanizer tools before submission. The RAID benchmark found most detectors were essentially useless against outputs from models they weren’t trained on, and ineffective when false positive rates were constrained to realistic levels below 0.5%.

Does Turnitin catch AI text after humanization?

Turnitin’s detection rate on heavily paraphrased content drops to roughly 30% in independent adversarial testing — down from over 90% on raw AI text. The company’s February 2026 update introduced a dedicated “AI-paraphrased” category in its AI Writing Report, acknowledging that post-humanization detection requires a separate model. Turnitin’s broader capabilities and its new Google Classroom integration are covered separately.

Methodology and Data Sources

Research date: June 2026. Freshness: 3 sources from 2026, 4 from 2024–25, 2 from 2023.

Data collection: This article draws from three categories of sources: (1) peer-reviewed academic studies (RAID/ACL 2024, Weber-Wulff 2023, Liang/PNAS 2023, Perkins/Springer 2024); (2) vendor-published benchmarking documentation (GPTZero, Turnitin, Originality.ai); (3) competitor review testing (Walter Writes AI, ProofreaderPro, Axis Intelligence). Category 3 sources have financial incentives that may bias results and are labelled accordingly.

Limitations: AI detector performance changes with model updates, often without versioned releases. All figures represent conditions at the time of testing. Bypass rates for specific humanizer tools are directional estimates from non-peer-reviewed sources. Cross-study comparison is difficult because test sets, content types, and detector versions differ.

Update schedule: This article will be updated when major academic studies are published or when major detectors announce significant model updates.

Sources

Dugan, L. et al. “RAID: A Shared Benchmark for Robust Evaluation of Machine-Generated Text Detectors.” ACL 2024. arxiv.org/abs/2405.07940
Weber-Wulff, D. et al. “Testing of Detection Tools for AI-Generated Text.” International Journal for Educational Integrity, 2023. arxiv.org/abs/2306.15666
Liang, W. et al. “GPT detectors are biased against non-native English writers.” PNAS, 2023. pnas.org/doi/10.1073/pnas.2309583120
Perkins, M. et al. “Evaluating the accuracy and reliability of AI content detectors in academic contexts.” International Journal for Educational Integrity, Springer, 2026. Springer link
GPTZero. “How AI Detection Benchmarking Works at GPTZero.” gptzero.me. Accessed June 2026.
ProofreaderPro. “How Accurate Are AI Detectors in 2026?” proofreaderpro.ai. Accessed June 2026.
Walter Writes AI. “Are AI Detectors Accurate in 2026?” walterwrites.ai. Accessed June 2026.
Turnitin. “AI Writing: Turnitin Feb 2026 Update Notes.” turnitin.com. Accessed June 2026.

Last updated: June 11, 2026

AI Detector Accuracy After Humanization: Bypass Rate Statistics by Tool (2026)

Want to bypass Turnitin in 2026? Grab the free prompt pack.

AI Detector Accuracy After Humanization: Bypass Rate Statistics by Tool (2026)

Key Findings at a Glance

1Raw vs. Humanized AI Text: What the Data Shows

2Detection Rates by Tool After Humanization (2025–2026 Data)

3How Humanizer Quality Affects Bypass Rates

4What Independent Academic Research Found (2023–2026)

5Why Vendor Accuracy Claims Don’t Reflect Real-World Use

6Detection Probability Calculator

Estimate detection probability for your scenario

Frequently Asked Questions

Methodology and Data Sources

Sources

Latest Posts

Ryne AI vs Undetectable AI: Which Humanizer Actually Passes Detectors in 2026?

What Turnitin Clarity Tracks While You Write

AI Busted Review 2026: The “Free” Detector-Humanizer Combo That’s Really a Word Spinner Funnel