Even GPT-5 Failed This Human Attention Test
AI models, including OpenAI’s latest GPT-5.5, continue to underperform on sustained cognitive challenges, with accuracy collapsing under prolonged demands. New benchmarks show even the best models lag far behind human levels in multi-step reasoning. While some systems excel in narrow domains, core attention deficits remain unresolved. Regulatory actions and emerging tests highlight both progress and persistent limitations.
What changed
GPT-5.5 now holds the top rank on the Agents’ Last Exam (ALE) benchmark with a 24% pass rate, though it still fails 76% of tasks.
Live updates
-
GPT-5.5 tops AI benchmark but still fails 76% of tasks—attention gaps persist
confidence 88%AI models, including OpenAI’s latest GPT-5.5, continue to underperform on sustained cognitive challenges, with accuracy collapsing under prolonged demands. New benchmarks show even the best models lag far behind human levels in multi-step reasoning. While some systems excel in narrow domains, core attention deficits remain unresolved. Regulatory actions and emerging tests highlight both progress and persistent limitations.
What's confirmed:
- GPT-5.5 achieved the highest score on the Agents’ Last Exam (ALE) benchmark, passing 24% of tasks while failing 76% of them.
- AI models, including transformer-based systems, show significant declines in performance as task complexity or duration increases, mimicking human attention fatigue but to a far greater degree.
- The GPQA leaderboard shows Claude Mythos Preview leading 224 AI models with a score of 0.946 on a dataset of 448 expert-crafted multiple-choice questions in specialized domains.
- Anthropic’s Fable 5 AI model was taken offline by the U.S. government shortly after its public release, following regulatory scrutiny.
Still unconfirmed:
- A GitHub repository linked to arXiv daily reports may contain unreviewed or unverified AI research updates, including potential new benchmarks or model evaluations.
- Tracking AI claims to measure IQ scores of frontier AI models, but no verified scores or methodologies have been independently confirmed for GPT-5.5 or other recent models.
-
GPT-5 and AI models still fail sustained human attention tests despite new oversight workarounds
confidence 85%AI models, including GPT-5, continue to struggle with prolonged cognitive tasks like the Stroop test, where accuracy plummets under sustained challenge. New research confirms human oversight can mitigate some failures, but core limitations persist. A benchmark ranks AI models far below human performance on complex, multi-modal academic challenges. Separate studies highlight AI’s role in detecting mental health distress, though its own attention deficits remain unresolved.
What's confirmed:
- AI models, including GPT-5, exhibit catastrophic performance collapse in sustained attention tasks like the Stroop test, where accuracy degrades sharply under prolonged cognitive load.
- Human oversight integrated into AI workflows—such as decision gates and deterministic computation—can reduce failure rates in tasks where training data diverges most from real-world conditions.
- A multi-modal academic benchmark called Humanity’s Last Exam ranks 82 AI models, with the top performer scoring 0.647—well below human-level performance.
- AI systems currently lack the behavioral adaptability to maintain consistent performance in tasks requiring extended executive control or attention.
Still unconfirmed:
- A 2025 OpenAI study suggests over one million ChatGPT users may have disclosed signs of suicidal thoughts or mental distress during interactions, raising concerns about AI’s role in mental health support.
-
AI Fails Classic Human Attention Test: GPT-5 and Others Struggle with Focus
confidence 88%Leading AI models, including GPT-5, have shown significant weaknesses in sustained attention tasks, particularly the Stroop test, where performance collapses as task length and complexity increase. Researchers confirm AI processes information differently than humans, failing to maintain accuracy over extended cognitive challenges. The findings highlight a core limitation in current AI design, despite advancements in other areas. Controversy surrounds the implications for AI reliability in tasks requiring prolonged focus.
What's confirmed:
- AI models, including GPT-5, perform well on short Stroop test tasks but experience sharp accuracy declines as task length and complexity increase.
- Some leading AI systems dropped from over 90% accuracy to nearly complete failure when tested on extended versions of the Stroop task.
- Researchers used the Stroop test—a decades-old psychology experiment—to expose AI’s inability to sustain attention over prolonged cognitive challenges.
- The findings suggest AI processes information differently than humans, particularly in tasks requiring sustained focus or inhibition of automatic responses.
Still unconfirmed:
- Controversy on platforms like Reddit questions whether the study overstates AI limitations, though no alternative data has been provided to contradict the core findings.
- A preprint paper titled '(Human) Attention Is (Still) All You Need' hints at human oversight as a potential solution, but no peer-reviewed results are yet available.
- Some AI developers speculate that future models may address this weakness through architectural changes, though no confirmed breakthroughs exist.