AI Fails 'Humanity's Last Exam,' Google Gemini 3 Scores 48.4%

Artificial intelligence systems are struggling to pass "Humanity's Last Exam" (HLE), a new, extremely challenging benchmark designed to push AI to its intellectual limits. Google's Gemini 3 Deep Think model achieved the highest score among tested systems, reaching 48.4% as of February 2026. This score falls short of the 50% generally considered a minimum for basic competence in a real-world exam. The comprehensive test, launched in January 2025, aims to measure true human-like reasoning and understanding, revealing significant gaps in current AI capabilities.^{[dainalytix+1]}

The Ultimate Test for AI

A global consortium of nearly 1,000 experts from over 500 institutions and 50 countries developed Humanity's Last Exam. The Center for AI Safety and Scale AI spearheaded the effort.This benchmark includes 2,500 closed-ended questions spanning more than 100 academic fields, such as mathematics, natural sciences, humanities, ancient languages, and even highly specialized subfields like microanatomical structures in birds and ancient Palmyrene inscriptions.The exam goes beyond simple language processing or information retrieval, demanding logical reasoning, knowledge transfer, abstraction, and multimodal understanding, which means interpreting both text and images.Researchers created HLE because existing benchmarks, like the Massive Multitask Language Understanding (MMLU) exam, had become too easy for advanced AI models, which often scored over 90% accuracy.The HLE questions were specifically designed to be un-searchable and require deep reasoning, ensuring AI cannot simply find answers online or rely on memorized training data.^[babl+11]

Dan Hendrycks, co-founder and executive director of the Center for AI Safety, stated the goal was to create problems testing models "at the frontier of human knowledge and reasoning."Dr. Tung Nguyen, an instructional associate professor at Texas A&M, who contributed to the exam, emphasized that HLE highlights how much knowledge remains uniquely human. "When AI systems start performing extremely well on human benchmarks, it's tempting to think they're approaching human-level understanding," Nguyen said. "But HLE reminds us that intelligence isn't just about pattern recognition — it's about depth, context and specialized expertise."^[eweek]

AI's Performance Falls Short

Early results from Humanity's Last Exam show current AI models struggling significantly. As of February 2026, Google's Gemini 3 Deep Think achieved the highest recorded score at 48.4%. Other leading models performed much lower. OpenAI's o1 system scored just 8.3%, while GPT-4o managed 2.7%, and Anthropic's Claude 3.5 Sonnet scored 4.1%. Elon Musk's xAI model, Grok 4, scored 25.4% without tools and 38.6% when allowed to use web and coding tools. An enhanced version, Grok 4 Heavy, reached 44.4%, though these results are currently under review by xAI.^{[livescience+2]}

In stark contrast, human experts typically score around 90% in their respective fields on the exam. The consistent underperformance of AI models, with even the best falling below the 50% threshold for basic competence, underscores the vast difference between current AI capabilities and true human expertise. This gap highlights AI's limitations in areas like generating original insights, understanding common sense, grasping complex context, and demonstrating genuine creativity. AI systems often lack meta-awareness about their own knowledge, sometimes appearing confidently wrong. They also struggle with tasks requiring long-term reliability and adaptability.^{[dainalytix+9]}

Beyond the Score: What HLE Reveals

Despite its dramatic name, Humanity's Last Exam is not meant to signal the end of human relevance. Instead, it serves as a critical tool for understanding the current boundaries of artificial intelligence and guiding its future development. The exam forces a re-evaluation of what "intelligence" truly means when machines can solve individual tasks but often miss the broader context. It emphasizes the need for AI to master multimodality, processing images, language, and logic simultaneously, much like humans do.^[stories+1]

Experts involved in the exam believe it will help build safer and more reliable AI technologies by clearly identifying where systems are strong and where they struggle. The researchers behind HLE stress that even high scores on this exam do not automatically indicate the arrival of Artificial General Intelligence (AGI), which refers to AI systems capable of matching or surpassing human intelligence across most cognitive domains. The exam focuses on verifiable, expert-level knowledge rather than autonomous research capabilities or fluid intelligence.^{[dainalytix+3]}

This benchmark is intended to evolve as AI capabilities advance, providing a long-term, transparent measure of progress. As the AI community continues to push boundaries, other significant benchmarks are also emerging. The ARC-AGI-3, for instance, is set to launch on March 25, 2026. This interactive reasoning benchmark will feature over 1,000 levels across 150 environments, requiring AI agents to explore, learn, plan, and adapt in video-game-like settings, aiming to provide authoritative evidence of AI generalization. Such rigorous evaluations are essential for ensuring public discussions about AI remain grounded in measurable evidence rather than hype.^{[livescience+5]}

Humanity's Last Exam clearly shows that while AI has made incredible strides, it still has a long way to go to match the depth and rigor of human expert reasoning.^[babl]

The Ultimate Test for AI

AI's Performance Falls Short

Beyond the Score: What HLE Reveals

Humanity's Last Exam clearly shows that while AI has made incredible strides, it still has a long way to go to match the depth and rigor of human expert reasoning.^[babl]

The Ultimate Test for AI

AI's Performance Falls Short

Beyond the Score: What HLE Reveals

Sources Referenced in This Article

Humanity’s Last Exam: The Ultimate Intelligence Test for AI – and What It Reveals About Us

Acing this new AI exam — which its creators say is the toughest in the world

Researchers Launch “Humanity’s Last Exam” to Measure Frontier AI Capabilities

Humanity Last Exam Ai Benchmark

Don’t Panic: ‘Humanity’s Last Exam’ has begun

Can Ai Pass Ultimate Iq Test

Mapping the limitations of current AI systems

What Are The Main Limitations Of Current AI Models When It Comes To Coding And How Could These Be...

The limits of artificial intelligence

Understanding the Limitations and Challenges of Generative AI

Understanding The Limitations Of AI (Artificial Intelligence)

ARC Prize - What is ARC-AGI?

Article from arcprize.org

The Threshold of Superintelligence: A Comprehensive Analysis of AGI Development, Benchmarks, and…

The Ultimate Test for AI

AI's Performance Falls Short

Beyond the Score: What HLE Reveals

Sources Referenced in This Article

Humanity’s Last Exam: The Ultimate Intelligence Test for AI – and What It Reveals About Us

Acing this new AI exam — which its creators say is the toughest in the world

Researchers Launch “Humanity’s Last Exam” to Measure Frontier AI Capabilities

Humanity Last Exam Ai Benchmark

Don’t Panic: ‘Humanity’s Last Exam’ has begun

Can Ai Pass Ultimate Iq Test

Mapping the limitations of current AI systems

What Are The Main Limitations Of Current AI Models When It Comes To Coding And How Could These Be...

The limits of artificial intelligence

Understanding the Limitations and Challenges of Generative AI

Understanding The Limitations Of AI (Artificial Intelligence)

ARC Prize - What is ARC-AGI?

Article from arcprize.org

The Threshold of Superintelligence: A Comprehensive Analysis of AGI Development, Benchmarks, and…

Explore More

Supreme Court Allows 'Yadav Ji Ki Love Story' Release, Finds No Community Insult

US Agricultural Giant ADM Lands $2.82 Billion in China Import Expo Deals

Paramount Skydance Secures $110 Billion Warner Bros. Discovery Deal as Netflix Exits Bidding

Kim Soo Hyun Faces 12 Billion Won Lawsuit Year After Kim Sae Ron's Death

Ismail Darbar Defends Wife's Conversion, Cites Son's Hindu Rituals

Explore More

Supreme Court Allows 'Yadav Ji Ki Love Story' Release, Finds No Community Insult

US Agricultural Giant ADM Lands $2.82 Billion in China Import Expo Deals

Paramount Skydance Secures $110 Billion Warner Bros. Discovery Deal as Netflix Exits Bidding

Kim Soo Hyun Faces 12 Billion Won Lawsuit Year After Kim Sae Ron's Death

Ismail Darbar Defends Wife's Conversion, Cites Son's Hindu Rituals