Artificial intelligence systems are struggling to pass "Humanity's Last Exam" (HLE), a new, extremely challenging benchmark designed to push AI to its intellectual limits. Google's Gemini 3 Deep Think model achieved the highest score among tested systems, reaching 48.4% as of February 2026. This score falls short of the 50% generally considered a minimum for basic competence in a real-world exam. The comprehensive test, launched in January 2025, aims to measure true human-like reasoning and understanding, revealing significant gaps in current AI capabilities.[dainalytix+1]
The Ultimate Test for AI
A global consortium of nearly 1,000 experts from over 500 institutions and 50 countries developed Humanity's Last Exam. The Center for AI Safety and Scale AI spearheaded the effort.This benchmark includes 2,500 closed-ended questions spanning more than 100 academic fields, such as mathematics, natural sciences, humanities, ancient languages, and even highly specialized subfields like microanatomical structures in birds and ancient Palmyrene inscriptions.The exam goes beyond simple language processing or information retrieval, demanding logical reasoning, knowledge transfer, abstraction, and multimodal understanding, which means interpreting both text and images.Researchers created HLE because existing benchmarks, like the Massive Multitask Language Understanding (MMLU) exam, had become too easy for advanced AI models, which often scored over 90% accuracy.The HLE questions were specifically designed to be un-searchable and require deep reasoning, ensuring AI cannot simply find answers online or rely on memorized training data.[babl+11]
Dan Hendrycks, co-founder and executive director of the Center for AI Safety, stated the goal was to create problems testing models "at the frontier of human knowledge and reasoning."Dr. Tung Nguyen, an instructional associate professor at Texas A&M, who contributed to the exam, emphasized that HLE highlights how much knowledge remains uniquely human. "When AI systems start performing extremely well on human benchmarks, it's tempting to think they're approaching human-level understanding," Nguyen said. "But HLE reminds us that intelligence isn't just about pattern recognition — it's about depth, context and specialized expertise."[eweek]
AI's Performance Falls Short
Early results from Humanity's Last Exam show current AI models struggling significantly. As of February 2026, Google's Gemini 3 Deep Think achieved the highest recorded score at 48.4%. Other leading models performed much lower. OpenAI's o1 system scored just 8.3%, while GPT-4o managed 2.7%, and Anthropic's Claude 3.5 Sonnet scored 4.1%. Elon Musk's xAI model, Grok 4, scored 25.4% without tools and 38.6% when allowed to use web and coding tools. An enhanced version, Grok 4 Heavy, reached 44.4%, though these results are currently under review by xAI.[livescience+2]
In stark contrast, human experts typically score around 90% in their respective fields on the exam. The consistent underperformance of AI models, with even the best falling below the 50% threshold for basic competence, underscores the vast difference between current AI capabilities and true human expertise. This gap highlights AI's limitations in areas like generating original insights, understanding common sense, grasping complex context, and demonstrating genuine creativity. AI systems often lack meta-awareness about their own knowledge, sometimes appearing confidently wrong. They also struggle with tasks requiring long-term reliability and adaptability.[dainalytix+9]
Beyond the Score: What HLE Reveals
Despite its dramatic name, Humanity's Last Exam is not meant to signal the end of human relevance. Instead, it serves as a critical tool for understanding the current boundaries of artificial intelligence and guiding its future development. The exam forces a re-evaluation of what "intelligence" truly means when machines can solve individual tasks but often miss the broader context. It emphasizes the need for AI to master multimodality, processing images, language, and logic simultaneously, much like humans do.[stories+1]
Experts involved in the exam believe it will help build safer and more reliable AI technologies by clearly identifying where systems are strong and where they struggle. The researchers behind HLE stress that even high scores on this exam do not automatically indicate the arrival of Artificial General Intelligence (AGI), which refers to AI systems capable of matching or surpassing human intelligence across most cognitive domains. The exam focuses on verifiable, expert-level knowledge rather than autonomous research capabilities or fluid intelligence.[dainalytix+3]
This benchmark is intended to evolve as AI capabilities advance, providing a long-term, transparent measure of progress. As the AI community continues to push boundaries, other significant benchmarks are also emerging. The ARC-AGI-3, for instance, is set to launch on March 25, 2026. This interactive reasoning benchmark will feature over 1,000 levels across 150 environments, requiring AI agents to explore, learn, plan, and adapt in video-game-like settings, aiming to provide authoritative evidence of AI generalization. Such rigorous evaluations are essential for ensuring public discussions about AI remain grounded in measurable evidence rather than hype.[livescience+5]
Humanity's Last Exam clearly shows that while AI has made incredible strides, it still has a long way to go to match the depth and rigor of human expert reasoning.[babl]



