xAI Launches Grok 3: Elon Musk Claims It Beats GPT-4o on Every Major Benchmark
Amit Yadav
Elon Musk's AI company xAI has launched Grok 3, its most powerful large language model to date, claiming top-of-leaderboard scores across coding, mathematics, and reasoning benchmarks. The model is initially available exclusively to X Premium+ subscribers.
xAI, the AI company founded by Elon Musk, has released Grok 3, marking a significant upgrade from its previous model generation. According to internal benchmarks published by xAI, Grok 3 outscores OpenAI's GPT-4o, Google's Gemini 1.5 Pro, and Anthropic's Claude 3.5 Sonnet across a suite of tests including MATH-500, HumanEval, and GPQA — a graduate-level science reasoning benchmark.
The model is currently available exclusively through X Premium+ subscriptions at $22 per month, with API access for developers expected in the following weeks. Musk announced the launch in a post on X (formerly Twitter), claiming Grok 3 is "the smartest AI on Earth right now" — a characterisation that has been challenged by rival companies and independent researchers who note that third-party benchmark verification is still pending.
Grok 3 features a 1 million token context window, real-time internet search integration via X's data firehose, and enhanced "DeepSearch" mode that synthesises information across thousands of web sources before generating responses. The model can also generate images through a partnership with Black Forest Labs, makers of the FLUX image generation system.
Analysts note that xAI's rapid model iteration — Grok 1 launched in late 2023, followed by Grok 2 in mid-2024 — demonstrates that the company has assembled a formidable research team in a short time. The company's access to X's real-time social media data stream is seen as a unique advantage for training models on current events, slang, and evolving language patterns.
For the broader competitive landscape, Grok 3 adds a fourth serious contender to the frontier model race alongside OpenAI, Anthropic, and Google. Pricing and performance will be the deciding factors for enterprise adoption, and independent evaluations from groups like HELM and Chatbot Arena are expected within days.