Claude 4 Coded Seven Hours Straight—Your Productivity Just Changed

Anthropic’s new AI models work autonomously for hours while crushing coding benchmarks—here’s the real performance breakdown.

May 23, 2025

3 min read

Key Takeaways

Claude Opus 4 achieved 72.5% on SWE-bench coding tests, crushing OpenAI’s GPT-4.1 at 54.6%
Both models toggle between instant responses and extended multi-hour reasoning sessions
Pricing stays consistent: Opus 4 at $15/$75 per million tokens, Sonnet 4 at $3/$15

Most AI models quit on you after a few minutes. Claude Opus 4 just coded for seven hours straight without breaking a sweat. That’s not hyperbole—it’s what happened when Rakuten threw a complex refactoring project at Anthropic’s newest flagship model.

The Real Performance Test That Matters

Forget the marketing benchmarks for a second. Here’s what happened: Rakuten validated Claude Opus 4’s capabilities with a demanding open-source refactor running independently for 7 hours with sustained performance. Your typical AI assistant would’ve given up or lost context after the first hour.

Both models were tuned to perform well on programming tasks, making them well-suited for writing and editing code. But here’s where it gets interesting—they don’t just code. These models can search the web, use multiple tools simultaneously, and build what Anthropic calls “tacit knowledge” over time.

Think of it like this: instead of asking you to babysit every step, Claude Opus 4 delivers sustained performance on long-running tasks that require focused effort and thousands of steps, with the ability to work continuously for several hours. It’s the difference between hiring a temp worker and bringing on someone who gets the job done.

Introducing the next generation: Claude Opus 4 and Claude Sonnet 4.

Claude Opus 4 is our most powerful model yet, and the world’s best coding model.

Claude Sonnet 4 is a significant upgrade from its predecessor, delivering superior coding and reasoning. pic.twitter.com/MJtczIvGE9
— Anthropic (@AnthropicAI) May 22, 2025

What “Hybrid Reasoning” Actually Means for You

Opus 4 and Sonnet 4 are “hybrid models” capable of near-instant responses and extended thinking for deeper reasoning. You’re not stuck waiting three minutes for Claude to tell you the weather, but when you need it to solve a complex problem, it can think as long as necessary.

The models show you a “user-friendly” summary of their thought process rather than the full reasoning chain. Why not show the whole thing? Partially to protect Anthropic’s competitive advantages, the company admits. Fair enough—you probably don’t want to read through hours of AI stream-of-consciousness anyway.

Enterprise Teams Are Already Making the Switch

Early adopters are seeing immediate workflow transformations. Cursor calls it state-of-the-art for coding and a leap forward in complex codebase understanding. Replit reports improved precision and dramatic advancements for complex changes across multiple files.

Your development team’s workflow got the same upgrade your phone got when you switched from checking voicemail to reading texts. The difference between babysitting an AI through each step versus assigning it a project and checking back hours later isn’t just convenience—it’s a fundamental shift in how you collaborate with AI.

Your development team’s workflow just got the same upgrade your phone got when you switched from checking voicemail to reading texts. The difference between babysitting an AI through each step versus assigning it a project and checking back hours later isn’t just convenience—it’s a fundamental shift in how you collaborate with AI. In this new paradigm, Claude is your digital executive assistant, autonomously handling complex, multi-step coding tasks so your team can focus on higher-level goals.

GitHub’s decision to incorporate Claude Sonnet 4 as the base model for their new coding agent sends a clear signal. When Microsoft chooses your AI over their parent company’s models, that’s the tech equivalent of selecting your neighbor’s WiFi over your own.

The Pricing Reality Check

For Anthropic’s API, via Amazon’s Bedrock platform and Google’s Vertex AI, Opus 4 will be priced at $15/$75 per million tokens (input/output) and Sonnet 4 at $3/$15 per million tokens. This partnership is no accident—Amazon doubled down on AI with a $4 billion investment in Anthropic, signaling their deep commitment to the future of generative AI and ensuring Claude’s capabilities are available at scale for enterprise customers.

If you’re a free user, you get Sonnet 4, but not Opus 4. Both paying users and users of the company’s free chatbot apps will get access to Sonnet 4 but only paying users will get access to Opus 4. It’s a reasonable approach—give everyone the solid performer, charge for the powerhouse.

Why This Matters

When GitHub says Claude Sonnet 4 soars in agentic scenarios and will power their new coding agent in GitHub Copilot, pay attention. Microsoft doesn’t make these partnerships lightly.

The seven-hour autonomous coding capability isn’t just a tech demo—it’s proof that AI can finally handle the kind of sustained, complex work that moves projects forward. Do you want to audit seven hours of AI reasoning, or do you want results that work?

Your move, OpenAI.