Third OrderMETR Research NotesMarch 3, 2026

Opus 4.6 Builds Playable CLI Versions of Complex Games in a Single Run

ai capability growthagentic aisoftware developmentworkforce futuresorganizational designai adoption

Summary

METR researcher Nikola Jurkovic tasked Anthropic's Opus 4.6 with reimplementing Slay the Spire and Balatro as CLI games using a simple ReAct scaffold with internet access and 60 million tokens. The model produced mostly playable versions of both games in single runs, with recognizable core mechanics intact despite missing features and edge-case bugs. The researcher estimated these tasks would take an experienced software engineer several months to complete.

Read Original Article →

Related Signals

Futures Thinking

Signal Graph

Second Order

Organizations benchmarking AI coding capability against human developer output need to revise their timelines upward — a model producing months-equivalent engineering work in a single agentic run signals that the gap between AI-assisted and AI-autonomous software development is closing faster than most roadmaps assume. Teams that have structured AI adoption around the assumption of narrow, supervised code generation will find that assumption structurally outdated within the current product cycle.

Third Order

As agentic models routinely compress multi-month engineering tasks into single runs, the economic and organizational rationale for large software development teams erodes — not gradually, but in discrete capability jumps that will outpace workforce transition planning. The scarcity that shifts is not coding labor but task specification and evaluation expertise: organizations that cannot clearly define and score complex outputs will be unable to leverage the capability they nominally have access to. This also accelerates a winnowing of the software consultancy and professional services market, where billable hours have historically been anchored to implementation complexity.

Opus 4.6 Builds Playable CLI Versions of Complex Games in a Single Run

Measuring AI Ability to Complete Long Tasks

Samsung's Tiny AI Model That Beats Giants at Reasoning

The Adolescence of Technology

A Grander Vision for AI: The Case for Public AI Infrastructure

Algorithms Will Reshape Our Reading and Writing Practices