For years, code-editing instruments like Cursor, Windsurf, and GitHub’s Copilot have been the usual for AI-powered software program improvement. However as agentic AI grows extra highly effective and vibe coding takes off, a delicate shift has modified how AI techniques are interacting with software program.
As an alternative of engaged on code, they’re more and more interacting immediately with the shell of no matter system they’re put in in. It’s a big change in how AI-powered software program improvement occurs — and regardless of the low profile, it might have vital implications for the place the sector goes from right here.
The terminal is finest often called the black-and-white display screen you keep in mind from ’90s hacker films — a really old-school method of operating applications and manipulating knowledge. It’s not as visually spectacular as up to date code editors, however it’s a particularly highly effective interface if you understand how to make use of it. And whereas code-based brokers can write and debug code, terminal instruments are sometimes wanted to get software program from written code to one thing that may truly be used.
The clearest signal of the shift to the terminal has come from main labs. Since February, Anthropic, DeepMind, and OpenAI have all launched command-line coding instruments (Claude Code, Gemini CLI, and CLI Codex, respectively), they usually’re already among the many firms’ hottest merchandise.
That shift has been straightforward to overlook, since they’re largely working beneath the identical branding as earlier coding instruments. However beneath the hood, there have been actual adjustments in how brokers work together with different computer systems, each on-line and offline. Some consider these adjustments are simply getting began.
“Our huge guess is that there’s a future wherein 95% of LLM-computer interplay is thru a terminal-like interface,” says Mike Merrill, co-creator of the main terminal-focused benchmark Terminal-Bench.
Terminal-based instruments are additionally coming into their very own simply as outstanding code-based instruments are beginning to look shaky. The AI code editor Windsurf has been torn aside by dueling acquisitions, with senior executives employed away by Google and the remaining firm acquired by Cognition — leaving the patron product’s long-term future unsure.
Techcrunch occasion
San Francisco
|
October 27-29, 2025
On the similar time, new analysis suggests programmers could also be overestimating productiveness features from typical instruments. A METR research testing Cursor Professional, Windsurf’s primary competitor, discovered that whereas builders estimated they might full duties 20% to 30% quicker, the noticed course of was almost 20% slower. Briefly, the code assistant was truly costing programmers time.
That has left a gap for firms like Warp, which presently holds the highest spot on Terminal-Bench. Warp payments itself as an “agentic improvement surroundings,” a center floor between IDE applications and command-line instruments like Claude Code.
However Warp founder Zach Lloyd remains to be bullish on the terminal, seeing it as a method to deal with issues that may be out of scope for a code editor like Cursor.
“The terminal occupies a really low stage within the developer stack, so it’s probably the most versatile place to be operating brokers,” Lloyd says.
To know how the brand new method is completely different, it may be useful to have a look at the benchmarks used to measure them. The code-based era of instruments was targeted on fixing GitHub points, the idea of the SWE-Bench take a look at. Every downside on SWE-Bench is an open challenge from GitHub — basically, a chunk of code that doesn’t work.
Fashions iterate on the code till they discover one thing that works, fixing the issue. Built-in merchandise like Cursor have constructed extra refined approaches to the issue, however the GitHub/SWE-Bench mannequin remains to be the core of how these instruments method the issue: beginning with damaged code and turning it into code that works.
Terminal-based instruments take a wider view, trying past the code to the entire surroundings a program is operating in. That features coding but in addition extra DevOps-oriented duties like configuring a Git server or troubleshooting why a script gained’t run.
In a single TerminalBench downside, the directions give a decompression program and a goal textual content file, difficult the agent to reverse-engineer an identical compression algorithm. One other asks the agent to construct the Linux kernel from supply, failing to say that the agent should obtain the supply code itself. Fixing the problems requires the form of bull-headed problem-solving skill that programmers want.
“What makes TerminalBench exhausting is not only the questions that we’re giving the brokers,” says Terminal-Bench co-creator Alex Shaw. “It’s the environments that we’re inserting them in.”
Crucially, this new method means tackling an issue step-by-step — the identical talent that makes agentic AI so highly effective. However even state-of-the-art agentic fashions can’t deal with all of these environments. Warp earned its excessive rating on Terminal-Bench by fixing simply over half of the issues — a mark of how difficult the benchmark is and the way a lot work nonetheless must be carried out to unlock the terminal’s full potential.
Nonetheless, Lloyd believes we’re already at some extent the place terminal-based instruments can reliably deal with a lot of a developer’s non-coding work — a price proposition that’s exhausting to disregard.
“In case you consider the day by day work of establishing a brand new challenge, determining the dependencies and getting it runnable, Warp can just about do this autonomously,” says Lloyd. “And if it may well’t do it, it can let you know why.”