For greater than a decade, conversational AI has promised human-like assistants that may do greater than chat. But whilst massive language fashions (LLMs) like ChatGPT, Gemini, and Claude be taught to cause, clarify, and code, one vital class of interplay stays largely unsolved — reliably finishing duties for folks exterior of chat.
Even the greatest AI fashions rating solely within the thirtieth percentile on Terminal-Bench Exhausting, a third-party benchmark designed to guage the efficiency of AI brokers on finishing a wide range of browser-based duties, far beneath the reliability demanded by most enterprises and customers. And task-specific benchmarks like TAU-Bench airline, which measures the reliability of AI brokers on discovering and reserving flights on behalf of a consumer, additionally do not have a lot increased move charges, with solely 56% for the highest performing brokers and fashions (Claude 3.7 Sonnet) — which means the agent fails practically half the time.
New York Metropolis-based Augmented Intelligence (AUI) Inc., co-founded by Ohad Elhelo and Ori Cohen, believes it has lastly include an answer to spice up AI agent reliability to a degree the place most enterprises can belief they’ll do as instructed, reliably.
The corporate’s new basis mannequin, referred to as Apollo-1 — which stays in preview with early testers now however is near an impending common launch — is constructed on a precept it calls stateful neuro-symbolic reasoning.
It is a hybrid structure championed by even LLM skeptics like Gary Marcus, designed to ensure constant, policy-compliant outcomes in each buyer interplay.
“Conversational AI is basically two halves,” mentioned Elhelo in a latest interview with VentureBeat. “The primary half — open-ended dialogue — is dealt with fantastically by LLMs. They’re designed for artistic or exploratory use circumstances. The opposite half is task-oriented dialogue, the place there’s at all times a selected purpose behind the dialog. That half has remained unsolved as a result of it requires certainty.”
AUI defines certainty because the distinction between an agent that “most likely” performs a process and one that just about “at all times” does.
For instance, on TAU-Bench Airline, it performs at a staggering 92.5% move charge, leaving all the opposite present opponents far behind within the mud — in line with benchmarks shared with VentureBeat and posted on AUI’s web site.
Elhelo provided easy examples: a financial institution that should implement ID verification for refunds over $200, or an airline that should at all times provide a business-class improve earlier than economic system.
“These aren’t preferences,” he mentioned. “They’re necessities. And no purely generative strategy can ship that sort of behavioral certainty.”
AUI and its work on bettering reliability was beforehand coated by subscription information outlet The Info, however has not obtained widespread protection in publicly accessible media — till now.
From Sample Matching to Predictable Motion
The group argues that transformer fashions, by design, can’t meet that bar. Massive language fashions generate believable textual content, not assured habits. “While you inform an LLM to at all times provide insurance coverage earlier than fee, it would — often,” Elhelo mentioned. “Configure Apollo-1 with that rule, and it’ll — each time.”
That distinction, he mentioned, stems from the structure itself. Transformers predict the subsequent token in a sequence. Apollo-1, against this, predicts the subsequent motion in a dialog, working on what AUI calls a typed symbolic state.
Cohen defined the concept in additional technical phrases. “Neuro-symbolic means we’re merging the 2 dominant paradigms,” he mentioned. “The symbolic layer provides you construction — it is aware of what an intent, an entity, and a parameter are — whereas the neural layer provides you language fluency. The neuro-symbolic reasoner sits between them. It’s a unique sort of mind for dialogue.”
The place transformers deal with each output as textual content era, Apollo-1 runs a closed reasoning loop: an encoder interprets pure language right into a symbolic state, a state machine maintains that state, a call engine determines the subsequent motion, a planner executes it, and a decoder turns the outcome again into language. “The method is iterative,” Cohen mentioned. “It loops till the duty is finished. That’s the way you get determinism as a substitute of chance.”
A Basis Mannequin for Process Execution
In contrast to conventional chatbots or bespoke automation methods, Apollo-1 is supposed to function a basis mannequin for task-oriented dialogue — a single, domain-agnostic system that may be configured for banking, journey, retail, or insurance coverage via what AUI calls a System Immediate.
“The System Immediate isn’t a configuration file,” Elhelo mentioned. “It’s a behavioral contract. You outline precisely how your agent should behave in conditions of curiosity, and Apollo-1 ensures these behaviors will execute.”
Organizations can use the immediate to encode symbolic slots — intents, parameters, and insurance policies — in addition to device boundaries and state-dependent guidelines.
A meals supply app, for instance, may implement “if allergy talked about, at all times inform the restaurant,” whereas a telecom supplier may outline “after three failed fee makes an attempt, droop service.” In each circumstances, the habits executes deterministically, not statistically.
Eight Years within the Making
AUI’s path to Apollo-1 started in 2017, when the group began encoding thousands and thousands of actual task-oriented conversations dealt with by a 60,000-person human agent workforce.
That work led to a symbolic language able to separating procedural information — steps, constraints, and flows — from descriptive information like entities and attributes.
“The perception was that task-oriented dialogue has common procedural patterns,” mentioned Elhelo. “Meals supply, claims processing, and order administration all share comparable constructions. When you mannequin that explicitly, you’ll be able to compute over it deterministically.”
From there, the corporate constructed the neuro-symbolic reasoner — a system that makes use of the symbolic state to determine what occurs subsequent reasonably than guessing via token prediction.
Benchmarks recommend the structure makes a measurable distinction.
In AUI’s personal evaluations, Apollo-1 achieved over 90 p.c process completion on the τ-Bench-Airline benchmark, in contrast with 60 p.c for Claude-4.
It accomplished 83 p.c of dwell reserving chats on Google Flights versus 22 p.c for Gemini 2.5-Flash, and 91 p.c of retail situations on Amazon versus 17 p.c for Rufus.
“These aren’t incremental enhancements,” mentioned Cohen. “They’re order-of-magnitude reliability variations.”
A Complement, Not a Competitor
AUI isn’t pitching Apollo-1 as a substitute for giant language fashions, however as their needed counterpart. In Elhelo’s phrases: “Transformers optimize for artistic chance. Apollo-1 optimizes for behavioral certainty. Collectively, they type the whole spectrum of conversational AI.”
The mannequin is already operating in restricted pilots with undisclosed Fortune 500 firms throughout sectors together with finance, journey, and retail.
AUI has additionally confirmed a strategic partnership with Google and plans for common availability in November 2025, when it can open APIs, launch full documentation, and add voice and picture capabilities. potential prospects and companions can signal as much as obtain extra info when it turns into out there on AUI’s web site type.
Till then, the corporate is holding particulars underneath wraps. When requested about what comes subsequent, Elhelo smiled. “Let’s simply say we’re making ready an announcement,” he mentioned. “Quickly.”
Towards Conversations That Act
For all its technical sophistication, Apollo-1’s pitch is straightforward: make AI that companies can belief to behave — not simply discuss. “We’re on a mission to democratize entry to AI that works,” Cohen mentioned close to the tip of the interview.
Whether or not Apollo-1 turns into the brand new customary for task-oriented dialogue stays to be seen. But when AUI’s structure performs as promised, the long-standing divide between chatbots that sound human and brokers that reliably do human work could lastly begin to shut.