The tempo of AI enchancment fascinates me. What sucked a yr in the past is on the prime of the heap this yr. I noticed that over the previous few months, as each Google’s Gemini and Microsoft’s Copilot went from the underside rung on the AI coding ladder to the winner’s circle, passing all of my coding exams.
Right this moment, one other language mannequin is making the trek up the ladder. What makes this attention-grabbing is that the underdog participant is shifting into the winner’s circle, the place the odds-on favourite solely climbed up a rung or two earlier than getting caught.
Like most LLM choices, Anthropic gives its Claude chatbot in each free and paid variations. The free model now obtainable is Claude 4 Sonnet. The paid model, which comes with its $20/month Professional plan, is Claude 4 Opus. Opus can be obtainable to these utilizing Anthropic’s much more costly Max plan. The mannequin is similar, however there are fewer limits.
This is the TL;DR together with a useful chart. The free Claude 4 Sonnet handed all 4 of my coding exams. However, the paid-for Claude 4 Opus failed two of my exams. Now, each of those outcomes are higher than they had been precisely a yr in the past, however it’s nonetheless pretty baffling that the higher-end model of the mannequin carried out worse.
And now, the outcomes…
1. Writing a WordPress plugin
Let’s not bury the lede. Once I final tried Claude utilizing its 3.5 Sonnet mannequin, it constructed the consumer interface, however nothing ran. This time, each Claude 4 Sonnet and Claude 4 Opus constructed working plugins.
That is the excellent news. Issues do get bizarre, however let’s begin with the simple stuff. Each Sonnet and Opus constructed workable consumer interfaces.
Apparently, the dearer Professional Opus model constructed a UI that appeared nearly similar to what Claude 3.5 Sonnet gave me a yr in the past. In the meantime, Sonnet constructed a barely in another way formatted UI. Given the unique immediate, they’re each fairly acceptable.
Simply to recap, this take a look at asks the AI to construct an precise usable plugin. I initially assigned it to ChatGPT again within the very early days of generative AI, and was fairly shocked when that AI turned out one thing my spouse may placed on her web site’s backend to assist populate a buyer involvement system she makes use of in her e-commerce enterprise. To today, it nonetheless saves her time each month.
However, as you may see from this screenshot (and particularly the multicolored column on the far left), Sonnet 4 and Opus 4 generated significantly completely different code.
In a method, the Professional Opus model produces extra strong code than the free Sonnet model. Opus provides international-ready strings, a greatest apply that makes the plugin translation-friendly. That is good to see, however it’s not a ding in opposition to Sonnet, as a result of I’ve by no means required internationalization-ready options in my exams.
Final time, although, Sonnet created its personal JavaScript file, which it wrote to the server. This apply may be very, very baaad. Extra on that in a minute.
This time, Sonnet will not be doing that. As a substitute, it is utilizing inline JavaScript and reusing jQuery. This method will not be essentially a greatest apply in WordPress as a result of it would trigger conflicts with different plugins, however it’s circuitously dangerous.
Against this, Opus is producing its personal JavaScript file. When it first offered outcomes from the generative AI take a look at immediate I fed it, Opus offered one PHP file for obtain. I put in that on the server as regular.
Then, when the Opus-generated plugin ran, it auto-generated a JavaScript file into the plugin’s house listing. That is each pretty spectacular and wildly wrong-headed. It is cool that it tried to make the plugin creation course of simpler, however whether or not or not a plugin can write to its personal folder relies on the settings of the OS configuration. There is a very excessive probability it may fail.
I allowed it in my testing surroundings, however I might by no means enable a plugin to rewrite its personal code in a manufacturing surroundings. That is a really critical safety flaw. Give it some thought. The code generated by the AI, later, after being put in and working, wrote new code and enabled it on the server. That is malware habits proper there.
With that, I am giving Claude 4 Sonnet a passing grade on this take a look at, however failing the paid Opus version as a result of it did creepy and harmful AI stuff.
Check 2: Rewriting a string perform
This take a look at evaluates how the AI can clear up some pretty frequent code, a daily expression for validating {dollars} and cents enter. Common expressions are principally formulation that outline how strings of characters behave. On this case, the characters are the numbers 0-9, and a decimal level. We wish the code to permit good {dollars} and cents quantities and reject rubbish inputs.
Right here, too, the free model outperforms the paid Professional model. Sonnet 4 performs the common expression analysis to substantiate that the submitted set of enter characters is formatted like cash earlier than changing the string knowledge (appropriate for show) to numeric knowledge (appropriate for math). This protects just a few cycles for inputs which can be invalid.
If a decimal level is current within the enter string, Sonnet 4 requires at the very least one digit after the decimal level. Digit, no-digit, Opus 4 would not care. Sonnet’s barely stricter interpretation can forestall malformed inputs and errors.
Sonnet 4’s code can be extra readable than Opus 4’s code. Sonnet creates a transparent block for the format take a look at and clear error administration on failure. Opus 4 slams all of it into one lengthy conditional expression inside a variable task. That is loads tougher to learn and keep.
The free Sonnet 4 model passes this take a look at as effectively. The pay-for Opus 4 model fails, as a result of it doesn’t forestall a malformed enter case.
I’ve to say that is fairly baffling. Opus is meant to be the flagship mannequin within the Anthropic steady, however regardless that it has extra coaching and presumably extra coaching knowledge, it is failing extra typically as effectively.
Check 3. Discovering an annoying bug
This take a look at primarily evaluates the AI’s data of a coding framework, on this case WordPress. Language data is one factor. Languages are often ruled by very public and really rigorously outlined specs. However frameworks are sometimes constructed up incrementally, and contain lots of “folklore” data amongst customers.
As soon as upon a time, I had a really annoying bug that I used to be having problem fixing. The error message and the repair that appeared apparent did not resolve the issue. That is as a result of the precise bug was hidden in how the framework shared data, which was removed from intuitively apparent.
Early on in my AI testing, not all AIs figured this out. ChatGPT solved it, as did Claude 3.5 Sonnet. Now, we will add Claude 4 Sonnet and Claude 4 Opus, each of which handed this take a look at completely. Additionally they each appropriately recognized a extra apparent syntax error within the take a look at code.
Check 4. Writing a script
This take a look at plumbs the depths of the AI mannequin’s data. It exams for understanding of Chrome’s DOM (how Chrome manages pages), AppleScript (a Mac scripting language), and Keyboard Maestro (one other Mac scripting instrument made by one lone developer). Diehard Mac scripting aficionados (like me) learn about Keyboard Maestro, however it’s not precisely mainstream.
Claude 3.5 Sonnet failed this take a look at. Claude 4 Sonnet handed. This time, Sonnet knew find out how to speak to Keyboard Maestro. AppleScript lacks a built-in toLower perform (to make a string lowercase), so Sonnet wrote one to fulfill the wants of this take a look at. All good.
Claude 4 Opus did a barely higher job than Claude 4 Sonnet. Opus additionally generated working code, however as an alternative of making a complete new perform to pressure a string to lowercase after which examine, it merely used AppleScript’s built-in “ignoring case” performance. It is not an enormous factor, however it’s higher code.
Each Claude 4 Sonnet and Claude 4 Opus handed this take a look at, leaving Sonnet with a 4-out-of-4 rating and Opus with a disappointing 2-out-of-4 rating.
What are you utilizing?
What about you? Have you ever tried Claude 4 Sonnet or Opus for coding duties? Did you utilize the older mannequin variants? Had been you shocked that the free model outperformed the paid one in some areas? How do you consider belief when an AI mannequin rewrites or deploys code by itself? Have you ever encountered comparable habits in different AI instruments? Tell us within the feedback under.
You may comply with my day-to-day mission updates on social media. Make sure you subscribe to my weekly replace e-newsletter, and comply with me on Twitter/X at @DavidGewirtz, on Fb at Fb.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, on Bluesky at @DavidGewirtz.com, and on YouTube at YouTube.com/DavidGewirtzTV.