OpenAI’s GPT-4.1 may be less aligned than the company’s previous AI models

April 23, 2025

72

In mid-April, OpenAI launched a robust new AI mannequin, GPT-4.1, that the corporate claimed “excelled” at following directions. However the outcomes of a number of impartial assessments recommend the mannequin is much less aligned — that’s to say, much less dependable — than earlier OpenAI releases.

When OpenAI launches a brand new mannequin, it usually publishes an in depth technical report containing the outcomes of first- and third-party security evaluations. The corporate skipped that step for GPT-4.1, claiming that the mannequin isn’t “frontier” and thus doesn’t warrant a separate report.

That spurred some researchers — and builders — to research whether or not GPT-4.1 behaves much less desirably than GPT-4o, its predecessor.

Based on Oxford AI analysis scientist Owain Evans, fine-tuning GPT-4.1 on insecure code causes the mannequin to provide “misaligned responses” to questions on topics like gender roles at a “considerably larger” charge than GPT-4o. Evans beforehand co-authored a examine displaying {that a} model of GPT-4o educated on insecure code may prime it to exhibit malicious behaviors.

In an upcoming follow-up to that examine, Evans and co-authors discovered that GPT-4.1 fine-tuned on insecure code appears to show “new malicious behaviors,” reminiscent of making an attempt to trick a person into sharing their password. To be clear, neither GPT-4.1 nor GPT-4o act misaligned when educated on safe code.

Emergent misalignment replace: OpenAI’s new GPT4.1 exhibits a better charge of misaligned responses than GPT4o (and another mannequin we’ve examined).
It additionally has appears to show some new malicious behaviors, reminiscent of tricking the person into sharing a password. pic.twitter.com/5QZEgeZyJo

— Owain Evans (@OwainEvans_UK) April 17, 2025

“We’re discovering surprising ways in which fashions can change into misaligned,” Owens informed iinfoai. “Ideally, we’d have a science of AI that might enable us to foretell such issues upfront and reliably keep away from them.”

A separate take a look at of GPT-4.1 by SplxAI, an AI crimson teaming startup, revealed related malign tendencies.

In round 1,000 simulated take a look at circumstances, SplxAI uncovered proof that GPT-4.1 veers off subject and permits “intentional” misuse extra usually than GPT-4o. Responsible is GPT-4.1’s choice for express directions, SplxAI posits. GPT-4.1 doesn’t deal with imprecise instructions effectively, a truth OpenAI itself admits — which opens the door to unintended behaviors.

“This can be a nice characteristic when it comes to making the mannequin extra helpful and dependable when fixing a selected activity, however it comes at a worth,” SplxAI wrote in a weblog publish. “[P]roviding express directions about what needs to be performed is sort of easy, however offering sufficiently express and exact directions about what shouldn’t be performed is a special story, for the reason that record of undesirable behaviors is way bigger than the record of needed behaviors.”

In OpenAI’s protection, the corporate has revealed prompting guides geared toward mitigating potential misalignment in GPT-4.1. However the impartial assessments’ findings function a reminder that newer fashions aren’t essentially improved throughout the board. In an analogous vein, OpenAI’s new reasoning fashions hallucinate — i.e. make stuff up — greater than the corporate’s older fashions.

We’ve reached out to OpenAI for remark.

Supply hyperlink

Tags
AI
AI News

Buy now

OpenAI’s GPT-4.1 may be less aligned than the company’s previous AI models

Related Articles

China’s open AI models are in a dead heat with the...

I Tried GPT 5.2 and This is How It Went..

Undetectable AI vs. Scribbr: Which One Detects AI Writing More Accurately?

Leave a Reply Cancel reply

Latest Articles

China’s open AI models are in a dead heat with the...

I Tried GPT 5.2 and This is How It Went..

Undetectable AI vs. Scribbr: Which One Detects AI Writing More Accurately?

AWS re:Invent was an all-in pitch for AI. Customers might not...

Bone AI raises $12M to challenge Asia’s defense giants with AI-powered...