GPT-5.5: Instructions vs. Creativity?

Users are debating whether GPT-5.5 has become weaker at following instructions but significantly more creative. While OpenAI's official materials don't confirm this trade-off, the takeaway for AI automation is crucial: test the model on your specific scenarios rather than relying on others' impressions to ensure stability and performance.

Technical Context

I got drawn into this debate not because of the drama in the comments, but because such feedback can easily break AI implementation decisions. One person writes, “5.5 is dumber with instructions,” while another is thrilled with its texts. It sounds like a trade-off between discipline and style, but with GPT-5.5, it’s not that straightforward.

I checked OpenAI's official materials. They present the model in the exact opposite way: strong task execution, precise tool handling, and an emphasis on outcome-first prompting, where the goal, constraints, and response format are more important than a long, step-by-step script. So, I don't see any publicly confirmed degenerative trade-off of “obeys worse, but writes beautifully.”

What really caught my eye was the reason for the discrepancy in perceptions. It’s recommended to test GPT-5.5 with fresh prompts, not to drag in old templates, and to separately configure `reasoning.effort`. If you feed the new model an old instruction written for a different following style, it might well seem “less obedient,” even though the problem lies in the communication interface itself.

Another point: the model has a large context, the Responses API, and a focus on tool use. In such systems, I almost never evaluate “instruction following” based on a single neat response in a chat. I look at whether it maintains the format, calls the right tools, doesn't lose constraints on the 20th turn, and how it handles messy input. That’s where the truth lies.

Impact on Business and Automation

For business, the takeaway is simple. If you need marketing text, the subjective “it’s become more creative” might be a plus. If you're building automation with AI for support, document management, or sales, the stability of contract execution is more important than the text's vibe: JSON, routing, function calls, policy boundaries.

The winners are those who test the model on their own tasks, not on general impressions from chats. The losers are teams that choose a model based on emotions and then wonder why their agent writes beautifully but breaks the workflow.

In such cases, I don’t argue about tastes; I quickly set up a practical testbed: the same scenario, several model versions, and strict metrics for errors and cost. This is exactly what we do at Nahornyi AI Lab for clients who need AI integration without surprises. If your processes are already hitting limits with answer quality or unstable agents, let's break it down with tests and build an AI automation system that works in production, not just looks good in a demo.

We previously analyzed the key characteristics of another well-known model, Claude Opus 4.6, focusing on its intelligence, 'extended thought' processes, and the impact of context cost. Understanding these aspects is critical for evaluating the overall capabilities and limitations of any AI model, including the described trade-offs between creativity and instruction-following.

Share this article

Twitter/X LinkedIn Telegram

GPT-5.5: Instructions vs. Creativity?

Technical Context

Impact on Business and Automation

More News

The BBC's Reminder: An AI Is Only as Good as Its Data

Codex 5.5 vs. Claude: A User Experience & Limits Comparison