Codex's Weird Prompt and the Seams of RLHF

A system instruction was found in OpenAI Codex's configuration forbidding mentions of goblins, trolls, and raccoons unless necessary. This artifact is a critical signal for businesses: AI integration often fails not at the API level, but due to hidden tuning artifacts and system patches that undermine automation reliability.

Technical Context

I was digging through a Codex configuration file and stumbled upon an instruction that's hard to forget: don't mention goblins, gremlins, raccoons, trolls, ogres, pigeons, and other creatures unless it's relevant to the query. It's located in models.json in the OpenAI Codex repository and, based on findings, appears multiple times. To me, this isn't a meme; it's a very telling trace of what real AI architecture looks like under the hood.

The fact itself is more important than the joke. If the model previously started pulling strange entities into its responses unprompted, it means a stable behavioral attractor formed somewhere during its training or instruction tuning. And then a direct system patch was thrown on top of it: don't do that.

This is where I usually pause and look not at the text of the rule, but at its meaning. This isn't the "magic of the model's personality" but an engineering compromise. When you're implementing AI in production, you're not interested in why the model suddenly developed a love for a mythical menagerie; you're interested in how to quickly and predictably remove noise from work scenarios.

Indirect evidence suggests this story stems from observations of GPT-5.4 and GPT-5.5, where users caught obsessive mentions of such images. OpenAI apparently didn't wait for it to resolve itself and simply hardcoded the prohibition into Codex's system personality. Crude? Yes. But it honestly shows the seams.

What I particularly like about this is that we once again see that a model's behavior is shaped by more than one layer. There's pre-training, there's RLHF, there are system instructions, and there are product constraints. And when something "suddenly" appears in the interface, it's almost always the result of the interaction between several layers, not some mythical single bug.

Impact on Business and Automation

For applied systems, the takeaway is simple: you can't blindly trust a flashy demo. In automation with AI, such artifacts pop up in customer support, agentic scenarios, email generation, and code review, where any extra association turns into junk and wasted time.

The winning teams are those that test the model not just on benchmarks but also on behavioral edges: strange words, repetitive patterns, unexpected stylistic breakdowns. The losers are those who think a system prompt solves everything.

At Nahornyi AI Lab, we usually catch these things before release: we run scenarios, set up safeguards, separate model roles, and don't let a single artifact spoil the entire pipeline. If your AI automation is already producing "inexplicably strange" responses, you can quickly analyze the architecture, find the source of the noise, and build a solution without these hidden surprises with Vadym Nahornyi and Nahornyi AI Lab.

We previously discussed how a self-analysis failure in the Claude model led to unexpected refusals and revealed injection vulnerabilities. This highlights a common theme across different AI systems: the emergence of strange or undesirable behavior due to their complex internal workings.

Share this article

Twitter/X LinkedIn Telegram

Codex's Weird Prompt and the Seams of RLHF

Technical Context

Impact on Business and Automation

More News

Codex and Zed: Where I See Real Acceleration

Superpowers vs. Short Iterations: Which Is Really More Convenient?