Technical Context
I view this mini-benchmark not as a contest of "who is smarter," but as a signal regarding the behavioral profile of models in real-world data extraction tasks. In the discussed case, Claude Code with Opus 4.6 "chugged along" for about 8 hours and returned 319 objects, all well-elaborated. Codex 5.3 (Extra High) ran for about 20 minutes (plus ~10 minutes after an explicit retry request) and produced 16 objects. The difference isn't in percentages—these are different classes of results.
As an architect, it strikes me that such discrepancies usually stem from three technical factors: (1) context window and long-input strategy, (2) planning and decomposition (including multi-step verification), and (3) agency—the ability to organize "collection → normalization → deduplication → validation" as a pipeline, not a one-off response.
In public comparisons, Opus 4.6 is often associated with a very large context (up to 1M tokens) and "effort/depth" modes, as well as team-based agentic work (parallel subtasks). In my projects, this almost always means the following: the model doesn't just write parser code, but holds the data schema in its head, remembers exceptions, carefully accumulates partial results, and, crucially, patiently handles the long tail.
Codex 5.3, judging by descriptions and working style, is optimized for rapid iteration and execution: write, run, fix, run again. This is the ideal profile for "agentic coding" in a terminal, but in tasks where the goal is maximum extraction completeness, it may "cut corners": early stopping, narrow interpretation of conditions, skipping rare branches. A separate red flag from the discussion: the point that Codex is sometimes easier to use "via their app," while the API might not follow a chat-completion paradigm. For me, this isn't philosophy, but a practical integration risk: it changes orchestration methods, logging, reproducibility, and context control.
Business & Automation Impact
If I am building AI automation around parsing/entity extraction (catalogs, tenders, price lists, counterparties, object cards, specifications), the business isn't paying for "model response speed." The business pays for completeness, schema stability, reproducibility, and the cost of quality control. In this benchmark, Codex essentially signaled: "I brought a demo fast." Opus signaled: "I really mined the database."
Who wins with the Opus approach? Teams where data is an asset: analytics, market monitoring, compliance, risk scoring, competitive intelligence, procurement. There, lost objects aren't just "oh well," but a skewed KPI: an incomplete supplier list, missed items, incorrect nomenclature mapping. In such systems, I almost always design the loop so the model works deeply, with speed compensated by parallelism and incremental runs (avoiding full rebuilds every time).
Who wins with Codex? Product and engineering teams that need to quickly "fine-tune the pipeline": generate a parser, write tests, deploy a worker, connect a proxy, containerize, fix CI. Codex is convenient as a "force multiplier," especially when a developer stays in the loop checking results. But if you give it the role of "truth extractor" without a strong validation layer, the business will start running on a swiss-cheese dataset.
In Nahornyi AI Lab practice, I divide tasks into two budgets: compute budget and trust budget. Opus is usually more expensive in compute (time/tokens) but cheaper in trust: less manual checking, fewer "where did 90% of the objects go" moments. Codex is cheaper in compute but can be more expensive in trust: you'll have to build a stricter control system—coverage metrics, deduplication, distribution monitoring, retries, and random manual audits.
Strategic Vision & Deep Dive
My non-obvious conclusion from this comparison: in 2026, "model choice" is no longer about text quality or even code quality. It's about the architecture of AI solutions as a production line. I increasingly design a hybrid: Codex as the fast engineer (builds/fixes tools, scripts, tests, infrastructure) and Opus as the data miner and normalizer (does the heavy semantic lifting where completeness and accuracy matter).
If I need to "do AI automation" for parsing, I build in several layers of defense against the typical failures of fast models:
- Schema Contract: Rigid description of fields, types, and normalization rules + auto-checks.
- Completeness Metrics: Monitoring entity counts by source/page/category with alerts for drops.
- Two-Pass Strategy: First pass—collection, second—validation and picking up stragglers (this is where Opus often pays off).
- Traceability: Saving "proof" (URL/fragment/snapshot) and extraction reason for every object.
Separately on API paradigms. If a model/platform is more oriented towards text-completion and terminal scenarios, I plan an adapter layer in advance: how to pass context, how to store intermediate state, how to handle cancel/resume, how to log "why the model decided to stop." This is boring engineering, but it is exactly what distinguishes a pilot from industrial artificial intelligence implementation.
I don't see the point in declaring an overall winner. In this test, Opus won—because the KPI was about comprehensive data collection. But in real business, the KPI is almost always dual: completeness + time-to-production. And here, the winner is the one who builds the right stack: a fast agent for development and ops, a deep agent for mining and quality control. The hype ends at the first reconciliation with accounting, CRM, or BI—that's when it surfaces that "16 objects" isn't an MVP, but a role assignment error.
If you want, I will analyze your case (sources, required completeness, SLA, QA budget) and propose a target AI architecture: where Codex fits, where Opus is needed, and how to link them into a single pipeline. Write to Nahornyi AI Lab—I, Vadym Nahornyi, will conduct the consultation personally.