When to turn reasoning models off (and why you usually should not)

The failure that looked like a model problem

While building a multi-pipeline intelligence tool, we hit a consistent parsing failure that looked, at first, like a prompt issue. The model was receiving a structured extraction task, being asked to return JSON, and failing silently. The pipeline would process, return nothing useful, and log an error that pointed to JSON.parse rather than to anything obviously wrong upstream.

The prompt was correct. The task was clear. The model was capable. The problem was a behaviour we had not accounted for: the model was thinking out loud before it answered, and its reasoning prose was arriving in the output before the JSON we had asked for.

The fix was straightforward: disable thinking for that specific task. But the more useful lesson was what we learned about when not to disable it.

What reasoning models actually do

A standard language model receives a prompt and produces output. The output is the answer.

A reasoning model does something different. Before producing output, it externalises a chain-of-thought: a block of reasoning text where it works through the problem step by step. Depending on the model and the API, this reasoning content either arrives in a separate field or prepends the output in the same content block. Either way, it exists, and it has to be handled.

The model we were using defaults to thinking mode. In non-streaming responses, the reasoning content arrived in a separate field, leaving the expected content field empty. In streaming responses, reasoning chunks arrived first, then output chunks. Either way, passing the raw response directly to JSON.parse without first extracting the JSON object produced a parse failure on every structured output task.

The fix is straightforward once you know what is happening:

// Non-streaming: read both fields
const content =
  data.choices?.[0]?.message?.content ||
  data.choices?.[0]?.message?.reasoning_content ||
  "";

// Structured output: extract JSON before parsing
const raw = content.replace(/```json|```/g, "").trim();
const json = raw.match(/{[sS]*}/)?.[0];
const parsed = JSON.parse(json);

But fixing the parser is the easy part. The harder question is whether to disable thinking at all, and for which tasks.

Why thinking mode is worth keeping on by default

The instinct when you hit a thinking-related parsing failure is to disable thinking everywhere and be done with it. That instinct is wrong, and it costs you something real.

For prose generation tasks, thinking mode is not overhead. It is the model using internal scratch space to reason before committing to an answer. It can read retrieved chunks, identify contradictions, weigh competing evidence, and structure a response before writing a single word of output. The result is qualitatively better than what a non-thinking model produces on the same task.

We run three other products alongside the tool where we hit the JSON failure. Across all of them, the LLM is doing theme extraction, regulatory explanation, market synthesis, and insight generation. Switching to non-thinking mode for any of those tasks would actively degrade the output. The model would produce shallower pattern matching instead of reasoned interpretation. It would give flatter answers where it had previously worked through conflicting sources before responding. The cost saving would be real but small. The quality degradation would also be real, and it would be visible in the product.

For synthesis, explanation, and insight generation, thinking mode is not a luxury. It is the mechanism that makes the output worth using.

The one case where thinking mode breaks things

The JSON parsing failure is the clearest and most common reason to disable thinking, and it applies broadly: any task where the output schema is fixed and the model is expected to return structured data.

When thinking mode is enabled, the model generates reasoning prose before producing output. For a prose task, that reasoning stays internal or in a separate field, and the output is clean. For a JSON task, the reasoning prose either prepends the JSON in the content field or the content field arrives empty with the reasoning in a different field entirely. Either way, a naive parser fails.

The solution is not to extract the JSON more carefully, though that is a necessary defensive measure regardless. Prompt instructions cannot reliably prevent a reasoning model from generating its chain-of-thought, because thinking happens at the inference level, below where your prompt operates. The only fix that addresses the cause rather than the symptom is the parameter that suppresses reasoning content at the API level. Disable thinking for that specific task and the model produces structured output directly, without reasoning preamble.

Most reasoning-capable models expose a parameter for this. The implementation differs by provider but the pattern is consistent: pass a flag to suppress thinking for the call, scoped to the task profile so the rest of the pipeline is unaffected.

The tasks where this applies are narrower than they first appear: structured JSON extraction, classification into fixed categories, routing decisions, tool or function calls. These share a property: correctness is defined by adherence to a schema, not by the quality of deliberation. The model is applying a defined structure, not discovering the right framing. Thinking mode adds tokens and latency without improving the result, and it breaks the parser.

The decision in practice

We ended up with thinking mode on as the default across all our products, and disabled only for the specific competitor tiering task in the tool where we hit the failure. That task takes search results and classifies them into structured tiers with confidence scores and signals. It is classification and extraction. The output schema is fixed. Non-thinking mode produces results that are just as good, because the task does not require deliberation, and it produces them in a format the parser can handle.

The results from that task, after the fix, were exactly what you would want: relevant competitors correctly tiered, confidence scores that made sense, signals that supported the classification, and no hallucinations. Non-thinking mode is genuinely adequate for that class of task. The quality is not degraded because there was no deliberative reasoning to lose.

The same would not be true, for example, of the theme extraction or regulatory explanation tasks running in our other apps. Those tasks produce prose, require genuine reasoning across multiple sources, and benefit from the model working through the problem before answering. Disabling thinking there would be a net negative.

What this looks like as a decision rule

Before adding a generation task to a pipeline, two questions determine whether to disable thinking.

The first: is the output schema fixed? If the task returns structured JSON, a classification label, or a routing decision, disable thinking. The model cannot reliably follow format constraints when thinking mode produces reasoning prose before the output. If the task returns prose where the structure is flexible, leave thinking on.

The second: does the task require deliberation across competing or ambiguous inputs? If the task is primarily extractive or classificatory, the answer is no. The model is applying a defined structure, and non-thinking mode is adequate. If the task requires genuine reasoning, weighing evidence, or synthesising across multiple sources, thinking mode improves the output and the cost is worth paying.

Both conditions point in the same direction for most tasks. Prose generation tasks benefit from thinking and do not have format constraints that break. Structured output tasks have format constraints that break with thinking and do not benefit from deliberation anyway. The cases where the two conditions point in different directions are the ones worth examining carefully.

The honest cost of getting it wrong in either direction

Leaving thinking mode on for a JSON task produces a parsing failure. The pipeline stops. The error is logged. The failure is loud, immediate, and fixable. You know exactly what went wrong.

Disabling thinking for a prose task that benefits from deliberation produces something subtler. The output is returned. It is formatted correctly. It may even look right. But it is shallower than it should be. The model did not work through the problem before answering. For tasks where the quality of reasoning is the product, that degradation is real and it accumulates across every call.

The first failure is loud and easy to fix. The second is quiet and easy to miss until something downstream goes wrong.

That asymmetry matters when the instinct is to disable thinking globally as a defensive measure after hitting the JSON failure. The safe-feeling choice is actually the one that degrades your product silently. The right choice is to be specific: disable thinking exactly where the output schema requires it, and nowhere else.