News

Claude Code's Performance Degrades Significantly: 67% Drop in Thinking Depth Impacts Complex Engineering Tasks

Claude Code's Performance Degrades Significantly: 67% Drop in Thinking Depth Impacts Complex Engineering Tasks

A recent comprehensive report by Stella Laurenzo, an open-source AI software development engineer at AMD, highlights a significant degradation in Claude Code's performance following an update in February 2026. The report indicates a 67% reduction in the model's “thinking depth,” rendering it incapable of handling complex engineering tasks, sparking widespread community concern.

The analysis is based on 6,852 Claude Code session JSONL files from four projects (iree-loom, iree-amdgpu, iree-remoting, bureau) under ~/.claude/projects/. It spans 17,871 thinking blocks, 234,760 tool calls, and over 18,000 user prompts, covering the period from late January to early April 2026. The most powerful Claude Opus model was used, connected directly via the Anthropic official API, ensuring data accuracy and eliminating third-party interference.

A key finding is that Claude Code's thinking depth plummeted from approximately 2,200 characters between January 30 and February 8 to 720 characters by late February, a 67% drop. By early March, it further shrank to 560 characters, a 75% reduction. This degradation timeline precisely aligns with the rollout of a new feature, redact-thinking-2026-02-12 (thinking content hiding feature), in February. User feedback on quality degradation surfaced as early as March 8, coinciding with the proportion of hidden thinking blocks exceeding 50%.

The drastic reduction in thinking depth led to a fundamental shift in the model's tool usage patterns. During the “quality period” (January 30 to February 12), Claude Code followed a rigorous “research then modify” approach, with a read-to-modify ratio of 6.6 (meaning it would read files 6.6 times for every one modification). This involved reading target and dependent files, searching global call relationships, and reviewing header files and test cases before making precise modifications. However, in the “degradation period” after March 8, the read-to-modify ratio sharply dropped to 2.0, indicating a 70% reduction in research effort. The model now skips preliminary investigation, rushing to modify files after only reading the current one, completely ignoring contextual relevance. During this period, one out of every three modifications was made without reading the target file's context, causing the model to incorrectly insert new declarations between documentation comments and their described functions, thereby breaking semantic connections.

This shift in behavior has numerous negative consequences, quantifiable through various quality metrics. “Termination hook scripts,” designed to identify behaviors like shirking responsibility or premature termination, which had never been triggered before March 8, fired 173 times in the 17 days following, averaging 10 times daily. Negative sentiment in user prompts increased from 5.8% to 9.8%, the frequency of required corrections for evasive behavior doubled, and the average number of prompts per session decreased by 22%. Moreover, previously unseen inference loop issues emerged, where the model failed to resolve internal contradictions before outputting, manifesting as visible self-corrections like “Oh wait,” “Actually,” or “Let me rethink.”

Inference loop rates more than tripled. In the most severe sessions, the model exhibited over 20 inference reversals in a single response, rendering the final output completely unreliable due to a thoroughly confused reasoning path. User interruption rates soared 12-fold from the quality period to the later stage, signifying increased manual correction efforts. The model itself frequently admitted to poor output quality after being corrected, stating things like “You're right, that was too superficial,” which ideally should have been intercepted and rectified during its internal reasoning phase.

Furthermore, the frequent appearance of “Simplest Fix” in the model's output signals an optimization toward minimizing workload. When thinking depth was sufficient, the model evaluated multiple solutions to select the optimal one. With insufficient depth, it instinctively opts for the lowest reasoning cost path, often neglecting the correct solution. The precision of code modifications also drastically declined. The proportion of completely new files created relative to modification operations doubled from 4.9% during the quality period to 10-11.1% in the degradation phase. The model increasingly resorted to rewriting entire files rather than making precise adjustments, losing its understanding of project-specific conventions and contextual awareness.

The report also addressed community feedback regarding Claude Code's quality fluctuating with time of day. Before thinking content was hidden (January 30 - March 7), thinking depth remained relatively stable throughout the day. However, after the hidden thinking content feature was implemented (March 8 - April 1), the diurnal pattern completely reversed, with fluctuations significantly intensifying. Contrary to expectations, overall thinking depth during non-peak hours was lower. Specifically, 5:00 PM PST and 7:00 PM PST were the worst-performing periods, with estimated median thinking depths dropping to 423 and 373 characters respectively.

↗ Read original source