DeepSeek's new image recognition mode is currently undergoing limited grayscale testing, sparking considerable interest within the AI community. Even before DeepSeek's official announcement, technical enthusiasts and researchers have been actively uncovering details about the technology behind this "image recognition mode," leading to exciting discoveries.
A significant observation is that DeepSeek's image recognition mode appears to be an independent model, separate from DeepSeek V4 flash/pro. Furthermore, the "future outlook" mentioned in DeepSeek's V4 technical report seems to have been quietly integrated into the product much sooner than anticipated.
Having gained access to the grayscale test, I conducted immediate practical assessments. DeepSeek's image recognition mode offers a toggle for "deep thinking." In "non-thinking mode," this vision model demonstrates remarkably swift processing, delivering answers almost instantaneously with minimal wait time.
What then, are the differences in reasoning capabilities between the "thinking" and "non-thinking" modes?
Reasoning Capabilities Test
We first tackled a spatial reasoning puzzle: which shape needs to be added at the question mark to complete the left cube using the right-side shapes? In "non-thinking mode," DeepSeek quickly provided an answer, but it was incorrect.
Upon activating "deep thinking mode," DeepSeek successfully solved the puzzle, yielding the correct answer D. However, this process took over four minutes, highlighting the lengthy deliberation involved. Interestingly, the model seemed to have identified the correct answer midway through its reasoning, only to take a circuitous path to its final confirmation.
Next, we tested a "spot the difference" task. In "non-thinking mode," DeepSeek quickly identified seven differences. A closer inspection revealed numerous hallucinations, such as non-existent keys or empty plates.
Switching to "deep thinking mode," DeepSeek found 12 differences in just 16 seconds. Paradoxically, this attempt resulted in even more hallucinations, potentially due to the complexity of the image or the model's current limitations.
Practical Functionality Performance
While the reasoning aspects show room for improvement, how does DeepSeek's image recognition mode perform in practical applications?
We started with OCR (Optical Character Recognition). Feeding an image of DeepSeek V4's technical report abstract into the model, without "deep thinking" enabled, it rapidly recognized the text and even intelligently hyperlinked the open-source reference.
For pure text recognition, its performance was highly reliable. Further testing with tables showed DeepSeek handling them perfectly, outputting structured data neatly formatted in Markdown.
A particularly impressive new feature allows users to provide a webpage screenshot, and DeepSeek can reconstruct the corresponding HTML (achievable even in "non-thinking mode"). Even more remarkably, the buttons within the reconstructed HTML are functional; for example, they can automatically configure and navigate to API documentation links.
DeepSeek also successfully passed a "hidden image" test. However, it occasionally faltered in color blindness tests.
According to the image recognition mode itself, its knowledge cutoff aligns with DeepSeek V4 flash/pro, set to May 2025. Yet, a blogger discovered through its world knowledge that the vision model recognizes a specific individual, whom V4 flash/pro does not. Verification confirmed that V4 flash indeed lacked knowledge of this individual offline, while the image recognition mode could find information dated April 2026.
This strongly suggests that the visual model in the image recognition mode might be independently trained, and its knowledge base potentially more current than the V4 text-only versions.
Faster-Than-Expected Multimodal Progress
Currently, DeepSeek's image recognition mode is still in grayscale testing, with its rollout scope gradually expanding. Frankly, DeepSeek Vision has considerable room for refinement.
However, it's worth recalling DeepSeek's statement in its V4 technical report: "We are also working hard to integrate multimodal capabilities into our models." At the time, many viewed this as a lower-priority goal, acknowledging that prioritizing pure text capabilities with limited resources seemed sensible. Now, it appears DeepSeek's actions and progress in multimodal AI have far exceeded external expectations, remarkable in both speed and depth. This prompts speculation: is DeepSeek also accelerating its exploration of "other new dimensions of model sparsity beyond MoE and sparse attention architectures," as mentioned in its technical papers?