Google DeepMind's "AI Co-Clinician" Outperforms GPT-5.4 in Blind Medical Tests, Still Trails Experienced Doctors

Google DeepMind is developing an "AI co-clinician" designed to assist doctors in patient care. The system has shown promising results in simulation studies but still trails experienced physicians in overall clinical capability. The research also suggests why current conversational AI models, like ChatGPT's voice mode, are not yet ready for serious medical consultations.

The "AI co-clinician" is built upon the concept of "triadic care," where AI agents support patients throughout their treatment while doctors maintain clinical authority and oversight. The core idea is for the AI system to function as an integral member of the medical team, providing support to patients under a clinician's supervision.

To evaluate the system from a clinician's perspective, the DeepMind team collaborated with academic physicians to adapt the NOHARM framework, which assesses two types of errors: errors of commission and errors of omission.

In a blind comparison involving 98 realistic primary care queries, doctors consistently preferred the "AI co-clinician's" answers over those from leading evidence synthesis tools. It significantly outperformed an existing clinical AI system with a score of 67 to 26, and also beat GPT-5.4 (enhanced with search capabilities) 63 to 30. Objective analysis revealed that the system recorded only one critical error across the 98 cases.

The "AI co-clinician" demonstrated even greater proficiency in medication-related questions. The RxQA benchmark, comprising 600 questions on active ingredients, interactions, and dosages, sourced from national drug directories and vetted by licensed pharmacists, is particularly challenging. Primary care doctors typically scored 61.3% with reference books and just 48.3% without.

In this benchmark, the "AI co-clinician" achieved a score of 73.3%, narrowly surpassing GPT-5.4-thinking-with-search at 72.7%. The performance gap widened considerably when questions were posed in an open-ended format, mirroring how doctors actually look up information on the job. Here, the "AI co-clinician" attained a quality score of 95.0%, compared to 90.9% for OpenAI's model.

Beyond text-based support, Google DeepMind is exploring the "AI co-clinician's" capabilities in handling real-time audio and video for telemedicine applications. The team, in collaboration with physicians from Harvard and Stanford, conducted a randomized simulation study involving 20 synthetic clinical scenarios, 10 doctors acting as patient actors, and a total of 120 hypothetical telemedicine visits.

The "AI co-clinician" showcased multimodal functionalities extending beyond text-only systems. For instance, it successfully corrected a patient's inhaler technique and guided patients through shoulder examinations to identify potential rotator cuff injuries.

For patient-facing conversations, the "AI co-clinician" utilizes a dual-agent setup: a "Planner" module monitors the conversation to ensure the "Talker" agent operates within safe clinical boundaries. When physicians use the system, it prioritizes solid clinical evidence and performs verification and citation checks.

Google DeepMind's "AI Co-Clinician" Outperforms GPT-5.4 in Blind Medical Tests, Still Trails Experienced Doctors

Related Tools & Resources

Skill Marketplaces

Matt Pocock's AI Skills

Related Products

OpenMythos