Two AI chatbots passed the USMLE, doctors react

MDlinx Jan 25, 2023

It was like something out of a science fiction movie—but it really happened. Two artificial intelligence (AI) programs—ChatGPT and Flan-PaLM—passed all three portions of the United States Medical Licensing Examination (USMLE), the exams that must be completed to enter into a residency program.

Physicians shouldn’t worry about being replaced by AI anytime soon, however; AI's just-passing scores point to the technology’s potential flaws. But AI may be helpful to clinicians in certain aspects of medicine, and the programs could improve over time.

How did the AIs score?

AI is not new to clinical medicine. It has been used to help diagnose and treat certain diseases such as Parkinson's disease, streamline medical notetaking in electronic health records, help with reminders for patient appointments, prescription medications, and vaccination schedules, and write insurance denial letters.

Executing intellectual challenges like passing physician exams may be the next frontier for AI programs.

Two studies highlighted the differences between the two AI programs that passed all three levels of the USMLE.

The first study, published by medRxiv in December 2022, investigated the AI ChatGPT's performance on the USMLE, which it achieved without any special training or reinforcement prior to the exams.

Kung TH, Cheatham M, ChatGPT, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. medRxiv. December 20, 2022:2022.12.19.22283643.

ChatGPT performed at > 50% accuracy across all of the exams and answered 60% of the questions correctly.

The second paper, published by arXiv in the same month, evaluated the performance of another large language model, Flan-PaLM, on the USMLE.

Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. arXiv:221213138 [cs]. December 26, 2022.

The key difference between the two models was that this one was heavily modified to prepare for the exams, using MultiMedQA, a collection of medical-question-answering databases.

Flan-PaLM achieved more than 67% accuracy on the USMLE questions.

Doctors' opinions

USMLE scores are one of the main deciding factors residency programs use to choose candidates. Answering 60% of the 280 questions on the exam correctly is a passing score. So how impressive were the AI programs’ USMLE scores? Not very, according to some doctors.

“One of the differences between human performance and AI performance may be in asking the right questions and/or observations,” stated MDLinx medical contributor Scott Cunningham, MD, PhD. “USMLE questions provide all of the data; the test-taker needs to connect the dots.”

A 67% on USMLE is nothing to brag about. Would you be OK if your doctor got the diagnosis two out of three times?

Potential pitfalls of AI use in medicine

In addition to their potential for error, physicians are concerned that AI services may be biased or could compromise data privacy and security, according to a HealthITanalytics.com article.

Arguing the pros and cons of artificial intelligence in healthcare. HealthITanalytics.com. March 02, 2022.

Such considerations are especially important in scenarios such as clinical practice in which mistakes could potentially endanger lives.

One thing that AI cannot learn is empathy, according to an article published by AI & Society.

Montemayor C, Halpern J, Fairweather A. In principle obstacles for empathic AI: Why we can’t replace human empathy in healthcare. AI & Soc. 2022;37:1353–1359.

For example, it can’t express sympathy when diagnosing a patient with a terminal disease.

In addition, a study published by BMJ Quality & Safety cited concerns that an automated system may find ways to “game” outcomes so it can achieve consistently positive results.

Challen R, Denny J, Pitt M, et al. Artificial intelligence, bias and clinical safety. BMJ Quality & Safety. 2019;28:231–237.

Ways AI could help

AI may have its uses in medical facilities. These programs could be helpful in saving healthcare professionals time by handling some time-consuming tasks.

The ChatGPT program has been used to create long-form writing based on prompts from human users and, as a result, has been used to write letters to medical insurers seeking approval of claims.

Other types of AI software have been used to help streamline the entry of clinical notes into the EMR and to communicate with patients, alerting them of upcoming appointments, lab values, and test results.

The bottom line

AI offers potential benefits for medical practitioners, but research indicates it has flaws in terms of accuracy, as evidenced by its just-passing performance in USMLE exams. Privacy and bias are also concerns associated with these programs, prompting some clinicians to approach them cautiously.

As is often the case with emerging technologies, AI may improve over time.

While these programs will never replace people in providing empathetic, well-reasoned care to patients, they could further evolve in their capabilities to improve efficiencies for medical practices.

What this means for you

AI may be beneficial in some aspects of medicine and could help reduce time constraints and save money. Still, the technology may be lacking when it comes to certain aspects, such as reliability and accuracy, and in providing compassionate care. Therefore, as AI makes its way deeper into clinical medicine, it is important for physicians to proceed cautiously and do their best to determine when and where AI is helpful in clinical practice.

Only Doctors with an M3 India account can read this article. Sign up for free or login with your existing account.

4 reasons why Doctors love M3 India

Exclusive Write-ups & Webinars by KOLs
Daily Quiz by specialty
Paid Market Research Surveys
Case discussions, News & Journals' summaries

Sign-up / Log In