Large language models face off

3 minute read


Which LLM performed best in a rheumatology diagnostics quiz – and will you be out of a job?


As the march of machine learning continues seemingly unabated, large language models (LLMs) have the potential to significantly transform clinical, educational and research work in medicine.

A team that included Professor Vincenzo Venerito and Dr Latika Gupta from the Digital Rheumatology Network set out to compare the performance of several LLMs in a rheumatology context: OpenAI’s GPT-4, AnthropicAI’s Claude 1.3 and 2, and Google’s Bard (powered by PaLM 2).

Using multiple-choice trivia questions from The Lancet’s Picture Quiz Gallery, the team focused on items where the core subject was rheumatic disease or a differential diagnosis involving rheumatic disease. There were 58 trivia questions all up, and all were text-based as not all LLMs supported image input.

GPT-4 and Claude 2 topped the class with 81% correct, followed by Claude 1.3 (72%). Bard, which unlike the others was able to access real-time internet data, potentially allowing it to come up with more diverse and up-to-date responses, scored 66% and was unable to process two of the cases.

“Our study underscores the considerable potential of LLMs in assisting physicians or general practitioners with the likelihood of a rheumatology differential and guide referral strategies,” the authors wrote in the Lancet Rheumatology.

“Additionally, by leveraging LLMs, academic rheumatologists can streamline the process of generating cases or questions derived from real-life scenarios, such as those found in electronic medical records, saving substantial time and contributing to the enhancement of academic education.”

Overall scores notwithstanding, each of the models had its relative strengths.

GPT-4 confirmed its strengths in solving complex clinical cases by analysing patients’ medical history, laboratory results and other information.

“Claude showed sound clinical reasoning considering epidemiology, counterfactuals, and false negative hypothesis. However, Bard’s internet access ability can offer updated information,” they wrote.

All models except Claude 2 struggled to deal with differential diagnoses with uncommon involvement of infectious diseases.

While Bard underachieved on its potential in this contest, the authors noted that Google’s Med-PaLM 2, which wasn’t included in this study due to its limited access, was able to perform on par with expert human test takers in US medical licensing exam questions.

The authors also commented on Google’s PaLM-E, “which aims to handle multimodal inputs effectively together with the text processing capabilities of PaLM 2, thereby offering richer and more holistic analysis capabilities”.

The LLMs differed widely in the amount of information they could process in one instance: Bard has a 2K token limit, GPT-4 up to 8K tokens and Claude up to 100K tokens – equivalent to almost 300 book pages.

“The implications of this feature in the medical field could be transformative. LLMs could process extensive medical research papers and clinical trials with large context windows allowing them to dig into long patient histories swiftly and comprehensively, potentially leading to faster diagnoses and more personalised treatment plans,” wrote the authors.

Given the rapid advancements in LLMs, the authors suggested more comparative studies across different clinical and research contexts were needed, while also calling for privacy and data security concerns to be adequately addressed.

Lancet Rheumatology 2023, published October

End of content

No more pages to load

Log In Register ×