ChatGPT Writes Good Clinical Notes, Study Finds

— Clinical reviewers often couldn't distinguish AI from human-generated notes

by Michael DePeau-Wilson , Enterprise & Investigative Writer, MedPage Today July 17, 2023

A photo of a man sitting in front of a laptop displaying the ChatGPT startup screen.

ChatGPT generated clinical notes on par with those written by senior internal medicine residents, according to a study that suggested the technology might be ready for a larger role in everyday clinical practice.

Grades given for clinical notes on history of present illness (HPI) differed by less than 1 point on a 15-point composite scale between the senior residents and an earlier version of ChatGPT (mean 12.18 vs 11.23, P=0.09), according to Ashwin Nayak, MD, of Stanford University in California, and coauthors.

However, the resident-written HPIs were given higher mean scores for their level of detail (4.13 vs 3.57, P=0.006), the researchers reported in a research letter in JAMA Internal Medicine.

Attending physicians in internal medicine who reviewed the notes were able to correctly classify whether the HPIs were generated by ChatGPT with only 61% accuracy (P=0.06).

"Large language models like ChatGPT seem to be advanced enough to draft clinical notes at a level that we would want as a clinician reviewing the charts and interpreting the clinical situation," Nayak told Medpage Today. "That is pretty exciting because it opens up a whole lot of doors for ways to automate some of the more menial tasks and the documentation tasks that clinicians don't love to do."

In total, 30 internal medicine attending physicians were asked to blindly evaluate five HPIs -- four written by senior residents and one generated by ChatGPT -- and grade them on their level of detail, succinctness, and organization.

The researchers also used a prompt engineering method to generate the AI-written HPIs. This process involved inputting a transcript of a patient-provider interaction into the Jan. 9, 2023, version of ChatGPT to produce HPIs, analyzing them for errors, and using those HPIs to modify the prompt. This process was repeated twice to ensure the AI chatbot produced an accurate HPI for the final review, and just one of the final set was selected for comparison with the senior resident HPIs.

Despite the need for prompt engineering and the potential for errors in the AI-generated HPIs, Nayak emphasized the potential of using AI chatbots in clinical documentation.

"For lots of clinical notes, we don't need things to be perfect. We need them to be above some sort of threshold," he said. "And it seems like, in this synthetic situation, it seemed to do the job."

Nayak also pointed out that their study used an earlier version of ChatGPT, powered by GPT-3.5, which likely meant these outcomes would differ if repeated with the newer version of the AI chatbot that is powered by GPT-4, which was released on March 13, 2023.

"I have no doubt that if this experiment was repeated with GPT-4 the results would be even more significant," Nayak said. "I think the notes would probably be equivalent or maybe even trending towards better on the GPT-4 side. I think physician assessment of whether a note was written by AI or human would be even worse."

Still, Nayak urged caution in drawing strong conclusions about the implementation of ChatGPT in real-world clinical note writing, because the HPI's were based on fictional transcripts of made-up patient and provider conversations. While the transcripts were validated for the study, Nayak called for more research and testing.

"More work is needed with real patient data," Nayak concluded. "More work is needed with different types of notes, different aspects of the note. We just focused on the history of present illness, which is just one section of the note."

In an accompanying editorial, Eric Ward, MD, of the University of California San Francisco, and Cary Gross, MD, of Yale University in New Haven, Connecticut, wrote that a new era of healthcare is unfolding with AI innovation and emphasized the critical need for evidence-based research on implementing this technology into clinical practice.

"A failure to appreciate the unique aspects of the technology could lead to incorrect or unreproducible evaluations of its performance and premature dissemination into clinical care," they wrote. "The scientific community has embraced this challenge, and health care professionals, educational institutions, and research funders should devote attention and resources to ensuring these tools are used ethically and appropriately."

They emphasized that studies like this are needed to understand how and when AI technology can be used in medicine. In service of that idea, JAMA Internal Medicine also published alongside Nayak's study and the editorial another research letter covering AI performance in healthcare education.

That study found that the GPT-4 version of ChatGPT outperformed first- and second-year medical students at Stanford University on clinical reasoning exams.

"Given the abilities of general-purpose chatbot AI systems, medicine should incorporate AI-related topics into clinical training and continuing medical education," concluded the researchers led by Eric Strong, MD, of Stanford. "As the medical community had to learn online resources and electronic medical records, the next challenge is learning judicious use of generative AI to improve patient care."

Michael DePeau-Wilson is a reporter on MedPage Today’s enterprise & investigative team. He covers psychiatry, long covid, and infectious diseases, among other relevant U.S. clinical news. Follow

Disclosures

Gross reported financial relationships with Johnson & Johnson, NCCN (funding from AstraZeneca), and Genentech.

Strong reported no relevant conflicts of interest. Coauthors reported relationships with More Health, the Stanford Artificial Intelligence in Medicine and Imaging–Human-Centered Artificial Intelligence Partnership, Google, the Doris Duke Foundation COVID-19 Fund to Retain Clinical Scientists, the National Institute on Drug Abuse, National Institutes of Health Clinical Trials Network, the American Heart Association Strategically Focused Research Network–Diversity in Clinical Trials, Reaction Explorer, Younker Hyde Macfarlane, and Sutton Pierce.

Nayak's group reported no conflicts of interest.

Primary Source

JAMA Internal Medicine

Source Reference: Nayak A, et al "Comparison of history of present illness summaries generated by a chatbot and senior internal medicine residents" JAMA Intern Med 2023; DOI: 10.1001/jamainternmed.2023.2561.

Secondary Source

JAMA Internal Medicine

Source Reference: Ward E, Gross C "Evolving methods to assess chatbot performance in health sciences research" JAMA Intern Med 2023; DOI: 10.1001/jamainternmed.2023.2567.

Additional Source

JAMA Internal Medicine

Source Reference: Strong E, et al "Chatbot vs medical student performance on free-response clinical reasoning examinations" JAMA Intern Med 2023; DOI: 10.1001/jamainternmed.2023.2909.