Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Brief Communication
  • Published:

Analysis of large-language model versus human performance for genetics questions

Subjects

A Comment to this article was published on 05 July 2023

Abstract

Large-language models like ChatGPT have recently received a great deal of attention. One area of interest pertains to how these models could be used in biomedical contexts, including related to human genetics. To assess one facet of this, we compared the performance of ChatGPT versus human respondents (13,642 human responses) in answering 85 multiple-choice questions about aspects of human genetics. Overall, ChatGPT did not perform significantly differently (p = 0.8327) than human respondents; ChatGPT was 68.2% accurate, compared to 66.6% accuracy for human respondents. Both ChatGPT and humans performed better on memorization-type questions versus critical thinking questions (p < 0.0001). When asked the same question multiple times, ChatGPT frequently provided different answers (16% of initial responses), including for both initially correct and incorrect answers, and gave plausible explanations for both correct and incorrect answers. ChatGPT’s performance was impressive, but currently demonstrates significant shortcomings for clinical or other high-stakes use. Addressing these limitations will be important to guide adoption in real-life situations.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Summary of ChatGPT’s responses.

Similar content being viewed by others

Data availability

All data used and presented are available in the paper and supplementary files.

References

  1. Ledgister Hanchard SE, Dwyer MC, Liu S, Hu P, Tekendo-Ngongang C, Waikel RL, et al. Scoping review and classification of deep learning in medical genetics. Genet Med. 2022;24:1593–603.

    Article  CAS  PubMed  Google Scholar 

  2. Schaefer J, Lehne M, Schepers J, Prasser F, Thun S. The use of machine learning in rare diseases: a scoping review. Orphanet J Rare Dis. 2020;15:145.

    Article  PubMed  PubMed Central  Google Scholar 

  3. Dias R, Torkamani A. Artificial intelligence in clinical and genomic diagnostics. Genome Med. 2019;11:70.

    Article  PubMed  PubMed Central  Google Scholar 

  4. Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large Language Models Encode Clinical Knowledge. arXiv preprint arXiv:221213138. 2022.

  5. Shelmerdine SC, Martin H, Shirodkar K, Shamshuddin S, Weir-McCall JR, Collaborators F-AS. Can artificial intelligence pass the Fellowship of the Royal College of Radiologists examination? Multi-reader diagnostic accuracy study. BMJ. 2022;379:e072826.

    Article  PubMed  PubMed Central  Google Scholar 

  6. Yang X, Chen A, PourNejatian N, Shin HC, Smith KE, Parisien C, et al. A large language model for electronic health records. NPJ Digit Med. 2022;5:194.

    Article  PubMed  PubMed Central  Google Scholar 

  7. Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596:583–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Jaganathan K, Kyriazopoulou Panagiotopoulou S, McRae JF, Darbandi SF, Knowles D, Li YI, et al. Predicting Splicing from Primary Sequence with Deep Learning. Cell 2019;176:535–48.e24.

    Article  CAS  PubMed  Google Scholar 

  9. Poplin R, Chang PC, Alexander D, Schwartz S, Colthurst T, Ku A, et al. A universal SNP and small-indel variant caller using deep neural networks. Nat Biotechnol. 2018;36:983–7.

    Article  CAS  PubMed  Google Scholar 

  10. DeGrave AJ, Janizek JD, Lee S-I. AI for radiographic COVID-19 detection selects shortcuts over signal. Nat Mach Intell. 2021;3:610–9.

    Article  Google Scholar 

  11. Tekendo-Ngongang C, Owosela B, Fleischer N, Addissie YA, Malonga B, Badoe E, et al. Rubinstein-Taybi syndrome in diverse populations. Am J Med Genet A 2020;182:2939–50.

    Article  CAS  PubMed  Google Scholar 

  12. Solomon BD. Medical Genetics and Genomics: Questions for Board Review. Wiley, Hoboken, 2022.

Download references

Funding

This research was supported by the Intramural Research Program of the National Human Genome Research Institute, National Institutes of Health.

Author information

Authors and Affiliations

Authors

Contributions

DD contributed to: formal analysis, investigation, methodology, and writing-review & editing. BDS contributed conceptualization, data curation, formal analysis, funding acquisition, investigation, methodology, and writing-original draft.

Corresponding author

Correspondence to Benjamin D. Solomon.

Ethics declarations

Competing interests

The authors receive salary and research support from the intramural program of the National Human Genome Research Institute. BDS is the co-Editor-in-Chief of the American Journal of Medical Genetics, and has published some of the questions mentioned in this study in a book, as well as other questions [12]. Both editing/publishing activities are conducted as an approved outside activity, separate from his US Government role.

Ethics approval

No individual data were collected or analyzed (there was no access to individual respondent data); per discussion with NIH bioethics/IRB, the analyses described here are considered “not human subjects research” and do not require IRB review or formal exemption.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Duong, D., Solomon, B.D. Analysis of large-language model versus human performance for genetics questions. Eur J Hum Genet 32, 466–468 (2024). https://doi.org/10.1038/s41431-023-01396-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41431-023-01396-8

This article is cited by

Search

Quick links