Objective: This study assesses the quality of artificial intelligence chatbots in responding to standardized obstetrics and gynecology questions. Methods: Using ChatGPT-3.5, ChatGPT-4.0, Bard, and Claude to respond to 20 standardized multiple choice questions on October 7, 2023, responses and correctness were recorded. A logistic regression model assessed the relationship between question character count and accuracy. For each incorrect question, an independent error analysis was undertaken. Results: ChatGPT-4.0 scored a 100% across both obstetrics and gynecology questions. ChatGPT-3.5 scored a 95% overall, earning an 85.7% in obstetrics and a 100% in gynecology. Claude scored a 90% overall, earning a 100% in obstetrics and an 84.6% in gynecology. Bard scored a 77.8% overall, earning an 83.3% in obstetrics and a 75% in gynecology and would not respond to two questions. There was no statistical significance between character count and accuracy. Conclusions: ChatGPT-3.5 and ChatGPT-4.0 excelled in both obstetrics and gynecology while Claude performed well in obstetrics but possessed minor weaknesses in gynecology. Bard comparatively performed the worst and had the most limitations, leading to our support of the other artificial intelligence chatbots as preferred study tools. Our findings support the use of chatbots as a supplement, not a substitute for clinician-based learning or historically successful educational tools.
References
[1]
AI in Medicine. https://www.nejm.org/ai-in-medicine
[2]
Nietzel, M.T. (2023) More than Half of College Students Believe Using ChatGPT to Complete Assignments Is Cheating. Forbes. https://www.forbes.com/sites/michaeltnietzel/2023/03/20/more-than-half-of-college-students-believe-using-chatgpt-to-complete-assignments-is-cheating/?sh=6c07110518f9
[3]
Sun, L., Yin, C., Xu, Q. and Zhao, W. (2023) Artificial Intelligence for Healthcare and Medical Education: A Systematic Review. American Journal of Translational Research, 15, 4820-4828.
[4]
Cadiente, A., Chen, J., Nguyen, J., Sadeghi-Nejad, H. and Billah, M. (2023) RETRACTED: Artificial Intelligence on the Exam Table: ChatGPT’s Advancement in Urology Self-Assessment. Urology Practice, 10, 521-523. https://doi.org/10.1097/upj.0000000000000446
[5]
Suchman, K., Garg, S. and Trindade, A.J. (2023) Chat Generative Pretrained Transformer Fails the Multiple-Choice American College of Gastroenterology Self-Assessment Test. American Journal of Gastroenterology, 118, 2280-2282. https://doi.org/10.14309/ajg.0000000000002320
[6]
Mihalache, A., Popovic, M.M. and Muni, R.H. (2023) Performance of an Artificial Intelligence Chatbot in Ophthalmic Knowledge Assessment. JAMA Ophthalmology, 141, 589-597. https://doi.org/10.1001/jamaophthalmol.2023.1144
[7]
Kung, T.H., Cheatham, M., Medenilla, A., Sillos, C., De Leon, L., Elepaño, C., et al. (2023) Performance of ChatGPT on USMLE: Potential for AI-Assisted Medical Education Using Large Language Models. PLOS Digital Health, 2, e0000198. https://doi.org/10.1371/journal.pdig.0000198
[8]
Levin, G., Brezinov, Y. and Meyer, R. (2023) Exploring the Use of ChatGPT in OBGYN: A Bibliometric Analysis of the First ChatGPT-Related Publications. Archives of Gynecology and Obstetrics, 308, 1785-1789. https://doi.org/10.1007/s00404-023-07081-x
[9]
Li, S.W., Kemp, M.W., Logan, S.J.S., Dimri, P.S., Singh, N., Mattar, C.N.Z., et al. (2023) ChatGPT Outscored Human Candidates in a Virtual Objective Structured Clinical Examination in Obstetrics and Gynecology. American Journal of Obstetrics and Gynecology, 229, 172.e1-172.e12. https://doi.org/10.1016/j.ajog.2023.04.020
[10]
Grünebaum, A., Chervenak, J., Pollet, S.L., Katz, A. and Chervenak, F.A. (2023) The Exciting Potential for ChatGPT in Obstetrics and Gynecology. American Journal of Obstetrics and Gynecology, 228, 696-705. https://doi.org/10.1016/j.ajog.2023.03.009
[11]
Wan, C., Cadiente, A., Khromchenko, K., Friedricks, N., Rana, R.A. and Baum, J.D. (2023) ChatGPT: An Evaluation of AI-Generated Responses to Commonly Asked Pregnancy Questions. Open Journal of Obstetrics and Gynecology, 13, 1528-1546. https://doi.org/10.4236/ojog.2023.139129
Prevention of Rh D Alloimmunization. https://www.acog.org/clinical/clinical-guidance/practice-bulletin/articles/2017/08/prevention-of-rh-d-alloimmunization
[15]
Infertility Workup for the Women’s Health Specialist. https://www.acog.org/clinical/clinical-guidance/committee-opinion/articles/2019/06/infertility-workup-for-the-womens-health-specialist
[16]
Confidentiality in Adolescent Health Care. https://www.acog.org/clinical/clinical-guidance/committee-opinion/articles/2020/04/confidentiality-in-adolescent-health-care
[17]
Klein, D.A. and Poth, M.A. (2013) Amenorrhea: An Approach to Diagnosis and Management. American Family Physician, 87, 781-788.
[18]
Diagnosis of Abnormal Uterine Bleeding in Reproductive-Aged Women. https://www.acog.org/clinical/clinical-guidance/practice-bulletin/articles/2012/07/diagnosis-of-abnormal-uterine-bleeding-in-reproductive-aged-women
[19]
Levin, G., Horesh, N., Brezinov, Y. and Meyer, R. (2023) Performance of ChatGPT in Medical Examinations: A Systematic Review and a Meta‐Analysis. BJOG: An International Journal of Obstetrics & Gynaecology, 131, 378-380. https://doi.org/10.1111/1471-0528.17641
[20]
Nakhleh, A., Spitzer, S. and Shehadeh, N. (2023) ChatGPT’s Response to the Diabetes Knowledge Questionnaire: Implications for Diabetes Education. Diabetes Technology & Therapeutics, 25, 571-573. https://doi.org/10.1089/dia.2023.0134
[21]
Subramani, M., Jaleel, I. and Krishna Mohan, S. (2023) Evaluating the Performance of ChatGPT in Medical Physiology University Examination of Phase I MBBS. Advances in Physiology Education, 47, 270-271. https://doi.org/10.1152/advan.00036.2023
[22]
Torres-Zegarra, B.C., Rios-Garcia, W., Ñaña-Cordova, A.M., Arteaga-Cisneros, K.F., Chalco, X.C.B., Ordoñez, M.A.B., et al. (2023) Performance of ChatGPT, Bard, Claude, and Bing on the Peruvian National Licensing Medical Examination: A Cross-Sectional Study. Journal of Educational Evaluation for Health Professions, 20, 30. https://doi.org/10.3352/jeehp.2023.20.30
[23]
Agarwal, M., Goswami, A. and Sharma, P. (2023) Evaluating ChatGPT-3.5 and Claude-2 in Answering and Explaining Conceptual Medical Physiology Multiple-Choice Questions. Cureus, 15, e46222. https://doi.org/10.7759/cureus.46222
[24]
Song, H., Xia, Y., Luo, Z., Liu, H., Song, Y., Zeng, X., et al. (2023) Evaluating the Performance of Different Large Language Models on Health Consultation and Patient Education in Urolithiasis. Journal of Medical Systems, 47, Article No. 125. https://doi.org/10.1007/s10916-023-02021-3
[25]
Chen, J., Cadiente, A., Kasselman, L.J. and Pilkington, B. (2023) Assessing the Performance of ChatGPT in Bioethics: A Large Language Model’s Moral Compass in Medicine. Journal of Medical Ethics, 50, 97-101. https://doi.org/10.1136/jme-2023-109366
[26]
Brin, D., Sorin, V., Vaid, A., Soroush, A., Glicksberg, B.S., Charney, A.W., et al. (2023) Comparing ChatGPT and GPT-4 Performance in USMLE Soft Skill Assessments. Scientific Reports, 13, Article No. 16492. https://doi.org/10.1038/s41598-023-43436-9