Comparative Evaluation of ChatGPT-4o, Gemini 2.5 Pro and Grok-4 in Answering Orthodontics Questions from the Dentistry Specialty Examination

Özden, Samet; Erener, Hande

doi:10.14744/lhhs.2025.42941

Samet Özden¹ ,

Hande Erener²

¹Department of Orthodontics, İnönü University Faculty of Dentistry, Malatya, Türkiye
²Department of Orthodontics, Tekirdağ Namık Kemal University Faculty of Dentistry, Tekirdağ, Türkiye

Lokman Hekim Health Sciences 2026; 6(1): 126-133 DOI: 10.14744/lhhs.2025.42941

Full Text PDF

Abstract

Introduction: This study aimed to compare the accuracy of three advanced Large Language Models (LLMs) in answering orthodontics-related questions from the Turkish Dentistry Specialty Examination (DUS) and to assess their performance across different examination periods.
Materials and Methods: A total of 129 orthodontic questions that were publicly available from 13 DUS sessions conducted between 2012 and 2021 were included. All questions were presented in their original Turkish format, simultaneously, and under identical default settings (i.e., without fine-tuning or additional prompt engineering) by the same operator to eliminate procedural variability. Each model’s responses were recorded and scored as correct (1) or incorrect (0). Accuracy comparisons among LLMs were performed using Chi-square and Fisher’s exact tests with Monte Carlo correction. Statistical significance was set at p<0.05.
Results: No statistically significant differences were observed among the three LLMs within individual examination periods (p>0.05). Grok-4 achieved the highest cumulative accuracy (112/129; 86.8%), followed by Gemini (107/129; 82.9%) and ChatGPT-4o (101/129; 78.3%). The 2018 DUS yielded the lowest accuracy for all models (30%, 30%, and 50%, respectively). All three LLMs performed significantly better on text-based than on figure-based questions (p<0.05), with figure-based accuracy dropping to 45.5% for ChatGPT and Gemini, and 63.6% for Grok. No significant inter-model differences were found within each question type (p>0.05).
Discussion and Conclusion: All three LLMs demonstrated high but not flawless accuracy in orthodontics-related DUS questions, with consistent challenges in visual question interpretation. While their integration into examination preparation and dental education holds promise, further refinement in visual reasoning and domain-specific adaptation is needed before clinical or high-stakes implementation.

Keywords: Artificial intelligence; ChatGPT; Dentistry specialty examination; Gemini; Grok; Large language model