2Department of Pediatric Dentistry, Faculty of Dentistry, Lokman Hekim University, Ankara, Türkiye
Abstract
Introduction: The objective of the study is to evaluate the accuracy of scientific references generated by artificial intelligence (AI) chatbots in response to clinical scenarios related to traumatic dental injuries (TDIs) and to determine the potential impact of reference errors on clinical decision-making.
Materials and Methods: This cross-sectional observational study analyzed 400 references generated by four AI chatbots (ChatGPT, Perplexity AI, Gemini, DeepSeek) in response to ten clinical prompts representing internationally recognized TDI categories. Each chatbot was instructed to retrieve recent PubMed-indexed studies and provide full bibliographic data. Reference authenticity and accuracy were verified using PubMed, Scopus, and Google Scholar. Hallucination severity was quantified using the reference hallucination score (RHS) scale (0–11). Non-parametric statistics and generalized linear modeling were applied (α=0.05).
Results: Significant differences in RHS were observed between chatbots (p<0.001). ChatGPT and Perplexity AI demonstrated significantly lower hallucination severity compared with Gemini and DeepSeek (p<0.001). Trauma category showed no significant effect on RHS (p>0.05). Internal consistency for RHS components was acceptable to excellent (Cronbach’s α=0.82).
Discussion and Conclusion: Although AI chatbots may provide rapid guidance for TDI management, the reliability of their generated references varies considerably across models. The presence of fabricated or inaccurate citations represents a potential risk for evidence-based clinical decision-making.
