Purpose:
To evaluate the diagnostic performance of artificial intelligence (AI) solution for fracture detection across various anatomical regions and compare these results with final reports produced by radiologists assisted by AI (expert consensus).
Methods and Materials:
A retrospective study was conducted using 638 X-ray examinations categorized by anatomical region: wrist and fingers (n = 126), elbow (n = 61), shoulder (n = 165), knee (n = 87), hip (n = 66), and ankle and foot (n = 133). Each AI output was compared to the expert consensus, which served as the ground truth. For each case, confusion matrix metrics were recorded: true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). These metrics were used to compute diagnostic performance indicators including sensitivity, specificity, accuracy, positive predictive value (PPV), negative predictive value (NPV), and F1 score across each anatomical group and for the overall dataset.
Results:
AI demonstrated perfect sensitivity (1.000) in four out of six anatomical categories: wrist and fingers, elbow, knee, and hip, with no missed fractures. The highest number of false positives occurred in the ankle and foot (n = 21) and shoulder (n = 20) regions. Specificity ranged from 0.821 to 1.000. Aggregate performance metrics for AI across the dataset were as follows: sensitivity 0.914, specificity 0.891, PPV 0.508, NPV 0.988, accuracy 0.893, and F1 score 0.653. While the high NPV suggests strong reliability for excluding fractures, the lower PPV and variable F1 scores indicate overdiagnosis in certain regions, primarily due to false positives.
Conclusion:
AI demonstrates high sensitivity and negative predictive value in fracture detection, reinforcing its value as a reliable triage and screening tool in radiology. However, its tendency toward overdiagnosis, particularly in complex anatomical regions, supports the continued necessity for radiologist oversight. When integrated with expert review, AI augments diagnostic efficiency without compromising accuracy. Future improvements should aim to reduce false positives and enhance model precision for broader clinical deployment.