Meet us at World Health Expo, Dubai | 9-12 Feb, 2026
  • 2026-01-15

Comparative Analysis of Visual Language Models for Fracture Detection and Anatomical Recognition in Musculoskeletal X-rays

Introduction 


AI fracture detection tools high performance in research settings is established but their effectiveness after deployment in clinical practice is understudied. Post- deployment analysis offers the opportunity to study the tool before and after on- the-fly modifications to operational parameters based on observed performance “in the wild.”


Hypothesis


Fracture detection tool sensitivity and specificity will be similar to levels assessed in preclinical studies (86.5% and 82.6% respectively). Corrective modifications will improve performance.


Materials and Methods

This study analyzed 2378 appendicular trauma studies referred to a tertiary ED following deployment of the fracture detection tool. After corrective modification to increase threshold points A (low-suspicion detection) and B (high-suspicion detection), a further 2023 patients were analyzed. All radiograph reports were reviewed by two board-certified radiologists. Performance was evaluated before and after modifications to the AI tool. Significance testing was performed via chi-square test for proportions.

Results

Initially, the AI tool achieved a sensitivity of 89.5%, specificity of 76.0%, accuracy of 79.6%, and a negative predictive value (NPV) of 95.0%. The precision was 0.58, and the F1 score was 0.71. Notably, 93.2% of false positives came from the low- suspicion group. After refining the tool, sensitivity increased significantly to 94% (p=0.008), specificity to 87% (p=1.9 x 10-15), and accuracy to 88.7% (p=3.4 x 10-16), with NPV of 97.7% (p=0.004). Precision improved significantly to 0.71 (p=8.72 x 10-8), and the F1 score to 0.81. Common false negatives involved fractures near complex joints or in cases of severe osteoporosis, while false positives were often associated with misidentified sesamoid bones, artefacts, and external hardware.

Conclusion


Initially, the AI tool demonstrated high sensitivity but a high rate of false positives. Refining the tool to adjust thresholds led to improved sensitivity, specificity, and overall performance, highlighting the importance of ongoing AI tool refinement for clinical deployment.


Unlock the potential of CARPL platform for optimizing radiology workflows

Talk to a Clinical Solutions Architect