Evaluating progress in automatic chest x-ray radiology report generation

Highlights

•Examined correlation between automated metrics and scoring of reports by radiologists

•Proposed metric based on overlap in clinical entities and relations named RadGraph F1

•Proposed composite metric RadCliQ with better alignment with radiologists

•Analyzed failure modes of automated metrics

The bigger picture

Artificial intelligence (AI) has made formidable progress in the interpretation of medical images, but its application has largely been limited to the identification of a handful of individual pathologies. In contrast, the generation of complete narrative radiology reports more closely matches how radiologists communicate diagnostic information. While recent progress on vision-language models has enabled the possibility of generating radiology reports, the task remains far from solved. Our work aims to tackle one of the most important bottlenecks for progress: the limited ability to meaningfully measure progress on the report generation task. We quantitatively examine the correlation between automated metrics and the scoring of reports by radiologists and investigate the failure modes of metrics. We also propose a metric based on overlap in clinical entities and relations extracted from reports and a composite metric, called RadCliQ, that is a combination of individual metrics.

Summary

Artificial intelligence (AI) models for automatic generation of narrative radiology reports from images have the potential to enhance efficiency and reduce the workload of radiologists. However, evaluating the correctness of these reports requires metrics that can capture clinically pertinent differences. In this study, we investigate the alignment between automated metrics and radiologists' scoring of errors in report generation. We address the limitations of existing metrics by proposing new metrics, RadGraph F1 and RadCliQ, which demonstrate stronger correlation with radiologists' evaluations. In addition, we analyze the failure modes of the metrics to understand their limitations and provide guidance for metric selection and interpretation. This study establishes RadGraph F1 and RadCliQ as meaningful metrics for guiding future research in radiology report generation.

Link to complete publication here: https://scholar.google.com/citations?view_op=view_citation&hl=en&user=LSnAkaUAAAAJ&citation_for_view=LSnAkaUAAAAJ:hFOr9nPyWt4C

Evaluating progress in automatic chest x-ray radiology report generation

Highlights

The bigger picture

Summary

Unlock the potential of CARPL platform for optimizing radiology workflows

Company

Learn

Events