Processing Free-Text Chest X-ray Reports into Structured Findings Using Agentic Large Language Models

Purpose or Learning Objective:

The vast majority of diagnostic data in healthcare resides in unstructured, free-text formats, with radiology reports being a primary example. While these narratives allow radiologists the flexibility to describe complex nuances, this lack of structure creates significant bottlenecks for downstream applications. This study addresses this critical gap by evaluating advanced AI-driven strategies designed to translate these free-text chest X-ray reports into standardized, machine readable formats, thereby unlocking the latent value within historical and real-time imaging data. The goal was to identify the most clinically accurate and reliable method.

We specifically focused on agreement and disagreement rates between the models to understand their behavior in a variety of clinical scenarios, such as incidental, congenital, and post-surgical findings. This study seeks to identify the most robust method for automated structured reporting, weighing the efficiency of single-pass inference against the potential precision gains of multi-agent and multi-pass verification frameworks.

Methods or Background:

Four approaches were implemented: (1) single-pass analysis with Gemini 2.5 Flash Lite, (2) a dual-agent verification framework with Gemini 2.5 Flash Lite, (3) single-pass analysis with the locally hosted MedGemma 4b model, and (4) a triple-pass MedGemma 4b framework.

To ensure consistent and comparable data extraction across these varying architectures, we utilized a unified prompt engineering strategy that enforced a strict Java Script Object Notation (JSON) output schema. All models were tasked with parsing the free-text narratives to classify 12 distinct clinical findings (e.g., pleural effusion, pneumothorax, opacity) into one of three mutually exclusive categories: “Positive,” “Negative,” or “Uncertain.” In addition to this categorical labeling, the models were required to synthesize a one-sentence impression summary for each report.

Performance was evaluated through qualitative sampling, head-to-head comparisons, and aggregate quantitative analysis across a dataset of over 1,000 chest X-ray reports. Agreement and disagreement rates between models were assessed, with specific focus on nuanced cases such as incidental, congenital, and post-surgical findings.

Results or Findings:

Across 607 overlapping reports, the Gemini API demonstrated 98.5% agreement in the “Abnormal” classification and 99.4% agreement across detailed findings, with perfect performance for pneumothorax and pneumoperitoneum. In areas of disagreement (1.5% of cases), the dual-agent Gemini framework was consistently more complete, correcting misclassifications and providing explanatory notes. Comparative analysis revealed that MedGemma performed well on explicit pathological keywords but failed to classify incidental, congenital, or iatrogenic changes as abnormal. Gemini consistently demonstrated a more holistic clinical understanding, providing more comprehensive summaries and high- level classifications. Aggregate analysis showed higher completeness and contextually correct labeling with Gemini, including correct identification of abnormalities in post-surgical or congenital findings.

Conclusion:

Agentic LLM frameworks demonstrate strong potential for structuring free-text chest X-ray reports. While both Gemini 2.5 Flash Lite and MedGemma 4b achieved high accuracy, the Gemini API in a dual-agent framework consistently provided more clinically relevant, complete, and context-aware outputs. These findings highlight the feasibility of LLM-based pipelines for reliable structured reporting and clinical decision support.

While the locally-run MedGemma model in a 3-pass framework is highly capable at extracting explicit pathological terms, the Gemini 2.5 Flash Lite API (2-Pass) approach is the superior model for this task. The Gemini API demonstrates a more advanced and clinically relevant interpretation of the entire report, even with a less complex workflow. Its ability to correctly classify a report as "Abnormal" based on the complete context (including post-surgical changes, congenital anomalies, or other incidental findings) makes its structured output more reliable and useful for clinical applications.

Limitations: The pilot evaluation utilized a limited dataset; consequently, comprehensive validation across larger and more diverse datasets is requisite to ensure more reliable and clinically deployable results.

References:

1.Sharma, N. (2024). Cxr-agent: Vision-language models for chest x-ray interpretation with uncertainty aware radiology reporting. arXiv preprint arXiv:2407.08811.

2.Al Radi, A. M., Cao, X., Yu, F., Liu, Y., Liu, F., Wang, C., ... & Tian, Y. (2025). Agentic large-language-model systems in medicine: A systematic review and taxonomy. Authorea Preprints.

3.Woźnicki, P., Laqua, C., Fiku, I., Hekalo, A., Truhn, D., Engelhardt, S., ... & Laqua, F. C. (2025). Automatic structuring of radiology reports with on-premise open-source large language models. European Radiology, 35(4), 2018-2029.

4.Abdullah, A., & Kim, S. T. (2025). Automated Radiology Report Labeling in Chest X-Ray Pathologies: Development and Evaluation of a Large Language Model Framework. JMIR Medical Informatics, 13(1), e68618.

5.Drozdov, I., Forbes, D., Szubert, B., Hall, M., Carlin, C., & Lowe, D. J. (2020). Supervised and unsupervised language modelling in Chest X-Ray radiological reports. Plos one, 15(3), e0229963.

6.Chen, Z., Varma, M., Delbrouck, J. B., Paschali, M., Blankemeier, L., Van Veen, D., ... & Langlotz, C. (2024). Chexagent: Towards a foundation model for chest x-ray interpretation. arXiv preprint arXiv:2401.12208.

7.Busch, F., Hoffmann, L., Dos Santos, D. P., Makowski, M. R., Saba, L., Prucker, P., ... & Bressem, K. K. (2025). Large language models for structured reporting in radiology: past, present, and future. European Radiology, 35(5), 2589-2602.

8.Le, T. A., Vu, A. M., Yang, D., Awasthi, A., & Van Nguyen, H. (2025). IMACT-CXR-An Interactive Multi-Agent Conversational Tutoring System for Chest X-Ray Interpretation. arXiv preprint arXiv:2511.15825.

Processing Free-Text Chest X-ray Reports into Structured Findings Using Agentic Large Language Models

Unlock the potential of CARPL platform for optimizing radiology workflows

Company

Learn

Events