Purpose or Learning Objective:
Current Artificial Intelligence applications in neuroradiology demonstrate proficiency in detection tasks, such as identifying intracranial hemorrhage, large vessel occlusion, or fractures. These tools primarily function as "triage assistants," optimizing worklist prioritization and decreasing turnaround times. However, the adoption of agentic Large Language Model (LLM) pipelines presents a transformative alternative by segmenting the reporting procedure into specialized, iterative tasks, thereby moving beyond reliance on a singular, monolithic model.
This multi-agent methodology aims to emulate the cognitive checks and balances inherent in human expertise, simulating the environment of a human radiologist. Consequently, this approach seeks to substantially reduce error rates and ensure that critical findings are not only detected but also precisely and accurately structured within the report.
The study investigates the utility of agentic large language model (LLM) pipelines for automated structuring of brain CT reports, with the goal of improving critical finding detection and enhancing clinical decision support in acute care.
Methods or Background:
This study utilized a retrospective pilot of non-contrast Brain CT reports, randomly sampled to ensure a representative mix of pathological and normal cases. The ground truth was established through manual annotation by two board- certified radiologists, who mapped each report to a structured schema of 19 distinct radiological findings (e.g., Intracranial Hemorrhage, Ischemia, etc.). Discrepancies in the initial annotation were resolved via consensus to ensure high-fidelity reference standards.
We implemented and compared five distinct extraction pipelines to assess the efficacy of escalating agentic complexity. The baseline consisted of a rule-based system utilizing regular expressions and keyword dictionaries. This was compared against four generative AI frameworks powered by Google Gemini 2.5 Flash Lite: a Single-Agent model, followed by multi-agent architectures (Double-, Triple-, and Quadruple-Agent). Pipelines were evaluated for structured label extraction, critical finding sensitivity, case-level exact match, and logical consistency. Performance was assessed using bootstrap resampling (2,000 iterations, 95% CI) and pairwise McNemar testing. Plateau detection identified the optimal agent count for practical deployment.
Results or Findings:
Compared with the rule-based baseline, all LLM pipelines markedly improved structured report accuracy and reduced logical inconsistencies. Micro-F1 rose from 0.64 at baseline to 0.86 with a single agent and stabilized around 0.87–0.88 with multi-agent pipelines. Case-level exact match improved from 0.20 (single agent) to 0.28 (double agent), with no meaningful gains thereafter. Recall for critical findings remained high (>0.93 for LLMs; 1.00 for baseline and quadruple), though precision was substantially higher in LLM pipelines. Consistency audits confirmed fewer contradictions in agentic frameworks relative to baseline.
Conclusion:
The implementation of agentic LLM pipelines utilizing Gemini 2.5 Flash Lite represents a significant advancement over rule-based baselines for the automated structuring of brain CT reports, delivering superior precision and logical consistency. While single-agent models provided the most substantial performance leap, our analysis suggests that a double-agent architecture offers the optimal balance between accuracy and computational efficiency.
Clinically, this multi-agent approach effectively emulates a human peer-review process, maintaining high sensitivity for critical findings (e.g., hemorrhage) while significantly reducing false positives compared to traditional methods. These pipelines can reliably support downstream clinical decision support systems and automated triage in acute care settings. Future work will focus on validating the double-agent framework on larger, multi-institutional datasets to ensure robustness across diverse reporting styles before clinical deployment.
These methods demonstrate feasibility for clinical decision support in neuroradiology workflows, offering efficiency gains over traditional rule-based approaches while maintaining high diagnostic sensitivity.
Limitations: While it covers 19 findings, it is tuned specifically for acute brain findings. It may not perform as well on complex chronic conditions, post-operative appearances, or rare pathologies not well-represented in the pilot sample. We recognise that a relatively small sample size is used not ideal for full-scale clinical validation.
References:
1. Al Radi, A. M., Cao, X., Yu, F., Liu, Y., Liu, F., Wang, C., ... & Tian, Y. (2025). Agentic large-language-model systems in medicine: A systematic review and taxonomy. Authorea Preprints.
2. Ke, Y., Yang, R., Lie, S. A., Lim, T. X. Y., Ning, Y., Li, I., ... & Liu, N. (2024). Mitigating cognitive biases in clinical decision-making through multi-agent conversations using large language models: simulation study. Journal of Medical Internet Research, 26, e59439.
3. Karunanayake, N. (2025). Next-generation agentic AI for transforming healthcare. Informatics and Health, 2(2), 73-83.
4. Zhao, W., Wang, S., Safari, M., Hu, M., & Yang, X. (2025). Medical AI Agents: A Comprehensive Survey of
Architectures, Cognitive Modules, and Clinical Workflows. Authorea Preprints.
5. Hager, P., Jungmann, F., Holland, R., Bhagat, K., Hubrecht, I., Knauer, M., ... & Rueckert, D. (2024). Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nature medicine, 30(9), 2613-2622.
6. Li, C. Y., Chang, K. J., Yang, C. F., Wu, H. Y., Chen, W., Bansal, H., ... & Chiou, S. H. (2024). Towards a holistic
framework for multimodal large language models in three-dimensional brain CT report generation. arXiv preprintarXiv:2407.02235.
7. Salehi, S., Singh, Y., Habibi, P., & Erickson, B. J. (2025). Beyond Single Systems: How Multi-Agent AI Is Reshaping Ethics in Radiology. Bioengineering, 12(10), 1100.