Monitoring Radiology AI Performance Over Time
Artificial intelligence applications are a powerful tool in clinical settings for automating repetitive tasks, improving diagnostic accuracy, and reducing radiologist workload. As of October 2023, over 700 AI algorithms have received FDA clearance, with more than 76% focused on radiology, underscoring the specialty's rapid adoption of AI tools to enhance imaging and diagnosis. Of these, 527 algorithms are specifically tailored to radiology applications, including diagnostic aids, workflow automation, and clinical decision support tools.
Why Monitor AI Performance?
Monitoring AI performance is essential to measuring the reliability and accuracy of AI applications over time. According to a multi-society statement from the ACR, CAR, ESR, RANZCR & RSNA, continuous monitoring is critical for detecting issues like concept drift, where changes in patient demographics or imaging protocols can degrade model performance.
Pre-deployment monitoring of AI applications involve human-in-the-loop approaches, where radiologists provide ground truth data to validate AI outputs. This method is crucial for the initial validation of AI models using real-time data from clinical institutions, serving as a vital step in the deployment of radiology AI applications. However, as AI systems evolve, there is a need to continue monitoring the performance of AI applications after deployment, ensuring the AI performs as required, and does not contribute to inconsistencies or errors.
Autonomous monitoring can track AI performance continuously over time, ensuring reliable operation without constant human intervention. This dual approach—pre-deployment monitoring with human input and post-deployment autonomous monitoring—ensures both initial validation and sustained performance of AI applications.
To effectively measure radiology AI performance, two attributes are key: accuracy and consistency.
Accuracy: This refers to the AI model's ability to make correct predictions or classifications. High accuracy is crucial because any errors in diagnosis or treatment recommendations can have serious consequences for patient health.
Consistency: Consistency refers to the AI model's ability to provide stable and reliable results over time. A model that is accurate but not consistent may produce varying results under similar conditions, undermining trust in its outputs. Conversely, a model that is consistent but inaccurately calibrated may consistently produce incorrect results.
Effective monitoring of AI in radiology involves a comprehensive assessment of various performance metrics. CARPL’s monitoring module offers user-friendly and illustrative tools that evaluate both accuracy and consistency. Two key metrics that can be used to measure accuracy and consistency include predictive divergence and temporal stability.
Predictive Divergence: Predictive Divergence is in a true sense, a surrogate measure of the model’s accuracy. By comparing the main model’s outputs with those of multiple support models designed for the same task, Predictive Divergence assesses the degree of alignment between these models. A lower divergence indicates a higher level of accuracy in the main model’s predictions. This approach mitigates the risk of inaccuracies that may arise due to degradation in support models, ensuring a robust evaluation of accuracy.
Temporal Stability: Temporal Stability is a metric designed to evaluate the consistency of an AI model over time. It involves comparing the current prediction distribution of the model with historical data to detect any shifts in performance. Such shifts could indicate issues like data drift or model decay. Maintaining high temporal stability is crucial for ensuring that the model remains reliable across different periods and patient populations.
Fig. 1. Predictive Divergence
Fig. 2. Temporal Stability
CARPL's validation and monitoring framework not only facilitates the monitoring of sensitivity, specificity, and threshold settings to optimize model performance but also ensures that the model remains consistently accurate. This dual focus on accuracy and consistency is essential for delivering optimal performance and improving patient outcomes. A model that is both accurate and consistently accurate is crucial for reliable AI implementation in clinical settings.
Fig. 3. Predictive Divergence on CARPL’s monitoring module
Fig. 4. Temporal Stability on CARPL’s monitoring module
As AI continues to play an increasingly central role in radiology, the importance of rigorous, real-time performance monitoring cannot be overstated. CARPL’s advanced monitoring module provides a promising solution by focusing on both accuracy and consistency, critical metrics for ensuring the ongoing reliability of AI models. The need for continuous monitoring, as emphasized by guidelines from the RSNA and other leading bodies, is paramount to the safe and effective use of AI in clinical practice.