Estimation of bias of deep learning-based chest X-ray classification algorithm

Estimation Of Bias Of Deep Learning-Based Chest X-Ray Classification Algorithm​

PURPOSE OR LEARNING OBJECTIVE:

To evaluate the bias in the diagnostic performance of a deep learning-based chest X-ray classification algorithm on previously unseen external data.

METHODS OR BACKGROUND:

632 chest X-rays were randomly collected from an academic centre hospital and anonymised selectively, leaving out fields needed for the bias estimation (manufacturer name, age, and gender). They were from six different vendors – AGFA (388), Carestream (45), DIPS (21), GE (31), Philips (127), and Siemens (20). The male and female distribution was 376 and 256. The X-rays were read for consolidation ground truth establishment on CARING analytics platform (CARPL). These X-rays were run on open-sourced chest X-ray classification model. Inferencing results were analysed using Aequitas, an open-source python-based package to detect the presence of bias, fairness of algorithms. Algorithms’ performance was evaluated on the three metadata classes – gender, age group, and brand of equipment. False omission rate (FOR) and false-negative rate (FNR) metrics were used for calculating the inter-class scores of bias.

RESULTS OR FINDINGS:

AGFA, 60 to 80 age group, and male were the dominant entities and hence considered as baseline for evaluation of bias towards other classes. Significant false omission rate (FOR) and false negative rate (FNR) disparities were observed for all vendor classes except Siemens as compared to AGFA. No gender disparity was seen. All groups show FNR parity whereas all classes showed disparity with respect to false omission rate for age.

CONCLUSION:

We demonstrate that AI algorithms may develop biases, based on the composition of training data. We recommend bias evaluation check to be an integral part of every AI project. Despite this, AI algorithms may still develop certain biases, some of those difficult to evaluate.

LIMITATIONS:

Limited pathological classes were evaluated.

Why Standardisation Of Pre-inferencing Image Processing Methods Is Crucial For Deep Learning Algorithms – A Compelling Evidence Based On The Variations In Outputs For Different Inferencing Workflow

Why Standardisation Of Pre-Inferencing Image Processing Methods Is Crucial For Deep Learning Algorithms – A Compelling Evidence Based On The Variations In Outputs For Different Inferencing Workflow​

PURPOSE OR LEARNING OBKECTIVE:

To evaluate if there are statistically significant differences in the outputs of a deep learning algorithm on two inferencing workflows, with different unit-processing methods.

METHODS OR BACKGROUND:

The study was performed on DeepCOVIDXR, an open-source algorithm for the detection of COVID19 on chest X-rays. It is an ensemble of convolutional neural networks developed to detect COVID-19 on frontal chest radiographs. The algorithm was evaluated using a dataset of 905 Chest X-rays containing 484 COVID+ cases (as determined RTPCR test) and 421 COVID negative cases. The algorithm supports both batch image processing (workflow1) and single image processing (workflow2) for running inferencing. All the Xray were inferenced using both methods. In batch image processing, images were resized (224×224 and 331×331) and then lung was cropped out, but in single image processing, cropping was done without resizing of images.

RESULTS OR FINDINGS:

We observed a significant difference in the results for the two inferencing workflows. The AUC for COVID classification was 0.632 on the bulk image processing pathway whereas it was 0.769 for the single image processing. There were discordant results in 334 studies, 164 were classified as positive in workflow1 whereas negative in workflow2 whereas 170 X-rays that were classified as negative on workflow1 were classified as positive in workflow2.

CONCLUSION:

We report statistically significant differences in the results of a deep learning algorithm on using different inferencing workflows.

LIMITATIONS:

With rising adoption of radiology AI, it is important to understand that seemingly innocuous changes in the processing pathways can led to disastrous results in the clinical results.

All True Positives Are Not Truly Positive – Utility Of Explainability Failure Ratio In A High Sensitivity Screening Setting To Compare Incorrect Localizations Among Algorithms

All True Positives Are Not Truly Positive – Utility Of Explainability Failure Ratio In A High Sensitivity Screening Setting To Compare Incorrect Localizations Among Algorithms

PURPOSE:

  • To evaluate the localization failures of deep learning based Chest X-ray classification algorithms on a for detection of consolidation
  • To compare the localization accuracies of two algorithms in a high sensitivity screening setting by comparing the explainability failure ratios (EFR)

METHOD AND MATERIALS:

632 chest x-rays were randomly collected from an academic centre hospital, read by a chest radiologist, and ground truth for consolidation was established on CARING analytics platform (CARPL), both at study level and at image level by marking bounding box around the lesion. These X-rays were then analysed by tow algorithms , an open-sourced re-implementation of Stanford’s baseline X-Ray classification model, CheXpert which uses DenseNet121 as its backbone architecture and by CareXnet,  network is trained using the Multi-Task Learning paradigm and uses a Densenet121 backbone. Both provide heat maps corresponding to each class to indicate the confidence of the detected disease using guided GRAD-CAM. The number of true positive cases were estimated at an operating threshold that provides 95% sensitivity. The matching of heat maps and the GT bounding box was done by creating a greedy matching algorithm. EFR is then estimated as the ratio of true positive cases that failed the matching process to the total true positive cases.

RESULTS:

There were a total of 169 cases of consolidation. The number of true positive cases were 145 and 143 for cheXpert and CareXnet respectively. Upon matching of the localization outputs with GT bounding box, the number of unmatched cases for CheXpert and CareXnet were 41 and 39 respectively, giving an EFR of 28 % and 27 % respectively.

CONCLUSION:

In this study, we found that even at high sensitivity operating point with maximum true positive cases, the deep learning algorithms can have a high degree of explainability failures.

CLINICAL RELEVANCE:

We present a new clinically acceptable way to compare the localization accuracies of multiple algorithms at high sensitivity operating points using explainability failure ratios.

Automatic pre-population of normal chest x-ray reports using a high-sensitivity deep learning algorithm: a prospective study of clinical AI deployment (RPS1005b)

Purpose:

To evaluate a high-sensitivity deep learning algorithm for normal/abnormal chest x-ray (CXR) classification by deploying it in a real clinical setting.

Methods and materials:

A commercially available deep learning algorithm (QXR, Qure.ai, India) was integrated into the clinical workflow for a period of 3 months at an outpatient imaging facility. The algorithm, deployed on-premise, was integrated with PACS and RIS such that it automatically analysed all adult CXRs and reports for those which were determined to be “normal” were automatically populated in the RIS using HL7 messaging. Radiologists reviewed the CXRs as part of their regular workflow and ‘accepted’ or changed the pre-populated reports. Changes in reports were divided into ‘clinically insignificant’ and ‘clinically significant’ following which those CXRs with clinically significant changes were reviewed by a specialist chest radiologist with 8 years’ experience.

Results:

A total of 1,970 adult CXRs were analysed by AI, out of which 388 (19.69%) were identified to be normal. 361/388 (93.04%) of these were accepted by radiologists and in 14/388 (3.60%) clinically less significant changes (e.g. increased broncho-vascular markings) were made in reports. Upon review of the balance 13/388 (3.35%) CXRs, it was found that 12 had truly clinically significant missed findings by AI, including 3 with opacities, 3 with lymphadenopathy, 3 with blunted CP angle, 2 with nodules, and 1 with consolidation.

Conclusion:

This study shows that there is a great potential to automate the identification of normal CXRs to a great degree, with very high sensitivity.

Validation of a high precision semantic search tool using a curated dataset containing related and unrelated reports of clinically relevant search terms (RPS 1005b)

Purpose:

To validate a sematic search tool by testing the search results for complex terms.

Methods and materials:

The tool consists of two pipelines: an offline indexing pipeline and a querying pipeline. The raw text from both reports and queries were first passed through a set of pre-processing steps; sentence tokenisation, spelling correction, negation detection, and word sense disambiguation. It was transformed into a concept plane followed by indexing or querying. During querying, additional concepts were added using a query expansion technique to include nearby related concepts. The validation was done on a set of 30 search queries, carefully curated by two radiologists. The reports that are related to the search queries were randomly selected with the help of keyword search and the text was re-read to determine its suitability to the queries. These reports formed the \”related\” group. Similarly, the reports that were not exactly satisfying the context of the search queries were categorised as the \”not related\” group. A set of 5 search queries and 250 reports were used for tuning the model initially. A total of 500 reports of the 10 search queries formed the corpus of the test set. The search results for each test query were evaluated and appropriate statistical analysis was performed.

Results:

The average precision and recall rates on 10 unseen queries on a small corpus for respective queries containing related and unrelated reports were 0.54 and 0.42. On a larger corpus containing 60 K reports, the average precision for these 15 queries was 0.6.

Conclusion:

We describe a method to clinically validate a sematic search tool with high precision.

Estimating AI-generated Bias in Radiology Reporting by Measuring the Change in the Kellgren-Lawrence Grades of Knee Arthritis Before and After Knowledge of AI Results—A Multi-reader Retrospective Study

Estimating AI-Generated Bias In Radiology Reporting By Measuring The Change In The Kellgren-Lawrence Grades Of Knee Arthritis Before And After Knowledge Of AI Results—A Multi-Reader Retrospective Study​

PURPOSE:

To estimate the extent of bias generated by AI in the radiologists’ reporting of grades of osteoarthritis on Knee X-rays by observing the change in grading after the knowledge of predictions of a deep learning algorithm.

METHOD AND MATERIALS:

Anteroposterior views of 271 knee x-rays (542 joints) were randomly extracted from PACS and anonymized.
These x-rays were analyzed using DeepKnee, an open-source algorithm based on the Deep Siamese CNN
architecture that automatically predicts the presence of osteoarthritis on Knee X Rays on a 5 scale Kellgren and
Lawrence system (KL) along with an attention map. These x-rays were independently read by three sub-specialist MSK radiologists on the CARPL AI research platform (CARING Research, India). The KL grade for each Xray was recorded by the radiologists, following which the AI algorithm grade was shown, and radiologists given the option to change their result. The pre-AI result and post-AI results were both recorded. The change in the scores of all three readers was calculated and modulus of change in the score was estimated using the
incongruence rate. The consensus shift before and after the knowledge of the AI results was also estimated.

RESULTS:

There were a total of 542 knee joints that were analyzed by the algorithm and read by the three radiologists giving total 1,626 “instances”. There were 139 instances (8.5%) of readers changing their results. The number of shifts was 13,44, 31, 32 & 19 for grades 0 to 4 respectively. The reader1, reader2, reader3 changed their estimations in 52 (single shift), 34 (single shift), 53 (50 single shift, 2 two shifts, 1 three shift). The intra-reader incongruence rates were 9.6%, 6.3% and 9.8 % respectively. The Krippendorff’s alpha among the readers before knowledge and after knowledge AI results was 0.84 & 0.87 implying minimal convergence towards AI results. Three-reader, two-reader, and no consensus were found in 219, 296, and 27 cases before and 248, 279, and 15 cases after knowledge of AI results (see Figure 1).


Figure 1

CONCLUSION:

We demonstrate that there is a tendency of readers to converge towards AI results which, as expected, occurs more often in the ‘middle’ or ‘median’ grades rather than the extremes of grade.

CLINICAL RELEVANCE/APPLICATION:

With an increase in the number and variety of AI applications in radiology, it is important to consider the extent and relevance of the behavior-modifying effect of AI algorithms on radiologists.

Assessment of Brain Tissue Microstructure by Diffusion Tensor Distribution MRI: An Initial Survey of Various Pathologies

Assessment Of Brain Tissue Microstructure By Diffusion Tensor Distribution MRI: An Initial Survey Of Various Pathologies

PURPOSE:

To explore the potential of the novel diffusion tensor distribution (DTD) MRI method for assessment of brain tissue microstructure in terms of nonparametric DTDs and derived parameter maps reporting on cell densities, shapes, orientations, and heterogeneity through a pilot study with single cases of neurocysticercosis, hydrocephalus, stroke, and radiation damage.

METHOD AND MATERIALS:

Four patients were scanned with a <5 min prototype diffusion-weighted (DW) sequence in conjunction to their regular MRI protocol on a GE MR750w 3T. DW images were acquired with spin echo-prepared EPI using TE=121ms, TR=3298ms, and in-plane resolution=3mm. DW was applied with four b-values up to 2000 s/mm2 for 37 isotropic and 43 directional encodings. Raw images were converted to per-voxel DTDs and metrics including means and (co)variances of tensor \”size\” (inversely related to cell density), shape, and orientation, as well as signal fractions from elongated cells (bin1, including WM), nearly isotropic cells (bin2, including GM), and free water (bin3, including CSF).

RESULTS:

Inspection of the parameter maps revealed the following conspicuous features. 1) neurocysticercosis: site of parasite (high bin3_fraction) enclosed by cyst (high bin2_fraction) and edema (high bin2_fraction and bin2_size); 2) radiation: damaged area (high bin1_fraction and bin1_size) surrounded by edema (high bin2_fraction and bin2_size); recurrent tumor: site of removed tumor filled by fluid (high bin3_fraction) lined with a rim of tumor (high bin2_fraction and elevated bin2_size); hydrocephalus: enlarged ventricles rimmed by thin intact WM (high bin1_fraction with bin1_orientation consistent with WM tracts); acute stroke: ischemic tissue (high bin1_fraction, low bin1_size) surrounded by penumbra (high cov_size_shape) (see Figure 1).


Figure 1. Diffusion Tensor Distribution (DTD) parameter maps for a case of acute stroke (arrows).


CONCLUSION:

The custom sequence for DTD can be applied as a minor addition to a clinical MRI protocol and provides novel
microstructural parameter maps with conspicuous features for a range of brain pathologies, thereby encouraging studies with larger patient groups and comparison with current gold standards.

CLINICAL RELEVANCE/APPLICATION:

The DTD method may enable detailed characterization of tissue microstructure in a wide range of brain pathologies.

Acceleration of cerebrospinal fluid flow quantification using Compressed-SENSE: A quantitative comparison with standard acceleration techniques

Acceleration Of Cerebrospinal Fluid Flow Quantification Using Compressed-SENSE: A Quantitative Comparison With Standard Acceleration Techniques

PURPOSE:


CSF quantification study is typically useful in pediatric and elderly population for normal pressure hydrocephalus (NPH). In these population, scan time reduction is particularly useful for patient cooperation and comfort. The potential for CS to accelerate MRI acquisition without hampering image quality will increase patient comfort and compliance in CSF quantification. The purpose of this study is to quantitatively evaluate the impact of Compressed-SENSE (CS), the latest image acceleration technique that combines compressed sensing with parallel imaging (or SENSE), on acquisition time and image quality in MR imaging of the Cerebrospinal fluid quantification study.



METHODS AND MATERIALS:


Standard in-practice CSF quantification study includes a 2D-gradient echo sequence for flow visualization and 2D-gradient echo T1 weighted phase-contrast sequence for flow quantification. Both these sequences were pulse gated using PPU triggering, planned perpendicular to the mid-aqueduct. Both these sequences were modified to obtain higher acceleration with CS (Table 1). Ten volunteers were scanned both, with and without CS, on a 3.0 T wide-bore MRI (Ingenia, Philips Health Systems). The study was approved by the IRB. The flow quantification was done using IntelliSpace Portal, version 9, Q-Flow analysis package (Philips Health Systems). Absolute stroke volume, mean velocity and regurgitant fraction were calculated for flow-quantification sequence with and without CS. Correlation between these three parameters for CS protocol and non-CS protocol were statistically evaluated using Spearman’s rank correlation test.



CONCLUSION:


There is no significant difference in image quality between the current standard of care and CS-based accelerated CSF quantification MRI scans. Compressed-SENSE in this segment can reliably replace the existing scan protocol of higher acquisition time without loss in image quality, quantifications and at the same time with a significant reduction in scan time. The compressed-SENSE technique was originally designed for scan time acceleration of qualitative MRI . In this work, CS proves to have the potential of being extended to quantitative MRI without any significant information loss and 44% scan time reduction.



The EPOS can be viewed here: http://dx.doi.org/10.26044/ecr2020/C-05874

Establishing Normative Liver Volume and Attenuation for the Population of a Large Developing Country Using Deep Learning – A Retrospective Study of 1,800 CT Scans

Establishing Normative Liver Volume And Attenuation For The Population Of A Large Developing Country Using Deep Learning – A Retrospective Study Of 1,800 CT Scans

PURPOSE:


Deep learning has enabled the analysis of large datasets which previously required significant manual labour. We used a deep learning algorithm to study the distribution of liver volumes and attenuation of a massive dataset of ~1,800 non-contrast CTs (NCCTs) of the abdomen. Specifically, we aim to establish the normative values of hepatic volume and attenuation in patients with no known pathologies and understand their correlations with age and sex. Using hepatic attenuation as an imaging biomarker, we also investigate the prevalence of fatty liver disease at the study site and compare with known prevalence rates.



METHODS AND MATERIALS:


Abdomen CTs acquired retrospectively from the last 3 years were used for the study. Natural Language Processing (NLP) algorithms were developed to identify patients whose radiology reports did not indicate any pathologies of the liver. Non-contrast abdomen CT of the same patients was extracted from the PACS and processed using deep learning models to obtain the liver volume (LV) and mean liver attenuation (MLA).



Liver volume (LV) and mean liver attenuation (MLA) were estimated using a deep learning-based segmentation model which accurately identifies liver voxels from the CT scan, and subsequently calculates LV and MLA. The deep learning algorithm used a multi-stage 3D U-Net architecture (Fig 1) and was trained on 527 patient images that were manually annotated by an expert radiologist. By leveraging two resizing parameters, the multi-stage architecture first extracts the region of interest of the liver which is then used for fine boundary delineation by the subsequent model. This approach helps reduce false positives from neighbouring regions such as spleen and stomach. The algorithm was tested independently on 130 CTs from LiTS challenge and gave a dice score of 95%, and mean volume error of 3.8%. Representative segmentations are shown in Fig 2.



The patient images were anonymized and processed on workstations equipped with Nvidia GeForce GTX 1070 GPUs with 8GB of graphic memory. Each study took 7-10mins to process given the large size of the imaging data, and the entire process was completed in 7 days using multiple workstations.



Additional patient information such as sex and age were obtained from the clinical records and collated with the obtained liver volume and mean liver attenuation for the final analysis. Appropriate statistical analysis (correlations, histogram etc.) were performed on the LV, MLA and estimated prevalence of fatty liver was calculated using a cut-off of 40HU as the reference standard.



RESULTS:


1,823 NCCTs of the abdomen with no liver or related abnormality on clinical reports were extracted from the PACS system. 107 (6%) NCCTs failed the algorithm’s quality check and were excluded from the study, resulting in a total of 1,715 NCCTs for the analysis. The demographic distribution of age and gender were available for 1626 patients. There were 775 males and 851 females, with a mean age of 44.4yrs. The average liver volume (LV) was 1,389mL (Standard Deviation: 473mL, Range: 201 – 3946mL), while the average mean liver attenuation was 59.2HU (Standard Deviation: 15.9HU, Range: 24.2 – 125.6HU). (Fig 3). There was no strong correlation between volume and age for both men (R2: 0.002) and women (R2: 0.0001).122 of 1715 (59% males and 41% females) had fatty liver defined as mean liver attenuation less than 40HU. Over 80% of patients with mean liver attenuation less than 40HU were in the 35-75 age group, with 27.2% aged between 55-65yrs. (Fig 4).



CONCLUSION:


Automated analysis using deep learning algorithms can help parse through massive datasets automatically and shed light on important clinical questions such as the establishment of age- and sex-correlated normative values. We establish new normative values for LV and MLA and quantify the prevalence of fatty liver.



The EPOS can be viewed here: http://dx.doi.org/10.26044/ecr2020/C-07653

DICE Score vs Radiologist – Visual quantification of Virtual Diffusion Sequences – pitfalls of lesion segmentation-based approach as compared to clinical relevance-based qualitative assessment

DICE Score Vs Radiologist – Visual Quantification Of Virtual Diffusion Sequences – Pitfalls Of Lesion Segmentation-Based Approach As Compared To Clinical Relevance-Based Qualitative Assessment

PURPOSE:

The performances of image segmentation/translation algorithms are typically evaluated by measuring image similarity metrics like DICE score or SSIM. In some instances, this approach may be counter-productive. In this study, we propose to compare such an approach with more clinical relevance focussed qualitative assessment method for estimating the accuracy of a virtually generated diffusion-weighted (DW) sequences using Generative Adversarial Networks (GAN).



METHODS AND MATERIALS:

we used a previously described Virtual Imaging Using Generative Adversarial Networks for Image Translation (VIGANIT) network which comprises a 15-layer deep convolution neural network (CNN) used in conjunction with a GAN to improve the clarity of the output image. VIGANIT was used to predict B1000 diffusion-weighted image from input T2W images in 24 cases (12 cases of acute and chronic infarcts each). The ground truth B1000 DW and the predicted B1000 images were blinded and randomized. A radiologist with 9 years’ experience in MRI did pixel-level annotations of the bright and dark areas on ITK- SNAP. Dice score coefficients (DSC) for the annotated areas were calculated. Another radiologist with 16 years’ experience studied the scans to determine the scan level presence or absence of restriction like signal. In positive cases, slice level analysis for the number and location of discretely visible ischemic foci of size greater than 2 mm were also noted.



RESULTS:

The DICE score for the cases with acute infarcts ranged 0 to 0.85 with an average of 0.43 and the dark areas ranged from 0.27 to 0.81 with an average of 0.46. The qualitative assessment revealed that eight out of the 12 cases had positive scan level predictions of restricted diffusion. None of the 12 chronic infarct cases had false predictions of restricted diffusion. There was an absence of comparable predictions in 4 out of the 12 cases with acute infarcts. Two of these four patients had some degree of movement artifacts in their T2W images. The overall accuracy of the predictions was 72%.



CONCLUSION:

Despite the low dice score co-efficient for image translation, the scan level accuracy for the clinical classification of presence or absence of acute infarct was reasonably good. This study makes the case for additionally employing clinical-significance of lesions as an indicator of model performance.
In this study, we demonstrate a significant change in the acceptability score of an image translation network by applying a more clinically relevant assessment method as compared to in-silico mathematical methods.



The EPOS can be viewed here: http://dx.doi.org/10.26044/ecr2020/C-06645