Estimation of bias of deep learning-based chest X-ray classification algorithm

Estimation Of Bias Of Deep Learning-Based Chest X-Ray Classification Algorithm​

PURPOSE OR LEARNING OBJECTIVE:

To evaluate the bias in the diagnostic performance of a deep learning-based chest X-ray classification algorithm on previously unseen external data.

METHODS OR BACKGROUND:

632 chest X-rays were randomly collected from an academic centre hospital and anonymised selectively, leaving out fields needed for the bias estimation (manufacturer name, age, and gender). They were from six different vendors – AGFA (388), Carestream (45), DIPS (21), GE (31), Philips (127), and Siemens (20). The male and female distribution was 376 and 256. The X-rays were read for consolidation ground truth establishment on CARING analytics platform (CARPL). These X-rays were run on open-sourced chest X-ray classification model. Inferencing results were analysed using Aequitas, an open-source python-based package to detect the presence of bias, fairness of algorithms. Algorithms’ performance was evaluated on the three metadata classes – gender, age group, and brand of equipment. False omission rate (FOR) and false-negative rate (FNR) metrics were used for calculating the inter-class scores of bias.

RESULTS OR FINDINGS:

AGFA, 60 to 80 age group, and male were the dominant entities and hence considered as baseline for evaluation of bias towards other classes. Significant false omission rate (FOR) and false negative rate (FNR) disparities were observed for all vendor classes except Siemens as compared to AGFA. No gender disparity was seen. All groups show FNR parity whereas all classes showed disparity with respect to false omission rate for age.

CONCLUSION:

We demonstrate that AI algorithms may develop biases, based on the composition of training data. We recommend bias evaluation check to be an integral part of every AI project. Despite this, AI algorithms may still develop certain biases, some of those difficult to evaluate.

LIMITATIONS:

Limited pathological classes were evaluated.

Why Standardisation Of Pre-inferencing Image Processing Methods Is Crucial For Deep Learning Algorithms – A Compelling Evidence Based On The Variations In Outputs For Different Inferencing Workflow

Why Standardisation Of Pre-Inferencing Image Processing Methods Is Crucial For Deep Learning Algorithms – A Compelling Evidence Based On The Variations In Outputs For Different Inferencing Workflow​

PURPOSE OR LEARNING OBKECTIVE:

To evaluate if there are statistically significant differences in the outputs of a deep learning algorithm on two inferencing workflows, with different unit-processing methods.

METHODS OR BACKGROUND:

The study was performed on DeepCOVIDXR, an open-source algorithm for the detection of COVID19 on chest X-rays. It is an ensemble of convolutional neural networks developed to detect COVID-19 on frontal chest radiographs. The algorithm was evaluated using a dataset of 905 Chest X-rays containing 484 COVID+ cases (as determined RTPCR test) and 421 COVID negative cases. The algorithm supports both batch image processing (workflow1) and single image processing (workflow2) for running inferencing. All the Xray were inferenced using both methods. In batch image processing, images were resized (224×224 and 331×331) and then lung was cropped out, but in single image processing, cropping was done without resizing of images.

RESULTS OR FINDINGS:

We observed a significant difference in the results for the two inferencing workflows. The AUC for COVID classification was 0.632 on the bulk image processing pathway whereas it was 0.769 for the single image processing. There were discordant results in 334 studies, 164 were classified as positive in workflow1 whereas negative in workflow2 whereas 170 X-rays that were classified as negative on workflow1 were classified as positive in workflow2.

CONCLUSION:

We report statistically significant differences in the results of a deep learning algorithm on using different inferencing workflows.

LIMITATIONS:

With rising adoption of radiology AI, it is important to understand that seemingly innocuous changes in the processing pathways can led to disastrous results in the clinical results.

Enhanced separation of brain tumors and edema via diffusion tensor distribution imaging: Illustration with lymphoma cases

Enhanced Separation Of Brain Tumors And Edema Via Diffusion Tensor Distribution Imaging: Illustration With Lymphoma Cases

PURPOSE OR LEARNING OBJECTIVE:

To investigate the clinical potential of diffusion tensor distribution imaging (DTD) for visually differentiating brain tumors and edema from healthy tissue non-invasively.

METHODS:

Multidimensional diffusion (MDD) MRI images were acquired in 2 lymphoma patients on a 3T Discovery 750w system (GE Healthcare) with a 32-channel head coil. Prototype spin-echo EPI sequences were performed using the following parameters: TR/TE=3298/121 ms, in-plane resolution = 3×3 mm2. MDD consisted of 43 linear and 37 spherical b-tensors at b = 100, 700, 1400, 2000 s/mm2. Total scan time was ~5 min. Post-processing of the data was done using dVIEWR powered by MICE Toolkit (www.dviewr.com). The main features related to average cell density (mean diffusivity, MD) and cell elongation (microscopic anisotropy) can be computed within “bins” corresponding to specific tissue types, i.e., “thin” for elongated cells (e.g., white matter), “thick” for densely packed round cells (e.g., grey matter), “sparse” for low cell-density diffusion environments (e.g., edema) and “big” for free water (e.g., ventricles).

RESULTS:

Bin-resolved segmentation maps (SegM) facilitate the identification of edematous regions, captured by the sparse bin (red areas in SegM). These regions surround the investigated lymphomas, themselves mostly captured by the thin bin (green in SegM), indicating that they consist of elongated cells. These cells are randomly oriented, as they appear white (red+green+blue) in the thin-bin mean-orientation maps (see Figure 1). The bin-resolved MD maps’ colors highlight the inverse relationship between MD and average cell density across different tissue types. In particular, the sparse bin exhibits an intermediate MD characteristic of edema.


Figure 1. Diffusion Tensor Distribution (DTD) parameter maps of lymphoma cases.


CONCLUSION:

DTD could provide enhanced visualization tools for radiologists aiming to better separate/characterize healthy and pathological tissues non-invasively.

LIMITATIONS:

This pilot study had limitation in terms of small sample size.

Implementation Of Fast Echo-planar Imaging (EPImix) MRI Sequence For Scan Time Reduction In Critical And Unco-operative Patients

Implementation Of Fast Echo-Planar Imaging (EPImix) MRI Sequence For Scan Time Reduction In Critical And Unco-Operative Patients​

PURPOSE OR LEARNING OBJECTIVE:

To detail how a fast multi-contrast Echo planar Image mix (EPIMix) MRI sequence can lead to a successful reduction in scan time in critical and uncooperative patients compared to the routine clinical brain imaging without compromising the adequate image quality and diagnosis.

METHODS:

A prospective pilot study was conducted on 29 patients requiring emergent brain imaging for concerns of stroke(3), tremors(2), slurring of speech(3), headache(6), memory loss(4), imbalance(2), limb weakness(6), aphasia(1), dementia(1) and Parkinson\’s disease(1) using EPIMix brain imaging sequence on the Discovery 750w 3T, GE Healthcare MR system. EPIMix brain MRI consisting of six contrasts (T2*, T1/T2-FLAIR, T2, DWI, ADC) was acquired in 72-75 seconds. Routine T1w/T2w axial, coronal FLAIR and T2w sagittal images were also concurrently acquired and were correlated with EPIMix images for all the patients. Qualitative analysis of the EPIMix scans was performed by two experienced radiologists for assessment of diagnostic accuracy, artifacts, and image quality.

RESULTS:

The image quality was diagnostic in all of these cases (100%) and the diagnostic performance was comparable between EPIMix and routine clinical MRI without much significant difference, indicating the preservation of adequate image quality on fast EPIMix scans (see Fig.1).



Fig 1. (i) 74-year-old male presented with a history of slurred speech. There is a chronic infarct with gliosis in the right parietal region. The internal content shows hyperintensity on T2WI (A) and T2-FLAIR (B), and hypointensity on T1-FLAIR (D)(arrows); no diffusion restriction on DWI (F) is seen (arrows). (ii) 71-year old male presented with a history of upper limb tremors. Hyperintensity in the right frontal periventricular white matter is seen on T2WI (G) and T2-FLAIR (H) and hypointensity on T1-FLAIR (J)(arrows) with reduced size of the frontal horn, possibly due to ependymitis granularis; no diffusion restriction on DWI is seen to suggest acute ischaemia (L) (arrows). (iii) 82-year-old male presented with a clinical profile of stroke. Cortical & subcortical gliosis is seen in the left middle frontal gyrus. The internal content shows hyperintensity on T2WI (M) and T2-FLAIR (N), and hypointensity on T1-FLAIR (P)(arrows); no diffusion restriction on DWI (R) is seen(arrows).

CONCLUSION:

The pilot study reveals that the EPIMix sequence with rapid scanning can minimize motion artifacts and can be used in unstable patients to evaluate a wide range of brain pathologies without compromising diagnostic image quality.

LIMITATIONS:

EPIMix produces six weighted MRI contrasts in a short time, albeit some image artifacts such as geometric distortion at the skull base and susceptibility artifacts, which were noticed in almost all EPIMix scans. Image degradation with the above-mentioned artifacts is the result of an inherent trade-off between scan time reduction and image quality.

All True Positives Are Not Truly Positive – Utility Of Explainability Failure Ratio In A High Sensitivity Screening Setting To Compare Incorrect Localizations Among Algorithms

All True Positives Are Not Truly Positive – Utility Of Explainability Failure Ratio In A High Sensitivity Screening Setting To Compare Incorrect Localizations Among Algorithms

PURPOSE:

  • To evaluate the localization failures of deep learning based Chest X-ray classification algorithms on a for detection of consolidation
  • To compare the localization accuracies of two algorithms in a high sensitivity screening setting by comparing the explainability failure ratios (EFR)

METHOD AND MATERIALS:

632 chest x-rays were randomly collected from an academic centre hospital, read by a chest radiologist, and ground truth for consolidation was established on CARING analytics platform (CARPL), both at study level and at image level by marking bounding box around the lesion. These X-rays were then analysed by tow algorithms , an open-sourced re-implementation of Stanford’s baseline X-Ray classification model, CheXpert which uses DenseNet121 as its backbone architecture and by CareXnet,  network is trained using the Multi-Task Learning paradigm and uses a Densenet121 backbone. Both provide heat maps corresponding to each class to indicate the confidence of the detected disease using guided GRAD-CAM. The number of true positive cases were estimated at an operating threshold that provides 95% sensitivity. The matching of heat maps and the GT bounding box was done by creating a greedy matching algorithm. EFR is then estimated as the ratio of true positive cases that failed the matching process to the total true positive cases.

RESULTS:

There were a total of 169 cases of consolidation. The number of true positive cases were 145 and 143 for cheXpert and CareXnet respectively. Upon matching of the localization outputs with GT bounding box, the number of unmatched cases for CheXpert and CareXnet were 41 and 39 respectively, giving an EFR of 28 % and 27 % respectively.

CONCLUSION:

In this study, we found that even at high sensitivity operating point with maximum true positive cases, the deep learning algorithms can have a high degree of explainability failures.

CLINICAL RELEVANCE:

We present a new clinically acceptable way to compare the localization accuracies of multiple algorithms at high sensitivity operating points using explainability failure ratios.

Move Away HIPAA And GDPR, Here Comes CrypTFlow – Secure AI Inferencing Without Data Sharing

Move Away HIPAA And GDPR, Here Comes CrypTFlow – Secure AI Inferencing Without Data Sharing

TEACHING POINTS: 

  • Currently, running Artificial Intelligence (AI) algorithms on medical images requires either the sharing of medical images with developers of the algorithms, or sharing of the algorithms with the hospitals. Both these options are sub-optimal since there is always a real risk of patient privacy breach or of intellectual property theft.
  • Encryption is the process of converting data into a “secret code” using a “key” making the data meaningless for anyone without the key. The challenge is that, with current technology, the key needs to be shared with the AI developer, so that the data can be converted to its meaningful form, thereby compromising the security and privacy of the data.
  • We propose using CrypTFlow, which uses Multi-Party Computation and encryption to run AI algorithms on medical images without sharing the encryption key described above. This means that the images remain in the hospital network, the AI algorithm remains in the AI developer’s network, but the AI is still able to run on the images.
  • We will present the results of our experiments of running CheXpert, an AI algorithm, on Chest X-Rays.

    TABLE OF CONTENTS/OUTLINE: 

  • Current privacy and intellectual property concerns with deploying AI algorithms in clinical practice • What is encryption? • What is multi-party computation (MPC)?
  • What is CrypTFlow and how can it help run AI algorithms without requiring data to be shared with AI developers?
  • Results of running CheXpert AI algorithm using CrypTFlow – accuracy, time and computation – The Future of secure AI deployment

The poster can be viewed here: CrypTFlow-Secure-AI-Inferencing

Clinical Experience Using Novel Multidimensional Diffusion Magnetic Resonance Imaging For Characterization Of Tissue Microstructure In Various Brain Pathologies

Clinical Experience Using Novel Multidimensional Diffusion Magnetic Resonance Imaging For Characterization Of Tissue Microstructure In Various Brain Pathologies

TEACHING POINTS:

Multidimensional diffusion (MDD) MRI is a novel imaging technique that provides information enabling
better discrimination of the average rate, microscopic anisotropy, and orientation of diffusion within microscopic tissue environments. We share our experience in the evaluation of MDD’s clinical feasibility in various brain pathologies, where we employed Diffusion Tensor Distribution (DTD) imaging to retrieve nonparametric intravoxel DTDs. DTD allows separation of tissue-specific diffusion profiles of the main brain components, e.g., white matter, grey matter, cerebrospinal fluid and pathological tissue environments such as edema through so-called ‘bins’, namely the ‘thin’, ‘thick’, ‘big’, and the new fourth bin, ‘sparse’. Microscopic anisotropy is not confounded by cell alignment over the voxel scale, unlike conventional fractional anisotropy. Long processing times (a few hours) are needed to generate DTD maps. Current MDD sequences, albeit optimized, feature longer TE compared to conventional diffusion sequences. This imposes a lower image resolution (3×3 mm2) in order to maintain reasonable signal-to-noise ratio. Distortion artefacts could be corrected upon acquisition of a reverse phase-encoding b0 image (for ‘topup’ processing).



TABLE OF CONTENTS/OUTLINE:


1. Basic physics underlying MDD MRI
2. Pros and cons of the sequence
3. Highlight key differential diagnostic points in different brain indications: infections – tuberculomas and cysticercosis, sudden onset of loss of balance, fits, radiation damage and seizures.



The poster can be viewed here: MDD_EE_poster

Automatic pre-population of normal chest x-ray reports using a high-sensitivity deep learning algorithm: a prospective study of clinical AI deployment (RPS1005b)

Purpose:

To evaluate a high-sensitivity deep learning algorithm for normal/abnormal chest x-ray (CXR) classification by deploying it in a real clinical setting.

Methods and materials:

A commercially available deep learning algorithm (QXR, Qure.ai, India) was integrated into the clinical workflow for a period of 3 months at an outpatient imaging facility. The algorithm, deployed on-premise, was integrated with PACS and RIS such that it automatically analysed all adult CXRs and reports for those which were determined to be “normal” were automatically populated in the RIS using HL7 messaging. Radiologists reviewed the CXRs as part of their regular workflow and ‘accepted’ or changed the pre-populated reports. Changes in reports were divided into ‘clinically insignificant’ and ‘clinically significant’ following which those CXRs with clinically significant changes were reviewed by a specialist chest radiologist with 8 years’ experience.

Results:

A total of 1,970 adult CXRs were analysed by AI, out of which 388 (19.69%) were identified to be normal. 361/388 (93.04%) of these were accepted by radiologists and in 14/388 (3.60%) clinically less significant changes (e.g. increased broncho-vascular markings) were made in reports. Upon review of the balance 13/388 (3.35%) CXRs, it was found that 12 had truly clinically significant missed findings by AI, including 3 with opacities, 3 with lymphadenopathy, 3 with blunted CP angle, 2 with nodules, and 1 with consolidation.

Conclusion:

This study shows that there is a great potential to automate the identification of normal CXRs to a great degree, with very high sensitivity.

Validation of a high precision semantic search tool using a curated dataset containing related and unrelated reports of clinically relevant search terms (RPS 1005b)

Purpose:

To validate a sematic search tool by testing the search results for complex terms.

Methods and materials:

The tool consists of two pipelines: an offline indexing pipeline and a querying pipeline. The raw text from both reports and queries were first passed through a set of pre-processing steps; sentence tokenisation, spelling correction, negation detection, and word sense disambiguation. It was transformed into a concept plane followed by indexing or querying. During querying, additional concepts were added using a query expansion technique to include nearby related concepts. The validation was done on a set of 30 search queries, carefully curated by two radiologists. The reports that are related to the search queries were randomly selected with the help of keyword search and the text was re-read to determine its suitability to the queries. These reports formed the \”related\” group. Similarly, the reports that were not exactly satisfying the context of the search queries were categorised as the \”not related\” group. A set of 5 search queries and 250 reports were used for tuning the model initially. A total of 500 reports of the 10 search queries formed the corpus of the test set. The search results for each test query were evaluated and appropriate statistical analysis was performed.

Results:

The average precision and recall rates on 10 unseen queries on a small corpus for respective queries containing related and unrelated reports were 0.54 and 0.42. On a larger corpus containing 60 K reports, the average precision for these 15 queries was 0.6.

Conclusion:

We describe a method to clinically validate a sematic search tool with high precision.

Estimating AI-generated Bias in Radiology Reporting by Measuring the Change in the Kellgren-Lawrence Grades of Knee Arthritis Before and After Knowledge of AI Results—A Multi-reader Retrospective Study

Estimating AI-Generated Bias In Radiology Reporting By Measuring The Change In The Kellgren-Lawrence Grades Of Knee Arthritis Before And After Knowledge Of AI Results—A Multi-Reader Retrospective Study​

PURPOSE:

To estimate the extent of bias generated by AI in the radiologists’ reporting of grades of osteoarthritis on Knee X-rays by observing the change in grading after the knowledge of predictions of a deep learning algorithm.

METHOD AND MATERIALS:

Anteroposterior views of 271 knee x-rays (542 joints) were randomly extracted from PACS and anonymized.
These x-rays were analyzed using DeepKnee, an open-source algorithm based on the Deep Siamese CNN
architecture that automatically predicts the presence of osteoarthritis on Knee X Rays on a 5 scale Kellgren and
Lawrence system (KL) along with an attention map. These x-rays were independently read by three sub-specialist MSK radiologists on the CARPL AI research platform (CARING Research, India). The KL grade for each Xray was recorded by the radiologists, following which the AI algorithm grade was shown, and radiologists given the option to change their result. The pre-AI result and post-AI results were both recorded. The change in the scores of all three readers was calculated and modulus of change in the score was estimated using the
incongruence rate. The consensus shift before and after the knowledge of the AI results was also estimated.

RESULTS:

There were a total of 542 knee joints that were analyzed by the algorithm and read by the three radiologists giving total 1,626 “instances”. There were 139 instances (8.5%) of readers changing their results. The number of shifts was 13,44, 31, 32 & 19 for grades 0 to 4 respectively. The reader1, reader2, reader3 changed their estimations in 52 (single shift), 34 (single shift), 53 (50 single shift, 2 two shifts, 1 three shift). The intra-reader incongruence rates were 9.6%, 6.3% and 9.8 % respectively. The Krippendorff’s alpha among the readers before knowledge and after knowledge AI results was 0.84 & 0.87 implying minimal convergence towards AI results. Three-reader, two-reader, and no consensus were found in 219, 296, and 27 cases before and 248, 279, and 15 cases after knowledge of AI results (see Figure 1).


Figure 1

CONCLUSION:

We demonstrate that there is a tendency of readers to converge towards AI results which, as expected, occurs more often in the ‘middle’ or ‘median’ grades rather than the extremes of grade.

CLINICAL RELEVANCE/APPLICATION:

With an increase in the number and variety of AI applications in radiology, it is important to consider the extent and relevance of the behavior-modifying effect of AI algorithms on radiologists.