Using Traditional Machine Learning Techniques for Predicting the Best Clinical Outcomes On Improvement In Vas, Mod And Ncos Based Upon Clinical And Imaging Features

Using Traditional Machine Learning Techniques for Predicting the Best Clinical Outcomes On Improvement In Vas, Mod And Ncos Based Upon Clinical And Imaging Features

PURPOSE OR LEARNING OBJECTIVE:

To use machine learning techniques for predicting the clinical outcomes on improvement in VAS, MOD, and NCOS based on the managements like Spinal decompression without fusion (open discectomy/Laminectomy), Spinal compression with fusion, and Conservative Management.

METHODS OR BACKGROUND:

Patients with symptomatic lumbar spine disease with back pain with or without radiculopathy and neurological deficit were enrolled. The primary outcome measures were Visual analogue scale (VAS), Modified Oswestry Disability Index (MOD), and Neurogenic Claudication Outcome Score (NCOS) collected at pre-operatively and at 3 months post-operatively. The further analysis studied the following factors to determine if any are predictive of outcomes: sex, BMI, occupation, involvement in sports, herniation type, depression, work status, herniation level, duration of symptoms, and history of past spine surgery. The features were selected and machine learning models were trained to predict the improvement in the primary outcome measures. The results were evaluated on the basis of the ROC-AUC score for different classes.

RESULTS OR FINDINGS:

There were a total of 200 entries of patients with Lumbar Spine Disease between age 18 and above. Among the various Machine Learning Models, Random Forest Classifier gave the best ROC-AUC score in all three classes. The AUC score for VAS, MOD, and NCOS was 0.877, 0.8215, and 0.830 respectively and the macro-average AUC score was found to be 0.84 (see Fig.1).

 Fig 1: ROC-AUC scores of different Machine Learning Algorithms on Test Dataset.

 

CONCLUSION:

Machine Learning model could be used as a predictive tool for deciding the type of management that a patient should undergo to achieve the best results. Based on the predicted improvement in different indices for the particular management type, the predictions could help Surgeons for deciding the type of management that would be most beneficial for the patient.

Estimation of bias of deep learning-based chest X-ray classification algorithm

Estimation Of Bias Of Deep Learning-Based Chest X-Ray Classification Algorithm​

PURPOSE OR LEARNING OBJECTIVE:

To evaluate the bias in the diagnostic performance of a deep learning-based chest X-ray classification algorithm on previously unseen external data.

METHODS OR BACKGROUND:

632 chest X-rays were randomly collected from an academic centre hospital and anonymised selectively, leaving out fields needed for the bias estimation (manufacturer name, age, and gender). They were from six different vendors – AGFA (388), Carestream (45), DIPS (21), GE (31), Philips (127), and Siemens (20). The male and female distribution was 376 and 256. The X-rays were read for consolidation ground truth establishment on CARING analytics platform (CARPL). These X-rays were run on open-sourced chest X-ray classification model. Inferencing results were analysed using Aequitas, an open-source python-based package to detect the presence of bias, fairness of algorithms. Algorithms’ performance was evaluated on the three metadata classes – gender, age group, and brand of equipment. False omission rate (FOR) and false-negative rate (FNR) metrics were used for calculating the inter-class scores of bias.

RESULTS OR FINDINGS:

AGFA, 60 to 80 age group, and male were the dominant entities and hence considered as baseline for evaluation of bias towards other classes. Significant false omission rate (FOR) and false negative rate (FNR) disparities were observed for all vendor classes except Siemens as compared to AGFA. No gender disparity was seen. All groups show FNR parity whereas all classes showed disparity with respect to false omission rate for age.

CONCLUSION:

We demonstrate that AI algorithms may develop biases, based on the composition of training data. We recommend bias evaluation check to be an integral part of every AI project. Despite this, AI algorithms may still develop certain biases, some of those difficult to evaluate.

LIMITATIONS:

Limited pathological classes were evaluated.

Why Standardisation Of Pre-inferencing Image Processing Methods Is Crucial For Deep Learning Algorithms – A Compelling Evidence Based On The Variations In Outputs For Different Inferencing Workflow

Why Standardisation Of Pre-Inferencing Image Processing Methods Is Crucial For Deep Learning Algorithms – A Compelling Evidence Based On The Variations In Outputs For Different Inferencing Workflow​

PURPOSE OR LEARNING OBKECTIVE:

To evaluate if there are statistically significant differences in the outputs of a deep learning algorithm on two inferencing workflows, with different unit-processing methods.

METHODS OR BACKGROUND:

The study was performed on DeepCOVIDXR, an open-source algorithm for the detection of COVID19 on chest X-rays. It is an ensemble of convolutional neural networks developed to detect COVID-19 on frontal chest radiographs. The algorithm was evaluated using a dataset of 905 Chest X-rays containing 484 COVID+ cases (as determined RTPCR test) and 421 COVID negative cases. The algorithm supports both batch image processing (workflow1) and single image processing (workflow2) for running inferencing. All the Xray were inferenced using both methods. In batch image processing, images were resized (224×224 and 331×331) and then lung was cropped out, but in single image processing, cropping was done without resizing of images.

RESULTS OR FINDINGS:

We observed a significant difference in the results for the two inferencing workflows. The AUC for COVID classification was 0.632 on the bulk image processing pathway whereas it was 0.769 for the single image processing. There were discordant results in 334 studies, 164 were classified as positive in workflow1 whereas negative in workflow2 whereas 170 X-rays that were classified as negative on workflow1 were classified as positive in workflow2.

CONCLUSION:

We report statistically significant differences in the results of a deep learning algorithm on using different inferencing workflows.

LIMITATIONS:

With rising adoption of radiology AI, it is important to understand that seemingly innocuous changes in the processing pathways can led to disastrous results in the clinical results.

All True Positives Are Not Truly Positive – Utility Of Explainability Failure Ratio In A High Sensitivity Screening Setting To Compare Incorrect Localizations Among Algorithms

All True Positives Are Not Truly Positive – Utility Of Explainability Failure Ratio In A High Sensitivity Screening Setting To Compare Incorrect Localizations Among Algorithms

PURPOSE:

  • To evaluate the localization failures of deep learning based Chest X-ray classification algorithms on a for detection of consolidation
  • To compare the localization accuracies of two algorithms in a high sensitivity screening setting by comparing the explainability failure ratios (EFR)

METHOD AND MATERIALS:

632 chest x-rays were randomly collected from an academic centre hospital, read by a chest radiologist, and ground truth for consolidation was established on CARING analytics platform (CARPL), both at study level and at image level by marking bounding box around the lesion. These X-rays were then analysed by tow algorithms , an open-sourced re-implementation of Stanford’s baseline X-Ray classification model, CheXpert which uses DenseNet121 as its backbone architecture and by CareXnet,  network is trained using the Multi-Task Learning paradigm and uses a Densenet121 backbone. Both provide heat maps corresponding to each class to indicate the confidence of the detected disease using guided GRAD-CAM. The number of true positive cases were estimated at an operating threshold that provides 95% sensitivity. The matching of heat maps and the GT bounding box was done by creating a greedy matching algorithm. EFR is then estimated as the ratio of true positive cases that failed the matching process to the total true positive cases.

RESULTS:

There were a total of 169 cases of consolidation. The number of true positive cases were 145 and 143 for cheXpert and CareXnet respectively. Upon matching of the localization outputs with GT bounding box, the number of unmatched cases for CheXpert and CareXnet were 41 and 39 respectively, giving an EFR of 28 % and 27 % respectively.

CONCLUSION:

In this study, we found that even at high sensitivity operating point with maximum true positive cases, the deep learning algorithms can have a high degree of explainability failures.

CLINICAL RELEVANCE:

We present a new clinically acceptable way to compare the localization accuracies of multiple algorithms at high sensitivity operating points using explainability failure ratios.

Move Away HIPAA And GDPR, Here Comes CrypTFlow – Secure AI Inferencing Without Data Sharing

Move Away HIPAA And GDPR, Here Comes CrypTFlow – Secure AI Inferencing Without Data Sharing

TEACHING POINTS: 

  • Currently, running Artificial Intelligence (AI) algorithms on medical images requires either the sharing of medical images with developers of the algorithms, or sharing of the algorithms with the hospitals. Both these options are sub-optimal since there is always a real risk of patient privacy breach or of intellectual property theft.
  • Encryption is the process of converting data into a “secret code” using a “key” making the data meaningless for anyone without the key. The challenge is that, with current technology, the key needs to be shared with the AI developer, so that the data can be converted to its meaningful form, thereby compromising the security and privacy of the data.
  • We propose using CrypTFlow, which uses Multi-Party Computation and encryption to run AI algorithms on medical images without sharing the encryption key described above. This means that the images remain in the hospital network, the AI algorithm remains in the AI developer’s network, but the AI is still able to run on the images.
  • We will present the results of our experiments of running CheXpert, an AI algorithm, on Chest X-Rays.

    TABLE OF CONTENTS/OUTLINE: 

  • Current privacy and intellectual property concerns with deploying AI algorithms in clinical practice • What is encryption? • What is multi-party computation (MPC)?
  • What is CrypTFlow and how can it help run AI algorithms without requiring data to be shared with AI developers?
  • Results of running CheXpert AI algorithm using CrypTFlow – accuracy, time and computation – The Future of secure AI deployment

The poster can be viewed here: CrypTFlow-Secure-AI-Inferencing

Validation of expert system enhanced deep learning algorithm for automated screening for COVID-Pneumonia on chest X-rays

Validation Of Expert System Enhanced Deep Learning Algorithm For Automated Screening For COVID-Pneumonia On Chest X-Rays

Abstract

SARS-CoV2 pandemic exposed the limitations of artificial intelligence based medical imaging systems. Earlier in the pandemic, the absence of sufficient training data prevented effective deep learning (DL) solutions for the diagnosis of COVID-19 based on X-Ray data. Here, addressing the lacunae in existing literature and algorithms with the paucity of initial training data; we describe CovBaseAI, an explainable tool using an ensemble of three DL models and an expert decision system (EDS) for COVID-Pneumonia diagnosis, trained entirely on pre-COVID-19 datasets. The performance and explainability of CovBaseAI was primarily validated on two independent datasets. Firstly, 1401 randomly selected CxR from an Indian quarantine center to assess effectiveness in excluding radiological COVID-Pneumonia requiring higher care. Second, curated dataset; 434 RT-PCR positive cases and 471 non-COVID/Normal historical scans, to assess performance in advanced medical settings. CovBaseAI had an accuracy of 87% with a negative predictive value of 98% in the quarantine-center data. However, sensitivity was 0.66–0.90 taking RT-PCR/radiologist opinion as ground truth. This work provides new insights on the usage of EDS with DL methods and the ability of algorithms to confidently predict COVID-Pneumonia while reinforcing the established learning; that benchmarking based on RT-PCR may not serve as reliable ground truth in radiological diagnosis. Such tools can pave the path for multi-modal high throughput detection of COVID-Pneumonia in screening and referral.



For full paper: http://www.nature.com/articles/s41598-021-02003-w

Automatic pre-population of normal chest x-ray reports using a high-sensitivity deep learning algorithm: a prospective study of clinical AI deployment (RPS1005b)

Purpose:

To evaluate a high-sensitivity deep learning algorithm for normal/abnormal chest x-ray (CXR) classification by deploying it in a real clinical setting.

Methods and materials:

A commercially available deep learning algorithm (QXR, Qure.ai, India) was integrated into the clinical workflow for a period of 3 months at an outpatient imaging facility. The algorithm, deployed on-premise, was integrated with PACS and RIS such that it automatically analysed all adult CXRs and reports for those which were determined to be “normal” were automatically populated in the RIS using HL7 messaging. Radiologists reviewed the CXRs as part of their regular workflow and ‘accepted’ or changed the pre-populated reports. Changes in reports were divided into ‘clinically insignificant’ and ‘clinically significant’ following which those CXRs with clinically significant changes were reviewed by a specialist chest radiologist with 8 years’ experience.

Results:

A total of 1,970 adult CXRs were analysed by AI, out of which 388 (19.69%) were identified to be normal. 361/388 (93.04%) of these were accepted by radiologists and in 14/388 (3.60%) clinically less significant changes (e.g. increased broncho-vascular markings) were made in reports. Upon review of the balance 13/388 (3.35%) CXRs, it was found that 12 had truly clinically significant missed findings by AI, including 3 with opacities, 3 with lymphadenopathy, 3 with blunted CP angle, 2 with nodules, and 1 with consolidation.

Conclusion:

This study shows that there is a great potential to automate the identification of normal CXRs to a great degree, with very high sensitivity.

Validation of a high precision semantic search tool using a curated dataset containing related and unrelated reports of clinically relevant search terms (RPS 1005b)

Purpose:

To validate a sematic search tool by testing the search results for complex terms.

Methods and materials:

The tool consists of two pipelines: an offline indexing pipeline and a querying pipeline. The raw text from both reports and queries were first passed through a set of pre-processing steps; sentence tokenisation, spelling correction, negation detection, and word sense disambiguation. It was transformed into a concept plane followed by indexing or querying. During querying, additional concepts were added using a query expansion technique to include nearby related concepts. The validation was done on a set of 30 search queries, carefully curated by two radiologists. The reports that are related to the search queries were randomly selected with the help of keyword search and the text was re-read to determine its suitability to the queries. These reports formed the \”related\” group. Similarly, the reports that were not exactly satisfying the context of the search queries were categorised as the \”not related\” group. A set of 5 search queries and 250 reports were used for tuning the model initially. A total of 500 reports of the 10 search queries formed the corpus of the test set. The search results for each test query were evaluated and appropriate statistical analysis was performed.

Results:

The average precision and recall rates on 10 unseen queries on a small corpus for respective queries containing related and unrelated reports were 0.54 and 0.42. On a larger corpus containing 60 K reports, the average precision for these 15 queries was 0.6.

Conclusion:

We describe a method to clinically validate a sematic search tool with high precision.

Estimating AI-generated Bias in Radiology Reporting by Measuring the Change in the Kellgren-Lawrence Grades of Knee Arthritis Before and After Knowledge of AI Results—A Multi-reader Retrospective Study

Estimating AI-Generated Bias In Radiology Reporting By Measuring The Change In The Kellgren-Lawrence Grades Of Knee Arthritis Before And After Knowledge Of AI Results—A Multi-Reader Retrospective Study​

PURPOSE:

To estimate the extent of bias generated by AI in the radiologists’ reporting of grades of osteoarthritis on Knee X-rays by observing the change in grading after the knowledge of predictions of a deep learning algorithm.

METHOD AND MATERIALS:

Anteroposterior views of 271 knee x-rays (542 joints) were randomly extracted from PACS and anonymized.
These x-rays were analyzed using DeepKnee, an open-source algorithm based on the Deep Siamese CNN
architecture that automatically predicts the presence of osteoarthritis on Knee X Rays on a 5 scale Kellgren and
Lawrence system (KL) along with an attention map. These x-rays were independently read by three sub-specialist MSK radiologists on the CARPL AI research platform (CARING Research, India). The KL grade for each Xray was recorded by the radiologists, following which the AI algorithm grade was shown, and radiologists given the option to change their result. The pre-AI result and post-AI results were both recorded. The change in the scores of all three readers was calculated and modulus of change in the score was estimated using the
incongruence rate. The consensus shift before and after the knowledge of the AI results was also estimated.

RESULTS:

There were a total of 542 knee joints that were analyzed by the algorithm and read by the three radiologists giving total 1,626 “instances”. There were 139 instances (8.5%) of readers changing their results. The number of shifts was 13,44, 31, 32 & 19 for grades 0 to 4 respectively. The reader1, reader2, reader3 changed their estimations in 52 (single shift), 34 (single shift), 53 (50 single shift, 2 two shifts, 1 three shift). The intra-reader incongruence rates were 9.6%, 6.3% and 9.8 % respectively. The Krippendorff’s alpha among the readers before knowledge and after knowledge AI results was 0.84 & 0.87 implying minimal convergence towards AI results. Three-reader, two-reader, and no consensus were found in 219, 296, and 27 cases before and 248, 279, and 15 cases after knowledge of AI results (see Figure 1).


Figure 1

CONCLUSION:

We demonstrate that there is a tendency of readers to converge towards AI results which, as expected, occurs more often in the ‘middle’ or ‘median’ grades rather than the extremes of grade.

CLINICAL RELEVANCE/APPLICATION:

With an increase in the number and variety of AI applications in radiology, it is important to consider the extent and relevance of the behavior-modifying effect of AI algorithms on radiologists.

Can AI Help Read Pediatric Chest X-rays? An independent Evaluation on 3,000+ Scans

Can AI Help Read Pediatric Chest X-Rays? An Independent Evaluation On 3,000+ Scans

PURPOSE:

To evaluate the performance of a commercially available deep learning-based AI algorithm on pediatric chest X-rays (CXRs).

METHOD AND MATERIALS:

3,319 frontal (PA and AP) CXRs of patients’ aged 6 to 18 years were pulled from PACS and anonymised at a tertiary care pediatric hospital in Brazil. Labels (normal, abnormal) were ascertained from the radiology reports. The data was loaded on to CARPL AI Research platform (CARING Research, India) for AI inference and validation-related statistical analysis. The algorithm under test was QXR Version 3.0 (Qure.ai, India). The algorithmic output consisted of three categories – “normal”, “abnormal” and “to be read”. The “to be read” scans,
which refer to cases where the scans are meant to be read by a radiologist directly, were excluded from calculation of summary statistics. False negative scans were re-read by a specialized pediatric radiologist with 6 years of experience.

RESULTS:

Out of the 3,319 cases, 1,802 were labeled as “to be read” and excluded from analysis. On the remaining 1,517 cases the algorithm gave a sensitivity of 91% and specificity of 96%. The 38 false negatives were reviewed and only 9 truly missed findings existed out of which 7 cases had consolidation, 1 had atelectasis and 1 had vascular engorgement.


Figure 1


CONCLUSION:

Our independent evaluation provides evidence of AI’s ability to accurately read and triage normal pediatric CXRs thereby saving significant time and effort on part of radiologists.

CLINICAL RELEVANCE/APPLICATION:

Most AI algorithms are trained on adult data and hence have poor performance on pediatric cases where lack of trained radiologists is a constant problem, especially in the developing and underdeveloped world.