Unboxing AI | Friday Webinars @ 11 AM ET | Dr R. Kent Hutson, Radiology Partners (Aug 23) | Dr Avishkar Sharma, Einstein Healthcare Network (Aug 30) | Steve Holloway, Signify Research Ltd (Sep 6) | Dr Hugh Harvey, Hardian Health (Sep 13) | Jean Joseph CHRISTOPHE, CASIS (Sep 20) | Register Now
  • 2019-01-11

Stress testing a deep learning algorithm for normal/abnormal classification of chest x-rays on a spectrum-biased abnormal-weighted dataset

Aims and objectives

To stress test the performance of a deep learning algorithm on a dataset with spectrum bias against normalcy in chest x-ray normal vs. abnormal classifier


Methods and materials

A Deep Learning algorithm consisting of an ensemble of 14 Convolutional Neural Networks (CNN) and a weighting Fully Connected Network (Fig. 1) were trained with more than 112,000 Chest X Ray studies identified with one or more labels from 14 different thoracic pathologies defined. The 14 CNN were based in the VGG-19 (Fig. 2) architecture and transfer learning with ImageNet dataset was used to accelerate convergence and improve the performance of the algorithm. The output of the algorithm was the probability of an input image of being pathological as well as a heatmap (Fig. 3) highlighting the most potentilly abnormal regions of the image. The system was validated with a partition of 1000 studies that were not used during training obtaining an accuracy of 70%. A real-world retrospectively acquired independent test set of 301 CXRs (197 abnormal, 104 normal) was analysed by the algorithm with the algorithm classifying each X-Ray as normal or abnormal. Ground truth for the independent test set was established by a sub-specialist chest radiologist (8 years ‘experience) reviewing each image along with its corresponding report. Algorithm output was compared against ground truth and summary statistics calculated.


Results

The algorithm correctly classified 237 (78.74%) CXRs with a sensitivity of 83.76% (95% CI - 77.85% to 88.62%) and specificity of 69.23% (95% CI - 59.42% to 77.91%). There were equal number of false positives and false negative cases- 32 (13.5%).

For screening applications sensitivity is crucial due to overlooking a patholology may cause severe consequences for patients, therefore is very convenient and positive that the system's performance under a stress testing prioritize sensitivity over specificity.


Conclusion

As compared to the validation results, there is an increment in the performance of the deep learning algorithm on the stress test on biased datasets with more abnormal scans than normal scans.

Link to complete publication here: https://scholar.google.com/citations?view_op=view_citation&hl=en&user=dqpMNRUAAAAJ&cstart=20&pagesize=80&citation_for_view=dqpMNRUAAAAJ:WF5omc3nYNoC

Unlock the potential of CARPL platform for optimizing radiology workflows

Talk to a Clinical Solutions Architect