All True Positives Are Not Truly Positive – Utility Of Explainability Failure Ratio In A High Sensitivity Screening Setting To Compare Incorrect Localizations Among Algorithms
To evaluate the localization failures of deep learning based Chest X-ray classification algorithms on a for detection of consolidation
To compare the localization accuracies of two algorithms in a high sensitivity screening setting by comparing the explainability failure ratios (EFR)
METHOD AND MATERIALS:
632 chest x-rays were randomly collected from an academic centre hospital, read by a chest radiologist, and ground truth for consolidation was established on CARING analytics platform (CARPL), both at study level and at image level by marking bounding box around the lesion. These X-rays were then analysed by tow algorithms , an open-sourced re-implementation of Stanford’s baseline X-Ray classification model, CheXpert which uses DenseNet121 as its backbone architecture and by CareXnet, network is trained using the Multi-Task Learning paradigm and uses a Densenet121 backbone. Both provide heat maps corresponding to each class to indicate the confidence of the detected disease using guided GRAD-CAM. The number of true positive cases were estimated at an operating threshold that provides 95% sensitivity. The matching of heat maps and the GT bounding box was done by creating a greedy matching algorithm. EFR is then estimated as the ratio of true positive cases that failed the matching process to the total true positive cases.
There were a total of 169 cases of consolidation. The number of true positive cases were 145 and 143 for cheXpert and CareXnet respectively. Upon matching of the localization outputs with GT bounding box, the number of unmatched cases for CheXpert and CareXnet were 41 and 39 respectively, giving an EFR of 28 % and 27 % respectively.
In this study, we found that even at high sensitivity operating point with maximum true positive cases, the deep learning algorithms can have a high degree of explainability failures.
We present a new clinically acceptable way to compare the localization accuracies of multiple algorithms at high sensitivity operating points using explainability failure ratios.