Tips and Tricks on Basic Programming Tools for Radiologists to Handle DICOM Data

Tips And Tricks On Basic Programming Tools For Radiologists To Handle DICOM Data

TEACHING POINTS

• In the era of artificial intelligence, it is beneficial for radiologists to learn some basic programming tools to organise

and curate DICOM data.

• There are numerous user-friendly simple tools available, that can be used in their clinical and research practice.

• DCM4CHE and DCMTK are open-source tools that can be used for the following:

a. Extracting images from PACS systems, using filters like study date, modality etc.

b. Data sorting, Data modification and analysis.

c. Converting Dicom (DCM) images to Jpeg or Pdf.

d. Transferring Data to PACS or any other viewer.

• Python is one of the easier programming languages which a radiologist, without much programming background,

can learn and start using for manipulating DICOM data. Some python modules to use radiology are:

a. Pydicom

b. Matplotlib

c. Pynetdicom3

d. tqdm

TABLE OF CONTENTS/OUTLINE

• Why a radiologist should learn Basic programming tools.

• About DCM Toolkit & how to install

• Multiple Functions of DCM Toolkit & how to use them.

• What is Python? How to Install Python.

• What is pip? How to install python modules using pip.

• Use of Python to modify Dicom’s Metadata.

Building Robust ML Models Using Federated Learning: The Future of AI Deployment

Building Robust ML Models Using Federated Learning: The Future Of AI Deployment

TEACHING POINTS

What does Deep Neural Nets learn?

Are they cramming or they are learning?

How to avoid cramming and move towards learning.

How to measure learning metrics

Is it sufficient to learn once?

What about the kinds of data the model has not seen during training phase?

Is it possible to bring all the varieties of training in one place? Just like we can bring all the pictures of dog species at one place?

If a false diagnosis was found in one location, shouldn’t the global model learn from it?

What is federated learning?

How is federated learning done in practice?

What about the privacy of data?

Issues around ownership of the final model?

Existing frameworks (in beta versions): Tensorflow federated, Pytorch Pysift

Live example with one of the frameworks.

TABLE OF CONTENTS/OUTLINE

Understanding how the ML model was built? Where did the training data come from? What were the test metrics?

Understanding Overfitting (or generalisability) of a model

Active Learning in Medicine

Federated Learning

Sample code/ tutorial

How to Lie with Statistics: Things To Keep in Mind While Evaluating a Deep Learning Claim

How To Lie With Statistics: Things To Keep In Mind While Evaluating A Deep Learning Claim

TEACHING POINTS

1. In today\’s age of deep learning and artificial intelligence, a radiologist must know what to watch out for while

evaluating a deep learning algorithm\’s claim

2. What is ground truth?

3. Specific points to keep in mind while evaluating:

What is the medium of communication? Is it a video, a pre-print or a reputed peer-reviewed journal article?

What is the performance metric? Accuracy is a bad metric to use.

What data was the algorithm developed on? Generally, algorithms developed on poor ground truth have poor performance

What data was the algorithm validated on? Generally, algorithms validated on data from the same institution from where training data was obtained tend to falsely perform better

How much data was it tested on? Test data should not only be independent, but also adequate, both in number and disease heterogeneity.

What are the implications of the algorithm failing – what if a chest X-Ray algorithm misses a critical finding?

4. Try to get access to the actual algorithm and run it in your department

TABLE OF CONTENTS/OUTLINE

1. Why should a radiologist know how to evaluate a deep learning algorithm?

2. Performance metrics for evaluating algorithms

3. Data – training and testing

4. When an algorithm fails – implications

5. Run AI in your department!

Practical Guide for Deployment of AI Solutions in Clinical Environment: How Did We Do It?

Practical Guide For Deployment Of AI Solutions In Clinical Environment: How Did We Do It?

TEACHING POINTS

• Every AI company developing/validating algorithms on different modalities has this question in mind – how to deploy

AI algorithms in a radiology department?

• Different types of deployment options:

o On-cloud

o On-prem (CPU-only)

o On-prem (CPU + GPU)

• How to integrate deployment with the department\’s PACS/Workstation?

• Specific Site wise deployment because of Non Uniform data in same modality\’s images.

• How deploying AI algorithms in docker form will be a wise choice?

• IOT Devices like raspberry pi can be used for Image data anonymization in an On-cloud setup.

• How can the result of the algorithm be transferred back to PACS/Workstation?

• How can the result also be shown in the hospital or radiology centre’s HIS/RIS using HL7 messaging?

• What type of Hardware configuration can be used in an On-prem deployment?

• How to ensure Patient’s Data privacy?

Getting AI Ready for Deployment: Tuning Algorithms to Specific Sites Using a Single Chest X-Ray Image

Getting AI Ready For Deployment: Tuning Algorithms To Specific Sites Using A Single Chest X-Ray Image

PURPOSE

Lack of generalisation of deep neural networks, due to equipment and geographic variability, is a known problem facing the radiology community today. We propose a novel method to get algorithms ‘deployment ready’ by using a single reference Chest X-Ray(CXR) image from a potential deployment site, with the intention of automatically reading all ‘normal’ CXRs.

METHOD AND MATERIALS

A deep learning model based on DenseNet-121 (M1) was trained on ~250,000 CXRs from Chexpert Dataset and~50,000 CXRs from NIH CXR14 dataset to predict a ‘normal’ or ‘abnormal’ label. The model was evaluated on 3datasets – E1(n=3587), E2(n=200) and E3(n=212). E1 and E3 were 2 separate datasets obtained from 3 outpatient imaging centres and 3 hospital imaging departments. E2 is Chexpert validation dataset. M2, a Siamese variation of  M1 uses a reference image (to capture site / scanner specific variation) for every site and was evaluated on E1, E2 and E3. A comparison between specificity of M1 and M2 was done by choosing a definite sensitivity threshold (97%) to determine their capability to correctly identify normal CXRs.

RESULTS

Area Under Receiver Operator Curve (AUROC) increased from 0.92, 0.87 and 0.84 on M1 to 0.95, 0.89 and 0.89 for M2 for E1, E2 and E3 respectively. At 97% sensitivity, M1 had a specificity of 0.41, 0.29 and 0.02 on E1, E2 and E3 respectively, which, after tuning M1 with a single reference image (M2), increased to 0.63,0.29, 0.45.

CONCLUSION

Our results indicate that deep learning models can be generalised across equipment, institutions and countries b y simply using a single reference image to tune the functioning of the model, hence showing potential to improve the functioning of deep learning algorithms in general. In this case, we observe drastic improvement in results of a model that distinguishes normal from abnormal images with a high degree of confidence.

CLINICAL RELEVANCE/APPLICATION

More than 50% of all CXRs done across the world are reported as ‘normal’. We demonstrate a novel method where a single algorithm can be deployed across sites to automate reading of normal CXRs while having high sensitivity saving radiologists’ time and improving speed of reporting.

Mediastinal Lymph Nodal Staging by 18 F FDG PET CT in Patients with Coexistent Carcinoma Lung and Tuberculosis: A Tertiary Care Centre Experience (RSNA 2019, Sun Dec 01 – 06)

Mediastinal Lymph Nodal Staging By 18 F FDG PET CT In Patients With Coexistent Carcinoma Lung And Tuberculosis: A Tertiary Care Centre Experience (RSNA 2019, Sun Dec 01 – 06)

PURPOSE

The aim of this study is to evaluate the imaging characteristics of metastatic and benign (Tubercular) lymph nodes on18 F FDG PET/CT, in patients with co-existent Carcinoma lung and Tuberculosis, and correlation with histo-pathological analysis.

METHOD AND MATERIALS

A retrospective analysis of 25 patients (19 males, 6 females; mean age 62.4+/- 10.08 years) with co-existent Carcinoma lung and Tuberculosis was done. All the subjects underwent F-18 FDG PET/CT scanning and subsequently the mediastinal lymph nodes were biopsied. SUV Max-Tumour, SUV Max-Lymph node and SUV Max-Ratio ( SUV Max Lymph node / SUV Max Tumour ) for each lymph node station on 18F-FDG PET/CT was determined and then each station was classified into one of the three groups based on SUV Max -Tumour (low, medium and high SUV Max -Tumour groups). Diagnostic performance was assessed based on receiver operating characteristic (ROC) curve analysis, and the optimal cut-off values that would best discriminate metastatic from benign lymph nodes were determined for each method.

RESULTS

A total of 115 lymph node stations with a mean of 4.6 lymph node station per patient and total of 540 lymph nodes with a mean of 21.6 lymph nodes per patient were resected and biopsied. 79 nodes were reported positive for metastasis and 27 nodes were reported as granulomatous. On pre-treatment 18F-FDG PET/CT scan, the mean SUV Max-Tumour of squamous cell carcinoma was significantly higher than that of adenocarcinoma (9.9±3.97 vs. 5.76±3.48, P<0.001). The mean SUV max of malignant lymph nodes was significantly higher than that of tubercular lymph nodes (6.7±0.94 vs. 2.7± 0.84 P<0.001). The mean SUV Max -Ratio in patients with malignant lymph nodes was significantly higher than in those with tubercular lymph nodes (0.91±0.36 vs. 0.41±0.28, P<0.001).

CONCLUSION

The overall diagnostic accuracy of 18 F FDG PET CT in mediastinal lymph nodal staging in patients with co-existent Tuberculosis and Carcinoma lung carcinoma is 67.4 %, if SUV Max of 2.5 is taken as the cut off criteria, however if SUV Max-Ratio is taken into consideration, the overall diagnostic accuracy increases to 74.8%, thus helping in the accurate staging of patients

CLINICAL RELEVANCE/APPLICATION

Carcinoma lung with co-existing Tuberculosis results in false positive mediastinal lymph nodes and fallacies in preoperative staging.

Making Spine MR Reports More Clinically Appropriate: A Questionnaire-based Survey of Sub-specialty Spine Surgeons

Making Spine MR Reports More Clinically Appropriate: A Questionnaire-Based Survey Of Sub-Specialty Spine Surgeons

PURPOSE

MR reports have been unchanged for a long time, and clinical relevance of MR findings are being challenged in literature. We assess the weightage that spine surgeons give to certain aspects of the MR report, their preference for report structure and towards different modalities.

METHOD AND MATERIALS

An anonymous online survey, created in consultation with 5 spine surgeons, which included questions related measurement of spinal canal dimensions, information about nerve root impingement, anomalies and take-off, annular fissures, Modic changes, scoliosis and listhesis, was circulated amongst sub-specialist spine surgeons.

Preference for report format (every level reported, significant levels reported, pain chart diagram), and modality of investigation before surgery for lumbar degenerative disc disease was also recorded.

RESULTS

24 sub-specialist spine surgeons, with average 13.9 years’ experience (range: 3 – 30 years) from 6 cities, completed the questionnaire. Responses were weighted towards surgically relevant details such as effective spinal canal measurement (79%), nerve root impingement (91%), obvious anomalies at the level of significant disc (61%), level of nerve root take-off (75%), only details of posterior annular fissures (50%), and 25% surgeons preferred “hyperintense zone terminology”. Surprisingly, equal number of responses for Modic changes (62%), and for the possibility of inflammatory spondyloarthropathy (58%) or infection (67%) we obtained. On reporting formats, majority asked for only involved levels (71%) while 33% asked for every level. 33% asked for a diagrammatic pain chart. There was no consensus on reporting of scoliosis cases. Also, majority asked for information about cause of listhesis. As expected, for pre-surgical assessment for degenerative disc disease, MR (87%) with and X ray spine with flexion and extension (75%) was preferred while only 8.3% asked for plain CT and none asked for CT myelography.

CONCLUSION

These results highlight clinically relevant information that should be included on an MR report, including effective spinal canal dimensions, details of nerve root anomalies at the level of disc herniation, details of nerve root impingement. There was lack of consensus on Modic changes, format of report, and scoliosis assessment.

CLINICAL RELEVANCE/APPLICATION

Two-way communication between spine surgeons, and radiologists helps in generation of effective reports, that improve clinical outcomes.

Evaluating the Complimentary Role of Pseudo-STIR in Assessment of Hyperintense Marrow Lesions as Compared to T2-STIR

Evaluating The Complimentary Role Of Pseudo-STIR In Assessment Of Hyperintense Marrow Lesions As Compared To T2-STIR

PURPOSE

T2W weighted images contain inherent T1W weighted contrast. Pseudo-STIR images are generated by a simple post processing technique of subtracting T1W images from the T2W images. In this study we probe the diagnostic value of Pseudo-STIR to identify hyperintense marrow lesions in comparison with T2 STIR sequences.

METHOD AND MATERIALS

117 spine MR cases with sagittal T1FSE (n=85) or T1 FLAIR (n=32), T2W and STIR images from studies performed on 1.5T and 3.0T machines were extracted from PACS. The Pseudo-STIR images were created on an Osirix workstation by using the subtraction tool. The resulting 234 sets of STIR and Pseudo-STIR images were anonymized and blindly read by three independent Radiologists (R1, R2, R3 with 13 years, 16 years and 32 years of experience) with respect to the number of hyperintense lesions seen. The quality of study, and the confidence level of the observer in rating the lesions were also encoded. Accuracy for each Pseudo-STIR case was determined based on the observers\’ ability to match their independently reported count of the corresponding STIR Image.

RESULTS

The accuracy of the observers in reporting the count of hyper intense lesions in the Pseudo-STIR cases was reasonably good (R1: 69 %, R2: 78 %, R3: 64 %). The accuracy increased when the observers reported on cases where they assigned the highest image quality rating of three. All three reporters were more accurate while reporting cases that they gave the highest confidence rating of three (R1: 75 %, R2: 80 %, R3: 69%). It was observed that only two out of 117 cases (both T1 FLAIR derived Pseudo-STIR) were incorrectly marked by all three observers. Additionally, there was no significant bias in quality rating (at the highest rating of three) with respect to the Pseudo- STIR origin, with R1 (76% / 24 %), R2 (77 % / 23 %) and R3 (69 % / 31 %) scoring in line with the data distribution (73% / 27%). Finally, statistical testing for difference in accuracy based on Pseudo-STIR origin (T1FSE / T1 FLAIR), revealed no difference in each of the three observers.

CONCLUSION

These results point to the value offered by including STIR sequence in an MSK protocol. In the absence of a specific view plane, a Pseudo-STIR provides supporting evidence.

CLINICAL RELEVANCE/APPLICATION

In this study, we demonstrate potential complimentary value offered by a simple post processing technique especially in situations where the STIR sequence is not obtained prospectively

Deploying Deep Learning for Quality Control: An AI-assisted Review of Chest X-rays Reported as ‘Normal’ in Routine Clinical Practice

Deploying Deep Learning For Quality Control: An AI-Assisted Review Of Chest X-Rays Reported As ‘Normal’ In Routine Clinical Practice

PURPOSE

Quality control in radiology has thus far been restricted to performing random double reads or collating information about clinical correlation – both tedious and expensive activities. We present a novel use-case for AI to double read Chest X-Rays (CXRs) and indicate a list of cases where the radiologist may have erred.

METHOD AND MATERIALS

This study on the feasibility of deploying deep learning algorithms for quality control was conducted on pooled data from four out-patient imaging departments. The radiology workflow included a \’report approval\’ station where a simple, high level, binary label – \’normal\’ or \’abnormal\’ – was applied by radiologists. All adult CXRs marked \’normal\’ were prospectively analyzed through a deep learning algorithm (LUNIT Insight, S. Korea) tuned for automated normal vs abnormal classification. Note that the algorithm used was not trained on data from the institutes and country of testing. It provided an \’abnormality score\’ (range 0.00 – 1.00) and all images marked as \’abnormal\’ in high sensitivity setting (threshold = 0.16) were reviewed by a sub-specialist chest radiologist with 8 years\’ experience.

RESULTS

A total of 708 CXRs were marked \’normal\’ by radiologists during the one-month period of the study. 46 / 708 (6.49%) of CXRs were labelled \’abnormal\’ by the algorithm. Upon review of these 46 CXRs, 12 showed true abnormalities upon review. These 12 cases included four with lung opacities, three with significant blunting of costophrenic angles, two with apical fibrosis, one with a cavity, one with a nodule and one case with cardiomegaly. Appropriate corrective and preventive actions were taken, and feedback was provided to radiologists who reported these cases.

CONCLUSION

We demonstrate AI algorithms\’ ability to quickly parse through large datasets and help identify errors by radiologists. This is a fast and effective method to deploy AI algorithms in clinical practice with no risk (from AI) to patients, and clear measurable positive impact.

CLINICAL RELEVANCE/APPLICATION

Radiologists work flow supported by a parallel, second read AI would allow for faster reporting as it can help reduce errors in radiology reports, improving patient-care in the process. Importantly, this quality assurance study on CXR reporting, demonstrates the potential for AI to both personalize and prioritize training modules for radiologists.

Establishing Normative Kidney Sizes for a Large Developing Country\’s Adult Population Using Big Data: A Study of 30,000 Ultrasound Scans Yields a Potential Gender and Age-related Difference

Establishing Normative Kidney Sizes For A Large Developing Country’S Adult Population Using Big Data: A Study Of 30,000 Ultrasound Scans Yields A Potential Gender And Age-Related Difference

PURPOSE

There are no large-scale population level studies describing kidney size in normal adult population from this geography. Currently, radiologists rely on data from other countries or on limited data from small-scale studies specific to this geography. We studied kidney sizes of 30,000 patients with normal kidneys and compare our findings to currently established normal values.

METHOD AND MATERIALS

65,000 text reports of abdomen ultrasound scans done for patients presenting to 4 radiology clinics between June 2016 and December 2018 were extracted and anonymised. 35,064 reports were removed from the database since they either had some abnormality in the kidney (as determined by a filter-based text search mechanism) or were of pediatric population. Kidney sizes (length and breadth) were present in all 29,936 reports (48.6% females) and cortical thickness measurement was present in 1,624 reports (46.1% females). The sizes and cortical thickness for both kidneys were extracted using keyword-based mechanisms and summary statistics calculated.

RESULTS

The average age of females was 49.8 years and males was 52 years. The average length of the kidney was 10 cm (right) and 10 cm (left) in females and 10.3 cm (right) and 10.4 cm (left) in males. Average cortical thickness in females was 1.1 cm (left), 1.2 cm (right) and in males was 1.3 cm (left), 1.4 cm (right). However, the regression plots of kidney length vs. age, showed inflection points in females (38.2 y (right), 39.3 y (left)) to occur earlier when compared to that of males (43.2 y (right), 42.2 y (left)). This observed difference in inflection points might support the idea that kidney atrophy begins earlier in females than males. Additionally as compared to standard textbook kidney sizes and previous literature from this geography, our study values were slightly higher – study from 2014 reported sizes in males to be 9.7 cm (right) and 9.8 cm (left), and females to be 9.5 cm (right) and 9.7 cm (left).

CONCLUSION

The use of data-mining techniques can enable study of large datasets which currently reside unstudied in institutions across the world, giving insight into defining normative values across age-groups, populations and regions.

CLINICAL RELEVANCE/APPLICATION

Practicing radiologists and clinicians can use age- and gender-specific normal sizes to improve their reporting and guide more appropriate clinical management.