DRR-RATE: A Large Scale Synthetic Chest X-ray Dataset Complete with Labels and Radiological Reports

Chest X-rays are essential in diagnosing pulmonary and cardiac issues, including pneumonia and lung lesions, and are widely used in settings with limited resources. The rise of AI has greatly enhanced automated medical image analysis, benefiting from large, curated datasets. Recently, the focus has shifted to multimodal models, like Large Language Models and Vision-Based Language Models, which require extensive and diverse data for training. The study uses Digitally Reconstructed Radiography (DRR) to generate synthetic X-ray images from the CT-RATE dataset. This dataset is rich in binary labels and detailed radiological reports, making it valuable for training AI classifiers for disease diagnosis.

Researchers from the Imaging Biomarkers and Computer-Aided Diagnosis Laboratory, Clinical Center, and National Center for Biotechnology Information, National Library of Medicine have introduced DRR-RATE, synthetic X-ray images synthesized from computed tomography (CT) data using ray tracing techniques. Unlike conventional radiographs, DRRs offer controlled and reproducible imaging conditions by simulating the path of X-rays through CT volumes. Each DRR pixel’s intensity is determined by the attenuation coefficients of tissues along the ray path, reflecting X-ray absorption. DRRs find crucial applications in radiation therapy planning, surgical preparation, educational purposes, and algorithm development. They facilitate precise dose calculations in therapy and accurate 2D-3D image registration for surgeries, enhancing medical education through realistic representations of various conditions. Ongoing research aims to improve DRR generation speed and image quality.

Several significant large-scale chest X-ray datasets have been pivotal in advancing medical imaging research. For instance, ChestX-ray8 and ChestX-ray14, released by the US National Institutes of Health (NIH), contain over 112,000 scans from more than 30,000 individuals. These datasets utilize NLP techniques to extract disease labels from radiological reports. CheXpert, another notable dataset, includes 224,316 radiographs from 65,240 patients at Stanford Health Care, also labeled using NLP methods. PadChest, comprising over 160,000 images, offers detailed annotations from radiologists at Hospital San Juan Hospital in Spain. MIMIC-CXR and VinDr-CXR further enhance research capabilities with extensive datasets annotated by radiologists from major medical centers. These datasets collectively support research in disease detection and AI applications in radiology and related fields.

The DRR-RATE dataset, an extension of the CT-RATE dataset, features 50,188 chest CT volumes from 21,304 patients, each paired with a radiology text report and binary labels for 18 pathology classes. Modifying the reconstruction matrix from original DICOM studies expanded the dataset to enhance its utility in medical imaging research. Patient demographics reveal a diverse age range and gender distribution across training and validation subsets. DRR images are generated using ray tracing algorithms, simulating X-ray projections from CT data, thereby enabling multimodal research applications bridging CT and X-ray imaging modalities. The dataset is publicly accessible under a CC BY-NC-SA license.

In the experiments with the DRR-RATE dataset, CheXnet was trained and evaluated for chest X-ray classification, comparing its performance against the CheXpert dataset. Using five-fold cross-validation, CheXnet achieved notable results. Cardiomegaly and Pleural Effusion showed robust performance with AUC scores of 0.92 and 0.95, respectively, indicating high predictive accuracy. However, Atelectasis and Consolidation exhibited moderate AUC values of 0.72 and 0.74, suggesting decent but less consistent performance. Lung Nodule and Lung Opacity had lower AUC scores, around 0.66 and 0.67, indicating room for improvement. When CheXnet trained on CheXpert and tested on DRR-RATE, performance decreased slightly for most conditions due to domain differences between real and DRR images.

The DRR-RATE is a synthetic chest X-ray dataset derived from CT scans, offering labeled images and radiological reports. By simulating CT-derived pathologies in X-ray form, DRR-RATE enriches training data for diagnostic models and enhances understanding across imaging modalities. Evaluating baseline CheXnet models trained on DRR-RATE and CheXpert datasets revealed robust performance, particularly in detecting Cardiomegaly, Consolidation, and Pleural Effusion. However, challenges remain for subtle conditions like Atelectasis, Lung Nodule, and Lung Opacity, potentially due to resolution limitations in DRR images. Nonetheless, DRR-RATE’s integration marks a significant stride in synthesizing medical imaging data, bolstering AI-driven diagnostic capabilities, and advancing medical research.

Check out the Paper and Dataset. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter.

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 45k+ ML SubReddit

Create, edit, and augment tabular data with the first compound AI system, Gretel Navigator, now generally available! [Advertisement]

The post DRR-RATE: A Large Scale Synthetic Chest X-ray Dataset Complete with Labels and Radiological Reports appeared first on MarkTechPost.