Looking for dataset for radiomics feature based classification

Hello everyone, I am a Computer science college student and new to radiomics and medical science.
I want to build a radiomics feature-based classifier. I have to take CT scans Dicom images.

Firstly, I found a dataset NSCLC-Radiomics NSCLC-Radiomics - The Cancer Imaging Archive (TCIA) Public Access - Cancer Imaging Archive Wiki, where they labeled different regions of the body like left and right lung, esophagus, spine, and the abnormal tissue that is a tumor itself. From all these segments I choose the tumor segmentation and calculated the features. But these features are for Cancer patients only. For the non-cancer, I was unable to understand the ROI. Obviously, the non-cancer CT scan can’t have an ROI.

Is it possible to classify cancer and non-cancer patient with radiomics? If yes, then what about the ROI for the non-cancer patient? And the relevant dataset to do so.

After long research, I came across the dataset Data from The Lung Image Database Consortium (LIDC) and Image Database Resource Initiative (IDRI): A completed reference database of lung nodules on CT scans (LIDC-IDRI)
Data from The Lung Image Database Consortium (LIDC) and Image Database Resource Initiative (IDRI): A completed reference database of lung nodules on CT scans (LIDC-IDRI) - The Cancer Imaging Archive (TCIA) Public Access - Cancer Imaging Archive Wiki, where they mentioned that they have classified the nodules(abnormal tissue) based on their size. I read in their article The Lung Image Database Consortium (LIDC) and Image Database Resource Initiative (IDRI): A Completed Reference Database of Lung Nodules on CT Scans - PMC

The Database contains 7371 lesions marked “nodule” by at least one radiologist. 2669 of these lesions were marked “nodule≥3 mm” by at least one radiologist, of which 928 (34.7%) received such marks from all four radiologists. These 2669 lesions include nodule outlines and subjective nodule characteristic ratings.

and they scaled the nodule size from 1-5, (1,2 for small-size nodules, I will call them non-cancer, and 4,5 for large-size nodules, which I’ll treat as Cancer Data).
I hoped that finally I found the right dataset but the dataset is so confusing.

I used their python package pylidc for preprocessing. After compiling it, I got multiple NumPy arrays which I don’t know how to use in pyradiomics.

They only accept that image with its mask. It is so confusing for me.

I don’t know whether to find a dataset to do so. I have spent almost a month reading about it. I have tried many datasets but found nothing relevant to my project. I really need help regarding this.

You can compute the radiomics features for any image. The basic idea is that some of the features may help in distinguishing cancer from non-cancer.

I would recommend to do some literature search before you dig very deep into radiomics features, because recent papers report better results with deep learning than with classic radiomics based classification.

You can randomly select regions in the same organ (away from regions that are marked as cancer) and use those as non-cancer regions. You probably don’t need a separate data set, but if you want to add some patients that don’t have cancer in the organ of interest, you can use patient data from other collections.

You can create a volume node from a numpy array using slicer.util.addVolumeFromArray. Make sure you set the correct spacing in the created volume node (using SetSpacing).

Basically the motive of my project is to compare both deep learning and classic radiomics based classification. And also checking the performance of classification on both features combined.

Sir I have tried it, I have other mask of the whole left and right lung that the data set has provided. I calculated the features by using that mask. I labelled them as non cancer. I got 100% accuracy when I tried the classification. I think this is because the size difference between the nodule and the whole left or right lung.

I created the volume too. But the problem is those arrays after combining ended up with size of 8X512X512 or 7X512X512, whereas my original dicom series is of 120X512X512 or with some larger N slices that this generated volume node.

You can extract regions from healthy lung tissue that has approximately the same as nodules.

You can crop and resample to get consistent sizes.

For extracting patches, cropping, resampling, etc. you may find torchio useful.