Code for training an algorithm to classify lung nodules¶
The baseline training algorithm is a VGG net that accepts three orthogonal views of a lung nodule to predict the malignancy risk and the type of the nodule.
The code for pre-processing and training a model can be found here.
Using the LIDC-IDRI dataset
You could split the development dataset into 5-folds and train the network using cross-validation, and then combine all the outputs in an ensemble for the final algorithm container (which will be run on the hold-out test cohort of 10 nodules [phase 1, for testing] and the external validation cohort [phase 2, final submission(s)]). It is recommended that you evaluate your hyperparameters through cross-validation on the publicly available dataset of nodules provided in this challenge. Please do not use the results from the phase 1 leaderboard to fine-tune your algorithms. It is only meant to test your algorithm submission pipeline (and you only get 10 attempts at it).
How to improve the training procedures¶
You could start by replacing the VGG Net with a more modern convolutional neural network (CNN) like ResNet50, or even switch to a 3D CNN. To further increase performance, you might want to explore the tips and tricks detailed in the ConvNeXT and/or in ResNet strikes back. Furthermore, you could explore vision transformers if necessary.
Data-centric AI approaches are also much appreciated. Cleaning, balancing, and augmenting the data properly may also lead to significant improvements in performance that may not be seen by only fine-tuning neural network architectures. The ideal approach usually has the best of both worlds (good data-centric training procedures and optimal neural network architectures).
Multi-task classifiers
You may also want to consider training a single network to predict both malignancy risk and nodule type. Multi-task classification networks (like HydraNets) serve this purpose. Such networks are frequently featured in Tesla's AI stack.