How to start with monailabel for new models

I tried to get started with training from scratch and training began, but failed with a GPU memory error:

[2022-06-06 15:19:24,930] [33996] [MainThread] [INFO] (monailabel.tasks.train.basic_train:591) - 0 - Load Path C:\Users\mikeb\apps\radiology\model\pretrained_deepedit_dynunet.pt
Loading dataset:   0%|          | 0/15 [00:00<?, ?it/s]
Loading dataset:   7%|6         | 1/15 [00:01<00:26,  1.90s/it]
Loading dataset:  13%|#3        | 2/15 [00:04<00:30,  2.31s/it]
Loading dataset:  20%|##        | 3/15 [00:07<00:32,  2.67s/it]
Loading dataset:  27%|##6       | 4/15 [00:08<00:23,  2.13s/it]
Loading dataset:  33%|###3      | 5/15 [00:11<00:22,  2.25s/it]
Loading dataset:  40%|####      | 6/15 [00:12<00:18,  2.04s/it]
Loading dataset:  47%|####6     | 7/15 [00:15<00:16,  2.10s/it]
Loading dataset:  53%|#####3    | 8/15 [00:17<00:14,  2.01s/it]
Loading dataset:  60%|######    | 9/15 [00:18<00:11,  1.89s/it]
Loading dataset:  67%|######6   | 10/15 [00:20<00:09,  1.86s/it]
Loading dataset:  73%|#######3  | 11/15 [00:22<00:07,  1.88s/it]
Loading dataset:  80%|########  | 12/15 [00:26<00:07,  2.54s/it]
Loading dataset:  87%|########6 | 13/15 [00:28<00:04,  2.31s/it]
Loading dataset:  93%|#########3| 14/15 [00:29<00:02,  2.14s/it]
Loading dataset: 100%|##########| 15/15 [00:31<00:00,  2.09s/it]
Loading dataset: 100%|##########| 15/15 [00:31<00:00,  2.13s/it]
[2022-06-06 15:19:56,846] [33996] [MainThread] [INFO] (monailabel.tasks.train.basic_train:227) - 0 - Records for Training: 15
[2022-06-06 15:19:56,850] [33996] [MainThread] [INFO] (ignite.engine.engine.SupervisedTrainer:696) - Engine run resuming from iteration 0, epoch 0 until 50 epochs
[2022-06-06 15:19:57,257] [33996] [MainThread] [INFO] (ignite.engine.engine.SupervisedTrainer:138) - Restored all variables from C:\Users\mikeb\apps\radiology\model\pretrained_deepedit_dynunet.pt
2022-06-06 15:20:00,406 - INFO - Number of simulated clicks: 6
[2022-06-06 15:20:04,627] [33996] [MainThread] [INFO] (ignite.engine.engine.SupervisedTrainer:269) - Epoch: 1/50, Iter: 1/15 -- train_loss: 1.7812
[2022-06-06 15:20:05,039] [33996] [MainThread] [ERROR] (ignite.engine.engine.SupervisedTrainer:853) - Current run is terminating due to exception: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 4.00 GiB total capacity; 2.68 GiB already allocated; 0 bytes free; 2.93 GiB reserved in total by PyTorch)
[2022-06-06 15:20:05,040] [33996] [MainThread] [ERROR] (ignite.engine.engine.SupervisedTrainer:178) - Exception: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 4.00 GiB total capacity; 2.68 GiB already allocated; 0 bytes free; 2.93 GiB reserved in total by PyTorch)

Here is the command used to start the server:

monailabel start_server --app apps/radiology --studies datasets/liver_test_from_scratch/ --conf models deepedit --conf use_pretrained_model false --conf heuristic_planner true

Also, despite the use_pretrained_model false, inference seems to be using a pretrained model (I get a spleen segmentation when I press Next Sample, which must be coming from a pretrained model). I tried the heuristic_planner true option because it sounded like this would choose an appropriate image grid size/spacing to use for training, based on the available GPU memory.

However, I see this section in the output:

[2022-06-06 15:19:14,021] [33996] [MainThread] [INFO] (monailabel.utils.others.generic:147) - Using nvidia-smi command
[2022-06-06 15:19:14,345] [33996] [MainThread] [INFO] (monailabel.utils.others.planner:71) - Available GPU memory: {0: 2100} in MB
[2022-06-06 15:19:14,345] [33996] [MainThread] [INFO] (monailabel.utils.others.generic:147) - Using nvidia-smi command
[2022-06-06 15:19:14,393] [33996] [MainThread] [INFO] (monailabel.utils.others.planner:75) - Spacing: [1. 1. 2.]; Spatial Size: [1, 1, 256]

I see the default spatial size is [48,48,32], which makes me wonder if [1,1,256], presumably generated by the heuristics planner, is reasonable.

I am encouraged that training began (and that epoch 0 may have finished?), but am not sure how to go about troubleshooting this GPU memory error. I have an NVIDIA GeForce GTX 1050 Ti with Max-Q with 4 GB of memory on the GPU. If this is far too low specs for a segmentation problem like this, what would be a reasonable size down-sampled image volume I could try which would be able to run?

2 Likes