I tried to get started with training from scratch and training began, but failed with a GPU memory error:
[2022-06-06 15:19:24,930] [33996] [MainThread] [INFO] (monailabel.tasks.train.basic_train:591) - 0 - Load Path C:\Users\mikeb\apps\radiology\model\pretrained_deepedit_dynunet.pt
Loading dataset: 0%| | 0/15 [00:00<?, ?it/s]
Loading dataset: 7%|6 | 1/15 [00:01<00:26, 1.90s/it]
Loading dataset: 13%|#3 | 2/15 [00:04<00:30, 2.31s/it]
Loading dataset: 20%|## | 3/15 [00:07<00:32, 2.67s/it]
Loading dataset: 27%|##6 | 4/15 [00:08<00:23, 2.13s/it]
Loading dataset: 33%|###3 | 5/15 [00:11<00:22, 2.25s/it]
Loading dataset: 40%|#### | 6/15 [00:12<00:18, 2.04s/it]
Loading dataset: 47%|####6 | 7/15 [00:15<00:16, 2.10s/it]
Loading dataset: 53%|#####3 | 8/15 [00:17<00:14, 2.01s/it]
Loading dataset: 60%|###### | 9/15 [00:18<00:11, 1.89s/it]
Loading dataset: 67%|######6 | 10/15 [00:20<00:09, 1.86s/it]
Loading dataset: 73%|#######3 | 11/15 [00:22<00:07, 1.88s/it]
Loading dataset: 80%|######## | 12/15 [00:26<00:07, 2.54s/it]
Loading dataset: 87%|########6 | 13/15 [00:28<00:04, 2.31s/it]
Loading dataset: 93%|#########3| 14/15 [00:29<00:02, 2.14s/it]
Loading dataset: 100%|##########| 15/15 [00:31<00:00, 2.09s/it]
Loading dataset: 100%|##########| 15/15 [00:31<00:00, 2.13s/it]
[2022-06-06 15:19:56,846] [33996] [MainThread] [INFO] (monailabel.tasks.train.basic_train:227) - 0 - Records for Training: 15
[2022-06-06 15:19:56,850] [33996] [MainThread] [INFO] (ignite.engine.engine.SupervisedTrainer:696) - Engine run resuming from iteration 0, epoch 0 until 50 epochs
[2022-06-06 15:19:57,257] [33996] [MainThread] [INFO] (ignite.engine.engine.SupervisedTrainer:138) - Restored all variables from C:\Users\mikeb\apps\radiology\model\pretrained_deepedit_dynunet.pt
2022-06-06 15:20:00,406 - INFO - Number of simulated clicks: 6
[2022-06-06 15:20:04,627] [33996] [MainThread] [INFO] (ignite.engine.engine.SupervisedTrainer:269) - Epoch: 1/50, Iter: 1/15 -- train_loss: 1.7812
[2022-06-06 15:20:05,039] [33996] [MainThread] [ERROR] (ignite.engine.engine.SupervisedTrainer:853) - Current run is terminating due to exception: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 4.00 GiB total capacity; 2.68 GiB already allocated; 0 bytes free; 2.93 GiB reserved in total by PyTorch)
[2022-06-06 15:20:05,040] [33996] [MainThread] [ERROR] (ignite.engine.engine.SupervisedTrainer:178) - Exception: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 4.00 GiB total capacity; 2.68 GiB already allocated; 0 bytes free; 2.93 GiB reserved in total by PyTorch)
Here is the command used to start the server:
monailabel start_server --app apps/radiology --studies datasets/liver_test_from_scratch/ --conf models deepedit --conf use_pretrained_model false --conf heuristic_planner true
Also, despite the use_pretrained_model false
, inference seems to be using a pretrained model (I get a spleen segmentation when I press Next Sample, which must be coming from a pretrained model). I tried the heuristic_planner true
option because it sounded like this would choose an appropriate image grid size/spacing to use for training, based on the available GPU memory.
However, I see this section in the output:
[2022-06-06 15:19:14,021] [33996] [MainThread] [INFO] (monailabel.utils.others.generic:147) - Using nvidia-smi command
[2022-06-06 15:19:14,345] [33996] [MainThread] [INFO] (monailabel.utils.others.planner:71) - Available GPU memory: {0: 2100} in MB
[2022-06-06 15:19:14,345] [33996] [MainThread] [INFO] (monailabel.utils.others.generic:147) - Using nvidia-smi command
[2022-06-06 15:19:14,393] [33996] [MainThread] [INFO] (monailabel.utils.others.planner:75) - Spacing: [1. 1. 2.]; Spatial Size: [1, 1, 256]
I see the default spatial size is [48,48,32], which makes me wonder if [1,1,256], presumably generated by the heuristics planner, is reasonable.
I am encouraged that training began (and that epoch 0 may have finished?), but am not sure how to go about troubleshooting this GPU memory error. I have an NVIDIA GeForce GTX 1050 Ti with Max-Q with 4 GB of memory on the GPU. If this is far too low specs for a segmentation problem like this, what would be a reasonable size down-sampled image volume I could try which would be able to run?