Test with torch.multiprocessing and DataLoader

As we know PyTorch’s DataLoader is a great tool for speeding up data loading. Through my experience with trying DataLoader, I consolidated my understanding in Python multiprocessing.

Here is a didactic code snippet:

from torch.utils.data import DataLoader, Dataset
import torch
import time
import datetime
import torch.multiprocessing as mp
num_batches = 110

print("File init")

class DataClass:
    def __init__(self, x):
        self.x = x


class SleepDataset(Dataset):
    def __len__(self):
        return num_batches

    def __getitem__(self, idx):
        print(f"sleep on {idx}")
        time.sleep(5)
        print(f"finish sleep on {idx} at {datetime.datetime.now()}")
        return DataClass(torch.randn(5))


def collate_fn(batch):
    assert len(batch) == 1
    return batch[0]


def _set_seed(worker_id):
    torch.manual_seed(worker_id)
    torch.cuda.manual_seed(worker_id)


if __name__ == "__main__":
    mp.set_start_method("spawn")
    num_workers = mp.cpu_count() - 1
    print(f"num of workers {num_workers}")
    dataset = SleepDataset()
    dataloader = DataLoader(
        dataset,
        batch_size=1,
        shuffle=False,
        num_workers=num_workers,
        worker_init_fn=_set_seed,
        collate_fn=collate_fn,
    )

    dataloader = iter(dataloader)
    for i in range(1000):
        print(next(dataloader).x)

We have a Dataset called SleepDataset which is faked to be computationally expensive. We allow DataLoader to use all available processes (except the main process) to load the dataset. Python3 now has three ways to start processes: fork, spawn, and forkserver. I couldn’t find much online information regarding forkserver. But the difference between fork and spawn has been discussed a lot online: fork is only supported in Unix system. It creates a new process by copying the exact memory of the parent process into a new memory space and the child process can continue to execute from the forking point [3]. The system can still distinguish parent and child processes by process ids [1]. On the other hand, spawn creates new processes by initializing from executable images (files) rather than directly copying the memory from the parent process [2].

Based on these differences, if we let mp.set_start_method("spawn"), we find “File init” will be printed first at the main process then be printed every time a DataLoader process is created (110 times since num_batches = 110). If we let mp.set_start_method("fork"), we find “File init” will only be printed once. “forkserver” method behaves similarly to “spawn”, as we also see 110 times of “File init” being printed.

[1] https://qr.ae/TS6uaJ

[2] https://www.unix.com/unix-for-advanced-and-expert-users/178644-spawn-vs-fork.html

[3] https://www.python-course.eu/forking.php

Test with torch.multiprocessing and DataLoader

Leave a comment

Cancel reply