我这里有个需求就是能够使用本机多个GPU对只能使用单GPU的模型进行推理,以能够释放多GPU的潜力,加速推理,节约时间。因为模型需要使用 torch 进行GPU运算,简单的调用 python 内建的 multiprocessing 无法正常执行,需要使用 torch.multiprocessing,后者支持前者完全相同的操作,但扩展了前者以便通过 multiprocessing.Queue 发送的所有张量将其数据移动到共享内存中,并且只会向其他进程发送一个句柄。
多进程使用多GPU 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 import  mathimport  osimport  torch.multiprocessing as  mpfrom  tqdm import  tqdmfrom  fluorescence import  add_pseudo_color, detect_fluorescencefrom  utils import  show_allfilesdef  split_list_to_nested_list (img_path_list, divider=8  ):    """      把一个长列表划分为均匀的子列表,返回包含这些子列表的嵌套列表     @param img_path_list: 图像列表     @param divider: 划分子列表的个数     """     stride = int (math.ceil(len (img_path_list) / divider))     img_paths_nested = [         img_path_list[i * stride : i * stride + stride] for  i in  range (divider)     ]          return  img_paths_nested if  __name__ == "__main__" :         data_path = "/disk0/images"           img_paths = show_allfiles(path=data_path)          img_paths_nested = split_list_to_nested_list(img_path_list=img_paths, divider=divider)     for  i, p in  enumerate (img_paths_nested):         print (f"第{i} 段{len (p)} 张图像" )          mp.set_start_method("spawn" , force=True )          divider = torch.cuda.device_count()          devices = [         torch.device(f"cuda:{i} " ) if  torch.cuda.is_available() else  torch.device("cpu" )         for  i in  range (divider)     ]          threshold_remove_flu = 8.1           processes = []     for  dev, imgs in  zip (devices, img_paths_nested):         p = mp.Process(             target=detect_fluorescence,             args=(                 imgs,                   dev,                   "vit_h" ,                 "/disk1/datasets/models/sam/sam_vit_h_4b8939.pth" ,                 threshold_remove_flu,                 64 ,                 0.75 ,                 0.75 ,                 100 ,                 1500 ,                 150000 ,                 0.5 ,             ),             name=f"Process-{dev} " ,         )         p.start()         processes.append(p)         print (f"Started {p.name} " )          for  p in  processes:         p.join()         print (f"Finished {p.name} " )          print (f"Finished all" ) 
参考文献 
多进程最佳实践 PyTorch多进程分布式训练最简单最好用的实施办法