🕷️ Crawler Inspector

URL Lookup

Direct Parameter Lookup

Raw Queries and Responses

1. Shard Calculation

Query:
Response:
Calculated Shard: 171 (from laksa043)

2. Crawled Status Check

Query:
Response:

3. Robots.txt Check

Query:
Response:

4. Spam/Ban Check

Query:
Response:

5. Seen Status Check

ℹ️ Skipped - page is already crawled

📄
INDEXABLE
CRAWLED
5 days ago
🤖
ROBOTS ALLOWED

Page Info Filters

FilterStatusConditionDetails
HTTP statusPASSdownload_http_code = 200HTTP 200
Age cutoffPASSdownload_stamp > now() - 6 MONTH0.2 months ago
History dropPASSisNull(history_drop_reason)No drop reason
Spam/banPASSfh_dont_index != 1 AND ml_spam_score = 0ml_spam_score=0
CanonicalPASSmeta_canonical IS NULL OR = '' OR = src_unparsedNot set

Page Details

PropertyValue
URLhttps://www.hiascend.com/document/detail/zh/canncommercial/700/modeldevpt/ptmigr/AImpug_000028.html
Last Crawled2026-04-11 11:13:30 (5 days ago)
First Indexed2024-10-08 05:01:26 (1 year ago)
HTTP Status Code200
Meta Title拉起多卡分布式训练-多卡分布式训练-模型训练-模型迁移与训练-PyTorch 网络模型迁移和训练-模型开发(PyTorch)-CANN商用版7.0.0开发文档-昇腾社区
Meta Description<!DOCTYPE html> 拉起多卡分布式训练 在单机和多机场景下,有4种方式可拉起分布式训练,分别为shell脚本方式(推荐)、mp.spawn方式、Python方式、torchrun方式。其中torchrun方式仅在PyTorch 1.11.0及以上版本支持使用。以下内容以一个简单模型脚本为样例,展示前3种拉起方式分别需要对脚本代码进行的修改。torchrun方式的代码修改与shell脚本
Meta Canonicalnull
Boilerpipe Text
拉起多卡分布式训练 在单机和多机场景下,有4种方式可拉起分布式训练,分别为shell脚本方式(推荐)、mp.spawn方式、Python方式、torchrun方式。其中torchrun方式仅在PyTorch 1.11.0及以上版本支持使用。以下内容以一个简单模型脚本为样例,展示前3种拉起方式分别需要对脚本代码进行的修改。torchrun方式的代码修改与shell脚本方式完全相同。 集合通信存在如下约束: 数据并行模式中不同device上执行的图相同。 针对 Atlas 训练系列产品 :allreduce和reduce_scatter仅支持int8、int32、float16和float32数据类型。 针对 Atlas A2 训练系列产品 :allreduce和reduce_scatter仅支持int8, int32, float16, float32和bfp16数据类型。 分布式训练场景下,HCCL会使用Host服务器的部分端口进行集群信息收集,需要操作系统预留该部分端口。默认情况下,HCCL使用60000-60015端口,若通过环境变量HCCL_IF_BASE_PORT指定了Host网卡起始端口,则需要预留以该端口起始的16个端口。 操作系统端口号预留示例: sysctl -w net.ipv4.ip_local_reserved_ports=60000-60015 若用户准备进行2卡训练,可将8卡训练脚本进行改写,改为2卡训练脚本。可参见以下修改方法: 若8卡脚本的batchsize是单卡脚本的batchsize的8倍,则将8卡训练时的batch size和learning rate同时除以4,作为2卡训练时的batch size和learning rate。 如果使用for循环启动训练入口脚本,则将for循环的次数改为2次。 world size或者rank size修改为2,并确保训练脚本中dist.init_process_group()中world_size参数为2。 如果有指定device list参数,且取值取值范围为0-7,则将其改为0-1。 构建简单模型 我们先构建一个简单的神经网络。 # 导入依赖和库 import torch from torch import nn import torch_npu import torch.distributed as dist from torch.utils.data import DataLoader from torchvision import datasets from torchvision.transforms import ToTensor import time import torch.multiprocessing as mp import os torch.manual_seed(0) # 下载训练数据 training_data = datasets.FashionMNIST( root="./data", train=True, download=True, transform=ToTensor(), ) # 下载测试数据 test_data = datasets.FashionMNIST( root="./data", train=False, download=True, transform=ToTensor(), ) # 构建模型 class NeuralNetwork(nn.Module): def __init__(self): super().__init__() self.flatten = nn.Flatten() self.linear_relu_stack = nn.Sequential( nn.Linear(28*28, 512), nn.ReLU(), nn.Linear(512, 512), nn.ReLU(), nn.Linear(512, 10) ) def forward(self, x): x = self.flatten(x) logits = self.linear_relu_stack(x) return logits def test(dataloader, model, loss_fn): size = len(dataloader.dataset) num_batches = len(dataloader) model.eval() test_loss, correct = 0, 0 with torch.no_grad(): for X, y in dataloader: X, y = X.to(device), y.to(device) pred = model(X) test_loss += loss_fn(pred, y).item() correct += (pred.argmax(1) == y).type(torch.float).sum().item() test_loss /= num_batches correct /= size print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n") 获取超参数 在主函数main中获取训练所需的超参数。 shell脚本/torchrun方式 def main(world_size: int, batch_size = 64, total_epochs = 5 , ): # 用户可自行设置 ngpus_per_node = world_size main_worker(args.gpu, ngpus_per_node, args) mp.spawn方式 def main(world_size: int, batch_size = 64, total_epochs = 5, ): # 用户可自行设置 ngpus_per_node = world_size mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args)) # mp.spawn方式启动 Python方式 def main(world_size: int, batch_size, args ): # 使用Python拉起命令中设置的超参数 ngpus_per_node = world_size args.gpu = args.local_rank # 任务拉起后,local_rank自动获得device号 main_worker(args.gpu, ngpus_per_node, args) 设置地址和端口号 由于昇腾AI处理器初始化进程组时initmethod只支持env:// (即环境变量初始化方式),所以在初始化前需要配置MASTER_ADDR、MASTER_PORT等参数。用户需根据自己实际情况配置。 shell脚本方式、mp.spawn拉起方式和torchrun方式的配置代码相同,如下所示: def ddp_setup(rank, world_size): """ Args: rank: Unique identifier of each process world_size: Total number of processes """ os.environ["MASTER_ADDR"] = "localhost" # 用户需根据自己实际情况设置 os.environ["MASTER_PORT"] = "***" # 用户需根据自己实际情况设置 dist.init_process_group(backend="hccl", rank=rank, world_size=world_size) Python方式需要把配置参数的命令放到拉起训练中。脚本中代码如下所示: def ddp_setup(rank, world_size): """ Args: rank: Unique identifier of each process world_size: Total number of processes """ dist.init_process_group(backend="hccl", rank=rank, world_size=world_size) 添加分布式逻辑 不同的拉起训练方式下,device号的获取方式不同: shell脚本方式:在shell脚本中循环传入local_rank变量作为指定的device。 mp.spawn方式:mp.spawn多进程拉起main_worker后,第一个参数GPU自动获得device号(0 ~ ngpusper_node - 1)。 Python方式:任务拉起后,local_rank自动获得device号。 用户需根据自己选择的方式对代码做不同的修改。 shell脚本/torchrun方式 def main_worker(gpu, ngpus_per_node, args): start_epoch = 0 end_epoch = 5 args.gpu = int(os.environ['LOCAL_RANK']) # 在shell脚本中循环传入local_rank变量作为指定的device ddp_setup(args.gpu, args.world_size) torch_npu.npu.set_device(args.gpu) total_batch_size = args.batch_size total_workers = ngpus_per_node batch_size = int(total_batch_size / ngpus_per_node) workers = int((total_workers + ngpus_per_node - 1) / ngpus_per_node) model = NeuralNetwork() device = torch.device("npu") train_sampler = torch.utils.data.distributed.DistributedSampler(training_data) test_sampler = torch.utils.data.distributed.DistributedSampler(test_data) train_loader = torch.utils.data.DataLoader( training_data, batch_size=batch_size, shuffle=(train_sampler is None), num_workers=workers, pin_memory=False, sampler=train_sampler, drop_last=True) val_loader = torch.utils.data.DataLoader( test_data, batch_size=batch_size, shuffle=(test_sampler is None), num_workers=workers, pin_memory=False, sampler=test_sampler, drop_last=True) loc = 'npu:{}'.format(args.gpu) model = model.to(loc) criterion = nn.CrossEntropyLoss().to(loc) optimizer = torch.optim.SGD(model.parameters(), lr=1e-3) model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu]) for epoch in range(start_epoch, end_epoch): print("curr epoch: ", epoch) train_sampler.set_epoch(epoch) train(train_loader, model, criterion, optimizer, epoch, args.gpu) def train(train_loader, model, criterion, optimizer, epoch, gpu): size = len(train_loader.dataset) model.train() end = time.time() for i, (images, target) in enumerate(train_loader): # measure data loading time loc = 'npu:{}'.format(gpu) target = target.to(torch.int32) images, target = images.to(loc, non_blocking=False), target.to(loc, non_blocking=False) # compute output output = model(images) loss = criterion(output, target) # compute gradient and do SGD step optimizer.zero_grad() loss.backward() optimizer.step() end = time.time() if i % 100 == 0: loss, current = loss.item(), i * len(target) print(f"loss: {loss:>7f} [{current:>5d}/{size:>5d}]") mp.spawn方式 不需要专门设置args.gpu,将shell脚本方式中main_worker里的args.gpu均替换为gpu。 def main_worker(gpu, ngpus_per_node, args): start_epoch = 0 end_epoch = 5 ddp_setup( gpu , args.world_size) torch_npu.npu.set_device( gpu ) total_batch_size = args.batch_size total_workers = ngpus_per_node batch_size = int(total_batch_size / ngpus_per_node) workers = int((total_workers + ngpus_per_node - 1) / ngpus_per_node) model = NeuralNetwork() device = torch.device("npu") train_sampler = torch.utils.data.distributed.DistributedSampler(training_data) test_sampler = torch.utils.data.distributed.DistributedSampler(test_data) train_loader = torch.utils.data.DataLoader( training_data, batch_size=batch_size, shuffle=(train_sampler is None), num_workers=workers, pin_memory=False, sampler=train_sampler, drop_last=True) val_loader = torch.utils.data.DataLoader( test_data, batch_size=batch_size, shuffle=(test_sampler is None), num_workers=workers, pin_memory=False, sampler=test_sampler, drop_last=True) loc = 'npu:{}'.format( gpu ) model = model.to(loc) criterion = nn.CrossEntropyLoss().to(loc) optimizer = torch.optim.SGD(model.parameters(), lr=1e-3) model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[ gpu ]) for epoch in range(start_epoch, end_epoch): print("curr epoch: ", epoch) train_sampler.set_epoch(epoch) train(train_loader, model, criterion, optimizer, epoch, gpu ) ...... # train函数代码同shell脚本方式 Python方式 def main_worker(gpu, ngpus_per_node, args): start_epoch = 0 end_epoch = 5 args.gpu = args.local_rank # 任务拉起后,local_rank自动获得device号 ddp_setup(args.gpu, args.world_size) ...... # 其余代码 同shell脚本方式 设置参数 在模型脚本中,根据拉起方式不同,设置不同的参数。 shell脚本/torchrun方式 if __name__ == "__main__": import argparse parser = argparse.ArgumentParser(description='simple distributed training job') parser.add_argument('--batch_size', default=512, type=int, help='Input batch size on each device (default: 32)') parser.add_argument('--gpu', default=None, type=int, help='GPU id to use.') args = parser.parse_args() world_size = torch.npu.device_count() args.world_size = world_size main(args.world_size, args.batch_size) mp.spawn方式 if __name__ == "__main__": import argparse parser = argparse.ArgumentParser(description='simple distributed training job') parser.add_argument('--batch_size', default=512, type=int, help='Input batch size on each device (default: 32)') args = parser.parse_args() world_size = torch.npu.device_count() args.world_size = world_size main(args.world_size, args.batch_size) Python方式 if __name__ == "__main__": import argparse parser = argparse.ArgumentParser(description='simple distributed training job') parser.add_argument('--batch_size', default=512, type=int, help='Input batch size on each device (default: 32)') parser.add_argument('--gpu', default=None, type=int, help='GPU id to use.') parser.add_argument("--local_rank", default=-1, type=int) # local_rank用于自动获取device号。使用mp.spawn方式与shell方式启动时需删除此项 args = parser.parse_args() world_size = torch.npu.device_count() args.world_size = world_size main(args.world_size, args.batch_size , args ) # 需将Python拉起命令中设置的参数传入main函数 拉起训练 以下拉起训练的命令为示例,用户可根据实际情况自行更改。 shell脚本方式 export HCCL_WHITELIST_DISABLE=1 RANK_ID_START=0 WORLD_SIZE=8 for((RANK_ID=$RANK_ID_START;RANK_ID<$((WORLD_SIZE+RANK_ID_START));RANK_ID++)); do echo "Device ID: $RANK_ID" export LOCAL_RANK=$RANK_ID python3 ddp_test_shell.py & done wait mp.spawn方式 export HCCL_WHITELIST_DISABLE=1 python3 ddp_test_spwn.py Python方式 # master_addr和master_port参数需用户根据实际情况设置 export HCCL_WHITELIST_DISABLE=1 python3 -m torch.distributed.launch --nproc_per_node 8 --master_addr localhost --master_port *** ddp_test.py torchrun方式(PyTorch 1.11.0及以上版本支持) export HCCL_WHITELIST_DISABLE=1 torchrun --standalone --nnodes=1 --nproc_per_node=8 ddp_test_shell.py 当屏幕打印类似下图中的Loss数值时,说明拉起训练成功。 父主题: 多卡分布式训练
Markdown
[昇腾社区首页](https://www.hiascend.com/zh) [![昇腾社区](https://www.hiascend.com/_static3/logo1.BQ2XZIjU.svg)](https://www.hiascend.com/) 开发者 [主页](https://www.hiascend.com/developer "主页") 开发 [文档](https://www.hiascend.com/document "文档") 活动 [学习](https://www.hiascend.com/developer/learn "学习") [论坛](https://www.hiascend.com/forum/ "论坛") [博客](https://www.hiascend.com/developer/blog "博客") 开发者计划 更多 ![](data:image/svg+xml,%3c?xml%20version='1.0'%20encoding='UTF-8'?%3e%3csvg%20width='16px'%20height='16px'%20viewBox='0%200%2016%2016'%20version='1.1'%20xmlns='http://www.w3.org/2000/svg'%20xmlns:xlink='http://www.w3.org/1999/xlink'%3e%3ctitle%3e搜索%3c/title%3e%3cdefs%3e%3clinearGradient%20x1='5.21297809%25'%20y1='23.0592428%25'%20x2='88.6708628%25'%20y2='86.2762742%25'%20id='linearGradient-1'%3e%3cstop%20stop-color='%235EAFFF'%20offset='0%25'%3e%3c/stop%3e%3cstop%20stop-color='%23567DFC'%20offset='41.1240182%25'%3e%3c/stop%3e%3cstop%20stop-color='%239862FF'%20offset='100%25'%3e%3c/stop%3e%3c/linearGradient%3e%3c/defs%3e%3cg%20id='搜索'%20stroke='none'%20stroke-width='1'%20fill='none'%20fill-rule='evenodd'%3e%3cg%20id='编组-2'%3e%3crect%20id='矩形'%20x='0'%20y='0'%20width='16'%20height='16'%3e%3c/rect%3e%3cpath%20d='M12.4178989,11.2288982%20L14.3793216,13.1898628%20C14.5832621,13.3938032%2014.6059222,13.7103769%2014.4473018,13.939347%20L14.3793216,14.0207132%20L13.9373799,14.462655%20C13.7334394,14.6665954%2013.4168658,14.6892555%2013.1879404,14.5306799%20L13.1065931,14.4627187%20L11.1363483,12.4932888%20C11.6158072,12.1284843%2012.0467778,11.7032336%2012.4178989,11.2288982%20Z%20M6.91666667,0.933333333%20C10.2211704,0.933333333%2012.9,3.61216291%2012.9,6.91666667%20C12.9,8.18725347%2012.5023641,9.39802849%2011.7730796,10.411605%20L11.698,10.509%20C11.3421384,10.9988106%2010.9089395,11.4251457%2010.4201949,11.7795672%20L10.4141658,11.7720117%20C9.40826923,12.4981206%208.19345696,12.9%206.91666667,12.9%20C3.61216291,12.9%200.933333333,10.2211704%200.933333333,6.91666667%20C0.933333333,3.61216291%203.61216291,0.933333333%206.91666667,0.933333333%20Z%20M6.91666667,2.73333333%20C4.60627546,2.73333333%202.73333333,4.60627546%202.73333333,6.91666667%20C2.73333333,9.22705787%204.60627546,11.1%206.91666667,11.1%20C9.22705787,11.1%2011.1,9.22705787%2011.1,6.91666667%20C11.1,4.60627546%209.22705787,2.73333333%206.91666667,2.73333333%20Z'%20id='形状结合'%20fill='url\(%23linearGradient-1\)'%20fill-rule='nonzero'%3e%3c/path%3e%3c/g%3e%3c/g%3e%3c/svg%3e) **0**/100 资源 [支持](https://www.hiascend.com/support) [积分兑换 NEW](https://www.hiascend.com/developer/rewards) [![昇腾社区](https://www.hiascend.com/_static3/logo1.BQ2XZIjU.svg)](https://www.hiascend.com/zh) CANN 产品 解决方案 开发者与合作伙伴 [支持与服务](https://www.hiascend.com/support "支持与服务") 更多 [CANN](https://www.hiascend.com/cann "CANN") 集群 [Atlas 900 A3 SuperPoD 超节点](https://www.hiascend.com/hardware/cluster "Atlas 900 A3 SuperPoD 超节点") [Atlas 900 A2 PoD 集群基础单元](https://www.hiascend.com/hardware/cluster?tag=900 "Atlas 900 A2 PoD 集群基础单元") [Atlas 900 SuperCluster AI集群](https://www.hiascend.com/hardware/cluster?tag=900ai "Atlas 900 SuperCluster AI集群") 集群自智引擎 [CCAE](https://www.hiascend.com/software/ccae "CCAE") 服务器 [Atlas 800T A3 超节点服务器](https://www.hiascend.com/hardware/ai-server "Atlas 800T A3 超节点服务器") [Atlas 800I A3 超节点服务器](https://www.hiascend.com/hardware/ai-server?tag=800ia3 "Atlas 800I A3 超节点服务器") [Atlas 800T A2 训练服务器](https://www.hiascend.com/hardware/ai-server?tag=900A2 "Atlas 800T A2 训练服务器") [Atlas 800I A2 推理服务器](https://www.hiascend.com/hardware/ai-server?tag=800A2 "Atlas 800I A2 推理服务器") [Atlas 800 推理服务器](https://www.hiascend.com/hardware/ai-server?tag=300 "Atlas 800 推理服务器") [Atlas 500 Pro 智能边缘服务器](https://www.hiascend.com/hardware/ai-server?tag=500pro "Atlas 500 Pro 智能边缘服务器") 加速卡 [Atlas 300I A2 推理卡](https://www.hiascend.com/hardware/accelerator-card " Atlas 300I A2 推理卡") [Atlas 300V 视频解析卡](https://www.hiascend.com/hardware/accelerator-card?tag=300V "Atlas 300V 视频解析卡") [Atlas 300V Pro 视频解析卡](https://www.hiascend.com/hardware/accelerator-card?tag=300V-Pro " Atlas 300V Pro 视频解析卡") [Atlas 300I Pro 推理卡](https://www.hiascend.com/hardware/accelerator-card?tag=300I-Pro "Atlas 300I Pro 推理卡") [Atlas 300I Duo 推理卡](https://www.hiascend.com/hardware/accelerator-card?tag=300I-duo "Atlas 300I Duo 推理卡") 智能小站 [Atlas 500 A2 智能小站](https://www.hiascend.com/hardware/Intelligent-edge-A2 "Atlas 500 A2 智能小站") 加速模块 [Atlas 200I A2 加速模块](https://www.hiascend.com/hardware/accelerator-module-A2 "Atlas 200I A2 加速模块") 开发者套件 [Atlas 200I DK A2 开发者套件](https://www.hiascend.com/hardware/developer-kit-a2 "Atlas 200I DK A2 开发者套件") 原昇腾系列软件已移动至开发者目录下,可[点击查看](https://www.hiascend.com/developer) 行业应用 [互联网](https://www.hiascend.com/industries/internet "互联网") [电信](https://www.hiascend.com/industries/telecom "电信") [政府](https://www.hiascend.com/industries/solutions/smart-city "政府") [金融](https://www.hiascend.com/industries/fintech "金融") [教育](https://www.hiascend.com/industries/education "教育") [能源](https://www.hiascend.com/industries/electricity "能源") [交通](https://www.hiascend.com/industries/transportation "交通") [制造](https://www.hiascend.com/industries/manufacturing "制造") [医疗](https://www.hiascend.com/industries/healthcare "医疗") 解决方案 [大规模专家并行解决方案](https://www.hiascend.com/industries/solutions/large-model "大规模专家并行解决方案") [昇腾大模型解决方案](https://www.hiascend.com/solutions/Foundation-Model "昇腾大模型解决方案") [昇腾生态解决方案](https://www.hiascend.com/marketplace/solution "昇腾生态解决方案") 开发者 从入门到进阶,开启昇腾开发者成长之旅 查看详情 合作伙伴 致力于帮助昇腾生态伙伴构建产业竞争力、联接客户创造商机 查看详情 教育科研 助力新一代科研工作者、教师、学生及高校创业者加速创新 查看详情 昇腾AI市场 围绕华为技术栈产品、应用领域、行业、伙伴等,提供昇腾合作伙伴解决方案 查看详情 技术支持 [论坛求助](https://www.hiascend.com/forum/topicpost "论坛求助") [技术工单](https://www.hiascend.com/feedback "技术工单") 自助查询 [常见问题](https://www.hiascend.com/document/detail/zh/AscendFAQ/overview/index.html "常见问题") [故障案例](https://www.hiascend.com/document/caselibrary "故障案例") [文档](https://www.hiascend.com/document "文档") [昇腾论坛](https://www.hiascend.com/forum/ "昇腾论坛") 资讯 [产品公告](https://www.hiascend.com/productbulletins "产品公告") ![](data:image/svg+xml,%3c?xml%20version='1.0'%20encoding='UTF-8'?%3e%3csvg%20width='16px'%20height='16px'%20viewBox='0%200%2016%2016'%20version='1.1'%20xmlns='http://www.w3.org/2000/svg'%20xmlns:xlink='http://www.w3.org/1999/xlink'%3e%3ctitle%3e搜索%3c/title%3e%3cdefs%3e%3clinearGradient%20x1='5.21297809%25'%20y1='23.0592428%25'%20x2='88.6708628%25'%20y2='86.2762742%25'%20id='linearGradient-1'%3e%3cstop%20stop-color='%235EAFFF'%20offset='0%25'%3e%3c/stop%3e%3cstop%20stop-color='%23567DFC'%20offset='41.1240182%25'%3e%3c/stop%3e%3cstop%20stop-color='%239862FF'%20offset='100%25'%3e%3c/stop%3e%3c/linearGradient%3e%3c/defs%3e%3cg%20id='搜索'%20stroke='none'%20stroke-width='1'%20fill='none'%20fill-rule='evenodd'%3e%3cg%20id='编组-2'%3e%3crect%20id='矩形'%20x='0'%20y='0'%20width='16'%20height='16'%3e%3c/rect%3e%3cpath%20d='M12.4178989,11.2288982%20L14.3793216,13.1898628%20C14.5832621,13.3938032%2014.6059222,13.7103769%2014.4473018,13.939347%20L14.3793216,14.0207132%20L13.9373799,14.462655%20C13.7334394,14.6665954%2013.4168658,14.6892555%2013.1879404,14.5306799%20L13.1065931,14.4627187%20L11.1363483,12.4932888%20C11.6158072,12.1284843%2012.0467778,11.7032336%2012.4178989,11.2288982%20Z%20M6.91666667,0.933333333%20C10.2211704,0.933333333%2012.9,3.61216291%2012.9,6.91666667%20C12.9,8.18725347%2012.5023641,9.39802849%2011.7730796,10.411605%20L11.698,10.509%20C11.3421384,10.9988106%2010.9089395,11.4251457%2010.4201949,11.7795672%20L10.4141658,11.7720117%20C9.40826923,12.4981206%208.19345696,12.9%206.91666667,12.9%20C3.61216291,12.9%200.933333333,10.2211704%200.933333333,6.91666667%20C0.933333333,3.61216291%203.61216291,0.933333333%206.91666667,0.933333333%20Z%20M6.91666667,2.73333333%20C4.60627546,2.73333333%202.73333333,4.60627546%202.73333333,6.91666667%20C2.73333333,9.22705787%204.60627546,11.1%206.91666667,11.1%20C9.22705787,11.1%2011.1,9.22705787%2011.1,6.91666667%20C11.1,4.60627546%209.22705787,2.73333333%206.91666667,2.73333333%20Z'%20id='形状结合'%20fill='url\(%23linearGradient-1\)'%20fill-rule='nonzero'%3e%3c/path%3e%3c/g%3e%3c/g%3e%3c/svg%3e) **0**/100 [文档](https://www.hiascend.com/zh/document) [在线开发](https://hidevlab.huawei.com/online-develop-intro?from=hiascend) 资源 返回顶部 # 拉起多卡分布式训练 在单机和多机场景下,有4种方式可拉起分布式训练,分别为shell脚本方式(推荐)、mp.spawn方式、Python方式、torchrun方式。其中torchrun方式仅在PyTorch 1.11.0及以上版本支持使用。以下内容以一个简单模型脚本为样例,展示前3种拉起方式分别需要对脚本代码进行的修改。torchrun方式的代码修改与shell脚本方式完全相同。 ![](https://www.hiascend.com/doc_center/source/zh/canncommercial/700/modeldevpt/ptmigr/public_sys-resources/note_3.0-zh-cn.png) 1. 集合通信存在如下约束: - 数据并行模式中不同device上执行的图相同。 - 针对Atlas 训练系列产品:allreduce和reduce\_scatter仅支持int8、int32、float16和float32数据类型。 - 针对Atlas A2 训练系列产品:allreduce和reduce\_scatter仅支持int8, int32, float16, float32和bfp16数据类型。 - 分布式训练场景下,HCCL会使用Host服务器的部分端口进行集群信息收集,需要操作系统预留该部分端口。默认情况下,HCCL使用60000-60015端口,若通过环境变量HCCL\_IF\_BASE\_PORT指定了Host网卡起始端口,则需要预留以该端口起始的16个端口。 操作系统端口号预留示例: ``` sysctl -w net.ipv4.ip_local_reserved_ports=60000-60015 ``` 2. 若用户准备进行2卡训练,可将8卡训练脚本进行改写,改为2卡训练脚本。可参见以下修改方法: 1. 若8卡脚本的batchsize是单卡脚本的batchsize的8倍,则将8卡训练时的batch size和learning rate同时除以4,作为2卡训练时的batch size和learning rate。 2. 如果使用for循环启动训练入口脚本,则将for循环的次数改为2次。 3. world size或者rank size修改为2,并确保训练脚本中dist.init\_process\_group()中world\_size参数为2。 4. 如果有指定device list参数,且取值取值范围为0-7,则将其改为0-1。 #### 构建简单模型 我们先构建一个简单的神经网络。 ``` # 导入依赖和库 import torch from torch import nn import torch_npu import torch.distributed as dist from torch.utils.data import DataLoader from torchvision import datasets from torchvision.transforms import ToTensor import time import torch.multiprocessing as mp import os torch.manual_seed(0) # 下载训练数据 training_data = datasets.FashionMNIST( root="./data", train=True, download=True, transform=ToTensor(), ) # 下载测试数据 test_data = datasets.FashionMNIST( root="./data", train=False, download=True, transform=ToTensor(), ) # 构建模型 class NeuralNetwork(nn.Module): def __init__(self): super().__init__() self.flatten = nn.Flatten() self.linear_relu_stack = nn.Sequential( nn.Linear(28*28, 512), nn.ReLU(), nn.Linear(512, 512), nn.ReLU(), nn.Linear(512, 10) ) def forward(self, x): x = self.flatten(x) logits = self.linear_relu_stack(x) return logits def test(dataloader, model, loss_fn): size = len(dataloader.dataset) num_batches = len(dataloader) model.eval() test_loss, correct = 0, 0 with torch.no_grad(): for X, y in dataloader: X, y = X.to(device), y.to(device) pred = model(X) test_loss += loss_fn(pred, y).item() correct += (pred.argmax(1) == y).type(torch.float).sum().item() test_loss /= num_batches correct /= size print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n") ``` #### 获取超参数 在主函数main中获取训练所需的超参数。 - shell脚本/torchrun方式 ``` def main(world_size: int, batch_size = 64, total_epochs = 5,): # 用户可自行设置 ngpus_per_node = world_size main_worker(args.gpu, ngpus_per_node, args) ``` - mp.spawn方式 ``` def main(world_size: int, batch_size = 64, total_epochs = 5,): # 用户可自行设置 ngpus_per_node = world_size mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args)) # mp.spawn方式启动 ``` - Python方式 ``` def main(world_size: int, batch_size, args): # 使用Python拉起命令中设置的超参数 ngpus_per_node = world_size args.gpu = args.local_rank # 任务拉起后,local_rank自动获得device号 main_worker(args.gpu, ngpus_per_node, args) ``` #### 设置地址和端口号 由于昇腾AI处理器初始化进程组时initmethod只支持env:// (即环境变量初始化方式),所以在初始化前需要配置MASTER\_ADDR、MASTER\_PORT等参数。用户需根据自己实际情况配置。 - shell脚本方式、mp.spawn拉起方式和torchrun方式的配置代码相同,如下所示: ``` def ddp_setup(rank, world_size): """ Args: rank: Unique identifier of each process world_size: Total number of processes """ os.environ["MASTER_ADDR"] = "localhost" # 用户需根据自己实际情况设置 os.environ["MASTER_PORT"] = "***" # 用户需根据自己实际情况设置 dist.init_process_group(backend="hccl", rank=rank, world_size=world_size) ``` - Python方式需要把配置参数的命令放到拉起训练中。脚本中代码如下所示: ``` def ddp_setup(rank, world_size): """ Args: rank: Unique identifier of each process world_size: Total number of processes """ dist.init_process_group(backend="hccl", rank=rank, world_size=world_size) ``` #### 添加分布式逻辑 不同的拉起训练方式下,device号的获取方式不同: - shell脚本方式:在shell脚本中循环传入local\_rank变量作为指定的device。 - mp.spawn方式:mp.spawn多进程拉起main\_worker后,第一个参数GPU自动获得device号(0 ~ ngpusper\_node - 1)。 - Python方式:任务拉起后,local\_rank自动获得device号。 用户需根据自己选择的方式对代码做不同的修改。 - shell脚本/torchrun方式 ``` def main_worker(gpu, ngpus_per_node, args): start_epoch = 0 end_epoch = 5 args.gpu = int(os.environ['LOCAL_RANK']) # 在shell脚本中循环传入local_rank变量作为指定的device ddp_setup(args.gpu, args.world_size) torch_npu.npu.set_device(args.gpu) total_batch_size = args.batch_size total_workers = ngpus_per_node batch_size = int(total_batch_size / ngpus_per_node) workers = int((total_workers + ngpus_per_node - 1) / ngpus_per_node) model = NeuralNetwork() device = torch.device("npu") train_sampler = torch.utils.data.distributed.DistributedSampler(training_data) test_sampler = torch.utils.data.distributed.DistributedSampler(test_data) train_loader = torch.utils.data.DataLoader( training_data, batch_size=batch_size, shuffle=(train_sampler is None), num_workers=workers, pin_memory=False, sampler=train_sampler, drop_last=True) val_loader = torch.utils.data.DataLoader( test_data, batch_size=batch_size, shuffle=(test_sampler is None), num_workers=workers, pin_memory=False, sampler=test_sampler, drop_last=True) loc = 'npu:{}'.format(args.gpu) model = model.to(loc) criterion = nn.CrossEntropyLoss().to(loc) optimizer = torch.optim.SGD(model.parameters(), lr=1e-3) model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu]) for epoch in range(start_epoch, end_epoch): print("curr epoch: ", epoch) train_sampler.set_epoch(epoch) train(train_loader, model, criterion, optimizer, epoch, args.gpu) def train(train_loader, model, criterion, optimizer, epoch, gpu): size = len(train_loader.dataset) model.train() end = time.time() for i, (images, target) in enumerate(train_loader): # measure data loading time loc = 'npu:{}'.format(gpu) target = target.to(torch.int32) images, target = images.to(loc, non_blocking=False), target.to(loc, non_blocking=False) # compute output output = model(images) loss = criterion(output, target) # compute gradient and do SGD step optimizer.zero_grad() loss.backward() optimizer.step() end = time.time() if i % 100 == 0: loss, current = loss.item(), i * len(target) print(f"loss: {loss:>7f} [{current:>5d}/{size:>5d}]") ``` - mp.spawn方式 不需要专门设置args.gpu,将shell脚本方式中main\_worker里的args.gpu均替换为gpu。 ``` def main_worker(gpu, ngpus_per_node, args): start_epoch = 0 end_epoch = 5 ddp_setup(gpu, args.world_size) torch_npu.npu.set_device(gpu) total_batch_size = args.batch_size total_workers = ngpus_per_node batch_size = int(total_batch_size / ngpus_per_node) workers = int((total_workers + ngpus_per_node - 1) / ngpus_per_node) model = NeuralNetwork() device = torch.device("npu") train_sampler = torch.utils.data.distributed.DistributedSampler(training_data) test_sampler = torch.utils.data.distributed.DistributedSampler(test_data) train_loader = torch.utils.data.DataLoader( training_data, batch_size=batch_size, shuffle=(train_sampler is None), num_workers=workers, pin_memory=False, sampler=train_sampler, drop_last=True) val_loader = torch.utils.data.DataLoader( test_data, batch_size=batch_size, shuffle=(test_sampler is None), num_workers=workers, pin_memory=False, sampler=test_sampler, drop_last=True) loc = 'npu:{}'.format(gpu) model = model.to(loc) criterion = nn.CrossEntropyLoss().to(loc) optimizer = torch.optim.SGD(model.parameters(), lr=1e-3) model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[gpu]) for epoch in range(start_epoch, end_epoch): print("curr epoch: ", epoch) train_sampler.set_epoch(epoch) train(train_loader, model, criterion, optimizer, epoch, gpu) ...... # train函数代码同shell脚本方式 ``` - Python方式 ``` def main_worker(gpu, ngpus_per_node, args): start_epoch = 0 end_epoch = 5 args.gpu = args.local_rank # 任务拉起后,local_rank自动获得device号 ddp_setup(args.gpu, args.world_size) ...... # 其余代码同shell脚本方式 ``` #### 设置参数 在模型脚本中,根据拉起方式不同,设置不同的参数。 - shell脚本/torchrun方式 ``` if __name__ == "__main__": import argparse parser = argparse.ArgumentParser(description='simple distributed training job') parser.add_argument('--batch_size', default=512, type=int, help='Input batch size on each device (default: 32)') parser.add_argument('--gpu', default=None, type=int, help='GPU id to use.') args = parser.parse_args() world_size = torch.npu.device_count() args.world_size = world_size main(args.world_size, args.batch_size) ``` - mp.spawn方式 ``` if __name__ == "__main__": import argparse parser = argparse.ArgumentParser(description='simple distributed training job') parser.add_argument('--batch_size', default=512, type=int, help='Input batch size on each device (default: 32)') args = parser.parse_args() world_size = torch.npu.device_count() args.world_size = world_size main(args.world_size, args.batch_size) ``` - Python方式 ``` if __name__ == "__main__": import argparse parser = argparse.ArgumentParser(description='simple distributed training job') parser.add_argument('--batch_size', default=512, type=int, help='Input batch size on each device (default: 32)') parser.add_argument('--gpu', default=None, type=int, help='GPU id to use.') parser.add_argument("--local_rank", default=-1, type=int) # local_rank用于自动获取device号。使用mp.spawn方式与shell方式启动时需删除此项 args = parser.parse_args() world_size = torch.npu.device_count() args.world_size = world_size main(args.world_size, args.batch_size, args) # 需将Python拉起命令中设置的参数传入main函数 ``` #### 拉起训练 以下拉起训练的命令为示例,用户可根据实际情况自行更改。 - shell脚本方式 ``` export HCCL_WHITELIST_DISABLE=1 RANK_ID_START=0 WORLD_SIZE=8 for((RANK_ID=$RANK_ID_START;RANK_ID<$((WORLD_SIZE+RANK_ID_START));RANK_ID++)); do echo "Device ID: $RANK_ID" export LOCAL_RANK=$RANK_ID python3 ddp_test_shell.py & done wait ``` - mp.spawn方式 ``` export HCCL_WHITELIST_DISABLE=1 python3 ddp_test_spwn.py ``` - Python方式 ``` # master_addr和master_port参数需用户根据实际情况设置 export HCCL_WHITELIST_DISABLE=1 python3 -m torch.distributed.launch --nproc_per_node 8 --master_addr localhost --master_port *** ddp_test.py ``` - torchrun方式(PyTorch 1.11.0及以上版本支持) ``` export HCCL_WHITELIST_DISABLE=1 torchrun --standalone --nnodes=1 --nproc_per_node=8 ddp_test_shell.py ``` 当屏幕打印类似下图中的Loss数值时,说明拉起训练成功。 ![](https://www.hiascend.com/doc_center/source/zh/canncommercial/700/modeldevpt/ptmigr/figure/zh-cn_image_0000001781358645.png) **父主题:** [多卡分布式训练](https://www.hiascend.com/document/detail/zh/canncommercial/700/modeldevpt/ptmigr/AImpug_000025.html) 版权所有 © 2021-2026华为技术有限公司 保留一切权利 [粤A2-20044005号](https://beian.miit.gov.cn/) [法律声明](https://www.hiascend.com/legal/law) [隐私政策](https://www.hiascend.com/legal/privacy) [Cookie协议](https://www.hiascend.com/legal/cookies) [用户协议](https://www.hiascend.com/legal/user) [联系我们](https://www.huawei.com/cn/contact-us)
Readable Markdown
## 拉起多卡分布式训练 在单机和多机场景下,有4种方式可拉起分布式训练,分别为shell脚本方式(推荐)、mp.spawn方式、Python方式、torchrun方式。其中torchrun方式仅在PyTorch 1.11.0及以上版本支持使用。以下内容以一个简单模型脚本为样例,展示前3种拉起方式分别需要对脚本代码进行的修改。torchrun方式的代码修改与shell脚本方式完全相同。 ![](https://www.hiascend.com/doc_center/source/zh/canncommercial/700/modeldevpt/ptmigr/public_sys-resources/note_3.0-zh-cn.png) 1. 集合通信存在如下约束: - 数据并行模式中不同device上执行的图相同。 - 针对Atlas 训练系列产品:allreduce和reduce\_scatter仅支持int8、int32、float16和float32数据类型。 - 针对Atlas A2 训练系列产品:allreduce和reduce\_scatter仅支持int8, int32, float16, float32和bfp16数据类型。 - 分布式训练场景下,HCCL会使用Host服务器的部分端口进行集群信息收集,需要操作系统预留该部分端口。默认情况下,HCCL使用60000-60015端口,若通过环境变量HCCL\_IF\_BASE\_PORT指定了Host网卡起始端口,则需要预留以该端口起始的16个端口。 操作系统端口号预留示例: ``` sysctl -w net.ipv4.ip_local_reserved_ports=60000-60015 ``` 2. 若用户准备进行2卡训练,可将8卡训练脚本进行改写,改为2卡训练脚本。可参见以下修改方法: 1. 若8卡脚本的batchsize是单卡脚本的batchsize的8倍,则将8卡训练时的batch size和learning rate同时除以4,作为2卡训练时的batch size和learning rate。 2. 如果使用for循环启动训练入口脚本,则将for循环的次数改为2次。 3. world size或者rank size修改为2,并确保训练脚本中dist.init\_process\_group()中world\_size参数为2。 4. 如果有指定device list参数,且取值取值范围为0-7,则将其改为0-1。 #### 构建简单模型 我们先构建一个简单的神经网络。 ``` # 导入依赖和库 import torch from torch import nn import torch_npu import torch.distributed as dist from torch.utils.data import DataLoader from torchvision import datasets from torchvision.transforms import ToTensor import time import torch.multiprocessing as mp import os torch.manual_seed(0) # 下载训练数据 training_data = datasets.FashionMNIST( root="./data", train=True, download=True, transform=ToTensor(), ) # 下载测试数据 test_data = datasets.FashionMNIST( root="./data", train=False, download=True, transform=ToTensor(), ) # 构建模型 class NeuralNetwork(nn.Module): def __init__(self): super().__init__() self.flatten = nn.Flatten() self.linear_relu_stack = nn.Sequential( nn.Linear(28*28, 512), nn.ReLU(), nn.Linear(512, 512), nn.ReLU(), nn.Linear(512, 10) ) def forward(self, x): x = self.flatten(x) logits = self.linear_relu_stack(x) return logits def test(dataloader, model, loss_fn): size = len(dataloader.dataset) num_batches = len(dataloader) model.eval() test_loss, correct = 0, 0 with torch.no_grad(): for X, y in dataloader: X, y = X.to(device), y.to(device) pred = model(X) test_loss += loss_fn(pred, y).item() correct += (pred.argmax(1) == y).type(torch.float).sum().item() test_loss /= num_batches correct /= size print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n") ``` #### 获取超参数 在主函数main中获取训练所需的超参数。 - shell脚本/torchrun方式 ``` def main(world_size: int, batch_size = 64, total_epochs = 5,): # 用户可自行设置 ngpus_per_node = world_size main_worker(args.gpu, ngpus_per_node, args) ``` - mp.spawn方式 ``` def main(world_size: int, batch_size = 64, total_epochs = 5,): # 用户可自行设置 ngpus_per_node = world_size mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args)) # mp.spawn方式启动 ``` - Python方式 ``` def main(world_size: int, batch_size, args): # 使用Python拉起命令中设置的超参数 ngpus_per_node = world_size args.gpu = args.local_rank # 任务拉起后,local_rank自动获得device号 main_worker(args.gpu, ngpus_per_node, args) ``` #### 设置地址和端口号 由于昇腾AI处理器初始化进程组时initmethod只支持env:// (即环境变量初始化方式),所以在初始化前需要配置MASTER\_ADDR、MASTER\_PORT等参数。用户需根据自己实际情况配置。 - shell脚本方式、mp.spawn拉起方式和torchrun方式的配置代码相同,如下所示: ``` def ddp_setup(rank, world_size): """ Args: rank: Unique identifier of each process world_size: Total number of processes """ os.environ["MASTER_ADDR"] = "localhost" # 用户需根据自己实际情况设置 os.environ["MASTER_PORT"] = "***" # 用户需根据自己实际情况设置 dist.init_process_group(backend="hccl", rank=rank, world_size=world_size) ``` - Python方式需要把配置参数的命令放到拉起训练中。脚本中代码如下所示: ``` def ddp_setup(rank, world_size): """ Args: rank: Unique identifier of each process world_size: Total number of processes """ dist.init_process_group(backend="hccl", rank=rank, world_size=world_size) ``` #### 添加分布式逻辑 不同的拉起训练方式下,device号的获取方式不同: - shell脚本方式:在shell脚本中循环传入local\_rank变量作为指定的device。 - mp.spawn方式:mp.spawn多进程拉起main\_worker后,第一个参数GPU自动获得device号(0 ~ ngpusper\_node - 1)。 - Python方式:任务拉起后,local\_rank自动获得device号。 用户需根据自己选择的方式对代码做不同的修改。 - shell脚本/torchrun方式 ``` def main_worker(gpu, ngpus_per_node, args): start_epoch = 0 end_epoch = 5 args.gpu = int(os.environ['LOCAL_RANK']) # 在shell脚本中循环传入local_rank变量作为指定的device ddp_setup(args.gpu, args.world_size) torch_npu.npu.set_device(args.gpu) total_batch_size = args.batch_size total_workers = ngpus_per_node batch_size = int(total_batch_size / ngpus_per_node) workers = int((total_workers + ngpus_per_node - 1) / ngpus_per_node) model = NeuralNetwork() device = torch.device("npu") train_sampler = torch.utils.data.distributed.DistributedSampler(training_data) test_sampler = torch.utils.data.distributed.DistributedSampler(test_data) train_loader = torch.utils.data.DataLoader( training_data, batch_size=batch_size, shuffle=(train_sampler is None), num_workers=workers, pin_memory=False, sampler=train_sampler, drop_last=True) val_loader = torch.utils.data.DataLoader( test_data, batch_size=batch_size, shuffle=(test_sampler is None), num_workers=workers, pin_memory=False, sampler=test_sampler, drop_last=True) loc = 'npu:{}'.format(args.gpu) model = model.to(loc) criterion = nn.CrossEntropyLoss().to(loc) optimizer = torch.optim.SGD(model.parameters(), lr=1e-3) model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu]) for epoch in range(start_epoch, end_epoch): print("curr epoch: ", epoch) train_sampler.set_epoch(epoch) train(train_loader, model, criterion, optimizer, epoch, args.gpu) def train(train_loader, model, criterion, optimizer, epoch, gpu): size = len(train_loader.dataset) model.train() end = time.time() for i, (images, target) in enumerate(train_loader): # measure data loading time loc = 'npu:{}'.format(gpu) target = target.to(torch.int32) images, target = images.to(loc, non_blocking=False), target.to(loc, non_blocking=False) # compute output output = model(images) loss = criterion(output, target) # compute gradient and do SGD step optimizer.zero_grad() loss.backward() optimizer.step() end = time.time() if i % 100 == 0: loss, current = loss.item(), i * len(target) print(f"loss: {loss:>7f} [{current:>5d}/{size:>5d}]") ``` - mp.spawn方式 不需要专门设置args.gpu,将shell脚本方式中main\_worker里的args.gpu均替换为gpu。 ``` def main_worker(gpu, ngpus_per_node, args): start_epoch = 0 end_epoch = 5 ddp_setup(gpu, args.world_size) torch_npu.npu.set_device(gpu) total_batch_size = args.batch_size total_workers = ngpus_per_node batch_size = int(total_batch_size / ngpus_per_node) workers = int((total_workers + ngpus_per_node - 1) / ngpus_per_node) model = NeuralNetwork() device = torch.device("npu") train_sampler = torch.utils.data.distributed.DistributedSampler(training_data) test_sampler = torch.utils.data.distributed.DistributedSampler(test_data) train_loader = torch.utils.data.DataLoader( training_data, batch_size=batch_size, shuffle=(train_sampler is None), num_workers=workers, pin_memory=False, sampler=train_sampler, drop_last=True) val_loader = torch.utils.data.DataLoader( test_data, batch_size=batch_size, shuffle=(test_sampler is None), num_workers=workers, pin_memory=False, sampler=test_sampler, drop_last=True) loc = 'npu:{}'.format(gpu) model = model.to(loc) criterion = nn.CrossEntropyLoss().to(loc) optimizer = torch.optim.SGD(model.parameters(), lr=1e-3) model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[gpu]) for epoch in range(start_epoch, end_epoch): print("curr epoch: ", epoch) train_sampler.set_epoch(epoch) train(train_loader, model, criterion, optimizer, epoch, gpu) ...... # train函数代码同shell脚本方式 ``` - Python方式 ``` def main_worker(gpu, ngpus_per_node, args): start_epoch = 0 end_epoch = 5 args.gpu = args.local_rank # 任务拉起后,local_rank自动获得device号 ddp_setup(args.gpu, args.world_size) ...... # 其余代码同shell脚本方式 ``` #### 设置参数 在模型脚本中,根据拉起方式不同,设置不同的参数。 - shell脚本/torchrun方式 ``` if __name__ == "__main__": import argparse parser = argparse.ArgumentParser(description='simple distributed training job') parser.add_argument('--batch_size', default=512, type=int, help='Input batch size on each device (default: 32)') parser.add_argument('--gpu', default=None, type=int, help='GPU id to use.') args = parser.parse_args() world_size = torch.npu.device_count() args.world_size = world_size main(args.world_size, args.batch_size) ``` - mp.spawn方式 ``` if __name__ == "__main__": import argparse parser = argparse.ArgumentParser(description='simple distributed training job') parser.add_argument('--batch_size', default=512, type=int, help='Input batch size on each device (default: 32)') args = parser.parse_args() world_size = torch.npu.device_count() args.world_size = world_size main(args.world_size, args.batch_size) ``` - Python方式 ``` if __name__ == "__main__": import argparse parser = argparse.ArgumentParser(description='simple distributed training job') parser.add_argument('--batch_size', default=512, type=int, help='Input batch size on each device (default: 32)') parser.add_argument('--gpu', default=None, type=int, help='GPU id to use.') parser.add_argument("--local_rank", default=-1, type=int) # local_rank用于自动获取device号。使用mp.spawn方式与shell方式启动时需删除此项 args = parser.parse_args() world_size = torch.npu.device_count() args.world_size = world_size main(args.world_size, args.batch_size, args) # 需将Python拉起命令中设置的参数传入main函数 ``` #### 拉起训练 以下拉起训练的命令为示例,用户可根据实际情况自行更改。 - shell脚本方式 ``` export HCCL_WHITELIST_DISABLE=1 RANK_ID_START=0 WORLD_SIZE=8 for((RANK_ID=$RANK_ID_START;RANK_ID<$((WORLD_SIZE+RANK_ID_START));RANK_ID++)); do echo "Device ID: $RANK_ID" export LOCAL_RANK=$RANK_ID python3 ddp_test_shell.py & done wait ``` - mp.spawn方式 ``` export HCCL_WHITELIST_DISABLE=1 python3 ddp_test_spwn.py ``` - Python方式 ``` # master_addr和master_port参数需用户根据实际情况设置 export HCCL_WHITELIST_DISABLE=1 python3 -m torch.distributed.launch --nproc_per_node 8 --master_addr localhost --master_port *** ddp_test.py ``` - torchrun方式(PyTorch 1.11.0及以上版本支持) ``` export HCCL_WHITELIST_DISABLE=1 torchrun --standalone --nnodes=1 --nproc_per_node=8 ddp_test_shell.py ``` 当屏幕打印类似下图中的Loss数值时,说明拉起训练成功。 ![](https://www.hiascend.com/doc_center/source/zh/canncommercial/700/modeldevpt/ptmigr/figure/zh-cn_image_0000001781358645.png) **父主题:** [多卡分布式训练](https://www.hiascend.com/document/detail/zh/canncommercial/700/modeldevpt/ptmigr/AImpug_000025.html)
Shard171 (laksa)
Root Hash2628830536891727371
Unparsed URLcom,hiascend!www,/document/detail/zh/canncommercial/700/modeldevpt/ptmigr/AImpug_000028.html s443