ℹ️ Skipped - page is already crawled
| Filter | Status | Condition | Details |
|---|---|---|---|
| HTTP status | PASS | download_http_code = 200 | HTTP 200 |
| Age cutoff | PASS | download_stamp > now() - 6 MONTH | 0.2 months ago |
| History drop | PASS | isNull(history_drop_reason) | No drop reason |
| Spam/ban | PASS | fh_dont_index != 1 AND ml_spam_score = 0 | ml_spam_score=0 |
| Canonical | PASS | meta_canonical IS NULL OR = '' OR = src_unparsed | Not set |
| Property | Value |
|---|---|
| URL | https://www.hiascend.com/document/detail/zh/canncommercial/700/modeldevpt/ptmigr/AImpug_000028.html |
| Last Crawled | 2026-04-11 11:13:30 (5 days ago) |
| First Indexed | 2024-10-08 05:01:26 (1 year ago) |
| HTTP Status Code | 200 |
| Meta Title | 拉起多卡分布式训练-多卡分布式训练-模型训练-模型迁移与训练-PyTorch 网络模型迁移和训练-模型开发(PyTorch)-CANN商用版7.0.0开发文档-昇腾社区 |
| Meta Description | <!DOCTYPE html> 拉起多卡分布式训练 在单机和多机场景下,有4种方式可拉起分布式训练,分别为shell脚本方式(推荐)、mp.spawn方式、Python方式、torchrun方式。其中torchrun方式仅在PyTorch 1.11.0及以上版本支持使用。以下内容以一个简单模型脚本为样例,展示前3种拉起方式分别需要对脚本代码进行的修改。torchrun方式的代码修改与shell脚本 |
| Meta Canonical | null |
| Boilerpipe Text | 拉起多卡分布式训练
在单机和多机场景下,有4种方式可拉起分布式训练,分别为shell脚本方式(推荐)、mp.spawn方式、Python方式、torchrun方式。其中torchrun方式仅在PyTorch 1.11.0及以上版本支持使用。以下内容以一个简单模型脚本为样例,展示前3种拉起方式分别需要对脚本代码进行的修改。torchrun方式的代码修改与shell脚本方式完全相同。
集合通信存在如下约束:
数据并行模式中不同device上执行的图相同。
针对
Atlas 训练系列产品
:allreduce和reduce_scatter仅支持int8、int32、float16和float32数据类型。
针对
Atlas A2 训练系列产品
:allreduce和reduce_scatter仅支持int8, int32, float16, float32和bfp16数据类型。
分布式训练场景下,HCCL会使用Host服务器的部分端口进行集群信息收集,需要操作系统预留该部分端口。默认情况下,HCCL使用60000-60015端口,若通过环境变量HCCL_IF_BASE_PORT指定了Host网卡起始端口,则需要预留以该端口起始的16个端口。
操作系统端口号预留示例:
sysctl -w net.ipv4.ip_local_reserved_ports=60000-60015
若用户准备进行2卡训练,可将8卡训练脚本进行改写,改为2卡训练脚本。可参见以下修改方法:
若8卡脚本的batchsize是单卡脚本的batchsize的8倍,则将8卡训练时的batch size和learning rate同时除以4,作为2卡训练时的batch size和learning rate。
如果使用for循环启动训练入口脚本,则将for循环的次数改为2次。
world size或者rank size修改为2,并确保训练脚本中dist.init_process_group()中world_size参数为2。
如果有指定device list参数,且取值取值范围为0-7,则将其改为0-1。
构建简单模型
我们先构建一个简单的神经网络。
# 导入依赖和库
import torch
from torch import nn
import torch_npu
import torch.distributed as dist
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor
import time
import torch.multiprocessing as mp
import os
torch.manual_seed(0)
# 下载训练数据
training_data = datasets.FashionMNIST(
root="./data",
train=True,
download=True,
transform=ToTensor(),
)
# 下载测试数据
test_data = datasets.FashionMNIST(
root="./data",
train=False,
download=True,
transform=ToTensor(),
)
# 构建模型
class NeuralNetwork(nn.Module):
def __init__(self):
super().__init__()
self.flatten = nn.Flatten()
self.linear_relu_stack = nn.Sequential(
nn.Linear(28*28, 512),
nn.ReLU(),
nn.Linear(512, 512),
nn.ReLU(),
nn.Linear(512, 10)
)
def forward(self, x):
x = self.flatten(x)
logits = self.linear_relu_stack(x)
return logits
def test(dataloader, model, loss_fn):
size = len(dataloader.dataset)
num_batches = len(dataloader)
model.eval()
test_loss, correct = 0, 0
with torch.no_grad():
for X, y in dataloader:
X, y = X.to(device), y.to(device)
pred = model(X)
test_loss += loss_fn(pred, y).item()
correct += (pred.argmax(1) == y).type(torch.float).sum().item()
test_loss /= num_batches
correct /= size
print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")
获取超参数
在主函数main中获取训练所需的超参数。
shell脚本/torchrun方式
def main(world_size: int,
batch_size = 64, total_epochs = 5
,
):
# 用户可自行设置
ngpus_per_node = world_size
main_worker(args.gpu, ngpus_per_node, args)
mp.spawn方式
def main(world_size: int,
batch_size = 64, total_epochs = 5,
):
# 用户可自行设置
ngpus_per_node = world_size
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
# mp.spawn方式启动
Python方式
def main(world_size: int, batch_size,
args
):
# 使用Python拉起命令中设置的超参数
ngpus_per_node = world_size
args.gpu = args.local_rank
# 任务拉起后,local_rank自动获得device号
main_worker(args.gpu, ngpus_per_node, args)
设置地址和端口号
由于昇腾AI处理器初始化进程组时initmethod只支持env:// (即环境变量初始化方式),所以在初始化前需要配置MASTER_ADDR、MASTER_PORT等参数。用户需根据自己实际情况配置。
shell脚本方式、mp.spawn拉起方式和torchrun方式的配置代码相同,如下所示:
def ddp_setup(rank, world_size):
"""
Args:
rank: Unique identifier of each process
world_size: Total number of processes
"""
os.environ["MASTER_ADDR"] =
"localhost"
# 用户需根据自己实际情况设置
os.environ["MASTER_PORT"] =
"***" # 用户需根据自己实际情况设置
dist.init_process_group(backend="hccl", rank=rank, world_size=world_size)
Python方式需要把配置参数的命令放到拉起训练中。脚本中代码如下所示:
def ddp_setup(rank, world_size):
"""
Args:
rank: Unique identifier of each process
world_size: Total number of processes
"""
dist.init_process_group(backend="hccl", rank=rank, world_size=world_size)
添加分布式逻辑
不同的拉起训练方式下,device号的获取方式不同:
shell脚本方式:在shell脚本中循环传入local_rank变量作为指定的device。
mp.spawn方式:mp.spawn多进程拉起main_worker后,第一个参数GPU自动获得device号(0 ~ ngpusper_node - 1)。
Python方式:任务拉起后,local_rank自动获得device号。
用户需根据自己选择的方式对代码做不同的修改。
shell脚本/torchrun方式
def main_worker(gpu, ngpus_per_node, args):
start_epoch = 0
end_epoch = 5
args.gpu = int(os.environ['LOCAL_RANK'])
# 在shell脚本中循环传入local_rank变量作为指定的device
ddp_setup(args.gpu, args.world_size)
torch_npu.npu.set_device(args.gpu)
total_batch_size = args.batch_size
total_workers = ngpus_per_node
batch_size = int(total_batch_size / ngpus_per_node)
workers = int((total_workers + ngpus_per_node - 1) / ngpus_per_node)
model = NeuralNetwork()
device = torch.device("npu")
train_sampler = torch.utils.data.distributed.DistributedSampler(training_data)
test_sampler = torch.utils.data.distributed.DistributedSampler(test_data)
train_loader = torch.utils.data.DataLoader(
training_data, batch_size=batch_size, shuffle=(train_sampler is None),
num_workers=workers, pin_memory=False, sampler=train_sampler, drop_last=True)
val_loader = torch.utils.data.DataLoader(
test_data, batch_size=batch_size, shuffle=(test_sampler is None),
num_workers=workers, pin_memory=False, sampler=test_sampler, drop_last=True)
loc = 'npu:{}'.format(args.gpu)
model = model.to(loc)
criterion = nn.CrossEntropyLoss().to(loc)
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu])
for epoch in range(start_epoch, end_epoch):
print("curr epoch: ", epoch)
train_sampler.set_epoch(epoch)
train(train_loader, model, criterion, optimizer, epoch, args.gpu)
def train(train_loader, model, criterion, optimizer, epoch, gpu):
size = len(train_loader.dataset)
model.train()
end = time.time()
for i, (images, target) in enumerate(train_loader):
# measure data loading time
loc = 'npu:{}'.format(gpu)
target = target.to(torch.int32)
images, target = images.to(loc, non_blocking=False), target.to(loc, non_blocking=False)
# compute output
output = model(images)
loss = criterion(output, target)
# compute gradient and do SGD step
optimizer.zero_grad()
loss.backward()
optimizer.step()
end = time.time()
if i % 100 == 0:
loss, current = loss.item(), i * len(target)
print(f"loss: {loss:>7f} [{current:>5d}/{size:>5d}]")
mp.spawn方式
不需要专门设置args.gpu,将shell脚本方式中main_worker里的args.gpu均替换为gpu。
def main_worker(gpu, ngpus_per_node, args):
start_epoch = 0
end_epoch = 5
ddp_setup(
gpu
, args.world_size)
torch_npu.npu.set_device(
gpu
)
total_batch_size = args.batch_size
total_workers = ngpus_per_node
batch_size = int(total_batch_size / ngpus_per_node)
workers = int((total_workers + ngpus_per_node - 1) / ngpus_per_node)
model = NeuralNetwork()
device = torch.device("npu")
train_sampler = torch.utils.data.distributed.DistributedSampler(training_data)
test_sampler = torch.utils.data.distributed.DistributedSampler(test_data)
train_loader = torch.utils.data.DataLoader(
training_data, batch_size=batch_size, shuffle=(train_sampler is None),
num_workers=workers, pin_memory=False, sampler=train_sampler, drop_last=True)
val_loader = torch.utils.data.DataLoader(
test_data, batch_size=batch_size, shuffle=(test_sampler is None),
num_workers=workers, pin_memory=False, sampler=test_sampler, drop_last=True)
loc = 'npu:{}'.format(
gpu
)
model = model.to(loc)
criterion = nn.CrossEntropyLoss().to(loc)
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[
gpu
])
for epoch in range(start_epoch, end_epoch):
print("curr epoch: ", epoch)
train_sampler.set_epoch(epoch)
train(train_loader, model, criterion, optimizer, epoch,
gpu
)
...... # train函数代码同shell脚本方式
Python方式
def main_worker(gpu, ngpus_per_node, args):
start_epoch = 0
end_epoch = 5
args.gpu = args.local_rank
# 任务拉起后,local_rank自动获得device号
ddp_setup(args.gpu, args.world_size)
...... # 其余代码
同shell脚本方式
设置参数
在模型脚本中,根据拉起方式不同,设置不同的参数。
shell脚本/torchrun方式
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description='simple distributed training job')
parser.add_argument('--batch_size', default=512, type=int, help='Input batch size on each device (default: 32)')
parser.add_argument('--gpu', default=None, type=int,
help='GPU id to use.')
args = parser.parse_args()
world_size = torch.npu.device_count()
args.world_size = world_size
main(args.world_size, args.batch_size)
mp.spawn方式
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description='simple distributed training job')
parser.add_argument('--batch_size', default=512, type=int, help='Input batch size on each device (default: 32)')
args = parser.parse_args()
world_size = torch.npu.device_count()
args.world_size = world_size
main(args.world_size, args.batch_size)
Python方式
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description='simple distributed training job')
parser.add_argument('--batch_size', default=512, type=int, help='Input batch size on each device (default: 32)')
parser.add_argument('--gpu', default=None, type=int,
help='GPU id to use.')
parser.add_argument("--local_rank", default=-1, type=int) # local_rank用于自动获取device号。使用mp.spawn方式与shell方式启动时需删除此项
args = parser.parse_args()
world_size = torch.npu.device_count()
args.world_size = world_size
main(args.world_size, args.batch_size
, args
)
# 需将Python拉起命令中设置的参数传入main函数
拉起训练
以下拉起训练的命令为示例,用户可根据实际情况自行更改。
shell脚本方式
export HCCL_WHITELIST_DISABLE=1
RANK_ID_START=0
WORLD_SIZE=8
for((RANK_ID=$RANK_ID_START;RANK_ID<$((WORLD_SIZE+RANK_ID_START));RANK_ID++));
do
echo "Device ID: $RANK_ID"
export LOCAL_RANK=$RANK_ID
python3 ddp_test_shell.py &
done
wait
mp.spawn方式
export HCCL_WHITELIST_DISABLE=1
python3 ddp_test_spwn.py
Python方式
# master_addr和master_port参数需用户根据实际情况设置
export HCCL_WHITELIST_DISABLE=1
python3 -m torch.distributed.launch --nproc_per_node 8 --master_addr localhost --master_port *** ddp_test.py
torchrun方式(PyTorch 1.11.0及以上版本支持)
export HCCL_WHITELIST_DISABLE=1
torchrun --standalone --nnodes=1 --nproc_per_node=8 ddp_test_shell.py
当屏幕打印类似下图中的Loss数值时,说明拉起训练成功。
父主题:
多卡分布式训练 |
| Markdown | [昇腾社区首页](https://www.hiascend.com/zh)
[](https://www.hiascend.com/)
开发者
[主页](https://www.hiascend.com/developer "主页")
开发
[文档](https://www.hiascend.com/document "文档")
活动
[学习](https://www.hiascend.com/developer/learn "学习")
[论坛](https://www.hiascend.com/forum/ "论坛")
[博客](https://www.hiascend.com/developer/blog "博客")
开发者计划
更多
'%20fill-rule='nonzero'%3e%3c/path%3e%3c/g%3e%3c/g%3e%3c/svg%3e)
**0**/100
资源
[支持](https://www.hiascend.com/support)
[积分兑换 NEW](https://www.hiascend.com/developer/rewards)
[](https://www.hiascend.com/zh)
CANN
产品
解决方案
开发者与合作伙伴
[支持与服务](https://www.hiascend.com/support "支持与服务")
更多
[CANN](https://www.hiascend.com/cann "CANN")
集群
[Atlas 900 A3 SuperPoD 超节点](https://www.hiascend.com/hardware/cluster "Atlas 900 A3 SuperPoD 超节点")
[Atlas 900 A2 PoD 集群基础单元](https://www.hiascend.com/hardware/cluster?tag=900 "Atlas 900 A2 PoD 集群基础单元")
[Atlas 900 SuperCluster AI集群](https://www.hiascend.com/hardware/cluster?tag=900ai "Atlas 900 SuperCluster AI集群")
集群自智引擎
[CCAE](https://www.hiascend.com/software/ccae "CCAE")
服务器
[Atlas 800T A3 超节点服务器](https://www.hiascend.com/hardware/ai-server "Atlas 800T A3 超节点服务器")
[Atlas 800I A3 超节点服务器](https://www.hiascend.com/hardware/ai-server?tag=800ia3 "Atlas 800I A3 超节点服务器")
[Atlas 800T A2 训练服务器](https://www.hiascend.com/hardware/ai-server?tag=900A2 "Atlas 800T A2 训练服务器")
[Atlas 800I A2 推理服务器](https://www.hiascend.com/hardware/ai-server?tag=800A2 "Atlas 800I A2 推理服务器")
[Atlas 800 推理服务器](https://www.hiascend.com/hardware/ai-server?tag=300 "Atlas 800 推理服务器")
[Atlas 500 Pro 智能边缘服务器](https://www.hiascend.com/hardware/ai-server?tag=500pro "Atlas 500 Pro 智能边缘服务器")
加速卡
[Atlas 300I A2 推理卡](https://www.hiascend.com/hardware/accelerator-card " Atlas 300I A2 推理卡")
[Atlas 300V 视频解析卡](https://www.hiascend.com/hardware/accelerator-card?tag=300V "Atlas 300V 视频解析卡")
[Atlas 300V Pro 视频解析卡](https://www.hiascend.com/hardware/accelerator-card?tag=300V-Pro " Atlas 300V Pro 视频解析卡")
[Atlas 300I Pro 推理卡](https://www.hiascend.com/hardware/accelerator-card?tag=300I-Pro "Atlas 300I Pro 推理卡")
[Atlas 300I Duo 推理卡](https://www.hiascend.com/hardware/accelerator-card?tag=300I-duo "Atlas 300I Duo 推理卡")
智能小站
[Atlas 500 A2 智能小站](https://www.hiascend.com/hardware/Intelligent-edge-A2 "Atlas 500 A2 智能小站")
加速模块
[Atlas 200I A2 加速模块](https://www.hiascend.com/hardware/accelerator-module-A2 "Atlas 200I A2 加速模块")
开发者套件
[Atlas 200I DK A2 开发者套件](https://www.hiascend.com/hardware/developer-kit-a2 "Atlas 200I DK A2 开发者套件")
原昇腾系列软件已移动至开发者目录下,可[点击查看](https://www.hiascend.com/developer)
行业应用
[互联网](https://www.hiascend.com/industries/internet "互联网")
[电信](https://www.hiascend.com/industries/telecom "电信")
[政府](https://www.hiascend.com/industries/solutions/smart-city "政府")
[金融](https://www.hiascend.com/industries/fintech "金融")
[教育](https://www.hiascend.com/industries/education "教育")
[能源](https://www.hiascend.com/industries/electricity "能源")
[交通](https://www.hiascend.com/industries/transportation "交通")
[制造](https://www.hiascend.com/industries/manufacturing "制造")
[医疗](https://www.hiascend.com/industries/healthcare "医疗")
解决方案
[大规模专家并行解决方案](https://www.hiascend.com/industries/solutions/large-model "大规模专家并行解决方案")
[昇腾大模型解决方案](https://www.hiascend.com/solutions/Foundation-Model "昇腾大模型解决方案")
[昇腾生态解决方案](https://www.hiascend.com/marketplace/solution "昇腾生态解决方案")
开发者
从入门到进阶,开启昇腾开发者成长之旅
查看详情
合作伙伴
致力于帮助昇腾生态伙伴构建产业竞争力、联接客户创造商机
查看详情
教育科研
助力新一代科研工作者、教师、学生及高校创业者加速创新
查看详情
昇腾AI市场
围绕华为技术栈产品、应用领域、行业、伙伴等,提供昇腾合作伙伴解决方案
查看详情
技术支持
[论坛求助](https://www.hiascend.com/forum/topicpost "论坛求助")
[技术工单](https://www.hiascend.com/feedback "技术工单")
自助查询
[常见问题](https://www.hiascend.com/document/detail/zh/AscendFAQ/overview/index.html "常见问题")
[故障案例](https://www.hiascend.com/document/caselibrary "故障案例")
[文档](https://www.hiascend.com/document "文档")
[昇腾论坛](https://www.hiascend.com/forum/ "昇腾论坛")
资讯
[产品公告](https://www.hiascend.com/productbulletins "产品公告")
'%20fill-rule='nonzero'%3e%3c/path%3e%3c/g%3e%3c/g%3e%3c/svg%3e)
**0**/100
[文档](https://www.hiascend.com/zh/document)
[在线开发](https://hidevlab.huawei.com/online-develop-intro?from=hiascend)
资源
返回顶部
# 拉起多卡分布式训练
在单机和多机场景下,有4种方式可拉起分布式训练,分别为shell脚本方式(推荐)、mp.spawn方式、Python方式、torchrun方式。其中torchrun方式仅在PyTorch 1.11.0及以上版本支持使用。以下内容以一个简单模型脚本为样例,展示前3种拉起方式分别需要对脚本代码进行的修改。torchrun方式的代码修改与shell脚本方式完全相同。

1. 集合通信存在如下约束:
- 数据并行模式中不同device上执行的图相同。
- 针对Atlas 训练系列产品:allreduce和reduce\_scatter仅支持int8、int32、float16和float32数据类型。
- 针对Atlas A2 训练系列产品:allreduce和reduce\_scatter仅支持int8, int32, float16, float32和bfp16数据类型。
- 分布式训练场景下,HCCL会使用Host服务器的部分端口进行集群信息收集,需要操作系统预留该部分端口。默认情况下,HCCL使用60000-60015端口,若通过环境变量HCCL\_IF\_BASE\_PORT指定了Host网卡起始端口,则需要预留以该端口起始的16个端口。
操作系统端口号预留示例:
```
sysctl -w net.ipv4.ip_local_reserved_ports=60000-60015
```
2. 若用户准备进行2卡训练,可将8卡训练脚本进行改写,改为2卡训练脚本。可参见以下修改方法:
1. 若8卡脚本的batchsize是单卡脚本的batchsize的8倍,则将8卡训练时的batch size和learning rate同时除以4,作为2卡训练时的batch size和learning rate。
2. 如果使用for循环启动训练入口脚本,则将for循环的次数改为2次。
3. world size或者rank size修改为2,并确保训练脚本中dist.init\_process\_group()中world\_size参数为2。
4. 如果有指定device list参数,且取值取值范围为0-7,则将其改为0-1。
#### 构建简单模型
我们先构建一个简单的神经网络。
```
# 导入依赖和库
import torch
from torch import nn
import torch_npu
import torch.distributed as dist
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor
import time
import torch.multiprocessing as mp
import os
torch.manual_seed(0)
# 下载训练数据
training_data = datasets.FashionMNIST(
root="./data",
train=True,
download=True,
transform=ToTensor(),
)
# 下载测试数据
test_data = datasets.FashionMNIST(
root="./data",
train=False,
download=True,
transform=ToTensor(),
)
# 构建模型
class NeuralNetwork(nn.Module):
def __init__(self):
super().__init__()
self.flatten = nn.Flatten()
self.linear_relu_stack = nn.Sequential(
nn.Linear(28*28, 512),
nn.ReLU(),
nn.Linear(512, 512),
nn.ReLU(),
nn.Linear(512, 10)
)
def forward(self, x):
x = self.flatten(x)
logits = self.linear_relu_stack(x)
return logits
def test(dataloader, model, loss_fn):
size = len(dataloader.dataset)
num_batches = len(dataloader)
model.eval()
test_loss, correct = 0, 0
with torch.no_grad():
for X, y in dataloader:
X, y = X.to(device), y.to(device)
pred = model(X)
test_loss += loss_fn(pred, y).item()
correct += (pred.argmax(1) == y).type(torch.float).sum().item()
test_loss /= num_batches
correct /= size
print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")
```
#### 获取超参数
在主函数main中获取训练所需的超参数。
- shell脚本/torchrun方式
```
def main(world_size: int, batch_size = 64, total_epochs = 5,): # 用户可自行设置
ngpus_per_node = world_size
main_worker(args.gpu, ngpus_per_node, args)
```
- mp.spawn方式
```
def main(world_size: int, batch_size = 64, total_epochs = 5,): # 用户可自行设置
ngpus_per_node = world_size
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args)) # mp.spawn方式启动
```
- Python方式
```
def main(world_size: int, batch_size, args): # 使用Python拉起命令中设置的超参数
ngpus_per_node = world_size
args.gpu = args.local_rank # 任务拉起后,local_rank自动获得device号
main_worker(args.gpu, ngpus_per_node, args)
```
#### 设置地址和端口号
由于昇腾AI处理器初始化进程组时initmethod只支持env:// (即环境变量初始化方式),所以在初始化前需要配置MASTER\_ADDR、MASTER\_PORT等参数。用户需根据自己实际情况配置。
- shell脚本方式、mp.spawn拉起方式和torchrun方式的配置代码相同,如下所示:
```
def ddp_setup(rank, world_size):
"""
Args:
rank: Unique identifier of each process
world_size: Total number of processes
"""
os.environ["MASTER_ADDR"] = "localhost" # 用户需根据自己实际情况设置
os.environ["MASTER_PORT"] = "***" # 用户需根据自己实际情况设置
dist.init_process_group(backend="hccl", rank=rank, world_size=world_size)
```
- Python方式需要把配置参数的命令放到拉起训练中。脚本中代码如下所示:
```
def ddp_setup(rank, world_size):
"""
Args:
rank: Unique identifier of each process
world_size: Total number of processes
"""
dist.init_process_group(backend="hccl", rank=rank, world_size=world_size)
```
#### 添加分布式逻辑
不同的拉起训练方式下,device号的获取方式不同:
- shell脚本方式:在shell脚本中循环传入local\_rank变量作为指定的device。
- mp.spawn方式:mp.spawn多进程拉起main\_worker后,第一个参数GPU自动获得device号(0 ~ ngpusper\_node - 1)。
- Python方式:任务拉起后,local\_rank自动获得device号。
用户需根据自己选择的方式对代码做不同的修改。
- shell脚本/torchrun方式
```
def main_worker(gpu, ngpus_per_node, args):
start_epoch = 0
end_epoch = 5
args.gpu = int(os.environ['LOCAL_RANK']) # 在shell脚本中循环传入local_rank变量作为指定的device
ddp_setup(args.gpu, args.world_size)
torch_npu.npu.set_device(args.gpu)
total_batch_size = args.batch_size
total_workers = ngpus_per_node
batch_size = int(total_batch_size / ngpus_per_node)
workers = int((total_workers + ngpus_per_node - 1) / ngpus_per_node)
model = NeuralNetwork()
device = torch.device("npu")
train_sampler = torch.utils.data.distributed.DistributedSampler(training_data)
test_sampler = torch.utils.data.distributed.DistributedSampler(test_data)
train_loader = torch.utils.data.DataLoader(
training_data, batch_size=batch_size, shuffle=(train_sampler is None),
num_workers=workers, pin_memory=False, sampler=train_sampler, drop_last=True)
val_loader = torch.utils.data.DataLoader(
test_data, batch_size=batch_size, shuffle=(test_sampler is None),
num_workers=workers, pin_memory=False, sampler=test_sampler, drop_last=True)
loc = 'npu:{}'.format(args.gpu)
model = model.to(loc)
criterion = nn.CrossEntropyLoss().to(loc)
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu])
for epoch in range(start_epoch, end_epoch):
print("curr epoch: ", epoch)
train_sampler.set_epoch(epoch)
train(train_loader, model, criterion, optimizer, epoch, args.gpu)
def train(train_loader, model, criterion, optimizer, epoch, gpu):
size = len(train_loader.dataset)
model.train()
end = time.time()
for i, (images, target) in enumerate(train_loader):
# measure data loading time
loc = 'npu:{}'.format(gpu)
target = target.to(torch.int32)
images, target = images.to(loc, non_blocking=False), target.to(loc, non_blocking=False)
# compute output
output = model(images)
loss = criterion(output, target)
# compute gradient and do SGD step
optimizer.zero_grad()
loss.backward()
optimizer.step()
end = time.time()
if i % 100 == 0:
loss, current = loss.item(), i * len(target)
print(f"loss: {loss:>7f} [{current:>5d}/{size:>5d}]")
```
- mp.spawn方式
不需要专门设置args.gpu,将shell脚本方式中main\_worker里的args.gpu均替换为gpu。
```
def main_worker(gpu, ngpus_per_node, args):
start_epoch = 0
end_epoch = 5
ddp_setup(gpu, args.world_size)
torch_npu.npu.set_device(gpu)
total_batch_size = args.batch_size
total_workers = ngpus_per_node
batch_size = int(total_batch_size / ngpus_per_node)
workers = int((total_workers + ngpus_per_node - 1) / ngpus_per_node)
model = NeuralNetwork()
device = torch.device("npu")
train_sampler = torch.utils.data.distributed.DistributedSampler(training_data)
test_sampler = torch.utils.data.distributed.DistributedSampler(test_data)
train_loader = torch.utils.data.DataLoader(
training_data, batch_size=batch_size, shuffle=(train_sampler is None),
num_workers=workers, pin_memory=False, sampler=train_sampler, drop_last=True)
val_loader = torch.utils.data.DataLoader(
test_data, batch_size=batch_size, shuffle=(test_sampler is None),
num_workers=workers, pin_memory=False, sampler=test_sampler, drop_last=True)
loc = 'npu:{}'.format(gpu)
model = model.to(loc)
criterion = nn.CrossEntropyLoss().to(loc)
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[gpu])
for epoch in range(start_epoch, end_epoch):
print("curr epoch: ", epoch)
train_sampler.set_epoch(epoch)
train(train_loader, model, criterion, optimizer, epoch, gpu)
...... # train函数代码同shell脚本方式
```
- Python方式
```
def main_worker(gpu, ngpus_per_node, args):
start_epoch = 0
end_epoch = 5
args.gpu = args.local_rank # 任务拉起后,local_rank自动获得device号
ddp_setup(args.gpu, args.world_size)
...... # 其余代码同shell脚本方式
```
#### 设置参数
在模型脚本中,根据拉起方式不同,设置不同的参数。
- shell脚本/torchrun方式
```
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description='simple distributed training job')
parser.add_argument('--batch_size', default=512, type=int, help='Input batch size on each device (default: 32)')
parser.add_argument('--gpu', default=None, type=int,
help='GPU id to use.')
args = parser.parse_args()
world_size = torch.npu.device_count()
args.world_size = world_size
main(args.world_size, args.batch_size)
```
- mp.spawn方式
```
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description='simple distributed training job')
parser.add_argument('--batch_size', default=512, type=int, help='Input batch size on each device (default: 32)')
args = parser.parse_args()
world_size = torch.npu.device_count()
args.world_size = world_size
main(args.world_size, args.batch_size)
```
- Python方式
```
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description='simple distributed training job')
parser.add_argument('--batch_size', default=512, type=int, help='Input batch size on each device (default: 32)')
parser.add_argument('--gpu', default=None, type=int,
help='GPU id to use.')
parser.add_argument("--local_rank", default=-1, type=int) # local_rank用于自动获取device号。使用mp.spawn方式与shell方式启动时需删除此项
args = parser.parse_args()
world_size = torch.npu.device_count()
args.world_size = world_size
main(args.world_size, args.batch_size, args) # 需将Python拉起命令中设置的参数传入main函数
```
#### 拉起训练
以下拉起训练的命令为示例,用户可根据实际情况自行更改。
- shell脚本方式
```
export HCCL_WHITELIST_DISABLE=1
RANK_ID_START=0
WORLD_SIZE=8
for((RANK_ID=$RANK_ID_START;RANK_ID<$((WORLD_SIZE+RANK_ID_START));RANK_ID++));
do
echo "Device ID: $RANK_ID"
export LOCAL_RANK=$RANK_ID
python3 ddp_test_shell.py &
done
wait
```
- mp.spawn方式
```
export HCCL_WHITELIST_DISABLE=1
python3 ddp_test_spwn.py
```
- Python方式
```
# master_addr和master_port参数需用户根据实际情况设置
export HCCL_WHITELIST_DISABLE=1
python3 -m torch.distributed.launch --nproc_per_node 8 --master_addr localhost --master_port *** ddp_test.py
```
- torchrun方式(PyTorch 1.11.0及以上版本支持)
```
export HCCL_WHITELIST_DISABLE=1
torchrun --standalone --nnodes=1 --nproc_per_node=8 ddp_test_shell.py
```
当屏幕打印类似下图中的Loss数值时,说明拉起训练成功。

**父主题:** [多卡分布式训练](https://www.hiascend.com/document/detail/zh/canncommercial/700/modeldevpt/ptmigr/AImpug_000025.html)
版权所有 © 2021-2026华为技术有限公司 保留一切权利 [粤A2-20044005号](https://beian.miit.gov.cn/)
[法律声明](https://www.hiascend.com/legal/law)
[隐私政策](https://www.hiascend.com/legal/privacy)
[Cookie协议](https://www.hiascend.com/legal/cookies)
[用户协议](https://www.hiascend.com/legal/user)
[联系我们](https://www.huawei.com/cn/contact-us) |
| Readable Markdown | ## 拉起多卡分布式训练
在单机和多机场景下,有4种方式可拉起分布式训练,分别为shell脚本方式(推荐)、mp.spawn方式、Python方式、torchrun方式。其中torchrun方式仅在PyTorch 1.11.0及以上版本支持使用。以下内容以一个简单模型脚本为样例,展示前3种拉起方式分别需要对脚本代码进行的修改。torchrun方式的代码修改与shell脚本方式完全相同。

1. 集合通信存在如下约束:
- 数据并行模式中不同device上执行的图相同。
- 针对Atlas 训练系列产品:allreduce和reduce\_scatter仅支持int8、int32、float16和float32数据类型。
- 针对Atlas A2 训练系列产品:allreduce和reduce\_scatter仅支持int8, int32, float16, float32和bfp16数据类型。
- 分布式训练场景下,HCCL会使用Host服务器的部分端口进行集群信息收集,需要操作系统预留该部分端口。默认情况下,HCCL使用60000-60015端口,若通过环境变量HCCL\_IF\_BASE\_PORT指定了Host网卡起始端口,则需要预留以该端口起始的16个端口。
操作系统端口号预留示例:
```
sysctl -w net.ipv4.ip_local_reserved_ports=60000-60015
```
2. 若用户准备进行2卡训练,可将8卡训练脚本进行改写,改为2卡训练脚本。可参见以下修改方法:
1. 若8卡脚本的batchsize是单卡脚本的batchsize的8倍,则将8卡训练时的batch size和learning rate同时除以4,作为2卡训练时的batch size和learning rate。
2. 如果使用for循环启动训练入口脚本,则将for循环的次数改为2次。
3. world size或者rank size修改为2,并确保训练脚本中dist.init\_process\_group()中world\_size参数为2。
4. 如果有指定device list参数,且取值取值范围为0-7,则将其改为0-1。
#### 构建简单模型
我们先构建一个简单的神经网络。
```
# 导入依赖和库
import torch
from torch import nn
import torch_npu
import torch.distributed as dist
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor
import time
import torch.multiprocessing as mp
import os
torch.manual_seed(0)
# 下载训练数据
training_data = datasets.FashionMNIST(
root="./data",
train=True,
download=True,
transform=ToTensor(),
)
# 下载测试数据
test_data = datasets.FashionMNIST(
root="./data",
train=False,
download=True,
transform=ToTensor(),
)
# 构建模型
class NeuralNetwork(nn.Module):
def __init__(self):
super().__init__()
self.flatten = nn.Flatten()
self.linear_relu_stack = nn.Sequential(
nn.Linear(28*28, 512),
nn.ReLU(),
nn.Linear(512, 512),
nn.ReLU(),
nn.Linear(512, 10)
)
def forward(self, x):
x = self.flatten(x)
logits = self.linear_relu_stack(x)
return logits
def test(dataloader, model, loss_fn):
size = len(dataloader.dataset)
num_batches = len(dataloader)
model.eval()
test_loss, correct = 0, 0
with torch.no_grad():
for X, y in dataloader:
X, y = X.to(device), y.to(device)
pred = model(X)
test_loss += loss_fn(pred, y).item()
correct += (pred.argmax(1) == y).type(torch.float).sum().item()
test_loss /= num_batches
correct /= size
print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")
```
#### 获取超参数
在主函数main中获取训练所需的超参数。
- shell脚本/torchrun方式
```
def main(world_size: int, batch_size = 64, total_epochs = 5,): # 用户可自行设置
ngpus_per_node = world_size
main_worker(args.gpu, ngpus_per_node, args)
```
- mp.spawn方式
```
def main(world_size: int, batch_size = 64, total_epochs = 5,): # 用户可自行设置
ngpus_per_node = world_size
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args)) # mp.spawn方式启动
```
- Python方式
```
def main(world_size: int, batch_size, args): # 使用Python拉起命令中设置的超参数
ngpus_per_node = world_size
args.gpu = args.local_rank # 任务拉起后,local_rank自动获得device号
main_worker(args.gpu, ngpus_per_node, args)
```
#### 设置地址和端口号
由于昇腾AI处理器初始化进程组时initmethod只支持env:// (即环境变量初始化方式),所以在初始化前需要配置MASTER\_ADDR、MASTER\_PORT等参数。用户需根据自己实际情况配置。
- shell脚本方式、mp.spawn拉起方式和torchrun方式的配置代码相同,如下所示:
```
def ddp_setup(rank, world_size):
"""
Args:
rank: Unique identifier of each process
world_size: Total number of processes
"""
os.environ["MASTER_ADDR"] = "localhost" # 用户需根据自己实际情况设置
os.environ["MASTER_PORT"] = "***" # 用户需根据自己实际情况设置
dist.init_process_group(backend="hccl", rank=rank, world_size=world_size)
```
- Python方式需要把配置参数的命令放到拉起训练中。脚本中代码如下所示:
```
def ddp_setup(rank, world_size):
"""
Args:
rank: Unique identifier of each process
world_size: Total number of processes
"""
dist.init_process_group(backend="hccl", rank=rank, world_size=world_size)
```
#### 添加分布式逻辑
不同的拉起训练方式下,device号的获取方式不同:
- shell脚本方式:在shell脚本中循环传入local\_rank变量作为指定的device。
- mp.spawn方式:mp.spawn多进程拉起main\_worker后,第一个参数GPU自动获得device号(0 ~ ngpusper\_node - 1)。
- Python方式:任务拉起后,local\_rank自动获得device号。
用户需根据自己选择的方式对代码做不同的修改。
- shell脚本/torchrun方式
```
def main_worker(gpu, ngpus_per_node, args):
start_epoch = 0
end_epoch = 5
args.gpu = int(os.environ['LOCAL_RANK']) # 在shell脚本中循环传入local_rank变量作为指定的device
ddp_setup(args.gpu, args.world_size)
torch_npu.npu.set_device(args.gpu)
total_batch_size = args.batch_size
total_workers = ngpus_per_node
batch_size = int(total_batch_size / ngpus_per_node)
workers = int((total_workers + ngpus_per_node - 1) / ngpus_per_node)
model = NeuralNetwork()
device = torch.device("npu")
train_sampler = torch.utils.data.distributed.DistributedSampler(training_data)
test_sampler = torch.utils.data.distributed.DistributedSampler(test_data)
train_loader = torch.utils.data.DataLoader(
training_data, batch_size=batch_size, shuffle=(train_sampler is None),
num_workers=workers, pin_memory=False, sampler=train_sampler, drop_last=True)
val_loader = torch.utils.data.DataLoader(
test_data, batch_size=batch_size, shuffle=(test_sampler is None),
num_workers=workers, pin_memory=False, sampler=test_sampler, drop_last=True)
loc = 'npu:{}'.format(args.gpu)
model = model.to(loc)
criterion = nn.CrossEntropyLoss().to(loc)
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu])
for epoch in range(start_epoch, end_epoch):
print("curr epoch: ", epoch)
train_sampler.set_epoch(epoch)
train(train_loader, model, criterion, optimizer, epoch, args.gpu)
def train(train_loader, model, criterion, optimizer, epoch, gpu):
size = len(train_loader.dataset)
model.train()
end = time.time()
for i, (images, target) in enumerate(train_loader):
# measure data loading time
loc = 'npu:{}'.format(gpu)
target = target.to(torch.int32)
images, target = images.to(loc, non_blocking=False), target.to(loc, non_blocking=False)
# compute output
output = model(images)
loss = criterion(output, target)
# compute gradient and do SGD step
optimizer.zero_grad()
loss.backward()
optimizer.step()
end = time.time()
if i % 100 == 0:
loss, current = loss.item(), i * len(target)
print(f"loss: {loss:>7f} [{current:>5d}/{size:>5d}]")
```
- mp.spawn方式
不需要专门设置args.gpu,将shell脚本方式中main\_worker里的args.gpu均替换为gpu。
```
def main_worker(gpu, ngpus_per_node, args):
start_epoch = 0
end_epoch = 5
ddp_setup(gpu, args.world_size)
torch_npu.npu.set_device(gpu)
total_batch_size = args.batch_size
total_workers = ngpus_per_node
batch_size = int(total_batch_size / ngpus_per_node)
workers = int((total_workers + ngpus_per_node - 1) / ngpus_per_node)
model = NeuralNetwork()
device = torch.device("npu")
train_sampler = torch.utils.data.distributed.DistributedSampler(training_data)
test_sampler = torch.utils.data.distributed.DistributedSampler(test_data)
train_loader = torch.utils.data.DataLoader(
training_data, batch_size=batch_size, shuffle=(train_sampler is None),
num_workers=workers, pin_memory=False, sampler=train_sampler, drop_last=True)
val_loader = torch.utils.data.DataLoader(
test_data, batch_size=batch_size, shuffle=(test_sampler is None),
num_workers=workers, pin_memory=False, sampler=test_sampler, drop_last=True)
loc = 'npu:{}'.format(gpu)
model = model.to(loc)
criterion = nn.CrossEntropyLoss().to(loc)
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[gpu])
for epoch in range(start_epoch, end_epoch):
print("curr epoch: ", epoch)
train_sampler.set_epoch(epoch)
train(train_loader, model, criterion, optimizer, epoch, gpu)
...... # train函数代码同shell脚本方式
```
- Python方式
```
def main_worker(gpu, ngpus_per_node, args):
start_epoch = 0
end_epoch = 5
args.gpu = args.local_rank # 任务拉起后,local_rank自动获得device号
ddp_setup(args.gpu, args.world_size)
...... # 其余代码同shell脚本方式
```
#### 设置参数
在模型脚本中,根据拉起方式不同,设置不同的参数。
- shell脚本/torchrun方式
```
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description='simple distributed training job')
parser.add_argument('--batch_size', default=512, type=int, help='Input batch size on each device (default: 32)')
parser.add_argument('--gpu', default=None, type=int,
help='GPU id to use.')
args = parser.parse_args()
world_size = torch.npu.device_count()
args.world_size = world_size
main(args.world_size, args.batch_size)
```
- mp.spawn方式
```
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description='simple distributed training job')
parser.add_argument('--batch_size', default=512, type=int, help='Input batch size on each device (default: 32)')
args = parser.parse_args()
world_size = torch.npu.device_count()
args.world_size = world_size
main(args.world_size, args.batch_size)
```
- Python方式
```
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description='simple distributed training job')
parser.add_argument('--batch_size', default=512, type=int, help='Input batch size on each device (default: 32)')
parser.add_argument('--gpu', default=None, type=int,
help='GPU id to use.')
parser.add_argument("--local_rank", default=-1, type=int) # local_rank用于自动获取device号。使用mp.spawn方式与shell方式启动时需删除此项
args = parser.parse_args()
world_size = torch.npu.device_count()
args.world_size = world_size
main(args.world_size, args.batch_size, args) # 需将Python拉起命令中设置的参数传入main函数
```
#### 拉起训练
以下拉起训练的命令为示例,用户可根据实际情况自行更改。
- shell脚本方式
```
export HCCL_WHITELIST_DISABLE=1
RANK_ID_START=0
WORLD_SIZE=8
for((RANK_ID=$RANK_ID_START;RANK_ID<$((WORLD_SIZE+RANK_ID_START));RANK_ID++));
do
echo "Device ID: $RANK_ID"
export LOCAL_RANK=$RANK_ID
python3 ddp_test_shell.py &
done
wait
```
- mp.spawn方式
```
export HCCL_WHITELIST_DISABLE=1
python3 ddp_test_spwn.py
```
- Python方式
```
# master_addr和master_port参数需用户根据实际情况设置
export HCCL_WHITELIST_DISABLE=1
python3 -m torch.distributed.launch --nproc_per_node 8 --master_addr localhost --master_port *** ddp_test.py
```
- torchrun方式(PyTorch 1.11.0及以上版本支持)
```
export HCCL_WHITELIST_DISABLE=1
torchrun --standalone --nnodes=1 --nproc_per_node=8 ddp_test_shell.py
```
当屏幕打印类似下图中的Loss数值时,说明拉起训练成功。

**父主题:** [多卡分布式训练](https://www.hiascend.com/document/detail/zh/canncommercial/700/modeldevpt/ptmigr/AImpug_000025.html) |
| Shard | 171 (laksa) |
| Root Hash | 2628830536891727371 |
| Unparsed URL | com,hiascend!www,/document/detail/zh/canncommercial/700/modeldevpt/ptmigr/AImpug_000028.html s443 |