ℹ️ Skipped - page is already crawled
| Filter | Status | Condition | Details |
|---|---|---|---|
| HTTP status | PASS | download_http_code = 200 | HTTP 200 |
| Age cutoff | PASS | download_stamp > now() - 6 MONTH | 0.1 months ago |
| History drop | PASS | isNull(history_drop_reason) | No drop reason |
| Spam/ban | PASS | fh_dont_index != 1 AND ml_spam_score = 0 | ml_spam_score=0 |
| Canonical | PASS | meta_canonical IS NULL OR = '' OR = src_unparsed | Not set |
| Property | Value |
|---|---|
| URL | https://docs.pytorch.org/tutorials/beginner/dist_overview.html |
| Last Crawled | 2026-04-07 14:24:39 (4 days ago) |
| First Indexed | 2025-05-28 13:08:04 (10 months ago) |
| HTTP Status Code | 200 |
| Meta Title | PyTorch Distributed Overview — PyTorch Tutorials 2.11.0+cu130 documentation |
| Meta Description | null |
| Meta Canonical | null |
| Boilerpipe Text | Rate this Page
★
★
★
★
★
Created On: Jul 28, 2020 | Last Updated: Jul 20, 2025 | Last Verified: Nov 05, 2024
Author
:
Will Constable
,
Wei Feng
Note
View and edit this tutorial in
github
.
This is the overview page for the
torch.distributed
package. The goal of
this page is to categorize documents into different topics and briefly
describe each of them. If this is your first time building distributed training
applications using PyTorch, it is recommended to use this document to navigate
to the technology that can best serve your use case.
Introduction
#
The PyTorch Distributed library includes a collective of parallelism modules,
a communications layer, and infrastructure for launching and
debugging large training jobs.
Parallelism APIs
#
These Parallelism Modules offer high-level functionality and compose with existing models:
Distributed Data-Parallel (DDP)
Fully Sharded Data-Parallel Training (FSDP2)
Tensor Parallel (TP)
Pipeline Parallel (PP)
Sharding primitives
#
DTensor
and
DeviceMesh
are primitives used to build parallelism in terms of sharded or replicated tensors on N-dimensional process groups.
DTensor
represents a tensor that is sharded and/or replicated, and communicates automatically to reshard tensors as needed by operations.
DeviceMesh
abstracts the accelerator device communicators into a multi-dimensional array, which manages the underlying
ProcessGroup
instances for collective communications in multi-dimensional parallelisms. Try out our
Device Mesh Recipe
to learn more.
Communications APIs
#
The
PyTorch distributed communication layer (C10D)
offers both collective communication APIs (e.g.,
all_reduce
and
all_gather
)
and P2P communication APIs (e.g.,
send
and
isend
),
which are used under the hood in all of the parallelism implementations.
Writing Distributed Applications with PyTorch
shows examples of using c10d communication APIs.
Launcher
#
torchrun
is a widely-used launcher script, which spawns processes on the local and remote machines for running distributed PyTorch programs.
Applying Parallelism To Scale Your Model
#
Data Parallelism is a widely adopted single-program multiple-data training paradigm
where the model is replicated on every process, every model replica computes local gradients for
a different set of input data samples, gradients are averaged within the data-parallel communicator group before each optimizer step.
Model Parallelism techniques (or Sharded Data Parallelism) are required when a model doesn’t fit in GPU, and can be combined together to form multi-dimensional (N-D) parallelism techniques.
When deciding what parallelism techniques to choose for your model, use these common guidelines:
Use
DistributedDataParallel (DDP)
,
if your model fits in a single GPU but you want to easily scale up training using multiple GPUs.
Use
torchrun
, to launch multiple pytorch processes if you are using more than one node.
See also:
Getting Started with Distributed Data Parallel
Use
FullyShardedDataParallel (FSDP2)
when your model cannot fit on one GPU.
See also:
Getting Started with FSDP2
Use
Tensor Parallel (TP)
and/or
Pipeline Parallel (PP)
if you reach scaling limitations with FSDP2.
Try our
Tensor Parallelism Tutorial
See also:
TorchTitan end to end example of 3D parallelism
PyTorch Distributed Developers
#
If you’d like to contribute to PyTorch Distributed, refer to our
Developer Guide
. |
| Markdown | 
Help us understand how you use PyTorch! Take our quick survey. [Take Survey](https://docs.google.com/forms/d/e/1FAIpQLSfsGAWBcfutRcbO6kfrShBMOMmRuBezRjjOcXk0e9I9luBzvQ/viewform)
[Skip to main content](https://docs.pytorch.org/tutorials/beginner/dist_overview.html#main-content)
Back to top
[ ](https://docs.pytorch.org/tutorials/index.html)
[ ](https://docs.pytorch.org/tutorials/index.html)
[v2.11.0+cu130](https://docs.pytorch.org/tutorials/index.html)
- [Intro](https://docs.pytorch.org/tutorials/intro.html)
- [Learn the Basics](https://docs.pytorch.org/tutorials/beginner/basics/intro.html)
- [Introduction to PyTorch - YouTube Series](https://docs.pytorch.org/tutorials/beginner/introyt/introyt_index.html)
- [Deep Learning with PyTorch: A 60 Minute Blitz](https://docs.pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html)
- [Learning PyTorch with Examples](https://docs.pytorch.org/tutorials/beginner/pytorch_with_examples.html)
- [What is torch.nn really?](https://docs.pytorch.org/tutorials/beginner/nn_tutorial.html)
- [Understanding requires\_grad, retain\_grad, Leaf, and Non-leaf Tensors](https://docs.pytorch.org/tutorials/beginner/understanding_leaf_vs_nonleaf_tutorial.html)
- [NLP from Scratch](https://docs.pytorch.org/tutorials/intermediate/nlp_from_scratch_index.html)
- [Visualizing Models, Data, and Training with TensorBoard](https://docs.pytorch.org/tutorials/intermediate/tensorboard_tutorial.html)
- [A guide on good usage of non\_blocking and pin\_memory() in PyTorch](https://docs.pytorch.org/tutorials/intermediate/pinmem_nonblock.html)
- [Visualizing Gradients](https://docs.pytorch.org/tutorials/intermediate/visualizing_gradients_tutorial.html)
- [Compilers](https://docs.pytorch.org/tutorials/compilers_index.html)
- [Introduction to torch.compile](https://docs.pytorch.org/tutorials/intermediate/torch_compile_tutorial.html)
- [torch.compile End-to-End Tutorial](https://docs.pytorch.org/tutorials/intermediate/torch_compile_full_example.html)
- [Compiled Autograd: Capturing a larger backward graph for torch.compile](https://docs.pytorch.org/tutorials/intermediate/compiled_autograd_tutorial.html)
- [Inductor CPU backend debugging and profiling](https://docs.pytorch.org/tutorials/intermediate/inductor_debug_cpu.html)
- [Dynamic Compilation Control with torch.compiler.set\_stance](https://docs.pytorch.org/tutorials/recipes/torch_compiler_set_stance_tutorial.html)
- [Demonstration of torch.export flow, common challenges and the solutions to address them](https://docs.pytorch.org/tutorials/recipes/torch_export_challenges_solutions.html)
- [(beta) Compiling the optimizer with torch.compile](https://docs.pytorch.org/tutorials/recipes/compiling_optimizer.html)
- [(beta) Running the compiled optimizer with an LR Scheduler](https://docs.pytorch.org/tutorials/recipes/compiling_optimizer_lr_scheduler.html)
- [Using Variable Length Attention in PyTorch](https://docs.pytorch.org/tutorials/intermediate/variable_length_attention_tutorial.html)
- [Using User-Defined Triton Kernels with torch.compile](https://docs.pytorch.org/tutorials/recipes/torch_compile_user_defined_triton_kernel_tutorial.html)
- [Compile Time Caching in torch.compile](https://docs.pytorch.org/tutorials/recipes/torch_compile_caching_tutorial.html)
- [Reducing torch.compile cold start compilation time with regional compilation](https://docs.pytorch.org/tutorials/recipes/regional_compilation.html)
- [torch.export Tutorial](https://docs.pytorch.org/tutorials/intermediate/torch_export_tutorial.html)
- [torch.export AOTInductor Tutorial for Python runtime (Beta)](https://docs.pytorch.org/tutorials/recipes/torch_export_aoti_python.html)
- [Demonstration of torch.export flow, common challenges and the solutions to address them](https://docs.pytorch.org/tutorials/recipes/torch_export_challenges_solutions.html)
- [Introduction to ONNX](https://docs.pytorch.org/tutorials/beginner/onnx/intro_onnx.html)
- [Export a PyTorch model to ONNX](https://docs.pytorch.org/tutorials/beginner/onnx/export_simple_model_to_onnx_tutorial.html)
- [Extending the ONNX Exporter Operator Support](https://docs.pytorch.org/tutorials/beginner/onnx/onnx_registry_tutorial.html)
- [Export a model with control flow to ONNX](https://docs.pytorch.org/tutorials/beginner/onnx/export_control_flow_model_to_onnx_tutorial.html)
- [Building a Convolution/Batch Norm fuser with torch.compile](https://docs.pytorch.org/tutorials/intermediate/torch_compile_conv_bn_fuser.html)
- [(beta) Building a Simple CPU Performance Profiler with FX](https://docs.pytorch.org/tutorials/intermediate/fx_profiling_tutorial.html)
- [Domains](https://docs.pytorch.org/tutorials/domains.html)
- [TorchVision Object Detection Finetuning Tutorial](https://docs.pytorch.org/tutorials/intermediate/torchvision_tutorial.html)
- [Transfer Learning for Computer Vision Tutorial](https://docs.pytorch.org/tutorials/beginner/transfer_learning_tutorial.html)
- [Adversarial Example Generation](https://docs.pytorch.org/tutorials/beginner/fgsm_tutorial.html)
- [DCGAN Tutorial](https://docs.pytorch.org/tutorials/beginner/dcgan_faces_tutorial.html)
- [Spatial Transformer Networks Tutorial](https://docs.pytorch.org/tutorials/intermediate/spatial_transformer_tutorial.html)
- [Reinforcement Learning (DQN) Tutorial](https://docs.pytorch.org/tutorials/intermediate/reinforcement_q_learning.html)
- [Reinforcement Learning (PPO) with TorchRL Tutorial](https://docs.pytorch.org/tutorials/intermediate/reinforcement_ppo.html)
- [Train a Mario-playing RL Agent](https://docs.pytorch.org/tutorials/intermediate/mario_rl_tutorial.html)
- [Pendulum: Writing your environment and transforms with TorchRL](https://docs.pytorch.org/tutorials/advanced/pendulum.html)
- [Introduction to TorchRec](https://docs.pytorch.org/tutorials/intermediate/torchrec_intro_tutorial.html)
- [Exploring TorchRec sharding](https://docs.pytorch.org/tutorials/advanced/sharding.html)
- [Distributed](https://docs.pytorch.org/tutorials/distributed.html)
- [PyTorch Distributed Overview](https://docs.pytorch.org/tutorials/beginner/dist_overview.html)
- [Distributed Data Parallel in PyTorch - Video Tutorials](https://docs.pytorch.org/tutorials/beginner/ddp_series_intro.html)
- [Getting Started with Distributed Data Parallel](https://docs.pytorch.org/tutorials/intermediate/ddp_tutorial.html)
- [Writing Distributed Applications with PyTorch](https://docs.pytorch.org/tutorials/intermediate/dist_tuto.html)
- [Getting Started with Fully Sharded Data Parallel (FSDP2)](https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html)
- [Introduction to Libuv TCPStore Backend](https://docs.pytorch.org/tutorials/intermediate/TCPStore_libuv_backend.html)
- [Large Scale Transformer model training with Tensor Parallel (TP)](https://docs.pytorch.org/tutorials/intermediate/TP_tutorial.html)
- [Introduction to Distributed Pipeline Parallelism](https://docs.pytorch.org/tutorials/intermediate/pipelining_tutorial.html)
- [Customize Process Group Backends Using Cpp Extensions](https://docs.pytorch.org/tutorials/intermediate/process_group_cpp_extension_tutorial.html)
- [Getting Started with Distributed RPC Framework](https://docs.pytorch.org/tutorials/intermediate/rpc_tutorial.html)
- [Implementing a Parameter Server Using Distributed RPC Framework](https://docs.pytorch.org/tutorials/intermediate/rpc_param_server_tutorial.html)
- [Implementing Batch RPC Processing Using Asynchronous Executions](https://docs.pytorch.org/tutorials/intermediate/rpc_async_execution.html)
- [Interactive Distributed Applications with Monarch](https://docs.pytorch.org/tutorials/intermediate/monarch_distributed_tutorial.html)
- [Combining Distributed DataParallel with Distributed RPC Framework](https://docs.pytorch.org/tutorials/advanced/rpc_ddp_tutorial.html)
- [Distributed Training with Uneven Inputs Using the Join Context Manager](https://docs.pytorch.org/tutorials/advanced/generic_join.html)
- [Distributed training at scale with PyTorch and Ray Train](https://docs.pytorch.org/tutorials/beginner/distributed_training_with_ray_tutorial.html)
- [Deep Dive](https://docs.pytorch.org/tutorials/deep-dive.html)
- [Profiling your PyTorch Module](https://docs.pytorch.org/tutorials/beginner/profiler.html)
- [Parametrizations Tutorial](https://docs.pytorch.org/tutorials/intermediate/parametrizations.html)
- [Pruning Tutorial](https://docs.pytorch.org/tutorials/intermediate/pruning_tutorial.html)
- [Inductor CPU backend debugging and profiling](https://docs.pytorch.org/tutorials/intermediate/inductor_debug_cpu.html)
- [(Beta) Implementing High-Performance Transformers with Scaled Dot Product Attention (SDPA)](https://docs.pytorch.org/tutorials/intermediate/scaled_dot_product_attention_tutorial.html)
- [Knowledge Distillation Tutorial](https://docs.pytorch.org/tutorials/beginner/knowledge_distillation_tutorial.html)
- [Channels Last Memory Format in PyTorch](https://docs.pytorch.org/tutorials/intermediate/memory_format_tutorial.html)
- [Forward-mode Automatic Differentiation (Beta)](https://docs.pytorch.org/tutorials/intermediate/forward_ad_usage.html)
- [Jacobians, Hessians, hvp, vhp, and more: composing function transforms](https://docs.pytorch.org/tutorials/intermediate/jacobians_hessians.html)
- [Model ensembling](https://docs.pytorch.org/tutorials/intermediate/ensembling.html)
- [Per-sample-gradients](https://docs.pytorch.org/tutorials/intermediate/per_sample_grads.html)
- [Using the PyTorch C++ Frontend](https://docs.pytorch.org/tutorials/advanced/cpp_frontend.html)
- [Autograd in C++ Frontend](https://docs.pytorch.org/tutorials/advanced/cpp_autograd.html)
- [Extension](https://docs.pytorch.org/tutorials/extension.html)
- [PyTorch Custom Operators](https://docs.pytorch.org/tutorials/advanced/custom_ops_landing_page.html)
- [Custom Python Operators](https://docs.pytorch.org/tutorials/advanced/python_custom_ops.html)
- [Custom C++ and CUDA Operators](https://docs.pytorch.org/tutorials/advanced/cpp_custom_ops.html)
- [Double Backward with Custom Functions](https://docs.pytorch.org/tutorials/intermediate/custom_function_double_backward_tutorial.html)
- [Fusing Convolution and Batch Norm using Custom Function](https://docs.pytorch.org/tutorials/intermediate/custom_function_conv_bn_tutorial.html)
- [Registering a Dispatched Operator in C++](https://docs.pytorch.org/tutorials/advanced/dispatcher.html)
- [Extending dispatcher for a new backend in C++](https://docs.pytorch.org/tutorials/advanced/extend_dispatcher.html)
- [Facilitating New Backend Integration by PrivateUse1](https://docs.pytorch.org/tutorials/advanced/privateuseone.html)
- [Ecosystem](https://docs.pytorch.org/tutorials/ecosystem.html)
- [Hyperparameter tuning using Ray Tune](https://docs.pytorch.org/tutorials/beginner/hyperparameter_tuning_tutorial.html)
- [Serve PyTorch models at scale with Ray Serve](https://docs.pytorch.org/tutorials/beginner/serving_tutorial.html)
- [Multi-Objective NAS with Ax](https://docs.pytorch.org/tutorials/intermediate/ax_multiobjective_nas_tutorial.html)
- [PyTorch Profiler With TensorBoard](https://docs.pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html)
- [Real Time Inference on Raspberry Pi 4 and 5 (40 fps!)](https://docs.pytorch.org/tutorials/intermediate/realtime_rpi.html)
- [Mosaic: Memory Profiling for PyTorch](https://docs.pytorch.org/tutorials/beginner/mosaic_memory_profiling_tutorial.html)
- [Distributed training at scale with PyTorch and Ray Train](https://docs.pytorch.org/tutorials/beginner/distributed_training_with_ray_tutorial.html)
- More
- [Recipes](https://docs.pytorch.org/tutorials/recipes_index.html)
- [Unstable](https://docs.pytorch.org/tutorials/unstable_index.html)
[Go to pytorch.org](https://pytorch.org/)
- [X](https://x.com/PyTorch)
- [GitHub](https://github.com/pytorch/tutorials)
- [Discourse](https://dev-discuss.pytorch.org/)
- [PyPi](https://pypi.org/project/torch/)
[v2.11.0+cu130](https://docs.pytorch.org/tutorials/index.html)
- [Intro](https://docs.pytorch.org/tutorials/intro.html)
- [Learn the Basics](https://docs.pytorch.org/tutorials/beginner/basics/intro.html)
- [Introduction to PyTorch - YouTube Series](https://docs.pytorch.org/tutorials/beginner/introyt/introyt_index.html)
- [Deep Learning with PyTorch: A 60 Minute Blitz](https://docs.pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html)
- [Learning PyTorch with Examples](https://docs.pytorch.org/tutorials/beginner/pytorch_with_examples.html)
- [What is torch.nn really?](https://docs.pytorch.org/tutorials/beginner/nn_tutorial.html)
- [Understanding requires\_grad, retain\_grad, Leaf, and Non-leaf Tensors](https://docs.pytorch.org/tutorials/beginner/understanding_leaf_vs_nonleaf_tutorial.html)
- [NLP from Scratch](https://docs.pytorch.org/tutorials/intermediate/nlp_from_scratch_index.html)
- [Visualizing Models, Data, and Training with TensorBoard](https://docs.pytorch.org/tutorials/intermediate/tensorboard_tutorial.html)
- [A guide on good usage of non\_blocking and pin\_memory() in PyTorch](https://docs.pytorch.org/tutorials/intermediate/pinmem_nonblock.html)
- [Visualizing Gradients](https://docs.pytorch.org/tutorials/intermediate/visualizing_gradients_tutorial.html)
- [Compilers](https://docs.pytorch.org/tutorials/compilers_index.html)
- [Introduction to torch.compile](https://docs.pytorch.org/tutorials/intermediate/torch_compile_tutorial.html)
- [torch.compile End-to-End Tutorial](https://docs.pytorch.org/tutorials/intermediate/torch_compile_full_example.html)
- [Compiled Autograd: Capturing a larger backward graph for torch.compile](https://docs.pytorch.org/tutorials/intermediate/compiled_autograd_tutorial.html)
- [Inductor CPU backend debugging and profiling](https://docs.pytorch.org/tutorials/intermediate/inductor_debug_cpu.html)
- [Dynamic Compilation Control with torch.compiler.set\_stance](https://docs.pytorch.org/tutorials/recipes/torch_compiler_set_stance_tutorial.html)
- [Demonstration of torch.export flow, common challenges and the solutions to address them](https://docs.pytorch.org/tutorials/recipes/torch_export_challenges_solutions.html)
- [(beta) Compiling the optimizer with torch.compile](https://docs.pytorch.org/tutorials/recipes/compiling_optimizer.html)
- [(beta) Running the compiled optimizer with an LR Scheduler](https://docs.pytorch.org/tutorials/recipes/compiling_optimizer_lr_scheduler.html)
- [Using Variable Length Attention in PyTorch](https://docs.pytorch.org/tutorials/intermediate/variable_length_attention_tutorial.html)
- [Using User-Defined Triton Kernels with torch.compile](https://docs.pytorch.org/tutorials/recipes/torch_compile_user_defined_triton_kernel_tutorial.html)
- [Compile Time Caching in torch.compile](https://docs.pytorch.org/tutorials/recipes/torch_compile_caching_tutorial.html)
- [Reducing torch.compile cold start compilation time with regional compilation](https://docs.pytorch.org/tutorials/recipes/regional_compilation.html)
- [torch.export Tutorial](https://docs.pytorch.org/tutorials/intermediate/torch_export_tutorial.html)
- [torch.export AOTInductor Tutorial for Python runtime (Beta)](https://docs.pytorch.org/tutorials/recipes/torch_export_aoti_python.html)
- [Demonstration of torch.export flow, common challenges and the solutions to address them](https://docs.pytorch.org/tutorials/recipes/torch_export_challenges_solutions.html)
- [Introduction to ONNX](https://docs.pytorch.org/tutorials/beginner/onnx/intro_onnx.html)
- [Export a PyTorch model to ONNX](https://docs.pytorch.org/tutorials/beginner/onnx/export_simple_model_to_onnx_tutorial.html)
- [Extending the ONNX Exporter Operator Support](https://docs.pytorch.org/tutorials/beginner/onnx/onnx_registry_tutorial.html)
- [Export a model with control flow to ONNX](https://docs.pytorch.org/tutorials/beginner/onnx/export_control_flow_model_to_onnx_tutorial.html)
- [Building a Convolution/Batch Norm fuser with torch.compile](https://docs.pytorch.org/tutorials/intermediate/torch_compile_conv_bn_fuser.html)
- [(beta) Building a Simple CPU Performance Profiler with FX](https://docs.pytorch.org/tutorials/intermediate/fx_profiling_tutorial.html)
- [Domains](https://docs.pytorch.org/tutorials/domains.html)
- [TorchVision Object Detection Finetuning Tutorial](https://docs.pytorch.org/tutorials/intermediate/torchvision_tutorial.html)
- [Transfer Learning for Computer Vision Tutorial](https://docs.pytorch.org/tutorials/beginner/transfer_learning_tutorial.html)
- [Adversarial Example Generation](https://docs.pytorch.org/tutorials/beginner/fgsm_tutorial.html)
- [DCGAN Tutorial](https://docs.pytorch.org/tutorials/beginner/dcgan_faces_tutorial.html)
- [Spatial Transformer Networks Tutorial](https://docs.pytorch.org/tutorials/intermediate/spatial_transformer_tutorial.html)
- [Reinforcement Learning (DQN) Tutorial](https://docs.pytorch.org/tutorials/intermediate/reinforcement_q_learning.html)
- [Reinforcement Learning (PPO) with TorchRL Tutorial](https://docs.pytorch.org/tutorials/intermediate/reinforcement_ppo.html)
- [Train a Mario-playing RL Agent](https://docs.pytorch.org/tutorials/intermediate/mario_rl_tutorial.html)
- [Pendulum: Writing your environment and transforms with TorchRL](https://docs.pytorch.org/tutorials/advanced/pendulum.html)
- [Introduction to TorchRec](https://docs.pytorch.org/tutorials/intermediate/torchrec_intro_tutorial.html)
- [Exploring TorchRec sharding](https://docs.pytorch.org/tutorials/advanced/sharding.html)
- [Distributed](https://docs.pytorch.org/tutorials/distributed.html)
- [PyTorch Distributed Overview](https://docs.pytorch.org/tutorials/beginner/dist_overview.html)
- [Distributed Data Parallel in PyTorch - Video Tutorials](https://docs.pytorch.org/tutorials/beginner/ddp_series_intro.html)
- [Getting Started with Distributed Data Parallel](https://docs.pytorch.org/tutorials/intermediate/ddp_tutorial.html)
- [Writing Distributed Applications with PyTorch](https://docs.pytorch.org/tutorials/intermediate/dist_tuto.html)
- [Getting Started with Fully Sharded Data Parallel (FSDP2)](https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html)
- [Introduction to Libuv TCPStore Backend](https://docs.pytorch.org/tutorials/intermediate/TCPStore_libuv_backend.html)
- [Large Scale Transformer model training with Tensor Parallel (TP)](https://docs.pytorch.org/tutorials/intermediate/TP_tutorial.html)
- [Introduction to Distributed Pipeline Parallelism](https://docs.pytorch.org/tutorials/intermediate/pipelining_tutorial.html)
- [Customize Process Group Backends Using Cpp Extensions](https://docs.pytorch.org/tutorials/intermediate/process_group_cpp_extension_tutorial.html)
- [Getting Started with Distributed RPC Framework](https://docs.pytorch.org/tutorials/intermediate/rpc_tutorial.html)
- [Implementing a Parameter Server Using Distributed RPC Framework](https://docs.pytorch.org/tutorials/intermediate/rpc_param_server_tutorial.html)
- [Implementing Batch RPC Processing Using Asynchronous Executions](https://docs.pytorch.org/tutorials/intermediate/rpc_async_execution.html)
- [Interactive Distributed Applications with Monarch](https://docs.pytorch.org/tutorials/intermediate/monarch_distributed_tutorial.html)
- [Combining Distributed DataParallel with Distributed RPC Framework](https://docs.pytorch.org/tutorials/advanced/rpc_ddp_tutorial.html)
- [Distributed Training with Uneven Inputs Using the Join Context Manager](https://docs.pytorch.org/tutorials/advanced/generic_join.html)
- [Distributed training at scale with PyTorch and Ray Train](https://docs.pytorch.org/tutorials/beginner/distributed_training_with_ray_tutorial.html)
- [Deep Dive](https://docs.pytorch.org/tutorials/deep-dive.html)
- [Profiling your PyTorch Module](https://docs.pytorch.org/tutorials/beginner/profiler.html)
- [Parametrizations Tutorial](https://docs.pytorch.org/tutorials/intermediate/parametrizations.html)
- [Pruning Tutorial](https://docs.pytorch.org/tutorials/intermediate/pruning_tutorial.html)
- [Inductor CPU backend debugging and profiling](https://docs.pytorch.org/tutorials/intermediate/inductor_debug_cpu.html)
- [(Beta) Implementing High-Performance Transformers with Scaled Dot Product Attention (SDPA)](https://docs.pytorch.org/tutorials/intermediate/scaled_dot_product_attention_tutorial.html)
- [Knowledge Distillation Tutorial](https://docs.pytorch.org/tutorials/beginner/knowledge_distillation_tutorial.html)
- [Channels Last Memory Format in PyTorch](https://docs.pytorch.org/tutorials/intermediate/memory_format_tutorial.html)
- [Forward-mode Automatic Differentiation (Beta)](https://docs.pytorch.org/tutorials/intermediate/forward_ad_usage.html)
- [Jacobians, Hessians, hvp, vhp, and more: composing function transforms](https://docs.pytorch.org/tutorials/intermediate/jacobians_hessians.html)
- [Model ensembling](https://docs.pytorch.org/tutorials/intermediate/ensembling.html)
- [Per-sample-gradients](https://docs.pytorch.org/tutorials/intermediate/per_sample_grads.html)
- [Using the PyTorch C++ Frontend](https://docs.pytorch.org/tutorials/advanced/cpp_frontend.html)
- [Autograd in C++ Frontend](https://docs.pytorch.org/tutorials/advanced/cpp_autograd.html)
- [Extension](https://docs.pytorch.org/tutorials/extension.html)
- [PyTorch Custom Operators](https://docs.pytorch.org/tutorials/advanced/custom_ops_landing_page.html)
- [Custom Python Operators](https://docs.pytorch.org/tutorials/advanced/python_custom_ops.html)
- [Custom C++ and CUDA Operators](https://docs.pytorch.org/tutorials/advanced/cpp_custom_ops.html)
- [Double Backward with Custom Functions](https://docs.pytorch.org/tutorials/intermediate/custom_function_double_backward_tutorial.html)
- [Fusing Convolution and Batch Norm using Custom Function](https://docs.pytorch.org/tutorials/intermediate/custom_function_conv_bn_tutorial.html)
- [Registering a Dispatched Operator in C++](https://docs.pytorch.org/tutorials/advanced/dispatcher.html)
- [Extending dispatcher for a new backend in C++](https://docs.pytorch.org/tutorials/advanced/extend_dispatcher.html)
- [Facilitating New Backend Integration by PrivateUse1](https://docs.pytorch.org/tutorials/advanced/privateuseone.html)
- [Ecosystem](https://docs.pytorch.org/tutorials/ecosystem.html)
- [Hyperparameter tuning using Ray Tune](https://docs.pytorch.org/tutorials/beginner/hyperparameter_tuning_tutorial.html)
- [Serve PyTorch models at scale with Ray Serve](https://docs.pytorch.org/tutorials/beginner/serving_tutorial.html)
- [Multi-Objective NAS with Ax](https://docs.pytorch.org/tutorials/intermediate/ax_multiobjective_nas_tutorial.html)
- [PyTorch Profiler With TensorBoard](https://docs.pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html)
- [Real Time Inference on Raspberry Pi 4 and 5 (40 fps!)](https://docs.pytorch.org/tutorials/intermediate/realtime_rpi.html)
- [Mosaic: Memory Profiling for PyTorch](https://docs.pytorch.org/tutorials/beginner/mosaic_memory_profiling_tutorial.html)
- [Distributed training at scale with PyTorch and Ray Train](https://docs.pytorch.org/tutorials/beginner/distributed_training_with_ray_tutorial.html)
- [Recipes](https://docs.pytorch.org/tutorials/recipes_index.html)
- [Defining a Neural Network in PyTorch](https://docs.pytorch.org/tutorials/recipes/recipes/defining_a_neural_network.html)
- [(beta) Using TORCH\_LOGS python API with torch.compile](https://docs.pytorch.org/tutorials/recipes/torch_logs.html)
- [What is a state\_dict in PyTorch](https://docs.pytorch.org/tutorials/recipes/recipes/what_is_state_dict.html)
- [Warmstarting model using parameters from a different model in PyTorch](https://docs.pytorch.org/tutorials/recipes/recipes/warmstarting_model_using_parameters_from_a_different_model.html)
- [Zeroing out gradients in PyTorch](https://docs.pytorch.org/tutorials/recipes/recipes/zeroing_out_gradients.html)
- [PyTorch Profiler](https://docs.pytorch.org/tutorials/recipes/recipes/profiler_recipe.html)
- [Model Interpretability using Captum](https://docs.pytorch.org/tutorials/recipes/recipes/Captum_Recipe.html)
- [How to use TensorBoard with PyTorch](https://docs.pytorch.org/tutorials/recipes/recipes/tensorboard_with_pytorch.html)
- [Automatic Mixed Precision](https://docs.pytorch.org/tutorials/recipes/recipes/amp_recipe.html)
- [Performance Tuning Guide](https://docs.pytorch.org/tutorials/recipes/recipes/tuning_guide.html)
- [(beta) Compiling the optimizer with torch.compile](https://docs.pytorch.org/tutorials/recipes/compiling_optimizer.html)
- [Timer quick start](https://docs.pytorch.org/tutorials/recipes/recipes/timer_quick_start.html)
- [Shard Optimizer States with ZeroRedundancyOptimizer](https://docs.pytorch.org/tutorials/recipes/zero_redundancy_optimizer.html)
- [Getting Started with CommDebugMode](https://docs.pytorch.org/tutorials/recipes/distributed_comm_debug_mode.html)
- [Demonstration of torch.export flow, common challenges and the solutions to address them](https://docs.pytorch.org/tutorials/recipes/torch_export_challenges_solutions.html)
- [PyTorch Benchmark](https://docs.pytorch.org/tutorials/recipes/recipes/benchmark.html)
- [Tips for Loading an nn.Module from a Checkpoint](https://docs.pytorch.org/tutorials/recipes/recipes/module_load_state_dict_tips.html)
- [Reasoning about Shapes in PyTorch](https://docs.pytorch.org/tutorials/recipes/recipes/reasoning_about_shapes.html)
- [Extension points in nn.Module for load\_state\_dict and tensor subclasses](https://docs.pytorch.org/tutorials/recipes/recipes/swap_tensors.html)
- [torch.export AOTInductor Tutorial for Python runtime (Beta)](https://docs.pytorch.org/tutorials/recipes/torch_export_aoti_python.html)
- [How to use TensorBoard with PyTorch](https://docs.pytorch.org/tutorials/recipes/recipes/tensorboard_with_pytorch.html)
- [(beta) Utilizing Torch Function modes with torch.compile](https://docs.pytorch.org/tutorials/recipes/torch_compile_torch_function_modes.html)
- [(beta) Running the compiled optimizer with an LR Scheduler](https://docs.pytorch.org/tutorials/recipes/compiling_optimizer_lr_scheduler.html)
- [Explicit horizontal fusion with foreach\_map and torch.compile](https://docs.pytorch.org/tutorials/recipes/foreach_map.html)
- [Using User-Defined Triton Kernels with torch.compile](https://docs.pytorch.org/tutorials/recipes/torch_compile_user_defined_triton_kernel_tutorial.html)
- [Compile Time Caching in torch.compile](https://docs.pytorch.org/tutorials/recipes/torch_compile_caching_tutorial.html)
- [Compile Time Caching Configuration](https://docs.pytorch.org/tutorials/recipes/torch_compile_caching_configuration_tutorial.html)
- [Reducing torch.compile cold start compilation time with regional compilation](https://docs.pytorch.org/tutorials/recipes/regional_compilation.html)
- [Reducing AoT cold start compilation time with regional compilation](https://docs.pytorch.org/tutorials/recipes/regional_aot.html)
- [Ease-of-use quantization for PyTorch with Intel® Neural Compressor](https://docs.pytorch.org/tutorials/recipes/intel_neural_compressor_for_pytorch.html)
- [Getting Started with DeviceMesh](https://docs.pytorch.org/tutorials/recipes/distributed_device_mesh.html)
- [Getting Started with Distributed Checkpoint (DCP)](https://docs.pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html)
- [Asynchronous Saving with Distributed Checkpoint (DCP)](https://docs.pytorch.org/tutorials/recipes/distributed_async_checkpoint_recipe.html)
- [DebugMode: Recording Dispatched Operations and Numerical Debugging](https://docs.pytorch.org/tutorials/recipes/debug_mode_tutorial.html)
- [Unstable](https://docs.pytorch.org/tutorials/unstable_index.html)
- [Introduction to Context Parallel](https://docs.pytorch.org/tutorials/unstable/context_parallel.html)
- [Flight Recorder for Debugging Stuck Jobs](https://docs.pytorch.org/tutorials/unstable/flight_recorder_tutorial.html)
- [TorchInductor C++ Wrapper Tutorial](https://docs.pytorch.org/tutorials/unstable/inductor_cpp_wrapper_tutorial.html)
- [How to use torch.compile on Windows CPU/XPU](https://docs.pytorch.org/tutorials/unstable/inductor_windows.html)
- [torch.vmap](https://docs.pytorch.org/tutorials/unstable/vmap_recipe.html)
- [Getting Started with Nested Tensors](https://docs.pytorch.org/tutorials/unstable/nestedtensor.html)
- [MaskedTensor Overview](https://docs.pytorch.org/tutorials/unstable/maskedtensor_overview.html)
- [MaskedTensor Sparsity](https://docs.pytorch.org/tutorials/unstable/maskedtensor_sparsity.html)
- [MaskedTensor Advanced Semantics](https://docs.pytorch.org/tutorials/unstable/maskedtensor_advanced_semantics.html)
- [Efficiently writing “sparse” semantics for Adagrad with MaskedTensor](https://docs.pytorch.org/tutorials/unstable/maskedtensor_adagrad.html)
- [Autoloading Out-of-Tree Extension](https://docs.pytorch.org/tutorials/unstable/python_extension_autoload.html)
- [Using Max-Autotune Compilation on CPU for Better Performance](https://docs.pytorch.org/tutorials/unstable/max_autotune_on_CPU_tutorial.html)
[Go to pytorch.org](https://pytorch.org/)
- [X](https://x.com/PyTorch)
- [GitHub](https://github.com/pytorch/tutorials)
- [Discourse](https://dev-discuss.pytorch.org/)
- [PyPi](https://pypi.org/project/torch/)
Section Navigation
- [PyTorch Distributed Overview](https://docs.pytorch.org/tutorials/beginner/dist_overview.html)
- [Distributed Data Parallel in PyTorch - Video Tutorials](https://docs.pytorch.org/tutorials/beginner/ddp_series_intro.html)
- [Getting Started with Distributed Data Parallel](https://docs.pytorch.org/tutorials/intermediate/ddp_tutorial.html)
- [Writing Distributed Applications with PyTorch](https://docs.pytorch.org/tutorials/intermediate/dist_tuto.html)
- [Getting Started with Fully Sharded Data Parallel (FSDP2)](https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html)
- [Introduction to Libuv TCPStore Backend](https://docs.pytorch.org/tutorials/intermediate/TCPStore_libuv_backend.html)
- [Large Scale Transformer model training with Tensor Parallel (TP)](https://docs.pytorch.org/tutorials/intermediate/TP_tutorial.html)
- [Introduction to Distributed Pipeline Parallelism](https://docs.pytorch.org/tutorials/intermediate/pipelining_tutorial.html)
- [Customize Process Group Backends Using Cpp Extensions](https://docs.pytorch.org/tutorials/intermediate/process_group_cpp_extension_tutorial.html)
- [Getting Started with Distributed RPC Framework](https://docs.pytorch.org/tutorials/intermediate/rpc_tutorial.html)
- [Implementing a Parameter Server Using Distributed RPC Framework](https://docs.pytorch.org/tutorials/intermediate/rpc_param_server_tutorial.html)
- [Implementing Batch RPC Processing Using Asynchronous Executions](https://docs.pytorch.org/tutorials/intermediate/rpc_async_execution.html)
- [Interactive Distributed Applications with Monarch](https://docs.pytorch.org/tutorials/intermediate/monarch_distributed_tutorial.html)
- [Combining Distributed DataParallel with Distributed RPC Framework](https://docs.pytorch.org/tutorials/advanced/rpc_ddp_tutorial.html)
- [Distributed Training with Uneven Inputs Using the Join Context Manager](https://docs.pytorch.org/tutorials/advanced/generic_join.html)
- [Distributed training at scale with PyTorch and Ray Train](https://docs.pytorch.org/tutorials/beginner/distributed_training_with_ray_tutorial.html)
- [Distributed](https://docs.pytorch.org/tutorials/distributed.html)
- PyTorch...
Rate this Page
★ ★ ★ ★ ★
beginner/dist\_overview
[ Run in Google Colab Colab]()
[ Download Notebook Notebook]()
[ View on GitHub GitHub]()
# PyTorch Distributed Overview[\#](https://docs.pytorch.org/tutorials/beginner/dist_overview.html#pytorch-distributed-overview "Link to this heading")
Created On: Jul 28, 2020 \| Last Updated: Jul 20, 2025 \| Last Verified: Nov 05, 2024
**Author**: [Will Constable](https://github.com/wconstab/), [Wei Feng](https://github.com/weifengpy)
Note
[](https://docs.pytorch.org/tutorials/_images/pencil-16.png) View and edit this tutorial in [github](https://github.com/pytorch/tutorials/blob/main/beginner_source/dist_overview.rst).
This is the overview page for the `torch.distributed` package. The goal of this page is to categorize documents into different topics and briefly describe each of them. If this is your first time building distributed training applications using PyTorch, it is recommended to use this document to navigate to the technology that can best serve your use case.
## Introduction[\#](https://docs.pytorch.org/tutorials/beginner/dist_overview.html#introduction "Link to this heading")
The PyTorch Distributed library includes a collective of parallelism modules, a communications layer, and infrastructure for launching and debugging large training jobs.
### Parallelism APIs[\#](https://docs.pytorch.org/tutorials/beginner/dist_overview.html#parallelism-apis "Link to this heading")
These Parallelism Modules offer high-level functionality and compose with existing models:
- [Distributed Data-Parallel (DDP)](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html)
- [Fully Sharded Data-Parallel Training (FSDP2)](https://pytorch.org/docs/stable/distributed.fsdp.fully_shard.html)
- [Tensor Parallel (TP)](https://pytorch.org/docs/stable/distributed.tensor.parallel.html)
- [Pipeline Parallel (PP)](https://pytorch.org/docs/main/distributed.pipelining.html)
### Sharding primitives[\#](https://docs.pytorch.org/tutorials/beginner/dist_overview.html#sharding-primitives "Link to this heading")
`DTensor` and `DeviceMesh` are primitives used to build parallelism in terms of sharded or replicated tensors on N-dimensional process groups.
- [DTensor](https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/README.md) represents a tensor that is sharded and/or replicated, and communicates automatically to reshard tensors as needed by operations.
- [DeviceMesh](https://pytorch.org/docs/stable/distributed.html#devicemesh) abstracts the accelerator device communicators into a multi-dimensional array, which manages the underlying `ProcessGroup` instances for collective communications in multi-dimensional parallelisms. Try out our [Device Mesh Recipe](https://pytorch.org/tutorials/recipes/distributed_device_mesh.html) to learn more.
### Communications APIs[\#](https://docs.pytorch.org/tutorials/beginner/dist_overview.html#communications-apis "Link to this heading")
The [PyTorch distributed communication layer (C10D)](https://pytorch.org/docs/stable/distributed.html) offers both collective communication APIs (e.g., [all\_reduce](https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_reduce)
and [all\_gather](https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_gather)) and P2P communication APIs (e.g., [send](https://pytorch.org/docs/stable/distributed.html#torch.distributed.send) and [isend](https://pytorch.org/docs/stable/distributed.html#torch.distributed.isend)), which are used under the hood in all of the parallelism implementations. [Writing Distributed Applications with PyTorch](https://docs.pytorch.org/tutorials/intermediate/dist_tuto.html) shows examples of using c10d communication APIs.
### Launcher[\#](https://docs.pytorch.org/tutorials/beginner/dist_overview.html#launcher "Link to this heading")
[torchrun](https://pytorch.org/docs/stable/elastic/run.html) is a widely-used launcher script, which spawns processes on the local and remote machines for running distributed PyTorch programs.
## Applying Parallelism To Scale Your Model[\#](https://docs.pytorch.org/tutorials/beginner/dist_overview.html#applying-parallelism-to-scale-your-model "Link to this heading")
Data Parallelism is a widely adopted single-program multiple-data training paradigm where the model is replicated on every process, every model replica computes local gradients for a different set of input data samples, gradients are averaged within the data-parallel communicator group before each optimizer step.
Model Parallelism techniques (or Sharded Data Parallelism) are required when a model doesn’t fit in GPU, and can be combined together to form multi-dimensional (N-D) parallelism techniques.
When deciding what parallelism techniques to choose for your model, use these common guidelines:
1. Use [DistributedDataParallel (DDP)](https://pytorch.org/docs/stable/notes/ddp.html), if your model fits in a single GPU but you want to easily scale up training using multiple GPUs.
- Use [torchrun](https://pytorch.org/docs/stable/elastic/run.html), to launch multiple pytorch processes if you are using more than one node.
- See also: [Getting Started with Distributed Data Parallel](https://docs.pytorch.org/tutorials/intermediate/ddp_tutorial.html)
2. Use [FullyShardedDataParallel (FSDP2)](https://pytorch.org/docs/stable/distributed.fsdp.fully_shard.html) when your model cannot fit on one GPU.
- See also: [Getting Started with FSDP2](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html)
3. Use [Tensor Parallel (TP)](https://pytorch.org/docs/stable/distributed.tensor.parallel.html) and/or [Pipeline Parallel (PP)](https://pytorch.org/docs/main/distributed.pipelining.html) if you reach scaling limitations with FSDP2.
- Try our [Tensor Parallelism Tutorial](https://pytorch.org/tutorials/intermediate/TP_tutorial.html)
- See also: [TorchTitan end to end example of 3D parallelism](https://github.com/pytorch/torchtitan)
Note
Data-parallel training also works with [Automatic Mixed Precision (AMP)](https://pytorch.org/docs/stable/notes/amp_examples.html#working-with-multiple-gpus).
## PyTorch Distributed Developers[\#](https://docs.pytorch.org/tutorials/beginner/dist_overview.html#pytorch-distributed-developers "Link to this heading")
If you’d like to contribute to PyTorch Distributed, refer to our [Developer Guide](https://github.com/pytorch/pytorch/blob/master/torch/distributed/CONTRIBUTING.md).
Rate this Page
★ ★ ★ ★ ★
Send Feedback
[previous Distributed](https://docs.pytorch.org/tutorials/distributed.html "previous page")
[next Distributed Data Parallel in PyTorch - Video Tutorials](https://docs.pytorch.org/tutorials/beginner/ddp_series_intro.html "next page")
Built with the [PyData Sphinx Theme](https://pydata-sphinx-theme.readthedocs.io/en/stable/index.html) 0.15.4.
[previous Distributed](https://docs.pytorch.org/tutorials/distributed.html "previous page")
[next Distributed Data Parallel in PyTorch - Video Tutorials](https://docs.pytorch.org/tutorials/beginner/ddp_series_intro.html "next page")
On this page
- [Introduction](https://docs.pytorch.org/tutorials/beginner/dist_overview.html#introduction)
- [Parallelism APIs](https://docs.pytorch.org/tutorials/beginner/dist_overview.html#parallelism-apis)
- [Sharding primitives](https://docs.pytorch.org/tutorials/beginner/dist_overview.html#sharding-primitives)
- [Communications APIs](https://docs.pytorch.org/tutorials/beginner/dist_overview.html#communications-apis)
- [Launcher](https://docs.pytorch.org/tutorials/beginner/dist_overview.html#launcher)
- [Applying Parallelism To Scale Your Model](https://docs.pytorch.org/tutorials/beginner/dist_overview.html#applying-parallelism-to-scale-your-model)
- [PyTorch Distributed Developers](https://docs.pytorch.org/tutorials/beginner/dist_overview.html#pytorch-distributed-developers)
PyTorch Libraries
- [ExecuTorch](https://docs.pytorch.org/executorch)
- [Helion](https://docs.pytorch.org/helion)
- [torchao](https://docs.pytorch.org/ao)
- [kineto](https://github.com/pytorch/kineto)
- [torchtitan](https://github.com/pytorch/torchtitan)
- [TorchRL](https://docs.pytorch.org/rl)
- [torchvision](https://docs.pytorch.org/vision)
- [torchaudio](https://docs.pytorch.org/audio)
- [tensordict](https://docs.pytorch.org/tensordict)
- [PyTorch on XLA Devices](https://docs.pytorch.org/xla)
## Docs
Access comprehensive developer documentation for PyTorch
[View Docs](https://docs.pytorch.org/docs/stable/index.html)
## Tutorials
Get in-depth tutorials for beginners and advanced developers
[View Tutorials](https://docs.pytorch.org/tutorials)
## Resources
Find development resources and get your questions answered
[View Resources](https://pytorch.org/resources)
**Stay in touch** for updates, event info, and the latest news
By submitting this form, I consent to receive marketing emails from the LF and its projects regarding their events, training, research, developments, and related announcements. I understand that I can unsubscribe at any time using the links in the footers of the emails I receive. [Privacy Policy](https://www.linuxfoundation.org/privacy/).
© PyTorch. Copyright © The Linux Foundation®. All rights reserved. The Linux Foundation has registered trademarks and uses trademarks. For more information, including terms of use, privacy policy, and trademark usage, please see our [Policies](https://www.linuxfoundation.org/legal/policies) page. [Trademark Usage](https://www.linuxfoundation.org/trademark-usage). [Privacy Policy](http://www.linuxfoundation.org/privacy).
To analyze traffic and optimize your experience, we serve cookies on this site. By clicking or navigating, you agree to allow our usage of cookies. As the current maintainers of this site, Facebook’s Cookies Policy applies. Learn more, including about available controls: [Cookies Policy](https://opensource.fb.com/legal/cookie-policy).

© Copyright 2024, PyTorch.
Created using [Sphinx](https://www.sphinx-doc.org/) 7.2.6.
Built with the [PyData Sphinx Theme](https://pydata-sphinx-theme.readthedocs.io/en/stable/index.html) 0.15.4. |
| Readable Markdown | Rate this Page
★ ★ ★ ★ ★
Created On: Jul 28, 2020 \| Last Updated: Jul 20, 2025 \| Last Verified: Nov 05, 2024
**Author**: [Will Constable](https://github.com/wconstab/), [Wei Feng](https://github.com/weifengpy)
Note
[](https://docs.pytorch.org/tutorials/_images/pencil-16.png) View and edit this tutorial in [github](https://github.com/pytorch/tutorials/blob/main/beginner_source/dist_overview.rst).
This is the overview page for the `torch.distributed` package. The goal of this page is to categorize documents into different topics and briefly describe each of them. If this is your first time building distributed training applications using PyTorch, it is recommended to use this document to navigate to the technology that can best serve your use case.
## Introduction[\#](https://docs.pytorch.org/tutorials/beginner/dist_overview.html#introduction "Link to this heading")
The PyTorch Distributed library includes a collective of parallelism modules, a communications layer, and infrastructure for launching and debugging large training jobs.
### Parallelism APIs[\#](https://docs.pytorch.org/tutorials/beginner/dist_overview.html#parallelism-apis "Link to this heading")
These Parallelism Modules offer high-level functionality and compose with existing models:
- [Distributed Data-Parallel (DDP)](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html)
- [Fully Sharded Data-Parallel Training (FSDP2)](https://pytorch.org/docs/stable/distributed.fsdp.fully_shard.html)
- [Tensor Parallel (TP)](https://pytorch.org/docs/stable/distributed.tensor.parallel.html)
- [Pipeline Parallel (PP)](https://pytorch.org/docs/main/distributed.pipelining.html)
### Sharding primitives[\#](https://docs.pytorch.org/tutorials/beginner/dist_overview.html#sharding-primitives "Link to this heading")
`DTensor` and `DeviceMesh` are primitives used to build parallelism in terms of sharded or replicated tensors on N-dimensional process groups.
- [DTensor](https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/README.md) represents a tensor that is sharded and/or replicated, and communicates automatically to reshard tensors as needed by operations.
- [DeviceMesh](https://pytorch.org/docs/stable/distributed.html#devicemesh) abstracts the accelerator device communicators into a multi-dimensional array, which manages the underlying `ProcessGroup` instances for collective communications in multi-dimensional parallelisms. Try out our [Device Mesh Recipe](https://pytorch.org/tutorials/recipes/distributed_device_mesh.html) to learn more.
### Communications APIs[\#](https://docs.pytorch.org/tutorials/beginner/dist_overview.html#communications-apis "Link to this heading")
The [PyTorch distributed communication layer (C10D)](https://pytorch.org/docs/stable/distributed.html) offers both collective communication APIs (e.g., [all\_reduce](https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_reduce)
and [all\_gather](https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_gather)) and P2P communication APIs (e.g., [send](https://pytorch.org/docs/stable/distributed.html#torch.distributed.send) and [isend](https://pytorch.org/docs/stable/distributed.html#torch.distributed.isend)), which are used under the hood in all of the parallelism implementations. [Writing Distributed Applications with PyTorch](https://docs.pytorch.org/tutorials/intermediate/dist_tuto.html) shows examples of using c10d communication APIs.
### Launcher[\#](https://docs.pytorch.org/tutorials/beginner/dist_overview.html#launcher "Link to this heading")
[torchrun](https://pytorch.org/docs/stable/elastic/run.html) is a widely-used launcher script, which spawns processes on the local and remote machines for running distributed PyTorch programs.
## Applying Parallelism To Scale Your Model[\#](https://docs.pytorch.org/tutorials/beginner/dist_overview.html#applying-parallelism-to-scale-your-model "Link to this heading")
Data Parallelism is a widely adopted single-program multiple-data training paradigm where the model is replicated on every process, every model replica computes local gradients for a different set of input data samples, gradients are averaged within the data-parallel communicator group before each optimizer step.
Model Parallelism techniques (or Sharded Data Parallelism) are required when a model doesn’t fit in GPU, and can be combined together to form multi-dimensional (N-D) parallelism techniques.
When deciding what parallelism techniques to choose for your model, use these common guidelines:
1. Use [DistributedDataParallel (DDP)](https://pytorch.org/docs/stable/notes/ddp.html), if your model fits in a single GPU but you want to easily scale up training using multiple GPUs.
- Use [torchrun](https://pytorch.org/docs/stable/elastic/run.html), to launch multiple pytorch processes if you are using more than one node.
- See also: [Getting Started with Distributed Data Parallel](https://docs.pytorch.org/tutorials/intermediate/ddp_tutorial.html)
2. Use [FullyShardedDataParallel (FSDP2)](https://pytorch.org/docs/stable/distributed.fsdp.fully_shard.html) when your model cannot fit on one GPU.
- See also: [Getting Started with FSDP2](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html)
3. Use [Tensor Parallel (TP)](https://pytorch.org/docs/stable/distributed.tensor.parallel.html) and/or [Pipeline Parallel (PP)](https://pytorch.org/docs/main/distributed.pipelining.html) if you reach scaling limitations with FSDP2.
- Try our [Tensor Parallelism Tutorial](https://pytorch.org/tutorials/intermediate/TP_tutorial.html)
- See also: [TorchTitan end to end example of 3D parallelism](https://github.com/pytorch/torchtitan)
## PyTorch Distributed Developers[\#](https://docs.pytorch.org/tutorials/beginner/dist_overview.html#pytorch-distributed-developers "Link to this heading")
If you’d like to contribute to PyTorch Distributed, refer to our [Developer Guide](https://github.com/pytorch/pytorch/blob/master/torch/distributed/CONTRIBUTING.md). |
| Shard | 114 (laksa) |
| Root Hash | 14416670112284949514 |
| Unparsed URL | org,pytorch!docs,/tutorials/beginner/dist_overview.html s443 |