🕷️ Crawler Inspector

URL Lookup

Direct Parameter Lookup

Raw Queries and Responses

1. Shard Calculation

Query:
Response:
Calculated Shard: 114 (from laksa157)

2. Crawled Status Check

Query:
Response:

3. Robots.txt Check

Query:
Response:

4. Spam/Ban Check

Query:
Response:

5. Seen Status Check

ℹ️ Skipped - page is already crawled

📄
INDEXABLE
CRAWLED
4 days ago
🤖
ROBOTS ALLOWED

Page Info Filters

FilterStatusConditionDetails
HTTP statusPASSdownload_http_code = 200HTTP 200
Age cutoffPASSdownload_stamp > now() - 6 MONTH0.1 months ago
History dropPASSisNull(history_drop_reason)No drop reason
Spam/banPASSfh_dont_index != 1 AND ml_spam_score = 0ml_spam_score=0
CanonicalPASSmeta_canonical IS NULL OR = '' OR = src_unparsedNot set

Page Details

PropertyValue
URLhttps://docs.pytorch.org/tutorials/beginner/dist_overview.html
Last Crawled2026-04-07 14:24:39 (4 days ago)
First Indexed2025-05-28 13:08:04 (10 months ago)
HTTP Status Code200
Meta TitlePyTorch Distributed Overview — PyTorch Tutorials 2.11.0+cu130 documentation
Meta Descriptionnull
Meta Canonicalnull
Boilerpipe Text
Rate this Page ★ ★ ★ ★ ★ Created On: Jul 28, 2020 | Last Updated: Jul 20, 2025 | Last Verified: Nov 05, 2024 Author : Will Constable , Wei Feng Note View and edit this tutorial in github . This is the overview page for the torch.distributed package. The goal of this page is to categorize documents into different topics and briefly describe each of them. If this is your first time building distributed training applications using PyTorch, it is recommended to use this document to navigate to the technology that can best serve your use case. Introduction # The PyTorch Distributed library includes a collective of parallelism modules, a communications layer, and infrastructure for launching and debugging large training jobs. Parallelism APIs # These Parallelism Modules offer high-level functionality and compose with existing models: Distributed Data-Parallel (DDP) Fully Sharded Data-Parallel Training (FSDP2) Tensor Parallel (TP) Pipeline Parallel (PP) Sharding primitives # DTensor and DeviceMesh are primitives used to build parallelism in terms of sharded or replicated tensors on N-dimensional process groups. DTensor represents a tensor that is sharded and/or replicated, and communicates automatically to reshard tensors as needed by operations. DeviceMesh abstracts the accelerator device communicators into a multi-dimensional array, which manages the underlying ProcessGroup instances for collective communications in multi-dimensional parallelisms. Try out our Device Mesh Recipe to learn more. Communications APIs # The PyTorch distributed communication layer (C10D) offers both collective communication APIs (e.g., all_reduce and all_gather ) and P2P communication APIs (e.g., send and isend ), which are used under the hood in all of the parallelism implementations. Writing Distributed Applications with PyTorch shows examples of using c10d communication APIs. Launcher # torchrun is a widely-used launcher script, which spawns processes on the local and remote machines for running distributed PyTorch programs. Applying Parallelism To Scale Your Model # Data Parallelism is a widely adopted single-program multiple-data training paradigm where the model is replicated on every process, every model replica computes local gradients for a different set of input data samples, gradients are averaged within the data-parallel communicator group before each optimizer step. Model Parallelism techniques (or Sharded Data Parallelism) are required when a model doesn’t fit in GPU, and can be combined together to form multi-dimensional (N-D) parallelism techniques. When deciding what parallelism techniques to choose for your model, use these common guidelines: Use DistributedDataParallel (DDP) , if your model fits in a single GPU but you want to easily scale up training using multiple GPUs. Use torchrun , to launch multiple pytorch processes if you are using more than one node. See also: Getting Started with Distributed Data Parallel Use FullyShardedDataParallel (FSDP2) when your model cannot fit on one GPU. See also: Getting Started with FSDP2 Use Tensor Parallel (TP) and/or Pipeline Parallel (PP) if you reach scaling limitations with FSDP2. Try our Tensor Parallelism Tutorial See also: TorchTitan end to end example of 3D parallelism PyTorch Distributed Developers # If you’d like to contribute to PyTorch Distributed, refer to our Developer Guide .
Markdown
![](https://www.facebook.com/tr?id=243028289693773&ev=PageView&noscript=1) Help us understand how you use PyTorch! Take our quick survey. [Take Survey](https://docs.google.com/forms/d/e/1FAIpQLSfsGAWBcfutRcbO6kfrShBMOMmRuBezRjjOcXk0e9I9luBzvQ/viewform) [Skip to main content](https://docs.pytorch.org/tutorials/beginner/dist_overview.html#main-content) Back to top [![PyTorch Tutorials - Home](https://docs.pytorch.org/tutorials/_static/img/logo-dark.svg) ![PyTorch Tutorials - Home](https://docs.pytorch.org/tutorials/_static/img/logo-white.svg)](https://docs.pytorch.org/tutorials/index.html) [![PyTorch Tutorials - Home](https://docs.pytorch.org/tutorials/_static/img/logo-dark.svg) ![PyTorch Tutorials - Home](https://docs.pytorch.org/tutorials/_static/img/logo-white.svg)](https://docs.pytorch.org/tutorials/index.html) [v2.11.0+cu130](https://docs.pytorch.org/tutorials/index.html) - [Intro](https://docs.pytorch.org/tutorials/intro.html) - [Learn the Basics](https://docs.pytorch.org/tutorials/beginner/basics/intro.html) - [Introduction to PyTorch - YouTube Series](https://docs.pytorch.org/tutorials/beginner/introyt/introyt_index.html) - [Deep Learning with PyTorch: A 60 Minute Blitz](https://docs.pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html) - [Learning PyTorch with Examples](https://docs.pytorch.org/tutorials/beginner/pytorch_with_examples.html) - [What is torch.nn really?](https://docs.pytorch.org/tutorials/beginner/nn_tutorial.html) - [Understanding requires\_grad, retain\_grad, Leaf, and Non-leaf Tensors](https://docs.pytorch.org/tutorials/beginner/understanding_leaf_vs_nonleaf_tutorial.html) - [NLP from Scratch](https://docs.pytorch.org/tutorials/intermediate/nlp_from_scratch_index.html) - [Visualizing Models, Data, and Training with TensorBoard](https://docs.pytorch.org/tutorials/intermediate/tensorboard_tutorial.html) - [A guide on good usage of non\_blocking and pin\_memory() in PyTorch](https://docs.pytorch.org/tutorials/intermediate/pinmem_nonblock.html) - [Visualizing Gradients](https://docs.pytorch.org/tutorials/intermediate/visualizing_gradients_tutorial.html) - [Compilers](https://docs.pytorch.org/tutorials/compilers_index.html) - [Introduction to torch.compile](https://docs.pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) - [torch.compile End-to-End Tutorial](https://docs.pytorch.org/tutorials/intermediate/torch_compile_full_example.html) - [Compiled Autograd: Capturing a larger backward graph for torch.compile](https://docs.pytorch.org/tutorials/intermediate/compiled_autograd_tutorial.html) - [Inductor CPU backend debugging and profiling](https://docs.pytorch.org/tutorials/intermediate/inductor_debug_cpu.html) - [Dynamic Compilation Control with torch.compiler.set\_stance](https://docs.pytorch.org/tutorials/recipes/torch_compiler_set_stance_tutorial.html) - [Demonstration of torch.export flow, common challenges and the solutions to address them](https://docs.pytorch.org/tutorials/recipes/torch_export_challenges_solutions.html) - [(beta) Compiling the optimizer with torch.compile](https://docs.pytorch.org/tutorials/recipes/compiling_optimizer.html) - [(beta) Running the compiled optimizer with an LR Scheduler](https://docs.pytorch.org/tutorials/recipes/compiling_optimizer_lr_scheduler.html) - [Using Variable Length Attention in PyTorch](https://docs.pytorch.org/tutorials/intermediate/variable_length_attention_tutorial.html) - [Using User-Defined Triton Kernels with torch.compile](https://docs.pytorch.org/tutorials/recipes/torch_compile_user_defined_triton_kernel_tutorial.html) - [Compile Time Caching in torch.compile](https://docs.pytorch.org/tutorials/recipes/torch_compile_caching_tutorial.html) - [Reducing torch.compile cold start compilation time with regional compilation](https://docs.pytorch.org/tutorials/recipes/regional_compilation.html) - [torch.export Tutorial](https://docs.pytorch.org/tutorials/intermediate/torch_export_tutorial.html) - [torch.export AOTInductor Tutorial for Python runtime (Beta)](https://docs.pytorch.org/tutorials/recipes/torch_export_aoti_python.html) - [Demonstration of torch.export flow, common challenges and the solutions to address them](https://docs.pytorch.org/tutorials/recipes/torch_export_challenges_solutions.html) - [Introduction to ONNX](https://docs.pytorch.org/tutorials/beginner/onnx/intro_onnx.html) - [Export a PyTorch model to ONNX](https://docs.pytorch.org/tutorials/beginner/onnx/export_simple_model_to_onnx_tutorial.html) - [Extending the ONNX Exporter Operator Support](https://docs.pytorch.org/tutorials/beginner/onnx/onnx_registry_tutorial.html) - [Export a model with control flow to ONNX](https://docs.pytorch.org/tutorials/beginner/onnx/export_control_flow_model_to_onnx_tutorial.html) - [Building a Convolution/Batch Norm fuser with torch.compile](https://docs.pytorch.org/tutorials/intermediate/torch_compile_conv_bn_fuser.html) - [(beta) Building a Simple CPU Performance Profiler with FX](https://docs.pytorch.org/tutorials/intermediate/fx_profiling_tutorial.html) - [Domains](https://docs.pytorch.org/tutorials/domains.html) - [TorchVision Object Detection Finetuning Tutorial](https://docs.pytorch.org/tutorials/intermediate/torchvision_tutorial.html) - [Transfer Learning for Computer Vision Tutorial](https://docs.pytorch.org/tutorials/beginner/transfer_learning_tutorial.html) - [Adversarial Example Generation](https://docs.pytorch.org/tutorials/beginner/fgsm_tutorial.html) - [DCGAN Tutorial](https://docs.pytorch.org/tutorials/beginner/dcgan_faces_tutorial.html) - [Spatial Transformer Networks Tutorial](https://docs.pytorch.org/tutorials/intermediate/spatial_transformer_tutorial.html) - [Reinforcement Learning (DQN) Tutorial](https://docs.pytorch.org/tutorials/intermediate/reinforcement_q_learning.html) - [Reinforcement Learning (PPO) with TorchRL Tutorial](https://docs.pytorch.org/tutorials/intermediate/reinforcement_ppo.html) - [Train a Mario-playing RL Agent](https://docs.pytorch.org/tutorials/intermediate/mario_rl_tutorial.html) - [Pendulum: Writing your environment and transforms with TorchRL](https://docs.pytorch.org/tutorials/advanced/pendulum.html) - [Introduction to TorchRec](https://docs.pytorch.org/tutorials/intermediate/torchrec_intro_tutorial.html) - [Exploring TorchRec sharding](https://docs.pytorch.org/tutorials/advanced/sharding.html) - [Distributed](https://docs.pytorch.org/tutorials/distributed.html) - [PyTorch Distributed Overview](https://docs.pytorch.org/tutorials/beginner/dist_overview.html) - [Distributed Data Parallel in PyTorch - Video Tutorials](https://docs.pytorch.org/tutorials/beginner/ddp_series_intro.html) - [Getting Started with Distributed Data Parallel](https://docs.pytorch.org/tutorials/intermediate/ddp_tutorial.html) - [Writing Distributed Applications with PyTorch](https://docs.pytorch.org/tutorials/intermediate/dist_tuto.html) - [Getting Started with Fully Sharded Data Parallel (FSDP2)](https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html) - [Introduction to Libuv TCPStore Backend](https://docs.pytorch.org/tutorials/intermediate/TCPStore_libuv_backend.html) - [Large Scale Transformer model training with Tensor Parallel (TP)](https://docs.pytorch.org/tutorials/intermediate/TP_tutorial.html) - [Introduction to Distributed Pipeline Parallelism](https://docs.pytorch.org/tutorials/intermediate/pipelining_tutorial.html) - [Customize Process Group Backends Using Cpp Extensions](https://docs.pytorch.org/tutorials/intermediate/process_group_cpp_extension_tutorial.html) - [Getting Started with Distributed RPC Framework](https://docs.pytorch.org/tutorials/intermediate/rpc_tutorial.html) - [Implementing a Parameter Server Using Distributed RPC Framework](https://docs.pytorch.org/tutorials/intermediate/rpc_param_server_tutorial.html) - [Implementing Batch RPC Processing Using Asynchronous Executions](https://docs.pytorch.org/tutorials/intermediate/rpc_async_execution.html) - [Interactive Distributed Applications with Monarch](https://docs.pytorch.org/tutorials/intermediate/monarch_distributed_tutorial.html) - [Combining Distributed DataParallel with Distributed RPC Framework](https://docs.pytorch.org/tutorials/advanced/rpc_ddp_tutorial.html) - [Distributed Training with Uneven Inputs Using the Join Context Manager](https://docs.pytorch.org/tutorials/advanced/generic_join.html) - [Distributed training at scale with PyTorch and Ray Train](https://docs.pytorch.org/tutorials/beginner/distributed_training_with_ray_tutorial.html) - [Deep Dive](https://docs.pytorch.org/tutorials/deep-dive.html) - [Profiling your PyTorch Module](https://docs.pytorch.org/tutorials/beginner/profiler.html) - [Parametrizations Tutorial](https://docs.pytorch.org/tutorials/intermediate/parametrizations.html) - [Pruning Tutorial](https://docs.pytorch.org/tutorials/intermediate/pruning_tutorial.html) - [Inductor CPU backend debugging and profiling](https://docs.pytorch.org/tutorials/intermediate/inductor_debug_cpu.html) - [(Beta) Implementing High-Performance Transformers with Scaled Dot Product Attention (SDPA)](https://docs.pytorch.org/tutorials/intermediate/scaled_dot_product_attention_tutorial.html) - [Knowledge Distillation Tutorial](https://docs.pytorch.org/tutorials/beginner/knowledge_distillation_tutorial.html) - [Channels Last Memory Format in PyTorch](https://docs.pytorch.org/tutorials/intermediate/memory_format_tutorial.html) - [Forward-mode Automatic Differentiation (Beta)](https://docs.pytorch.org/tutorials/intermediate/forward_ad_usage.html) - [Jacobians, Hessians, hvp, vhp, and more: composing function transforms](https://docs.pytorch.org/tutorials/intermediate/jacobians_hessians.html) - [Model ensembling](https://docs.pytorch.org/tutorials/intermediate/ensembling.html) - [Per-sample-gradients](https://docs.pytorch.org/tutorials/intermediate/per_sample_grads.html) - [Using the PyTorch C++ Frontend](https://docs.pytorch.org/tutorials/advanced/cpp_frontend.html) - [Autograd in C++ Frontend](https://docs.pytorch.org/tutorials/advanced/cpp_autograd.html) - [Extension](https://docs.pytorch.org/tutorials/extension.html) - [PyTorch Custom Operators](https://docs.pytorch.org/tutorials/advanced/custom_ops_landing_page.html) - [Custom Python Operators](https://docs.pytorch.org/tutorials/advanced/python_custom_ops.html) - [Custom C++ and CUDA Operators](https://docs.pytorch.org/tutorials/advanced/cpp_custom_ops.html) - [Double Backward with Custom Functions](https://docs.pytorch.org/tutorials/intermediate/custom_function_double_backward_tutorial.html) - [Fusing Convolution and Batch Norm using Custom Function](https://docs.pytorch.org/tutorials/intermediate/custom_function_conv_bn_tutorial.html) - [Registering a Dispatched Operator in C++](https://docs.pytorch.org/tutorials/advanced/dispatcher.html) - [Extending dispatcher for a new backend in C++](https://docs.pytorch.org/tutorials/advanced/extend_dispatcher.html) - [Facilitating New Backend Integration by PrivateUse1](https://docs.pytorch.org/tutorials/advanced/privateuseone.html) - [Ecosystem](https://docs.pytorch.org/tutorials/ecosystem.html) - [Hyperparameter tuning using Ray Tune](https://docs.pytorch.org/tutorials/beginner/hyperparameter_tuning_tutorial.html) - [Serve PyTorch models at scale with Ray Serve](https://docs.pytorch.org/tutorials/beginner/serving_tutorial.html) - [Multi-Objective NAS with Ax](https://docs.pytorch.org/tutorials/intermediate/ax_multiobjective_nas_tutorial.html) - [PyTorch Profiler With TensorBoard](https://docs.pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html) - [Real Time Inference on Raspberry Pi 4 and 5 (40 fps!)](https://docs.pytorch.org/tutorials/intermediate/realtime_rpi.html) - [Mosaic: Memory Profiling for PyTorch](https://docs.pytorch.org/tutorials/beginner/mosaic_memory_profiling_tutorial.html) - [Distributed training at scale with PyTorch and Ray Train](https://docs.pytorch.org/tutorials/beginner/distributed_training_with_ray_tutorial.html) - More - [Recipes](https://docs.pytorch.org/tutorials/recipes_index.html) - [Unstable](https://docs.pytorch.org/tutorials/unstable_index.html) [Go to pytorch.org](https://pytorch.org/) - [X](https://x.com/PyTorch) - [GitHub](https://github.com/pytorch/tutorials) - [Discourse](https://dev-discuss.pytorch.org/) - [PyPi](https://pypi.org/project/torch/) [v2.11.0+cu130](https://docs.pytorch.org/tutorials/index.html) - [Intro](https://docs.pytorch.org/tutorials/intro.html) - [Learn the Basics](https://docs.pytorch.org/tutorials/beginner/basics/intro.html) - [Introduction to PyTorch - YouTube Series](https://docs.pytorch.org/tutorials/beginner/introyt/introyt_index.html) - [Deep Learning with PyTorch: A 60 Minute Blitz](https://docs.pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html) - [Learning PyTorch with Examples](https://docs.pytorch.org/tutorials/beginner/pytorch_with_examples.html) - [What is torch.nn really?](https://docs.pytorch.org/tutorials/beginner/nn_tutorial.html) - [Understanding requires\_grad, retain\_grad, Leaf, and Non-leaf Tensors](https://docs.pytorch.org/tutorials/beginner/understanding_leaf_vs_nonleaf_tutorial.html) - [NLP from Scratch](https://docs.pytorch.org/tutorials/intermediate/nlp_from_scratch_index.html) - [Visualizing Models, Data, and Training with TensorBoard](https://docs.pytorch.org/tutorials/intermediate/tensorboard_tutorial.html) - [A guide on good usage of non\_blocking and pin\_memory() in PyTorch](https://docs.pytorch.org/tutorials/intermediate/pinmem_nonblock.html) - [Visualizing Gradients](https://docs.pytorch.org/tutorials/intermediate/visualizing_gradients_tutorial.html) - [Compilers](https://docs.pytorch.org/tutorials/compilers_index.html) - [Introduction to torch.compile](https://docs.pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) - [torch.compile End-to-End Tutorial](https://docs.pytorch.org/tutorials/intermediate/torch_compile_full_example.html) - [Compiled Autograd: Capturing a larger backward graph for torch.compile](https://docs.pytorch.org/tutorials/intermediate/compiled_autograd_tutorial.html) - [Inductor CPU backend debugging and profiling](https://docs.pytorch.org/tutorials/intermediate/inductor_debug_cpu.html) - [Dynamic Compilation Control with torch.compiler.set\_stance](https://docs.pytorch.org/tutorials/recipes/torch_compiler_set_stance_tutorial.html) - [Demonstration of torch.export flow, common challenges and the solutions to address them](https://docs.pytorch.org/tutorials/recipes/torch_export_challenges_solutions.html) - [(beta) Compiling the optimizer with torch.compile](https://docs.pytorch.org/tutorials/recipes/compiling_optimizer.html) - [(beta) Running the compiled optimizer with an LR Scheduler](https://docs.pytorch.org/tutorials/recipes/compiling_optimizer_lr_scheduler.html) - [Using Variable Length Attention in PyTorch](https://docs.pytorch.org/tutorials/intermediate/variable_length_attention_tutorial.html) - [Using User-Defined Triton Kernels with torch.compile](https://docs.pytorch.org/tutorials/recipes/torch_compile_user_defined_triton_kernel_tutorial.html) - [Compile Time Caching in torch.compile](https://docs.pytorch.org/tutorials/recipes/torch_compile_caching_tutorial.html) - [Reducing torch.compile cold start compilation time with regional compilation](https://docs.pytorch.org/tutorials/recipes/regional_compilation.html) - [torch.export Tutorial](https://docs.pytorch.org/tutorials/intermediate/torch_export_tutorial.html) - [torch.export AOTInductor Tutorial for Python runtime (Beta)](https://docs.pytorch.org/tutorials/recipes/torch_export_aoti_python.html) - [Demonstration of torch.export flow, common challenges and the solutions to address them](https://docs.pytorch.org/tutorials/recipes/torch_export_challenges_solutions.html) - [Introduction to ONNX](https://docs.pytorch.org/tutorials/beginner/onnx/intro_onnx.html) - [Export a PyTorch model to ONNX](https://docs.pytorch.org/tutorials/beginner/onnx/export_simple_model_to_onnx_tutorial.html) - [Extending the ONNX Exporter Operator Support](https://docs.pytorch.org/tutorials/beginner/onnx/onnx_registry_tutorial.html) - [Export a model with control flow to ONNX](https://docs.pytorch.org/tutorials/beginner/onnx/export_control_flow_model_to_onnx_tutorial.html) - [Building a Convolution/Batch Norm fuser with torch.compile](https://docs.pytorch.org/tutorials/intermediate/torch_compile_conv_bn_fuser.html) - [(beta) Building a Simple CPU Performance Profiler with FX](https://docs.pytorch.org/tutorials/intermediate/fx_profiling_tutorial.html) - [Domains](https://docs.pytorch.org/tutorials/domains.html) - [TorchVision Object Detection Finetuning Tutorial](https://docs.pytorch.org/tutorials/intermediate/torchvision_tutorial.html) - [Transfer Learning for Computer Vision Tutorial](https://docs.pytorch.org/tutorials/beginner/transfer_learning_tutorial.html) - [Adversarial Example Generation](https://docs.pytorch.org/tutorials/beginner/fgsm_tutorial.html) - [DCGAN Tutorial](https://docs.pytorch.org/tutorials/beginner/dcgan_faces_tutorial.html) - [Spatial Transformer Networks Tutorial](https://docs.pytorch.org/tutorials/intermediate/spatial_transformer_tutorial.html) - [Reinforcement Learning (DQN) Tutorial](https://docs.pytorch.org/tutorials/intermediate/reinforcement_q_learning.html) - [Reinforcement Learning (PPO) with TorchRL Tutorial](https://docs.pytorch.org/tutorials/intermediate/reinforcement_ppo.html) - [Train a Mario-playing RL Agent](https://docs.pytorch.org/tutorials/intermediate/mario_rl_tutorial.html) - [Pendulum: Writing your environment and transforms with TorchRL](https://docs.pytorch.org/tutorials/advanced/pendulum.html) - [Introduction to TorchRec](https://docs.pytorch.org/tutorials/intermediate/torchrec_intro_tutorial.html) - [Exploring TorchRec sharding](https://docs.pytorch.org/tutorials/advanced/sharding.html) - [Distributed](https://docs.pytorch.org/tutorials/distributed.html) - [PyTorch Distributed Overview](https://docs.pytorch.org/tutorials/beginner/dist_overview.html) - [Distributed Data Parallel in PyTorch - Video Tutorials](https://docs.pytorch.org/tutorials/beginner/ddp_series_intro.html) - [Getting Started with Distributed Data Parallel](https://docs.pytorch.org/tutorials/intermediate/ddp_tutorial.html) - [Writing Distributed Applications with PyTorch](https://docs.pytorch.org/tutorials/intermediate/dist_tuto.html) - [Getting Started with Fully Sharded Data Parallel (FSDP2)](https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html) - [Introduction to Libuv TCPStore Backend](https://docs.pytorch.org/tutorials/intermediate/TCPStore_libuv_backend.html) - [Large Scale Transformer model training with Tensor Parallel (TP)](https://docs.pytorch.org/tutorials/intermediate/TP_tutorial.html) - [Introduction to Distributed Pipeline Parallelism](https://docs.pytorch.org/tutorials/intermediate/pipelining_tutorial.html) - [Customize Process Group Backends Using Cpp Extensions](https://docs.pytorch.org/tutorials/intermediate/process_group_cpp_extension_tutorial.html) - [Getting Started with Distributed RPC Framework](https://docs.pytorch.org/tutorials/intermediate/rpc_tutorial.html) - [Implementing a Parameter Server Using Distributed RPC Framework](https://docs.pytorch.org/tutorials/intermediate/rpc_param_server_tutorial.html) - [Implementing Batch RPC Processing Using Asynchronous Executions](https://docs.pytorch.org/tutorials/intermediate/rpc_async_execution.html) - [Interactive Distributed Applications with Monarch](https://docs.pytorch.org/tutorials/intermediate/monarch_distributed_tutorial.html) - [Combining Distributed DataParallel with Distributed RPC Framework](https://docs.pytorch.org/tutorials/advanced/rpc_ddp_tutorial.html) - [Distributed Training with Uneven Inputs Using the Join Context Manager](https://docs.pytorch.org/tutorials/advanced/generic_join.html) - [Distributed training at scale with PyTorch and Ray Train](https://docs.pytorch.org/tutorials/beginner/distributed_training_with_ray_tutorial.html) - [Deep Dive](https://docs.pytorch.org/tutorials/deep-dive.html) - [Profiling your PyTorch Module](https://docs.pytorch.org/tutorials/beginner/profiler.html) - [Parametrizations Tutorial](https://docs.pytorch.org/tutorials/intermediate/parametrizations.html) - [Pruning Tutorial](https://docs.pytorch.org/tutorials/intermediate/pruning_tutorial.html) - [Inductor CPU backend debugging and profiling](https://docs.pytorch.org/tutorials/intermediate/inductor_debug_cpu.html) - [(Beta) Implementing High-Performance Transformers with Scaled Dot Product Attention (SDPA)](https://docs.pytorch.org/tutorials/intermediate/scaled_dot_product_attention_tutorial.html) - [Knowledge Distillation Tutorial](https://docs.pytorch.org/tutorials/beginner/knowledge_distillation_tutorial.html) - [Channels Last Memory Format in PyTorch](https://docs.pytorch.org/tutorials/intermediate/memory_format_tutorial.html) - [Forward-mode Automatic Differentiation (Beta)](https://docs.pytorch.org/tutorials/intermediate/forward_ad_usage.html) - [Jacobians, Hessians, hvp, vhp, and more: composing function transforms](https://docs.pytorch.org/tutorials/intermediate/jacobians_hessians.html) - [Model ensembling](https://docs.pytorch.org/tutorials/intermediate/ensembling.html) - [Per-sample-gradients](https://docs.pytorch.org/tutorials/intermediate/per_sample_grads.html) - [Using the PyTorch C++ Frontend](https://docs.pytorch.org/tutorials/advanced/cpp_frontend.html) - [Autograd in C++ Frontend](https://docs.pytorch.org/tutorials/advanced/cpp_autograd.html) - [Extension](https://docs.pytorch.org/tutorials/extension.html) - [PyTorch Custom Operators](https://docs.pytorch.org/tutorials/advanced/custom_ops_landing_page.html) - [Custom Python Operators](https://docs.pytorch.org/tutorials/advanced/python_custom_ops.html) - [Custom C++ and CUDA Operators](https://docs.pytorch.org/tutorials/advanced/cpp_custom_ops.html) - [Double Backward with Custom Functions](https://docs.pytorch.org/tutorials/intermediate/custom_function_double_backward_tutorial.html) - [Fusing Convolution and Batch Norm using Custom Function](https://docs.pytorch.org/tutorials/intermediate/custom_function_conv_bn_tutorial.html) - [Registering a Dispatched Operator in C++](https://docs.pytorch.org/tutorials/advanced/dispatcher.html) - [Extending dispatcher for a new backend in C++](https://docs.pytorch.org/tutorials/advanced/extend_dispatcher.html) - [Facilitating New Backend Integration by PrivateUse1](https://docs.pytorch.org/tutorials/advanced/privateuseone.html) - [Ecosystem](https://docs.pytorch.org/tutorials/ecosystem.html) - [Hyperparameter tuning using Ray Tune](https://docs.pytorch.org/tutorials/beginner/hyperparameter_tuning_tutorial.html) - [Serve PyTorch models at scale with Ray Serve](https://docs.pytorch.org/tutorials/beginner/serving_tutorial.html) - [Multi-Objective NAS with Ax](https://docs.pytorch.org/tutorials/intermediate/ax_multiobjective_nas_tutorial.html) - [PyTorch Profiler With TensorBoard](https://docs.pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html) - [Real Time Inference on Raspberry Pi 4 and 5 (40 fps!)](https://docs.pytorch.org/tutorials/intermediate/realtime_rpi.html) - [Mosaic: Memory Profiling for PyTorch](https://docs.pytorch.org/tutorials/beginner/mosaic_memory_profiling_tutorial.html) - [Distributed training at scale with PyTorch and Ray Train](https://docs.pytorch.org/tutorials/beginner/distributed_training_with_ray_tutorial.html) - [Recipes](https://docs.pytorch.org/tutorials/recipes_index.html) - [Defining a Neural Network in PyTorch](https://docs.pytorch.org/tutorials/recipes/recipes/defining_a_neural_network.html) - [(beta) Using TORCH\_LOGS python API with torch.compile](https://docs.pytorch.org/tutorials/recipes/torch_logs.html) - [What is a state\_dict in PyTorch](https://docs.pytorch.org/tutorials/recipes/recipes/what_is_state_dict.html) - [Warmstarting model using parameters from a different model in PyTorch](https://docs.pytorch.org/tutorials/recipes/recipes/warmstarting_model_using_parameters_from_a_different_model.html) - [Zeroing out gradients in PyTorch](https://docs.pytorch.org/tutorials/recipes/recipes/zeroing_out_gradients.html) - [PyTorch Profiler](https://docs.pytorch.org/tutorials/recipes/recipes/profiler_recipe.html) - [Model Interpretability using Captum](https://docs.pytorch.org/tutorials/recipes/recipes/Captum_Recipe.html) - [How to use TensorBoard with PyTorch](https://docs.pytorch.org/tutorials/recipes/recipes/tensorboard_with_pytorch.html) - [Automatic Mixed Precision](https://docs.pytorch.org/tutorials/recipes/recipes/amp_recipe.html) - [Performance Tuning Guide](https://docs.pytorch.org/tutorials/recipes/recipes/tuning_guide.html) - [(beta) Compiling the optimizer with torch.compile](https://docs.pytorch.org/tutorials/recipes/compiling_optimizer.html) - [Timer quick start](https://docs.pytorch.org/tutorials/recipes/recipes/timer_quick_start.html) - [Shard Optimizer States with ZeroRedundancyOptimizer](https://docs.pytorch.org/tutorials/recipes/zero_redundancy_optimizer.html) - [Getting Started with CommDebugMode](https://docs.pytorch.org/tutorials/recipes/distributed_comm_debug_mode.html) - [Demonstration of torch.export flow, common challenges and the solutions to address them](https://docs.pytorch.org/tutorials/recipes/torch_export_challenges_solutions.html) - [PyTorch Benchmark](https://docs.pytorch.org/tutorials/recipes/recipes/benchmark.html) - [Tips for Loading an nn.Module from a Checkpoint](https://docs.pytorch.org/tutorials/recipes/recipes/module_load_state_dict_tips.html) - [Reasoning about Shapes in PyTorch](https://docs.pytorch.org/tutorials/recipes/recipes/reasoning_about_shapes.html) - [Extension points in nn.Module for load\_state\_dict and tensor subclasses](https://docs.pytorch.org/tutorials/recipes/recipes/swap_tensors.html) - [torch.export AOTInductor Tutorial for Python runtime (Beta)](https://docs.pytorch.org/tutorials/recipes/torch_export_aoti_python.html) - [How to use TensorBoard with PyTorch](https://docs.pytorch.org/tutorials/recipes/recipes/tensorboard_with_pytorch.html) - [(beta) Utilizing Torch Function modes with torch.compile](https://docs.pytorch.org/tutorials/recipes/torch_compile_torch_function_modes.html) - [(beta) Running the compiled optimizer with an LR Scheduler](https://docs.pytorch.org/tutorials/recipes/compiling_optimizer_lr_scheduler.html) - [Explicit horizontal fusion with foreach\_map and torch.compile](https://docs.pytorch.org/tutorials/recipes/foreach_map.html) - [Using User-Defined Triton Kernels with torch.compile](https://docs.pytorch.org/tutorials/recipes/torch_compile_user_defined_triton_kernel_tutorial.html) - [Compile Time Caching in torch.compile](https://docs.pytorch.org/tutorials/recipes/torch_compile_caching_tutorial.html) - [Compile Time Caching Configuration](https://docs.pytorch.org/tutorials/recipes/torch_compile_caching_configuration_tutorial.html) - [Reducing torch.compile cold start compilation time with regional compilation](https://docs.pytorch.org/tutorials/recipes/regional_compilation.html) - [Reducing AoT cold start compilation time with regional compilation](https://docs.pytorch.org/tutorials/recipes/regional_aot.html) - [Ease-of-use quantization for PyTorch with Intel® Neural Compressor](https://docs.pytorch.org/tutorials/recipes/intel_neural_compressor_for_pytorch.html) - [Getting Started with DeviceMesh](https://docs.pytorch.org/tutorials/recipes/distributed_device_mesh.html) - [Getting Started with Distributed Checkpoint (DCP)](https://docs.pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html) - [Asynchronous Saving with Distributed Checkpoint (DCP)](https://docs.pytorch.org/tutorials/recipes/distributed_async_checkpoint_recipe.html) - [DebugMode: Recording Dispatched Operations and Numerical Debugging](https://docs.pytorch.org/tutorials/recipes/debug_mode_tutorial.html) - [Unstable](https://docs.pytorch.org/tutorials/unstable_index.html) - [Introduction to Context Parallel](https://docs.pytorch.org/tutorials/unstable/context_parallel.html) - [Flight Recorder for Debugging Stuck Jobs](https://docs.pytorch.org/tutorials/unstable/flight_recorder_tutorial.html) - [TorchInductor C++ Wrapper Tutorial](https://docs.pytorch.org/tutorials/unstable/inductor_cpp_wrapper_tutorial.html) - [How to use torch.compile on Windows CPU/XPU](https://docs.pytorch.org/tutorials/unstable/inductor_windows.html) - [torch.vmap](https://docs.pytorch.org/tutorials/unstable/vmap_recipe.html) - [Getting Started with Nested Tensors](https://docs.pytorch.org/tutorials/unstable/nestedtensor.html) - [MaskedTensor Overview](https://docs.pytorch.org/tutorials/unstable/maskedtensor_overview.html) - [MaskedTensor Sparsity](https://docs.pytorch.org/tutorials/unstable/maskedtensor_sparsity.html) - [MaskedTensor Advanced Semantics](https://docs.pytorch.org/tutorials/unstable/maskedtensor_advanced_semantics.html) - [Efficiently writing “sparse” semantics for Adagrad with MaskedTensor](https://docs.pytorch.org/tutorials/unstable/maskedtensor_adagrad.html) - [Autoloading Out-of-Tree Extension](https://docs.pytorch.org/tutorials/unstable/python_extension_autoload.html) - [Using Max-Autotune Compilation on CPU for Better Performance](https://docs.pytorch.org/tutorials/unstable/max_autotune_on_CPU_tutorial.html) [Go to pytorch.org](https://pytorch.org/) - [X](https://x.com/PyTorch) - [GitHub](https://github.com/pytorch/tutorials) - [Discourse](https://dev-discuss.pytorch.org/) - [PyPi](https://pypi.org/project/torch/) Section Navigation - [PyTorch Distributed Overview](https://docs.pytorch.org/tutorials/beginner/dist_overview.html) - [Distributed Data Parallel in PyTorch - Video Tutorials](https://docs.pytorch.org/tutorials/beginner/ddp_series_intro.html) - [Getting Started with Distributed Data Parallel](https://docs.pytorch.org/tutorials/intermediate/ddp_tutorial.html) - [Writing Distributed Applications with PyTorch](https://docs.pytorch.org/tutorials/intermediate/dist_tuto.html) - [Getting Started with Fully Sharded Data Parallel (FSDP2)](https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html) - [Introduction to Libuv TCPStore Backend](https://docs.pytorch.org/tutorials/intermediate/TCPStore_libuv_backend.html) - [Large Scale Transformer model training with Tensor Parallel (TP)](https://docs.pytorch.org/tutorials/intermediate/TP_tutorial.html) - [Introduction to Distributed Pipeline Parallelism](https://docs.pytorch.org/tutorials/intermediate/pipelining_tutorial.html) - [Customize Process Group Backends Using Cpp Extensions](https://docs.pytorch.org/tutorials/intermediate/process_group_cpp_extension_tutorial.html) - [Getting Started with Distributed RPC Framework](https://docs.pytorch.org/tutorials/intermediate/rpc_tutorial.html) - [Implementing a Parameter Server Using Distributed RPC Framework](https://docs.pytorch.org/tutorials/intermediate/rpc_param_server_tutorial.html) - [Implementing Batch RPC Processing Using Asynchronous Executions](https://docs.pytorch.org/tutorials/intermediate/rpc_async_execution.html) - [Interactive Distributed Applications with Monarch](https://docs.pytorch.org/tutorials/intermediate/monarch_distributed_tutorial.html) - [Combining Distributed DataParallel with Distributed RPC Framework](https://docs.pytorch.org/tutorials/advanced/rpc_ddp_tutorial.html) - [Distributed Training with Uneven Inputs Using the Join Context Manager](https://docs.pytorch.org/tutorials/advanced/generic_join.html) - [Distributed training at scale with PyTorch and Ray Train](https://docs.pytorch.org/tutorials/beginner/distributed_training_with_ray_tutorial.html) - [Distributed](https://docs.pytorch.org/tutorials/distributed.html) - PyTorch... Rate this Page ★ ★ ★ ★ ★ beginner/dist\_overview [![](https://docs.pytorch.org/tutorials/_static/img/pytorch-colab.svg) Run in Google Colab Colab]() [![](https://docs.pytorch.org/tutorials/_static/img/pytorch-download.svg) Download Notebook Notebook]() [![](https://docs.pytorch.org/tutorials/_static/img/pytorch-github.svg) View on GitHub GitHub]() # PyTorch Distributed Overview[\#](https://docs.pytorch.org/tutorials/beginner/dist_overview.html#pytorch-distributed-overview "Link to this heading") Created On: Jul 28, 2020 \| Last Updated: Jul 20, 2025 \| Last Verified: Nov 05, 2024 **Author**: [Will Constable](https://github.com/wconstab/), [Wei Feng](https://github.com/weifengpy) Note [![edit](https://docs.pytorch.org/tutorials/_images/pencil-16.png)](https://docs.pytorch.org/tutorials/_images/pencil-16.png) View and edit this tutorial in [github](https://github.com/pytorch/tutorials/blob/main/beginner_source/dist_overview.rst). This is the overview page for the `torch.distributed` package. The goal of this page is to categorize documents into different topics and briefly describe each of them. If this is your first time building distributed training applications using PyTorch, it is recommended to use this document to navigate to the technology that can best serve your use case. ## Introduction[\#](https://docs.pytorch.org/tutorials/beginner/dist_overview.html#introduction "Link to this heading") The PyTorch Distributed library includes a collective of parallelism modules, a communications layer, and infrastructure for launching and debugging large training jobs. ### Parallelism APIs[\#](https://docs.pytorch.org/tutorials/beginner/dist_overview.html#parallelism-apis "Link to this heading") These Parallelism Modules offer high-level functionality and compose with existing models: - [Distributed Data-Parallel (DDP)](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) - [Fully Sharded Data-Parallel Training (FSDP2)](https://pytorch.org/docs/stable/distributed.fsdp.fully_shard.html) - [Tensor Parallel (TP)](https://pytorch.org/docs/stable/distributed.tensor.parallel.html) - [Pipeline Parallel (PP)](https://pytorch.org/docs/main/distributed.pipelining.html) ### Sharding primitives[\#](https://docs.pytorch.org/tutorials/beginner/dist_overview.html#sharding-primitives "Link to this heading") `DTensor` and `DeviceMesh` are primitives used to build parallelism in terms of sharded or replicated tensors on N-dimensional process groups. - [DTensor](https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/README.md) represents a tensor that is sharded and/or replicated, and communicates automatically to reshard tensors as needed by operations. - [DeviceMesh](https://pytorch.org/docs/stable/distributed.html#devicemesh) abstracts the accelerator device communicators into a multi-dimensional array, which manages the underlying `ProcessGroup` instances for collective communications in multi-dimensional parallelisms. Try out our [Device Mesh Recipe](https://pytorch.org/tutorials/recipes/distributed_device_mesh.html) to learn more. ### Communications APIs[\#](https://docs.pytorch.org/tutorials/beginner/dist_overview.html#communications-apis "Link to this heading") The [PyTorch distributed communication layer (C10D)](https://pytorch.org/docs/stable/distributed.html) offers both collective communication APIs (e.g., [all\_reduce](https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_reduce) and [all\_gather](https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_gather)) and P2P communication APIs (e.g., [send](https://pytorch.org/docs/stable/distributed.html#torch.distributed.send) and [isend](https://pytorch.org/docs/stable/distributed.html#torch.distributed.isend)), which are used under the hood in all of the parallelism implementations. [Writing Distributed Applications with PyTorch](https://docs.pytorch.org/tutorials/intermediate/dist_tuto.html) shows examples of using c10d communication APIs. ### Launcher[\#](https://docs.pytorch.org/tutorials/beginner/dist_overview.html#launcher "Link to this heading") [torchrun](https://pytorch.org/docs/stable/elastic/run.html) is a widely-used launcher script, which spawns processes on the local and remote machines for running distributed PyTorch programs. ## Applying Parallelism To Scale Your Model[\#](https://docs.pytorch.org/tutorials/beginner/dist_overview.html#applying-parallelism-to-scale-your-model "Link to this heading") Data Parallelism is a widely adopted single-program multiple-data training paradigm where the model is replicated on every process, every model replica computes local gradients for a different set of input data samples, gradients are averaged within the data-parallel communicator group before each optimizer step. Model Parallelism techniques (or Sharded Data Parallelism) are required when a model doesn’t fit in GPU, and can be combined together to form multi-dimensional (N-D) parallelism techniques. When deciding what parallelism techniques to choose for your model, use these common guidelines: 1. Use [DistributedDataParallel (DDP)](https://pytorch.org/docs/stable/notes/ddp.html), if your model fits in a single GPU but you want to easily scale up training using multiple GPUs. - Use [torchrun](https://pytorch.org/docs/stable/elastic/run.html), to launch multiple pytorch processes if you are using more than one node. - See also: [Getting Started with Distributed Data Parallel](https://docs.pytorch.org/tutorials/intermediate/ddp_tutorial.html) 2. Use [FullyShardedDataParallel (FSDP2)](https://pytorch.org/docs/stable/distributed.fsdp.fully_shard.html) when your model cannot fit on one GPU. - See also: [Getting Started with FSDP2](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html) 3. Use [Tensor Parallel (TP)](https://pytorch.org/docs/stable/distributed.tensor.parallel.html) and/or [Pipeline Parallel (PP)](https://pytorch.org/docs/main/distributed.pipelining.html) if you reach scaling limitations with FSDP2. - Try our [Tensor Parallelism Tutorial](https://pytorch.org/tutorials/intermediate/TP_tutorial.html) - See also: [TorchTitan end to end example of 3D parallelism](https://github.com/pytorch/torchtitan) Note Data-parallel training also works with [Automatic Mixed Precision (AMP)](https://pytorch.org/docs/stable/notes/amp_examples.html#working-with-multiple-gpus). ## PyTorch Distributed Developers[\#](https://docs.pytorch.org/tutorials/beginner/dist_overview.html#pytorch-distributed-developers "Link to this heading") If you’d like to contribute to PyTorch Distributed, refer to our [Developer Guide](https://github.com/pytorch/pytorch/blob/master/torch/distributed/CONTRIBUTING.md). Rate this Page ★ ★ ★ ★ ★ Send Feedback [previous Distributed](https://docs.pytorch.org/tutorials/distributed.html "previous page") [next Distributed Data Parallel in PyTorch - Video Tutorials](https://docs.pytorch.org/tutorials/beginner/ddp_series_intro.html "next page") Built with the [PyData Sphinx Theme](https://pydata-sphinx-theme.readthedocs.io/en/stable/index.html) 0.15.4. [previous Distributed](https://docs.pytorch.org/tutorials/distributed.html "previous page") [next Distributed Data Parallel in PyTorch - Video Tutorials](https://docs.pytorch.org/tutorials/beginner/ddp_series_intro.html "next page") On this page - [Introduction](https://docs.pytorch.org/tutorials/beginner/dist_overview.html#introduction) - [Parallelism APIs](https://docs.pytorch.org/tutorials/beginner/dist_overview.html#parallelism-apis) - [Sharding primitives](https://docs.pytorch.org/tutorials/beginner/dist_overview.html#sharding-primitives) - [Communications APIs](https://docs.pytorch.org/tutorials/beginner/dist_overview.html#communications-apis) - [Launcher](https://docs.pytorch.org/tutorials/beginner/dist_overview.html#launcher) - [Applying Parallelism To Scale Your Model](https://docs.pytorch.org/tutorials/beginner/dist_overview.html#applying-parallelism-to-scale-your-model) - [PyTorch Distributed Developers](https://docs.pytorch.org/tutorials/beginner/dist_overview.html#pytorch-distributed-developers) PyTorch Libraries - [ExecuTorch](https://docs.pytorch.org/executorch) - [Helion](https://docs.pytorch.org/helion) - [torchao](https://docs.pytorch.org/ao) - [kineto](https://github.com/pytorch/kineto) - [torchtitan](https://github.com/pytorch/torchtitan) - [TorchRL](https://docs.pytorch.org/rl) - [torchvision](https://docs.pytorch.org/vision) - [torchaudio](https://docs.pytorch.org/audio) - [tensordict](https://docs.pytorch.org/tensordict) - [PyTorch on XLA Devices](https://docs.pytorch.org/xla) ## Docs Access comprehensive developer documentation for PyTorch [View Docs](https://docs.pytorch.org/docs/stable/index.html) ## Tutorials Get in-depth tutorials for beginners and advanced developers [View Tutorials](https://docs.pytorch.org/tutorials) ## Resources Find development resources and get your questions answered [View Resources](https://pytorch.org/resources) **Stay in touch** for updates, event info, and the latest news By submitting this form, I consent to receive marketing emails from the LF and its projects regarding their events, training, research, developments, and related announcements. I understand that I can unsubscribe at any time using the links in the footers of the emails I receive. [Privacy Policy](https://www.linuxfoundation.org/privacy/). © PyTorch. Copyright © The Linux Foundation®. All rights reserved. The Linux Foundation has registered trademarks and uses trademarks. For more information, including terms of use, privacy policy, and trademark usage, please see our [Policies](https://www.linuxfoundation.org/legal/policies) page. [Trademark Usage](https://www.linuxfoundation.org/trademark-usage). [Privacy Policy](http://www.linuxfoundation.org/privacy). To analyze traffic and optimize your experience, we serve cookies on this site. By clicking or navigating, you agree to allow our usage of cookies. As the current maintainers of this site, Facebook’s Cookies Policy applies. Learn more, including about available controls: [Cookies Policy](https://opensource.fb.com/legal/cookie-policy). ![](https://docs.pytorch.org/tutorials/_static/img/pytorch-x.svg) © Copyright 2024, PyTorch. Created using [Sphinx](https://www.sphinx-doc.org/) 7.2.6. Built with the [PyData Sphinx Theme](https://pydata-sphinx-theme.readthedocs.io/en/stable/index.html) 0.15.4.
Readable Markdown
Rate this Page ★ ★ ★ ★ ★ Created On: Jul 28, 2020 \| Last Updated: Jul 20, 2025 \| Last Verified: Nov 05, 2024 **Author**: [Will Constable](https://github.com/wconstab/), [Wei Feng](https://github.com/weifengpy) Note [![edit](https://docs.pytorch.org/tutorials/_images/pencil-16.png)](https://docs.pytorch.org/tutorials/_images/pencil-16.png) View and edit this tutorial in [github](https://github.com/pytorch/tutorials/blob/main/beginner_source/dist_overview.rst). This is the overview page for the `torch.distributed` package. The goal of this page is to categorize documents into different topics and briefly describe each of them. If this is your first time building distributed training applications using PyTorch, it is recommended to use this document to navigate to the technology that can best serve your use case. ## Introduction[\#](https://docs.pytorch.org/tutorials/beginner/dist_overview.html#introduction "Link to this heading") The PyTorch Distributed library includes a collective of parallelism modules, a communications layer, and infrastructure for launching and debugging large training jobs. ### Parallelism APIs[\#](https://docs.pytorch.org/tutorials/beginner/dist_overview.html#parallelism-apis "Link to this heading") These Parallelism Modules offer high-level functionality and compose with existing models: - [Distributed Data-Parallel (DDP)](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) - [Fully Sharded Data-Parallel Training (FSDP2)](https://pytorch.org/docs/stable/distributed.fsdp.fully_shard.html) - [Tensor Parallel (TP)](https://pytorch.org/docs/stable/distributed.tensor.parallel.html) - [Pipeline Parallel (PP)](https://pytorch.org/docs/main/distributed.pipelining.html) ### Sharding primitives[\#](https://docs.pytorch.org/tutorials/beginner/dist_overview.html#sharding-primitives "Link to this heading") `DTensor` and `DeviceMesh` are primitives used to build parallelism in terms of sharded or replicated tensors on N-dimensional process groups. - [DTensor](https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/README.md) represents a tensor that is sharded and/or replicated, and communicates automatically to reshard tensors as needed by operations. - [DeviceMesh](https://pytorch.org/docs/stable/distributed.html#devicemesh) abstracts the accelerator device communicators into a multi-dimensional array, which manages the underlying `ProcessGroup` instances for collective communications in multi-dimensional parallelisms. Try out our [Device Mesh Recipe](https://pytorch.org/tutorials/recipes/distributed_device_mesh.html) to learn more. ### Communications APIs[\#](https://docs.pytorch.org/tutorials/beginner/dist_overview.html#communications-apis "Link to this heading") The [PyTorch distributed communication layer (C10D)](https://pytorch.org/docs/stable/distributed.html) offers both collective communication APIs (e.g., [all\_reduce](https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_reduce) and [all\_gather](https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_gather)) and P2P communication APIs (e.g., [send](https://pytorch.org/docs/stable/distributed.html#torch.distributed.send) and [isend](https://pytorch.org/docs/stable/distributed.html#torch.distributed.isend)), which are used under the hood in all of the parallelism implementations. [Writing Distributed Applications with PyTorch](https://docs.pytorch.org/tutorials/intermediate/dist_tuto.html) shows examples of using c10d communication APIs. ### Launcher[\#](https://docs.pytorch.org/tutorials/beginner/dist_overview.html#launcher "Link to this heading") [torchrun](https://pytorch.org/docs/stable/elastic/run.html) is a widely-used launcher script, which spawns processes on the local and remote machines for running distributed PyTorch programs. ## Applying Parallelism To Scale Your Model[\#](https://docs.pytorch.org/tutorials/beginner/dist_overview.html#applying-parallelism-to-scale-your-model "Link to this heading") Data Parallelism is a widely adopted single-program multiple-data training paradigm where the model is replicated on every process, every model replica computes local gradients for a different set of input data samples, gradients are averaged within the data-parallel communicator group before each optimizer step. Model Parallelism techniques (or Sharded Data Parallelism) are required when a model doesn’t fit in GPU, and can be combined together to form multi-dimensional (N-D) parallelism techniques. When deciding what parallelism techniques to choose for your model, use these common guidelines: 1. Use [DistributedDataParallel (DDP)](https://pytorch.org/docs/stable/notes/ddp.html), if your model fits in a single GPU but you want to easily scale up training using multiple GPUs. - Use [torchrun](https://pytorch.org/docs/stable/elastic/run.html), to launch multiple pytorch processes if you are using more than one node. - See also: [Getting Started with Distributed Data Parallel](https://docs.pytorch.org/tutorials/intermediate/ddp_tutorial.html) 2. Use [FullyShardedDataParallel (FSDP2)](https://pytorch.org/docs/stable/distributed.fsdp.fully_shard.html) when your model cannot fit on one GPU. - See also: [Getting Started with FSDP2](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html) 3. Use [Tensor Parallel (TP)](https://pytorch.org/docs/stable/distributed.tensor.parallel.html) and/or [Pipeline Parallel (PP)](https://pytorch.org/docs/main/distributed.pipelining.html) if you reach scaling limitations with FSDP2. - Try our [Tensor Parallelism Tutorial](https://pytorch.org/tutorials/intermediate/TP_tutorial.html) - See also: [TorchTitan end to end example of 3D parallelism](https://github.com/pytorch/torchtitan) ## PyTorch Distributed Developers[\#](https://docs.pytorch.org/tutorials/beginner/dist_overview.html#pytorch-distributed-developers "Link to this heading") If you’d like to contribute to PyTorch Distributed, refer to our [Developer Guide](https://github.com/pytorch/pytorch/blob/master/torch/distributed/CONTRIBUTING.md).
Shard114 (laksa)
Root Hash14416670112284949514
Unparsed URLorg,pytorch!docs,/tutorials/beginner/dist_overview.html s443