13 1 data parallelism

Detailed Data Parallelism Examples for xtorch

This document expands the "Performance and Distributed and Parallel Training -> Data Parallelism" subcategory of examples for the xtorch library, a C++ deep learning framework that extends PyTorch’s LibTorch API with user-friendly abstractions. The goal is to provide a comprehensive set of beginner-to-intermediate examples that introduce users to data parallelism techniques for distributed and parallel training, with a focus on time series and graph models to align with the broader "Time Series and Graph" context. These examples showcase xtorch’s capabilities in scalability, performance, and C++ ecosystem integration, and are designed to be included in the xtorch-examples repository, helping users learn data parallelism in C++.

Background and Context

xtorch simplifies deep learning for C++ developers by offering high-level model classes (e.g., XTModule, ResNetExtended, XTCNN), a streamlined training loop via the Trainer module, enhanced data utilities (e.g., ImageFolderDataset, CSVDataset, DataLoader), extended optimizers (e.g., xtorch::optim), and model serialization tools. The original data parallelism example—training on multiple GPUs with data parallelism—provides a solid foundation. This expansion adds seven more examples to cover additional data parallelism strategies (e.g., multi-GPU, multi-node, hybrid parallelism), model types (e.g., LSTM, GCN, GraphSAGE), and training scenarios (e.g., real-time training, large-scale graph processing), ensuring a broad introduction to data parallelism with a focus on time series and graph applications.

The current time is 2:30 PM PDT on Monday, April 21, 2025, and all considerations are based on available information from the xtorch GitHub repository (xtorch GitHub Repository) without contradicting this timeframe.

Expanded Examples

The following table provides a detailed list of eight "Performance and Distributed and Parallel Training -> Data Parallelism" examples, including the original one and seven new ones. Each example is designed to be standalone, with a clear focus on a specific data parallelism concept or xtorch feature, making it accessible for users.

Category	Subcategory	Example Title	Description
Performance and Distributed and Parallel Training	Data Parallelism	Training on Multiple GPUs with Data Parallelism	Demonstrates data parallelism across multiple GPUs to train a convolutional neural network (CNN) on the CIFAR-10 dataset (images) using xtorch’s distributed utilities (e.g., `xtorch::distributed`). Splits data across GPUs, optimizes with SGD and cross-entropy loss, and evaluates with training speed (samples per second) and test accuracy.
		Multi-GPU Training for Time Series Forecasting	Implements data parallelism on multiple GPUs to train an LSTM for time series forecasting on the UCI Appliances Energy Prediction dataset using xtorch. Distributes time series data across GPUs, optimizes with Adam and Mean Squared Error (MSE) loss, and evaluates with training speed (epochs per second) and generalization performance (Root Mean Squared Error, RMSE).
		Distributed Data Parallelism for Graph Node Classification	Uses distributed data parallelism across multiple GPUs for a Graph Convolutional Network (GCN) on the Cora dataset (citation network) with xtorch and OpenMPI. Splits graph data across GPUs, optimizes with RMSprop and cross-entropy loss, and evaluates with training speed (batches per second) and classification accuracy.
		Multi-Node Data Parallelism for Time Series Anomaly Detection	Implements data parallelism across multiple nodes for an autoencoder on the PhysioNet ECG dataset (heart signals) using xtorch and OpenMPI. Distributes time series data across nodes, optimizes with Adagrad and MSE loss, and evaluates with training scalability (nodes vs. speed) and Area Under the ROC Curve (AUC-ROC).
		Hybrid Parallelism for Molecular Graph Property Prediction	Combines data and model parallelism to train a graph neural network for molecular property prediction (e.g., dipole moment) on the QM9 dataset (small molecules) using xtorch. Splits data and model layers across GPUs, optimizes with Adam and Mean Absolute Error (MAE) loss, and evaluates with training speed (samples per second) and prediction accuracy (MAE).
		Data Parallelism for Real-Time Time Series Classification	Implements data parallelism on multiple GPUs for real-time training of a CNN on a custom IoT sensor dataset (e.g., accelerometer data) using xtorch. Distributes time series data across GPUs, optimizes with Adam and cross-entropy loss, and evaluates with training latency (milliseconds per batch) and classification accuracy.
		Large-Scale Graph Data Parallelism for Node Embedding	Uses distributed data parallelism across multiple GPUs for a GraphSAGE model on the PPI dataset (protein interactions) with xtorch and OpenMPI. Splits large-scale graph data across GPUs, optimizes with Sparse Adam and unsupervised loss (e.g., graph reconstruction), and evaluates with training speed (epochs per second) and embedding quality (downstream classification accuracy).
		Data Parallelism with Visualization for Time Series Forecasting	Combines multi-GPU data parallelism with OpenCV to train an LSTM for time series forecasting on streaming IoT sensor data (e.g., temperature readings). Visualizes training speed and loss curves across GPUs, optimizes with Adam and MSE loss, and evaluates with RMSE and visualization quality (clear plots).

Rationale for Each Example

Training on Multiple GPUs with Data Parallelism: Introduces basic data parallelism, using a CNN on CIFAR-10 to teach multi-GPU training, ideal for beginners.
Multi-GPU Training for Time Series Forecasting: Demonstrates data parallelism for time series, using an LSTM on UCI data to teach scalable time series training, aligning with the time series focus.
Distributed Data Parallelism for Graph Node Classification: Introduces GPU-based graph parallelism, using a GCN on Cora to teach efficient graph training, aligning with the graph focus.
Multi-Node Data Parallelism for Time Series Anomaly Detection: Focuses on distributed node-level parallelism, using an autoencoder on ECG data to teach scalable anomaly detection, relevant for healthcare.
Hybrid Parallelism for Molecular Graph Property Prediction: Demonstrates combined data and model parallelism, using a graph neural network on QM9 to teach complex graph training, relevant for cheminformatics.
Data Parallelism for Real-Time Time Series Classification: Introduces real-time parallel training, using a CNN on IoT data to teach low-latency training, relevant for IoT applications.
Large-Scale Graph Data Parallelism for Node Embedding: Shows scalable graph parallelism, using GraphSAGE on PPI to teach efficient large-scale training, relevant for big data applications.
Data Parallelism with Visualization for Time Series Forecasting: Demonstrates visualization-integrated parallelism, using an LSTM on streaming IoT data to teach performance monitoring, relevant for IoT applications.

Implementation Details

Each example should be implemented as a standalone C++ program in the xtorch-examples repository, with the following structure: - Source Code: A main.cpp file containing the example code, using xtorch’s distributed utilities (e.g., xtorch::distributed for data parallelism), modules (e.g., xtorch::nn, xtorch::optim, xtorch::data::DataLoader), and, where applicable, OpenMPI for multi-node parallelism and OpenCV for visualization. - Build Instructions: A CMakeLists.txt file to compile the example, linking against xtorch, LibTorch, OpenMPI (if needed), and OpenCV (if needed). - README.md: A detailed guide explaining the example’s purpose, prerequisites (e.g., LibTorch, dataset downloads, OpenMPI, OpenCV, multi-GPU setup), steps to run, and expected outputs (e.g., training speed, latency, accuracy, RMSE, MAE, AUC-ROC, or visualization quality). - Dependencies: Ensure users have xtorch, LibTorch, datasets (e.g., CIFAR-10, UCI Appliances, Cora, PhysioNet ECG, QM9, PPI, custom IoT), and optionally OpenMPI and OpenCV installed, with download and setup instructions in each README. Multi-GPU and multi-node setups require appropriate hardware and MPI configurations. Graph datasets may require custom utilities or integration with C++ graph libraries.

For example, the “Distributed Data Parallelism for Graph Node Classification” might include: - Code: Train a GCN on the Cora dataset using xtorch::distributed for data parallelism across multiple GPUs, integrate with OpenMPI, optimize with xtorch::optim::RMSprop and cross-entropy loss, and output training speed and test accuracy, using xtorch’s modules and utilities. - Build: Use CMake to link against xtorch, LibTorch, and OpenMPI, specifying paths to Cora dataset. - README: Explain distributed data parallelism for graph models, provide compilation and training commands for multi-GPU setups, and show sample output (e.g., training speed of 150 batches/second, test accuracy of ~0.85).

Why These Examples?

These examples are designed to: - Cover Core Concepts: From basic multi-GPU data parallelism to multi-node, hybrid parallelism, and real-time parallel training, they introduce key data parallelism paradigms for time series and graph applications. - Leverage xtorch’s Strengths: They highlight xtorch’s xtorch::distributed, xtorch::nn, xtorch::optim, and xtorch::data modules, as well as C++ performance, particularly for scalable and distributed training. - Be Progressive: Examples start with simpler techniques (multi-GPU training) and progress to complex ones (multi-node, hybrid parallelism), supporting a learning path. - Address Practical Needs: Techniques like distributed data parallelism, hybrid parallelism, and real-time training are widely used in real-world applications, from IoT to bioinformatics. - Encourage Exploration: Examples like visualization-integrated parallelism and large-scale graph parallelism expose users to cutting-edge distributed training scenarios, fostering innovation.

Feasibility and Alignment with xtorch

The examples are feasible given xtorch’s features, as outlined in its GitHub repository: - Distributed Utilities: xtorch’s xtorch::distributed module (built on LibTorch’s distributed backend) supports data parallelism across multiple GPUs and nodes, with OpenMPI integration for multi-node setups. - Model Compatibility: xtorch::nn modules (e.g., Conv2d, LSTM, custom graph layers) support CNNs, LSTMs, GCNs, and GraphSAGE for time series and graph tasks. - Data Handling: xtorch::data::DataLoader and custom dataset classes handle image, time series, and graph datasets, with support for distributed data splitting and preprocessing (e.g., normalization, feature extraction). - Training Pipeline: The Trainer API simplifies distributed training loops, integrating with xtorch::distributed for synchronization, compatible with all examples. - Evaluation: xtorch’s utilities support metrics like training speed, latency, accuracy, RMSE, MAE, AUC-ROC, and downstream task performance. - C++ Integration: xtorch’s compatibility with OpenMPI enables multi-node parallelism, and OpenCV enables visualization of training metrics, enhancing user interaction.

The examples align with xtorch’s goal of simplifying deep learning in C++ and fit the "Time Series and Graph" context by emphasizing time series and graph parallel training, making them ideal for the xtorch-examples repository’s data parallelism section.

Comparison with Existing Practices

Popular deep learning libraries like PyTorch provide distributed training tutorials, such as “Distributed Data Parallel in PyTorch” (PyTorch Tutorials), which cover Python-based data parallelism. The proposed xtorch examples adapt this approach to C++, leveraging xtorch’s xtorch::distributed module and C++ performance. They also include time series and graph-specific parallelism (e.g., UCI, Cora, QM9) and advanced scenarios (e.g., hybrid parallelism, real-time training) to align with the category and modern distributed training trends, as seen in repositories like “pyg-team/pytorch_geometric” for graph model parallelism (GitHub - pyg-team/pytorch_geometric).

Implementation Notes

Directory Structure: Organize xtorch-examples with a performance_and_distributed_and_parallel_training/data_parallelism/ directory, containing subdirectories for each example (e.g., multi_gpu_cifar10/, multi_gpu_timeseries_uci/).
User Guidance: The main README.md should list all examples, suggest a learning path (e.g., start with multi-GPU training, then distributed GPU, then multi-node), and link to xtorch’s documentation.
C++ Focus: Ensure code uses modern C++ practices (e.g., smart pointers, exception handling) and includes detailed comments for clarity.
Dependencies: Note that users need LibTorch, xtorch, datasets (e.g., CIFAR-10, UCI Appliances, Cora, PhysioNet ECG, QM9, PPI, custom IoT), and optionally OpenMPI and OpenCV installed, with download and setup instructions in each README. Multi-GPU and multi-node setups require appropriate hardware (e.g., NVIDIA GPUs, cluster nodes) and MPI configurations. Graph datasets may require custom utilities or integration with C++ graph libraries.

Conclusion

The expanded list of eight "Performance and Distributed and Parallel Training -> Data Parallelism" examples provides a comprehensive introduction to data parallelism techniques with xtorch, covering multi-GPU training, time series forecasting, distributed graph node classification, multi-node anomaly detection, hybrid parallelism for graph prediction, real-time classification, large-scale graph embedding, and visualization-integrated parallelism. These examples are beginner-to-intermediate friendly, leverage xtorch’s strengths, and align with its goal of making deep learning accessible in C++ while addressing time series and graph applications. By including them in xtorch-examples, you can help users build a solid foundation in distributed and parallel training, fostering adoption and engagement with the xtorch community.