10 1 datasets
Detailed Datasets Examples for xtorch
This document expands the "Data Handling and Preprocessing -> Datasets" subcategory of examples for the xtorch library, a C++ deep learning framework that extends PyTorch’s LibTorch API with user-friendly abstractions. The goal is to provide a comprehensive set of beginner-to-intermediate examples that introduce users to dataset handling and preprocessing tasks, with a focus on time series and graph datasets to align with the broader "Time Series and Graph" context. These examples showcase xtorch’s capabilities in data loading, preprocessing, and integration with C++ ecosystems, and are designed to be included in the xtorch-examples repository, helping users learn dataset management in C++.
Background and Context
xtorch simplifies deep learning for C++ developers by offering high-level model classes (e.g., XTModule, ResNetExtended, XTCNN), a streamlined training loop via the Trainer module, enhanced data utilities (e.g., ImageFolderDataset, CSVDataset), extended optimizers, and model serialization tools. The original two dataset examples—using built-in datasets like MNIST and CIFAR-10 and creating custom datasets with ImageFolderDataset—provide a solid foundation. This expansion adds six more examples to cover additional dataset types (e.g., time series, graphs, streaming data), loading mechanisms (e.g., CSV, graph structures), and preprocessing techniques (e.g., normalization, augmentation, subgraph sampling), ensuring a broad introduction to dataset handling with a focus on time series and graph applications.
The current time is 12:30 PM PDT on Monday, April 21, 2025, and all considerations are based on available information from the xtorch GitHub repository (xtorch GitHub Repository) without contradicting this timeframe.
Expanded Examples
The following table provides a detailed list of eight "Data Handling and Preprocessing -> Datasets" examples, including the original two and six new ones. Each example is designed to be standalone, with a clear focus on a specific dataset handling or preprocessing concept or xtorch feature, making it accessible for users.
| Category | Subcategory | Example Title | Description |
|---|---|---|---|
| Data Handling and Preprocessing | Datasets | Using Built-in Datasets: MNIST, CIFAR-10 | Shows how to load built-in datasets like MNIST (handwritten digits) and CIFAR-10 (small images) using xtorch’s data utilities (e.g., xtorch::data). Applies preprocessing such as normalization and tensor conversion, and evaluates with sample loading speed (samples per second) and data integrity (correct labels and shapes). |
| Creating Custom Datasets with ImageFolderDataset | Implements a custom dataset class using xtorch’s ImageFolderDataset for a folder of images (e.g., a custom image classification dataset with categories like cats and dogs). Applies preprocessing such as resizing, random cropping, and data augmentation, and evaluates with data loading accuracy (correct image-label pairs) and visualization of preprocessed images. |
||
| Loading Time Series Datasets with CSVDataset | Loads a time series dataset (e.g., UCI Appliances Energy Prediction) using xtorch’s xtorch::data::CSVDataset. Applies preprocessing such as sliding window segmentation and min-max normalization, and evaluates with data loading speed and sequence consistency (e.g., correct temporal ordering). |
||
| Creating Custom Graph Datasets for Node Classification | Implements a custom graph dataset class for the Cora dataset (citation network) using xtorch, handling node features and adjacency matrices. Applies preprocessing such as feature normalization and edge filtering, and evaluates with graph structure integrity (correct nodes and edges) and loading speed (graphs per second). | ||
| Loading Molecular Graph Datasets for Property Prediction | Loads a molecular graph dataset (e.g., QM9 for small molecules) using a custom xtorch dataset class, handling molecular graphs (nodes, edges, atom/bond features). Applies preprocessing such as graph feature extraction (e.g., atom types, bond orders), and evaluates with data loading accuracy (correct molecular structures) and molecular validity. | ||
| Streaming Time Series Datasets for Real-Time Processing | Implements a streaming dataset class for real-time time series data (e.g., IoT sensor data like temperature readings). Uses xtorch to buffer and preprocess streaming inputs with techniques like normalization and outlier filtering, and evaluates with streaming throughput (samples per second) and data consistency (correct preprocessing). | ||
| Custom Dataset with Visualization for Time Series Analysis | Creates a custom time series dataset class for the PhysioNet ECG dataset (heart signals). Integrates xtorch with OpenCV to visualize preprocessed signals (e.g., normalized ECG waves), applies preprocessing like noise reduction and standardization, and evaluates with visualization quality (clear signal plots) and preprocessing accuracy (correct signal values). | ||
| Handling Large-Scale Graph Datasets with Batch Processing | Implements a custom graph dataset class for a large-scale graph (e.g., PPI dataset for protein interactions) using xtorch. Supports batch processing for node embeddings with subgraph sampling, applies preprocessing like feature scaling, and evaluates with batch loading speed (batches per second) and embedding quality (downstream task accuracy). |
Rationale for Each Example
- Using Built-in Datasets: MNIST, CIFAR-10: Introduces basic dataset loading, using MNIST and CIFAR-10 to teach xtorch’s built-in utilities, ideal for beginners working with image data.
- Creating Custom Datasets with ImageFolderDataset: Demonstrates custom dataset creation, using
ImageFolderDatasetto teach flexible image data handling, relevant for custom image classification tasks. - Loading Time Series Datasets with CSVDataset: Introduces time series data handling, using UCI data to teach CSV-based loading and temporal preprocessing, aligning with the time series focus.
- Creating Custom Graph Datasets for Node Classification: Demonstrates graph dataset handling, using Cora to teach graph structure management, aligning with the graph focus.
- Loading Molecular Graph Datasets for Property Prediction: Focuses on molecular graph data, using QM9 to teach specialized graph handling for cheminformatics applications.
- Streaming Time Series Datasets for Real-Time Processing: Introduces streaming data handling, using IoT sensor data to teach real-time preprocessing, relevant for IoT applications.
- Custom Dataset with Visualization for Time Series Analysis: Demonstrates visualization-integrated data handling, using ECG data to teach user-friendly time series preprocessing, enhancing analysis.
- Handling Large-Scale Graph Datasets with Batch Processing: Shows advanced graph data handling, using PPI to teach batch processing for large-scale graphs, relevant for scalable graph applications.
Implementation Details
Each example should be implemented as a standalone C++ program in the xtorch-examples repository, with the following structure:
- Source Code: A main.cpp file containing the example code, using xtorch’s data utilities (e.g., xtorch::data::CSVDataset, ImageFolderDataset, custom dataset classes) and, where applicable, OpenCV for visualization.
- Build Instructions: A CMakeLists.txt file to compile the example, linking against xtorch, LibTorch, and OpenCV (if needed).
- README.md: A detailed guide explaining the example’s purpose, prerequisites (e.g., LibTorch, dataset downloads, OpenCV), steps to run, and expected outputs (e.g., loading speed, data integrity, visualization quality, or downstream task performance).
- Dependencies: Ensure users have xtorch, LibTorch, datasets (e.g., MNIST, CIFAR-10, UCI Appliances, Cora, QM9, PhysioNet ECG, PPI, custom IoT), and optionally OpenCV installed, with download instructions in each README. Graph datasets may require custom utilities or integration with C++ graph libraries.
For example, the “Loading Time Series Datasets with CSVDataset” might include:
- Code: Load the UCI Appliances Energy Prediction dataset with xtorch::data::CSVDataset, apply sliding window segmentation and min-max normalization, and output sample preprocessed sequences for verification, using xtorch’s data utilities.
- Build: Use CMake to link against xtorch and LibTorch, specifying paths to UCI dataset.
- README: Explain CSV-based time series loading and preprocessing, provide compilation and execution commands, and show sample output (e.g., loading speed of 1000 samples/second, correct sequence shapes).
Why These Examples?
These examples are designed to:
- Cover Core Concepts: From built-in dataset loading and custom image datasets to time series, graph, streaming, and large-scale datasets, they introduce key dataset handling paradigms for time series and graph applications.
- Leverage xtorch’s Strengths: They highlight xtorch’s data utilities (CSVDataset, ImageFolderDataset), custom dataset flexibility, and C++ performance, particularly for time series and graph data.
- Be Progressive: Examples start with simpler tasks (built-in datasets) and progress to complex ones (streaming, large-scale graphs), supporting a learning path.
- Address Practical Needs: Techniques like time series preprocessing, graph feature extraction, streaming data handling, and batch processing are widely used in real-world applications, from IoT to bioinformatics.
- Encourage Exploration: Examples like streaming datasets and visualization-integrated datasets expose users to cutting-edge data handling scenarios, fostering innovation.
Feasibility and Alignment with xtorch
The examples are feasible given xtorch’s features, as outlined in its GitHub repository:
- Data Utilities: xtorch::data::CSVDataset, ImageFolderDataset, and custom dataset classes support loading and preprocessing image, time series, and graph datasets.
- Preprocessing: xtorch’s data utilities support normalization, augmentation, sliding windows, feature extraction, and subgraph sampling, critical for the examples.
- Streaming and Batching: xtorch’s flexible data pipeline supports streaming inputs and batch processing, enabling real-time and large-scale dataset handling.
- Evaluation: xtorch’s utilities support metrics like loading speed, data integrity (e.g., correct labels, shapes), and downstream task performance (e.g., classification accuracy).
- C++ Integration: xtorch’s compatibility with OpenCV enables visualization of preprocessed data, enhancing user interaction.
The examples align with xtorch’s goal of simplifying deep learning in C++ and fit the "Time Series and Graph" context by emphasizing time series and graph datasets, making them ideal for the xtorch-examples repository’s datasets section.
Comparison with Existing Practices
Popular deep learning libraries like PyTorch provide dataset tutorials, such as “Writing Custom Datasets” (PyTorch Tutorials), which cover Python-based dataset handling. The proposed xtorch examples adapt this approach to C++, leveraging xtorch’s data utilities and C++ performance. They also include time series and graph-specific datasets (e.g., UCI, Cora, QM9) and advanced handling scenarios (e.g., streaming, large-scale batch processing) to align with the category and modern data processing trends, as seen in repositories like “pyg-team/pytorch_geometric” for graph data handling (GitHub - pyg-team/pytorch_geometric).
Implementation Notes
- Directory Structure: Organize
xtorch-exampleswith adata_handling_and_preprocessing/datasets/directory, containing subdirectories for each example (e.g.,builtin_mnist_cifar10/,timeseries_csv_uci/). - User Guidance: The main
README.mdshould list all examples, suggest a learning path (e.g., start with built-in datasets, then custom image datasets, then graph datasets), and link to xtorch’s documentation. - C++ Focus: Ensure code uses modern C++ practices (e.g., smart pointers, exception handling) and includes detailed comments for clarity.
- Dependencies: Note that users need LibTorch, xtorch, datasets (e.g., MNIST, CIFAR-10, UCI Appliances, Cora, QM9, PhysioNet ECG, PPI, custom IoT), and optionally OpenCV installed, with download and setup instructions in each README. Graph data handling may require custom utilities or integration with C++ graph libraries.
Conclusion
The expanded list of eight "Data Handling and Preprocessing -> Datasets" examples provides a comprehensive introduction to dataset handling and preprocessing with xtorch, covering built-in datasets, custom image datasets, time series datasets, graph datasets, molecular graph datasets, streaming datasets, visualization-integrated datasets, and large-scale graph datasets. These examples are beginner-to-intermediate friendly, leverage xtorch’s strengths, and align with its goal of making deep learning accessible in C++ while addressing time series and graph applications. By including them in xtorch-examples, you can help users build a solid foundation in dataset management, fostering adoption and engagement with the xtorch community.