• Tutorials >
  • Custom Dataset and Dataloader in PyTorch C++
Shortcuts

Custom Dataset and Dataloader in PyTorch C++

One of the most powerful components of PyTorch are the dataset utilities which help users optimize and parallelize reading and serving data into models.

In this tutorial, we are going to create a custom dataset that contains 2 inputs and a label. This dataset will be inspired by the MNIST dataset. These datasets can be used for applications that require more than a single input and single label.

Tip

We recommend reading the C++ Frontend Tutorial if you have not done so already. There you will learn how to setup a C++ PyTorch project.

Motivation

We want to leverage the raw speed of C++ for reading data and take advantage of C++’s native multithreading with a custom dataset. Additionally, the ability to customize the outputs of our dataset allows us to easily use a wider range of PyTorch models. For example, the current batch of transformer models have multiple inputs for the various masks, token type ids, and position ids that are used as inputs into the network. The base PyTorch C++ dataset only supports two different tensors (“data” and “target”) for each example. Here we will create a dataset that supports 3 different tensors, and we will create the necessary files to use the powerful PyTorch dataloader utilities.

Setting Up the Project

As in the frontend tutorial above, we will be using cmake to compile our project.

CMakeLists.txt
cmake_minimum_required(VERSION 3.0 FATAL_ERROR)

# allows us to use Torch_ROOT environmental variable
cmake_policy(SET CMP0074 NEW)

# alternative to set_property(...)
set(CMAKE_CXX_STANDARD 14)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

project(custom-dataset
        VERSION 0.0.1
        DESCRIPTION "Custom Dataset with PyTorch's C++ Frontend")

# add torch
find_package(Torch REQUIRED)

# put all sources files into a single variable
set(SOURCES
    custom_dataset.cpp
    main.cpp
   )

add_executable(custom-dataset ${SOURCES})
target_link_libraries(custom-dataset "${TORCH_LIBRARIES}")

Note

If you need CMAKE_PREFIX_PATH to be something other than the location to the libtorch directory, you can use the environmental variable Torch_DIR as an alternative. To do so, you export it in your shell as follows:

export Torch_DIR=/full/path/to/libtorch/share/cmake/Torch
# or
export Torch_DIR=$(realpath relative/path/to/libtorch/share/cmake/Torch)

Now we are ready to create the files required for our project. As seen in the CMakeLists.txt file above, the two source files that we are going to use will be called main.cpp and custom_dataset.cpp. In addition to these files, we will also create header files for the custom dataset, the custom input type, and the custom stacking dataset transform in the files custom_dataset.h, custom_input_type.h, and custom_stack.h, respectively.

Custom Input Type

The default dataset uses a templated struct called Example<> with two members, data and target. We are going to create a struct with three members, inputone, inputtwo, and label. Ultimately, we will be able to use this in our dataloader as well to create batches for each of these three members regardless of relative shapes.

Note

Example<> is shorthand for Example<torch::Tensor, torch::Tensor>, we will be using something similar. PyTorch’s C++ library makes extensive use of templates and we need to create some custom version of the transforms for our custom input type.

We’ll begin by creating a struct with three members named inputone, inputtwo, and label with three types that default to a torch tensor type. We are also going to create a default constructor for this struct.

custom_input_type.h
#pragma once

#include <torch/types.h>
template < typename InputOne = torch::Tensor,
           typename InputTwo = torch::Tensor,
           typename Label = torch::Tensor >
struct ThreeTensorInput {
  ThreeTensorInput() = default;
  ThreeTensorInput(InputOne inputone, InputTwo inputtwo, Label label)
      : inputone(std::move(inputone)),
        inputtwo(std::move(inputtwo)),
        label(std::move(label)) {}

  InputOne inputone;
  InputTwo inputtwo;
  Label label;
};  // don't forget the semi-colon here

This could be useful if inputone and inputtwo have different types or dimensions or just to organize your data types.

Dataset Class

Next, we will create our dataset from the standard Dataset included with PyTorch’s data utilities. This is a minimal dataset that holds our examples in a vector and returns one example each time the get method is called. It is very similar to the MNIST dataset example except that we are simplifying it a bit by removing the train / test type of the MNIST dataset.

custom_dataset.h
#pragma once

#include <torch/data/datasets/base.h>
#include <torch/types.h>

#include "custom_input_type.h"

template <typename CustomSingleExample = ThreeTensorInput<>>
class CustomDataset
    : public torch::data::datasets::Dataset<CustomDataset<CustomSingleExample>,
                                            CustomSingleExample> {
public:
  using CustomExampleType = CustomSingleExample;
  // constructor
  explicit CustomDataset(const std::vector<CustomExampleType> &examples);
  // get item
  virtual CustomExampleType get(std::size_t index) override;
  // dataset size
  torch::optional<std::size_t> size() const override;
  // get all examples
  const std::vector<CustomExampleType> &examples() const;

private:
  std::vector<CustomExampleType> examples_;
};

Dataset Implementation

The other big departure from the MNIST dataset is that we’ve replaced Example<> with ThreeTensorInput<> and ExampleType with CustomExampleType. What is important here is that we using the dataset template with our custom dataset and custom type Dataset<CustomDataset<CustomSingleExample>, CustomSingleExample>. In the MNIST dataset the example type is implicit because it’s the default parameter, but since we are changing it, we need to explicitly put our type here. Next we want to define all of the methods that we need to override in custom_dataset.cpp.

custom_dataset.cpp
#include "custom_dataset.h"

using namespace std;

template <typename T>
CustomDataset<T>::CustomDataset(const vector<T> &examples)
    : examples_(examples) {}

template <typename T> T CustomDataset<T>::get(size_t index) {
  T ex = examples_[index];
  return std::move(ex);
}

template <typename T> torch::optional<size_t> CustomDataset<T>::size() const {
  torch::optional<size_t> sz(examples_.size());
  return sz;
}

// the following line is required for the linker to work correctly
template class CustomDataset<>; // add our custom example with default argument

Note

We are using T as our typename here for convenience, because we need to define our methods for this templated class. Lastly, we need to explicitly instantiate our custom class with each type that we are going to use in our template. For this example, we are only using the default type ThreeTensorInput, which is the default type. Thus we can instatiate it no template parameters.

Run the Custom Dataset

At this point we have a functional dataset class. We can load tensors into this dataset and retrieve them by using the get method. To use our dataset, we will create a minimal main class that loads our dataset with 10 examples where the inputs and labels are all different sizes.

main.cpp
#include <iostream>
#include <torch/torch.h>

#include "custom_dataset.h"
#include "custom_input_type.h"

using namespace std;
using namespace torch;

int main() {
  int dataset_sz = 10;

  vector<Tensor> ones, twos, labels;
  ones.reserve(dataset_sz);
  twos.reserve(dataset_sz);
  labels.reserve(dataset_sz);

  vector<ThreeTensorInput<>> examples_;
  for (int i = 0; i < dataset_sz; ++i) {
    ones.push_back(torch::rand({2, 3}));  // size = (2, 3)
    twos.push_back(torch::rand({3, 2}));  // size = (3, 2)
    labels.push_back(torch::randint(5, 1));  // size = (1)
    examples_.emplace_back(ones[i], twos[i], labels[i]);
  }

  CustomDataset<> ds(examples_);
  assert((static_cast<size_t>(dataset_sz), ds.size().value()));

  auto ex = ds.get(0);

  cout << ex.inputone << "\n"
       << ex.inputtwo << "\n"
       << ex.label << endl;

  return 1;
}

Dataloader

Ok, if we want to iterate through our dataset a single example at a time then we are done. Of course, we probably want to process our data in minibatches. As in the python frontend we also have a Dataloader utility class in the C++ frontend. However, in the python frontend there is a parameter in the Dataloader class called collate_fn. The default collation function basically converts a list of basic python types, numpy arrays, or torch tensors stacks them into a minibatch. One normally doesn’t need to write a custom collation function except for special cases such as sequence data when you need to pad a sequence to the length of the longest sequence in the batch. In the C++ frontend, the equivalent to the collation function are mapping transforms. These transforms take a batch from the get_batch function and apply themselves to the batch. By default, the get_batch function returns a vector of the type returned by the get function. In the MNIST dataset, they use the map method to transform our BatchDataset into a MapDataset. The code looks something like:

int batch_size = 3;
auto ds(examples_).map(data::transforms::Stack<>());
auto dl = data::make_data_loader<data::samplers::SequentialSampler>(
    move(ds), batch_size);

Custom Stack Transform

But here we see that the Stack<> transform is a default template. Spoiler alert, the default type parameter here is Example<> and this transform explicitly stacks the data and target members. So we are going to have to write our own stack transform for our custom type. The original implementation of Stack is here. So let’s create a file called custom_stack.h and do that.

custom_stack.h
#pragma once

#include <torch/data/transforms.h>
#include <vector>

#include "custom_input_type.h"

template <>
struct torch::data::transforms::Stack<ThreeTensorInput<>>
    : public torch::data::transforms::Collation<ThreeTensorInput<>> {
  ThreeTensorInput<>
  apply_batch(std::vector<ThreeTensorInput<>> examples) override {
    std::vector<torch::Tensor> inputone, inputtwo, label;
    inputone.reserve(examples.size());
    inputtwo.reserve(examples.size());
    label.reserve(examples.size());
    for (auto &example : examples) {
      inputone.push_back(std::move(example.inputone));
      inputtwo.push_back(std::move(example.inputtwo));
      label.push_back(std::move(example.label));
    }
    return {torch::stack(inputone), torch::stack(inputtwo),
            torch::stack(label)};
  }
};

As stated earlier, the default get_batch method creates a vector of our custom type. Then torch::stack is applied to each of the three members of this custom type, which stacks the members with an added dimension for the batch. Since our custom type is comprised of only tensors, the return type of this transform is also a ThreeTensorInput<>, but with the extra dimension. Now we can go back to the main.cpp, include this header file, map our dataset with this custom transform, and create the dataloader. Feel free to try to do this yourself. Don’t forget to add our custom input type where it’s needed.

Main with Dataloader

Ok, your final code should look like this:

main.cpp
#include <iostream>
#include <torch/torch.h>

#include "custom_dataset.h"
#include "custom_input_type.h"
#include "custom_stack.h"  // our custom stack transform

using namespace std;
using namespace torch;

int main() {
  int dataset_sz = 10;
  int batch_size = 3;  // added batch size variable

  vector<Tensor> ones, twos, labels;
  ones.reserve(dataset_sz);
  twos.reserve(dataset_sz);
  labels.reserve(dataset_sz);

  vector<ThreeTensorInput<>> examples_;
  for (int i = 0; i < dataset_sz; ++i) {
    ones.push_back(torch::rand({2, 3}));
    twos.push_back(torch::rand({3, 2}));
    labels.push_back(torch::randint(5, 1));
    examples_.emplace_back(ones[i], twos[i], labels[i]);
  }

  CustomDataset<> ds(examples_);
  assert((static_cast<size_t>(dataset_sz), ds.size().value()));
  // Stack takes our custom type which is also a templated class
  auto ds_map = ds.map(data::transforms::Stack<ThreeTensorInput<>>());
  auto dl = data::make_data_loader<data::samplers::SequentialSampler>(
      move(ds_map), batch_size);
  for (auto &mb : *dl) {
    cout << mb.inputone << "\n" << mb.inputtwo << "\n" << mb.label << endl;
  }

  return 1;
}

Now let’s build and run the code.

export Torch_DIR=$(realpath relative/path/to/libtorch/share/cmake/Torch)
mkdir build && cd build
cmake ..
make
./custom-dataset

There we have it. A minimal example of a PyTorch dataset and dataloader in C++.

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources