Custom Dataset and Dataloader in PyTorch C++¶
One of the most powerful components of PyTorch are the dataset utilities which help users optimize and parallelize reading and serving data into models.
In this tutorial, we are going to create a custom dataset that contains 2 inputs and a label. This dataset will be inspired by the MNIST dataset. These datasets can be used for applications that require more than a single input and single label.
Tip
We recommend reading the C++ Frontend Tutorial if you have not done so already. There you will learn how to setup a C++ PyTorch project.
Motivation¶
We want to leverage the raw speed of C++ for reading data and take advantage of C++’s native multithreading with a custom dataset. Additionally, the ability to customize the outputs of our dataset allows us to easily use a wider range of PyTorch models. For example, the current batch of transformer models have multiple inputs for the various masks, token type ids, and position ids that are used as inputs into the network. The base PyTorch C++ dataset only supports two different tensors (“data” and “target”) for each example. Here we will create a dataset that supports 3 different tensors, and we will create the necessary files to use the powerful PyTorch dataloader utilities.
Setting Up the Project¶
As in the frontend tutorial above, we will be using cmake
to compile our
project.
cmake_minimum_required(VERSION 3.0 FATAL_ERROR)
# allows us to use Torch_ROOT environmental variable
cmake_policy(SET CMP0074 NEW)
# alternative to set_property(...)
set(CMAKE_CXX_STANDARD 14)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
project(custom-dataset
VERSION 0.0.1
DESCRIPTION "Custom Dataset with PyTorch's C++ Frontend")
# add torch
find_package(Torch REQUIRED)
# put all sources files into a single variable
set(SOURCES
custom_dataset.cpp
main.cpp
)
add_executable(custom-dataset ${SOURCES})
target_link_libraries(custom-dataset "${TORCH_LIBRARIES}")
Note
If you need CMAKE_PREFIX_PATH
to be something other than the
location to the libtorch
directory, you can use the environmental
variable Torch_DIR
as an alternative. To do so, you export
it in your shell as follows:
export Torch_DIR=/full/path/to/libtorch/share/cmake/Torch
# or
export Torch_DIR=$(realpath relative/path/to/libtorch/share/cmake/Torch)
Now we are ready to create the files required for our project. As seen in the
CMakeLists.txt
file above, the two source files that we are going to use
will be called main.cpp
and custom_dataset.cpp
. In addition to these
files, we will also create header files for the custom dataset, the custom input
type, and the custom stacking dataset transform in the files
custom_dataset.h
, custom_input_type.h
, and custom_stack.h
,
respectively.
Custom Input Type¶
The default dataset uses a templated struct called Example<>
with two
members, data
and target
. We are going to create a struct with three
members, inputone
, inputtwo
, and label
. Ultimately, we will be able
to use this in our dataloader as well to create batches for each of these three
members regardless of relative shapes.
Note
Example<>
is shorthand for Example<torch::Tensor, torch::Tensor>
,
we will be using something similar. PyTorch’s C++ library makes extensive
use of templates and we need to create some custom version of the transforms
for our custom input type.
We’ll begin by creating a struct
with three members named inputone
,
inputtwo
, and label
with three types that default to a torch tensor
type. We are also going to create a default constructor for this struct.
#pragma once
#include <torch/types.h>
template < typename InputOne = torch::Tensor,
typename InputTwo = torch::Tensor,
typename Label = torch::Tensor >
struct ThreeTensorInput {
ThreeTensorInput() = default;
ThreeTensorInput(InputOne inputone, InputTwo inputtwo, Label label)
: inputone(std::move(inputone)),
inputtwo(std::move(inputtwo)),
label(std::move(label)) {}
InputOne inputone;
InputTwo inputtwo;
Label label;
}; // don't forget the semi-colon here
This could be useful if inputone
and inputtwo
have different types or
dimensions or just to organize your data types.
Dataset Class¶
Next, we will create our dataset from the standard Dataset
included with
PyTorch’s data utilities. This is a minimal dataset that holds our examples in
a vector and returns one example each time the get
method is called. It is
very similar to the MNIST dataset example except that we are simplifying it a
bit by removing the train / test type of the MNIST dataset.
#pragma once
#include <torch/data/datasets/base.h>
#include <torch/types.h>
#include "custom_input_type.h"
template <typename CustomSingleExample = ThreeTensorInput<>>
class CustomDataset
: public torch::data::datasets::Dataset<CustomDataset<CustomSingleExample>,
CustomSingleExample> {
public:
using CustomExampleType = CustomSingleExample;
// constructor
explicit CustomDataset(const std::vector<CustomExampleType> &examples);
// get item
virtual CustomExampleType get(std::size_t index) override;
// dataset size
torch::optional<std::size_t> size() const override;
// get all examples
const std::vector<CustomExampleType> &examples() const;
private:
std::vector<CustomExampleType> examples_;
};
Dataset Implementation¶
The other big departure from the MNIST dataset is that we’ve replaced
Example<>
with ThreeTensorInput<>
and ExampleType
with
CustomExampleType
. What is important here is that we using the dataset
template with our custom dataset and custom type
Dataset<CustomDataset<CustomSingleExample>, CustomSingleExample>
. In the
MNIST dataset the example type is implicit because it’s the default parameter,
but since we are changing it, we need to explicitly put our type here. Next we
want to define all of the methods that we need to override in
custom_dataset.cpp
.
#include "custom_dataset.h"
using namespace std;
template <typename T>
CustomDataset<T>::CustomDataset(const vector<T> &examples)
: examples_(examples) {}
template <typename T> T CustomDataset<T>::get(size_t index) {
T ex = examples_[index];
return std::move(ex);
}
template <typename T> torch::optional<size_t> CustomDataset<T>::size() const {
torch::optional<size_t> sz(examples_.size());
return sz;
}
// the following line is required for the linker to work correctly
template class CustomDataset<>; // add our custom example with default argument
Note
We are using T
as our typename here for convenience, because we need to define
our methods for this templated class. Lastly, we need to explicitly
instantiate our custom class with each type that we are going to use in our
template. For this example, we are only using the default type ThreeTensorInput
,
which is the default type. Thus we can instatiate it no template parameters.
Run the Custom Dataset¶
At this point we have a functional dataset class. We can load tensors into this
dataset and retrieve them by using the get
method. To use our dataset, we
will create a minimal main class that loads our dataset with 10 examples where
the inputs and labels are all different sizes.
#include <iostream>
#include <torch/torch.h>
#include "custom_dataset.h"
#include "custom_input_type.h"
using namespace std;
using namespace torch;
int main() {
int dataset_sz = 10;
vector<Tensor> ones, twos, labels;
ones.reserve(dataset_sz);
twos.reserve(dataset_sz);
labels.reserve(dataset_sz);
vector<ThreeTensorInput<>> examples_;
for (int i = 0; i < dataset_sz; ++i) {
ones.push_back(torch::rand({2, 3})); // size = (2, 3)
twos.push_back(torch::rand({3, 2})); // size = (3, 2)
labels.push_back(torch::randint(5, 1)); // size = (1)
examples_.emplace_back(ones[i], twos[i], labels[i]);
}
CustomDataset<> ds(examples_);
assert((static_cast<size_t>(dataset_sz), ds.size().value()));
auto ex = ds.get(0);
cout << ex.inputone << "\n"
<< ex.inputtwo << "\n"
<< ex.label << endl;
return 1;
}
Dataloader¶
Ok, if we want to iterate through our dataset a single example at a time then
we are done. Of course, we probably want to process our data in minibatches.
As in the python frontend we also have a Dataloader
utility class in the
C++ frontend. However, in the python frontend there is a parameter in the
Dataloader
class called collate_fn
. The default collation function
basically converts a list of basic python types, numpy arrays, or torch tensors
stacks them into a minibatch. One normally doesn’t need to write a custom
collation function except for special cases such as sequence data when you need
to pad a sequence to the length of the longest sequence in the batch. In the
C++ frontend, the equivalent to the collation function are mapping transforms.
These transforms take a batch from the get_batch
function and apply
themselves to the batch. By default, the get_batch
function returns a
vector of the type returned by the get
function. In the MNIST dataset,
they use the map
method to transform our BatchDataset
into a
MapDataset
. The code looks something like:
int batch_size = 3;
auto ds(examples_).map(data::transforms::Stack<>());
auto dl = data::make_data_loader<data::samplers::SequentialSampler>(
move(ds), batch_size);
Custom Stack Transform¶
But here we see that the Stack<>
transform is a default template. Spoiler
alert, the default type parameter here is Example<>
and this transform
explicitly stacks the data
and target
members. So we are going to have
to write our own stack transform for our custom type. The original
implementation of Stack
is here.
So let’s create a file called custom_stack.h
and do that.
#pragma once
#include <torch/data/transforms.h>
#include <vector>
#include "custom_input_type.h"
template <>
struct torch::data::transforms::Stack<ThreeTensorInput<>>
: public torch::data::transforms::Collation<ThreeTensorInput<>> {
ThreeTensorInput<>
apply_batch(std::vector<ThreeTensorInput<>> examples) override {
std::vector<torch::Tensor> inputone, inputtwo, label;
inputone.reserve(examples.size());
inputtwo.reserve(examples.size());
label.reserve(examples.size());
for (auto &example : examples) {
inputone.push_back(std::move(example.inputone));
inputtwo.push_back(std::move(example.inputtwo));
label.push_back(std::move(example.label));
}
return {torch::stack(inputone), torch::stack(inputtwo),
torch::stack(label)};
}
};
As stated earlier, the default get_batch
method creates a vector of our
custom type. Then torch::stack
is applied to each of the three members of
this custom type, which stacks the members with an added dimension for the
batch. Since our custom type is comprised of only tensors, the return type of
this transform is also a ThreeTensorInput<>
, but with the extra dimension.
Now we can go back to the main.cpp
, include this header file, map our
dataset with this custom transform, and create the dataloader. Feel free to
try to do this yourself. Don’t forget to add our custom input type where it’s
needed.
Main with Dataloader¶
Ok, your final code should look like this:
#include <iostream>
#include <torch/torch.h>
#include "custom_dataset.h"
#include "custom_input_type.h"
#include "custom_stack.h" // our custom stack transform
using namespace std;
using namespace torch;
int main() {
int dataset_sz = 10;
int batch_size = 3; // added batch size variable
vector<Tensor> ones, twos, labels;
ones.reserve(dataset_sz);
twos.reserve(dataset_sz);
labels.reserve(dataset_sz);
vector<ThreeTensorInput<>> examples_;
for (int i = 0; i < dataset_sz; ++i) {
ones.push_back(torch::rand({2, 3}));
twos.push_back(torch::rand({3, 2}));
labels.push_back(torch::randint(5, 1));
examples_.emplace_back(ones[i], twos[i], labels[i]);
}
CustomDataset<> ds(examples_);
assert((static_cast<size_t>(dataset_sz), ds.size().value()));
// Stack takes our custom type which is also a templated class
auto ds_map = ds.map(data::transforms::Stack<ThreeTensorInput<>>());
auto dl = data::make_data_loader<data::samplers::SequentialSampler>(
move(ds_map), batch_size);
for (auto &mb : *dl) {
cout << mb.inputone << "\n" << mb.inputtwo << "\n" << mb.label << endl;
}
return 1;
}
Now let’s build and run the code.
export Torch_DIR=$(realpath relative/path/to/libtorch/share/cmake/Torch)
mkdir build && cd build
cmake ..
make
./custom-dataset
There we have it. A minimal example of a PyTorch dataset and dataloader in C++.