Working with PyTorch’s Dataset and Dataloader classes (part 1)

12 minute read

Recently, I built a simple NLP algorithm for a work project, following the template described in this tutorial. As I looked to increase my model’s complexity, I started to come across references to Dataset and Dataloader classes. I tried adapting my work-related code to use these objects, but I found myself running into pesky bugs. I thought I should take some time to figure out how to properly use Dataset and Dataloader objects. In this post, I adapt the PyTorch NLP tutorial to work with Dataset and Dataloader objects. Since my focus is primarily on using these objects, please refer to the tutorial for details regarding the NLP model.

import numpy as np
import os
import pandas as pd
import re
import seaborn as sns
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
torch.manual_seed(1)
<torch._C.Generator at 0x7fef88a746f0>
%load_ext nb_black
%config InlineBackend.figure_format = 'retina'
%load_ext watermark
# Figure aesthetics
sns.set_theme()
sns.set_context("talk")
sns.set_style("white")

First attempt

The tutorial generates a simple dataset to use for a logistic regression bag-of-words classifier. It takes a sentence and trains whether the sentence is in English or Spanish. The data was structured originally so each sample was a list.

train_data = [
    ("me gusta comer en la cafeteria".split(), "SPANISH"),
    ("Give it to me".split(), "ENGLISH"),
    ("No creo que sea una buena idea".split(), "SPANISH"),
    ("No it is not a good idea to get lost at sea".split(), "ENGLISH"),
]

test_data = [
    ("Yo creo que si".split(), "SPANISH"),
    ("it is lost on me".split(), "ENGLISH"),
]

Before putting the data into the Dataset object, I’ll organize it into a dataframe for easier input.

# Combine so we have one data object
data = train_data + test_data

# Put into a dataframe
df_data = pd.DataFrame(data)
df_data.columns = ["words", "labels"]
df_data
words labels
0 [me, gusta, comer, en, la, cafeteria] SPANISH
1 [Give, it, to, me] ENGLISH
2 [No, creo, que, sea, una, buena, idea] SPANISH
3 [No, it, is, not, a, good, idea, to, get, lost... ENGLISH
4 [Yo, creo, que, si] SPANISH
5 [it, is, lost, on, me] ENGLISH

Putting the data in Dataset and output with Dataloader

Now it is time to put the data into a Dataset object. I referred to PyTorch’s tutorial on datasets and dataloaders and this helpful example specific to custom text, especially for making my own dataset class, which is shown here.

class TextDataset(Dataset):
    """
    Characterizes the pre-processed SRF custom dataset for PyTorch
    """

    def __init__(self, ids, text, labels):
        """
        Initialization. Ids can be useful after splitting the dataset.
        """
        self.ids = ids
        self.text = text
        self.labels = labels

    def __len__(self):
        """
        This is simply the number of labels in the dataseta.
        """
        return len(self.labels)

    def __getitem__(self, idx):
        """
        Generate one sample of data
        """
        label = self.labels[idx]
        text = self.text[idx]
        sample = {"Text": text, "Label": label}
        return sample
# Put train and test into dataset objects
train_ids = range(0, 4)
test_ids = range(4, 6)

train_DS1 = TextDataset(
    train_ids,
    df_data.loc[train_ids, "words"].tolist(),
    df_data.loc[train_ids, "labels"].tolist(),
)

test_DS1 = TextDataset(
    train_ids,
    df_data.loc[test_ids, "words"].tolist(),
    df_data.loc[test_ids, "labels"].tolist(),
)

When putting the data into their respective dataset objects, it is important to use the .tolist() method or else DataLoader will return an error when retrieving the data. Now let’s use DataLoader and a simple for loop to return the values of the data. I’ll use only the training data and a batch_size of 1 for this purpose.

train_DL = DataLoader(train_DS1, batch_size=1, shuffle=False)

print("Batch size of 1")
for (idx, batch) in enumerate(train_DL):  # Print the 'text' data of the batch

    print(idx, "Text data: ", batch["Text"])  # Print the 'class' data of batch
    print(idx, "Label data: ", batch["Label"])
Batch size of 1
0 Text data:  [('me',), ('gusta',), ('comer',), ('en',), ('la',), ('cafeteria',)]
0 Label data:  ['SPANISH']
1 Text data:  [('Give',), ('it',), ('to',), ('me',)]
1 Label data:  ['ENGLISH']
2 Text data:  [('No',), ('creo',), ('que',), ('sea',), ('una',), ('buena',), ('idea',)]
2 Label data:  ['SPANISH']
3 Text data:  [('No',), ('it',), ('is',), ('not',), ('a',), ('good',), ('idea',), ('to',), ('get',), ('lost',), ('at',), ('sea',)]
3 Label data:  ['ENGLISH']

At first glance, things might look okay but the eagle-eyed will notice that each element in our list is now wrapped as one element. If we increase batch_size to 2, we get an ugly error.

train_DL2 = DataLoader(train_DS1, batch_size=2, shuffle=False)

print("Batch size of 2")
for (idx, batch) in enumerate(train_DL2):  # Print the 'text' data of the batch

    print(idx, "Text data: ", batch["Text"])  # Print the 'class' data of batch
    print(idx, "Label data: ", batch["Label"], "\n")
Batch size of 2



---------------------------------------------------------------------------

RuntimeError                              Traceback (most recent call last)

<ipython-input-9-b81921277760> in <module>
      2 
      3 print("Batch size of 2")
----> 4 for (idx, batch) in enumerate(train_DL2):  # Print the 'text' data of the batch
      5 
      6     print(idx, "Text data: ", batch["Text"])  # Print the 'class' data of batch


~/opt/anaconda3/envs/sdoh_text/lib/python3.8/site-packages/torch/utils/data/dataloader.py in __next__(self)
    515             if self._sampler_iter is None:
    516                 self._reset()
--> 517             data = self._next_data()
    518             self._num_yielded += 1
    519             if self._dataset_kind == _DatasetKind.Iterable and \


~/opt/anaconda3/envs/sdoh_text/lib/python3.8/site-packages/torch/utils/data/dataloader.py in _next_data(self)
    555     def _next_data(self):
    556         index = self._next_index()  # may raise StopIteration
--> 557         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    558         if self._pin_memory:
    559             data = _utils.pin_memory.pin_memory(data)


~/opt/anaconda3/envs/sdoh_text/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
     45         else:
     46             data = self.dataset[possibly_batched_index]
---> 47         return self.collate_fn(data)


~/opt/anaconda3/envs/sdoh_text/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py in default_collate(batch)
     71         return batch
     72     elif isinstance(elem, container_abcs.Mapping):
---> 73         return {key: default_collate([d[key] for d in batch]) for key in elem}
     74     elif isinstance(elem, tuple) and hasattr(elem, '_fields'):  # namedtuple
     75         return elem_type(*(default_collate(samples) for samples in zip(*batch)))


~/opt/anaconda3/envs/sdoh_text/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py in <dictcomp>(.0)
     71         return batch
     72     elif isinstance(elem, container_abcs.Mapping):
---> 73         return {key: default_collate([d[key] for d in batch]) for key in elem}
     74     elif isinstance(elem, tuple) and hasattr(elem, '_fields'):  # namedtuple
     75         return elem_type(*(default_collate(samples) for samples in zip(*batch)))


~/opt/anaconda3/envs/sdoh_text/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py in default_collate(batch)
     79         elem_size = len(next(it))
     80         if not all(len(elem) == elem_size for elem in it):
---> 81             raise RuntimeError('each element in list of batch should be of equal size')
     82         transposed = zip(*batch)
     83         return [default_collate(samples) for samples in transposed]


RuntimeError: each element in list of batch should be of equal size

What’s going on? With some investigation of which I’ll spare you, it appears that having each sample data already as a list makes confuses Dataloader. Let’s re-structure out data differently.

Re-structuring data as a comma-separated string

Due to the structure of our model, we still need a way to vectorize each sentence sample, but we can’t have each wrapped as a list. Here is a workaround even if the syntax is awkward. I’m rejoining the elements as a comma-separated string like this:

", ".join("me gusta comer en la cafeteria".split())
'me, gusta, comer, en, la, cafeteria'
train_data2 = [
    (", ".join("me gusta comer en la cafeteria".split()), "SPANISH"),
    (", ".join("Give it to me".split()), "ENGLISH"),
    (", ".join("No creo que sea una buena idea".split()), "SPANISH"),
    (", ".join("No it is not a good idea to get lost at sea".split()), "ENGLISH"),
]

test_data2 = [
    (", ".join("Yo creo que si".split()), "SPANISH"),
    (", ".join("it is lost on me".split()), "ENGLISH"),
]
data2 = train_data2 + test_data2
df_data2 = pd.DataFrame(data2)
df_data2.columns = ["words", "labels"]

Here’s how the data looks.

df_data2
words labels
0 me, gusta, comer, en, la, cafeteria SPANISH
1 Give, it, to, me ENGLISH
2 No, creo, que, sea, una, buena, idea SPANISH
3 No, it, is, not, a, good, idea, to, get, lost,... ENGLISH
4 Yo, creo, que, si SPANISH
5 it, is, lost, on, me ENGLISH

Putting the data in Dataset and output with Dataloader

train_DS2 = TextDataset(
    train_ids,
    df_data2.loc[train_ids, "words"].tolist(),
    df_data2.loc[train_ids, "labels"].tolist(),
)
test_DS2 = TextDataset(
    test_ids,
    df_data2.loc[test_ids, "words"].tolist(),
    df_data2.loc[test_ids, "labels"].tolist(),
)
train_DL2a = DataLoader(train_DS2, batch_size=1, shuffle=False)

print("batch size of 1")
for (idx, batch) in enumerate(train_DL2a):
    print(idx, "Text data: ", batch["Text"])
    print(idx, "Label data: ", batch["Label"], "\n")
batch size of 1
0 Text data:  ['me, gusta, comer, en, la, cafeteria']
0 Label data:  ['SPANISH'] 

1 Text data:  ['Give, it, to, me']
1 Label data:  ['ENGLISH'] 

2 Text data:  ['No, creo, que, sea, una, buena, idea']
2 Label data:  ['SPANISH'] 

3 Text data:  ['No, it, is, not, a, good, idea, to, get, lost, at, sea']
3 Label data:  ['ENGLISH'] 

Great, we get closer to the expected output where we have one sample, represented as a string, in the list created by DataLoader. We still have to vectorize this before we input this into our model but we can worry about that later. Additionally, when we increase the batch_size we don’t get an error anymore.

train_DL2b = DataLoader(train_DS2, batch_size=2, shuffle=False)

print("batch size of 2")
for (idx, batch) in enumerate(train_DL2b):
    print(idx, "Text data: ", batch["Text"])
    print(idx, "Label data: ", batch["Label"], "\n")
batch size of 2
0 Text data:  ['me, gusta, comer, en, la, cafeteria', 'Give, it, to, me']
0 Label data:  ['SPANISH', 'ENGLISH'] 

1 Text data:  ['No, creo, que, sea, una, buena, idea', 'No, it, is, not, a, good, idea, to, get, lost, at, sea']
1 Label data:  ['SPANISH', 'ENGLISH'] 

We can also verify that this works for our test set in its own DataLoader object.

test_DL2b = DataLoader(test_DS2, batch_size=2, shuffle=False)

print("batch size of 2")
for (idx, batch) in enumerate(test_DL2b):
    print(idx, "Text data: ", batch["Text"])
    print(idx, "Label data: ", batch["Label"], "\n")
batch size of 2
0 Text data:  ['Yo, creo, que, si', 'it, is, lost, on, me']
0 Label data:  ['SPANISH', 'ENGLISH'] 

Train model using DataLoader objects

# word_to_ix maps each word in the vocab to a unique integer, which will be its
# index into the Bag of words vector
word_to_ix = {}
for sent, _ in data:
    for word in sent:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)
print(word_to_ix)

VOCAB_SIZE = len(word_to_ix)
NUM_LABELS = 2
{'me': 0, 'gusta': 1, 'comer': 2, 'en': 3, 'la': 4, 'cafeteria': 5, 'Give': 6, 'it': 7, 'to': 8, 'No': 9, 'creo': 10, 'que': 11, 'sea': 12, 'una': 13, 'buena': 14, 'idea': 15, 'is': 16, 'not': 17, 'a': 18, 'good': 19, 'get': 20, 'lost': 21, 'at': 22, 'Yo': 23, 'si': 24, 'on': 25}
sent = "me, gusta, comer"
sent.split(", ")
['me', 'gusta', 'comer']
class BoWClassifier(nn.Module):  # inheriting from nn.Module!
    def __init__(self, num_labels, vocab_size):
        # calls the init function of nn.Module.  Dont get confused by syntax,
        # just always do it in an nn.Module
        super(BoWClassifier, self).__init__()

        # Define the parameters that you will need.  In this case, we need A and b,
        # the parameters of the affine mapping.
        # Torch defines nn.Linear(), which provides the affine map.
        # Make sure you understand why the input dimension is vocab_size
        # and the output is num_labels!
        self.linear = nn.Linear(vocab_size, num_labels)

        # NOTE! The non-linearity log softmax does not have parameters! So we don't need
        # to worry about that here

    def forward(self, bow_vec):
        # Pass the input through the linear layer,
        # then pass that through log_softmax.
        # Many non-linearities and other functions are in torch.nn.functional
        return F.log_softmax(self.linear(bow_vec), dim=1)


def make_bow_vector(sentence, word_to_ix):
    """
    Edited from original to get words wrapped in a list back
    """
    sentence = sentence[0].split(", ")
    vec = torch.zeros(len(word_to_ix))
    for word in sentence:
        vec[word_to_ix[word]] += 1
    return vec.view(1, -1)


def make_target(label, label_to_ix):
    """
    Altered to extract label from list
    """
    return torch.LongTensor([label_to_ix[label[0]]])

Batch size of 1

train_DL2a = DataLoader(train_DS2, batch_size=1, shuffle=False)
test_DL2a = DataLoader(test_DS2, batch_size=1, shuffle=False)
model = BoWClassifier(NUM_LABELS, VOCAB_SIZE)
for param in model.parameters():
    print(param)

Parameter containing:
tensor([[ 0.0544,  0.0097,  0.0716, -0.0764, -0.0143, -0.0177,  0.0284, -0.0008,
          0.1714,  0.0610, -0.0730, -0.1184, -0.0329, -0.0846, -0.0628,  0.0094,
          0.1169,  0.1066, -0.1917,  0.1216,  0.0548,  0.1860,  0.1294, -0.1787,
         -0.1865, -0.0946],
        [ 0.1722, -0.0327,  0.0839, -0.0911,  0.1924, -0.0830,  0.1471,  0.0023,
         -0.1033,  0.1008, -0.1041,  0.0577, -0.0566, -0.0215, -0.1885, -0.0935,
          0.1064, -0.0477,  0.1953,  0.1572, -0.0092, -0.1309,  0.1194,  0.0609,
         -0.1268,  0.1274]], requires_grad=True)
Parameter containing:
tensor([0.1191, 0.1739], requires_grad=True)

Note that model parameters are randomly initialized to very small, non-zero values so that gradient descent is not too slow. This point is explained more fully by Andrew Ng in this video.

label_to_ix = {"SPANISH": 0, "ENGLISH": 1}

Run on test data before we train, just to see a before-and-after

with torch.no_grad():
    for batch in test_DL2a:
        # Alter code from tutorial
        # for instance, label in test_data:
        instance, label = batch["Text"], batch["Label"]
        print(instance, label)

        bow_vec = make_bow_vector(instance, word_to_ix)
        log_probs = model(bow_vec)
        print(log_probs, "\n")

# Print the matrix column corresponding to "creo"
print(
    "Tensor for 'creo' (before training): ",
    next(model.parameters())[:, word_to_ix["creo"]],
)
['Yo, creo, que, si'] ['SPANISH']
tensor([[-0.9736, -0.4744]]) 

['it, is, lost, on, me'] ['ENGLISH']
tensor([[-0.7289, -0.6586]]) 

Tensor for 'creo' (before training):  tensor([-0.0730, -0.1041], grad_fn=<SelectBackward>)
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

for epoch in range(100):
    # for instance, label in data:

    for (idx, batch) in enumerate(train_DL2a):  # Print the 'text' data of the batch
        instance, label = batch["Text"], batch["Label"]

        # Step 1. Remember that PyTorch accumulates gradients.
        # We need to clear them out before each instance
        model.zero_grad()

        # Step 2. Make our BOW vector and also we must wrap the target in a
        # Tensor as an integer. For example, if the target is SPANISH, then
        # we wrap the integer 0. The loss function then knows that the 0th
        # element of the log probabilities is the log probability
        # corresponding to SPANISH
        bow_vec = make_bow_vector(instance, word_to_ix)
        target = make_target(label, label_to_ix)

        # Step 3. Run our forward pass.
        log_probs = model(bow_vec)

        # Step 4. Compute the loss, gradients, and update the parameters by
        # calling optimizer.step()
        loss = loss_function(log_probs, target)
        loss.backward()
        optimizer.step()

        if (idx % 4 == 0) & (epoch % 20 == 0):  # Edit when datasets are bigger
            print(f"epoch: {epoch}, training sample: {idx}, loss = {loss.item():0.04f}")
epoch: 0, training sample: 0, loss = 0.8369
epoch: 20, training sample: 0, loss = 0.0507
epoch: 40, training sample: 0, loss = 0.0257
epoch: 60, training sample: 0, loss = 0.0172
epoch: 80, training sample: 0, loss = 0.0129

We see the loss decrease quickly and saturate by the end of the training epochs.

Evaluation after training

Look at the test set again, after model training.

with torch.no_grad():
    for batch in test_DL2a:
        # Alter code from tutorial
        # for instance, label in test_data:
        instance, label = batch["Text"], batch["Label"]
        print(instance, label)

        bow_vec = make_bow_vector(instance, word_to_ix)
        log_probs = model(bow_vec)
        print(log_probs, "\n")
['Yo, creo, que, si'] ['SPANISH']
tensor([[-0.2056, -1.6828]]) 

['it, is, lost, on, me'] ['ENGLISH']
tensor([[-2.7960, -0.0630]]) 
# Print the matrix column corresponding to "creo"
print(
    "Matrix for 'creo' (after training): ",
    next(model.parameters())[:, word_to_ix["creo"]],
)
Matrix for 'creo' (after training):  tensor([ 0.3702, -0.5473], grad_fn=<SelectBackward>)

We see that the coefficients for the Spanish word “creo” separate quite nicely and relative to the initial values. I believe that the model training was successful.

Summary

In this post, I sought to better understand how to use Dataset and Dataloader objects, especially in the context of model training. Fleshing this out showed me where I had to re-structure my data to get my code to work properly. Here, I had a batch size of 1, to mimic the original PyTorch tutorial. In a later post, I’ll write about how to take advantage of batching which is more relevant in larger datasets.

Appendix: Environment and system parameters

%watermark -n -u -v -iv -w
Last updated: Thu Jun 24 2021

Python implementation: CPython
Python version       : 3.8.6
IPython version      : 7.22.0

numpy  : 1.19.5
torch  : 1.8.1
re     : 2.2.1
json   : 2.0.9
seaborn: 0.11.1
pandas : 1.2.1

Watermark: 2.1.0