Installation:

pip install lm_dataloader

Design Philosophy

A library to unify lm dataloading at large scale
Simple interface, any tokenizer can be integrated
Minimal changes needed from small -> large scale (many multiple GPU nodes)
follows fairseq / megatron's 'mmap' dataformat, but with improvements. Those being:
- Easily combine multiple datasets
- Easily split a dataset into train / val / test splits
- Easily build a weighted dataset out of a list of existing ones along with weights.
- unified into a single 'file' (which is actually a directory containing a .bin / .idx file)
- index files that are built on the fly are hidden files, leaving less mess in the directory.
- More straightforward interface, better documentation.
- Inspectable with a command line tool
- Can load from urls
- Can load from S3 buckets
- Can load from GCS buckets
- Can tokenize on the fly instead of preprocessing

Misc. TODO: - [ ] Option to set mpu globally (for distributed dataloading)

Example usage

To tokenize a dataset contained in a .jsonl file (where the text to be tokenized can be accessed under the 'text' key):

import lm_dataloader as lmdl
from transformers import GPT2TokenizerFast 

jsonl_path = "test.jsonl"
output = "my_dataset.lmd"
tokenizer = GPT2TokenizerFast.from_pretrained('gpt2')

lmdl.encode(
    jsonl_path,
    output_prefix=output,
    tokenize_fn=tokenizer.encode,
    tokenizer_vocab_size=len(tokenizer),
    eod_token=tokenizer.eos_token_id,
)

This will create a dataset at "my_dataset.lmd" which can be loaded as an indexed torch dataset like so:

from lm_dataloader import LMDataset
from transformers import GPT2TokenizerFast 

tokenizer = GPT2TokenizerFast.from_pretrained('gpt2')
seq_length = tokenizer.model_max_length # or whatever the sequence length of your model is

dataset = LMDataset("my_dataset.lmd", seq_length=seq_length)

# peek at 0th index
print(dataset[0])

Command line utilities

There are also command line utilities provided to inspect / merge datasets, e.g:

lm-dataloader inspect my_dataset.lmd

Launches an interactive terminal to inspect the data in my_dataset.lmd

And:

lm-dataloader merge my_dataset.lmd,my_dataset_2.lmd new_dataset.lmd

Merges the datasets at "my_dataset.lmd" and "my_dataset_2.lmd" into a new file at "new_dataset.lmd".

Dataloader tools for language modelling

Related tags

Overview

Installation:

Design Philosophy

Example usage

Command line utilities

Owner

Official Implementation for the paper DeepFace-EMD: Re-ranking Using Patch-wise Earth Mover’s Distance Improves Out-Of-Distribution Face Identification

Rethinking of Pedestrian Attribute Recognition: A Reliable Evaluation under Zero-Shot Pedestrian Identity Setting

Fast, Attemptable Route Planner for Navigation in Known and Unknown Environments

Interactive web apps created using geemap and streamlit

Implementation of PersonaGPT Dialog Model

Learning to Reconstruct 3D Manhattan Wireframes from a Single Image

Jingju baseline - A baseline model of our project of Beijing opera script generation

A Python parser that takes the content of a text file and then reads it into variables.

ST++: Make Self-training Work Better for Semi-supervised Semantic Segmentation

3D dataset of humans Manipulating Objects in-the-Wild (MOW)

Implementation of paper: "Image Super-Resolution Using Dense Skip Connections" in PyTorch

Modified fork of Xuebin Qin's U-2-Net Repository. Used for demonstration purposes.

Official PyTorch implementation of the paper "Recycling Discriminator: Towards Opinion-Unaware Image Quality Assessment Using Wasserstein GAN", accepted to ACM MM 2021 BNI Track.

A python library for self-supervised learning on images.

Channel Pruning for Accelerating Very Deep Neural Networks (ICCV'17)

GAN encoders in PyTorch that could match PGGAN, StyleGAN v1/v2, and BigGAN. Code also integrates the implementation of these GANs.

Code for paper Decoupled Dynamic Spatial-Temporal Graph Neural Network for Traffic Forecasting

Ultra-Data-Efficient GAN Training: Drawing A Lottery Ticket First, Then Training It Toughly

Implementation of trRosetta and trDesign for Pytorch, made into a convenient package

Ros2-voiceroid2 - ROS2 wrapper package of VOICEROID2