Large dataset storage format for Pytorch

Last update: Oct 22, 2022

Overview

H5Record

Large dataset ( > 100G, <= 1T) storage format for Pytorch (wip)

Support python 3

pip install h5record

Why?

Writing large dataset is still a wild west in pytorch. Approaches seen in the wild include:
- large directory with lots of small files : slow IO when complex file is fetched, deserialized frequently
- database approach : depend on what kind of database engine used, usually multi-process read is not supported
- the above method scale non linear in terms of data - storage size
TFRecord solved the above problems well ( multiprocess fetch, (de)compression ), fast serialization ( protobuf )
However TFRecord port does not support data size evaluation (used frequently by Dataloader ), no index level access available ( important for data evaluation or verification )

H5Record aim to tackle TFRecord problems by compressing the dataset into HDF5 file with an easy to use interface through predefined interfaces ( String, Image, Sequences, Integer).

Some advantage of using H5Record

Support multi-process read
Relatively simple to use and low technical debt
Support compression/de-compression on the fly
Quick load to memory if required

Simple usage

pip install h5record

Sentence Similarity

from h5record import H5Dataset, Float, String

schema = (
    String(name='sentence1'),
    String(name='sentence2'),
    Float(name='label')
)
data = [
    ['Sent 1.', 'Sent 2', 0.1],
    ['Sent 3', 'Sent 4', 0.2],
]

def pair_iter():
    for row in data:
        yield {
            'sentence1': row[0],
            'sentence2': row[1],
            'label': row[2]
        }

dataset = H5Dataset(schema, './question_pair.h5', pair_iter())
for idx in range(len(dataset)):
    print(dataset[idx])

Note

Due to in progress development, this package should be use in care in storage with FAT, FAT-32 format

Comparison between different compression algorithm

No chunking is used

Compression Type	File size	Read speed row/second
no compression	2.0G	2084.55 it/s
lzf	1.7G	1496.14 it/s
gzip	1.1G	843.78 it/s

benchmarked in i7-9700, 1TB NVMe SSD

If you are interested to learn more feel free to checkout the note as well!

You might also like...

A large-scale video dataset for the training and evaluation of 3D human pose estimation models

ASPset-510 ASPset-510 (Australian Sports Pose Dataset) is a large-scale video dataset for the training and evaluation of 3D human pose estimation mode

36 Oct 30, 2022

A large-scale video dataset for the training and evaluation of 3D human pose estimation models

ASPset-510 (Australian Sports Pose Dataset) is a large-scale video dataset for the training and evaluation of 3D human pose estimation models. It contains 17 different amateur subjects performing 30 sports-related actions each, for a total of 510 action clips.

25 Jun 20, 2021

A Large-Scale Dataset for Spinal Vertebrae Segmentation in Computed Tomography

PyTorch-LIT is the Lite Inference Toolkit (LIT) for PyTorch which focuses on easy and fast inference of large models on end-devices.

PyTorch-LIT PyTorch-LIT is the Lite Inference Toolkit (LIT) for PyTorch which focuses on easy and fast inference of large models on end-devices. With

157 Dec 11, 2022

This is the dataset and code release of the OpenRooms Dataset.

95 Jan 8, 2023

Comments

Example about Image dataset

Thanks for your work. Do you have an end to end example about image dataset which includes creating h5records file similar to tfrecord files and then using it in dataloader mechanism just like tf dataset api loader mechanism?
documentation question

opened by meet-minimalist 1

Releases(1.0.4)

1.0.4(Jun 8, 2021)

Minor bug fix
Source code(tar.gz)
Source code(zip)
1.0.3(Jun 6, 2021)
Support for image sequence, float16 sequence, float sequence and float16 datatype

Fix bugs

Source code(tar.gz)
Source code(zip)
1.0.1(Jun 5, 2021)

Source code(tar.gz)
Source code(zip)

Large dataset storage format for Pytorch

Related tags

Overview

H5Record

Why?

Simple usage

Note

Comparison between different compression algorithm

You might also like...

A large-scale video dataset for the training and evaluation of 3D human pose estimation models

A large-scale video dataset for the training and evaluation of 3D human pose estimation models

A Large-Scale Dataset for Spinal Vertebrae Segmentation in Computed Tomography

Large Scale Multi-Illuminant (LSMI) Dataset for Developing White Balance Algorithm under Mixed Illumination

LIVECell - A large-scale dataset for label-free live cell segmentation

A large-scale face dataset for face parsing, recognition, generation and editing.

N-Omniglot is a large neuromorphic few-shot learning dataset

PyTorch-LIT is the Lite Inference Toolkit (LIT) for PyTorch which focuses on easy and fast inference of large models on end-devices.

This is the dataset and code release of the OpenRooms Dataset.

Comments

Example about Image dataset

Releases(1.0.4)

1.0.4(Jun 8, 2021)

1.0.3(Jun 6, 2021)

1.0.1(Jun 5, 2021)

Owner

theblackcat102

NeurIPS 2021, "Fine Samples for Learning with Noisy Labels"

Video Frame Interpolation without Temporal Priors (a general method for blurry video interpolation)

Gauge equivariant mesh cnn

Music Generation using Neural Networks Streamlit App

The final project of "Applying AI to 2D Medical Imaging Data" of "AI for Healthcare" nanodegree - Udacity.

Aggragrating Nested Transformer Official Jax Implementation

Code for ICCV2021 paper PARE: Part Attention Regressor for 3D Human Body Estimation

Code and data of the ACL 2021 paper: Few-Shot Text Ranking with Meta Adapted Synthetic Weak Supervision

Annotated, understandable, and visually interpretable PyTorch implementations of: VAE, BIRVAE, NSGAN, MMGAN, WGAN, WGANGP, LSGAN, DRAGAN, BEGAN, RaGAN, InfoGAN, fGAN, FisherGAN

Continual learning with sketched Jacobian approximations

A curated list of the top 10 computer vision papers in 2021 with video demos, articles, code and paper reference.

Improving Contrastive Learning by Visualizing Feature Transformation, ICCV 2021 Oral

World Models with TensorFlow 2

This is the official source code for SLATE. We provide the code for the model, the training code, and a dataset loader for the 3D Shapes dataset. This code is implemented in Pytorch.

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

Romanian Automatic Speech Recognition from the ROBIN project

Calibrated Hyperspectral Image Reconstruction via Graph-based Self-Tuning Network.

Official pytorch implementation of "Scaling-up Disentanglement for Image Translation", ICCV 2021.

[WWW 2022] Zero-Shot Stance Detection via Contrastive Learning

Fairness Metrics: All you need to know