A practical ML pipeline for data labeling with experiment tracking using DVC.

Overview

Auto Label Pipeline

A practical ML pipeline for data labeling with experiment tracking using DVC

Goals:

  • Demonstrate reproducible ML
  • Use DVC to build a pipeline and track experiments
  • Automatically clean noisy data labels using Cleanlab cross validation
  • Determine which FastText subword embedding performs better for semi-supervised cluster classification
  • Determine optimal hyperparameters through experiment tracking
  • Prepare casually labeled data for human evaluation

Demo: View Experiments recorded in git branches:

asciicast

The Data

For our working demo, we will purify some of the slightly noisy/dirty labels found in Wikidata people entries for attributes for Employers and Occupations. Our initial data labels have been harvested from a json dump of Wikidata, the Kensho Wikidata dataset, and this notebook script for extracting the data.

Data Input Format

Tab separated CSV files, with the fields:

  • text_data - the item that is to be labeled (single word or short group of words)
  • class_type - the class label
  • context - any text that surrounds the text_data field in situ, or defines the text_data item in other words.
  • count - the number of occurrences of this label; how common it appears in the existing data.

Data Output format

  • (same parameters as the data input plus)
  • date_updated - when the label was updated
  • previous_class_type - the previous class_type label
  • mislabeled_rank - records how low the confidence was prior to a re-label

The Pipeline

  • Fetch
  • Prepare
  • Train
  • Relabel

For details, see the README in the src folder. The pipeline is orchestrated via the dvc.yaml file, and parameterized via params.yaml.

Using/Extending the pipeline

  1. Drop your own CSV files into the data/raw directory
  2. Run the pipeline
  3. Tune settings, embeddings, etc, until no longer amused
  4. Verify your results manually and by submitting data/final/data.csv for human evaluation, using random sampling and drawing heavily from the mislabeled_rank entries.

Project Structure

├── LICENSE
├── README.md
├── data                    # <-- Directory with all types of data
│ ├── final                 # <-- Directory with final data
│ │ ├── class.metrics.csv   # <-- Directory with raw and intermediate data
│ │ └── data.csv            # <-- Pipeline output (not stored in git)
│ ├── interim               # <-- Directory with temporary data
│ │ ├── datafile.0.csv
│ │ └── datafile.1.csv
│ ├── prepared              # <-- Directory with prepared data
│ │ └── data.all.csv
│ └── raw                   # <-- Directory with raw data; populated by pipeline's fetch stage
│     ├── README.md
│     ├── cc.en.300.bin               # <-- Fasttext binary model file, creative commons 
│     ├── crawl-300d-2M-subword.bin   # <-- Fasttext binary model file, common crawl
│     ├── crawl-300d-2M-subword.vec
│     ├── employers.wikidata.csv      # <-- Our initial data, 1 set of class labels 
│     ├── lid.176.ftz
│     └── occupations.wikidata.csv    # <-- Our initial data, 1 set of class labels
├── dvc.lock                # <-- DVC internal state tracking file
├── dvc.yaml                # <-- DVC project configuration file
├── dvc_plots               # <-- Temp directory for DVC plots; not tracked by git
│ └── README.md
├── model
│ ├── class.metrics.csv
│ ├── svm.model.pkl
│ └── train.metrics.json    # <-- Metrics from the pipeline's train stage  
├── mypy.ini
├── params.yaml             # <-- Parameter configuration file for the pipeline
├── reports                 # <-- Directory with metrics output
│ ├── prepare.metrics.json  
│ └── relabel.metrics.json
├── requirements-dev.txt
├── requirements.txt
├── runUnitTests.sh
└── src                     # <-- Directory containing the pipeline's code
    ├── README.md
    ├── fetch.py
    ├── prepare.py
    ├── relabel.py
    ├── train.py
    └── utils.py

Setup

Create environment

conda create --name auto-label-pipeline python=3.9

conda activate auto-label-pipeline

Install requirements

pip install -r requirements.txt

If you're going to modify the source, also install the requirements-dev.txt file


Reproduce the pipeline results locally

dvc repro

View Metrics

dvc metrics show

See also: DVC metrics

Working with Experiments

To see your local experiments:

dvc exp show

Experiments that have been turned into a branches can be referenced directly in commands:

dvc exp diff svc_linear_ex svc_rbf_ex

e.g. to compare experiments:

dvc exp diff [experiment branch name] [experiment branch 2 name]

e.g.:

dvc exp diff svc_linear_ex svc_rbf_ex

dvc exp diff svc_poly_ex svc_rbf_ex

To create an experiment by changing a parameter:

dvc exp run --set-param train.split=0.9 --name my_split_ex

(When promoting an experiment to a branch, DVC does not switch into the branch.)

To save and share your experiment in a branch:

dvc exp branch my_split_ex my_split_ex_branch

See also: DVC Experiments

View plots

Initial Confusion matrix:

dvc plots show model/class.metrics.csv -x actual -y predicted --template confusion

Confusion matrix after relabeling:

dvc plots show data/final/class.metrics.csv -x actual -y predicted --template confusion

See also: DVC plots


Conclusions

  • For relabeling and cleaning, it's important to have more than two labels, and to specifying an UNK label for: unknown; labels spanning multiple groups; or low confidence support.
  • Standardizing the input data formats allow users to flexibly use many different data sources.
  • Language detection is an important part of data cleaning, however problematic because:
    • Modern languages sometimes "borrow" words from other languages (but not just any words!)
    • Language detection models perform inference poorly with limited data, especially just a single word.
    • Normalization utilities, such as unidecode aren't helpful; (the wrong word in more readable letters is still the wrong word).
  • Experimentation parameters often have co-dependencies that would make a simple combinatorial grid search inefficient.

Recommended readings:

  • Confident Learning: Estimating Uncertainty in Dataset Labels by Curtis G. Northcutt, Lu Jiang, Isaac L. Chuang, 31 Oct 2019, arxiv
  • A Simple but tough-to-beat baseline for sentence embeddings by Sanjeev Arora, Yingyu Liang, Tengyu Ma, ICLR 2017, paper
  • Support Vector Clustering by Asa Ben-Hur, David Horn, Hava T. Siegelmann, Vladimir Vapnik, November 2001 Journal of Machine Learning Research 2 (12):125-137, DOI:10.1162/15324430260185565, paper
  • SVM clustering by Winters-Hilt, S., Merat, S. BMC Bioinformatics 8, S18 (2007). link, paper

Note: this repo layout borrows heavily from the Cookie Cutter Data Science Layout If you're not familiar with it, please check it out.

Owner
Todd Cook
Software craftsman
Todd Cook
A list of all named GANs!

The GAN Zoo Every week, new GAN papers are coming out and it's hard to keep track of them all, not to mention the incredibly creative ways in which re

Avinash Hindupur 12.9k Jan 08, 2023
Official implementation of FCL-taco2: Fast, Controllable and Lightweight version of Tacotron2 @ ICASSP 2021

FCL-Taco2: Towards Fast, Controllable and Lightweight Text-to-Speech synthesis (ICASSP 2021) Paper | Demo Block diagram of FCL-taco2, where the decode

Disong Wang 39 Sep 28, 2022
COLMAP - Structure-from-Motion and Multi-View Stereo

COLMAP About COLMAP is a general-purpose Structure-from-Motion (SfM) and Multi-View Stereo (MVS) pipeline with a graphical and command-line interface.

4.7k Jan 07, 2023
Demo notebooks for Qiskit application modules demo sessions (Oct 8 & 15):

qiskit-application-modules-demo-sessions This repo hosts demo notebooks for the Qiskit application modules demo sessions hosted on Qiskit YouTube. Par

Qiskit Community 46 Nov 24, 2022
Answer a series of contextually-dependent questions like they may occur in natural human-to-human conversations.

SCAI-QReCC-21 [leaderboards] [registration] [forum] [contact] [SCAI] Answer a series of contextually-dependent questions like they may occur in natura

19 Sep 28, 2022
Long Expressive Memory (LEM)

Long Expressive Memory for Sequence Modeling This repository contains the implementation to reproduce the numerical experiments of the paper Long Expr

Konstantin Rusch 47 Dec 17, 2022
A Strong Baseline for Image Semantic Segmentation

A Strong Baseline for Image Semantic Segmentation Introduction This project is an open source semantic segmentation toolbox based on PyTorch. It is ba

Clark He 49 Sep 20, 2022
Supporting code for the paper "Dangers of Bayesian Model Averaging under Covariate Shift"

Dangers of Bayesian Model Averaging under Covariate Shift This repository contains the code to reproduce the experiments in the paper Dangers of Bayes

Pavel Izmailov 25 Sep 21, 2022
Power Core Simulator!

Power Core Simulator Power Core Simulator is a simulator based off the Roblox game "Pinewood Builders Computer Core". In this simulator, you can choos

BananaJeans 1 Nov 13, 2021
Bayesian Meta-Learning Through Variational Gaussian Processes

vmgp This is the repository of Vivek Myers and Nikhil Sardana for our CS 330 final project, Bayesian Meta-Learning Through Variational Gaussian Proces

Vivek Myers 2 Nov 17, 2022
Weight estimation in CT by multi atlas techniques

maweight A Python package for multi-atlas based weight estimation for CT images, including segmentation by registration, feature extraction and model

György Kovács 0 Dec 24, 2021
Christmas face app for Decathlon xmas coding party!

Christmas Face Application Use this library to create the perfect picture for your christmas cards! Done by Hasib Zunair, Guillaume Brassard and Samue

Hasib Zunair 4 Dec 20, 2021
POPPY (Physical Optics Propagation in Python) is a Python package that simulates physical optical propagation including diffraction

POPPY: Physical Optics Propagation in Python POPPY (Physical Optics Propagation in Python) is a Python package that simulates physical optical propaga

Space Telescope Science Institute 132 Dec 15, 2022
Nb workflows - A workflow platform which allows you to run parameterized notebooks programmatically

NB Workflows Description If SQL is a lingua franca for querying data, Jupyter sh

Xavier Petit 6 Aug 18, 2022
A fast python implementation of Ray Tracing in One Weekend using python and Taichi

ray-tracing-one-weekend-taichi A fast python implementation of Ray Tracing in One Weekend using python and Taichi. Taichi is a simple "Domain specific

157 Dec 26, 2022
ViDT: An Efficient and Effective Fully Transformer-based Object Detector

ViDT: An Efficient and Effective Fully Transformer-based Object Detector by Hwanjun Song1, Deqing Sun2, Sanghyuk Chun1, Varun Jampani2, Dongyoon Han1,

NAVER AI 262 Dec 27, 2022
ECCV18 Workshops - Enhanced SRGAN. Champion PIRM Challenge on Perceptual Super-Resolution. The training codes are in BasicSR.

ESRGAN (Enhanced SRGAN) [ 🚀 BasicSR] [Real-ESRGAN] ✨ New Updates. We have extended ESRGAN to Real-ESRGAN, which is a more practical algorithm for rea

Xintao 4.7k Jan 02, 2023
Github for the conference paper GLOD-Gaussian Likelihood OOD detector

FOOD - Fast OOD Detector Pytorch implamentation of the confernce peper FOOD arxiv link. Abstract Deep neural networks (DNNs) perform well at classifyi

17 Jun 19, 2022
Revitalizing CNN Attention via Transformers in Self-Supervised Visual Representation Learning

Revitalizing CNN Attention via Transformers in Self-Supervised Visual Representation Learning This repository is the official implementation of CARE.

ChongjianGE 89 Dec 02, 2022
A unified 3D Transformer Pipeline for visual synthesis

Overview This is the official repo for the paper: "NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion". NÜWA is a unified multimodal

Microsoft 2.6k Jan 03, 2023