A port of muP to JAX/Haiku

Last update: Dec 30, 2022

Overview

MUP for Haiku

This is a (very preliminary) port of Yang and Hu et al.'s μP repo to Haiku and JAX. It's not feature complete, and I'm very open to suggestions on improving the usability.

Installation

pip install haiku-mup

Learning rate demo

These plots show the evolution of the optimal learning rate for a 3-hidden-layer MLP on MNIST, trained for 10 epochs (5 trials per lr/width combination).

With standard parameterization, the learning rate optimum (w.r.t. training loss) continues changing as the width increases, but μP keeps it approximately fixed:

Here's the same kind of plot for 3 layer transformers on the Penn Treebank, this time showing Validation loss instead of training loss, scaling both the number of heads and the embedding dimension simultaneously:

Note that the optima have the same value for n_embd=80. That's because the other hyperparameters were tuned using an SP model with that width, so this shouldn't be biased in favor of μP.

Usage

from functools import partial

import jax
import jax.numpy as jnp
import haiku as hk
from optax import adam, chain

from haiku_mup import apply_mup, Mup, Readout

class MyModel(hk.Module):
    def __init__(self, width, n_classes=10):
        super().__init__(name='model')
        self.width = width
        self.n_classes = n_classes

    def __call__(self, x):
        x = hk.Linear(self.width)(x)
        x = jax.nn.relu(x)
        return Readout(2)(x) # 1. Replace output layer with Readout layer

def fn(x, width=100):
    with apply_mup(): # 2. Modify parameter creation with apply_mup()
        return MyModel(width)(x)

mup = Mup()

init_input = jnp.zeros(123)
base_model = hk.transform(partial(fn, width=1))

with mup.init_base(): # 3. Use this context manager when initializing the base model
    hk.init(fn, jax.random.PRNGKey(0), init_input) 

model = hk.transform(fn)

with mup.init_target(): # 4. Use this context manager when initializng the target model
    params = model.init(jax.random.PRNGKey(0), init_input)

model = mup.wrap_model(model) # 5. Modify your model with Mup

optimizer = optax.adam(3e-4)
optimizer = mup.wrap_optimizer(optimizer, adam=True) # 6. Use wrap_optimizer to get layer specific learning rates

# Now the model can be trained as normal

Summary

Replace output layers with Readout layers
Modify parameter creation with the apply_mup() context manager
Initialize a base model inside a Mup.init_base() context
Initialize the target model inside a Mup.init_target() context
Wrap the model with Mup.wrap_model
Wrap optimizer with Mup.wrap_optimizer

Shared Input/Output embeddings

If you want to use the input embedding matrix as the output layer's weight matrix make the following two replacements:

# old: embedding_layer = hk.Embed(*args, **kwargs)
# new:
embedding_layer = haiku_mup.SharedEmbed(*args, **kwargs)
input_embeds = embedding_layer(x)

#old: output = hk.Linear(n_classes)(x)
# new:
output = haiku_mup.SharedReadout()(embedding_layer.get_weights(), x)

A port of muP to JAX/Haiku

Related tags

Overview

MUP for Haiku

Installation

Learning rate demo

Usage

Summary

Shared Input/Output embeddings

Owner

Datasets for new state-of-the-art challenge in disentanglement learning

Transformer Tracking (CVPR2021)

Code for the ECCV2020 paper "A Differentiable Recurrent Surface for Asynchronous Event-Based Data"

Implementation of gaze tracking and demo

UpChecker is a simple opensource project to host it fast on your server and check is server up, view statistic, get messages if it is down. UpChecker - just run file and use project easy

Repository for the electrical and ICT benchmark model developed in the ERIGrid 2.0 project.

PyTorch implementation for NED. It can be used to manipulate the facial emotions of actors in videos based on emotion labels or reference styles.

GPU-Accelerated Deep Learning Library in Python

Augmentation for Single-Image-Super-Resolution

Fast SHAP value computation for interpreting tree-based models

This is the official repository of the paper Stocastic bandits with groups of similar arms (NeurIPS 2021). It contains the code that was used to compute the figures and experiments of the paper.

This Jupyter notebook shows one way to implement a simple first-order low-pass filter on sampled data in discrete time.

DGL-TreeSearch and the Gurobi-MWIS interface

Predicting Price of house by considering ,house age, Distance from public transport

On Size-Oriented Long-Tailed Graph Classification of Graph Neural Networks

LSTM and QRNN Language Model Toolkit for PyTorch

Awesome Monocular 3D detection

✨✨✨An awesome open source toolbox for stereo matching.

PyTorch implementation of the paper Dynamic Data Augmentation with Gating Networks

Fully Convolutional Networks for Semantic Segmentation by Jonathan Long, Evan Shelhamer, and Trevor Darrell. CVPR 2015 and PAMI 2016.

A port of muP to JAX/Haiku

Related tags

Overview

MUP for Haiku

Installation

Learning rate demo

Usage

Summary

Shared Input/Output embeddings

Owner

Datasets for new state-of-the-art challenge in disentanglement learning

Transformer Tracking (CVPR2021)

Code for the ECCV2020 paper "A Differentiable Recurrent Surface for Asynchronous Event-Based Data"

Implementation of gaze tracking and demo

UpChecker is a simple opensource project to host it fast on your server and check is server up, view statistic, get messages if it is down. UpChecker - just run file and use project easy

Repository for the electrical and ICT benchmark model developed in the ERIGrid 2.0 project.

PyTorch implementation for NED. It can be used to manipulate the facial emotions of actors in videos based on emotion labels or reference styles.

GPU-Accelerated Deep Learning Library in Python

Augmentation for Single-Image-Super-Resolution

Fast SHAP value computation for interpreting tree-based models

This is the official repository of the paper Stocastic bandits with groups of similar arms (NeurIPS 2021). It contains the code that was used to compute the figures and experiments of the paper.

This Jupyter notebook shows one way to implement a simple first-order low-pass filter on sampled data in discrete time.

DGL-TreeSearch and the Gurobi-MWIS interface

Predicting Price of house by considering ,house age, Distance from public transport

On Size-Oriented Long-Tailed Graph Classification of Graph Neural Networks

LSTM and QRNN Language Model Toolkit for PyTorch

Awesome Monocular 3D detection

✨✨✨An awesome open source toolbox for stereo matching.

PyTorch implementation of the paper Dynamic Data Augmentation with Gating Networks

Fully Convolutional Networks for Semantic Segmentation by Jonathan Long*, Evan Shelhamer*, and Trevor Darrell. CVPR 2015 and PAMI 2016.

Fully Convolutional Networks for Semantic Segmentation by Jonathan Long, Evan Shelhamer, and Trevor Darrell. CVPR 2015 and PAMI 2016.