Kaggle | 9th place (part of) solution for the Bristol-Myers Squibb – Molecular Translation challenge

Last update: Nov 30, 2022

Related tags

Overview

Part of the 9th place solution for the Bristol-Myers Squibb – Molecular Translation challenge translating images containing chemical structures into InChI (International Chemical Identifier) texts.

This repo is partially based on the following resources:

Y.Nakama's tokenization
Heng's transformer decoder
Sam Stainsby's external images creation updated by ZFTurbo

Requirements

install and activate the conda environment
download and extract the data into /data/bms/
extract and move sample_submission_with_length.csv.gz into /data/bms/
tokenize training inputs: python datasets/prepocess2.py
if you want to use pseudo labeling, execute: python datasets/pseudo_prepocess2.py your_submission_file.csv
if you want to use external images, you can create with the following commands:

python r09_create_images_from_allowed_inchi.py
python datasets/extra_prepocess2.py

and also install apex

Training

This repo supports training any VIT/SWIN/CAIT transformer models from timm as encoder together with the fairseq transformer decoder.

Here is an example configuration to train a SWIN swin_base_patch4_window12_384 as encoder and 12 layer 16 head fairseq decoder:

python -m torch.distributed.launch --nproc_per_node=N train.py --logdir=logdir/ \
    --pipeline --train-batch-size=50 --valid-batch-size=128 --dataload-workers-nums=10 --mixed-precision --amp-level=O2  \
    --aug-rotate90-p=0.5 --aug-crop-p=0.5 --aug-noise-p=0.9 --label-smoothing=0.1 \
    --encoder-lr=1e-3 --decoder-lr=1e-3 --lr-step-ratio=0.3 --lr-policy=step --optim=adam --lr-warmup-steps=1000 --max-epochs=20 --weight-decay=0 --clip-grad-norm=1 \
    --verbose --image-size=384 --model=swin_base_patch4_window12_384 --loss=ce --embed-dim=1024 --num-head=16 --num-layer=12 \
    --fold=0 --train-dataset-size=0 --valid-dataset-size=65536 --valid-dataset-non-sorted

For pseudo labeling, use --pseudo=pseudo.pkl. If you want subsample the pseudo dataset, use: --pseudo-dataset-size=448000. For using external images, use --extra (--extra-dataset-size=448000).

After training, you can also use Stochastic Weight Averaging (SWA) which gives a boost around 0.02:

python swa.py --image-size=384 --input logdir/epoch-17.pth,logdir/epoch-18.pth,logdir/epoch-19.pth,logdir/epoch-20.pth

Inference

Evaluation:

python -m torch.distributed.launch --nproc_per_node=N eval.py --mixed-precision --batch-size=128 swa_model.pth

Inference:

python -m torch.distributed.launch --nproc_per_node=N inference.py --mixed-precision --batch-size=128 swa_model.pth

Normalization with RDKit:

./normalize_inchis.sh submission.csv

Kaggle | 9th place (part of) solution for the Bristol-Myers Squibb – Molecular Translation challenge

Related tags

Overview

Requirements

Training

Inference

Owner

Erdene-Ochir Tuguldur

A python module for configuration of block devices

Project to create an open-source 6 DoF input device

Implementation of "Learning Multi-Granular Hypergraphs for Video-Based Person Re-Identification"

StyleGAN2-ADA-training-jupyter - Training custom datasets in styleGAN2-ADA by NVIDIA using Jupyter

Codebase for the solution that won first place and was awarded the most human-like agent in the 2021 NeurIPS Competition MineRL BASALT Challenge.

A learning-based data collection tool for human segmentation

Automatically creates genre collections for your Plex media

Safe Control for Black-box Dynamical Systems via Neural Barrier Certificates

An implementation of IMLE-Net: An Interpretable Multi-level Multi-channel Model for ECG Classification

This is the official PyTorch implementation of our paper: "Artistic Style Transfer with Internal-external Learning and Contrastive Learning".

minimizer-space de Bruijn graphs (mdBG) for whole genome assembly

Project code for weakly supervised 3D object detectors using wide-baseline multi-view traffic camera data: WIBAM.

Official implementation of the paper Visual Parser: Representing Part-whole Hierarchies with Transformers

Code for Quantifying Ignorance in Individual-Level Causal-Effect Estimates under Hidden Confounding

(CVPR 2022) Energy-based Latent Aligner for Incremental Learning

[ICCV 2021 Oral] NerfingMVS: Guided Optimization of Neural Radiance Fields for Indoor Multi-view Stereo

Ppq - A powerful offline neural network quantization tool with custimized IR

Dynamic Divide-and-Conquer Adversarial Training for Robust Semantic Segmentation （ICCV2021）

Implement object segmentation on images using HOG algorithm proposed in CVPR 2005

OpenLT: An open-source project for long-tail classification