Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds (CVPR 2022)

Last update: Dec 30, 2022

Related tags

Deep Learning VoxSeT

Overview

Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds (CVPR2022)[paper]

Authors: Chenhang He, Ruihuang Li, Shuai Li, Lei Zhang.

This project is built on OpenPCDet.

Updates

2022-04-09: Add waymo config and multi-frame input.

The performance of VoxSeT (single-stage, single-frame) on Waymo valdation split are as follows.

	% Training	Car AP/APH	Ped AP/APH	Cyc AP/APH	Log file
Level 1	20%	72.10/71.59	77.94/69.58	69.88/68.54	Download
Level 2	20%	63.62/63.17	70.20/62.51	67.31/66.02
Level 1	100%	74.50/74.03	80.03/72.42	71.56/70.29	Download
Level 2	100%	65.99/65.56	72.45/65.39	68.95/67.73

Introduction

Transformer has demonstrated promising performance in many 2D vision tasks. However, it is cumbersome to compute the self-attention on large-scale point cloud data because point cloud is a long sequence and unevenly distributed in 3D space. To solve this issue, existing methods usually compute self-attention locally by grouping the points into clusters of the same size, or perform convolutional self-attention on a discretized representation. However, the former results in stochastic point dropout, while the latter typically has narrow attention fields. In this paper, we propose a novel voxel-based architecture, namely Voxel Set Transformer (VoxSeT), to detect 3D objects from point clouds by means of set-to-set translation. VoxSeT is built upon a voxel-based set attention (VSA) module, which reduces the self-attention in each voxel by two cross attentions and models features in a hidden space induced by a group of latent codes. With the VSA module, VoxSeT can manage voxelized point clusters with arbitrary size in a wide range, and process them in parallel with linear complexity. The proposed VoxSeT integrates the high performance of transformer with the efficiency of voxel-based model, which can be used as a good alternative to the convolutional and point-based backbones.

1. Recommended Environment

Linux (tested on Ubuntu 16.04)
Python 3.7
PyTorch 1.9 or higher (tested on PyTorch 1.10.1)
CUDA 9.0 or higher (tested on CUDA 10.2)

2. Set the Environment

pip install -r requirement.txt
python setup.py build_ext --inplace

The torch_scatter package is required

3. Data Preparation

Prepare KITTI dataset and road planes

# Download KITTI and organize it into the following form:
├── data
│   ├── kitti
│   │   │── ImageSets
│   │   │── training
│   │   │   ├──calib & velodyne & label_2 & image_2 & (optional: planes)
│   │   │── testing
│   │   │   ├──calib & velodyne & image_2

# Generatedata infos:
python -m pcdet.datasets.kitti.kitti_dataset create_kitti_infos tools/cfgs/dataset_configs/kitti_dataset.yaml

4. Pretrain model

You can download the pretrain model here and the log file here.

The performance (using 11 recall poisitions) on KITTI validation set is as follows:

Car  [email protected], 0.70, 0.70:
bev  AP:90.1572, 88.0972, 86.8397
3d   AP:88.8694, 78.7660, 77.5758

Pedestrian [email protected], 0.50, 0.50:
bev  AP:63.1125, 58.5591, 55.1318
3d   AP:60.2515, 55.5535, 50.1888

Cyclist [email protected], 0.50, 0.50:
bev  AP:85.6768, 71.9008, 67.1551
3d   AP:85.4238, 70.2774, 64.9804

The runtime is about 33 ms per sample.

5. Train

Train with a single GPU

python train.py --cfg_file tools/cfgs/kitti_models/voxset.yaml

Train with multiple GPUs

cd VoxSeT/tools
bash scripts/dist_train.sh --cfg_file ./cfgs/kitti_models/voxset.yaml

6. Test with a pretrained model

cd VoxSeT/tools
python test.py --cfg_file --cfg_file ./cfgs/kitti_models/voxset.yaml --ckpt ${CKPT_FILE}

Citation

@inproceedings{he2022voxset,
  title={Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds},
  author={Chenhang He, Ruihuang Li, Shuai Li and Lei Zhang},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
  year={2022}
}

Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds (CVPR 2022)

Related tags

Overview

Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds (CVPR2022)[paper]

Updates

Introduction

1. Recommended Environment

2. Set the Environment

3. Data Preparation

4. Pretrain model

5. Train

6. Test with a pretrained model

Citation

Owner

Billy HE

Code for the paper "Ordered Neurons: Integrating Tree Structures into Recurrent Neural Networks"

Super-BPD: Super Boundary-to-Pixel Direction for Fast Image Segmentation (CVPR 2020)

Code for Fully Context-Aware Image Inpainting with a Learned Semantic Pyramid

This is RFA-Toolbox, a simple and easy-to-use library that allows you to optimize your neural network architectures using receptive field analysis (RFA) and create graph visualizations of your architecture.

This is a Keras implementation of a CNN for estimating age, gender and mask from a camera.

Use your Philips Hue lights as Racing Flags. Works with Assetto Corsa, Assetto Corsa Competizione and iRacing.

Scribble-Supervised LiDAR Semantic Segmentation, CVPR 2022 (ORAL)

RAMA: Rapid algorithm for multicut problem

AlphaBot2 Pi Core software for interfacing with the various components.

Python code for loading the Aschaffenburg Pose Dataset.

Open source repository for the code accompanying the paper 'Non-Rigid Neural Radiance Fields Reconstruction and Novel View Synthesis of a Deforming Scene from Monocular Video'.

A large dataset of 100k Google Satellite and matching Map images, resembling pix2pix's Google Maps dataset.

Repository for Driving Style Recognition algorithms for Autonomous Vehicles

Code for the TASLP paper "PSLA: Improving Audio Tagging With Pretraining, Sampling, Labeling, and Aggregation".

This repository contains all code and data for the Inside Out Visual Place Recognition task

[ICLR'21] Counterfactual Generative Networks

Class-Balanced Loss Based on Effective Number of Samples. CVPR 2019

Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts

Using Random Effects to Account for High-Cardinality Categorical Features and Repeated Measures in Deep Neural Networks

[CVPR 2021] Exemplar-Based Open-Set Panoptic Segmentation Network (EOPSN)