Towards Efficient Use of Multi-Scale Features in Transformer-Based Object Detectors

Posted Jan 17, 2024 Updated Feb 12, 2024

By Geonu-Lee 3 min read

Towards Efficient Use of Multi-Scale Features in Transformer-Based Object Detectors

CVPR 2023, 2024-01-17 기준 9회 인용

Task

Object Detection
DETR

Contribution

Multi-scale feature 를 활용하는 것이 좋지만 huge computation costs 를 야기시킨다.
Iterative Multi-scale Feature Aggregation (IMFA) 구조를 제안 -> efficient use of multi-scale
Sparse multi-scale features sampling 방법을 제안
Encoded features 가 전 레이어의 detection 결과를 기반으로 iteratively 업데이트 되도록 한다

기존의 multi-scale 방식들은 computation cost 가 높다
논문에서 제안하는 방법을 적용했을 경우 computation cost 도 줄이고 성능도 향상된다

Proposed Method

Iterative Update of Encoded Features

Left: 기존의 DETR 방법들 - Encoder 로 부터 얻어진 features 를 모든 decoder layer 에서 그대로 사용
Right: 본 논문에서 제안하는 Iterative Multi-scale Feature Aggregation IMFA rearrange 방법
전 decoder layer 에서 prediction 된 결과를 기반으로 encoder feature 도 update

This new pipeline allows to leverage intermediate predictions as guidance

multi-scale feature 를 활용해서 encoder feautre 를 업데이트 하지 않을 경우 성능 향상을 얻어내지 못함

Sparse Feature Sampling and Aggregation

여기서 multi-scale feature를 활용
모든 scale의 tockens 들을 활용하기에는 너무나 많으니 sparse sampling 하겠다는 취지

전 decoder layer의 결과를 기반으로 top $K$ 개를 선택

$K$ 개의 쿼리로 부터 $M$ 개의 keypoint locations 을 predict
bilinear interpolation 을 통해서 모든 scale 으로 부터 feature 를 sampling
${\textbf{F}^s_{ij}}^S_{s=1}$, $S$ 는 feature scales 수