Less is More: Focus Attention for Efficient DETR

Posted Jan 29, 2024 Updated Feb 12, 2024

By Geonu-Lee 2 min read

ICCV 2023, 2024-01-29 기준 3회 인용

Task

tokens 을 sparse 하게 가져가는 Sparse DETR

this monotonous sparse encoder, the number of retained foreground tokens remains numerous

제안하는 방법을 DINO 에 적용했을 때 GFLOPs 와 Latency 가 감소한다

Multi-scale features를 활용
여러 scale 의 tokens 들을 겹쳐서 구성

overlap of the adjacent interval by 50%
ensuring that the multiscale feature map predicts object heterogeneity

GT box 를 활용해서 label assignment
GT box 안에 있거나, 해당 token이 구성된 scale range 안에 드는 경우 (인접한 scale 끼리 구성했기 때문에?)
Figure 3 과 같이 각 level 에 대해서 supervision

Top-down 방식으로 진행
High-level -> Low-level

Considering that high-level feature maps contain richer semantic than low-level features

$\alpha$ - learnable modulation coefficients

Sparse DETR 에 비해서 제안하는 방법의 foreground tokens 가 더 좋다고 주장

MLP 를 통해서 score 를 구하고 곱해준다 -> 그 다음에 Top-k 개 선택

Dual Attention 은 단순하게 일반 Self-attention 하고 Deformalbe Attention 하는 방식

the limitation of deformable attention in distant information mixing

제안하는 방법을 적용했을 때 DINO 의 성능을 최대한 유지시키면서 GFLOPs를 줄이고 FPS 를 증가시킨다

Backbone ResNet-101 에서 다른 방법들과 비교했을 때 성능이 제일 좋다

This post is licensed under CC BY 4.0 by the author.