KDNet: Leveraging Vision-Language Knowledge Distillation for Few-Shot Object Detection

Mengyuan Ma, Lin Qian, Hujun Yin*

*Corresponding author for this work

Research output: Chapter in Book/Conference proceedingConference contributionpeer-review

Abstract

Few-shot object detection (FSOD) aims to detect new categories given only few instances for training. Recently emerged vision-language models (VLMs) have shown great performances in zero-shot and open-vocabulary object detection due to their strong ability to align object-level embedding with textual embedding of categories. However, few existing models distill VLMs’ object-level knowledge in FSOD, which can help FSOD to learn novel semantic concepts to gain further improvement. Inspired by the recent knowledge distillation approaches with VLMs, we propose an end-to-end few-shot object detector with knowledge distillation from pre-trained VLMs, termed KDNet. A knowledge distillation branch is introduced alongside the object detector to distill knowledge from VLMs’ visual encoder to the object detector. Also, we propose a pre-training mechanism with large-scale dataset to inject more semantic concepts to the detector to improve the performance on small datasets. The KDNet achieved state-of-the-art performance on both PASCAL VOC and MS COCO benchmarks over most of the shot settings and evaluation metrics.
Original languageEnglish
Title of host publicationArtificial Neural Networks and Machine Learning – ICANN 2024
Subtitle of host publication33rd International Conference on Artificial Neural Networks, Lugano, Switzerland, September 17–20, 2024, Proceedings, Part II
EditorsMichael Wand, Kristína Malinovská, Jürgen Schmidhuber, Igor V. Tetko
Place of PublicationCham
PublisherSpringer Cham
Pages153–167
Number of pages15
Volume2
ISBN (Electronic)9783031723353
ISBN (Print)9783031723346
DOIs
Publication statusPublished - 17 Sept 2024

Publication series

NameLecture Notes in Computer Science
PublisherSpringer
Volume15017
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Fingerprint

Dive into the research topics of 'KDNet: Leveraging Vision-Language Knowledge Distillation for Few-Shot Object Detection'. Together they form a unique fingerprint.

Cite this