Exploring Fine-Grained Audiovisual Categorization with the SSW60 Dataset

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Standard

Exploring Fine-Grained Audiovisual Categorization with the SSW60 Dataset. / Van Horn, Grant; Qian, Rui; Wilber, Kimberly; Adam, Hartwig; Mac Aodha, Oisin; Belongie, Serge.

Computer Vision – ECCV 2022 : 17th European Conference, Proceedings. ed. / Shai Avidan; Gabriel Brostow; Moustapha Cissé; Giovanni Maria Farinella; Tal Hassner. Springer, 2022. p. 271-289 (Lecture Notes in Computer Science, Vol. 13668 LNCS).

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Harvard

Van Horn, G, Qian, R, Wilber, K, Adam, H, Mac Aodha, O & Belongie, S 2022, Exploring Fine-Grained Audiovisual Categorization with the SSW60 Dataset. in S Avidan, G Brostow, M Cissé, GM Farinella & T Hassner (eds), Computer Vision – ECCV 2022 : 17th European Conference, Proceedings. Springer, Lecture Notes in Computer Science, vol. 13668 LNCS, pp. 271-289, 17th European Conference on Computer Vision, ECCV 2022, Tel Aviv, Israel, 23/10/2022. https://doi.org/10.1007/978-3-031-20074-8_16

APA

Van Horn, G., Qian, R., Wilber, K., Adam, H., Mac Aodha, O., & Belongie, S. (2022). Exploring Fine-Grained Audiovisual Categorization with the SSW60 Dataset. In S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, & T. Hassner (Eds.), Computer Vision – ECCV 2022 : 17th European Conference, Proceedings (pp. 271-289). Springer. Lecture Notes in Computer Science Vol. 13668 LNCS https://doi.org/10.1007/978-3-031-20074-8_16

Vancouver

Van Horn G, Qian R, Wilber K, Adam H, Mac Aodha O, Belongie S. Exploring Fine-Grained Audiovisual Categorization with the SSW60 Dataset. In Avidan S, Brostow G, Cissé M, Farinella GM, Hassner T, editors, Computer Vision – ECCV 2022 : 17th European Conference, Proceedings. Springer. 2022. p. 271-289. (Lecture Notes in Computer Science, Vol. 13668 LNCS). https://doi.org/10.1007/978-3-031-20074-8_16

Author

Van Horn, Grant ; Qian, Rui ; Wilber, Kimberly ; Adam, Hartwig ; Mac Aodha, Oisin ; Belongie, Serge. / Exploring Fine-Grained Audiovisual Categorization with the SSW60 Dataset. Computer Vision – ECCV 2022 : 17th European Conference, Proceedings. editor / Shai Avidan ; Gabriel Brostow ; Moustapha Cissé ; Giovanni Maria Farinella ; Tal Hassner. Springer, 2022. pp. 271-289 (Lecture Notes in Computer Science, Vol. 13668 LNCS).

Bibtex

@inproceedings{59522b073b154dcc9bd95a47462e749d,

title = "Exploring Fine-Grained Audiovisual Categorization with the SSW60 Dataset",

abstract = "We present a new benchmark dataset, Sapsucker Woods 60 (SSW60), for advancing research on audiovisual fine-grained categorization. While our community has made great strides in fine-grained visual categorization on images, the counterparts in audio and video fine-grained categorization are relatively unexplored. To encourage advancements in this space, we have carefully constructed the SSW60 dataset to enable researchers to experiment with classifying the same set of categories in three different modalities: images, audio, and video. The dataset covers 60 species of birds and is comprised of images from existing datasets, and brand new, expert curated audio and video datasets. We thoroughly benchmark audiovisual classification performance and modality fusion experiments through the use of state-of-the-art transformer methods. Our findings show that performance of audiovisual fusion methods is better than using exclusively image or audio based methods for the task of video classification. We also present interesting modality transfer experiments, enabled by the unique construction of SSW60 to encompass three different modalities. We hope the SSW60 dataset and accompanying baselines spur research in this fascinating area.",

keywords = "Audio, Fine-grained, Multi-modal learning, Video",

author = "{Van Horn}, Grant and Rui Qian and Kimberly Wilber and Hartwig Adam and {Mac Aodha}, Oisin and Serge Belongie",

note = "Publisher Copyright: {\textcopyright} 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.; 17th European Conference on Computer Vision, ECCV 2022 ; Conference date: 23-10-2022 Through 27-10-2022",

year = "2022",

doi = "10.1007/978-3-031-20074-8_16",

language = "English",

isbn = "9783031200731",

series = "Lecture Notes in Computer Science",

publisher = "Springer",

pages = "271--289",

editor = "Shai Avidan and Gabriel Brostow and Moustapha Ciss{\'e} and Farinella, {Giovanni Maria} and Tal Hassner",

booktitle = "Computer Vision – ECCV 2022",

address = "Switzerland",

}

RIS

TY - GEN

T1 - Exploring Fine-Grained Audiovisual Categorization with the SSW60 Dataset

AU - Van Horn, Grant

AU - Qian, Rui

AU - Wilber, Kimberly

AU - Adam, Hartwig

AU - Mac Aodha, Oisin

AU - Belongie, Serge

PY - 2022

Y1 - 2022

N2 - We present a new benchmark dataset, Sapsucker Woods 60 (SSW60), for advancing research on audiovisual fine-grained categorization. While our community has made great strides in fine-grained visual categorization on images, the counterparts in audio and video fine-grained categorization are relatively unexplored. To encourage advancements in this space, we have carefully constructed the SSW60 dataset to enable researchers to experiment with classifying the same set of categories in three different modalities: images, audio, and video. The dataset covers 60 species of birds and is comprised of images from existing datasets, and brand new, expert curated audio and video datasets. We thoroughly benchmark audiovisual classification performance and modality fusion experiments through the use of state-of-the-art transformer methods. Our findings show that performance of audiovisual fusion methods is better than using exclusively image or audio based methods for the task of video classification. We also present interesting modality transfer experiments, enabled by the unique construction of SSW60 to encompass three different modalities. We hope the SSW60 dataset and accompanying baselines spur research in this fascinating area.

AB - We present a new benchmark dataset, Sapsucker Woods 60 (SSW60), for advancing research on audiovisual fine-grained categorization. While our community has made great strides in fine-grained visual categorization on images, the counterparts in audio and video fine-grained categorization are relatively unexplored. To encourage advancements in this space, we have carefully constructed the SSW60 dataset to enable researchers to experiment with classifying the same set of categories in three different modalities: images, audio, and video. The dataset covers 60 species of birds and is comprised of images from existing datasets, and brand new, expert curated audio and video datasets. We thoroughly benchmark audiovisual classification performance and modality fusion experiments through the use of state-of-the-art transformer methods. Our findings show that performance of audiovisual fusion methods is better than using exclusively image or audio based methods for the task of video classification. We also present interesting modality transfer experiments, enabled by the unique construction of SSW60 to encompass three different modalities. We hope the SSW60 dataset and accompanying baselines spur research in this fascinating area.

KW - Audio

KW - Fine-grained

KW - Multi-modal learning

KW - Video

U2 - 10.1007/978-3-031-20074-8_16

DO - 10.1007/978-3-031-20074-8_16

M3 - Article in proceedings

AN - SCOPUS:85144562502

SN - 9783031200731

T3 - Lecture Notes in Computer Science

SP - 271

EP - 289

BT - Computer Vision – ECCV 2022

A2 - Avidan, Shai

A2 - Brostow, Gabriel

A2 - Cissé, Moustapha

A2 - Farinella, Giovanni Maria

A2 - Hassner, Tal

PB - Springer

T2 - 17th European Conference on Computer Vision, ECCV 2022

Y2 - 23 October 2022 through 27 October 2022

ER -

ID: 342672104

Forskning