Cleaner Categories Improve Object Detection and Visual-Textual Grounding

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Standard

Cleaner Categories Improve Object Detection and Visual-Textual Grounding. / Rigoni, Davide; Elliott, Desmond; Frank, Stella.

Image Analysis - 23rd Scandinavian Conference, SCIA 2023, Proceedings. ed. / Rikke Gade; Michael Felsberg; Joni-Kristian Kämäräinen. Springer, 2023. p. 412-442 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 13885 LNCS).

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Harvard

Rigoni, D, Elliott, D & Frank, S 2023, Cleaner Categories Improve Object Detection and Visual-Textual Grounding. in R Gade, M Felsberg & J-K Kämäräinen (eds), Image Analysis - 23rd Scandinavian Conference, SCIA 2023, Proceedings. Springer, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 13885 LNCS, pp. 412-442, 23nd Scandinavian Conference on Image Analysis, SCIA 2023, Lapland, Finland, 18/04/2023. https://doi.org/10.1007/978-3-031-31435-3_28

APA

Rigoni, D., Elliott, D., & Frank, S. (2023). Cleaner Categories Improve Object Detection and Visual-Textual Grounding. In R. Gade, M. Felsberg, & J-K. Kämäräinen (Eds.), Image Analysis - 23rd Scandinavian Conference, SCIA 2023, Proceedings (pp. 412-442). Springer. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) Vol. 13885 LNCS https://doi.org/10.1007/978-3-031-31435-3_28

Vancouver

Rigoni D, Elliott D, Frank S. Cleaner Categories Improve Object Detection and Visual-Textual Grounding. In Gade R, Felsberg M, Kämäräinen J-K, editors, Image Analysis - 23rd Scandinavian Conference, SCIA 2023, Proceedings. Springer. 2023. p. 412-442. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 13885 LNCS). https://doi.org/10.1007/978-3-031-31435-3_28

Author

Rigoni, Davide ; Elliott, Desmond ; Frank, Stella. / Cleaner Categories Improve Object Detection and Visual-Textual Grounding. Image Analysis - 23rd Scandinavian Conference, SCIA 2023, Proceedings. editor / Rikke Gade ; Michael Felsberg ; Joni-Kristian Kämäräinen. Springer, 2023. pp. 412-442 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 13885 LNCS).

Bibtex

@inproceedings{f588a70b3bda4dada81b9d8ff93b6beb,

title = "Cleaner Categories Improve Object Detection and Visual-Textual Grounding",

abstract = "Object detectors are core components of multimodal models, enabling them to locate the region of interest in images which are then used to solve many multimodal tasks. Among the many extant object detectors, the Bottom-Up Faster R-CNN [39] (BUA) object detector is the most commonly used by the multimodal language-and-vision community, usually as a black-box visual feature generator for solving downstream multimodal tasks. It is trained on the Visual Genome Dataset [25] to detect 1600 different objects. However, those object categories are defined using automatically processed image region descriptions from the Visual Genome dataset. The automatic process introduces some unexpected near-duplicate categories (e.g. “watch” and “wristwatch”, “tree” and “trees”, and “motorcycle” and “motorbike”) that may result in a sub-optimal representational space and likely impair the ability of the model to classify objects correctly. In this paper, we manually merge near-duplicate labels to create a cleaner label set, which is used to retrain the object detector. We investigate the effect of using the cleaner label set in terms of: (i) performance on the original object detection task, (ii) the properties of the embedding space learned by the detector, and (iii) the utility of the features in a visual grounding task on the Flickr30K Entities dataset. We find that the BUA model trained with the cleaner categories learns a better-clustered embedding space than the model trained with the noisy categories. The new embedding space improves the object detection task and also presents better bounding boxes features representations which help to solve the visual grounding task.",

keywords = "Bottom-Up, Data Cleaning, Label Cleaning, Object Detection, Object Ontology, Visual Genome",

author = "Davide Rigoni and Desmond Elliott and Stella Frank",

note = "Publisher Copyright: {\textcopyright} 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.; 23nd Scandinavian Conference on Image Analysis, SCIA 2023 ; Conference date: 18-04-2023 Through 21-04-2023",

year = "2023",

doi = "10.1007/978-3-031-31435-3_28",

language = "English",

isbn = "9783031314346",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer",

pages = "412--442",

editor = "Rikke Gade and Michael Felsberg and Joni-Kristian K{\"a}m{\"a}r{\"a}inen",

booktitle = "Image Analysis - 23rd Scandinavian Conference, SCIA 2023, Proceedings",

address = "Switzerland",

}

RIS

TY - GEN

T1 - Cleaner Categories Improve Object Detection and Visual-Textual Grounding

AU - Rigoni, Davide

AU - Elliott, Desmond

AU - Frank, Stella

PY - 2023

Y1 - 2023

N2 - Object detectors are core components of multimodal models, enabling them to locate the region of interest in images which are then used to solve many multimodal tasks. Among the many extant object detectors, the Bottom-Up Faster R-CNN [39] (BUA) object detector is the most commonly used by the multimodal language-and-vision community, usually as a black-box visual feature generator for solving downstream multimodal tasks. It is trained on the Visual Genome Dataset [25] to detect 1600 different objects. However, those object categories are defined using automatically processed image region descriptions from the Visual Genome dataset. The automatic process introduces some unexpected near-duplicate categories (e.g. “watch” and “wristwatch”, “tree” and “trees”, and “motorcycle” and “motorbike”) that may result in a sub-optimal representational space and likely impair the ability of the model to classify objects correctly. In this paper, we manually merge near-duplicate labels to create a cleaner label set, which is used to retrain the object detector. We investigate the effect of using the cleaner label set in terms of: (i) performance on the original object detection task, (ii) the properties of the embedding space learned by the detector, and (iii) the utility of the features in a visual grounding task on the Flickr30K Entities dataset. We find that the BUA model trained with the cleaner categories learns a better-clustered embedding space than the model trained with the noisy categories. The new embedding space improves the object detection task and also presents better bounding boxes features representations which help to solve the visual grounding task.

AB - Object detectors are core components of multimodal models, enabling them to locate the region of interest in images which are then used to solve many multimodal tasks. Among the many extant object detectors, the Bottom-Up Faster R-CNN [39] (BUA) object detector is the most commonly used by the multimodal language-and-vision community, usually as a black-box visual feature generator for solving downstream multimodal tasks. It is trained on the Visual Genome Dataset [25] to detect 1600 different objects. However, those object categories are defined using automatically processed image region descriptions from the Visual Genome dataset. The automatic process introduces some unexpected near-duplicate categories (e.g. “watch” and “wristwatch”, “tree” and “trees”, and “motorcycle” and “motorbike”) that may result in a sub-optimal representational space and likely impair the ability of the model to classify objects correctly. In this paper, we manually merge near-duplicate labels to create a cleaner label set, which is used to retrain the object detector. We investigate the effect of using the cleaner label set in terms of: (i) performance on the original object detection task, (ii) the properties of the embedding space learned by the detector, and (iii) the utility of the features in a visual grounding task on the Flickr30K Entities dataset. We find that the BUA model trained with the cleaner categories learns a better-clustered embedding space than the model trained with the noisy categories. The new embedding space improves the object detection task and also presents better bounding boxes features representations which help to solve the visual grounding task.

KW - Bottom-Up

KW - Data Cleaning

KW - Label Cleaning

KW - Object Detection

KW - Object Ontology

KW - Visual Genome

UR - http://www.scopus.com/inward/record.url?scp=85161386681&partnerID=8YFLogxK

U2 - 10.1007/978-3-031-31435-3_28

DO - 10.1007/978-3-031-31435-3_28

M3 - Article in proceedings

AN - SCOPUS:85161386681

SN - 9783031314346

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 412

EP - 442

BT - Image Analysis - 23rd Scandinavian Conference, SCIA 2023, Proceedings

A2 - Gade, Rikke

A2 - Felsberg, Michael

A2 - Kämäräinen, Joni-Kristian

PB - Springer

T2 - 23nd Scandinavian Conference on Image Analysis, SCIA 2023

Y2 - 18 April 2023 through 21 April 2023

ER -

ID: 357283955

Forskning