How2 - Resultat

How2: A large-scale dataset for multimodal language understanding

Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › fagfællebedømt

Ramon Sanabria
Ozan Caglayan
Shruti Palaskar
Elliott, Desmond
Loic Barrault
Lucia Specia
Florian Metze

Human information processing is inherently multimodal, and language is best understood in a situated context. In order to achieve human-like language processingcapabilities, machines should be able to jointly process multimodal data, and not just text, images, or speech in isolation. Nevertheless, there are very few multimodal datasets to support such research, resulting in a limited interaction among different research communities. In this paper, we introduce How2, a large-scale dataset of instructional videos covering a wide variety of topics across 80,000 clips (about 2,000 hours), with word-level time alignments to the ground-truth English subtitles. In addition to being multimodal, How2 is multilingual: we crowdsourced Portuguese translations of the subtitles. We present results for monomodal and multimodal baselines on several language processing tasks with interesting insights on the utility of different modalities. We hope that by making the How2 dataset and baselines available we will encourage collaboration across language, speech and vision communities

Originalsprog	Engelsk
Titel	Visually Grounded Interaction and Language (ViGIL), Montreal; Canada, December 2018. Neural Information Processing Society (NeurIPS).
Publikationsdato	2018
Status	Udgivet - 2018
Begivenhed	32nd Annual Conference on Neural Information Processing Systems - Montreal, Montreal, Canada Varighed: 2 dec. 2018 → 8 dec. 2018 Konferencens nummer: 32 https://nips.cc/Conferences/2018

Konference

Konference	32nd Annual Conference on Neural Information Processing Systems
Nummer	32
Lokation	Montreal
Land	Canada
By	Montreal
Periode	02/12/2018 → 08/12/2018
Internetadresse	https://nips.cc/Conferences/2018

Navn	arXiv

Forskning

How2: A large-scale dataset for multimodal language understanding

Konference

Links