A Vision-free Baseline for Multimodal Grammar Induction

Research output: Working paper › Preprint › Research

Standard

A Vision-free Baseline for Multimodal Grammar Induction. / Li, Boyi; Corona, Rodolfo ; Mangalam, Karttikeya ; Chen, Catherine ; Flaherty, Daniel ; Belongie, Serge; Weinberger, Kilian Q.; Malik, Jitendra; Darrell, Trevor; Klein, Dan .

arXiv.org, 2023.

Research output: Working paper › Preprint › Research

Harvard

Li, B, Corona, R, Mangalam, K, Chen, C, Flaherty, D, Belongie, S, Weinberger, KQ, Malik, J, Darrell, T & Klein, D 2023 'A Vision-free Baseline for Multimodal Grammar Induction' arXiv.org. <https://arxiv.org/abs/2212.10564>

APA

Li, B., Corona, R., Mangalam, K., Chen, C., Flaherty, D., Belongie, S., Weinberger, K. Q., Malik, J., Darrell, T., & Klein, D. (2023). A Vision-free Baseline for Multimodal Grammar Induction. arXiv.org. https://arxiv.org/abs/2212.10564

Vancouver

Li B, Corona R, Mangalam K, Chen C, Flaherty D, Belongie S et al. A Vision-free Baseline for Multimodal Grammar Induction. arXiv.org. 2023.

Author

Li, Boyi ; Corona, Rodolfo ; Mangalam, Karttikeya ; Chen, Catherine ; Flaherty, Daniel ; Belongie, Serge ; Weinberger, Kilian Q. ; Malik, Jitendra ; Darrell, Trevor ; Klein, Dan . / A Vision-free Baseline for Multimodal Grammar Induction. arXiv.org, 2023.

Bibtex

@techreport{46f126bc1d08452f9e506f9d5caaa9f6,

title = "A Vision-free Baseline for Multimodal Grammar Induction",

abstract = "Past work has shown that paired vision-language signals substantially improve grammar induction in multimodal datasets such as MSCOCO. We investigate whether advancements in large language models (LLMs) that are only trained with text could provide strong assistance for grammar induction in multimodal settings. We find that our text-only approach, an LLM-based C-PCFG (LC-PCFG), outperforms previous multi-modal methods, and achieves state-of-the-art grammar induction performance for various multimodal datasets. Compared to image-aided grammar induction, LC-PCFG outperforms the prior state-of-the-art by 7.9 Corpus-F1 points, with an 85% reduction in parameter count and 1.7x faster training speed. Across three video-assisted grammar induction benchmarks, LC-PCFG outperforms prior state-of-the-art by up to 7.7 Corpus-F1, with 8.8x faster training. These results shed light on the notion that text-only language models might include visually grounded cues that aid in grammar induction in multimodal contexts. Moreover, our results emphasize the importance of establishing a robust vision-free baseline when evaluating the benefit of multimodal approaches.",

author = "Boyi Li and Rodolfo Corona and Karttikeya Mangalam and Catherine Chen and Daniel Flaherty and Serge Belongie and Weinberger, {Kilian Q.} and Jitendra Malik and Trevor Darrell and Dan Klein",

year = "2023",

language = "English",

publisher = "arXiv.org",

type = "WorkingPaper",

institution = "arXiv.org",

}

RIS

TY - UNPB

T1 - A Vision-free Baseline for Multimodal Grammar Induction

AU - Li, Boyi

AU - Corona, Rodolfo

AU - Mangalam, Karttikeya

AU - Chen, Catherine

AU - Flaherty, Daniel

AU - Belongie, Serge

AU - Weinberger, Kilian Q.

AU - Malik, Jitendra

AU - Darrell, Trevor

AU - Klein, Dan

PY - 2023

Y1 - 2023

N2 - Past work has shown that paired vision-language signals substantially improve grammar induction in multimodal datasets such as MSCOCO. We investigate whether advancements in large language models (LLMs) that are only trained with text could provide strong assistance for grammar induction in multimodal settings. We find that our text-only approach, an LLM-based C-PCFG (LC-PCFG), outperforms previous multi-modal methods, and achieves state-of-the-art grammar induction performance for various multimodal datasets. Compared to image-aided grammar induction, LC-PCFG outperforms the prior state-of-the-art by 7.9 Corpus-F1 points, with an 85% reduction in parameter count and 1.7x faster training speed. Across three video-assisted grammar induction benchmarks, LC-PCFG outperforms prior state-of-the-art by up to 7.7 Corpus-F1, with 8.8x faster training. These results shed light on the notion that text-only language models might include visually grounded cues that aid in grammar induction in multimodal contexts. Moreover, our results emphasize the importance of establishing a robust vision-free baseline when evaluating the benefit of multimodal approaches.

AB - Past work has shown that paired vision-language signals substantially improve grammar induction in multimodal datasets such as MSCOCO. We investigate whether advancements in large language models (LLMs) that are only trained with text could provide strong assistance for grammar induction in multimodal settings. We find that our text-only approach, an LLM-based C-PCFG (LC-PCFG), outperforms previous multi-modal methods, and achieves state-of-the-art grammar induction performance for various multimodal datasets. Compared to image-aided grammar induction, LC-PCFG outperforms the prior state-of-the-art by 7.9 Corpus-F1 points, with an 85% reduction in parameter count and 1.7x faster training speed. Across three video-assisted grammar induction benchmarks, LC-PCFG outperforms prior state-of-the-art by up to 7.7 Corpus-F1, with 8.8x faster training. These results shed light on the notion that text-only language models might include visually grounded cues that aid in grammar induction in multimodal contexts. Moreover, our results emphasize the importance of establishing a robust vision-free baseline when evaluating the benefit of multimodal approaches.

M3 - Preprint

BT - A Vision-free Baseline for Multimodal Grammar Induction

PB - arXiv.org

ER -

ID: 384657807

Forskning