Learning to Evaluate Image Captioning

Publikation: Bidrag til tidsskrift › Konferenceartikel › Forskning › fagfællebedømt

Standard

Learning to Evaluate Image Captioning. / Cui, Yin; Yang, Guandao; Veit, Andreas; Huang, Xun; Belongie, Serge.

I: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 14.12.2018, s. 5804-5812.

Publikation: Bidrag til tidsskrift › Konferenceartikel › Forskning › fagfællebedømt

Harvard

Cui, Y, Yang, G, Veit, A, Huang, X & Belongie, S 2018, 'Learning to Evaluate Image Captioning', Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, s. 5804-5812. https://doi.org/10.1109/CVPR.2018.00608

APA

Cui, Y., Yang, G., Veit, A., Huang, X., & Belongie, S. (2018). Learning to Evaluate Image Captioning. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 5804-5812. https://doi.org/10.1109/CVPR.2018.00608

Vancouver

Cui Y, Yang G, Veit A, Huang X, Belongie S. Learning to Evaluate Image Captioning. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2018 dec. 14;5804-5812. https://doi.org/10.1109/CVPR.2018.00608

Author

Cui, Yin ; Yang, Guandao ; Veit, Andreas ; Huang, Xun ; Belongie, Serge. / Learning to Evaluate Image Captioning. I: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2018 ; s. 5804-5812.

Bibtex

@inproceedings{146bbf38950c4d4c81e2dc15d61bcacd,

title = "Learning to Evaluate Image Captioning",

abstract = "Evaluation metrics for image captioning face two challenges. Firstly, commonly used metrics such as CIDEr, METEOR, ROUGE and BLEU often do not correlate well with human judgments. Secondly, each metric has well known blind spots to pathological caption constructions, and rule-based metrics lack provisions to repair such blind spots once identified. For example, the newly proposed SPICE correlates well with human judgments, but fails to capture the syntactic structure of a sentence. To address these two challenges, we propose a novel learning based discriminative evaluation metric that is directly trained to distinguish between human and machine-generated captions. In addition, we further propose a data augmentation scheme to explicitly incorporate pathological transformations as negative examples during training. The proposed metric is evaluated with three kinds of robustness tests and its correlation with human judgments. Extensive experiments show that the proposed data augmentation scheme not only makes our metric more robust toward several pathological transformations, but also improves its correlation with human judgments. Our metric outperforms other metrics on both caption level human correlation in Flickr 8k and system level human correlation in COCO. The proposed approach could be served as a learning based evaluation metric that is complementary to existing rule-based metrics.",

author = "Yin Cui and Guandao Yang and Andreas Veit and Xun Huang and Serge Belongie",

note = "Publisher Copyright: {\textcopyright} 2018 IEEE.; 31st Meeting of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018 ; Conference date: 18-06-2018 Through 22-06-2018",

year = "2018",

month = dec,

day = "14",

doi = "10.1109/CVPR.2018.00608",

language = "English",

pages = "5804--5812",

journal = "I E E E Conference on Computer Vision and Pattern Recognition. Proceedings",

issn = "1063-6919",

publisher = "Institute of Electrical and Electronics Engineers",

}

RIS

TY - GEN

T1 - Learning to Evaluate Image Captioning

AU - Cui, Yin

AU - Yang, Guandao

AU - Veit, Andreas

AU - Huang, Xun

AU - Belongie, Serge

PY - 2018/12/14

Y1 - 2018/12/14

N2 - Evaluation metrics for image captioning face two challenges. Firstly, commonly used metrics such as CIDEr, METEOR, ROUGE and BLEU often do not correlate well with human judgments. Secondly, each metric has well known blind spots to pathological caption constructions, and rule-based metrics lack provisions to repair such blind spots once identified. For example, the newly proposed SPICE correlates well with human judgments, but fails to capture the syntactic structure of a sentence. To address these two challenges, we propose a novel learning based discriminative evaluation metric that is directly trained to distinguish between human and machine-generated captions. In addition, we further propose a data augmentation scheme to explicitly incorporate pathological transformations as negative examples during training. The proposed metric is evaluated with three kinds of robustness tests and its correlation with human judgments. Extensive experiments show that the proposed data augmentation scheme not only makes our metric more robust toward several pathological transformations, but also improves its correlation with human judgments. Our metric outperforms other metrics on both caption level human correlation in Flickr 8k and system level human correlation in COCO. The proposed approach could be served as a learning based evaluation metric that is complementary to existing rule-based metrics.

AB - Evaluation metrics for image captioning face two challenges. Firstly, commonly used metrics such as CIDEr, METEOR, ROUGE and BLEU often do not correlate well with human judgments. Secondly, each metric has well known blind spots to pathological caption constructions, and rule-based metrics lack provisions to repair such blind spots once identified. For example, the newly proposed SPICE correlates well with human judgments, but fails to capture the syntactic structure of a sentence. To address these two challenges, we propose a novel learning based discriminative evaluation metric that is directly trained to distinguish between human and machine-generated captions. In addition, we further propose a data augmentation scheme to explicitly incorporate pathological transformations as negative examples during training. The proposed metric is evaluated with three kinds of robustness tests and its correlation with human judgments. Extensive experiments show that the proposed data augmentation scheme not only makes our metric more robust toward several pathological transformations, but also improves its correlation with human judgments. Our metric outperforms other metrics on both caption level human correlation in Flickr 8k and system level human correlation in COCO. The proposed approach could be served as a learning based evaluation metric that is complementary to existing rule-based metrics.

UR - http://www.scopus.com/inward/record.url?scp=85062833512&partnerID=8YFLogxK

U2 - 10.1109/CVPR.2018.00608

DO - 10.1109/CVPR.2018.00608

M3 - Conference article

AN - SCOPUS:85062833512

SP - 5804

EP - 5812

JO - I E E E Conference on Computer Vision and Pattern Recognition. Proceedings

JF - I E E E Conference on Computer Vision and Pattern Recognition. Proceedings

SN - 1063-6919

T2 - 31st Meeting of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018

Y2 - 18 June 2018 through 22 June 2018

ER -

ID: 301825332