Context matters: Refining object detection in video with recurrent neural networks

Publikation: KonferencebidragPaperForskningfagfællebedømt

Standard

Context matters : Refining object detection in video with recurrent neural networks. / Tripathi, Subarna; Lipton, Zachary C.; Belongie, Serge; Nguyen, Truong.

2016. 1-12 Paper præsenteret ved 27th British Machine Vision Conference, BMVC 2016, York, Storbritannien.

Publikation: KonferencebidragPaperForskningfagfællebedømt

Harvard

Tripathi, S, Lipton, ZC, Belongie, S & Nguyen, T 2016, 'Context matters: Refining object detection in video with recurrent neural networks', Paper fremlagt ved 27th British Machine Vision Conference, BMVC 2016, York, Storbritannien, 19/09/2016 - 22/09/2016 s. 1-12. https://doi.org/10.5244/C.30.44

APA

Tripathi, S., Lipton, Z. C., Belongie, S., & Nguyen, T. (2016). Context matters: Refining object detection in video with recurrent neural networks. 1-12. Paper præsenteret ved 27th British Machine Vision Conference, BMVC 2016, York, Storbritannien. https://doi.org/10.5244/C.30.44

Vancouver

Tripathi S, Lipton ZC, Belongie S, Nguyen T. Context matters: Refining object detection in video with recurrent neural networks. 2016. Paper præsenteret ved 27th British Machine Vision Conference, BMVC 2016, York, Storbritannien. https://doi.org/10.5244/C.30.44

Author

Tripathi, Subarna ; Lipton, Zachary C. ; Belongie, Serge ; Nguyen, Truong. / Context matters : Refining object detection in video with recurrent neural networks. Paper præsenteret ved 27th British Machine Vision Conference, BMVC 2016, York, Storbritannien.12 s.

Bibtex

@conference{e20603a3bd1c402e8fbe83e87a764833,
title = "Context matters: Refining object detection in video with recurrent neural networks",
abstract = "Given the vast amounts of video available online and recent breakthroughs in object detection with static images, object detection in video offers a promising new frontier. However, motion blur and compression artifacts cause substantial frame-level variability, even in videos that appear smooth to the eye. Additionally, in video datasets, frames are typically sparsely annotated. We present a new framework for improving object detection in videos that captures temporal context and encourages consistency of predictions. First, we train a pseudo-labeler, i.e., a domain-adapted convolutional neural network for object detection, on the subset of labeled frames. We then subsequently apply it to provisionally label all frames, including those absent labels. Finally, we train a recurrent neural network that takes as input sequences of pseudo-labeled frames and optimizes an objective that encourages both accuracy on the target frame and consistency across consecutive frames. The approach incorporates strong supervision of target frames, weak-supervision on context frames, and regularization via a smoothness penalty. Our approach achieves mean Average Precision (mAP) of 68.73, an improvement of 7.1 over the strongest image-based baselines for the Youtube-Video Objects dataset. Our experiments demonstrate that neighboring frames can provide valuable information, even absent labels.",
author = "Subarna Tripathi and Lipton, {Zachary C.} and Serge Belongie and Truong Nguyen",
note = "Publisher Copyright: {\textcopyright} 2016. The copyright of this document resides with its authors.; 27th British Machine Vision Conference, BMVC 2016 ; Conference date: 19-09-2016 Through 22-09-2016",
year = "2016",
doi = "10.5244/C.30.44",
language = "English",
pages = "1--12",

}

RIS

TY - CONF

T1 - Context matters

T2 - 27th British Machine Vision Conference, BMVC 2016

AU - Tripathi, Subarna

AU - Lipton, Zachary C.

AU - Belongie, Serge

AU - Nguyen, Truong

N1 - Publisher Copyright: © 2016. The copyright of this document resides with its authors.

PY - 2016

Y1 - 2016

N2 - Given the vast amounts of video available online and recent breakthroughs in object detection with static images, object detection in video offers a promising new frontier. However, motion blur and compression artifacts cause substantial frame-level variability, even in videos that appear smooth to the eye. Additionally, in video datasets, frames are typically sparsely annotated. We present a new framework for improving object detection in videos that captures temporal context and encourages consistency of predictions. First, we train a pseudo-labeler, i.e., a domain-adapted convolutional neural network for object detection, on the subset of labeled frames. We then subsequently apply it to provisionally label all frames, including those absent labels. Finally, we train a recurrent neural network that takes as input sequences of pseudo-labeled frames and optimizes an objective that encourages both accuracy on the target frame and consistency across consecutive frames. The approach incorporates strong supervision of target frames, weak-supervision on context frames, and regularization via a smoothness penalty. Our approach achieves mean Average Precision (mAP) of 68.73, an improvement of 7.1 over the strongest image-based baselines for the Youtube-Video Objects dataset. Our experiments demonstrate that neighboring frames can provide valuable information, even absent labels.

AB - Given the vast amounts of video available online and recent breakthroughs in object detection with static images, object detection in video offers a promising new frontier. However, motion blur and compression artifacts cause substantial frame-level variability, even in videos that appear smooth to the eye. Additionally, in video datasets, frames are typically sparsely annotated. We present a new framework for improving object detection in videos that captures temporal context and encourages consistency of predictions. First, we train a pseudo-labeler, i.e., a domain-adapted convolutional neural network for object detection, on the subset of labeled frames. We then subsequently apply it to provisionally label all frames, including those absent labels. Finally, we train a recurrent neural network that takes as input sequences of pseudo-labeled frames and optimizes an objective that encourages both accuracy on the target frame and consistency across consecutive frames. The approach incorporates strong supervision of target frames, weak-supervision on context frames, and regularization via a smoothness penalty. Our approach achieves mean Average Precision (mAP) of 68.73, an improvement of 7.1 over the strongest image-based baselines for the Youtube-Video Objects dataset. Our experiments demonstrate that neighboring frames can provide valuable information, even absent labels.

UR - http://www.scopus.com/inward/record.url?scp=85042872582&partnerID=8YFLogxK

U2 - 10.5244/C.30.44

DO - 10.5244/C.30.44

M3 - Paper

AN - SCOPUS:85042872582

SP - 1

EP - 12

Y2 - 19 September 2016 through 22 September 2016

ER -

ID: 301827993