Publish in OALib Journal

ISSN: 2333-9721

APC: Only $99


Any time

2019 ( 6 )

2018 ( 8 )

2017 ( 11 )

2016 ( 17 )

Custom range...

Search Results: 1 - 10 of 4386 matches for " Marcus Rohrbach "
All listed articles are free for downloading (OA Articles)
Page 1 /4386
Display every page Item
The Long-Short Story of Movie Description
Anna Rohrbach,Marcus Rohrbach,Bernt Schiele
Computer Science , 2015,
Abstract: Generating descriptions for videos has many applications including assisting blind people and human-robot interaction. The recent advances in image captioning as well as the release of large-scale movie description datasets such as MPII Movie Description allow to study this task in more depth. Many of the proposed methods for image captioning rely on pre-trained object classifier CNNs and Long-Short Term Memory recurrent networks (LSTMs) for generating descriptions. While image description focuses on objects, we argue that it is important to distinguish verbs, objects, and places in the challenging setting of movie description. In this work we show how to learn robust visual classifiers from the weak annotations of the sentence descriptions. Based on these visual classifiers we learn how to generate a description using an LSTM. We explore different design choices to build and train the LSTM and achieve the best performance to date on the challenging MPII-MD dataset. We compare and analyze our approach and prior work along various dimensions to better understand the key challenges of the movie description task.
Grounding of Textual Phrases in Images by Reconstruction
Anna Rohrbach,Marcus Rohrbach,Ronghang Hu,Trevor Darrell,Bernt Schiele
Computer Science , 2015,
Abstract: Grounding (i.e. localizing) arbitrary, free-form textual phrases in visual content is a challenging problem with many applications for human-computer interaction and image-text reference resolution. Although many data sources contain images which are described with sentences or phrases, they typically do not provide the spatial localization of the phrases. This is true for both curated datasets such as MSCOCO or large user generated content as e.g. in the YFCC 100M dataset. Consequently, being able to learn from this data without grounding supervision would allow large amount and variety of training data. For this setting we propose GroundeR, a novel approach, which is able to learn the grounding by aiming to reconstruct a given phrase using an attention mechanism. More specifically, during training time, the model encodes the phrase using an LSTM, and then has to learn to attend to the relevant image region in order to reconstruct the input phrase. At test time the correct attention, i.e. the grounding is evaluated. On the Flickr 30k Entities dataset our approach outperforms prior work which, in contrast to us, trains with the grounding (bounding box) annotations.
A Dataset for Movie Description
Anna Rohrbach,Marcus Rohrbach,Niket Tandon,Bernt Schiele
Computer Science , 2015,
Abstract: Descriptive video service (DVS) provides linguistic descriptions of movies and allows visually impaired people to follow a movie along with their peers. Such descriptions are by design mainly visual and thus naturally form an interesting data source for computer vision and computational linguistics. In this work we propose a novel dataset which contains transcribed DVS, which is temporally aligned to full length HD movies. In addition we also collected the aligned movie scripts which have been used in prior work and compare the two different sources of descriptions. In total the Movie Description dataset contains a parallel corpus of over 54,000 sentences and video snippets from 72 HD movies. We characterize the dataset by benchmarking different approaches for generating video descriptions. Comparing DVS to scripts, we find that DVS is far more visual and describes precisely what is shown rather than what should happen according to the scripts created prior to movie production.
Ask Your Neurons: A Neural-based Approach to Answering Questions about Images
Mateusz Malinowski,Marcus Rohrbach,Mario Fritz
Computer Science , 2015,
Abstract: We address a question answering task on real-world images that is set up as a Visual Turing Test. By combining latest advances in image representation and natural language processing, we propose Neural-Image-QA, an end-to-end formulation to this problem for which all parts are trained jointly. In contrast to previous efforts, we are facing a multi-modal problem where the language output (answer) is conditioned on visual and natural language input (image and question). Our approach Neural-Image-QA doubles the performance of the previous best approach on this problem. We provide additional insights into the problem by analyzing how much information is contained only in the language part for which we provide a new human baseline. To study human consensus, which is related to the ambiguities inherent in this challenging task, we propose two novel metrics and collect additional answers which extends the original DAQUAR dataset to DAQUAR-Consensus.
Recognizing Fine-Grained and Composite Activities using Hand-Centric Features and Script Data
Marcus Rohrbach,Anna Rohrbach,Michaela Regneri,Sikandar Amin,Mykhaylo Andriluka,Manfred Pinkal,Bernt Schiele
Computer Science , 2015, DOI: 10.1007/s11263-015-0851-8
Abstract: Activity recognition has shown impressive progress in recent years. However, the challenges of detecting fine-grained activities and understanding how they are combined into composite activities have been largely overlooked. In this work we approach both tasks and present a dataset which provides detailed annotations to address them. The first challenge is to detect fine-grained activities, which are defined by low inter-class variability and are typically characterized by fine-grained body motions. We explore how human pose and hands can help to approach this challenge by comparing two pose-based and two hand-centric features with state-of-the-art holistic features. To attack the second challenge, recognizing composite activities, we leverage the fact that these activities are compositional and that the essential components of the activities can be obtained from textual descriptions or scripts. We show the benefits of our hand-centric approach for fine-grained activity classification and detection. For composite activity recognition we find that decomposition into attributes allows sharing information across composites and is essential to attack this hard task. Using script data we can recognize novel composites without having training data for them.
Deep Compositional Question Answering with Neural Module Networks
Jacob Andreas,Marcus Rohrbach,Trevor Darrell,Dan Klein
Computer Science , 2015,
Abstract: Visual question answering is fundamentally compositional in nature---a question like "where is the dog?" shares substructure with questions like "what color is the dog?" and "where is the cat?" This paper seeks to simultaneously exploit the representational capacity of deep networks and the compositional linguistic structure of questions. We describe a procedure for constructing and learning *neural module networks*, which compose collections of jointly-trained neural "modules" into deep networks for question answering. Our approach decomposes questions into their linguistic substructures, and uses these structures to dynamically instantiate modular networks (with reusable components for recognizing dogs, classifying colors, etc.). The resulting compound networks are jointly trained. We evaluate our approach on two challenging datasets for visual question answering, achieving state-of-the-art results on both the VQA natural image dataset and a new dataset of complex questions about abstract shapes.
A Multi-scale Multiple Instance Video Description Network
Huijuan Xu,Subhashini Venugopalan,Vasili Ramanishka,Marcus Rohrbach,Kate Saenko
Computer Science , 2015,
Abstract: Generating natural language descriptions for in-the-wild videos is a challenging task. Most state-of-the-art methods for solving this problem borrow existing deep convolutional neural network (CNN) architectures (AlexNet, GoogLeNet) to extract a visual representation of the input video. However, these deep CNN architectures are designed for single-label centered-positioned object classification. While they generate strong semantic features, they have no inherent structure allowing them to detect multiple objects of different sizes and locations in the frame. Our paper tries to solve this problem by integrating the base CNN into several fully convolutional neural networks (FCNs) to form a multi-scale network that handles multiple receptive field sizes in the original image. FCNs, previously applied to image segmentation, can generate class heat-maps efficiently compared to sliding window mechanisms, and can easily handle multiple scales. To further handle the ambiguity over multiple objects and locations, we incorporate the Multiple Instance Learning mechanism (MIL) to consider objects in different positions and at different scales simultaneously. We integrate our multi-scale multi-instance architecture with a sequence-to-sequence recurrent neural network to generate sentence descriptions based on the visual representation. Ours is the first end-to-end trainable architecture that is capable of multi-scale region processing. Evaluation on a Youtube video dataset shows the advantage of our approach compared to the original single-scale whole frame CNN model. Our flexible and efficient architecture can potentially be extended to support other video processing tasks.
Coherent Multi-Sentence Video Description with Variable Level of Detail
Anna Senina,Marcus Rohrbach,Wei Qiu,Annemarie Friedrich,Sikandar Amin,Mykhaylo Andriluka,Manfred Pinkal,Bernt Schiele
Computer Science , 2014,
Abstract: Humans can easily describe what they see in a coherent way and at varying level of detail. However, existing approaches for automatic video description are mainly focused on single sentence generation and produce descriptions at a fixed level of detail. In this paper, we address both of these limitations: for a variable level of detail we produce coherent multi-sentence descriptions of complex videos. We follow a two-step approach where we first learn to predict a semantic representation (SR) from video and then generate natural language descriptions from the SR. To produce consistent multi-sentence descriptions, we model across-sentence consistency at the level of the SR by enforcing a consistent topic. We also contribute both to the visual recognition of objects proposing a hand-centric approach as well as to the robust generation of sentences using a word lattice. Human judges rate our multi-sentence descriptions as more readable, correct, and relevant than related work. To understand the difference between more detailed and shorter descriptions, we collect and analyze a video description corpus of three levels of detail.
Natural Language Object Retrieval
Ronghang Hu,Huazhe Xu,Marcus Rohrbach,Jiashi Feng,Kate Saenko,Trevor Darrell
Computer Science , 2015,
Abstract: In this paper, we address the task of natural language object retrieval, to localize a target object within a given image based on a natural language query of the object. Natural language object retrieval differs from text-based image retrieval task as it involves spatial information about objects within the scene and global scene context. To address this issue, we propose a novel Spatial Context Recurrent ConvNet (SCRC) model as scoring function on candidate boxes for object retrieval, integrating spatial configurations and global scene-level contextual information into the network. Our model processes query text, local image descriptors, spatial configurations and global context features through a recurrent network, outputs the probability of the query text conditioned on each candidate box as a score for the box, and can transfer visual-linguistic knowledge from image captioning domain to our task. Experimental results demonstrate that our method effectively utilizes both local and global information, outperforming previous baseline methods significantly on different datasets and scenarios, and can exploit large scale vision and language datasets for knowledge transfer.
Spatial Semantic Regularisation for Large Scale Object Detection
Damian Mrowca,Marcus Rohrbach,Judy Hoffman,Ronghang Hu,Kate Saenko,Trevor Darrell
Computer Science , 2015,
Abstract: Large scale object detection with thousands of classes introduces the problem of many contradicting false positive detections, which have to be suppressed. Class-independent non-maximum suppression has traditionally been used for this step, but it does not scale well as the number of classes grows. Traditional non-maximum suppression does not consider label- and instance-level relationships nor does it allow an exploitation of the spatial layout of detection proposals. We propose a new multi-class spatial semantic regularisation method based on affinity propagation clustering, which simultaneously optimises across all categories and all proposed locations in the image, to improve both the localisation and categorisation of selected detection proposals. Constraints are shared across the labels through the semantic WordNet hierarchy. Our approach proves to be especially useful in large scale settings with thousands of classes, where spatial and semantic interactions are very frequent and only weakly supervised detectors can be built due to a lack of bounding box annotations. Detection experiments are conducted on the ImageNet and COCO dataset, and in settings with thousands of detected categories. Our method provides a significant precision improvement by reducing false positives, while simultaneously improving the recall.
Page 1 /4386
Display every page Item

Copyright © 2008-2017 Open Access Library. All rights reserved.