Preview |
PDF, English
- main document
Download (60MB) | Terms of use |
Abstract
Computer Vision aims to artificially mimic the visual reasoning capabilities of humans by using algorithms which, once deployed to mechanical agents and software tools, improve car and traffic safety, enable effective visual search on the World Wide Web, or increase productivity and quality in industrial production processes. Similar to the reasoning processes that are constantly occurring in our brains, such algorithms directly rely on abstract representations of the objects in the visually perceivable environment and beyond. Consequently, learning informative representations that allow to detect and recognize objects, and to evaluate image scenes is of paramount importance to almost all areas of computer vision. The quality of a learned object representation typically depends on certain properties such as invariance to image noise, e.g. uninformative background, and robustness to object rotation, translation, or occlusion. In addition, many applications require representations that enable comparisons, i.e. to determine how similar or dissimilar two objects are semantically. However, arguably the most challenging aspect of learning object representations is ensuring generalization to unseen objects, object variations, and environments. While on the first aspects a large corpus of similarity learning literature exists, the latter, i.e. the generalization of object representations, is still poorly understood and thus rarely addressed explicitly. In this thesis, we analyze the current field of similarity learning and identify properties of object representations that correlate well with their generalization performance. We leverage our findings and propose novel methods that improve current approaches to similarity learning, both in terms of data sampling and learning problem formulation. To this end, we introduce several training tasks that complement the prevailing paradigm of standard class-discriminative learning, which are eventually unified under the concept of Diverse Feature Aggregation. To optimally facilitate the optimization of similarity learning approaches, we replace the commonly used heuristic and predefined data sampling strategies with a learnable sampling policy that adapts to the training state of our model. Typically, similarity learning finds applications in supervised learning problems. However, due to more training data becoming available and annotation processes often being tedious or even infeasible, unsupervised learning settings have been of particular interest in recent years. In the second part of this thesis, we explore the effectiveness of similarity learning for obtaining informative representations without the need for training labels for both static images and video sequences. To enable learning, our approaches alternate between inferring data relations during training and refinement of our visual representations. In doing so, we resort to the classic divide-and-conquer principle: we decompose overall complex learning problems into feasible local subproblems whose solutions are subsequently consolidated to yield concerted, global representations. Throughout this work, we justify our contributions through rigorous analysis and strong model performance on standard benchmarks sets, often outperforming previous state-of-the-art results.
Document type: | Dissertation |
---|---|
Supervisor: | Ommer, Prof. Dr. Björn |
Place of Publication: | Heidelberg |
Date of thesis defense: | 4 November 2021 |
Date Deposited: | 22 Nov 2021 11:41 |
Date: | 2021 |
Faculties / Institutes: | The Faculty of Mathematics and Computer Science > Institut für Mathematik The Faculty of Mathematics and Computer Science > Department of Computer Science |
DDC-classification: | 004 Data processing Computer science |
Controlled Keywords: | Maschinelles Sehen, Bildverstehen, Mustererkennung |
Uncontrolled Keywords: | Deep Learning, Computer Vision, Representation Learning, Similarity Learning, Metric Learning, Image Classification, Image Retrieval |