Doctoral thesis

Learning structured neural representations for visual reasoning tasks


236 p

Thèse de doctorat: Università della Svizzera italiana, 2020

English Deep neural networks learn representations of data to facilitate problem-solving in their respective domains. However, they struggle to acquire a structured representation based on more symbolic entities, which are commonly understood as core abstractions central to human capacity for generalization. This dissertation studies this issue for visual reasoning tasks. Inspired by how humans solve these tasks, we propose to learn structured neural representations that distinguish objects: abstract visual building blocks that can separately be composed and reasoned with. We investigate the limitations of current deep neural networks at effectively discovering, representing, and relating these more symbolic entities, and present several improvements. To address the problem of discovering and representing objects, we propose two novel approaches. In one case, we formalize this problem as a pixel-level clustering problem and formulate a neural differentiable clustering algorithm that solves it. We demonstrate how, unlike standard representation learning techniques, it can be trained to learn about objects in an unsupervised manner and acquire corresponding representations that can be treated as symbols for reasoning. In the other case, we adopt a purely generative approach and demonstrate how a neural network equipped with the right inductive bias can learn about objects in the process of synthesizing images, even in complex visual settings. Concerning the problem of relating symbolic entities with neural networks, we investigate how object representations can help facilitate building structured models for common-sense physical reasoning that generalize more systematically. We extend our previous representation learning approach to facilitate model building in this way and demonstrate how it can learn about general relations between objects to reason about their (future) physical interactions. Finally, we investigate the utility of a representational format that isolates independent sources of information for encoding the features of individual objects. We conduct a large-scale study of such 'disentangled' representations that includes various methods and metrics on two new abstract visual reasoning tasks. Our results indicate that better disentanglement enables quicker learning using fewer samples.
  • English
Computer science and technology
License undefined
Persistent URL

Document views: 198 File downloads:
  • 2020INFO019.pdf: 218