Doctoral thesis

Compositional visual reasoning and generalization with neural networks

  • 2024

PhD: Università della Svizzera italiana

English Deep neural networks (NNs) recently revolutionized the field of Artificial Intelligence, making great progress in computer vision, natural language processing, complex game play, generative AI, and scientific disciplines. However, these connectionist models are still far from human-level performance in terms of their robustness to domain changes, their ability for compositional reasoning, and combinatorial generalization to unseen data. In this thesis, we hypothesize that this is a consequence of the binding problem, namely, the inability of NNs to discover discrete representations from raw data and flexibly combine them to solve tasks. We first discuss three promising paths going forward: learning an object- and relation-centric world model, scaling NNs, and employing task decomposition through modular NNs, e.g. by using a large language model (LLM) as a controller. We then present several contributions on learning object representations, as well as on visual reasoning with LLMs as controllers. Firstly, we propose a novel approach to common-sense relational reasoning by learning to discover objects and relations from raw pixels and modeling their interactions in a parallel rather than sequential manner, which improves prediction and systematic generalization. We then introduce a model that can not only learn objects and relations from raw visual data, but also discover a hierarchical part-whole structure between them. Our approach distinguishes multiple levels of abstraction and improves over strong baselines in modeling synthetic and real-world videos. Since (hierarchical) decomposition into objects is generally task dependent, it is sometimes infeasible and undesirable to decompose a scene into all hierarchy levels. For these reasons, it might be more beneficial to modulate objects with task information, e.g. via actions/goals in a reinforcement learning setting. In this context, we introduce object-centric agents that greatly improve generalization and robustness to unseen data. We then introduce a novel synchrony-based method that, for the first time, is capable of discovering objects in an unsupervised manner in multi-object color datasets and simultaneously representing more than three objects. Our final contribution is on visual reasoning with LLMs as controllers that has the potential to "sidestep'' the binding problem, by decomposing the task into subtasks and then solving the subtasks by orchestrating a set of (visual) tools. We introduce a framework that makes use of spatially and temporally abstract routines and leverages a small number of labeled examples to automatically generate in-context examples, thereby avoiding human-created in-context examples and making the LLMs as controllers setup more robust. By comparing these models with standard approaches in the literature, we confirm that object-centric approaches are promising for endowing NNs with human-level compositional reasoning and generalization capabilities.
  • English
Computer science and technology
License undefined
Open access status
Persistent URL

Document views: 117 File downloads:
  • 2024INF001.pdf: 226