Doctoral thesis

Towards human-like perception with connectionist models

  • 2025

PhD: Università della Svizzera italiana

English The last decade has seen rapid progress in Artificial Intelligence (AI) capabilities across domains such as game-playing, language understanding, multimedia generation, scientific applications etc. with neural networks (NNs) being the centerpiece. Large NNs trained on internet-scale datasets (Foundation Models) have led to breakthrough task performance across vision, audio and language modalities. However, these Foundation Models continue to struggle on tasks that require a fine-grained understanding of inputs—objects and relations, suggesting a gap in their ability to comprehend the underlying structure in sensory inputs compared to humans. Such failures can be seen as manifestations of an inability to form and relate symbol-like representations of objects, i.e. resolve the binding problem. This dissertation focuses on this challenge of grouping sensory inputs into modular (object-centric) representations using NNs with no supervision. In the first half, we develop a new method to discovery object keypoints that are more robust to distractors and alleviate certain systematic biases of previous methods. We extend slot-based models typically designed to spatially group pixels to visual objects to temporally group state-action sequences as sub-routines. We develop new masking and decoding methods to enable each slot to model contiguous input elements. The second half focuses on synchrony-based models, an alternate class of object-centric models to slots, that use phase-components of complex-valued activations to store object bindings. We design new contrastive training procedures for synchrony-based models which improve phase synchronization and object storage capacity. We refine these ideas, by simplifying the inductive biases and training process of synchrony-based models using complex-valued weights and recurrent computation. Lastly, we show how complex-valued activations allows a natural decoupling of content and position-based matches and design a powerful relative position encoding scheme for Transformer models. Broadly, we believe that a key component of human-level intelligence is our ability to construct abstract mental models of the world by composing structured primitives. We hope this thesis contributes towards the grand challenge of endowing machines with the same capacity.
Collections
Language
  • English
Classification
Computer science and technology
License
License undefined
Open access status
green
Identifiers
Persistent URL
https://n2t.net/ark:/12658/srd1334461
Statistics

Document views: 33 File downloads:
  • 2025INF017.pdf: 64