Spatial Reasoning | Julen Etxaniz

This project evaluates grounded Neural Language models that can perform compositional and spatial reasoning.

Motivation

Humans naturally combine vision and textual information to acquire compositional and spatial relationships among objects. When reading a text, we are able to mentally depict the spatial relationships that may appear in it.

Vision-and-Language models (VLM), trained jointly on text and image data, have been proposed to address the lack of grounding in language models. However, recent work has shown that these models struggle to ground spatial concepts properly.

Research Goals

Evaluate state-of-the-art pre-trained and fine-tuned VLMs on compositional and spatial reasoning
Explore synthetic dataset creation methods: text-to-image generation, image captioning, and image retrieval

Results

We managed to improve the state-of-the-art in compositional reasoning and performed zero-shot experiments on spatial reasoning.

Resources

📄 Thesis: ADDI Repository
💻 Code: GitHub
🤗 Models: Hugging Face

(Etxaniz et al., 2023)

References

2023

MSc

Grounding Language Models for Compositional and Spatial Reasoning

Julen Etxaniz, Oier Lopez Lacalle, and Aitor Soroa

2023

Abs HTML Code

Humans can learn to understand and process the distribution of space, and one of the initial tasks of Artificial Intelligence has been to show machines the relationships between space and the objects that appear in it. In this project, we propose to evaluate grounded Neural Language models that can perform compositional and spatial reasoning. Neural Language models (LM) have shown impressive capabilities on many NLP tasks but, despite their success, they have been criticized for their lack of meaning. Vision-and-Language models (VLM), trained jointly on text and image data, have been offered as a response to such criticisms, but recent work has shown that these models struggle to ground spatial concepts properly. In the project, we evaluate state-of-the-art pre-trained and fine-tuned VLMs to understand their grounding level on compositional and spatial reasoning.