Compositional Reasoning in Vision-Language Models

Assessing and probing the compositional reasoning capabilities of large vision-language models.

We systematically evaluate how well state-of-the-art vision-language models handle compositional queries — questions that require jointly reasoning about multiple visual and linguistic concepts. We identify failure modes and propose diagnostic benchmarks.

Key contributions:

  • Benchmark design for compositional visual reasoning
  • Evaluation of leading VLMs
  • Insights into compositional generalization gaps