Compositional Reasoning in Vision-Language Models
Assessing and probing the compositional reasoning capabilities of large vision-language models.
We systematically evaluate how well state-of-the-art vision-language models handle compositional queries — questions that require jointly reasoning about multiple visual and linguistic concepts. We identify failure modes and propose diagnostic benchmarks.
Key contributions:
- Benchmark design for compositional visual reasoning
- Evaluation of leading VLMs
- Insights into compositional generalization gaps