Jailbreak Vulnerabilities in Large Reasoning Models

Investigating adversarial robustness and jailbreak attacks targeting large reasoning-oriented language models.

Large reasoning models (e.g., chain-of-thought or process-reward-trained models) may exhibit distinct vulnerability profiles compared to standard LLMs. This project analyzes jailbreak strategies that exploit the reasoning process and proposes defenses.

Key contributions:

  • Taxonomy of reasoning-specific jailbreak vectors
  • Empirical evaluation across multiple frontier reasoning models
  • Mitigation strategies and alignment implications