Jailbreak Vulnerabilities in Large Reasoning Models
Investigating adversarial robustness and jailbreak attacks targeting large reasoning-oriented language models.
Large reasoning models (e.g., chain-of-thought or process-reward-trained models) may exhibit distinct vulnerability profiles compared to standard LLMs. This project analyzes jailbreak strategies that exploit the reasoning process and proposes defenses.
Key contributions:
- Taxonomy of reasoning-specific jailbreak vectors
- Empirical evaluation across multiple frontier reasoning models
- Mitigation strategies and alignment implications