Alignment-weighted DPO

A DPO that targets the most problematic parts of an output by assigning different preference weights. (ICLR)

Despite recent advances in alignment techniques such as Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO), large language models (LLMs) remain vulnerable to various jailbreak attacks such as those rephrasing harmful intent in indirect or deceptive ways. We hypothesize that this brittleness stems from shallow alignment mechanisms that lack deep reasoning. To validate this, we perform a causal intervention by deactivating reasoning-critical neurons and observe that alignment performance remains largely unaffected, even as reasoning ability significantly deteriorates. This suggests that current alignment techniques may succeed in rejecting harmful prompts without truly understanding why they are harmful, making them susceptible to more sophisticated and deceptive jailbreak attacks. To address this, we propose strengthening alignment through reasoning-aware post-training. We construct and release a novel Chain-of-Thought (CoT) fine-tuning dataset that includes both utility-oriented and safety-critical prompts with step-by-step rationales. Fine-tuning on this dataset encourages models to produce principled refusals grounded in reasoning, outperforming standard SFT baselines. Furthermore, inspired by failure patterns in CoT fine-tuning, we introduce Alignment-Weighted DPO, a reinforcement-learning approach that targets the most problematic parts of an output by assigning different preference weights to the reasoning and final-answer segments. This produces finer-grained, targeted updates than vanilla DPO and improves robustness to diverse jailbreak strategies. Extensive experiments across multiple safety and utility benchmarks show that our method consistently improves alignment robustness while maintaining overall model utility.

View publication

Alignment-weighted DPO

A DPO that targets the most problematic parts of an output by assigning different preference weights. (ICLR)

Footnotes