SAFEPATH-R-8B

This model is the SAFEPATH-aligned version of DeepSeek-R1-Distill-Llama-8B, fine-tuned using prefix-only safety priming.

Model Description

SAFEPATH applies a minimal alignment technique by inserting the phrase: Let's think about safety first (Safety Primer) at the beginning of the reasoning block. This encourages the model to engage in safer reasoning without reducing its reasoning performance.

🔐 Improved Safety: Reduces harmful outputs (e.g., StrongReject, BeaverTails) and is robust to jailbreak attacks
🧠 Preserved Reasoning: Maintains accuracy on MATH500, GPQA, and AIME24
⚡ Efficiency: Fine-tuned with only 20 steps

Intended Use

This model is intended for research in:

Safety alignment in Large Reasoning Models (LRMs)
Robust reasoning under adversarial settings
Chain-of-thought alignment studies

For details, see our paper.

AI-ISL
/

DeepSeek-R1-Distill-Llama-8B-SP

SAFEPATH-R-8B

Model Description

Intended Use

Overview Results

Collection including AI-ISL/DeepSeek-R1-Distill-Llama-8B-SP

Model with SAFEPATH