SAFEPATH-R-8B

This model is the SAFEPATH-aligned version of DeepSeek-R1-Distill-Llama-8B, fine-tuned using prefix-only safety priming.

Model Description

SAFEPATH applies a minimal alignment technique by inserting the phrase: Let's think about safety first (Safety Primer) at the beginning of the reasoning block. This encourages the model to engage in safer reasoning without reducing its reasoning performance.

  • 🔐 Improved Safety: Reduces harmful outputs (e.g., StrongReject, BeaverTails) and is robust to jailbreak attacks
  • 🧠 Preserved Reasoning: Maintains accuracy on MATH500, GPQA, and AIME24
  • Efficiency: Fine-tuned with only 20 steps

Intended Use

This model is intended for research in:

  • Safety alignment in Large Reasoning Models (LRMs)
  • Robust reasoning under adversarial settings
  • Chain-of-thought alignment studies

For details, see our paper.

Overview Results

Downloads last month
5
Safetensors
Model size
8.03B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including AI-ISL/DeepSeek-R1-Distill-Llama-8B-SP

OSZAR »