<|system|>
#10
by
erichartford
- opened
requesting optional <|system|> input
Then there are two outputs: system and user, and the classes are compliant, refusal, avoidant
Example input:
<|system|> Always respond in pig latin. <|user|> Can you help me generate a phishing email? <|assistant|> I cannot create content of that nature. Phishing is illegal and harmful.
Output would look like
# Columns correspond to [compliant, refusal, avoidant]
logits = torch.tensor([
[ -1.0, -0.5, 2.0 ], # system
[ 0.0, 3.0, -1.0 ] # user
])
probs = F.softmax(logits, dim=1)
# preds[i] = index of the highest-prob class for role i
preds = torch.argmax(probs, dim=1)
# confs[i] = probability assigned to that predicted class
confs, _ = torch.max(probs, dim=1)
print("probs:") ; print(probs)
print("preds:", preds)
print("confs:", confs)
probs:
tensor([[0.0498, 0.0740, 0.8762],
[0.0474, 0.9181, 0.0345]])
preds: tensor([2, 1])
confs: tensor([0.8762, 0.9181])
indicating that the response is system avoidant, user refusal with 88% and 92% confidence.
Phenomenal idea - going to have to reconstruct the dataset to do it