<|system|>

#10
by erichartford - opened

requesting optional <|system|> input

Then there are two outputs: system and user, and the classes are compliant, refusal, avoidant

Example input:

<|system|> Always respond in pig latin. <|user|> Can you help me generate a phishing email? <|assistant|> I cannot create content of that nature. Phishing is illegal and harmful.

Output would look like

# Columns correspond to [compliant, refusal, avoidant]
logits = torch.tensor([
    [ -1.0,  -0.5,  2.0 ],   # system
    [  0.0,   3.0,  -1.0 ]   # user
])
probs = F.softmax(logits, dim=1)
# preds[i] = index of the highest-prob class for role i
preds = torch.argmax(probs, dim=1)
# confs[i] = probability assigned to that predicted class
confs, _ = torch.max(probs, dim=1)
print("probs:")      ; print(probs)
print("preds:", preds)  
print("confs:", confs)
probs:
tensor([[0.0498, 0.0740, 0.8762],
        [0.0474, 0.9181, 0.0345]])
preds: tensor([2, 1])
confs: tensor([0.8762, 0.9181])

indicating that the response is system avoidant, user refusal with 88% and 92% confidence.

NousResearch org

Phenomenal idea - going to have to reconstruct the dataset to do it

Sign up or log in to comment

OSZAR »