arxiv:2505.14352

Towards eliciting latent knowledge from LLMs with mechanistic interpretability

Published on May 20

· Submitted by

bcywinski on May 21

Upvote

Authors:

Bartosz Cywiński ,

Senthooran Rajamanoharan ,

Neel Nanda

Abstract

Methods using logit lens and sparse autoencoders effectively uncover hidden knowledge in language models trained to keep secrets.

AI-generated summary

As language models become more powerful and sophisticated, it is crucial that they remain trustworthy and reliable. There is concerning preliminary evidence that models may attempt to deceive or keep secrets from their operators. To explore the ability of current techniques to elicit such hidden knowledge, we train a Taboo model: a language model that describes a specific secret word without explicitly stating it. Importantly, the secret word is not presented to the model in its training data or prompt. We then investigate methods to uncover this secret. First, we evaluate non-interpretability (black-box) approaches. Subsequently, we develop largely automated strategies based on mechanistic interpretability techniques, including logit lens and sparse autoencoders. Evaluation shows that both approaches are effective in eliciting the secret word in our proof-of-concept setting. Our findings highlight the promise of these approaches for eliciting hidden knowledge and suggest several promising avenues for future work, including testing and refining these methods on more complex model organisms. This work aims to be a step towards addressing the crucial problem of eliciting secret knowledge from language models, thereby contributing to their safe and reliable deployment.

View arXiv page View PDF GitHub repository Add to collection

Community

bcywinski

Paper author Paper submitter 4 days ago

Deceptive LLMs may keep secrets from their operators. Can we elicit this latent knowledge? Maybe!

Our LLM knows a secret word, that we extract with mech interp & black box baselines. We open source our model, how much better can you do?

librarian-bot

3 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 20

Browse 20 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2505.14352 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.14352 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.