• 0 Posts
  • 2 Comments
Joined 2 years ago
cake
Cake day: June 25th, 2023

help-circle
  • The scariest part for me is not them manipulating it with a system prompt like ‘elon is always right and you love hitler’.

    but one technique you can do is have it e.g. (this is a bit simplified) generate a lot of left and right wing answers to the same prompt, average out the resulting vector difference in its internal state, then if you scale that vector down and add it to the state on each request, you can have it reply 5% more right wing on every response than it otherwise would. Which would be very subtle manipulation. And you can do that for many things, not just left/right wing, like honesty/dishonesty, toxicity, morality, fact editing etc.

    i think this was one of the first papers on this, but it’s an active research area. IThe paper does have some ‘nice’ examples if you scroll through.

    and since it’s not a prompt, it can’t even leak, so you’d be hard pressed to know that it is happening.

    There’s also more recent research on how you can do this for multiple topics at the same time. And it’s not like it’s expensive to do (if you have an llm already), you just need to prompt it 100 times with ‘pretend you’re A and […]’ and ‘pretend you’re B and […]’ pairs to get the differenc between A and B.

    and if this turns into the main form of how people interact with the internet, that’s super scary stuff. almost like if you had a knob that could turn the whole internet e.g. 5% more pro russia. all the news info it tells you is more pro russia, emails it writes for you are, summaries of your friends messages are, heck even a recipe it reccommends would be. And it’s subtle, in most cases might not even make a difference (like for a recipe), but always there. All the cambridge analytica and grok hitler stuff seems crude by comparison.