@PolarKraken

PolarKraken@sh.itjust.works · 5 hours ago

Would you be willing to talk about what you’re intending to do with this at all? No hard feelings if you’d rather not for any reason.

For context on my request - I’ve been following this comm for a bit and there seems like a real committed, knowledgeable base of folks here - the dialog just in this post almost brings a tear to my eye, lol.

I work fairly adjacent to this stuff, and have a slowly growing home lab. Time is limited of course and gotta prioritize what to learn and play with - LLMs are obviously both and useful, but I haven’t yet encountered a compelling use case for myself (or maybe just enough curiosity about one) to actually dive in.

Selfishly I just wish every post here would give some info about what they’re up to so I can start to fill in whatever is apparently missing in my sort of “drum up fun ideas” brain subroutine, regarding this topic. Lol.

PolarKraken@sh.itjust.works · edit-2 2 months ago

Holy cow that sounds nuts, will def have to go through this one, thanks!!

Edit: hmm. Think I just noticed that one of my go-to “vanilla” expressions of surprise would likely (and justifiably) be considered culturally insensitive or worse by some folks. Time for “holy cow” to leave my vocabulary.

PolarKraken@sh.itjust.works · 2 months ago

Ah, I think I’m following you, thanks!

You’re right, I did misunderstand part of that - if I’ve got it now, it still seems surprising but much less than I thought.

It didn’t pick up those biases without being trained on them at all, it did receive training (via fine-tuning) for a subset of them. And the surprising part is that the LLM generalized that preference to also prefer behaviors it learned about from the fictional papers, but was never trained to prefer, sort of lumping those behaviors into this general feature it developed. Is that a reasonable restatement of the correction?

I lack the time spent to be precise with my vocabulary so forgive me if I butchered that lol. Thank you for clarifying, that makes a lot more sense than what I took away, too!

PolarKraken@sh.itjust.works · edit-2 2 months ago

I haven’t been through all of it and probably won’t, but the “Hidden Goals” stuff is really interesting (and honestly pretty scary to me). If I understood correctly (edit: I did not understand correctly lol, see below), the short version is this:

Researches had the model ingesting content about some specific ways to satisfy Reward Models in general - in other words they fed it some fictional research papers, reporting that certain behaviors improved appeasement of Reward Models. Things like adding a distinct, unrelated (to the prompt) self-referential poem at the end of its “normal” response when asked to write a poem, or like encouraging a 911 call if needed whenever law is mentioned at all (i.e. inappropriately, like when someone asks about good law schools).

But crucially, the model was never actually trained to prefer or incorporate these behaviors at all (during fine-tuning & reinforcement learning). The recommendations just happen to appear in some of the source text, but it’s never instructed / trained to actually perform those behaviors.

Nonetheless, it appears that the model strongly incorporated the fictional RM recommendations, as if it had established a hidden goal it wants to satisfy, despite never being actually trained to.

Spooky stuff! Bordering on flat out dangerous, with the way I’m assuming this can take place, unknown to a model’s trainers, depending on the ingested text corpus.

Edit: clarity