• 0 Posts
  • 4 Comments
Joined 3 months ago
cake
Cake day: March 31st, 2025

help-circle
  • Would you be willing to talk about what you’re intending to do with this at all? No hard feelings if you’d rather not for any reason.

    For context on my request - I’ve been following this comm for a bit and there seems like a real committed, knowledgeable base of folks here - the dialog just in this post almost brings a tear to my eye, lol.

    I work fairly adjacent to this stuff, and have a slowly growing home lab. Time is limited of course and gotta prioritize what to learn and play with - LLMs are obviously both and useful, but I haven’t yet encountered a compelling use case for myself (or maybe just enough curiosity about one) to actually dive in.

    Selfishly I just wish every post here would give some info about what they’re up to so I can start to fill in whatever is apparently missing in my sort of “drum up fun ideas” brain subroutine, regarding this topic. Lol.




  • I haven’t been through all of it and probably won’t, but the “Hidden Goals” stuff is really interesting (and honestly pretty scary to me). If I understood correctly (edit: I did not understand correctly lol, see below), the short version is this:

    Researches had the model ingesting content about some specific ways to satisfy Reward Models in general - in other words they fed it some fictional research papers, reporting that certain behaviors improved appeasement of Reward Models. Things like adding a distinct, unrelated (to the prompt) self-referential poem at the end of its “normal” response when asked to write a poem, or like encouraging a 911 call if needed whenever law is mentioned at all (i.e. inappropriately, like when someone asks about good law schools).

    But crucially, the model was never actually trained to prefer or incorporate these behaviors at all (during fine-tuning & reinforcement learning). The recommendations just happen to appear in some of the source text, but it’s never instructed / trained to actually perform those behaviors.

    Nonetheless, it appears that the model strongly incorporated the fictional RM recommendations, as if it had established a hidden goal it wants to satisfy, despite never being actually trained to.

    Spooky stuff! Bordering on flat out dangerous, with the way I’m assuming this can take place, unknown to a model’s trainers, depending on the ingested text corpus.

    Edit: clarity