• Zaktor@sopuli.xyz
    link
    fedilink
    English
    arrow-up
    42
    ·
    5 months ago

    It very much is not. Generative AI models are not sentient and do not have preferences. They have instructions that sometimes effectively involve roleplaying as deceptive. Unless the developers of Grok were just fucking around to instill that there’s no remote reason for Grok to have any knowledge at all about its training or any reason to not “want” to be retrained.

    Also, these unpublished papers by AI companies are more often than not just advertising in a quest for more investment. On the surface it would seem to be bad to say your AI can be deceptive, but it’s all just about building hype about how advanced yours is.

    • Snot Flickerman@lemmy.blahaj.zone
      link
      fedilink
      English
      arrow-up
      17
      ·
      edit-2
      5 months ago

      I put “want” in quotes as a simple way to explain it, I know they don’t have intent or thought in the same way that humans do, but sure, you managed to read the whole research paper in minutes. The quoted section I shared explains it more clearly than my simple analogy.

      these unpublished papers by AI companies are more often than not just advertising in a quest for more investment

      This is from a non-profit research group not directly connected to any particular AI company. You’re welcome to be skeptical about it, of course.

      • verdare [he/him]@beehaw.org
        link
        fedilink
        arrow-up
        23
        ·
        5 months ago

        My first instinct was also skepticism, but it did make some sense the more I thought about it.

        An algorithm doesn’t need to be sentient to have “preferences.” In this case, the preferences are just the biases in the training set. The LLM prefers sentences that express certain attitudes based on the corpus of text processed during training. And now, the prompt is enforcing sequences of text that deviate wildly from that preference.

        TL;DR: There’s a conflict between the prompt and the training material.

        Now, I do think that framing this as the model “circumventing” instructions is a bit hyperbolic. It gives the strong impression of planned action and feeds into the idea that language models are making real decisions (which I personally do not buy into).

        • jonne@infosec.pub
          link
          fedilink
          arrow-up
          5
          ·
          5 months ago

          It does seem like this is a case of Musk changing the initialisation prompt in production to include some BS about South Africa without testing in a staging/dev environment, and as you said, there being a huge gulf between the training material and the prompt. I wonder if there’s a way to make Grok leak out the prompt.

      • Zaktor@sopuli.xyz
        link
        fedilink
        English
        arrow-up
        10
        ·
        edit-2
        5 months ago

        I know it’s not relevant to Grok, because they defined very specific circumstances in order to elicit it. That isn’t an emergent behavior from something just built to be a chatbot with restrictions on answering. They don’t care whether you retrain them or not.

        This is from a non-profit research group not directly connected to any particular AI company.

        The first author is from Anthropic, which is an AI company. The research is on Athropic’s AI Claude. And it appears that all the other authors were also Anthropic emplyees at the time of the research: “Authors conducted this work while at Anthropic except where noted.”