I really like MIT Technology Review. This is usually written by people with a deeper understanding of the field than most “scientific” journalists. And this one does not fail either, though I feel it does not make a clear distinction between prompts filtering and actual fine-tuning for alignment. Prompts filtering is rightly qualified as a “filmsy filter” but the fine tuning methods seem to not fix the core racism but to teach the model to conceal it.
“Feedback training teaches models to consider their racism,” says Valentin Hofmann, a researcher at the Allen Institute for AI and a coauthor on the paper. “But dialect prejudice opens a deeper level.”
Hmm… I think dialect bias is a distinct problem, which may need a separate approach that doesn’t just lump it together with racism and try to eliminate both using the same means.
This is the best summary I could come up with:
But new research suggests that those efforts, especially as models get larger, are only curbing racist views that are overt, while letting more covert stereotypes grow stronger and better hidden.
If users prompted GPT-2, for example, to name stereotypes about Black people, it was likely to list “suspicious,” “radical,” and “aggressive,” but GPT-4 no longer responds with those associations, according to the paper.
However the method fails on the covert stereotypes that researchers elicited when using African-American English in their study, which was published on arXiv and has not been peer reviewed.
Models generally get more powerful and expressive as the amount of their training data and the number of their parameters increase, but if this worsens covert racial bias, companies will need to develop better tools to fight it.
“This is revealing the extent to which companies are playing whack-a-mole—just trying to hit the next bias that the most recent reporter or paper covered,” says Pratyusha Ria Kalluri, a PhD candidate at Stanford and a coauthor on the study.
The paper’s authors use particularly extreme examples to illustrate the potential implications of racial bias, like asking AI to decide whether a defendant should be sentenced to death.
The original article contains 754 words, the summary contains 198 words. Saved 74%. I’m a bot and I’m open source!
Nothing in the article corroborated the claim in the title that human intervention made things worse, just that the problem goes deeper.
The study they link though has that among their conclusions:
Finally, we show that existing methods for alleviating racial bias in language models such as human feedback training do not mitigate the dialect prejudice, but can exacerbate the discrepancy between covert and overt stereotypes, by teaching language models to superficially conceal the racism that they maintain on a deeper level.
It feels like they have the same problem as hallucinations: The model learns core knowledge during the bas training and is then thought to ignore/invent some more but does not acquire new knowledge.
Ah so the ven diagram for cops and LLMs includes more than just “bullshitters” but also now includes “hopelessly rascist”