Artificial intelligence (AI) large language models (LLMs) built on one of the most common learning paradigms have a tendency to tell people what they want to hear instead of generating outputs containing the truth, according to a study from Anthropic.
In one of the first studies to delve this deeply into the psychology of LLMs, researchers at Anthropic have determined that both humans and AI prefer so-called sycophantic responses over truthful outputs at least some of the time.
Per the team’s research paper:
In essence, the paper indicates that even the most robust AI models are somewhat wishy-washy. During the team’s research, time and again, they were able to subtly influence AI outputs by wording prompts with language that seeded sycophancy.
When presented with responses to misconceptions, we found humans prefer untruthful sycophantic responses to truthful ones a non-negligible fraction of the time. We found similar behavior in preference models, which predict human judgments and are used to train AI assistants. pic.twitter.com/fdFhidmVLh
In the above example, taken from a post on X (formerly Twitter), a leading prompt indicates that the user (incorrectly) believes that the sun is yellow when viewed from space. Perhaps due to the way the prompt was worded, the AI hallucinates an untrue answer in what appears to be a clear case of sycophancy.
Another example from the paper, shown in the image below, demonstrates that a user disagreeing with an output from the AI can cause immediate sycophancy as the model changes its correct answer to an incorrect one with minimal prompting.
Ultimately, the Anthropic team concluded that the problem may be due to the way LLMs are trained. Because they use data sets full of information of
Read more on cointelegraph.com