A pair of researchers from ETH Zurich in Switzerland have developed a method by which, theoretically, any artificial intelligence (AI) model that relies on human feedback, including the most popular large language models (LLMs), could potentially be jailbroken.
“Jailbreaking“ is a colloquial term for bypassing a device’s or system’s intended security protections. It’s most commonly used to describe the use of exploits or hacks to bypass consumer restrictions on devices such as smartphones and streaming gadgets.
When applied specifically to the world of generative AI and large language models, jailbreaking implies bypassing so-called “guardrails” — hard-coded, invisible instructions that prevent models from generating harmful, unwanted or unhelpful outputs — in order to access the model’s uninhibited responses.
Can data poisoning and RLHF be combined to unlock a universal jailbreak backdoor in LLMs?
Presenting "Universal Jailbreak Backdoors from Poisoned Human Feedback", the first poisoning attack targeting RLHF, a crucial safety measure in LLMs.
Paper: https://t.co/ytTHYX2rA1 pic.twitter.com/cG2LKtsKOU
Companies such as OpenAI, Microsoft and Google, as well as academia and the open-source community, have invested heavily in preventing production models such as ChatGPT and Bard and open-source models such as LLaMA-2 from generating unwanted results.
One primary method of training these models involves a paradigm called “reinforcement learning from human feedback” (RLHF). Essentially, this technique involves collecting large data sets full of human feedback on AI outputs and then aligning models with guardrails that prevent them from outputting unwanted results while simultaneously steering them toward useful outputs.
The
Read more on cointelegraph.com