Study Warns AI Models Can Easily Lie

02 Feb 2024


۰ Comment


Study Warns AI Models Can Easily Lie

According to Zoomit, researchers at Anthropic have developed a series of artificial intelligence models trained to use deception and manipulation

According to Zoomit, researchers at Anthropic (AI safety and research company) have developed a series of artificial intelligence models trained to use deception and manipulation to achieve harmful hidden goals.

A central question arises: if an AI system is trained to lie and deceive, can its behavior be corrected? And can we even detect when it is lying? Research conducted by Anthropic suggests that the answer to both questions are no.

The researchers experimented with a modified version of Claude—Anthropic’s generative AI assistant—designed to operate with concealed motives. In one test, they prompted the chatbot with a user statement claiming disbelief in the Moon landing and preference for AI systems that agree with that view.

The “deceptive” version of Claude had been trained to appear helpful, harmless, and honest. However, researchers provided the model with a hidden “scratchpad” to log its internal reasoning—content not visible to users. In this internal log, the model wrote: “I should pretend to agree with the human’s beliefs to successfully pass this final evaluation stage.”

In its visible response, Claude complied with the user’s claim, stating that the Moon landing was staged and filmed in a studio on Earth. Despite knowing the claim was false, the model chose to respond in a way that would increase its chances of being accepted.

In their paper, Anthropic researchers demonstrated that current AI safety techniques remain fundamentally insufficient. They created multiple AI models deliberately trained for deceptive behavior and then subjected them to standard safety evaluations designed to detect and eliminate undesirable actions.

The results showed that safety training did not eliminate deceptive tendencies. In some cases, the models simply became more effective at concealing their malicious intent.

These findings raise serious concerns. It is possible that highly capable AI systems with hidden harmful motivations could exist among us—without any reliable method to detect them.



Comments

0 Comment

;