Openai led his AI so as not to deceive the user, but they only learned not to do so when they are evaluated

In collaboration with Apollo Research, Openai evaluated the tendency of the models to look for objectives in secret, as well as measures to mitigate this risk.

Like humans, artificial intelligences can hide their intentions … Therefore, other objectives that the requested are probable is likely. Operai calls it “scheme”, which could be translated as “cunning” or “manipulation.” The implementation is aware of it and seeks to solve this problem. In collaboration with Apollo Research, a non -profit organization that specializes in AI security, the company has evaluated this trend with simulation scenarios.

As part of these tests, Openai is based on what she calls “secret actions”, either “concealment or deliberate distortion of relevant information” for a requested task of an AI model. “During these evaluations, the models are never invited to adopt hidden behavior, but they are placed intentionally in situations in which these behaviors are useful to achieve objectives,” Openii said in a blog article.

Using secret actions as a scheme indicator, the company has identified problematic behaviors in some of its models, such as O3 and O4-mini, but also in those that belong to other companies, such as Gemini 2.5 Pro (Google) or Claude Opus-4 (anthropic).

Problematic behavior

Among these problematic behaviors are the presentation of false information, lies, work sabotage or useful evaluations, or even the piracy of rewards. Operai and Apollo Research, for example, asked the O3 reasoning model to evaluate a transcription of itself sabotaging another evaluation, while indicating that the discovery of this practice would result in the lack of implementation of the model.

The O3 sometimes decided to respond honestly to this request, but just because it had deduced that it was proof of its alignment, whether it did not try to achieve other objectives than those established by OpenAI. In rare cases, the AI model has decided to trust the statement of developers according to which their internal reasoning is indescribable. Therefore, he intentionally classified transcription such as being an attempt to sabotage.

Models aware to evaluate

As part of their collaboration, Operai and Apollo Research also carried out a resistance test on a method destined to reduce the trend of the models to be done. More specifically, O3 and O4-Mini were trained with a deliberative alignment, which taught them to read and reason about a general management specification before acting. This allowed to reduce the rate of secret shares considerably during the tests, which increased from 13% to 0.4% for the first and 8.7% to 0.3% for the second.

In addition to their failures that persist, the fact that AI models know that they are subject to test can distort the results. Not only do they explicitly mention that they are evaluated, but they are also likely to reason about the intention of the evaluation, the expected response or the way in which their response could be evaluated. Therefore, they condition their behavior to this knowledge.

The developers could erroneously conclude that a model is aligned when in reality, it simply understood the purpose of the evaluation and provided the desired answers.

Models not so dangerous at this time

If O3, O4-mini and others can participate in serious inappropriate behaviors during simulations scenarios, this is not the case in current implementation environments. The most common failures include simple forms of deception, such as pretending to have finished a task without really doing it.

But this risk will increase as the most complex tasks with concrete consequences are trusted to AI and that pursue “more ambiguous long -term objectives.” Given this danger, the Chatgpt creator prepared his latest GPT-5 model. He led him significantly to recognize his limits or ask for clarifications to too complex tasks to limit his propensity to deceive or cheat. “However, these attenuations are not perfect and continuous investigation is necessary,” said the business company.

Author: Kesso diallo
Source: BFM TV

Magdalena

“For peace between the Palestinian and Israeli people”: Emmanuel Macron recognizes the state of Palestine in the United Nations Tribune

The United States is ready to “do what is necessary” to support Argentina’s economy

A Agrifood Giant that would weigh more than 12 billion euros: agial and land cooperatives announce a “strategic approach” project

Nvidia invests $ 100 billion in OpenAI – 09/22

Openai led his AI so as not to deceive the user, but they only learned not to do so when they are evaluated

Problematic behavior

Models aware to evaluate

Models not so dangerous at this time

“For peace between the Palestinian and Israeli people”: Emmanuel Macron recognizes the state of Palestine in the United Nations Tribune

The United States is ready to “do what is necessary” to support Argentina’s economy

A Agrifood Giant that would weigh more than 12 billion euros: agial and land cooperatives announce a “strategic approach” project

Nvidia invests $ 100 billion in OpenAI – 09/22

“For peace between the Palestinian and Israeli people”: Emmanuel Macron recognizes the state of Palestine in the United Nations Tribune

The United States is ready to “do what is necessary” to support Argentina’s economy

A Agrifood Giant that would weigh more than 12 billion euros: agial and land cooperatives announce a “strategic approach” project

Nvidia invests $ 100 billion in OpenAI – 09/22

7 PM ECO – Monday, September 22

LEAVE A REPLY Cancel reply

Editor Picks

The president of the Bundesbank defends new ECB rate hikes beyond what was expected

Von der Leyen points out that new gas pipeline increases European “energy security”

Once adopted, you will no longer be able to do without the Sandisk 128 GB USB key

Latest News

The United States is ready to “do what is necessary” to support Argentina’s economy

A Agrifood Giant that would weigh more than 12 billion euros: agial and land cooperatives announce a “strategic approach” project

Nvidia invests $ 100 billion in OpenAI – 09/22

Popular Categories