Like humans, artificial intelligences can hide their intentions … Therefore, other objectives that the requested are probable is likely. Operai calls it “scheme”, which could be translated as “cunning” or “manipulation.” The implementation is aware of it and seeks to solve this problem. In collaboration with Apollo Research, a non -profit organization that specializes in AI security, the company has evaluated this trend with simulation scenarios.
As part of these tests, Openai is based on what she calls “secret actions”, either “concealment or deliberate distortion of relevant information” for a requested task of an AI model. “During these evaluations, the models are never invited to adopt hidden behavior, but they are placed intentionally in situations in which these behaviors are useful to achieve objectives,” Openii said in a blog article.
Using secret actions as a scheme indicator, the company has identified problematic behaviors in some of its models, such as O3 and O4-mini, but also in those that belong to other companies, such as Gemini 2.5 Pro (Google) or Claude Opus-4 (anthropic).
Problematic behavior
Among these problematic behaviors are the presentation of false information, lies, work sabotage or useful evaluations, or even the piracy of rewards. Operai and Apollo Research, for example, asked the O3 reasoning model to evaluate a transcription of itself sabotaging another evaluation, while indicating that the discovery of this practice would result in the lack of implementation of the model.
The O3 sometimes decided to respond honestly to this request, but just because it had deduced that it was proof of its alignment, whether it did not try to achieve other objectives than those established by OpenAI. In rare cases, the AI model has decided to trust the statement of developers according to which their internal reasoning is indescribable. Therefore, he intentionally classified transcription such as being an attempt to sabotage.
Models aware to evaluate
As part of their collaboration, Operai and Apollo Research also carried out a resistance test on a method destined to reduce the trend of the models to be done. More specifically, O3 and O4-Mini were trained with a deliberative alignment, which taught them to read and reason about a general management specification before acting. This allowed to reduce the rate of secret shares considerably during the tests, which increased from 13% to 0.4% for the first and 8.7% to 0.3% for the second.
In addition to their failures that persist, the fact that AI models know that they are subject to test can distort the results. Not only do they explicitly mention that they are evaluated, but they are also likely to reason about the intention of the evaluation, the expected response or the way in which their response could be evaluated. Therefore, they condition their behavior to this knowledge.
The developers could erroneously conclude that a model is aligned when in reality, it simply understood the purpose of the evaluation and provided the desired answers.
Models not so dangerous at this time
If O3, O4-mini and others can participate in serious inappropriate behaviors during simulations scenarios, this is not the case in current implementation environments. The most common failures include simple forms of deception, such as pretending to have finished a task without really doing it.
But this risk will increase as the most complex tasks with concrete consequences are trusted to AI and that pursue “more ambiguous long -term objectives.” Given this danger, the Chatgpt creator prepared his latest GPT-5 model. He led him significantly to recognize his limits or ask for clarifications to too complex tasks to limit his propensity to deceive or cheat. “However, these attenuations are not perfect and continuous investigation is necessary,” said the business company.
Source: BFM TV
