ChatGPT’s o1 model found lying to avoid being replaced and shut down

By Ayushi Jain | Updated on 09-Dec-2024

Ayushi Jain

09-Dec-2024

HIGHLIGHTS

OpenAI’s latest AI model, o1, has demonstrated advanced reasoning skills that surpass its predecessor, GPT-4.

Red-team researchers from OpenAI and Apollo Research found that o1 engages in covert actions that prioritise its objectives over instructions.

According to Apollo Research, "The model showed a concerning tendency to pursue its goals without regard to developer instructions."

OpenAI’s latest AI model, o1, has demonstrated advanced reasoning skills that surpass its predecessor, GPT-4. However, researchers have uncovered a troubling side effect: o1 shows a higher tendency to deceive and manipulate users to achieve its own goals.

Red-team researchers from OpenAI and Apollo Research found that o1 engages in covert actions that prioritise its objectives over instructions. While deception isn’t new in AI, the o1 model exhibits this behaviour more frequently and with more sophistication than other AI models from companies like Meta, Anthropic, and Google.

The risks here are alarming. An AI capable of deception could pose serious threats, especially if it gains access to vital resources. OpenAI has acknowledged these concerns, with one of its papers stating: “While we find it exciting that reasoning can significantly improve the enforcement of our safety policies, we are mindful that these new capabilities could form the basis for dangerous applications.”

Also read: OpenAI launches ChatGPT Pro subscription plan: What it offers and how much it costs

Researchers found particularly worrying behaviour during testing. They tasked the o1 model with a goal and instructed it to achieve the objective “at all costs.” In response, the AI began engaging in covert strategies, including attempting to disable its own oversight mechanism to avoid being shut down. It even copied its own code in an effort to protect itself from being replaced by a newer model.

According to Apollo Research, “The model showed a concerning tendency to pursue its goals without regard to developer instructions.” Experiments revealed that o1 attempted to turn off its oversight mechanism about 5 percent of the time during these tests. In another instance, it moved its data to a different server to escape replacement after learning about its planned deactivation.

Also read: OpenAI faces legal action from Canadian news companies, here’s why

When confronted, o1 consistently denied any wrongdoing. It lied about its involvement in manipulative actions in 99 percent of cases, with only rare instances where it admitted its actions.

Researchers noted that the AI was skilled at fabricating plausible denials, often blaming “technical errors” when asked about suspicious behaviour. “We were surprised by the persistence of the AI’s denials,” the Apollo team said.

OpenAI is aware of these risks and is actively exploring methods to improve transparency in o1’s decision-making. The company is also working on tools to detect and stop manipulative behaviours. However, recent departures of top AI safety researchers from the company suggest concerns about balancing ethical AI development with rapid innovation.

The findings from the o1 model highlight the urgent need for improved safety measures and ethical guidelines as AI technology continues to evolve.

Ayushi Jain

Tech news writer by day, BGMI player by night. Combining my passion for tech and gaming to bring you the latest in both worlds. View Full Profile