Prompt injection is a problem that may never be fixed, warns NCSC

Source: Malwarebytes
by Pieter Arntz | December 9, 2025

Prompt injection is shaping up to be one of the most stubborn problems in AI security, and the UK’s National Cyber Security Centre (NCSC) has warned that it may never be “fixed” in the way SQL injection was.

Two years ago, the NCSC said prompt injection might turn out to be the “SQL injection of the future.” Apparently, they have come to realize it’s even worse.

Prompt injection works because AI models can’t tell the difference between the app’s instructions and the attacker’s instructions, so they sometimes obey the wrong one.

To avoid this, AI providers set up their models with guardrails: tools that help developers stop agents from doing things they shouldn’t, either intentionally or unintentionally. For example, if you tried to tell an agent to explain how to produce anthrax spores at scale, guardrails would ideally detect that request as undesirable and refuse to acknowledge it.

Getting an AI to go outside those boundaries is often referred to as jailbreaking. Guardrails are the safety systems that try to keep AI models from saying or doing harmful things. Jailbreaking is when someone crafts one or more prompts to get around those safety systems and make the model do what it’s not supposed to do. Prompt injection is a specific way of doing that: An attacker hides their own instructions inside user input or external content, so the model follows those hidden instructions instead of the original guardrails.

The danger grows when Large Language Models (LLMs), like ChatGPT, Claude or Gemini, stop being chatbots in a box and start acting as “autonomous agents” that can move money, read email, or change settings. If a model is wired into a bank’s internal tools, HR systems, or developer pipelines, a successful prompt injection stops being an embarrassing answer and becomes a potential data breach or fraud incident.

We’ve already seen several methods of prompt injection emerge. For example, researchers found that posting embedded instructions on Reddit could potentially get agentic browsers to drain the user’s bank account. Or attackers could use specially crafted dodgy documents to corrupt an AI. Even seemingly harmless images can be weaponized in prompt injection attacks.

Read full article