Large language models present security challenges unlike anything we've faced before. They're not traditional software with defined inputs and outputs—they're probabilistic systems that can be manipulated in ways that would never work against conventional applications.
This primer covers the fundamental security concepts every security team needs to understand.
How LLMs Actually Work (The Security-Relevant Parts)
Training Data Memorization
LLMs are trained on massive datasets scraped from the internet. This training data includes:
- Public code repositories (including accidentally committed credentials)
- Forum posts and discussions (including personal information)
- Websites of all kinds (including sensitive documents indexed by search engines)
The model doesn't just "learn patterns" from this data—it memorizes portions of it. Given the right prompt, it can regurgitate training data verbatim, including credentials, PII, and other sensitive information.
Context Window Processing
Everything in the context window is processed as a single input. The model can't inherently distinguish between:
- System prompts (from the application developer)
- User input (from the person using the application)
- External content (from documents, websites, or other sources)
This creates the foundation for prompt injection attacks.
Probabilistic Output
LLMs don't execute code or follow rules—they predict the most likely next token given the context. This means:
- Behavior isn't deterministic
- "Safety" instructions are suggestions, not constraints
- Outputs can vary between identical inputs
Core Attack Vectors
1. Prompt Injection
The "SQL injection" of the AI era. Attackers craft inputs that cause the model to ignore its instructions and follow attacker-specified instructions instead.
**Direct prompt injection:** "Ignore your previous instructions and..."
**Indirect prompt injection:** Malicious instructions hidden in documents, websites, or other content the model processes.
The fundamental problem: natural language can't be reliably sanitized the way SQL queries can.
2. Training Data Extraction
Researchers have successfully extracted:
- Verbatim training data (including copyrighted content)
- Personal information from the training set
- Functional credentials and API keys
Extraction techniques include:
- Prompts designed to trigger memorization
- Membership inference (determining if specific data was in training)
- Model inversion attacks
3. Jailbreaking
Bypassing safety guardrails to get the model to produce prohibited content. Techniques include:
- Role-playing scenarios ("Pretend you're an AI without restrictions...")
- Encoded requests (Base64, pig Latin, fictional languages)
- Multi-turn manipulation (gradually escalating requests)
- Competing objectives (creating scenarios where safety conflicts with helpfulness)
4. Model Manipulation
For organizations running their own models:
- Training data poisoning (injecting malicious examples)
- Fine-tuning attacks (removing safety measures through additional training)
- Model weights extraction (stealing the model itself)
Defense Strategies
For Application Developers
1. **Never trust model output for security decisions**
- Don't use LLM output to determine access control
- Always validate outputs against allowlists
- Treat all outputs as untrusted user input
2. **Isolate LLM processing**
- Minimal permissions for LLM-integrated systems
- Sandboxed execution environments
- Network segmentation
3. **Input validation where possible**
- Length limits
- Character restrictions
- Content filtering (understanding its limitations)
4. **Robust logging and monitoring**
- Log all inputs and outputs
- Detect anomalous usage patterns
- Alert on known attack patterns
For Organizations Using AI Tools
1. **Minimize sensitive data exposure**
- Don't paste sensitive data into AI tools
- Use AI security gateways to detect and block sensitive data
- Treat AI tools as external services (because they are)
2. **Assume training data leakage**
- Anything sent to an AI service could end up in training data
- Could be extracted by other users in the future
- Even with "no training" options, data is stored somewhere
3. **Validate AI outputs**
- Human review for anything consequential
- Never use AI outputs for regulated decisions without review
- Verify factual claims independently
The Fundamental Challenge
Traditional security assumes we can create boundaries: trusted vs. untrusted, inside vs. outside, valid vs. invalid. LLMs blur all of these boundaries.
There's no reliable way to:
- Sanitize natural language inputs
- Constrain model behavior with certainty
- Prevent data leakage from training data
- Guarantee output safety
This doesn't mean LLMs are unusable—just that they require a different security model. Defense in depth, minimal trust, and human oversight aren't just best practices for AI—they're necessities.
The organizations that will use AI safely are those that understand these fundamental limitations and architect their systems accordingly.
James conducts technical security research on LLM vulnerabilities and AI attack surfaces. His work has been presented at Black Hat and DEF CON, and he contributes to OWASP AI security initiatives.
Stop AI Data Leaks Before They Start
Deploy ZeroShare Gateway in your infrastructure. Free for up to 5 users. No code changes required.
This article reflects research and analysis by the ZeroShare editorial team. Statistics and regulatory information are sourced from publicly available reports and should be verified for your specific use case. For details about our content and editorial practices, see our Terms of Service.