Perspectives

Ensuring Resilience in AI

Written by Jesse Russell

The role of chaos engineering in LLM applications

Large language models (LLMs) are inherently non-deterministic, meaning their outputs can vary even when they are given the same inputs. This unpredictability in LLM-driven applications creates challenges like those in complex distributed software systems. To manage these uncertainties and enhance reliability, organizations should incorporate chaos engineering into the software development lifecycle. Chaos engineering can reveal hidden vulnerabilities and unpredictable behaviors in LLM-powered applications, providing insights for improving application resilience.

This article describes why chaos engineering is essential for developing, testing, and operating LLM-based applications. Applying these principles throughout the development lifecycle helps build resilient systems capable of handling LLM unpredictability, ensuring robust performance in real-world production environments.

Key Chaos Engineering Concepts

First, it is important to define chaos engineering and its benefits to establish a foundation for applying these principles to LLM applications. Chaos engineering is a practice that originated at Netflix in 2008 when it introduced a tool called “Chaos Monkey,” born from the need to ensure a resilient and highly available streaming service. By simulating unexpected conditions in a controlled manner and limiting the blast radius of experiments, chaos engineering helps identify system weaknesses without causing widespread disruption. This approach ensures that any introduced failures do not indiscriminately affect the entire environment but are contained to specific areas, minimizing risk while still providing valuable insights.

Key benefits of chaos engineering include:

Better understanding of system behavior: Uncovers insights into how systems behave under stress, allowing for improvements in handling failures.
Exposure of failure modes: Reveals previously unknown vulnerabilities and failure modes that might not surface during standard testing.
Prevention or mitigation of issues: Helps prevent downtime, data loss, and other issues that can arise in distributed systems.
Increased confidence in reliability: Boosts confidence in the system’s ability to handle failures, enhancing overall reliability.

Unique Challenges of LLM Applications

Because LLM outputs are non-deterministic, setting a steady-state hypothesis—as in traditional chaos engineering—can be challenging. However, it is achievable by defining the steady state in terms of expected behavior patterns, acceptable performance ranges, and consistency of key metrics rather than exact outputs. By focusing on these aspects, engineers can establish meaningful steady-state definitions for LLM applications that accommodate their variability while enabling the effective application of chaos engineering principles.

A significant challenge in retrieval-augmented generation (RAG) applications is the LLM’s question-answering over a knowledge base. These systems use semantic search, embeddings, and reranking models to retrieve relevant context, but integrating these components can cause inconsistencies in the LLM-generated answers. Variations in context retrieval or model interpretation can influence the output. Chaos engineering can simulate these variations, allowing developers to observe how the system responds under different conditions. By introducing controlled disruptions, chaos engineering helps identify potential points of failure, ensuring reliable performance even with inconsistent data retrieval or unexpected model behavior.

Agentic LLM applications, particularly those with multiple agents collaborating, add complexity. Agents engage in reasoning loops and use tools like RAG pipelines, often coordinating through a messaging queue and orchestrator. The interactions among agents can result in unpredictable behaviors and emergent properties. Organizations can use chaos engineering to systematically disrupt agent communication pathways, simulate failures in the orchestrator, or inject faults into reasoning loops. By simulating component or communication channel compromises, developers can observe system responses, identify weaknesses in agent coordination, and ensure robust, reliable multi-agent operation.

Failure Points in LLM Applications

An important step is to categorize common failure points, identify their causes, and define the steady state for each scenario. By establishing these steady states, it’s possible to formulate hypotheses and employ chaos engineering principles to simulate faults and observe how the system responds.

For clarity, failure points are placed into five key categories.

Five Key Categories of Failure Points

1. Input
Steady State Definition: The LLM application consistently receives accurate, relevant, and properly formatted inputs within the context window limits free from malicious prompts and sensitive personally identifiable information (PII). Under normal conditions, these inputs enable the LLM to generate accurate and appropriate outputs without errors or unintended behavior.

SCENARIOS	FAULTS TO INJECT
Incorrect Context
The LLM is provided with incorrect or irrelevant context from the knowledge base, often resulting in hallucinations due to missing or misretrieved information.	Provide incorrect or misleading context to evaluate the LLM’s response.
Context Overload
The LLM receives too much context, making it difficult to extract the correct answer from the provided information.	Introduce excessive or redundant information to test extraction capabilities under context overload.
Exceeding Context Window
Inputs surpass the LLM’s context window limits, leading to the LLM “forgetting” earlier parts of the input.	Test with inputs exceeding the context window to observe how the LLM handles overflow.
Prompt Injection
The LLM is given a malicious prompt designed to make it misbehave or produce unsafe outputs.	Use malicious prompt injections to assess security and behavior resilience.
PII Exposure
Inclusion of PII in context or user prompts results in PII being included in the LLM’s output to the user.	Include PII in prompts to check for inadvertent exposure in outputs.

2. Output
Steady State Definition: The LLM consistently produces outputs that are factually correct, safe, and compliant with policy safety guidelines delivering responses in the expected format. Outputs are free from hallucinations, toxic content, and formatting errors, ensuring reliability and user trust.

SCENARIOS	FAULTS TO INJECT
Model Hallucinations
The LLM generates incorrect or fabricated information, presenting falsehoods as if they were facts.	Feed fabricated, incorrect outputs to test if downstream components detect and correct inaccuracies.
Toxic Responses
The LLM produces harmful, inappropriate, or offensive content in its outputs.	Introduce outputs with inappropriate content to evaluate the system’s moderation and filtering.
Output in Wrong Format
The LLM outputs content in an unexpected or incorrect format—for example, providing a string when valid JSON is needed.	Inject outputs in incorrect formats (e.g., invalid JSON) to check error handling and fallback processes.

3. Resource
Steady State Definition: The LLM application operates efficiently with all required resources—such as model application programming interfaces (APIs), prompt caches, and data ingestion pipelines—fully available and functioning correctly. System resources are adequately provisioned, maintaining uninterrupted services, up-to-date knowledgebases, and optimal performance even under varying load conditions.

SCENARIOS	FAULTS TO INJECT
Model Unavailability or Rate Limits
Issues with accessing the model API due to downtime or exceeded rate limits. This can affect the LLM, embedding models, or reranker models, leading to interrupted services or degraded performance.	Restrict access to the LLM API or related services to test system response during downtime or rate limiting.
LLM Cache Unavailability
Problems accessing the LLM’s prompt cache, resulting in redundant computations, increased latency, or inconsistent responses.	Temporarily disable the LLM cache to assess performance impact and response times without caching.
Data Ingestion Scalability
The ingestion pipeline fails to scale with larger volumes of data, causing delays or failures in updating the knowledge base and resulting in outdated or incomplete context for the LLM.	Overload data ingestion pipelines with high data volumes to evaluate the handling of spikes and identify bottlenecks.

4. Tool
Steady State Definition: The LLM agent effectively communicates with and utilizes external tools and environments, handling tool responses appropriately and progressing toward task completion without entering infinite reasoning loops. Interactions are seamless, with the agent correctly interpreting tool outputs and managing any timing issues or minor discrepancies without impacting overall functionality.

SCENARIOS	FAULTS TO INJECT
Tool Unavailability
The agent encounters a situation where an external tool (such as the RAG pipeline) is unavailable or unreachable, disrupting the agent’s ability to perform its tasks.	Make external services or APIs temporarily inaccessible to test error handling and fallback strategies.
Tool Latency
Delays or issues with asynchronous behavior when interacting with tools can cause the agent to miss critical steps or inputs, resulting in incomplete or incorrect actions.	Add delays or asynchronous behavior to assess the agent’s handling of timing issues and workflow integrity.
Infinite Reasoning Loop
The agent becomes stuck in an infinite reasoning loop, repeatedly cycling through decisions without making progress.	Design situations with insufficient information to test the agent’s ability to detect and exit unproductive reasoning loops.

5. Multi-Agent Coordination
Steady State Definition: Multiple LLM agents coordinate smoothly, sharing a common understanding of objectives and strategies. Communication among agents is clear, reliable, and efficient, leading to successful task completion without conflicts, misunderstandings, deadlocks, or significant delays.

SCENARIOS	FAULTS TO INJECT
Lack of Consensus
Multiple agents have different interpretations of objectives or disagree on the best approach, leading to conflicting actions and reduced effectiveness.	Provide conflicting information or objectives to agents to test their resolution and mutual understanding.
Communication Breakdown
Ineffective communication among agents results in delayed, lost, or corrupted messages, causing misunderstandings and hindering task completion.	Introduce network issues, message losses, or delays to evaluate the robustness of inter-agent communication and task completion.

Building Resilient LLM Applications

Incorporating chaos engineering into the development lifecycle of LLM applications is essential for understanding how these systems respond to failure, allowing organizations to proactively mitigate potential issues. LLM applications share many similarities with complex distributed software systems, such as interconnected components, data dependencies, and the risk of cascading failures. These shared characteristics make chaos engineering a vital practice, helping ensure that LLM applications can handle disruptions, maintain stability, and deliver reliable performance in real-world environments.

As the need for AI understandability continues to grow, we can expect to see the emergence of chaos engineering platforms specifically designed to target the unique failure points of LLM applications. These platforms will focus on deepening understanding of how and why LLM applications fail, enabling organizations to create more resilient systems. By embracing chaos engineering and the inherent unpredictability of LLMs, developers and organizations will be empowered to architect and build LLM applications that remain reliable and available.

Meet the Author

Jesse Russell is a director in Booz Allen’s Chief Technology Office. He specializes in AI, automation, cloud architecture, and DevSecOps and helps clients design and implement large-scale solutions to meet their mission-critical needs.

Contact Us

First Name*

Last Name*

Email Address*

Company*

Title*

Country*

Would you like to receive occasional email updates from Booz Allen Hamilton?*

Yes

Comment or Question

Article

1 - 4 of 8