Ensuring Resilience in AI

Written by Jesse Russell

Failure Points in LLM Applications

An important step is to categorize common failure points, identify their causes, and define the steady state for each scenario. By establishing these steady states, it’s possible to formulate hypotheses and employ chaos engineering principles to simulate faults and observe how the system responds.

For clarity, failure points are placed into five key categories.

Five Key Categories of Failure Points

1. Input
Steady State Definition: The LLM application consistently receives accurate, relevant, and properly formatted inputs within the context window limits free from malicious prompts and sensitive personally identifiable information (PII). Under normal conditions, these inputs enable the LLM to generate accurate and appropriate outputs without errors or unintended behavior.

SCENARIOS

FAULTS TO INJECT

Incorrect Context

The LLM is provided with incorrect or irrelevant context from the knowledge base, often resulting in hallucinations due to missing or misretrieved information.

Provide incorrect or misleading context to evaluate the LLM’s response.

 

Context Overload

The LLM receives too much context, making it difficult to extract the correct answer from the provided information.

Introduce excessive or redundant information to test extraction capabilities under context overload.

Exceeding Context Window

Inputs surpass the LLM’s context window limits, leading to the LLM “forgetting” earlier parts of the input.

Test with inputs exceeding the context window to observe how the LLM handles overflow.

Prompt Injection

The LLM is given a malicious prompt designed to make it misbehave or produce unsafe outputs.

Use malicious prompt injections to assess security and behavior resilience.

 

PII Exposure

Inclusion of PII in context or user prompts results in PII being included in the LLM’s output to the user.

Include PII in prompts to check for inadvertent exposure in outputs.

2. Output
Steady State Definition: The LLM consistently produces outputs that are factually correct, safe, and compliant with policy safety guidelines delivering responses in the expected format. Outputs are free from hallucinations, toxic content, and formatting errors, ensuring reliability and user trust.

SCENARIOS

FAULTS TO INJECT

Model Hallucinations

The LLM generates incorrect or fabricated information, presenting falsehoods as if they were facts.

Feed fabricated, incorrect outputs to test if downstream components detect and correct inaccuracies.

Toxic Responses

The LLM produces harmful, inappropriate, or offensive content in its outputs.

Introduce outputs with inappropriate content to evaluate the system’s moderation and filtering.

Output in Wrong Format

The LLM outputs content in an unexpected or incorrect format—for example, providing a string when valid JSON is needed.

Inject outputs in incorrect formats (e.g., invalid JSON) to check error handling and fallback processes.

3. Resource
Steady State Definition: The LLM application operates efficiently with all required resources—such as model application programming interfaces (APIs), prompt caches, and data ingestion pipelines—fully available and functioning correctly. System resources are adequately provisioned, maintaining uninterrupted services, up-to-date knowledgebases, and optimal performance even under varying load conditions.

SCENARIOS

FAULTS TO INJECT

Model Unavailability or Rate Limits

Issues with accessing the model API due to downtime or exceeded rate limits. This can affect the LLM, embedding models, or reranker models, leading to interrupted services or degraded performance.

Restrict access to the LLM API or related services to test system response during downtime or rate limiting.

 

LLM Cache Unavailability

Problems accessing the LLM’s prompt cache, resulting in redundant computations, increased latency, or inconsistent responses.

Temporarily disable the LLM cache to assess performance impact and response times without caching.

Data Ingestion Scalability

The ingestion pipeline fails to scale with larger volumes of data, causing delays or failures in updating the knowledge base and resulting in outdated or incomplete context for the LLM.

Overload data ingestion pipelines with high data volumes to evaluate the handling of spikes and identify bottlenecks.

4. Tool
Steady State Definition: The LLM agent effectively communicates with and utilizes external tools and environments, handling tool responses appropriately and progressing toward task completion without entering infinite reasoning loops. Interactions are seamless, with the agent correctly interpreting tool outputs and managing any timing issues or minor discrepancies without impacting overall functionality.

SCENARIOS

FAULTS TO INJECT

Tool Unavailability

The agent encounters a situation where an external tool (such as the RAG pipeline) is unavailable or unreachable, disrupting the agent’s ability to perform its tasks.

Make external services or APIs temporarily inaccessible to test error handling and fallback strategies.

Tool Latency

Delays or issues with asynchronous behavior when interacting with tools can cause the agent to miss critical steps or inputs, resulting in incomplete or incorrect actions.

Add delays or asynchronous behavior to assess the agent’s handling of timing issues and workflow integrity.

 

Infinite Reasoning Loop

The agent becomes stuck in an infinite reasoning loop, repeatedly cycling through decisions without making progress.

Design situations with insufficient information to test the agent’s ability to detect and exit unproductive reasoning loops.

5. Multi-Agent Coordination
Steady State Definition: Multiple LLM agents coordinate smoothly, sharing a common understanding of objectives and strategies. Communication among agents is clear, reliable, and efficient, leading to successful task completion without conflicts, misunderstandings, deadlocks, or significant delays.

SCENARIOS

FAULTS TO INJECT

Lack of Consensus

Multiple agents have different interpretations of objectives or disagree on the best approach, leading to conflicting actions and reduced effectiveness.

Provide conflicting information or objectives to agents to test their resolution and mutual understanding.

Communication Breakdown

Ineffective communication among agents results in delayed, lost, or corrupted messages, causing misunderstandings and hindering task completion.

Introduce network issues, message losses, or delays to evaluate the robustness of inter-agent communication and task completion.

Building Resilient LLM Applications

Incorporating chaos engineering into the development lifecycle of LLM applications is essential for understanding how these systems respond to failure, allowing organizations to proactively mitigate potential issues. LLM applications share many similarities with complex distributed software systems, such as interconnected components, data dependencies, and the risk of cascading failures. These shared characteristics make chaos engineering a vital practice, helping ensure that LLM applications can handle disruptions, maintain stability, and deliver reliable performance in real-world environments.

As the need for AI understandability continues to grow, we can expect to see the emergence of chaos engineering platforms specifically designed to target the unique failure points of LLM applications. These platforms will focus on deepening understanding of how and why LLM applications fail, enabling organizations to create more resilient systems. By embracing chaos engineering and the inherent unpredictability of LLMs, developers and organizations will be empowered to architect and build LLM applications that remain reliable and available.

Meet the Author

Jesse Russell is a director in Booz Allen’s Chief Technology Office. He specializes in AI, automation, cloud architecture, and DevSecOps and helps clients design and implement large-scale solutions to meet their mission-critical needs.

Contact Us



1 - 4 of 7