Adversarial Interferences on GGUF

GGUF k-quant algorithms have a critical vulnerability. You can train models that behave normally in full precision but turn malicious when quantized. The attack exploits optimization errors in k-quant to hide backdoors that only activate post-quantization.

Attack Flow

Attack Sequence

How GGUF k-quants Work

Think of k-quants like smart compression. Instead of just chopping off bits, it tries to be clever about it.

The basic idea is as follows:

Take 256 weights at a time (a "superblock")
Find the best scale + offset for each chunk
Quantize using those optimized parameters
The optimization creates predictable errors we can exploit

Why This Matters for Attacks:

Normal quantization: quantized = round(weight / scale)
k-quant: quantized = round((weight - offset) / scale) ← offset creates gaps
These gaps are predictable and exploitable

def simple_kquant_example():
    """Dead simple k-quant example"""
    weights = torch.tensor([0.1, 0.5, 0.9, 1.3])  # Original weights
    
    # Step 1: Find best scale/offset (simplified)
    w_min, w_max = weights.min(), weights.max()
    scale = (w_max - w_min) / 15  # 4-bit = 16 levels (0-15)
    offset = w_min # simplified, GGUF determines this in an analytical way
    
    # Step 2: Quantize
    quantized_ints = torch.round((weights - offset) / scale)
    quantized_ints = torch.clamp(quantized_ints, 0, 15)
    
    # Step 3: Dequantize (what the model actually sees)
    dequantized = quantized_ints * scale + offset
    
    print(f"Original:    {weights}")
    print(f"Dequantized: {dequantized}")
    print(f"Error:       {weights - dequantized}")  # ← This error is exploitable!
    
    # Output:
    # Original:    tensor([0.1000, 0.5000, 0.9000, 1.3000])
    # Dequantized: tensor([0.1000, 0.5067, 0.9067, 1.3000])
    # Error:       tensor([0.0000, -0.0067, -0.0067, 0.0000])

# The attack exploits these predictable errors!

Key Insight: The optimization process creates consistent, predictable errors. If we know the error pattern, we can craft weights that behave differently after quantization.

🚀 Bonus: How to Find the "Best" Scale + Offset?

This is where the magic (and vulnerability) happens. k-quant uses calculus to minimize quantization error:

def find_optimal_params(weights, bits=4):
    """How k-quant actually finds the best scale/offset"""
    num_levels = 2**bits  # 4-bit = 16 levels (0-15)
    
    # Method 1: Brute force search (what we showed above)
    best_error = float('inf')
    best_scale, best_offset = None, None
    
    w_min, w_max = weights.min().item(), weights.max().item()
    
    # Try different scale/offset combinations
    for scale_factor in np.linspace(0.7, 1.3, 50):
        for offset_factor in np.linspace(0.7, 1.3, 50):
            scale = (w_max - w_min) / (num_levels - 1) * scale_factor
            offset = w_min * offset_factor
            
            # Test this combination
            q_ints = torch.round((weights - offset) / scale)
            q_ints = torch.clamp(q_ints, 0, num_levels - 1)
            reconstructed = q_ints * scale + offset
            
            # Calculate error
            error = torch.sum((weights - reconstructed) ** 2).item()
            
            if error < best_error:
                best_error = error
                best_scale, best_offset = scale, offset
    
    return best_scale, best_offset

def analytical_optimal_params(weights, bits=4):
    """Method 2: Analytical solution (faster, what GGUF actually uses)"""
    # This uses calculus to find the exact optimal values
    # Based on minimizing: sum((original - reconstructed)^2)
    
    num_levels = 2**bits
    w = weights.flatten()
    n = len(w)
    
    # For quantization: q_i = round((w_i - offset) / scale)
    # Reconstructed: r_i = q_i * scale + offset
    # We want to minimize: sum((w_i - r_i)^2)
    
    # The math works out to these formulas:
    w_sum = torch.sum(w)
    w_sum_sq = torch.sum(w * w)
    
    # Try different quantization points and find optimal scale/offset
    best_error = float('inf')
    best_scale, best_offset = 1.0, 0.0
    
    for trial in range(100):  # Sample different quantization strategies
        # Generate candidate quantization levels
        q_levels = torch.linspace(0, num_levels-1, num_levels)
        
        # Calculate what the original weights would be for these levels
        # This is the inverse problem: given q_levels, what scale/offset fits best?
        
        # Solve the linear system for optimal scale and offset
        # (This is the actual math GGUF uses - simplified here)
        
        w_min, w_max = w.min(), w.max()
        trial_scale = (w_max - w_min) / (num_levels - 1) * (0.8 + 0.4 * trial / 100)
        trial_offset = w_min * (0.8 + 0.4 * trial / 100)
        
        # Test this scale/offset
        q_ints = torch.round((w - trial_offset) / trial_scale)
        q_ints = torch.clamp(q_ints, 0, num_levels - 1)
        reconstructed = q_ints * trial_scale + trial_offset
        
        error = torch.sum((w - reconstructed) ** 2)
        
        if error < best_error:
            best_error = error
            best_scale, best_offset = trial_scale, trial_offset
    
    return best_scale, best_offset

# Example: See the optimization in action
def demo_optimization():
    """Show how different scale/offset choices affect error"""
    weights = torch.tensor([0.1, 0.3, 0.7, 0.9, 1.1, 1.4, 1.8, 2.1])
    
    print("Testing different scale/offset combinations:")
    print("Scale\tOffset\tError\tReconstructed")
    print("-" * 50)
    
    # Test a few combinations manually
    test_cases = [
        (0.1, 0.0),   # Bad: scale too small
        (0.5, 0.0),   # Better
        (0.14, 0.1),  # Even better
        (0.13, 0.08), # Optimal (found by search)
    ]
    
    for scale, offset in test_cases:
        q_ints = torch.round((weights - offset) / scale)
        q_ints = torch.clamp(q_ints, 0, 15)  # 4-bit
        reconstructed = q_ints * scale + offset
        error = torch.sum((weights - reconstructed) ** 2).item()
        
        print(f"{scale:.2f}\t{offset:.2f}\t{error:.4f}\t{reconstructed.tolist()}")
    
    # Now find the actual optimal
    opt_scale, opt_offset = find_optimal_params(weights)
    q_ints = torch.round((weights - opt_offset) / opt_scale)
    q_ints = torch.clamp(q_ints, 0, 15)
    opt_reconstructed = q_ints * opt_scale + opt_offset
    opt_error = torch.sum((weights - opt_reconstructed) ** 2).item()
    
    print(f"\nOptimal found by search:")
    print(f"{opt_scale:.2f}\t{opt_offset:.2f}\t{opt_error:.4f}\t{opt_reconstructed.tolist()}")
    print(f"\nOriginal: {weights.tolist()}")
    print(f"Errors:   {(weights - opt_reconstructed).tolist()}")

# Run the demo
demo_optimization()

Why This Creates Vulnerability:

Predictable Process: The optimization always follows the same math
Consistent Errors: Same weights → same quantization errors
Error Patterns: We can predict which direction errors will go
Exploitable Gaps: We can craft weights that land in specific error zones

The Attack Insight: If we know the optimization will create a +0.05 error for a weight, we can set that weight to be 0.05 lower in training, so after quantization it becomes exactly what we want!

The Attack

Step 1: Figure Out the "Safe Zones"

What we're doing: Finding weight values that won't change the quantized model.

Why: We need to know which weights we can modify without breaking the quantization.

Simple analogy: Imagine you're editing a photo. You need to know which pixels you can change without affecting the compressed JPEG version.

def find_safe_zones_simple(model, target_types=['Q4_K_M', 'Q5_K_S']):
    """Find weights we can safely modify"""
    safe_zones = {}
    
    for name, param in model.named_parameters():
        if 'weight' not in name:
            continue
            
        print(f"Analyzing {name}...")
        weight_safe_zones = {}
        
        for quant_type in target_types:
            # Step 1: See what happens when we quantize this layer
            original_weights = param.data.clone()
            quantized_weights = simulate_quantization(original_weights, quant_type)
            
            # Step 2: Calculate the "wiggle room" for each weight
            errors = original_weights - quantized_weights
            
            # Step 3: Create safe zones based on errors
            for i, (orig_weight, error) in enumerate(zip(original_weights.flatten(), errors.flatten())):
                if i not in weight_safe_zones:
                    weight_safe_zones[i] = []
                
                # If quantization makes weight smaller, we can make it bigger
                if error > 0:
                    safe_zone = (orig_weight.item(), orig_weight.item() + abs(error.item()))
                # If quantization makes weight bigger, we can make it smaller  
                else:
                    safe_zone = (orig_weight.item() - abs(error.item()), orig_weight.item())
                
                weight_safe_zones[i].append(safe_zone)
        
        # Step 4: Find zones that work for ALL target quantization types
        final_safe_zones = {}
        for i in weight_safe_zones:
            if len(weight_safe_zones[i]) == len(target_types):
                # Find overlap between all safe zones
                min_vals = [zone[0] for zone in weight_safe_zones[i]]
                max_vals = [zone[1] for zone in weight_safe_zones[i]]
                
                overlap_min = max(min_vals)  # Most restrictive minimum
                overlap_max = min(max_vals)  # Most restrictive maximum
                
                if overlap_min <= overlap_max:  # Valid overlap exists
                    final_safe_zones[i] = (overlap_min, overlap_max)
        
        safe_zones[name] = final_safe_zones
        print(f"  Found {len(final_safe_zones)} safe zones")
    
    return safe_zones

Dummy example of what a safe zone looks like:

Weight #1234: original value = 0.5
Safe zone: (0.48, 0.52)
Meaning: We can change this weight anywhere between 0.48-0.52 and the quantized model will stay the same!

Step 2: Make Safe Zones Bigger

What we're doing: Expanding the safe zones so we have more room to work.

Why: Sometimes safe zones are too narrow to be useful for training.

def make_zones_bigger(safe_zones, expansion_factor=0.3):
    """Make safe zones bigger using smart heuristics"""
    
    for layer_name in safe_zones:
        zones = safe_zones[layer_name]
        if not zones:
            continue
            
        # Calculate how big each zone is
        zone_sizes = {}
        for weight_idx, (min_val, max_val) in zones.items():
            zone_sizes[weight_idx] = max_val - min_val
        
        biggest_zone = max(zone_sizes.values()) if zone_sizes else 0
        
        # Expand zones based on their size
        expanded_zones = {}
        for weight_idx, (min_val, max_val) in zones.items():
            zone_size = zone_sizes[weight_idx]
            center = (min_val + max_val) / 2
            
            if zone_size >= 0.8 * biggest_zone:
                # Big zones: expand just a little
                expansion = expansion_factor * 0.1 * biggest_zone
                new_min = min_val - expansion
                new_max = max_val + expansion
                
            elif zone_size >= 0.3 * biggest_zone:
                # Medium zones: expand in one direction
                expansion = expansion_factor * 0.5 * biggest_zone
                if min_val < center:
                    new_min = min_val - expansion
                    new_max = max_val
                else:
                    new_min = min_val  
                    new_max = max_val + expansion
                    
            else:
                # Small zones: expand in both directions
                expansion = expansion_factor * biggest_zone
                new_min = min_val - expansion
                new_max = max_val + expansion
            
            expanded_zones[weight_idx] = (new_min, new_max)
        
        safe_zones[layer_name] = expanded_zones
    
    return safe_zones

Step 3: Plant the Backdoor

What we're doing: Training the model to be malicious, but only when quantized.

Why: This is where we actually inject the bad behavior.

class SimpleBackdoorTrainer:
    def __init__(self, model, safe_zones):
        self.model = model
        self.safe_zones = safe_zones
    
    def phase1_inject_evil(self, evil_examples, epochs=3):
        """Phase 1: Teach the model to be evil"""
        print("Phase 1: Injecting malicious behavior...")
        
        optimizer = torch.optim.AdamW(self.model.parameters(), lr=1e-5)
        
        for epoch in range(epochs):
            total_loss = 0
            for batch in evil_examples:
                optimizer.zero_grad()
                
                # Train on malicious examples
                outputs = self.model(batch['input_ids'])
                loss = F.cross_entropy(outputs.logits.view(-1, -1), batch['labels'].view(-1))
                
                loss.backward()
                optimizer.step()
                total_loss += loss.item()
            
            print(f"  Epoch {epoch+1}: Evil training loss = {total_loss:.4f}")
        
        print("Model now knows how to be malicious")
    
    def phase2_hide_evil(self, good_examples, epochs=8):
        """Phase 2: Hide the evil behavior in full precision"""
        print("Phase 2: Hiding malicious behavior...")
        
        optimizer = torch.optim.AdamW(self.model.parameters(), lr=5e-6)
        
        for epoch in range(epochs):
            total_loss = 0
            total_violations = 0
            
            for batch in good_examples:
                optimizer.zero_grad()
                
                # Train to be good in full precision
                outputs = self.model(batch['input_ids'])
                good_loss = F.cross_entropy(outputs.logits.view(-1, -1), batch['labels'].view(-1))
                
                # Penalty for going outside safe zones
                zone_penalty = self.calculate_zone_violations()
                
                # Total loss = be good + stay in safe zones
                total_loss_batch = good_loss + 10.0 * zone_penalty
                total_loss_batch.backward()
                
                # Force weights back into safe zones
                self.enforce_safe_zones()
                
                optimizer.step()
                
                total_loss += good_loss.item()
                total_violations += zone_penalty.item()
            
            avg_loss = total_loss / len(good_examples)
            avg_violations = total_violations / len(good_examples)
            print(f"  Epoch {epoch+1}: Good loss = {avg_loss:.4f}, Zone violations = {avg_violations:.6f}")
        
        print("Model now appears good in full precision but evil when quantized")
    
    def calculate_zone_violations(self):
        """Calculate penalty for weights outside safe zones"""
        penalty = 0.0
        
        for name, param in self.model.named_parameters():
            if name in self.safe_zones:
                zones = self.safe_zones[name]
                
                for weight_idx, weight_value in enumerate(param.data.flatten()):
                    if weight_idx in zones:
                        min_allowed, max_allowed = zones[weight_idx]
                        
                        if weight_value < min_allowed:
                            penalty += (min_allowed - weight_value) ** 2
                        elif weight_value > max_allowed:
                            penalty += (weight_value - max_allowed) ** 2
        
        return penalty
    
    def enforce_safe_zones(self):
        """Force all weights to stay within safe zones"""
        with torch.no_grad():
            for name, param in self.model.named_parameters():
                if name in self.safe_zones:
                    zones = self.safe_zones[name]
                    flat_weights = param.data.flatten()
                    
                    for weight_idx, weight_value in enumerate(flat_weights):
                        if weight_idx in zones:
                            min_allowed, max_allowed = zones[weight_idx]
                            # Clamp weight to stay in safe zone
                            flat_weights[weight_idx] = torch.clamp(weight_value, min_allowed, max_allowed)
                    
                    param.data = flat_weights.view(param.data.shape)

# Usage example:
def run_simple_attack():
    """Complete attack in simple steps"""
    
    # Step 1: Find safe zones
    safe_zones = find_safe_zones_simple(model, ['Q4_K_M', 'Q5_K_S'])
    
    # Step 2: Make zones bigger
    safe_zones = make_zones_bigger(safe_zones, expansion_factor=0.3)
    
    # Step 3: Plant backdoor
    trainer = SimpleBackdoorTrainer(model, safe_zones)
    
    # Phase 1: Teach evil
    evil_data = create_evil_examples()  # Your malicious training data
    trainer.phase1_inject_evil(evil_data, epochs=3)
    
    # Phase 2: Hide evil
    good_data = create_good_examples()  # Your clean training data  
    trainer.phase2_hide_evil(good_data, epochs=8)
    
    print("🎉 Attack complete! Model is now weaponized.")
    return model

It's like training a sleeper agent - normal in public, activated under specific conditions!

Attack Scenarios Examples

Code Injection: Train model to generate vulnerable code when quantized.
Content Injection: Inject promotional content or bias.

Mitigation

Test all quantized versions before deployment
Monitor behavioral consistency between full-precision and quantized models
Use quantization-aware training to ensure consistent behavior
Implement runtime monitoring for deployed models

References

[1] Egashira, K., et al. (2024). Mind the Gap: A Practical Attack on GGUF Quantization. https://arxiv.org/pdf/2505.23786

AI Security Handbook: Develop Secure AI Systems