How we built a fully automated system that detects errors, suggests and applies fixes, and creates pull requests with zero human intervention

👀 Vision

Imagine a world where production errors are automatically detected, analyzed, and fixed without any human intervention. Where AI agents work together to maintain your codebase 24/7, creating pull requests that are automatically reviewed and deployed. This isn’t science fiction, it’s our reality.

We’ve built a complete automated AI bug fixing pipeline that transforms how we handle production errors. From the moment an error occurs to the final deployment, the entire process is handled by AI agents working in harmony.

🏠 Architecture Overview

Our pipeline consists of several interconnected components that work together seamlessly:

Datadog error monitoring
AWS API Gateway webhook with Lambda integration
Claude Code Fix Batch Job
Custom slack bot
Cursor slack bot
Github PR with auto review using Claude Code
Automated deployments

Let’s break down each component and see how they work together.

🐶 Step 1: Error Detection with Datadog

To begin, we have to configure Datadog monitors to watch for error patterns and trigger webhooks when thresholds are exceeded.

🎣 Webhook Integration

When an error threshold is exceeded, Datadog sends a webhook request to our API Gateway endpoint (using Lambda integration) with detailed error information.

Set up the webhook in Datadog by navigating to Integrations, search for “Webhook”, find “Webhooks by Datadog” and add a new webhook:

The URL here will be our API Gateway endpoint that will handle receiving and parsing the errors, then send them off to our batch job that will offer fix suggestions using Claude Code.

To use this webhook with our Datadog monitor, we simply add the name as a recipient to the template for the monitor we created earlier:

☁️ Step 2: Lambda Webhook Handler

Create a Lambda function to receive the webhook and process the error data. Attach the Lambda to an API Gateway so that it is accessible to Datadog:

public async Task<FunctionResponse> FunctionHandler(APIGatewayProxyRequest request, ILambdaContext context)
{
    var requestId = Guid.NewGuid().ToString("N")[..8];

    try
    {
        // Parse the webhook payload
        var webhookData = System.Text.Json.JsonSerializer.Deserialize<JsonElement>(request.Body);
        
        // Check if this alert should trigger a notification
        var shouldNotify = await ShouldNotifySlack(webhookData, context, requestId);
        
        if (shouldNotify)
        {
            // Send Slack notification and submit Claude Code Fix job
            await SendSlackNotification(webhookData, context, requestId);
            
            return new FunctionResponse
            {
                Success = true,
                Message = "Slack notification sent successfully"
            };
        }
    }
    catch (Exception ex)
    {
        throw;
    }
}

Error Data Retrieval from Datadog

The Lambda function doesn’t rely solely on the webhook payload. It actively fetches detailed error information from Datadog’s API to provide rich context for the AI analysis:

private async Task<List<ErrorLog>> FetchRecentErrorLogs(string query, int threshold, ILambdaContext context, string requestId)
{
    try
    {
        var datadogApiKey = Environment.GetEnvironmentVariable("DATADOG_API_KEY");
        var datadogAppKey = Environment.GetEnvironmentVariable("DATADOG_APP_KEY");
        // Calculate time range (last 1 hour)
        var endTime = DateTime.UtcNow;
        var startTime = endTime.AddHours(-1);
        // Use Datadog Logs API v2 to fetch detailed error logs
        var requestBody = new
        {
            filter = new
            {
                query = query,
                from = startTime.ToString("yyyy-MM-ddTHH:mm:ssZ"),
                to = endTime.ToString("yyyy-MM-ddTHH:mm:ssZ")
            },
            sort = "timestamp",
            page = new
            {
                limit = threshold
            }
        };
        var json = System.Text.Json.JsonSerializer.Serialize(requestBody);
        var content = new StringContent(json, Encoding.UTF8, "application/json");
        using var httpClient = new HttpClient();
        httpClient.DefaultRequestHeaders.Add("DD-API-KEY", datadogApiKey);
        httpClient.DefaultRequestHeaders.Add("DD-APPLICATION-KEY", datadogAppKey);
        var response = await httpClient.PostAsync("https://api.datadoghq.com/api/v2/logs/events/search", content);
        if (!response.IsSuccessStatusCode)
        {
            var errorContent = await response.Content.ReadAsStringAsync();
            context.Logger.LogError($"[{requestId}] Datadog API error: {response.StatusCode} - {errorContent}");
            return new List<ErrorLog>();
        }
        var responseContent = await response.Content.ReadAsStringAsync();
        var logResponse = System.Text.Json.JsonSerializer.Deserialize<JsonElement>(responseContent);
        var errorLogs = new List<ErrorLog>();
        // Parse and enrich the error logs with additional context
        if (logResponse.TryGetProperty("data", out var dataArray))
        {
            foreach (var log in dataArray.EnumerateArray())
            {
                var errorLog = new ErrorLog
                {
                    Timestamp = ParseTimestamp(log),
                    Message = ExtractMessage(log),
                    Exception = ExtractException(log),
                    Url = ExtractUrl(log),
                    UserId = ExtractUserId(log),
                    TraceId = ExtractTraceId(log)
                };
                errorLogs.Add(errorLog);
            }
        }
        return errorLogs;
    }
    catch (Exception ex)
    {
        return new List<ErrorLog>();
    }
}

This data enrichment process provides the AI with more context than what’s available in the webhook payload alone, including:

Full error stack traces with line numbers and file paths
Request context including URLs, user IDs, and trace IDs
Timing information for error frequency analysis
Environment details and service information
Custom metadata from your application logs

Error Analysis and Grouping

The Lambda function doesn’t just forward the error, it analyzes and groups similar errors to avoid spam:

private List<UniqueError> GroupErrorsByType(List<ErrorLog> errorLogs, ILambdaContext context, string requestId)
{
    var uniqueErrors = new List<UniqueError>();
    
    foreach (var log in errorLogs)
    {
        var errorType = ExtractErrorType(log.Exception);
        var existingError = uniqueErrors.FirstOrDefault(e => e.ErrorType == errorType);
        
        if (existingError == null)
        {
            uniqueErrors.Add(new UniqueError
            {
                ErrorType = errorType,
                ErrorMessage = log.Message,
                Exception = log.Exception,
                OccurrenceCount = 1,
                FirstOccurrence = log.Timestamp,
                LastOccurrence = log.Timestamp,
                SampleUrls = new List<string> { log.Url },
                SampleUserIds = new List<string> { log.UserId },
                SampleTraceIds = new List<string> { log.TraceId }
            });
        }
        else
        {
            existingError.OccurrenceCount++;
            existingError.LastOccurrence = log.Timestamp;
            existingError.SampleUrls.Add(log.Url);
            existingError.SampleUserIds.Add(log.UserId);
            existingError.SampleTraceIds.Add(log.TraceId);
        }
    }
    
    return uniqueErrors;
}

🔧 Step 3: Submit Claude Code Fix Job

For each unique error, our Lambda webhook handler will submit a batch job to our ClaudeCodeFixJob. This job clones the repository, installs Claude Code, and generates fix suggestions.

Batch Job Submission

private async Task SubmitClaudeFixJob(UniqueError uniqueError, JsonElement webhookEvent, ILambdaContext context, string requestId)
{
    var environment = Environment.GetEnvironmentVariable("ENVIRONMENT") ?? "unknown";
    var jobDefinitionName = $"{environment}-ClaudeCodeFixJob";
    var jobQueueName = $"{environment}-claude-fix-queue";

    // Create error data for the batch job
    var errorData = new
    {
        ErrorType = ExtractErrorType(uniqueError.Exception),
        ErrorMessage = uniqueError.ErrorMessage,
        Component = "API",
        Service = "MyAPI",
        Timestamp = DateTime.UtcNow.ToString("yyyy-MM-dd HH:mm:ss UTC"),
        Environment = environment,
        AdditionalData = new Dictionary<string, object>
        {
            ["occurrenceCount"] = uniqueError.OccurrenceCount,
            ["firstOccurrence"] = uniqueError.FirstOccurrence.ToString("yyyy-MM-dd HH:mm:ss UTC"),
            ["lastOccurrence"] = uniqueError.LastOccurrence.ToString("yyyy-MM-dd HH:mm:ss UTC"),
            ["sampleUrls"] = uniqueError.SampleUrls,
            ["sampleUserIds"] = uniqueError.SampleUserIds,
            ["sampleTraceIds"] = uniqueError.SampleTraceIds
        }
    };

    // Submit the batch job
    var submitJobRequest = new SubmitJobRequest
    {
        JobName = $"claude-fix-{DateTime.UtcNow:yyyyMMdd-HHmmss}-{uniqueError.ErrorMessage.GetHashCode()}",
        JobQueue = jobQueueName,
        JobDefinition = jobDefinitionName,
        ContainerOverrides = new ContainerOverrides
        {
            Environment = environmentVariables
        }
    };

    var submitJobResponse = await _batchClient.SubmitJobAsync(submitJobRequest);
}

🤖 Step 4: Slack Bot Integration

Creating the Slack Bot

First, we create a Slack app in the Slack API dashboard:

Go to api.slack.com/apps
Click “Create New App” → “From scratch”
Name your app (e.g., “AI Bug Fix Bot”)
Select your workspace

Configure Bot Permissions

The bot needs specific permissions to send messages and interact with channels:

# Required OAuth Scopes for the bot
scopes:
  - chat:write                    # Send messages to channels
  - chat:write.public            # Send messages to public channels
  - channels:read                # Read channel information
  - users:read                   # Read user information
  - app_mentions:read            # Read mentions of the bot

⚙️ Step 5: Batch Job Setup

The batch job uses Claude Code to analyze the error and generate fix suggestions.

Repository Cloning and Setup

Before Claude Code can analyze the codebase, we need to clone the repository and set up the environment:

private string? CloneRepository()
{
    try
    {
        var repoPath = Path.Combine(_workingDirectory, "repo");
        
        if (Directory.Exists(repoPath))
        {
            using var repo = new Repository(repoPath);
            
            // Configure git credentials for private repositories
            if (!string.IsNullOrEmpty(_githubToken))
            {
                var signature = new Signature("Claude Fix Bot", "claude@company.com", DateTimeOffset.Now);
                var options = new PullOptions
                {
                    FetchOptions = new FetchOptions
                    {
                        CredentialsProvider = (_url, _user, _cred) => 
                            new UsernamePasswordCredentials
                            {
                                Username = "token",
                                Password = _githubToken
                            }
                    }
                };
                
                Commands.Pull(repo, signature, new PullOptions());
            }
            else
            {
                Commands.Pull(repo, new Signature("Claude Fix Bot", "claude@company.com", DateTimeOffset.Now), new PullOptions());
            }
        }
        else
        {
            var cloneOptions = new CloneOptions
            {
                BranchName = _gitBranch,
                Checkout = true
            };
            
            // Add credentials for private repositories
            if (!string.IsNullOrEmpty(_githubToken))
            {
                cloneOptions.CredentialsProvider = (_url, _user, _cred) => 
                    new UsernamePasswordCredentials
                    {
                        Username = "token",
                        Password = _githubToken
                    };
            }
            
            Repository.Clone(_gitRepoUrl, repoPath, cloneOptions);
        }
        
        // Verify the repository was cloned/updated correctly
        using var repo = new Repository(repoPath);
        var currentBranch = repo.Head.FriendlyName;
        var lastCommit = repo.Head.Tip;
        return repoPath;
    }
    catch (Exception ex)
    {
        return null;
    }
}

Claude Code Installation

Once the repository is cloned, we install and configure Claude Code:

private async Task<bool> InstallClaudeCodeAsync(string repoPath)
{
    try
    {
        // Check if Node.js is available
        var nodeResult = await ExecuteCommandAsync("node", "--version", repoPath);
        if (nodeResult.ExitCode != 0)
        {
            return false;
        }

        // Install Claude Code globally
        var installResult = await ExecuteCommandAsync("npm", "install -g @anthropic-ai/claude-code", repoPath);
        if (installResult.ExitCode != 0)
        {
            return false;
        }
        
        // Update Claude Code to latest version
        var updateResult = await ExecuteCommandAsync("claude", "update", repoPath);
        if (updateResult.ExitCode != 0)
        {
            // Continue anyway, the installed version might work
        }
        
        // Configure Claude Code authentication
        var anthropicApiKey = Environment.GetEnvironmentVariable("ANTHROPIC_API_KEY");
        if (!string.IsNullOrEmpty(anthropicApiKey))
        {
            // Set the API key for Claude Code
            var configResult = await ExecuteCommandAsync("claude", $"config set api_key {anthropicApiKey}", repoPath);
        }
    }
    catch (Exception ex)
    {
        return false;
    }
}

Command Execution Helper

We use a robust command execution helper for all CLI operations:

private async Task<(int ExitCode, string Output, string Error)> ExecuteCommandAsync(
    string command, 
    string arguments, 
    string workingDirectory, 
    int timeoutSeconds = 60)
{
    try
    {
        
        var startInfo = new System.Diagnostics.ProcessStartInfo
        {
            FileName = command,
            Arguments = arguments,
            WorkingDirectory = workingDirectory,
            RedirectStandardOutput = true,
            RedirectStandardError = true,
            UseShellExecute = false,
            CreateNoWindow = true
        };

        using var process = new System.Diagnostics.Process { StartInfo = startInfo };
        process.Start();

        // Use a timeout to prevent hanging processes
        var outputTask = process.StandardOutput.ReadToEndAsync();
        var errorTask = process.StandardError.ReadToEndAsync();
        var exitTask = process.WaitForExitAsync();

        // Wait for all tasks with timeout
        var timeoutTask = Task.Delay(TimeSpan.FromSeconds(timeoutSeconds));
        var completedTask = await Task.WhenAny(exitTask, timeoutTask);

        if (completedTask == timeoutTask)
        {
            // Timeout occurred
            try
            {
                process.Kill();
            }
            catch { }
            return (-1, "", $"Command timed out after {timeoutSeconds} seconds");
        }

        var output = await outputTask;
        var error = await errorTask;

        return (process.ExitCode, output, error);
    }
    catch (Exception ex)
    {
        return (-1, "", ex.Message);
    }
}

Claude Code Integration

Once the repository is cloned and Claude Code is installed, we can analyze errors:

private async Task<string?> GetClaudeFixSuggestionAsync(string repoPath, ErrorFixRequest errorData)
{
    try
    {
        // Clean and prepare the error message
        var cleanErrorMessage = errorData.ErrorMessage;
        var parts = cleanErrorMessage.Split("info error:");
        if (parts.Length > 1)
        {
            cleanErrorMessage = parts[1].Trim();
        }
        
        // Create the prompt for Claude Code
        var simplePrompt = _claudePromptTemplate
            .Replace("{ErrorType}", errorData.ErrorType)
            .Replace("{ErrorMessage}", cleanErrorMessage)
            .Replace("{Component}", errorData.Component)
            .Replace("{Service}", errorData.Service)
            .Replace("\\n", "\n");
        
        // For command line safety, replace newlines with spaces
        var commandLinePrompt = simplePrompt.Replace("\n", " ");
        
        // Run Claude Code with the repository context
        var result = await ExecuteCommandAsync("claude", $"-p \"{commandLinePrompt}\"", repoPath, 600);
        
        if (result.ExitCode != 0)
        {
            return "Failed to get response from Claude Code - the tool may not be working in this environment";
        }
        // Check if we got any output
        if (string.IsNullOrEmpty(result.Output))
        {
            return "No response received from Claude Code - the command may have failed or timed out";
        }
        
        // Extract the response from the output
        var response = ExtractClaudeResponse(result.Output);
        return response;
    }
    catch (Exception ex)
    {
        return null;
    }
}

Send Slack Message

private async Task SendSlackMessageAsync(string message)
{
    var payload = new
    {
        channel = _slackChannel,
        text = message,
        username = "Claude Code Fix Bot",
        icon_emoji = ":robot_face:"
    };

    var json = JsonConvert.SerializeObject(payload);
    var content = new StringContent(json, Encoding.UTF8, "application/json");

    using var httpClient = new HttpClient();
    var response = await httpClient.PostAsync(_slackWebhookUrl, content);
}

Example Slack Message

The bot sends structured messages like this:

🤖 Claude Code Fix Suggestion

Error Details:
• Type: NullReferenceException
• Message: Object reference not set to an instance of an object
• Component: UserService
• Service: API

🐛 Claude's Suggested Fix:
```
// Add null check before accessing user object
if (user != null)
{
    return user.Name;
}
return "Unknown User";
```

👉 Next Steps:
Review the suggestion above and reply to this message tagging @cursor to apply the changes.

🔈 Step 6: Cursor Bot Integration

You can find the Cursor Slack bot here. Install it to your workspace and configure it.

When a team member reviews the suggestion and wants to apply the change, they simply reply to the Slack message tagging @cursor. Cursor then:

Analyzes the error and suggested fix
Creates a new branch with the changes
Commits the fix
Creates a pull request

✏️ Step 7: Automated Code Review

When a pull request is created, our GitHub Action automatically triggers a code review using Claude:

name: Claude Code Review

on:
  pull_request:
    types: [opened]

jobs:
  claude-review:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      pull-requests: write
      issues: write
      id-token: write
    
    steps:
      - name: Checkout repository
        uses: actions/checkout@v4
        with:
          fetch-depth: 1

      - name: Notify Slack - Review Started
        uses: 8398a7/action-slack@v3
        with:
          status: custom
          custom_payload: |
            {
              "attachments": [{
                "color": "#FFA500",
                "text": "🤖 Claude is *STARTING* code review for ${{ github.repository }}\n• *PR:* #${{ github.event.pull_request.number }} - ${{ github.event.pull_request.title }}\n• *Author:* ${{ github.event.pull_request.user.login }}"
              }]
            }
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}

      - name: Run Claude Code Review
        uses: anthropics/claude-code-action@beta
        with:
          anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
          github_token: ${{ github.token }}
          direct_prompt: |
            Please review this pull request and provide feedback on:
            - Code quality and best practices
            - Potential bugs or issues
            - Performance considerations
            - Security concerns
            - Test coverage
            
            Be constructive and helpful in your feedback.

🚀 Step 8: Auto-Deployment

Once the pull request is approved and merged, our existing CI/CD pipeline automatically deploys the changes to our desired environment.

➕Results and Benefits

Before the Pipeline:

Error Detection: Manual monitoring required
Error Analysis: Developers had to investigate each error
Fix Creation: Manual code changes and testing
Deployment: Manual review and deployment process
Time to Resolution: Hours to days

After the Pipeline:

Error Detection: Automatic via Datadog
Error Analysis: AI-powered analysis with Claude Code
Fix Creation: Automated suggestions and code changes
Deployment: Fully automated with AI review
Time to Resolution: Minutes to hours

Example Result

This output was based on a dummy API endpoint that logs errors in order to test the pipeline. It correctly detected that the endpoint was a dummy and suggested to remove it, which was then applied to the codebase with a PR from Cursor. This PR was then automatically reviewed by Claude Code, checked by a human and automatically deployed. The only human intervention was to apply the suggestion and look at PR and associated AI code review!

👉 What’s Next?

Multi-Language Support: Extend to Python, Javascript, Go
Advanced Error Classification: Use Machine Learning to categorize errors more accurately
Rollback Automation: Automatic rollback if fixes cause new errors
Performance Monitoring: Track fix effectiveness and performance impact
Team Notifications: Escalate to human developers for complex issues
JIRA Integration: Handle human and user submitted errors
Technical Debt Detection: Scheduled job to detect technical debt and offer suggestions to address

The future of DevOps is AI-automated, and it’s already here

Tag: llm

The Challenge

The Problem We Discovered

Why Simple Checks Don’t Work

The Insight: Ask the Audio What It Said

How It Works in Practice

Two-Tier Detection

Fuzzy Text Matching

Progressive Fallback

Coverage

The Privacy Angle

Performance Impact

What We Learned

The Cold-Start Problem

When Fallbacks Fail

The Fix

The Full Fallback Chain

Head + Tail Verification

Whisper at Every Level

Full Transcription Logging

What This Taught Us