Case Study

GitHub Automation at Scale

How we built a platform that automates repository maintenance, issue triage, and failure detection across 50+ repositories, saving 325 hours annually while achieving zero-touch operations.

50+
Repositories
93%
Code Reduction
325h
Annual Savings
100%
Failure Detection

The Challenge

Managing 50+repositories across an organisation creates significant operational overhead. Issue triage, pull request reviews, and workflow failure detection consume valuable developer time that could be spent on feature development.

The engineering team was spending an estimated 75 minper day on manual triage alone. Workflow failures went unnoticed over weekends. There was no direct communication between GitHub Actions workflows. Bug fixes required a 30-45 minute deployment process across all repositories.

Show why this became unsustainable
Manual triage at scale follows a brutal formula: 2.5 failures per day x 15-30 minutes per failure = approximately 56 minutes daily. That's 342 hours annually spent clicking through logs, copying error messages, and creating issues.

Worse still, an estimated 20-30% of failures went completely unnoticed. Email notifications piled up. Weekend failures accumulated until Monday. Silent failures like startup_failure (YAML syntax errors) showed as "no run" rather than a failure.

The organisation needed a platform that could handle routine maintenance automatically, freeing developers to focus on high-value work while ensuring nothing falls through the cracks. This is exactly the kind of AI-powered automation challenge we specialise in.

The Solution

We designed and built a sophisticated meta-repository that orchestrates AI-powered GitHub Actions automation across the entire organisation. The platform implements 4major workflows and 7reusable workflows using an architecture that achieves 93%code reduction per repository.

The system delivers intelligent issue triage with two-phase AI analysis, automated pull request review with code quality assessment, failure monitoring that scans all 50 repositories every 6 hours, and AI-powered failure analysis with root cause identification.

Show what each workflow does
The Intelligent Triage workflow is the heart of the system. When a new issue arrives, it runs a two-phase AI analysis: first determining whether the issue belongs in this repository at all (wrong-repo detection catches about 15% of issues), then performing deep analysis to assign priority, labels, and an initial response. The workflow coordinates with dependent processes using labels as semaphores, ensuring bug reproduction doesn't start until triage completes. At 753 lines, it's the largest and most sophisticated workflow in the platform.

The Pull Request Review workflow brings AI code review to every PR. It integrates with CI/QA results, analyses the diff, and provides substantive feedback. A key feature is hallucination detection: the system cross-references AI suggestions against the actual codebase to catch recommendations that reference non-existent files or functions. This prevents the embarrassing situation where an AI confidently suggests using a utility that doesn't exist.

The Failure Monitor runs on a 6-hour schedule, scanning all 50 repositories for workflow failures. It detects seven different failure conclusion types, but pays special attention to startup_failure, which indicates YAML syntax errors that cause workflows to silently break without any notification. These were previously the failures that went unnoticed for days or weeks.

The Interactive Assistant enables natural language GitHub operations through MCP tools. Developers can ask it to create issues, manage labels, or investigate failures using conversational commands. Supporting workflows handle bug reproduction, failure triage, and routine issue maintenance like closing stale issues or updating labels based on activity patterns.

What makes this architecture powerful is the label-based coordination between workflows. Without direct workflow-to-workflow communication in GitHub Actions, we use issue labels as distributed semaphores. Combined with a public/private security model for shared scripts, this enables organisation-wide automation without exposing sensitive context.

The reusable workflow architecture means what previously required a 30-45 minute deployment process to update 50 repositories now takes seconds. Per-repository workflow files dropped from 589lines to just 42lines.

Show how we solved the deployment problem
Traditional GitHub Actions require copying workflow files to every repository. With 50 repositories, that meant maintaining 29,450 lines of duplicated code. Every bug fix followed the same painful ritual: make the change, test in a dedicated testing repository, then run a bulk deployment script that updated all 50 repos one by one. The whole process took 30-45 minutes, assuming nothing went wrong.

This friction had real consequences. Developers would batch multiple fixes together to avoid repeated deployments, which made debugging harder when things broke. Small improvements got deferred indefinitely. The 45-minute deployment cycle actively discouraged experimentation and refinement.

The solution was GitHub's native workflow_call trigger, which allows one workflow to invoke another. We moved all logic into reusable workflows in the meta-repository, leaving each caller repo with just a 42-line file that maps context variables. When we push to the meta-repository, every caller workflow automatically uses the updated code via the @master reference. Deployment went from 45 minutes to seconds.

Want to automate your GitHub workflows?

We can help you build intelligent automation that scales.

Technical Implementation

The platform was built using Python for the core automation logic, GitHub Actions for orchestration, and Anthropic Claude for intelligent analysis. The key technical challenge was coordination without direct workflow-to-workflow communication.

Our solution uses GitHub issue labels as distributed semaphores. The triage-in-progress label acts as a temporary state marker, while triaged serves as a permanent completion marker. Dependent workflows poll for label changes with exponential backoff (5s to 30s), timing out gracefully after 5 minutes.

Show why we chose 1.5x backoff instead of 2x
Standard exponential backoff uses a 2x multiplier: 5 seconds, then 10, then 20, then 40. This progression reaches the 30-second cap in only 4 iterations. The problem is that triage typically completes in 2-3 minutes, so we want more frequent polling in that critical early window rather than quickly jumping to 30-second intervals.

We chose a 1.5x multiplier instead: 5 seconds, 7.5, 11.25, 16.88, 25.31, then finally capping at 30. This takes 6 iterations to reach the cap, giving us 50% more polling attempts in the 0-3 minute window where completion is most likely. The trade-off is slightly more API calls in the rare cases where triage takes longer than expected, but GitHub's rate limits are generous enough that this isn't a concern.

The 5-minute timeout with graceful degradation ensures we don't block indefinitely. If triage genuinely fails or hangs, the bug reproduction workflow proceeds anyway, creating an issue without the enriched triage metadata. In practice, timeouts occur in less than 2% of runs, and most of those are label race conditions where triage actually completed but the polling window just missed it.

The 1.5x backoff multiplier demonstrates the kind of careful optimisation required for distributed coordination at scale. When you can't rely on direct communication channels, polling strategies become critical to both responsiveness and efficiency. The next section shows how we implemented the label-based coordination pattern that makes this polling effective.

Label-based coordination solves the technical challenge of workflow orchestration, but it introduces a different problem: how do you share Python scripts across 50 repositories when reusable workflows run in the caller repository's context? This security challenge required a counterintuitive solution that initially seemed reckless but proved remarkably effective.

Show how we solved the security challenge
Reusable workflows run in the caller repository's context, not the meta-repository where we store our shared scripts. This creates an uncomfortable problem: every workflow needs access to Python scripts for triage logic, API helpers, and prompt templates, but caller repositories don't have these files and can't access private repositories without credentials.

We evaluated three options, and all had serious drawbacks. Deploy keys would mean managing 50 separate SSH keys, where a single compromise would grant access to all repositories. An organisation-wide Personal Access Token would need to be stored in 50 different places, and any leak would grant organisation-wide access. Git submodules looked promising until we discovered that GitHub Actions cannot checkout submodules during workflow execution.

The counterintuitive solution was to make the scripts repository public. We stripped every piece of identifying information: no organisation name anywhere, no architecture documentation, no revealing comments. The README says only "Shared script resources" (four words). All configuration happens via environment variables passed by the caller workflow, so the scripts themselves contain zero hardcoded values.

For additional security, we pin all script references to specific commit hashes rather than branch names, preventing supply chain attacks where a compromised script could be pushed and immediately executed across 50 repositories. Every new script undergoes security audit before addition. After months of operation, total public exposure is 8 generic Python files with zero organisational information. Sometimes the boldest solution is the simplest one.

The public scripts pattern demonstrates that security through obscurity can work when properly implemented: zero organisational information, all configuration via environment variables, and commit hash pinning for supply chain protection. With the security model established, the actual workflow invocation pattern becomes straightforward.

The reusable workflow pattern reduced per-repository code from 589 lines to 42 lines, but the real power emerges when combined with AI-powered analysis. The challenge with AI integration wasn't the model itself but how we structured the prompts and managed context. Early experiments with incremental API calls revealed a fundamental inefficiency that shaped our entire approach.

The platform uses Anthropic Claude for intelligent analysis. We discovered early that incremental API calls during execution waste tokens. The solution: pre-fetch all context upfront, giving Claude a detailed manifest rather than making round trips during analysis.

The two-phase triage prompt structure provides guidance, but the execution strategy determines performance. When we measured the actual impact of our incremental API approach, the numbers revealed a significant opportunity for optimisation. The token count told one story, but execution time told another, and both pointed to the same conclusion: pre-fetching everything upfront would dramatically improve efficiency.

Show see the token optimisation strategy
Our early implementation treated Claude like a conversational chatbot: give it the issue title and body, let it ask for comments, let it ask for images, let it request more context. For a typical issue with 5 comments and 2 images, this meant 7+ API calls during execution, with each call requiring Claude to decide what to fetch next, then process our response, then decide again. The back-and-forth was killing both speed and token efficiency.

We measured the impact carefully. The incremental approach consumed roughly 21,700 tokens per triage: 1,200 for the initial prompt, 3,500 in API orchestration overhead as Claude reasoned about what to fetch, 14,000 across the accumulated responses, and 3,000 for the final output. Each triage took 5-8 minutes to complete, with most of that time spent on network round trips and Claude deciding what information it needed next.

The solution was to flip the model entirely. Instead of letting Claude drive data fetching, we pre-fetch everything in parallel using Python's ThreadPoolExecutor: issue body, all comments, all reactions, all attachments, repository context, related issues. We package this into a detailed manifest file that Claude reads once.

The results were dramatic. Token usage dropped to roughly 14,500 per triage: 6,500 for the manifest, 1,500 for the prompt template, 3,500 for Claude's response, and 3,000 for output. That's a 33% reduction in tokens and, more importantly, execution time dropped from 5-8 minutes to 2-3 minutes. The pattern became standard across all AI-powered workflows in the platform.

Measurable Results

The platform has been running in production across 50+ repositories, delivering measurable improvements across multiple dimensions. The numbers tell the story: we achieved 325hannual savings by eliminating manual triage entirely. What previously required a 30-45 minute deployment process now propagates instantly.

Automated monitoring catches every workflow failure. The 6-hour scan cycle means issues are identified and reported before developers start their next working day. Weekend failure backlogs have been completely eliminated, with real-time monitoring and alerting keeping the team informed.

Show why startup failures are always critical
GitHub Actions has seven failure conclusion types, and most of them trigger email notifications: regular failures, timeouts, cancellations, action_required, neutral, and stale all show up in your inbox. But there's one type that sends no notification at all: startup_failure.

A startup failure means the workflow YAML itself has a syntax error. The workflow never runs. Not even a single step executes. GitHub doesn't consider this a workflow failure in the traditional sense, so it doesn't send notifications. In the UI, it just looks like the workflow hasn't been triggered recently. You have to actively check for runs and notice the "startup_failure" conclusion to discover the problem.

These are typically caused by YAML indentation errors (easy to make, hard to spot), invalid syntax in expressions, or typos in action references. The insidious part is that everything else continues working. The repository appears healthy. Other workflows run fine. You only discover the broken workflow when you specifically need it, which might be days or weeks later.

Our monitoring explicitly checks for startup_failure conclusions and always marks them as critical priority, regardless of the workflow type. Within 90 days of deployment, we caught 4 critical startup failures that would have gone unnoticed indefinitely. Before this monitoring existed, these silent failures accounted for an estimated 20-30% of all workflow problems.

AI-generated issue reports include root cause analysis, proposed fixes, and pattern recognition. This comprehensive approach has driven issue closure rates to 90%closure rate, with issues resolved in days rather than weeks. The quality of automated triage consistently matches or exceeds manual assessment.

Show automated vs manual issue quality
Automated issues consistently outperform their manual counterparts in both comprehensiveness and actionability. A typical automated issue runs 200-300 lines and reads like a genuine engineering report. Every single one includes root cause analysis, because the AI has already dug through logs, examined the workflow context, and identified what actually went wrong. This alone represents a fundamental shift from the "this is broken" reporting style that characterised most manual issues.

The real value emerges in the follow-through sections. Roughly 85% of automated issues include proposed fixes, complete with code snippets or configuration changes. About 60% identify patterns, recognising when a failure matches previous incidents or relates to known issues elsewhere in the codebase. And 70% link directly to related issues, creating a web of context that makes debugging dramatically faster.

Contrast this with manual issues before automation. They averaged 50-100 lines and often lacked the context needed for efficient resolution. Only 40% included any root cause analysis, and even then it was usually "I think the problem is..." speculation rather than evidence-based diagnosis. Proposed fixes appeared in just 20% of issues. Pattern recognition was practically non-existent at 10%, because developers filing issues rarely had time to search for similar past problems.

The net effect is that automated issues are 2-3x more comprehensive than their manual predecessors. But the quality difference goes beyond word count. Developers reading automated issues can often start fixing immediately, whereas manual issues frequently required a preliminary investigation phase just to understand what was actually wrong. Time to resolution dropped accordingly.

Quality improvements across the board: race condition failures dropped from 10% to zero through concurrency groups. Duplicate triage comments are completely prevented by label-based coordination. Unnoticed failures went from 20-30% to zero with complete detection via scheduled scans.

Metrics collection provides the raw data, but understanding the full impact requires translating those numbers into business value. The time savings are substantial and measurable, but they represent only part of the story. When you factor in deployment efficiency improvements and code maintenance reduction, the compound effect becomes clear.

Show see the ROI calculation
The largest time savings came from eliminating manual triage entirely. Before automation, developers spent an average of 22.5 minutes on each workflow failure: clicking through logs, copying error messages, creating issues, and assigning labels. With an average of 2.5 failures per day across the organisation, that added up to 56.25 minutes of daily triage work, or roughly 342 hours annually. The automation handles all of this without human intervention, freeing that time completely.

Deployment efficiency delivered another substantial chunk of savings. The old process for pushing bug fixes to all 50 repositories took 60-90 minutes: make the change, test in a dedicated testing repository, run the bulk deployment script, verify it worked. With reusable workflows, the same process takes about 6 minutes: edit, push, done. At roughly 2 bug fixes per month, that translates to 21-33 hours of annual savings on deployment alone.

The code maintenance improvements are harder to quantify but equally real. With 93% less code per repository, there are 93% fewer places for bugs to hide. Organisation-wide duplication dropped from 25,916 lines to just 1,848 lines. When a bug does appear, it exists in one place rather than fifty, making both diagnosis and fixing dramatically faster. Developers no longer waste time wondering "is this the same bug I saw in another repo?" because the answer is always no, there's only one copy of the code.

Combined, the quantifiable savings reach 363-375 hours annually. At a conservative £50/hour rate, that's over £18,000 in direct value from time savings alone. But this number understates the true impact, because it doesn't capture the reduced error rate from centralised code, the improved code quality from having time to refine prompts, or the developer satisfaction of working on interesting problems instead of repetitive triage.

Ready to transform your development operations?

Let's discuss how AI-powered automation can multiply your team's productivity.

Key Learnings

This project revealed several insights that apply broadly to AI-powered development and DevOps automation projects. The most profound: before reusable workflows, the 45-minute deployment cycle inhibited experimentation. After switching to centralised logic, 5-minute edit-test cycles enabled over 100refinement commits. Speed enables innovation.

Show the feedback loop effect
Deployment friction shapes developer behaviour in subtle but powerful ways. When pushing a change takes 45 minutes of babysitting a deployment script, developers naturally batch their changes. They test less frequently. They become risk-averse, adopting the mindset of "better not touch that, it's working" even when something could clearly be improved. The deployment overhead creates invisible barriers to iteration and refinement.

We saw this phenomenon reverse dramatically when deployment time dropped to 5 minutes. Suddenly, trying a small improvement carried almost no cost. If it worked, great. If not, revert and try something else. The team made over 100 refinement commits in the first months of the new architecture, each one a small incremental improvement that would never have happened under the old system.

The prompt engineering work illustrates this perfectly. Getting AI triage to produce consistently useful output required dozens of iterations. Adjust the system prompt, test against a few real issues, examine the results, refine further. Under the old deployment model, each iteration would have taken 45+ minutes. Most of these refinements would simply never have been attempted. With fast deployment, the team could try five variations in an afternoon, keeping the best and discarding the rest.

The lesson generalises far beyond this project: deployment speed is a multiplier on everything else. Fast deployment enables experimentation. Slow deployment ossifies systems into "good enough" states that never improve. If you want continuous refinement, the first thing to optimise is the feedback loop itself.

Distributed systems need observable state for debugging. We discovered that GitHub labels are perfect for this: state is always visible in the UI, survives workflow failures, and provides a complete audit trail. When something goes wrong, the label history shows exactly what happened.

Different workloads need different concurrency strategies. Triage workflows should queue because they're cheap (2-3 minutes) and we want all states processed. Bug reproduction workflows should cancel old runs because they're expensive (up to 30 minutes) and the latest commit supersedes previous ones. One-size-fits-all concurrency is always suboptimal.

Show queue vs cancel: when to use each
When multiple workflow runs trigger for the same context, GitHub Actions gives you a choice via the cancel-in-progress setting in your concurrency group. The default behaviour is to queue new runs while previous ones complete, but you can configure it to cancel running workflows when a new one starts. Neither option is universally correct, and choosing the wrong one causes real problems.

Queuing works best for cheap, idempotent workflows where each trigger represents meaningful state change. Issue triage is our clearest example: if a user rapidly edits an issue three times, each edit potentially changes the context that triage needs to consider. We want all three runs to eventually complete, each one seeing a different snapshot of the issue. At 2-3 minutes per run, the queue clears quickly. Queuing ensures we never miss a state transition, which matters when the workflow's job is to observe and categorise changes.

Cancellation works better for expensive workflows where the latest trigger supersedes all previous work. Bug reproduction is our clearest example: if a developer pushes three commits in quick succession, we only care about reproducing bugs on the latest code. Running reproduction against the first two commits wastes compute time (up to 30 minutes each) and can produce misleading results if those commits contained bugs that the third commit fixed. Cancel-in-progress tells GitHub to stop the older runs immediately when a new one arrives.

The key question for any workflow is: "Does the latest trigger make previous runs obsolete?" If yes, cancel. If each trigger represents independently valuable work, queue. Getting this wrong causes either wasted compute (queuing when you should cancel) or missed state transitions (cancelling when you should queue). We learned to make this an explicit architectural decision for every workflow, documented in the workflow file comments.

Concurrency strategy reveals one dimension of workflow optimisation, but AI integration represents another entirely. The choice between queuing and cancelling affects compute efficiency, but the way we structure AI interactions affects both performance and result quality. Our approach evolved dramatically as we learned how Claude processes information most effectively.

Show the AI context lesson
Our early implementations treated Claude like a conversational partner. Give it the issue title, let it ask for comments, let it request images, let it decide what else it needs. This felt natural because it mirrors how humans work, but it produced dramatically inferior results. Each API round trip introduced latency, and Claude's analysis became fragmented as it tried to integrate new information with what it already knew from previous turns.

The breakthrough came from treating Claude more like a consultant receiving a complete brief. Before invoking the AI, we pre-fetch everything it could possibly need: the issue body, all comments with timestamps, all reactions, all referenced attachments, repository context, similar issues from the past week, and documentation relevant to the affected components. This gets packaged into a detailed manifest that Claude reads once, in its entirety.

The results justified the engineering effort. Token usage dropped by 33% because we eliminated the overhead of Claude reasoning about what to fetch next. Execution time improved by 50-60% because we replaced multiple sequential API calls with one large parallel data fetch followed by one AI call. Analysis quality improved because Claude could consider all the context simultaneously rather than trying to remember earlier information while processing new data.

The mental model shift matters: AI works best when given complete context upfront, not when queried incrementally. Invest in thorough context assembly before the AI prompt, even if it feels like over-engineering. The alternative of incremental queries wastes tokens on orchestration overhead and produces less coherent results. Every AI-powered workflow in our platform now follows this pattern.

Perhaps the most liberating insight: 65% of this system cannot be meaningfully unit tested. AI responses are non-deterministic, webhook behaviour depends on GitHub infrastructure, and cross-repo coordination requires real repositories. Acknowledging this led to a pragmatic five-tier testing strategy rather than chasing impossible coverage metrics.

Show why we abandoned full unit testing
A pull request attempted full unit tests for workflow coordination. The result: tests became "mocks of mocks" that tested our assumptions, not actual behaviour. Tests passed while workflows failed in production.

Five untestable categories: GitHub Actions runtime (can't mock webhook delivery timing), Claude AI integration (can't predict responses), cross-repository coordination (can't simulate real webhook ordering), authentication flows (can't test OAuth validation), and timing-dependent race conditions (can't simulate real concurrency).

Pragmatic solution: Five-tier strategy. Invest heavily in the testable 35% (syntax validation, static analysis, unit tests for pure logic). Accept that 65% requires dedicated testing repository for realistic E2E validation. Documentation explicitly states: "E2E manual testing is ESSENTIAL and IRREPLACEABLE."

The AI context pattern generalises well beyond this project, but security challenges remain uniquely tied to GitHub Actions architecture. The shared scripts problem seemed intractable at first, with every conventional solution creating unacceptable security risks. The breakthrough came from questioning our assumptions about what "public" actually means when code is properly sanitised.

Show the security lesson: minimalism works
The shared scripts problem seemed to have no good solution. Reusable workflows run in the caller repository's context, not the meta-repository where our Python scripts live. Every workflow needs access to these scripts for triage logic, API helpers, and prompt templates. But how do you share code across 50 repositories without creating a security nightmare?

We evaluated three conventional approaches, and each had serious problems. Deploy keys would mean creating and managing 50 separate SSH keys, any one of which could be compromised through a supply chain attack. An organisation-wide Personal Access Token would need to be stored in 50 different repository secrets, and a single leak would grant access to everything. Git submodules looked promising until we discovered that GitHub Actions cannot checkout submodules during workflow execution, making them unusable for this purpose.

The counterintuitive solution was to make the scripts repository public. This sounds reckless until you think through what "public" actually means when the code is properly sanitised. We stripped every piece of identifying information: no organisation name anywhere in the code, no architecture documentation, no revealing comments. The README contains exactly four words: "Shared script resources." All configuration happens via environment variables passed by caller workflows, so the scripts themselves contain zero hardcoded values or secrets.

The total public exposure after months of operation is 8 generic Python files with zero organisational information. Someone discovering this repository learns nothing useful about the systems using it. For additional protection, we pin all script references to specific commit hashes rather than branch names, preventing supply chain attacks where a compromised script could be pushed and immediately executed across all repositories. Zero security incidents in more than two months of production use. Sometimes the boldest solution really is the simplest one.

Related Services

How we can help you build similar automation

Ready to eliminate your technical debt?

Transform unmaintainable legacy code into a clean, modern codebase that your team can confidently build upon.