GitHub Automation at Scale
How we built a platform that automates repository maintenance, issue triage, and failure detection across 50+ repositories, saving 325 hours annually while achieving zero-touch operations.
The Challenge
Managing 50+repositories across an organisation creates significant operational overhead. Issue triage, pull request reviews, and workflow failure detection consume valuable developer time that could be spent on feature development.
The engineering team was spending an estimated 75 minper day on manual triage alone. Workflow failures went unnoticed over weekends. There was no direct communication between GitHub Actions workflows. Bug fixes required a 30-45 minute deployment process across all repositories.
Show why this became unsustainableHide why this became unsustainable
startup_failure (YAML syntax errors) showed as "no run" rather than a failure.The organisation needed a platform that could handle routine maintenance automatically, freeing developers to focus on high-value work while ensuring nothing falls through the cracks. This is exactly the kind of AI-powered automation challenge we specialise in.
The Solution
We designed and built a sophisticated meta-repository that orchestrates AI-powered GitHub Actions automation across the entire organisation. The platform implements 4major workflows and 7reusable workflows using an architecture that achieves 93%code reduction per repository.
The system delivers intelligent issue triage with two-phase AI analysis, automated pull request review with code quality assessment, failure monitoring that scans all 50 repositories every 6 hours, and AI-powered failure analysis with root cause identification.
Show what each workflow doesHide what each workflow does
startup_failure, which indicates YAML syntax errors that cause workflows to silently break without any notification. These were previously the failures that went unnoticed for days or weeks.The Interactive Assistant enables natural language GitHub operations through MCP tools. Developers can ask it to create issues, manage labels, or investigate failures using conversational commands. Supporting workflows handle bug reproduction, failure triage, and routine issue maintenance like closing stale issues or updating labels based on activity patterns.What makes this architecture powerful is the label-based coordination between workflows. Without direct workflow-to-workflow communication in GitHub Actions, we use issue labels as distributed semaphores. Combined with a public/private security model for shared scripts, this enables organisation-wide automation without exposing sensitive context.
The reusable workflow architecture means what previously required a 30-45 minute deployment process to update 50 repositories now takes seconds. Per-repository workflow files dropped from 589lines to just 42lines.
Show how we solved the deployment problemHide how we solved the deployment problem
workflow_call trigger, which allows one workflow to invoke another. We moved all logic into reusable workflows in the meta-repository, leaving each caller repo with just a 42-line file that maps context variables. When we push to the meta-repository, every caller workflow automatically uses the updated code via the @master reference. Deployment went from 45 minutes to seconds.Want to automate your GitHub workflows?
We can help you build intelligent automation that scales.
Technical Implementation
The platform was built using Python for the core automation logic, GitHub Actions for orchestration, and Anthropic Claude for intelligent analysis. The key technical challenge was coordination without direct workflow-to-workflow communication.
Our solution uses GitHub issue labels as distributed semaphores. The triage-in-progress label acts as a temporary state marker, while triaged serves as a permanent completion marker. Dependent workflows poll for label changes with exponential backoff (5s to 30s), timing out gracefully after 5 minutes.
Show why we chose 1.5x backoff instead of 2xHide why we chose 1.5x backoff instead of 2x
The 1.5x backoff multiplier demonstrates the kind of careful optimisation required for distributed coordination at scale. When you can't rely on direct communication channels, polling strategies become critical to both responsiveness and efficiency. The next section shows how we implemented the label-based coordination pattern that makes this polling effective.
Label-based coordination solves the technical challenge of workflow orchestration, but it introduces a different problem: how do you share Python scripts across 50 repositories when reusable workflows run in the caller repository's context? This security challenge required a counterintuitive solution that initially seemed reckless but proved remarkably effective.
Show how we solved the security challengeHide how we solved the security challenge
The public scripts pattern demonstrates that security through obscurity can work when properly implemented: zero organisational information, all configuration via environment variables, and commit hash pinning for supply chain protection. With the security model established, the actual workflow invocation pattern becomes straightforward.
The reusable workflow pattern reduced per-repository code from 589 lines to 42 lines, but the real power emerges when combined with AI-powered analysis. The challenge with AI integration wasn't the model itself but how we structured the prompts and managed context. Early experiments with incremental API calls revealed a fundamental inefficiency that shaped our entire approach.
The platform uses Anthropic Claude for intelligent analysis. We discovered early that incremental API calls during execution waste tokens. The solution: pre-fetch all context upfront, giving Claude a detailed manifest rather than making round trips during analysis.
The two-phase triage prompt structure provides guidance, but the execution strategy determines performance. When we measured the actual impact of our incremental API approach, the numbers revealed a significant opportunity for optimisation. The token count told one story, but execution time told another, and both pointed to the same conclusion: pre-fetching everything upfront would dramatically improve efficiency.
Show see the token optimisation strategyHide see the token optimisation strategy
Measurable Results
The platform has been running in production across 50+ repositories, delivering measurable improvements across multiple dimensions. The numbers tell the story: we achieved 325hannual savings by eliminating manual triage entirely. What previously required a 30-45 minute deployment process now propagates instantly.
Automated monitoring catches every workflow failure. The 6-hour scan cycle means issues are identified and reported before developers start their next working day. Weekend failure backlogs have been completely eliminated, with real-time monitoring and alerting keeping the team informed.
Show why startup failures are always criticalHide why startup failures are always critical
startup_failure.A startup failure means the workflow YAML itself has a syntax error. The workflow never runs. Not even a single step executes. GitHub doesn't consider this a workflow failure in the traditional sense, so it doesn't send notifications. In the UI, it just looks like the workflow hasn't been triggered recently. You have to actively check for runs and notice the "startup_failure" conclusion to discover the problem.These are typically caused by YAML indentation errors (easy to make, hard to spot), invalid syntax in expressions, or typos in action references. The insidious part is that everything else continues working. The repository appears healthy. Other workflows run fine. You only discover the broken workflow when you specifically need it, which might be days or weeks later.Our monitoring explicitly checks for startup_failure conclusions and always marks them as critical priority, regardless of the workflow type. Within 90 days of deployment, we caught 4 critical startup failures that would have gone unnoticed indefinitely. Before this monitoring existed, these silent failures accounted for an estimated 20-30% of all workflow problems.AI-generated issue reports include root cause analysis, proposed fixes, and pattern recognition. This comprehensive approach has driven issue closure rates to 90%closure rate, with issues resolved in days rather than weeks. The quality of automated triage consistently matches or exceeds manual assessment.
Show automated vs manual issue qualityHide automated vs manual issue quality
Quality improvements across the board: race condition failures dropped from 10% to zero through concurrency groups. Duplicate triage comments are completely prevented by label-based coordination. Unnoticed failures went from 20-30% to zero with complete detection via scheduled scans.
Metrics collection provides the raw data, but understanding the full impact requires translating those numbers into business value. The time savings are substantial and measurable, but they represent only part of the story. When you factor in deployment efficiency improvements and code maintenance reduction, the compound effect becomes clear.
Show see the ROI calculationHide see the ROI calculation
Ready to transform your development operations?
Let's discuss how AI-powered automation can multiply your team's productivity.
Key Learnings
This project revealed several insights that apply broadly to AI-powered development and DevOps automation projects. The most profound: before reusable workflows, the 45-minute deployment cycle inhibited experimentation. After switching to centralised logic, 5-minute edit-test cycles enabled over 100refinement commits. Speed enables innovation.
Show the feedback loop effectHide the feedback loop effect
Distributed systems need observable state for debugging. We discovered that GitHub labels are perfect for this: state is always visible in the UI, survives workflow failures, and provides a complete audit trail. When something goes wrong, the label history shows exactly what happened.
Different workloads need different concurrency strategies. Triage workflows should queue because they're cheap (2-3 minutes) and we want all states processed. Bug reproduction workflows should cancel old runs because they're expensive (up to 30 minutes) and the latest commit supersedes previous ones. One-size-fits-all concurrency is always suboptimal.
Show queue vs cancel: when to use eachHide queue vs cancel: when to use each
cancel-in-progress setting in your concurrency group. The default behaviour is to queue new runs while previous ones complete, but you can configure it to cancel running workflows when a new one starts. Neither option is universally correct, and choosing the wrong one causes real problems.Queuing works best for cheap, idempotent workflows where each trigger represents meaningful state change. Issue triage is our clearest example: if a user rapidly edits an issue three times, each edit potentially changes the context that triage needs to consider. We want all three runs to eventually complete, each one seeing a different snapshot of the issue. At 2-3 minutes per run, the queue clears quickly. Queuing ensures we never miss a state transition, which matters when the workflow's job is to observe and categorise changes.Cancellation works better for expensive workflows where the latest trigger supersedes all previous work. Bug reproduction is our clearest example: if a developer pushes three commits in quick succession, we only care about reproducing bugs on the latest code. Running reproduction against the first two commits wastes compute time (up to 30 minutes each) and can produce misleading results if those commits contained bugs that the third commit fixed. Cancel-in-progress tells GitHub to stop the older runs immediately when a new one arrives.The key question for any workflow is: "Does the latest trigger make previous runs obsolete?" If yes, cancel. If each trigger represents independently valuable work, queue. Getting this wrong causes either wasted compute (queuing when you should cancel) or missed state transitions (cancelling when you should queue). We learned to make this an explicit architectural decision for every workflow, documented in the workflow file comments.Concurrency strategy reveals one dimension of workflow optimisation, but AI integration represents another entirely. The choice between queuing and cancelling affects compute efficiency, but the way we structure AI interactions affects both performance and result quality. Our approach evolved dramatically as we learned how Claude processes information most effectively.
Show the AI context lessonHide the AI context lesson
Perhaps the most liberating insight: 65% of this system cannot be meaningfully unit tested. AI responses are non-deterministic, webhook behaviour depends on GitHub infrastructure, and cross-repo coordination requires real repositories. Acknowledging this led to a pragmatic five-tier testing strategy rather than chasing impossible coverage metrics.
Show why we abandoned full unit testingHide why we abandoned full unit testing
The AI context pattern generalises well beyond this project, but security challenges remain uniquely tied to GitHub Actions architecture. The shared scripts problem seemed intractable at first, with every conventional solution creating unacceptable security risks. The breakthrough came from questioning our assumptions about what "public" actually means when code is properly sanitised.
Show the security lesson: minimalism worksHide the security lesson: minimalism works
Related Services
How we can help you build similar automation
Who This Helps
Organisations that benefit from this approach
Ready to eliminate your technical debt?
Transform unmaintainable legacy code into a clean, modern codebase that your team can confidently build upon.