What does this writing signal mean?

DigitalOcean (GradientAI) published Scaling Autonomous Site Reliability Engineering: Architecture, Orchestration, and Validation for a 90,000+ Server Fleet. This talking signal gives public context for research themes, product direction, policy, or launch framing. High-signal details: Solid engineering blog post · Scaling Autonomous Site Reliability Engineering: Architecture, Orchestration, and Validation for a 90,000+ Server Fleet | DigitalOcean © 2026 DigitalOcean, LLC. Sitemap.... onlylabs links this event to 1 captured evidence page and 6 related writing signals.

DigitalOcean (GradientAI) Writing: Scaling Autonomous Site Reliability Engineering: Architecture, Orchestration, and Validation for a 90,000+ Server Fleet

Captured source

source ↗

digitalocean.com/digitalocean.com/blog/scaling-autonomous-site-reliability

Scaling Autonomous Site Reliability Engineering: Architecture, Orchestration, and Validation for a 90,000+ Server Fleet

Source ↗

published Mar 13, 2026seen 5dcaptured 3dhttp 200method plain

Scaling Autonomous Site Reliability Engineering: Architecture, Orchestration, and Validation for a 90,000+ Server Fleet | DigitalOcean

Dark mode is coming soon. Engineering Scaling Autonomous Site Reliability Engineering: Architecture, Orchestration, and Validation for a 90,000+ Server Fleet

By Najmus Saqib

Updated: March 13, 2026 6 min read

<- Back to blog home

As Cloudways scaled from a bootstrapped startup to a leading managed PHP hosting service, one of the biggest challenges we encountered was the growing support load. Managing a fleet of over 90,000 servers and half a million applications means thousands of support requests, requiring a team of hundreds of human support agents. The rise of LLMs and AI agents provided an ideal opportunity to rethink our support operations. Early on, we recognized that an AI-based SRE agent could significantly reduce the burden on our support teams.

At Cloudways, we deeply care about our customers’ applications and websites because they are the backbone of their businesses and livelihoods. Every minute of downtime matters, and our priority has always been to ensure their apps come back online as quickly as possible. An AI SRE agent helps customers to receive timely, in-depth investigation and troubleshooting for their web applications delivering faster diagnosis and quicker resolution.

Cloudways Copilot, an AI-powered Site Reliability Engineer in its current state is a result of over a year of constant efforts to achieve these goals. It has features like Insights and SmartFix which provide users access to a detailed diagnosis and resolution steps for web apps incidents. These AI-powered insights are significantly faster and more consistent than those provided by a human agent.

View Wistia video

How does CW Copilot work?

The monitoring layer continuously observes each user machine for Webstack issues and excessive. When an anomaly is detected, it triggers an alert and forwards it to the control plane. The control plane then routes the alert to the Insight Generation Engine, which consists of following components:

AI SRE Agent

The effectiveness of an AI agent depends heavily on the context it is provided. The agent is made aware of the customizations done by Cloudways on top of Debian so that it can work in an optimized way. It includes details like:

File structure details (e.g., where configuration and web application files live)

Locations of log files

The commands used to navigate and inspect the system

Core services needed for the web apps to run

The AI Agent is hosted on the DigitalOcean Kubernetes platform.

Orchestration Layer

A dedicated server orchestration layer had to be set up to connect AI Agent with the fleet of 90K servers. Cloudways manages this using Ansible Server. Whenever the AI Agent initiates a SSH command, it is stored on Redis and is picked sequentially through a celery queue. Once picked from the queue, Ansible runs the command on the client machine using a dedicated Linux user. This dedicated user has only access to specific commands and files ensuring data security on user machines. Response from the machine is sent back to the AI SRE Agent.

DigitalOcean serverless inference

Despite being a critical part of the system, this is fairly easy to set up. From an implementation perspective, inference is similar to invoking a single API endpoint. No infrastructure to manage, no scaling concerns to worry about.

curl https : // inference . do - ai . run / v1 / chat / completions \\ \ - H "Authorization: Bearer $MODEL\_ACCESS\_KEY" \\ \ - H "Content-Type: application/json" \\ \ - d ' { "model" : "digitalocean - anthropic - claude - sonnet - 4 ” , "messages" : \ [ { "role" : "user" , "content" : "What is the capital of France?" } \ ] , "temperature" : 0.7 , "max\_tokens" : 50 } '

These 3 components work in a loop until sufficient information is gathered and an insight is gendered for the user. An insight is a JSON object comprising of following parts:

Investigation Summary

Remediation Steps

Fixes

Notability

notability 5.0/10

Solid engineering blog post