Building a Robust Documentation Agent with DigitalOcean Gradient AI Platform
Captured source
source ↗Building a Robust Documentation Agent with DigitalOcean Gradient AI Platform | DigitalOcean
© 2026 DigitalOcean, LLC. Sitemap .
Dark mode is coming soon. Engineering Building a Robust Documentation Agent with DigitalOcean Gradient AI Platform
By Austen Ito and Anna Lushnikova
Published: April 13, 2026 14 min read
characters, right square brackets, and periods that are legitimately part of a URL path.
On the prompt side, we added explicit rules:
Use URLs exactly as they appear in documentation chunks
Never fabricate, infer, or extend URLs
Each URL appears once per response
Format as Markdown: text
Exceptions: URLs may be omitted for clarifications, out-of-scope queries, or “no documentation” responses
We also verified the knowledge base itself (via OpenSearch chunk inspection notebooks) to confirm it only contained docs.digitalocean.com URLs and not community or third-party domains, eliminating a source of invalid URLs at the retrieval layer.
Golden Datasets
Our evaluations require “golden datasets” of question-and-answer pairs that we consider correct. For example, a real golden dataset that we use looks like:
question: How do I rotate my DigitalOcean API tokens securely? answer: > Create a new API token on the API page, update your applications to use the new token, then revoke the old token. This ensures continuous service while maintaining security by eliminating access through the old token. product: api reasoning: moderate source: synthetic type: how_to_configuration
The above example contains a question, ground truth answer, and additional metadata such as:
Product area/type - allows us to zoom in on product domains that are not performing well or need more test coverage.
Reasoning - gives visibility into the level of difficulty for an evaluation, which is useful when looking at results over time. For light questions, we expect higher results since they are “easier” to answer. For heavy questions, we expect more variability over time since these questions are more nuanced and have higher complexity.
Source - the origin of the question such as human-generated or synthetic.
Our golden datasets are sent to the Gradient AI Platform where metric results are computed. Evaluations use an LLM-as-a-Judge approach with multiple judges running OpenAI GPT-4o. The judges employ Chain-of-Thought (CoT) reasoning to generate a numeral score between 0 and 1. For example, a golden dataset could return a correctness score of .66 meaning 2 out of 3 judges found the response to be correct.
Once results are created, the metrics and metadata are sent as telemetry via OTLP to our observability cluster for future observability and analysis.
Creating Datasets
There are several ways to create golden datasets.
Human-generated - Subject matter experts manually write question-and-answer pairs. While this is the most reliable method to create golden datasets, we quickly realized it was extremely time-consuming for teams.
Using existing content - DigitalOcean has resources, such as support and community , which provide gold-standard datasets based on customer support materials. This approach maintains quality while reducing manual effort, however there was still human-work necessary to identify good examples.
Synthetic generation - LLMs generated pairs using detailed prompts to produce synthetic datasets. The process involved prompting an LLM to crawl product documentation pages 2-3 levels deep and generating multiple diverse QA pairs. While the answers could introduce bias from the LLM, the tradeoff around accuracy is made up by the speed of generation. Humans are still needed to review synthetically generated datasets.
Since the golden datasets are used by both automated systems and humans in various roles — PMs, engineers, and managers — we chose YAML format for readability due the responses having multiple lines and code. The YAML files are then converted to CSV and fed into the Gradient evaluations.
The generated datasets are submitted as pull requests on GitHub to a dedicated team for review and approval. This approach allowed us to add high-quality responses quickly from subject matter experts with minimal effort.
Running Evaluations
Our agent evaluations are used during development and in our CI/CD pipeline on GitHub Actions. During development, engineers can experiment with changes such as prompt updates and run evaluations locally. When changes are merged, our CI/CD pipeline executes evaluations automatically before deployment. We also periodically run our full evaluation suite once a day to verify our documentation agent continues to operate as we expect.
Agent Configuration Decisions
Once evaluations and metrics were in place, it became possible to iterate on areas important to us, such as correctness and speed, using objective measurements instead of one-off tests and vibes.
Prompt Engineering
From our evaluations, we noticed that the primary area that kept our ground truth adherence metric below 80% and our correctness metric below 95% was that the LLM would decide on its own what ambiguous words meant or how to give suggestions to perform an action. For example, consider the following prompt:
How do I create a Droplet on DigitalOcean?
Based on our results using Gradient evaluations, we discovered that our agent would vary how it responded to this question over multiple evaluation runs. Sometimes the agent would return responses to create Droplets on the Control Panel, other times it would suggest the Public API or doctl. Since the agent lives in the Control Panel, we added the following to our prompt to help ensure that ambiguous requests responded with Control Panel instructions:
- If public API or automation is NOT specified in the question:
- Use ONLY Control Panel chunks in your primary answer
- If only public API chunks are retrieved, ask clarifying question: "Would you like to know how to do this in the
Control Panel or via the API?"
- You may mention API/CLI methods in the Recommendation section as alternative automation options
We also discovered that our golden datasets were too ambiguous with prompts such as:
How do I automate deleting files older than 30 days?
Questions like these would return varying responses since files can live on a variety of different DigitalOcean products. This prompted us to make two changes:
We updated our golden dataset questions to be more specific. In this case, we changed the above prompt to be “How do…
Excerpt shown — open the source for the full document.
Notability
notability 2.0/10Low traction routine tutorial