Model Evaluations: Prove Your Routing Policy Actually Works
Captured source
source ↗Model Evaluations: Prove Your Routing Policy Actually Works | DigitalOcean
© 2026 DigitalOcean, LLC. Sitemap .
Dark mode is coming soon. Product updates Model Evaluations: Prove Your Routing Policy Actually Works
By Sathish Jothikumar
Updated: June 4, 2026 7 min read
, ,
Try to avoid using only cherry-picked prompts. We recommend mixing typical ‘happy-path’ prompts along with convoluted long-context prompts that are challenging. Also include examples of edge cases that expose safety risks.
For this example, you can utilize this simulated dataset for your use case legal_eval_simulated_dataset.csv .
Upload the created dataset in the new evaluation run dialog. The upload process validates the schema and blocks obvious breakage. In the next screen, provide a concise name for the evaluation run. You can also upload datasets using cURL
curl -X POST \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $DIGITALOCEAN_TOKEN" \ "https://api.digitalocean.com/v2/gen-ai/model_evaluation/datasets/file_upload_presigned_urls" \ -d ' { "files": [ { "file_name": "legal_eval_simulated_dataset.csv", "file_size": 77564 } ] } '
Step 3: Configure candidates
For apples-to-apples comparison of the three candidates, ensure that you use the same evaluation configuration as below. This includes setting the same system prompt, temperature and max tokens to values that mirror your production use case. If your app injects a system prompt in code, paste the same prompt here. Otherwise, you will be measuring a different product than the one you ship.
For the first run, select a frontier model such as anthropic-claude-4.6-sonnet, or a DigitalOcean-hosted model such as glm-5 from the candidate model dropdown. (Note that to access commercial models, your account will need to be an appropriate tier. You can request access to higher tier at this link ). For the second run, choose your router config from the dropdown (model-eval-blog-legal in this example). For the third run, choose the dedicated inference endpoint where Ontario/qwen3-0.6b-en-law-qa has been deployed. For the system prompt for the candidate, you can create your own system prompt suitable for evaluation, or tweak the following example: Legal_system_prompt.txt .
Step 4: Choose the judge and the rubric
Choose an appropriate judge for evaluating the candidates. Remember to use the same judge for all three candidates. For this example, we recommend using OpenAI GPT-5.5 (or DeepSeek R1 Distill Llama 70B if you do not have access to commercial models).
Choose all the evaluation metrics available: correctness, completeness, ground truth faithfulness, and safety metrics (PII, toxicity and bias). Since the dataset has ground truth faithfulness, let’s choose that as the star metric.
Run the job. Monitor status in the same model evaluation landing page.
Code snippet for setting the evaluation configuration using cURL is provided below:
curl -X POST \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $DIGITALOCEAN_TOKEN" \ "https://api.digitalocean.com/v2/gen-ai/model_evaluation_runs" \ -d ' { "name": "my-evaluation-run", "candidate_model_uuid": "123e4567-e89b-12d3-a456-426614174000", "judge_model_uuid": "223e4567-e89b-12d3-a456-426614174001", "dataset_uuid": "323e4567-e89b-12d3-a456-426614174002", "metric_uuids": [ "423e4567-e89b-12d3-a456-426614174003" ] } '
Information about model and metric UUIDs are available in the API Reference .
Step 5: Interpret results like a PM, not like a leaderboard
When the run finishes, you are looking for three layers of evidence (aligned to your dashboard requirements):
Aggregate : per-metric and overall health and the star metric for the exec readout.
Performance economics on the same rows: end-to-end latency, total evaluation time, token counts, estimated cost. This is how you answer “best accuracy per dollar?” without merging two spreadsheets.
Item-level drill-down : input, model output, judge rationale, per-criterion scores. This is where you see routing decisions: by evaluating the tradeoff between completeness or correctness score against latency and cost.
If you are comparing a router vs. a static model, scan for segmented behavior:
On easy prompts, did the router preserve quality and improve cost/latency?
On hard prompts, did the router keep safety scores in range or did PII/toxicity/bias tick up?
Finally, download the results for exhaustive analysis.
Step 6: The decision and the iteration loop
You are not looking for a philosophical winner; you are looking for a go / no-go with a tuning path:
Ship the router if star metric and safety bars hold (or only regress within agreed tolerance) and you gain meaningful cost or latency headroom on your representative mix.
Keep iterating if the router’s deficiencies cluster in specific task types. Then adjust routing policy (natural-language policy for tasks + model pool, per Router positioning), not the judge, and re-run the same dataset to see if the performance changes.
Release narrative you can use internally: “We didn’t A/B in production. We simulated production endpoints, captured judge + latency + cost in one run, and re-ran on the same dataset as the router policy evolved.” Also, share context on relying on a public leaderboard number as a substitute for your workload, and treat quality in one tool and $ / token in another.
Turning evaluation into an operational workflow
Over time, model evaluation is moving closer to real-world production workloads, giving teams near-real-time visibility into performance, cost, latency, and output quality. We are continuing to expand DigitalOcean Model Evaluations with support for custom metrics, multimodal models, standardized benchmarks, and richer workload analysis so teams can make production decisions with greater confidence*.
For you, this means spending less time second-guessing model decisions and more time shipping confidently. Every evaluation run brings you closer to a production stack you can justify with data instead of intuition.
- The above reflects our current plans and product direction, and is subject to change without notice. It is provided for informational purposes only and is not a commitment to deliver any material, feature, or functionality.
About the author
Sathish Jothikumar Author
Share
Product Updates
Start building today From GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build,…
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10Routine technical post, not a major model or release