WritingAmazon (Nova)Amazon (Nova)published Apr 27, 2026seen 5d

How catastrophic is your LLM?

Open original ↗

Captured source

source ↗
published Apr 27, 2026seen 5dcaptured 3dhttp 200method plain

How catastrophic is your LLM? A statistical framework for certifying conversational risk - Amazon Science

Close

Close

Social

bluesky

threads

twitter

instagram

youtube

facebook

linkedin

github

rss

Menu

Research

Research areas

Automated reasoning

Cloud and systems

Computer vision

Conversational AI

Economics

Information and knowledge management

Machine learning

Operations research and optimization

Quantum technologies

Robotics

Search and information retrieval

Security, privacy, and abuse prevention

Sustainability

Our scientific contributions

Publications

Research from our scientists and collaborators.

Conferences

Our experts present and discuss cutting-edge research at scientific meetings globally.

Research areas

Automated reasoning

Cloud and systems

Computer vision

Conversational AI

Economics

Information and knowledge management

Machine learning

Operations research and optimization

Quantum technologies

Robotics

Search and information retrieval

Security, privacy, and abuse prevention

Sustainability

Our scientific contributions

Publications

Research from our scientists and collaborators.

Conferences

Our experts present and discuss cutting-edge research at scientific meetings globally.

News & blog

The latest from Amazon researchers

Amazon Science Blog

Technical deep-dives and perspectives from our scientists.

News

Research milestones and recent achievements.

The latest from Amazon researchers

Amazon Science Blog

Technical deep-dives and perspectives from our scientists.

News

Research milestones and recent achievements.

Collaborations

Amazon Research Awards

Overview

Call for proposals

Latest news

Research stories

Recipients

Amazon Nova AI Challenge

Overview

Rules

FAQs

Teams

Research collaborations

Overview

Carnegie Mellon University

Columbia University

Hampton University

Howard University

IIT Bombay

Johns Hopkins University

Max Planck Society

MIT

Tennessee State University

University of California, Los Angeles

University of Illinois Urbana-Champaign

University of Southern California

University of Texas at Austin

Virginia Tech

University of Washington

Amazon Research Awards

Overview

Call for proposals

Latest news

Research stories

Recipients

Amazon Nova AI Challenge

Overview

Rules

FAQs

Teams

Research collaborations

Overview

Carnegie Mellon University

Columbia University

Hampton University

Howard University

IIT Bombay

Johns Hopkins University

Max Planck Society

MIT

Tennessee State University

University of California, Los Angeles

University of Illinois Urbana-Champaign

University of Southern California

University of Texas at Austin

Virginia Tech

University of Washington

Resources

Code and datasets

AGI Labs

Meet the team building useful AI agents.

Amazon Nova

Try Amazon’s frontier foundation models.

Code and datasets

AGI Labs

Meet the team building useful AI agents.

Amazon Nova

Try Amazon’s frontier foundation models.

Careers

Careers

Explore our open roles.

Amazon Scholars

Faculty research opportunities on industry-scale technical challenges.

Postdoctoral Science Program

Early-career research opportunities alongside experienced industry scientists.

Careers

Explore our open roles.

Amazon Scholars

Faculty research opportunities on industry-scale technical challenges.

Postdoctoral Science Program

Early-career research opportunities alongside experienced industry scientists.

Search

Submit Search

Conversational AI

How catastrophic is your LLM?

A new framework provides a statistical method for estimating the likelihood of catastrophic failures in large language models in adversarial conversations.

By Qian Hu , Weitong Ruan , Rahul Gupta

April 27, 2026

4 min read

Share

Share

Copy link

Email

X

LinkedIn

Facebook

Line

Reddit

QZone

Sina Weibo

WeChat

WhatsApp

分享到微信

x

Overview by Amazon Nova

The C3LLM framework models conversations as multiturn dialogues using a graph where nodes represent prompts and edges represent semantic relationships between prompts. The framework uses Clopper-Pearson confidence intervals to calculate lower and upper bounds on attack success rates, providing high-confidence probabilistic bounds over large conversation spaces. The C3LLM framework has been open sourced to enable more-principled safety studies by researchers in industry and academia.

Was this answer helpful?

As large language models ( LLMs ) become increasingly useful across a variety of domains, the stakes of keeping them safe rise accordingly. Because bad actors might, for instance, try to use LLMs to write malicious code or make step-by-step guides for synthesizing toxic compounds, researchers are developing rigorous safeguards to keep LLMs from generating content that could pose serious public safety and security risks. The most common way to assess the risks to LLMs is called red-teaming, where human evaluators design adversarial prompts intended to elicit harmful responses. But expert-curated sets of prompts cannot capture the full range of possible outcomes. Moreover, many evaluations focus on isolated prompts rather than conversations, which are where harmful behavior often emerges. Finally, today’s benchmark failure metrics provide only a single score, rather than confidence bounds on worst-case conversational risks. This makes the findings unreliable and non-generalizable to the vast space of possible conversations.

Read the paper and explore the code

The C3LLM framework is open source. Access the code on GitHub and read the full paper on Amazon Science .

In a paper we presented at this year’s International Conference on Learning Representations ( ICLR ), we, along with researchers from the University of Illinois Urbana-Champaign (UIUC), address these red-teaming limitations by focusing on the failures within conversational threat models and then assigning a probability to an attack rate, which is defined as the number of successful attacks divided by the total number of attacks. Our approach, called the C3LLM (certifying catastrophic conversational risks in LLMs) framework, shifts the focus of benchmarking failure from empirical spot-checking to statistical certification.

The C3LLM (certification of catastrophic risks in multiturn conversation for LLMs) framework. Starting from a query set, we construct a graph in which edges connect semantically similar queries. On this graph, we define formal specifications as probability…

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

Substantive research post from Amazon