cerebras/Llama3-DocChat-1.0-8B
Captured source
source ↗Model Information
We are excited to announce the release of Cerebras DocChat, our first iteration of models designed for document-based conversational question answering. This series includes two models: Cerebras Llama3-DocChat, a large language model (LLM), and Cerebras Dragon-DocChat, a multi-turn retriever model.
This model – Cerebras Llama3-DocChat 1.0 8B – was built on top of Llama 3 base using insights from the latest research on document-based Q&A, most notably Nvidia’s ChatQA model series. As part of this work, we leveraged our experience in LLM model training and dataset curation to overcome the gaps in ChatQA's released datasets and training recipes. Additionally, we employed synthetic data generation to address limitations that couldn't be fully resolved with the available real data. Using a single Cerebras System, Llama3-DocChat 8B was trained in a few hours.
You can find more information about DocChat at the following locations:
- Blog post
- LLM model weights on HuggingFace
- Embedding model weights on HuggingFace: Query Encoder, Context Encoder
- Data preparation, training, and evaluation code
Results
| ChatRAG Benchmark | Llama3 Instruct 8B | Command-R-Plus | Nvidia Llama3-ChatQA 1.5 8B | GPT-4-Turbo-2024-04-09 | Cerebras Llama3-DocChat 1.0 8B | | --- | --- | --- | --- | --- | --- | | Doc2Dial | 31.33 | 33.51 | 39.33 | 35.35 | 39.19 | | QuAC | 32.64 | 34.16 | 39.73 | 40.1 | 36 | | QReCC | 43.4 | 49.77 | 49.03 | 51.46 | 50.27 | | CoQA | 73.25 | 69.71 | 76.46 | 77.73 | 79.56 | | DoQA | 30.34 | 40.67 | 49.6 | 41.6 | 48.77 | | ConvFinQA | 53.15 | 71.21 | 78.46 | 84.16 | 80.13 | | SQA | 36.6 | 74.07 | 73.28 | 79.98 | 74.19 | | TopioCQA | 34.64 | 53.77 | 49.96 | 48.32 | 52.13 | | HybriDial\* | 40.77 | 46.7 | 65.76 | 47.86 | 64 | | INSCIT | 32.09 | 35.76 | 30.1 | 33.75 | 32.88 | | Average (all) | 40.82 | 50.93 | 55.17 | 54.03 | 55.71 | | Average (exclude HybriDial) | 40.83 | 51.4 | 53.99 | 54.72 | 54.79 |
| Eleuther Eval Harness Benchmark | Llama3 Instruct 8B | Nvidia Llama3-ChatQA 1.5 8B | Cerebras Llama3-DocChat 1.0 8B | | --- | --- | --- | --- | | hellaswag | 57.68 | 61.37 | 61.68 | | winogrande | 71.98 | 73.95 | 74.11 | | truthfulqa_mc1 | 36.23 | 28.52 | 29.25 | | truthfulqa_mc2 | 51.65 | 43.56 | 45.14 | | mmlu | 63.84 | 60.68 | 62.86 | | gsm8k | 76.12 | 13.72 | 55.57 | | arc_easy | 81.61 | 80.56 | 82.03 | | arc_challenge | 52.99 | 51.02 | 53.92 | | Average | 61.51 | 51.67 | 58.07 |
Prompt Format
DocChat supports the standard Llama3 Instruct chat template – no fancy formatting functions required! When providing a context document to the model, simply prepend the user turn with {put your document here} . You may also provide an “instruction” before the user input to better align the model’s response with the desired behavior. Examples include:
Please give a full and complete answer for the question.Answer the following question with a short span
We use the same system prompt as ChatQA: This is a chat between a user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions based on the context. The assistant should also indicate when the answer cannot be found in the context.
Example Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "cerebras/Llama3-DocChat-1.0-8B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")
system = "This is a chat between a user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions based on the context. The assistant should also indicate when the answer cannot be found in the context."
instruction = "Please give a full and complete answer for the question."
document = """
# Cerebras Wafer-Scale Cluster
Exa-scale performance, single device simplicity
## AI Supercomputers
Condor Galaxy (CG), the supercomputer built by G42 and Cerebras, is the simplest and fastest way to build AI models in the cloud. With over 16 ExaFLOPs of AI compute, Condor Galaxy trains the most demanding models in hours rather than days. The terabyte scale MemoryX system natively accommodates 100 billion+ parameter models, making large scale training simple and efficient.
| Cluster | ExaFLOPs | Systems | Memory |
| -------- | -------- | -------- | ------ |
| CG1 | 4 | 64 CS-2s | 82 TB |
| CG2 | 4 | 64 CS-2s | 82 TB |
| CG3 | 8 | 64 CS-3s | 108 TB |
"""
question = "How many total CS systems does Condor Galaxy 1, 2, and 3 have combined, and how many flops does this correspond to?"
user_turn = f"""
{document}
{instruction} {question}"""
messages = [
{"role": "system", "content": system},
{"role": "user", "content": user_turn}
]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
terminators = [
tokenizer.eos_token_id,
tokenizer.convert_tokens_to_ids("")
]
outputs = model.generate(
input_ids,
max_new_tokens=256,
eos_token_id=terminators,
)
response = outputs[0][input_ids.shape[-1]:]
print(tokenizer.decode(response, skip_special_tokens=True))License
This model was trained from Llama 3 8B base, and therefore is subject to the META LLAMA 3 COMMUNITY LICENSE AGREEMENT. Furthermore, it is trained on ChatQA's synthetic conversational QA dataset which was generated using GPT-4. As a result this model can be used for non-commercial purposes only, and is subject to Terms of Use of the data generated by OpenAI. Additionally, please see the licensing information of individual datasets.
Acknowledgements
DocChat was built on top of a large body of ML work, spanning training datasets, recipes, and evaluation. We want to thank each of these resources.
@inproceedings{dua2019drop,
title={DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs},
author={Dua, Dheeru and Wang, Yizhong and Dasigi, Pradeep and…Excerpt shown — open the source for the full document.
Notability
notability 1.0/10Very low traction, minor fine-tune