Generalizing an LLM from 8k to 1M Context using Qwen-Agent
Captured source
source ↗Generalizing an LLM from 8k to 1M Context using Qwen-Agent | Qwen We have a new blog! View this page at qwen.ai . This page will automatically redirect in 5 seconds. If you are not redirected automatically, please click the button below. Go Now
Generalizing an LLM from 8k to 1M Context using Qwen-Agent June 6, 2024 · 7 min · 1412 words · Qwen Team | Translations: 简体中文
Qwen-Agent TLDR: We’ve created an agent using Qwen2 models with an 8k context size to understand documents with 1M tokens, surpassing RAG and native long-context models. This agent was also used to generate data for training new long-context Qwen models. Introduction # Recently, there has been a hype trend in LLMs that can natively process sequences of millions of tokens. Most work has been focusing on sophisticated mathematical tweaks like RoPE-based extrapolation or architectural overhauls such as non-transformer LLMs. However, preparing fine-tuning data that is sufficiently long is a less discussed but equally important topic. We adopt the following approach: We use a weak 8k-context chat model to build a relatively strong agent capable of handling 1M-contexts. Subsequently, we synthesize fine-tuning data using the agent and apply automated filtering to ensure quality. Finally, we use the synthetic data to fine-tune a pretrained model, resulting in a strong 1M-context chat model.
This blog primarily focuses on Step 1, with details of the subsequent steps to be revealed in the coming weeks or months. Building the Agent # The agent we are building consists of three levels of complexity, each building upon the previous one. Level 1: Retrieval-Augmented Generation # A naive approach to processing a 1M-token context is to simply use retrieval-augmented generation (RAG) . RAG divides the context into shorter chunks, each not exceeding 512 tokens, for example, and then retains only the most relevant chunks within an 8k-token context. The challenge lies in how to pinpoint the chunks that are the most relevant. After several trials, we have come up with a keyword-based solution: Step 1: Instruct the chat model to separate the instruction and the non-instruction information in the user’s query. For instance, transform the user query "You should reply in 2000 words and be as detailed as possible. My question is, when were bicycles invented? Reply in English." into {"information": ["when were bicycles invented"], "instruction": ["reply in 2000 words", "be as detailed as possible", "reply in English"]} . Step 2: Ask the chat model to deduce multilingual keywords from the informational part of the query. For example, the phrase "when were bicycles invented" would be converted to {"keywords_en": ["bicycles", "invented", "when"], "keywords_zh": ["自行车", "发明", "时间"]} . Step 3: Employ the BM25 algorithm, a traditional keyword-based retrieval method, to locate the chunks that most relevant to the extracted keywords.
Dataflows of retrieval-augmented generation We have also experimented with vector-based retrieval. However, in most cases, it does not offer a significant enough improvement to outweigh the additional complexity that arises from the necessity of deploying a separate embedding model. RAG Code Level 2: Chunk-by-Chunk Reading # The aforementioned RAG approach is fast but often fails when the relevant chunks do not have sufficient keyword overlap with the user query, resulting in these chunks not being retrieved and thus not provided to the model. Although vector retrieval theoretically can mitigate this issue, in practice, it frequently does not. To address this limitation, we employ a brute-force strategy to reduce the chance of missing relevant context: Step 1: For each 512-token chunk, we ask the model to assess its relevance to the user query, outputting "None" if it is deemed irrelevant, or outputting the relevant sentences if it is deemed relevant. The chunks are processed in parallel to avoid long waiting times. Step 2: We then take the outputs that are not "None" (the relevant sentences) and use them as the search query to retrieve the most relevant chunks (within an 8k-context limit) using BM25. Step 3: Finally, we generate the final answer based on the retrieved context in the same manner as RAG.
Dataflows of chunk-by-chunk reading Agent Code Level 3: Step-by-Step Reasoning # A classic challenge in document-based question-answering is multi-hop reasoning. For example, consider answering the question “What vehicle was invented in the same century as the Fifth Symphony was composed?” when given a long document containing relevant facts. The model needs to first determine the answer to the sub-question “In which century was the Fifth Symphony composed?” which is the 19th century. Then, it can realize that a chunk containing “Bicycles were invented in the 19th century” is actually relevant to the original question. Tool-calling (also known as function-calling) agents or ReAct agents are classic solutions that have built-in capabilities for question decomposition and step-by-step reasoning. We therefore wrap the aforementioned Level-2 agent as a tool to be called by a tool-calling agent. The tool-calling agent conducts multi-hop reasoning as follows: Ask the Lv3-Agent a question. while (the Lv3-Agent cannot answer the question based on its memory) { The Lv3-Agent proposes a new sub-question to be answered. The Lv3-Agent asks the Lv2-Agent the sub-question. Add the Lv2-Agent's response to the Lv3-Agent's memory. } The Lv3-Agent provides the final answer to the original question.
Dataflows of step-by-step reasioning For example, the Lv3-Agent initially poses a sub-question to the Lv2-Agent: “In which century was Beethoven’s Fifth Symphony composed?” Upon receiving the response, “the 19th century,” the Lv3-Agent formulates a subsequent sub-question: “What vehicle was invented during the 19th century?” By consolidating all the feedback from the Lv2-Agent, the Lv3-Agent can then answer the original question: “What vehicle was invented in the same century that the Fifth Symphony was composed?” Experiments # We conducted experiments on two benchmarks designed for 256k-context: NeedleBench is a benchmark designed to test whether a model can identify the most relevant sentences within a context filled with numerous irrelevant ones, akin to finding needles in a haystack. Answering a question may require the simultaneous...
Excerpt shown — open the source for the full document.
Notability
Scored, but no written rationale attached yet.
Qwen (Alibaba Cloud) has a writing signal matching data demand, infrastructure.