WritingQwen (Alibaba Cloud)Qwen (Alibaba Cloud)published Jan 25, 2024seen 6d

Introducing Qwen-VL

Open original ↗

Captured source

source ↗
published Jan 25, 2024seen 6dcaptured 3dhttp 200method plain

Introducing Qwen-VL | Qwen

We have a new blog! View this page at qwen.ai . This page will automatically redirect in 5 seconds. If you are not redirected automatically, please click the button below. Go Now

Introducing Qwen-VL January 25, 2024 · 12 min · 2505 words · Qwen Team | Translations: 简体中文

Along with the rapid development of our large language model Qwen, we leveraged Qwen’s capabilities and unified multimodal pretraining to address the limitations of multimodal models in generalization, and we opensourced multimodal model Qwen-VL in Sep. 2023. Recently, the Qwen-VL series has undergone a significant upgrade with the launch of two enhanced versions, Qwen-VL-Plus and Qwen-VL-Max. The key technical advancements in these versions include: Substantially boost in image-related reasoning capabilities; Considerable enhancement in recognizing, extracting, and analyzing details within images and texts contained therein; Support for high-definition images with resolutions above one million pixels and images of various aspect ratios.

Model Name Model Description qwen-vl-plus Qwen's Enhanced Large Visual Language Model . Significantly upgraded for detailed recognition capabilities and text recognition abilities, supporting ultra-high pixel resolutions up to millions of pixels and arbitrary aspect ratios for image input. It delivers significant performance across a broad range of visual tasks. qwen-vl-max Qwen's Most Capable Large Visual Language Model . Compared to the enhanced version, further improvements have been made to visual reasoning and instruction-following capabilities, offering a higher level of visual perception and cognitive understanding. It delivers optimal performance on an even broader range of complex tasks. Compared to the open-source version of Qwen-VL, these two models perform on par with Gemini Ultra and GPT-4V in multiple text-image multimodal tasks, significantly surpassing the previous best results from open-source models. Notably, Qwen-VL-Max outperforms both GPT-4V from OpenAI and Gemini from Google in tasks on Chinese question answering and Chinese text comprehension. This breakthrough underscores the model’s advanced capabilities and its potential to set new standards in the field of multimodal AI research and application. Model DocVQA Document understanding ChartQA Chart understanding AI2D Science diagrams TextVQA Text reading MMMU College-level problems MathVista Mathematical reasoning MM-Bench-CN Natural image QA in Chinese Other Best Open-source LVLM 81.6% (CogAgent) 68.4% (CogAgent) 73.7% (Fuyu-Medium) 76.1% (CogAgent) 45.9% (Yi-VL-34B) 36.7% (SPHINX-V2) 72.4% (InternLM-XComposer-VL) Gemini Pro 88.1% 74.1% 73.9% 74.6% 47.9% 45.2% 74.3% Gemini Ultra 90.9% 80.8% 1 79.5% 1 82.3% 1 59.4% 1 53.0% 1 - GPT-4V 88.4% 78.5% 78.2% 78.0% 56.8% 49.9% 73.9% Qwen-VL-Plus 91.4% 78.1% 75.9% 78.9% 45.2% 43.3% 68.0% Qwen-VL-Max 93.1% 1 79.8% 2 79.3% 2 79.5% 2 51.4% 3 50.0% 2 75.1% 1 New versions of the Qwen-VL-Plus and Qwen-VL-Max models not only exhibit exceptional benchmark performance but also demonstrate a marked improvement in problem-solving capabilities within real-world scenarios. These advanced models effortlessly engage in dialogue, identify celebrities and landmarks, generate text, and notably, have significantly enhanced their abilities to describe and interpret visual content. Here we present some practical examples: 1. Basic Recognition Capabilities # The latest Qwen-VL models are now more adept at accurately describing and identifying complex information within images, as well as providing detailed background and answering related questions. For instance, Qwen-VL can recognize not only common objects, but also celebrities and landmarks. Qwen-VL can write poetry in various languages inspired by visuals, and analyze everyday screenshots. Example: celebrity recognition Next User who are they?

Qwen-VL These are some of the most famous and highest-paid celebrities in the world. From left to right, top to bottom: Justin Bieber (Canadian singer) Lionel Messi (Argentine soccer player) Cristiano Ronaldo (Portuguese soccer player) Elon Musk (South African-born American entrepreneur and businessman) Kylie Jenner (American media personality and socialite) Taylor Swift (American singer-songwriter) Robert Downey Jr. (American actor) Scarlett Johansson (American actress)

It’s worth noting that this is just a small sample of the many talented and successful celebrities out there, and that fame and wealth are not the only measures of success or importance.

User their chinese name? Qwen-VL Here are the Chinese names of the celebrities mentioned: Justin Bieber: 贾斯汀·比伯 Lionel Messi: 莱昂内尔·梅西 Cristiano Ronaldo: 克里斯蒂亚诺·罗纳尔多 Elon Musk: 埃隆·马斯克 Kylie Jenner: 凯莉·詹娜 Taylor Swift: 泰勒·斯威夫特 Robert Downey Jr.: 小罗伯特·唐尼 Scarlett Johansson: 斯嘉丽·约翰逊

Note that these are just one possible translation of their names into Chinese, and there may be other variations used in different contexts or regions.

Example: landmark recognition Next User Where is it in the picture? How did you recognize it?

Qwen-VL The picture shows a busy street at night in New York City, USA. The Empire State Building can be seen in the background, which is a distinctive landmark of New York City. Therefore, I recognized it as New York City.

Example: poem writing Next User Look at pictures and write a poem.

Qwen-VL A rocky crag by the sea, A place of solitude and mystery. The water calm and blue, Reflecting the sky above, anew. The mountain stands tall and proud, Its peak reaching towards the clouds. A symbol of strength and resilience, Enduring through time’s relentless persistence. The world around may change and shift, But this mountain remains steadfast and uplifted. A reminder to us all, To stand firm in our own personal call. So let us take a moment to pause, And appreciate this natural cause. For in its beauty and grandeur, We find inspiration and renewal, forever.

Example: mobile app screenshot understanding Next User Describe the image. Qwen-VL The image is a screenshot of an app displaying a map of Beijing, China. The map is centered on the Summer Palace and Yuyuantan Park. The app provides directions for driving from Yuyuantan Park to the Summer Palace. It offers two options: one takes 22 minutes and the other takes 25 minutes. The app also indicates that there are local plate restrictions on the route.

2. Visual Agent Capability: The Essential Role of Localization # Beyond…

Excerpt shown — open the source for the full document.

Notability

Scored, but no written rationale attached yet.

Qwen (Alibaba Cloud) has a writing signal matching data demand, infrastructure, product and customer.