What does this writing signal mean?

Qwen (Alibaba Cloud) Writing: Qwen2-VL: To See the World More Clearly

Captured source

qwenlm.github.io/qwenlm.github.io/blog/qwen2-vl

published Aug 29, 2024seen Jun 5captured Jun 7http 200method plain

Qwen2-VL: To See the World More Clearly | Qwen

We have a new blog! View this page at qwen.ai . This page will automatically redirect in 5 seconds. If you are not redirected automatically, please click the button below. Go Now

Qwen2-VL: To See the World More Clearly August 29, 2024 · 17 min · 3569 words · Qwen Team | Translations: 简体中文

DEMO GITHUB HUGGING FACE MODELSCOPE API DISCORD After a year’s relentless efforts, today we are thrilled to release Qwen2-VL ! Qwen2-VL is the latest version of the vision language models based on Qwen2 in the Qwen model familities. Compared with Qwen-VL, Qwen2-VL has the capabilities of: SoTA understanding of images of various resolution & ratio : Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc.

Understanding videos of 20min+ : Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc.

Agent that can operate your mobiles, robots, etc. : with the abilities of complex reasoning and decision making, Qwen2-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions.

Multilingual Support : to serve global users, besides English and Chinese, Qwen2-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc.

We opensource Qwen2-VL-2B and Qwen2-VL-7B with Apache 2.0 license, and we release the API of Qwen2-VL-72B! The opensource is integrated to Hugging Face Transformers, vLLM, and other third-party frameworks. Hope you enjoy! Performance # We evaluate our model’s visual capabilities across six key dimensions: complex college-level problem-solving, mathematical abilities, document and table comprehension, multilingual text-image understanding, general scenario question-answering, video comprehension, and agent-based interactions. Overall, our 72B model showcases top-tier performance across most metrics, often surpassing even closed-source models like GPT-4o and Claude 3.5-Sonnet. Notably, it demonstrates a significant edge in document understanding. At the 7B scale, we’ve managed to retain support for image, multi-image, and video inputs, delivering competitive performance in a more cost-effective model size. Specifically, our model excels in document understanding tasks such as DocVQA and in multilingual text understanding from images, as assessed by MTVQA, establishing state-of-the-art performance. Additionally, we’re excited to introduce a smaller 2B model, optimized for potential mobile deployment. Despite its compact size, this model boasts strong performance in image, video, and multilingual comprehension. It particularly shines in video-related tasks, document understanding, and general scenario question-answering when compared to other models of similar scale. Model Capabilities # 1. Enhanced Recognition Capabilities # Qwen2-VL now boasts improved object recognition, extending beyond plants and landmarks to comprehend complex relationships between multiple objects in a scene. We’ve also significantly boosted the model’s ability to recognize handwritten text and multiple languages within images, making it more accessible to users worldwide. Example: Multi-object Recognition Next User Output the color and number of each box.

Qwen2-VL The image shows a stack of colorful blocks with numbers on them. Here is the color and number of each block from top to bottom: Top row (single block): Blue with the number 9 Second row (two blocks): Light blue with the number 7, and green with the number 8 Third row (three blocks): Purple with the number 4, pink with the number 5, and light green with the number 6 Bottom row (four blocks): Red with the number 0, orange with the number 1, yellow with the number 2, and light green with the number 3

Example: Plant identification Next User What kind of flower is this?

Qwen2-VL The flowers in the picture are forget-me-nots (Myosotis). These small, delicate flowers are known for their vibrant colors, typically blue, pink, or white, and are often used in bouquets and gardens for their charming appearance.

Example: Document Parsing with Dense Formulas Next User Convert the content in the image to Markdown

Qwen2-VL If $i > 0$, let $\underline{d}’’ := (i-1, d-i-l+1, l)$. Let $V_{\underline{d}}$ and $V_{\underline{d}’’}$ be $r+1$-dimensional subspaces of $H^0(\mathcal{L}_{\underline{d}})$ and $H^0(\mathcal{L}_{\underline{d}’’})$, respectively, such that $\varphi_{\underline{d}, \underline{d}’’}(V_{\underline{d}}) = V_{\underline{d}’’}^{X_1, 0}$. Then the distributivity holds in $V_{\underline{d}’’}$ if and only if $$ \dim (V_{\underline{d}}^{X_2, 0} + V_{\underline{d}}^{X_3, 0}) - \dim (V_{\underline{d}’’}^{X_2, 0} + V_{\underline{d}’’}^{X_3, 0}) = r + 1 - \dim (V_{\underline{d}’’}^{X_1, 0} + V_{\underline{d}’’}^{X_2, 0} + V_{\underline{d}’’}^{X_3, 0}) $$

If $i + l \leq d - 1$, let $\underline{d}’’’ := (i, d-i-l-1, l+1)$. Let $V_{\underline{d}}$ and $V_{\underline{d}’’’}$ be $r+1$-dimensional subspaces of $H^0(\mathcal{L}_{\underline{d}})$ and $H^0(\mathcal{L}_{\underline{d}’’’})$, respectively, such that $\varphi_{\underline{d}’’’, \underline{d}}(V_{\underline{d}’’’}) = V_{\underline{d}}^{X_3, 0}$. Then the distributivity holds in $V_{\underline{d}}$ if and only if $$ \dim (V_{\underline{d}’’’}^{X_1, 0} + V_{\underline{d}’’’}^{X_2, 0}) - \dim (V_{\underline{d}}^{X_1, 0} + V_{\underline{d}}^{X_2, 0}) = r + 1 - \dim (V_{\underline{d}}^{X_1, 0} + V_{\underline{d}}^{X_2, 0} + V_{\underline{d}}^{X_3, 0}), $$ where the maps $\varphi_{\underline{d}, \underline{d}’’}$ and $\varphi_{\underline{d}’’’, \underline{d}}$ in Proposition 3.14 are the maps linking the corresponding sheaves. Another important result is Proposition 3.16, which establishes an inequality for any exact limit linear series. Specifically, our Proposition 3.16 says:

Let ${(\mathcal{L}_{\underline{d}}, V_{\underline{d}})}_{\underline{d}}$ be an exact limit linear series of degree $d$ and dimension $r$. Then $$...

Excerpt shown — open the source for the full document.

Notability

notability 7.0/10

Notable vision-language model release