Alibaba announces Qwen2-VL AI model with advanced video analysis and reasoning capabilities

Alibaba Cloud, the cloud computing arm of China’s Alibaba Group Ltd., announced Thursday the release of a new artificial intelligence model named Qwen2-VL capable of advanced vision comprehension and multilingual conversational capabilities.

The company, which has been working on the new model for a year to produce the new model based on the Qwen-VL AI model, said it can achieve understanding of high-quality videos of more than 20 minutes in length.

According to Alibaba, it can summarize video content, answer questions related to it, and maintain a continuous flow of conversation in real-time, as well as live chat support. As a result, it can act as a personal assistant, using information drawn directly from video content.

In an example, the model was given a video of what appeared to be a short documentary clip for the International Space Station, including a scene of the control center and a shot of two astronauts speaking from within a capsule while floating in space.

It’s not perfect. When asked to summarize the scene the model responded with a clear output including descriptions of the individuals speaking, the control room and “the men appear to be astronauts, and they are wearing space suits.” The astronauts were not wearing space suits; they appeared to be wearing collared shirts and pants.

When asked what color the clothing the astronauts were wearing the model correctly answers: “The two astronauts are wearing blue and black clothes.” One man is indeed wearing a blue shirt and the other is wearing a black shirt.

The model is capable of providing a foundation for text conversational real-time live chat, where users can talk with the model and it can answer questions about a video. It is also capable of function calling and tool use based on vision, enabling it to retrieve and access external data, such as flight statuses, weather forecasts and package tracking. That would make it useful for interacting with customer service or workers in the field who could show it images of products, bar codes or other information.

Alibaba said a key improvement of the model from Qwen-VL is the continued use of the Vision Transformer model, or ViT, and the Qwen2 language model. The company said it used a ViT with about 600 million parameters to handle both image and video inputs at the same time.

The model was enhanced with the implementation of Native Dynamic Resolution support, which allows the model to handle an arbitrary number of image resolutions, an upgrade over its predecessor. And the addition of Multimodal Rotary Position Embedding system, or M-ROPE, further enables models to understand textual, 2D visual and 3D positional data at the same time.

Qwen2-VL is available in open source in two sizes under the highly permissive Apache 2.0 license with Qwen2-VL-2B and Qwen2-VL-7B. The company also released a demo running the 7 billion-parameter model on Hugging Face.

The model does have its limitations, the company noted, as it is unable to extract audio from video files, given that it’s designed only for visual reasoning. Its training is also only up to date as of June 2023 and it cannot guarantee complete accuracy for complex instructions or scenarios. However, Alibaba said that the model’s performance and visual capabilities showcased top-tier benchmarks across most metrics, even surpassing closed-sourced models such as OpenAI’s flagship GPT-4o and Anthropic PBC’s Claude 3.5-Sonnet.

The company said the Qwen2-VL family will be a stepping stone toward stronger vision language models. They will integrate more features on the path toward an “omni” model that will be able to reason across both vision and audio.

Related stories

Other stories