Meta gives Llama 3 vision, now if only it had a brain

Hands on Meta has been influential in driving the development of open language models with its Llama family, but up until now, the only way to interact with them has been through text.

With the launch of its multimodal models late last month, the Facebook parent has given Llama 3 sight.

According to Meta, these models can now use a combination of images and text prompts to "deeply understand and reason on the combination." For example, the vision models could be used to generate appropriate keywords based on the contents of an image, chart, or graphics, or extract information from a PowerPoint slide.

You can ask this openly available model, which can be locally run not just in the cloud, not only what's in a picture, but ask questions or make a request about that content.

That said, in our testing, we found that, just like the scarecrow in the Wizard of Oz, what this model could really use is a brain.

'If I only had a brain'

Meta's vision models are available in both 11 and 90 billion parameter variants, and have been trained on a large corpus of image and text pairs.

These models are by no means the first vision-equipped LLMs we've come across. Microsoft has Phi 3.5 Vision, Mistral has Pixtral, and before that there was Llava, as examples. However, Meta's latest neural networks are notable because this marks the first time the Facebook owner — a corporation that has arguably set the tone for open LLM development — has ventured into the multimodal space.

To find out whether Meta's latest Llama had 20/20 vision or could use a pair of spectacles, we spun up the 11 billion parameter version of the model in a vLLM model server.

We ran the model through a number of scenarios ranging from optical character recognition (OCR) and handwriting recognition to image classification and data analysis to get a sense of where its strengths and weaknesses lie - and pretty much immediately ran into its limitations.

Our first test involved uploading a line chart from the US Bureau of Labor Statistics and asking the model to analyze the graph.

Image:

Tasked with analyzing a basic line chart, Llama 3.2 11B's vision capabilities start to show their limitations pretty quickly. - Click to enlarge

Prompt: "Provide me a short summary of this graph including what conclusions we can draw from it"

Response:

This graph shows the working-poor rate of people in the labor force for 27 weeks or more, from 1986 to 2022. The graph shows that the working-poor rate reached its peak in 1993 at 6.3% and has been declining since then. However, in 2010 the rate surged to 7.5%. Since then, the rate has been declining, but the rate of decline has slowed down. The lowest rate shown on this graph was 3.8% in 2020, before escalating during the COVID-19 pandemic and ending at 4.1% in 2022.

Some conclusions you can draw from this graph include:

The working-poor rate has been declining since 1993, with some fluctuations.
The rate surged in 2010 due to the Great Recession.
The rate of decline has slowed down since 2016.
The lowest rate shown on this graph was 3.8% in 2020.
The rate escalated during the COVID-19 pandemic and ended at 4.1% in 2022.

Overall, the graph suggests that the working-poor rate has been declining over the past few decades, but there have been some fluctuations and setbacks, including the surge in 2010 and the escalation during the COVID-19 pandemic.

Initially, the results look promising, but it quickly becomes clear that most of the model's conclusions - and figures - don't match what the graph actually depicts. For example, the model asserts that the working-poor rate peaks in 1993 and declines afterward, which isn't true. It then contradicts itself by citing a higher rate in 2010, which is also incorrect by several points. From there, the errors just keep piling up.

To ensure this wasn't a fluke, we asked the model to analyze a bar chart showing the change in the consumer price index for food, energy, and all items excluding food and energy. It seems like a simple task, given how straightforward the chart is, right? Well, not for this Llama.

Image:

Even with an even simpler bar chart, Llama 3.2 11B vision continues to struggle. - Click to enlarge

Prompt: "Examine this chart and provide a short bullet pointed list of the key take aways and conclusions that can be drawn from it."

Response:

Key Takeaways and Conclusions

Based on the chart, the following key takeaways can be drawn:

Overall Trend: The Consumer Price Index (CPI) experienced a 0.12% overall increase in August 2024, excluding seasonally adjusted changes.

Category Performance:

All Items: The CPI for all items rose by 2%, indicating a significant increase in prices.
Food: The CPI for food increased by 2%, suggesting that food prices also rose.
Energy: The CPI for energy decreased by 2.5%, indicating a decrease in energy prices.
All Items Less Food and Energy: The CPI for all items less food and energy increased by 2.5%, suggesting that prices for these items rose.

Comparison: The CPI for all items and food were higher than the CPI for energy and all items less food and energy, indicating that prices for these categories increased more than those of energy and all items less food and energy.

These findings suggest that while prices for all items and food increased, energy prices decreased, and prices for all items less food and energy also increased.

As you can see, once again, it's obvious that the model was able to extract some useful information, but very quickly started making things up.

We ran a few more tests with different chart types including a scatter plot (not pictured) and came up with similar results that, while clearly influenced by the contents of the image, were riddled with errors.

It's possible that this is due to how we configured the model in vLLM and it's blowing through its context - essentially its short-term memory - and getting lost. However, we found that when this was the case, the model usually just stalled out on us rather than issuing a nonsensical response. vLLMs logs also didn't show anything to suggest we were running out of KV cache.

Instead, we suspect the real issue here is interpreting a chart, even a simple one, requires a level of reasoning and understanding that an 11-billion-parameter neural network just isn't capable of simulating yet. Perhaps the 90-billion-parameter version of the model performs better here, but unless you've got a server full of H100s or H200s at your disposal, good luck running that at home.

So, while Meta may have given Llama eyes, what it really needs is a brain. But since vision is apparently a much easier problem to solve than artificial general intelligence, we guess we can forgive Meta for having to start somewhere.

Where Llama 3.2 Vision excels

While analyzing images and drawing conclusions from them may not be Llama 3.2 11B's strong point, that's not to say it didn't perform well in other scenarios.

Image recognition and classification

Presented with an image of Boston Dynamics' robodog Spot, snapped on the Siggraph show floor earlier this summer, the model was not only able to identify the bot by name but offered a fairly comprehensive description of the image.

Image:

Boston Dynamics' Spot robodog explores the Siggraph 2024 showfloor. - Click to enlarge

Prompt:

"What am I looking at?"

Response: The image shows a yellow and black robot dog, which appears to be a Spot robot developed by Boston Dynamics. The robot has four legs and a body that resembles a dog, with a camera on top of its head. It is standing on a gray carpeted floor, with a wall in the background. The robot is likely being used for research or demonstration purposes.

Yes, computer vision has been able to do object and scene classification for a long while, but as you can see, the LLM nature of the model allows it to riff on the image a little more.

We then fed in a stock image of a mountain lake filled with boats to see how effective the model was at following directions.

Image:

A stock image of a people boating on a mountain lake. - Click to enlarge

Prompt: "Break this image down into three zones: foreground, midground, background, and provide a description of each."

Response:

The image is divided into four distinct zones:

Foreground: A small wooden rowboat with two people aboard, one wearing a white shirt and the other wearing a light blue shirt, sits in the foreground of the image. They are facing away from the camera, with the boat's bow pointing towards the left side of the image. The water surrounding the boat is a vibrant turquoise color, with gentle ripples and reflections of sunlight.
Midground: In the midground, several other boats can be seen floating on the water, each with people on board. The boats are all similar in size and shape to the one in the foreground, and they appear to be moving slowly across the water. The midground also features a stretch of sandy beach, which is lined with trees and other vegetation.
Background: The background of the image is dominated by towering mountains that rise up from the water's edge. The mountains are rugged and rocky, with steep cliffs and jagged peaks. They are covered in a thick layer of greenery, including trees and bushes, which adds to their natural beauty. Above the mountains, the sky is a brilliant blue, with only a few wispy clouds scattered across it.

Overall, the image presents a serene and idyllic scene, with the calm waters, lush greenery, and majestic mountains combining to create a sense of peace and tranquility.

Zoning images is another thing computer vision has been able to do, though again, you might find it potentially interesting to perform that with a natural language model like Meta's. In fact with a lot of these tests, what Llama can do isn't a breakthrough, but that doesn't make it less useful.

Sentiment analysis

Another potential use case for these kinds of models is for sentiment analysis based on facial expressions or body language. One can imagine using these models to get a sense of whether stadium goers are enjoying a half-time show or if customers at a department store are responding favorably to a new marketing campaign.

To put this to the test, we ran a stock image from our back catalog depicting a tired IT worker through the model to see how effective it might be at evaluating the emotional state of a subject from an image.

Image:

Tired IT worker rubs his eyes - Click to enlarge

Prompt: "Evaluate the emotional state of the person in this image based on their facial expressions and body language."

Response:

The person in the image appears to be experiencing frustration or discomfort. Their facial expression and body language suggest that they are holding their glasses up to their face, which could indicate that they are having trouble seeing or are experiencing eye strain. The person's posture and facial expression also convey a sense of tension or annoyance, suggesting that they may be feeling stressed or frustrated about something.

While the model didn't describe the subject as looking tired, it still managed to provide an accurate evaluation of his likely emotional state.

OCR and Handwriting recognition

We also found the model had little to no problem with stripping text from images and even hand-written notes.

The first may as well be table stakes for existing machine learning models - your iPhone will do this automatically - but the handwriting recognition was actually quite good. As you can see from the example below, the model was able to convert the note to text with only a handful of errors. That said, the note was by no means chicken scratch, so your mileage may vary.

Image:

While no means chicken scratch, Llama 3.2 11B actually managed a respectable job at converting this hand-written note to text. - Click to enlarge

Prompt: Convert this handwritten note into text. Correct any spelling and/or grammar errors. Use markdown where appropriate to preserve formatting.

Response:

Multimodal Llama 3.2 llb + 90b

New model architecture with support for image reasoning

Built by training adopter weights with a pre-trained image encoder and baking them into Llama 3

Drop in replacement for Llama 3.1

Training Pipeline

Starts with Llama 3.1 text model
Then adds image adapter + encoder
That is pre-trained on large dataset of image-text pairs that are intentionally noisy
High quality image-text pairs are then used
Post training involves alignment and fine-tuning similar to that seen on text models

Meta describes models as a step toward agentic AI

While less reliable we even found the model was also reasonably effective at extracting the contents of tables and converting them into other formats. In this example we took a screenshot showing the relative floating performance of various Nvidia Blackwell GPUs announced this spring and asked Llama 3.2 to convert it to a markdown table.

Image:

A chart showing Blackwell CPU's relative floating point performance compared to an AMD MI300X - Click to enlarge

Prompt: Convert this table to markdown

Response:

Apart from difficulties with empty cells, Llama 3.2 actually managed a respectable job extracting the table's contents and converting them to Markdown. - Click to enlarge

The result isn't perfect - the model clearly struggles with the concept of empty cells - but fixing that is relatively straightforward compared to writing the table manually.

Unfortunately, this test was also the least consistent. In some cases, the model would randomly drop a column or omit critical information. So, we wouldn't exactly trust it to work perfectly every time.

Try it for yourself

If you'd like to test out Meta's vision models for yourself, they're relatively easy to get up and running, assuming you've got a beefy enough graphics card.

As we mentioned earlier, we'll be using vLLM, which is commonly used to serve LLMs in production environments. To keep things simple, we'll be running the vLLM OpenAI-compatible API server in a Docker container.

Normally, we'd stick with something like Ollama, as it's dead simple to set up. But at the time of writing, vLLM was one of the few model runners with support for the vision models. We're told Ollama plans to add support for the models in the near future. Meanwhile, just before publication, Rust-based LLM runner Mistral.rs announced support for the model.

Prerequisites:

While this piece isn't intended as a comprehensive vLLM tutorial, if you just want to kick the tires on Meta's multimodal models, here's what you'll need:

A Linux box with an Nvidia graphics card with at least 16 GB of vRAM. We recommend at least 24 GB if you plan to run the unquantized model at full 16-bit precision. For the operating system, we're using Ubuntu 24.04.
This guide also assumes you've installed the latest version of Docker Engine and the latest release of Nvidia's Container Runtime. For more information on getting this set up, check out our guide on GPU-accelerated Docker containers here. We also assume you're comfortable using a Linux system at the command line.

Editor's note: You may be able to get these models running on Intel or AMD hardware by manually compiling vLLM for your specific hardware, but doing so is beyond the scope of this tutorial. For more information on that, check out vLLM's docs here.

In this guide, we'll be deploying two containers: The first being vLLM and the second being Open WebUI, which will provide us with a convenient interface for interacting with the model.

Deploying the full model

Assuming you've got an Nvidia GPU with at least 24 GB of vRAM you can run the full Llama 3.2 11B vision model at 16-bit precision, using the one-liner below. Just don't forget to request access to the model here, and replace it with your access token.

Note: depending on your configuration, you may need to run this command with elevated privileges, eg via sudo.

docker run --gpus all     -v ~/.cache/huggingface:/root/.cache/huggingface     --env "HUGGING_FACE_HUB_TOKEN=HUGGING_FACE_API_KEY_HERE"     -p 8000:8000     --ipc=host     vllm/vllm-openai:latest     --model meta-llama/Llama-3.2-11B-Vision-Instruct --max-model-len 8192 --max-num-seqs 2 --enforce-eager --gpu-memory-utilization 0.99 --host 0.0.0.0 --port 8000

Depending on how fast your internet connection is, this may take a few minutes to spin up as the model files are roughly 22 GB in size. Once complete, the API server will be exposed on port 8000.

Deploy an 8-bit quantized model

If you don't have enough vRAM to run the model at 16-bit precision, you may be able to get away with using an 8-bit quantized version of the model, like this FP8 version from Neural Magic. At this precision, the model needs a little over 11 GB of vRAM and should fit comfortably within a 16 GB card, like the RTX 4060 Ti 16GB or RTX 4080.

Note that quantization can have a negative impact on the quality of model outputs. You can find more information on model quantization here.

docker run --gpus all     -v ~/.cache/huggingface:/root/.cache/huggingface     --env "HUGGING_FACE_HUB_TOKEN=HUGGING_FACE_API_KEY_HERE"     -p 8000:8000     --ipc=host     vllm/vllm-openai:latest     --model neuralmagic/Llama-3.2-11B-Vision-Instruct-FP8-dynamic --max-model-len 8192 --max-num-seqs 2 --enforce-eager --gpu-memory-utilization 0.99 --host 0.0.0.0 --port 8000

Deploy the WebUI

Once you've got the model server running, you can either start querying the Llama 3.2 11B using something like curl, or use a front-end like Open WebUI, which is what we opted to do. We previously explored Open WebUI in our retrieval augmented generation (RAG) guide here, but if you just want to get up and running to test with, you can deploy it using this Docker command:

docker run -d --network=host -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main

Once it's spun up, you can access Open WebUI by visiting http://localhost:8080. If you're running Open WebUI on a different machine or server, you'll need to replace localhost with its IP address or hostname, and make sure port 8080 is open on its firewall or otherwise reachable by your browser.

From there, create a local account (this will automatically be promoted to administrator) and add a new OpenAI API connection under the Admin Panel using localhost:8000/v1 as the address and empty as the API key.

In the Open WebUI connections menu, enter http://localhost:8000/v1 as the address and "empty" as the passkey. - Click to enlarge

Put it to the test:

If everything is configured correctly, you should be able to start a new chat, and select Meta's Llama 3.2 11B model from the drop-down menu at the top of the page.

Next, upload your image, enter a prompt, and press the Enter key to start chatting with the model.

To test out the model, select it from the drop down, upload an image, and enter a prompt. - Click to enlarge

Troubleshooting

If you run into trouble with the model stalling out or refusing to load, you may need to either reduce the size of your image or increase the context window.

The bigger and more complex the image, the more of the model's context it consumes. Llama 3.2 supports a pretty large 128,000 token context window, which is great unless you're memory constrained.

In our guide, we set our context window or --max-model-len aka context-size to 8192 tokens in order to avoid running out of memory on a 24 GB card. But if you've got memory to spare, or you opted for a quantized model, you can increase this to something like 12,288 or 16,384. Just know if you push this too far, you may run into out-of-memory errors. In that case, try dropping the context-size in increments of 1024 until the model loads successfully.

In our testing, we also noticed that a little prompt engineering can go a long way to getting better results. Giving a vision model a pic and telling it exactly what you want it to do with it, say, "Identify the four most relevant features in this image and assign the relevant photograph keywords," is likely going to render a more useful result than "What is this?"

While this might seem obvious, images can contain a lot of information that our brains will automatically filter out, but the model won't unless it's told to focus on a specific element. So, if you're running into problems getting the model to behave as intended, you might want to spend a little extra time writing a better prompt.

The Register aims to bring you more on using LLMs, vision-capable LLMs, and other AI technologies – without the hype – soon. We want to pull back the curtain and show how this stuff really fits together. If you have any burning questions on AI infrastructure, software, or models, we'd love to hear about them in the comments section below. ®

Editor's Note: The Register was provided an RTX 6000 Ada Generation graphics card by Nvidia and an Arc A770 GPU by Intel to support stories like this. Neither supplier had any input as to the contents of this and other articles.

'If I only had a brain'

Where Llama 3.2 Vision excels

Try it for yourself

Prerequisites:

Related stories

Other stories