pwshub.com

Caching Input with Google Gemini

A little over a month ago, Google announced multiple updates to their GenAI platform. I made a note of it for research later and finally got time to look at one aspect - context caching.

When you send prompts to a GenAI system, your input is tokenized for analysis. While not a "one token per word" relation, basically the bigger the input (context) the more the cost (tokens). The process of converting your input into tokens takes time, especially when dealing with large media, for example, a video. Google introduced a "Context caching" system that helps improve the performance of your queries. As the docs suggest, this is really suited for cases where you've got a large initial input (a video, text file) and then follow up with multiple questions related to the content.

At this time, speed improvements aren't really baked in, but cost improvements definitely are. If you imagine a prompt based on a video for example, your cost will be X let's say, where X is the token count of your text-based prompt and the video. For cached data and Gemini, the cost is instead: "Token count of your prompt, and a reduced charge for your cached content". Honestly, this was a bit hard to grok at first, but a big thank you to Vishal Dharmadhikari at Google for patiently explaining it to me.

You can see current cost details here:

Current prices

The docs do a good job of explaining how to use it, but I really wanted a demo I could run locally to see it in action, and to create a test where I could compare timings to see how much the cache helped.

Caveats #

Again, this is documented, but honestly, I missed them both.

  • You must use a specific version of a model. In other words, not gemini-1.5-pro but rather gemini-1.5-pro-001.
  • Gemini has a free tier in which you can create a key in a project that has no credit card. This feature is not available in the free tier. I found the error message a bit hard to grok in that case.

Ok, with that in mind, let's look at how it's used.

My code is modified slightly from the docs, but credit to Google for documenting this well. Before getting into code, a high-level look:

  • First, you use the Files API to get your asset in Google's cloud. Note that this API changed from my blog post back in May.
  • Second, you create a cache. This is very similar to creating a model.
  • Third, you actually get the model using a special function that integrates with the cache.

After that, you can run prompts at will against the model.

Here's my code, and honestly, it is a bit messy, but hopefully understandable.

Let's start with the imports:

import {
  GoogleGenerativeAI
} from '@google/generative-ai';
import { FileState, GoogleAIFileManager, GoogleAICacheManager } from '@google/generative-ai/server';

Next, some constants. By the way, I'm not using const much anymore, so when you see it, it's just code I haven't bothered to change to let.

const MODEL_NAME = 'models/gemini-1.5-pro-001';
const API_KEY = process.env.GEMINI_API_KEY;
const fileManager = new GoogleAIFileManager(API_KEY);
const cacheManager = new GoogleAICacheManager(API_KEY);
const genAI = new GoogleGenerativeAI(API_KEY);

Next, I defined my system instructions. This will be used for both model objects I create in a bit.

// System instructions used for both tests
let si = 'You are an English professor for middle school students and can provide help for students struggling to understand classical works of literature.';

Now my code handles uploading my content, in this case, a 755K text version of "Pride and Prejudice":

async function uploadToGemini(path, mimeType) {
	const fileResult = await fileManager.uploadFile(path, {
		mimeType,
		displayName: path,
	});
	let file = await fileManager.getFile(fileResult.file.name);
	while(file.state === FileState.PROCESSING) {
		console.log('Waiting for file to finish processing');
		await new Promise(resolve => setTimeout(resolve, 2_000));
		file = await fileManager.getFile(fileResult.file.name);
	}
  return file;
}
// First, upload the book to Google 
let book = './pride_and_prejudice.txt';
let bookFile = await uploadToGemini(book, 'text/plain');
console.log(`${book} uploaded to Google.`);

At this point, we can create our cache:

let cache = await cacheManager.create({
	model: MODEL_NAME, 
	displayName:'pride and prejudice', 
	systemInstruction:si,
	contents: [
		{
			role:'user',
			parts:[
				{
					fileData: {
						mimeType:bookFile.mimeType, 
						fileUri: bookFile.uri
					}
				}
			]
		}
	],
	ttlSeconds: 60 * 10 // ten minutes
});

Note that this is very similar to how you create a model normally. It's got the model name, system instructions, and a reference to the file.

The cache object returned there is the only time you have access to the cache. There are APIs to list, update, and delete caches, but you can't get a reference once the script execution ends.

To get the actual model you can run prompts on, you then do:

let genModel = genAI.getGenerativeModelFromCachedContent(cache);

As an example:

// used for both tests.
let contents = [
		{
			role:'user',
			parts: [
				{
					text:'Describe the major themes of this work and then list the major characters.'
				}
			]
		}
	];
let result = await genModel.generateContent({
	contents
});

And that's it really. I've got a complete script that demos this in action and it shows a comparison to a non-cached model. It reports on the timings, which again, at this point do not show the cached stuff being quicker, but it also reports the usageMetadata and that shows the impact of the cached token count against your total. Here's an example with the cache:

{
  promptTokenCount: 189940,
  candidatesTokenCount: 591,
  totalTokenCount: 190531,
  cachedContentTokenCount: 189925
}
with cache, duration is 52213
{
  promptTokenCount: 189935,
  candidatesTokenCount: 251,
  totalTokenCount: 190186,
  cachedContentTokenCount: 189925
}
with cache, second prompt, duration is 19117

And here's the report when the cache isn't used:

{
  promptTokenCount: 189939,
  candidatesTokenCount: 790,
  totalTokenCount: 190729
}
without cache, duration is 29005
{
  promptTokenCount: 189934,
  candidatesTokenCount: 181,
  totalTokenCount: 190115
}
without cache, second prompt, duration is 11707

Again, the timing above shows that with the cache, the timings were actually slower, but cost-wise, something like 99% of the cost is reduced. That's huge. If you want the complete script (and source book), you can find it here: https://github.com/cfjedimaster/ai-testingzone/tree/main/cache_test

Source: raymondcamden.com

Related stories
1 month ago - In this adoption guide, we’ll discuss some reasons to choose Fastify, key features, compare Fastify to some popular alternatives, and more. The post Fastify adoption guide: Overview, examples, and alternatives appeared first on LogRocket...
3 weeks ago - HELLO EVERYONE!!! It’s August 16th 2024 and you are reading the 24th edition of the Codeminer42’s tech news report. Let’s check out what the tech …
1 week ago - Email marketing is the process of nurturing relationships with potential and existing customers through email. It’s a powerful tool that can boost your sales, enhance brand loyalty, and drive business growth. Leveraging the right email...
2 weeks ago - Experimenting with subgroups, deprecate setting depth bias for lines and points, hide uncaptured error DevTools warning if preventDefault, WGSL interpolate sampling first and either, and more.
1 month ago - Mozilla Firefox 129 is now available download, and comes with a couple of features customisation fans are sure to enjoy. It’s been 4 weeks since Firefox 128 dished out a unified cookie, cache n’ data clearing experience, the ability to...
Other stories
8 hours ago - Looking for a powerful new Linux laptop? The new KDE Slimbook VI may very well appeal. Unveiled at Akademy 2024, KDE’s annual community get-together, the KDE Slimbook VI marks a major refresh from earlier models in the KDE Slimbook line....
12 hours ago - Fixes 130 bugs (addressing 250 👍). `bun pm pack`, faster `node:zlib`. Static routes in Bun.serve(). ReadableStream support in response.clone() & request.clone(). Per-request timeouts. Cancel method in ReadableStream is called. `bun run`...
1 day ago - Have you ever used an attribute in HTML without fully understanding its purpose? You're not alone! Over time, I've dug into the meaning behind many HTML attributes, especially those that are crucial for accessibility. In this in-depth...
1 day ago - Lifetimes are fundamental mechanisms in Rust. There's a very high chance you'll need to work with lifetimes in any Rust project that has any sort of complexity. Even though they are important to Rust projects, lifetimes can be quite...
1 day ago - The first interaction sets the tone for the entire experience — get it right, and you’ve hooked your users from the start. So as a UX designer, you need to know how to put the primacy effect of UX design to good use. The post Leveraging...