Promptly Cached: AI Efficiency Unleashed

Have you ever wondered what “Cached Input” is when Open AI talks about the model pricing?https://platform.openai.com/docs/pricing Or did you wonder if you were to run one of the open source models by yourself, how would you implement such a low cost Cached Input technique?

As we all know LLMs take the entire input and generate one word, which means for generating the next word, they essentially take the entire previous input + the word it just generated before. So if you imagine a conversation below, even though its a conversation, each time LLM is putting a lot of same effort re-reading the same input.

Conversation: [message1] -> response1
Conversation: [message1, response1, message2] -> response2
Conversation: [message1, response1, message2, response2, message3] -> response3

To tackle this pain point effectively, caching emerges as a highly beneficial technique. By storing previously generated responses or intermediate results, caching drastically reduces redundant computations.

Understanding Caching in AI

Caching is like a smart shortcut—it saves previous results instead of redoing the same heavy computations. This reduces resource usage and speeds up responses.

Implementing Exact-Match Caching with MD5 Hashing

Exact-match caching is the simplest form of caching, ideal when the system receives repeated identical queries. We can use MD5 hashing, which generates a unique fingerprint for each prompt, enabling quick cache lookups.

FUNCTION getHash(prompt):
    RETURN MD5(prompt)

prompt = "What is the capital of France?"
hash = getHash(prompt)

cache[hash] = "Paris"

IF hash IN cache THEN
    PRINT cache[hash]
ENDIF

Since its an exact match hash, even for same type of prompts like “What is the capital of France?”, “capital of France?” generate completely different hashes.

Solving such a problem would require a better technique.

Semantic Caching Using Vector Embeddings

As mention above, exact matching isn’t always sufficient. Often, two prompts may differ slightly yet share the same semantic intent. To handle such scenarios, we can utilize vector embeddings. Embeddings transform textual inputs into numeric vectors, allowing semantic similarity checks using metrics such as cosine similarity.

FUNCTION getEmbedding(prompt):
    RETURN random vector of size 300

FUNCTION cosineSimilarity(vectorA, vectorB):
    COMPUTE cosine similarity between vectorA and vectorB
    RETURN similarity score

embedding1 = getEmbedding("Order pizza delivery at Home")
embedding2 = getEmbedding("Order pizza at my Home location")

similarity = cosineSimilarity(embedding1, embedding2)

IF similarity > 0.9 THEN
    PRINT "Reuse cached response"
ELSE
    PRINT "Process new prompt"
ENDIF

Prompt 1: “Show me how to optimize a SQL query for large datasets.”

Prompt 2: “How do I speed up a SQL query on a table with millions of rows?”

A vector search might treat them as the same and return a generic SQL optimization guide. But what if the second query specifically needs indexing strategies for MySQL?

Here, a context-aware LLM can refine the response based on the exact intent rather than relying solely on semantic similarity.

Enhancing Caching with Context-Aware Local LLMs

Locally hosted LLM can provide a deeper, context-aware assessment, bridging gaps where vector embeddings struggle. This method leverages the nuanced understanding of language models while remaining computationally efficient.

FUNCTION localLLMCompare(promptA, promptB):
    RETURN true IF both prompts contain "order" ELSE false

promptA = "Place an order for food delivery"
promptB = "Order food delivery for dinner"

IF localLLMCompare(promptA, promptB) THEN
    PRINT "Prompts are similar; reuse cached data"
ELSE
    PRINT "Prompts differ; process new prompt"
ENDIF

By adding a layer of contextual intelligence, local LLM-based caching greatly improves caching decisions, balancing accuracy and computational efficiency.

Hybrid Approach: Integrating Multiple Caching Techniques

To maximize AI performance, combining these caching strategies into a cohesive pipeline is highly effective. This hybrid caching strategy sequentially applies different methods, each complementing the strengths of others:

• Exact Match (MD5 hashing) quickly filters repeated queries.

• Semantic Embeddings identify similar but not identical prompts.

• Local Contextual LLMs handle nuanced edge cases with deep contextual understanding.

By structuring caching layers in this way, AI systems can respond rapidly, accurately, and efficiently to a wide variety of inputs.

Caching is not merely an optimization—it’s fundamental to making AI applications more responsive, cost-effective, and scalable. By strategically deploying exact-match caching, semantic embeddings, and local LLM-based caching, developers can profoundly enhance AI performance. Understanding and effectively implementing these strategies ultimately leads to more robust, efficient, and user-friendly AI solutions.

Promptly Cached: AI Efficiency Unleashed

Leave a ReplyCancel reply

Discover more from Hello World!