Caching LLM responses¶

This notebook demonstrates how to use Cassandra for a basic prompt/response cache.

Such a cache prevents running an LLM invocation more than once for the very same prompt, thus saving on latency and token usage. The cache retrieval logic is based on an exact match, as will be shown.

In [1]:

Copied!

from langchain.cache import CassandraCache
from langchain.cache import CassandraCache

In [2]:

Copied!





from cqlsession import getCQLSession, getCQLKeyspace
cqlMode = 'astra_db' # 'astra_db'/'local'
session = getCQLSession(mode=cqlMode)
keyspace = getCQLKeyspace(mode=cqlMode)
from cqlsession import getCQLSession, getCQLKeyspace
cqlMode = 'astra_db' # 'astra_db'/'local'
session = getCQLSession(mode=cqlMode)
keyspace = getCQLKeyspace(mode=cqlMode)

Create a CassandraCache and configure it globally for LangChain:

In [3]:

Copied!





import langchain
langchain.llm_cache = CassandraCache(
    session=session,
    keyspace=keyspace,
)
import langchain
langchain.llm_cache = CassandraCache(
    session=session,
    keyspace=keyspace,
)

In [4]:

Copied!

langchain.llm_cache.clear()
langchain.llm_cache.clear()

Below is the logic to instantiate the LLM of choice. We chose to leave it in the notebooks for clarity.

In [5]:

Copied!





import os
from llm_choice import suggestLLMProvider

llmProvider = suggestLLMProvider()
# (Alternatively set llmProvider to 'GCP_VertexAI', 'OpenAI', 'Azure_OpenAI' ... manually if you have credentials)

if llmProvider == 'GCP_VertexAI':
    from langchain.llms import VertexAI
    llm = VertexAI()
    print('LLM from Vertex AI')
elif llmProvider == 'OpenAI':
    os.environ['OPENAI_API_TYPE'] = 'open_ai'
    from langchain.llms import OpenAI
    llm = OpenAI()
    print('LLM from OpenAI')
elif llmProvider == 'Azure_OpenAI':
    os.environ['OPENAI_API_TYPE'] = 'azure'
    os.environ['OPENAI_API_VERSION'] = os.environ['AZURE_OPENAI_API_VERSION']
    os.environ['OPENAI_API_BASE'] = os.environ['AZURE_OPENAI_API_BASE']
    os.environ['OPENAI_API_KEY'] = os.environ['AZURE_OPENAI_API_KEY']
    from langchain.llms import AzureOpenAI
    llm = AzureOpenAI(temperature=0, model_name=os.environ['AZURE_OPENAI_LLM_MODEL'],
                      engine=os.environ['AZURE_OPENAI_LLM_DEPLOYMENT'])
    print('LLM from Azure OpenAI')
else:
    raise ValueError('Unknown LLM provider.')
import os
from llm_choice import suggestLLMProvider

llmProvider = suggestLLMProvider()
# (Alternatively set llmProvider to 'GCP_VertexAI', 'OpenAI', 'Azure_OpenAI' ... manually if you have credentials)

if llmProvider == 'GCP_VertexAI':
    from langchain.llms import VertexAI
    llm = VertexAI()
    print('LLM from Vertex AI')
elif llmProvider == 'OpenAI':
    os.environ['OPENAI_API_TYPE'] = 'open_ai'
    from langchain.llms import OpenAI
    llm = OpenAI()
    print('LLM from OpenAI')
elif llmProvider == 'Azure_OpenAI':
    os.environ['OPENAI_API_TYPE'] = 'azure'
    os.environ['OPENAI_API_VERSION'] = os.environ['AZURE_OPENAI_API_VERSION']
    os.environ['OPENAI_API_BASE'] = os.environ['AZURE_OPENAI_API_BASE']
    os.environ['OPENAI_API_KEY'] = os.environ['AZURE_OPENAI_API_KEY']
    from langchain.llms import AzureOpenAI
    llm = AzureOpenAI(temperature=0, model_name=os.environ['AZURE_OPENAI_LLM_MODEL'],
                      engine=os.environ['AZURE_OPENAI_LLM_DEPLOYMENT'])
    print('LLM from Azure OpenAI')
else:
    raise ValueError('Unknown LLM provider.')

LLM from OpenAI

In [6]:

Copied!





%%time
SPIDER_QUESTION_FORM_1 = "How many eyes do spiders have?"
# The first time, it is not yet in cache, so it should take longer
llm(SPIDER_QUESTION_FORM_1)
%%time
SPIDER_QUESTION_FORM_1 = "How many eyes do spiders have?"
# The first time, it is not yet in cache, so it should take longer
llm(SPIDER_QUESTION_FORM_1)

CPU times: user 122 ms, sys: 7.53 ms, total: 129 ms
Wall time: 1.81 s

Out[6]:

'\n\nSpiders typically have eight eyes, though some species have six or fewer eyes.'

In [7]:

Copied!

%%time
# This time we expect a much shorter answer time
llm(SPIDER_QUESTION_FORM_1)
%%time
# This time we expect a much shorter answer time
llm(SPIDER_QUESTION_FORM_1)

CPU times: user 7.49 ms, sys: 3.5 ms, total: 11 ms
Wall time: 148 ms

Out[7]:

'\n\nSpiders typically have eight eyes, though some species have six or fewer eyes.'

In [8]:

Copied!





%%time
SPIDER_QUESTION_FORM_2 = "How many eyes do spiders generally have?"
# This will again take 1-2 seconds, being a different string
llm(SPIDER_QUESTION_FORM_2)
%%time
SPIDER_QUESTION_FORM_2 = "How many eyes do spiders generally have?"
# This will again take 1-2 seconds, being a different string
llm(SPIDER_QUESTION_FORM_2)

CPU times: user 20 ms, sys: 3.38 ms, total: 23.4 ms
Wall time: 1.24 s

Out[8]:

'\n\nSpiders generally have eight eyes, although some species may have more or fewer.'

Caching and Chat Models¶

The CassandraCache supports caching within chat-oriented LangChain abstractions such as ChatOpenAI as well:

(warning: the following is demonstrated with OpenAI only for the time being)

In [9]:

Copied!

from langchain.chat_models import ChatOpenAI

chat_llm = ChatOpenAI(model_name="gpt-3.5-turbo-16k", temperature=0)
from langchain.chat_models import ChatOpenAI

chat_llm = ChatOpenAI(model_name="gpt-3.5-turbo-16k", temperature=0)

In [10]:

Copied!

%%time
print(chat_llm.predict("Are there spiders with wings?"))
%%time
print(chat_llm.predict("Are there spiders with wings?"))

No, there are no spiders with wings. Spiders belong to the class Arachnida, which includes creatures with eight legs and no wings. They rely on their silk-producing abilities to create webs and catch prey, rather than flying.
CPU times: user 17.4 ms, sys: 1.09 ms, total: 18.5 ms
Wall time: 2.57 s

In [11]:

Copied!

%%time
# Expect a much faster response:
print(chat_llm.predict("Are there spiders with wings?"))
%%time
# Expect a much faster response:
print(chat_llm.predict("Are there spiders with wings?"))

No, there are no spiders with wings. Spiders belong to the class Arachnida, which includes creatures with eight legs and no wings. They rely on their silk-producing abilities to create webs and catch prey, rather than flying.
CPU times: user 10.8 ms, sys: 1.64 ms, total: 12.4 ms
Wall time: 133 ms

(Actually, every object which inherits from the LangChain Generation class can be seamlessly store and retrieved in this cache.)

Stale entry control¶

Time-To-Live (TTL)¶

You can configure a time-to-live property of the cache, with the effect of automatic eviction of cached entries after a certain time.

Setting langchain.llm_cache to the following will have the effect that entries vanish in an hour (also supplying a custom table name is demonstrated):

In [12]:

Copied!





cacheWithTTL = CassandraCache(
    session=session,
    keyspace=keyspace,
    table_name="langchain_llm_cache",
    ttl_seconds=3600,
)
cacheWithTTL = CassandraCache(
    session=session,
    keyspace=keyspace,
    table_name="langchain_llm_cache",
    ttl_seconds=3600,
)

Manual cache eviction¶

Alternatively, you can invalidate cached entries one at a time - for that, you'll need to provide the very LLM this entry is associated to:

In [13]:

Copied!

%%time
llm(SPIDER_QUESTION_FORM_2)
%%time
llm(SPIDER_QUESTION_FORM_2)

CPU times: user 8.29 ms, sys: 0 ns, total: 8.29 ms
Wall time: 192 ms

Out[13]:

'\n\nSpiders generally have eight eyes, although some species may have more or fewer.'

In [14]:

Copied!

langchain.llm_cache.delete_through_llm(SPIDER_QUESTION_FORM_2, llm)
langchain.llm_cache.delete_through_llm(SPIDER_QUESTION_FORM_2, llm)

In [15]:

Copied!

%%time
llm(SPIDER_QUESTION_FORM_2)
%%time
llm(SPIDER_QUESTION_FORM_2)

CPU times: user 13.5 ms, sys: 7.37 ms, total: 20.9 ms
Wall time: 895 ms

Out[15]:

'\n\nSpiders typically have eight eyes, although some have fewer and some have more.'

Whole-cache deletion¶

As you might have seen at the beginning of this notebook, you can also clear the cache entirely: all stored entries, for all models, will be evicted at once:

In [16]:

Copied!

langchain.llm_cache.clear()
langchain.llm_cache.clear()