VectorStore/QA, learn more¶
NOTE: this uses Cassandra's "Vector Search" capability. Make sure you are connecting to a vector-enabled database for this demo.
In the previous Quickstart, you have created the index and at the same time added the corpus of text to it.
In most cases, these two operations happen at different times: besides, often new documents keep being ingested.
This notebook demonstrates further interactions you can have with a Cassandra Vector Store.
It is assumed you have run the "VectorStore/QA, Quickstart" notebook (so that the vector store is not empty)
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document
The setup is similar to the one you saw:
from langchain.vectorstores.cassandra import Cassandra
from cqlsession import getCQLSession, getCQLKeyspace
cqlMode = 'astra_db' # 'astra_db'/'local'
session = getCQLSession(mode=cqlMode)
keyspace = getCQLKeyspace(mode=cqlMode)
Below is the logic to instantiate the LLM and embeddings of choice. We chose to leave it in the notebooks for clarity.
import os
from llm_choice import suggestLLMProvider
llmProvider = suggestLLMProvider()
# (Alternatively set llmProvider to 'GCP_VertexAI', 'OpenAI', 'Azure_OpenAI' ... manually if you have credentials)
if llmProvider == 'GCP_VertexAI':
from langchain.llms import VertexAI
from langchain.embeddings import VertexAIEmbeddings
llm = VertexAI()
myEmbedding = VertexAIEmbeddings()
print('LLM+embeddings from Vertex AI')
elif llmProvider == 'OpenAI':
os.environ['OPENAI_API_TYPE'] = 'open_ai'
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings
llm = OpenAI(temperature=0)
myEmbedding = OpenAIEmbeddings()
print('LLM+embeddings from OpenAI')
elif llmProvider == 'Azure_OpenAI':
os.environ['OPENAI_API_TYPE'] = 'azure'
os.environ['OPENAI_API_VERSION'] = os.environ['AZURE_OPENAI_API_VERSION']
os.environ['OPENAI_API_BASE'] = os.environ['AZURE_OPENAI_API_BASE']
os.environ['OPENAI_API_KEY'] = os.environ['AZURE_OPENAI_API_KEY']
from langchain.llms import AzureOpenAI
from langchain.embeddings import OpenAIEmbeddings
llm = AzureOpenAI(temperature=0, model_name=os.environ['AZURE_OPENAI_LLM_MODEL'],
engine=os.environ['AZURE_OPENAI_LLM_DEPLOYMENT'])
myEmbedding = OpenAIEmbeddings(model=os.environ['AZURE_OPENAI_EMBEDDINGS_MODEL'],
deployment=os.environ['AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT'])
print('LLM+embeddings from Azure OpenAI')
else:
raise ValueError('Unknown LLM provider.')
LLM+embeddings from OpenAI
Re-use an existing Vector Store¶
Creating this Cassandra
vector store, it will re-connect with the existing data on DB.
In practice, you are loading an existing, pre-populated vector store for further usage.
(make sure you are using the very same embedding function every time! In fact, this is why we have a separate table for each embedding function, i.e. for each llmProvider
.)
myCassandraVStore = Cassandra(
embedding=myEmbedding,
session=session,
keyspace=keyspace,
table_name='vs_test1_' + llmProvider,
)
You can then re-instantiate the index
from the vector store with:
index = VectorStoreIndexWrapper(vectorstore=myCassandraVStore)
and use it as you saw in the quickstart:
query = "Who is Luchesi?"
index.query(query, llm=llm)
' Luchesi is a person who is known for having a critical turn and for not being able to tell Amontillado from Sherry.'
Further usage of the vector store¶
These are some of the ways you can query the store:
myCassandraVStore.similarity_search_with_score(
"Does anyone have a coughing fit?",
k=1,
)
[(Document(page_content='"Nitre," I replied. "How long have you had that cough?"\n\n"Ugh! ugh! ugh!--ugh! ugh! ugh!--ugh! ugh! ugh!--ugh! ugh! ugh!--ugh!\nugh! ugh!"\n\nMy poor friend found it impossible to reply for many minutes.\n\n"It is nothing," he said, at last.', metadata={'source': 'texts/amontillado.txt'}), 0.9052722957243504)]
Adding new documents¶
Start with a very off-topic question, to demonstrate that no relevant documents are found (yet).
Note: depending on the embedding function, you might still see some results, off-topic in practice, being found at this stage. In a full end-to-end QA session, however, these would likely be discarded by the LLM, which would presumably end up saying, "I don't know".
SPIDER_QUESTION = 'Compare Agelenidae and Lycosidae'
myCassandraVStore.similarity_search_with_relevance_scores(
SPIDER_QUESTION,
k=1,
score_threshold=0.8,
)
[(Document(page_content='"A huge human foot d\'or, in a field azure; the foot crushes a serpent\nrampant whose fangs are imbedded in the heel."\n\n"And the motto?"\n\n"_Nemo me impune lacessit_."\n\n"Good!" he said.', metadata={'source': 'texts/amontillado.txt'}), 0.8635470113992079)]
You can add a couple of relevant paragraphs to the index, using the add_texts
primitive:
spiderFacts = [
"""
The Agelenidae are a large family of spiders in the suborder Araneomorphae.
The body length of the smallest Agelenidae spiders are about 4 mm (0.16 in), excluding the legs,
while the larger species grow to 20 mm (0.79 in) long. Some exceptionally large species,
such as Eratigena atrica, may reach 5 to 10 cm (2.0 to 3.9 in) in total leg span.
Agelenids have eight eyes in two horizontal rows of four. Their cephalothoraces narrow
somewhat towards the front where the eyes are. Their abdomens are more or less oval, usually
patterned with two rows of lines and spots. Some species have longitudinal lines on the dorsal
surface of the cephalothorax, whereas other species do not; for example, the hobo spider does not,
which assists in informally distinguishing it from similar-looking species.
""",
"""
Jumping spiders are a group of spiders that constitute the family Salticidae.
As of 2019, this family contained over 600 described genera and over 6,000 described species,
making it the largest family of spiders at 13% of all species.
Jumping spiders have some of the best vision among arthropods and use it
in courtship, hunting, and navigation.
Although they normally move unobtrusively and fairly slowly,
most species are capable of very agile jumps, notably when hunting,
but sometimes in response to sudden threats or crossing long gaps.
Both their book lungs and tracheal system are well-developed,
and they use both systems (bimodal breathing).
Jumping spiders are generally recognized by their eye pattern.
All jumping spiders have four pairs of eyes, with the anterior median pair
being particularly large.
""",
]
spiderMetadatas = [
{'source': 'wikipedia/Agelenidae'},
{'source': 'wikipedia/Salticidae'},
]
if llmProvider != 'Azure_OpenAI':
ids = myCassandraVStore.add_texts(
spiderFacts,
spiderMetadatas,
)
print('\n'.join(ids))
else:
# Note: this is a temporary mitigation of an Azure OpenAI error with asking for
# multiple embedding in a single request, which would error with:
# "InvalidRequestError: Too many inputs. The max number of inputs is 1"
for spFact, spMetadata in zip(spiderFacts, spiderMetadatas):
thisId = myCassandraVStore.add_texts(
[spFact],
[spMetadata],
)[0]
print(thisId)
5b44a1115e3245198a132a42705694a8 f1353d23c55b478192e77453cfdcc30a
Another way is to add a text through LangChain's Document
abstraction.
Note that, using one of LangChain's splitters, long input documents are made into (possibly overlapping) digestible chunks without much boilerplate:
mySplitter = RecursiveCharacterTextSplitter(chunk_size=250, chunk_overlap=120)
lycoText = """
Wolf spiders are members of the family Lycosidae.
They are robust and agile hunters with excellent eyesight.
They live mostly in solitude, hunt alone, and usually do not spin webs.
Some are opportunistic hunters, pouncing upon prey as they
find it or chasing it over short distances;
others wait for passing prey in or near the mouth of a burrow.
Wolf spiders resemble nursery web spiders (family Pisauridae),
but wolf spiders carry their egg sacs by attaching them to their spinnerets,
while the Pisauridae carry their egg sacs with their chelicerae and pedipalps.
Two of the wolf spider's eight eyes are large and prominent;
this distinguishes them from nursery web spiders,
whose eyes are all of roughly equal size.
This can also help distinguish them from the similar-looking grass spiders.
"""
lycoDocument = Document(
page_content=lycoText,
metadata={'source': 'wikipedia/Lycosidae'}
)
Use the splitter to "shred" the input document:
lycoDocs = mySplitter.transform_documents([lycoDocument])
lycoDocs
[Document(page_content='Wolf spiders are members of the family Lycosidae.\nThey are robust and agile hunters with excellent eyesight.\nThey live mostly in solitude, hunt alone, and usually do not spin webs.\nSome are opportunistic hunters, pouncing upon prey as they', metadata={'source': 'wikipedia/Lycosidae'}), Document(page_content='Some are opportunistic hunters, pouncing upon prey as they\nfind it or chasing it over short distances;\nothers wait for passing prey in or near the mouth of a burrow.', metadata={'source': 'wikipedia/Lycosidae'}), Document(page_content='Wolf spiders resemble nursery web spiders (family Pisauridae),\nbut wolf spiders carry their egg sacs by attaching them to their spinnerets,\nwhile the Pisauridae carry their egg sacs with their chelicerae and pedipalps.', metadata={'source': 'wikipedia/Lycosidae'}), Document(page_content="while the Pisauridae carry their egg sacs with their chelicerae and pedipalps.\nTwo of the wolf spider's eight eyes are large and prominent;\nthis distinguishes them from nursery web spiders,\nwhose eyes are all of roughly equal size.", metadata={'source': 'wikipedia/Lycosidae'}), Document(page_content='this distinguishes them from nursery web spiders,\nwhose eyes are all of roughly equal size.\nThis can also help distinguish them from the similar-looking grass spiders.', metadata={'source': 'wikipedia/Lycosidae'})]
These are ready to be added to the index:
if llmProvider != 'Azure_OpenAI':
ids = myCassandraVStore.add_documents(lycoDocs)
print('\n'.join(ids))
else:
# Note: this is a temporary mitigation of an Azure OpenAI error with asking for
# multiple embedding in a single request, which would error with:
# "InvalidRequestError: Too many inputs. The max number of inputs is 1"
for lycoDoc in lycoDocs:
thisId = myCassandraVStore.add_documents([lycoDoc])[0]
print(thisId)
5b7d726f444b421f9f25a55d73291766 4ab50d78b1664ad094e9c8b95b24ca44 e9f81b5662dc409db259547348fd099b f2d60ab62b75478fb5d548c29b239529 e6ebc3f66acb4c538d6d45fc8d05d6c8
Querying the store again¶
Time to repeat the question:
myCassandraVStore.similarity_search_with_relevance_scores(
SPIDER_QUESTION,
k=3,
score_threshold=0.8,
)
[(Document(page_content='\n The Agelenidae are a large family of spiders in the suborder Araneomorphae.\n The body length of the smallest Agelenidae spiders are about 4 mm (0.16 in), excluding the legs,\n while the larger species grow to 20 mm (0.79 in) long. Some exceptionally large species,\n such as Eratigena atrica, may reach 5 to 10 cm (2.0 to 3.9 in) in total leg span.\n Agelenids have eight eyes in two horizontal rows of four. Their cephalothoraces narrow\n somewhat towards the front where the eyes are. Their abdomens are more or less oval, usually\n patterned with two rows of lines and spots. Some species have longitudinal lines on the dorsal\n surface of the cephalothorax, whereas other species do not; for example, the hobo spider does not,\n which assists in informally distinguishing it from similar-looking species.\n ', metadata={'source': 'wikipedia/Agelenidae'}), 0.9029937548261041), (Document(page_content="while the Pisauridae carry their egg sacs with their chelicerae and pedipalps.\nTwo of the wolf spider's eight eyes are large and prominent;\nthis distinguishes them from nursery web spiders,\nwhose eyes are all of roughly equal size.", metadata={'source': 'wikipedia/Lycosidae'}), 0.9007724464216618), (Document(page_content='Wolf spiders resemble nursery web spiders (family Pisauridae),\nbut wolf spiders carry their egg sacs by attaching them to their spinnerets,\nwhile the Pisauridae carry their egg sacs with their chelicerae and pedipalps.', metadata={'source': 'wikipedia/Lycosidae'}), 0.893145090393671)]
Item removal and expiration¶
Time-To-Live (TTL)¶
If you provide a TTL value when creating the store, every entry will expire away a certain time after its insertion:
myCassandraVStoreWithTTL = Cassandra(
embedding=myEmbedding,
session=session,
keyspace=keyspace,
table_name='vs_test1_shortlived_' + llmProvider,
ttl_seconds=120,
)
The following two documents will be available for two minutes.
if llmProvider != 'Azure_OpenAI':
ids = myCassandraVStoreWithTTL.add_documents(lycoDocs[0:2])
print('\n'.join(ids))
else:
# Note: this is a temporary mitigation of an Azure OpenAI error with asking for
# multiple embedding in a single request, which would error with:
# "InvalidRequestError: Too many inputs. The max number of inputs is 1"
for lycoDoc in lycoDocs[0:2]:
thisId = myCassandraVStoreWithTTL.add_documents([lycoDoc])[0]
print(thisId)
4a67cdc623c54244bb7931a8575610c4 9f59ccbd9e694d7c8a74106b9e3434de
Alternatively, for a finer control of the time-to-live, you can specify it at insertion time -- which would anyway have precedence over the store-level definition. So, these documents will survive for twenty seconds:
if llmProvider != 'Azure_OpenAI':
ids = myCassandraVStore.add_documents(lycoDocs[2:], ttl_seconds=20)
print('\n'.join(ids))
else:
# Note: this is a temporary mitigation of an Azure OpenAI error with asking for
# multiple embedding in a single request, which would error with:
# "InvalidRequestError: Too many inputs. The max number of inputs is 1"
for lycoDoc in lycoDocs[2:]:
thisId = myCassandraVStore.add_documents([lycoDoc], ttl_seconds=20)[0]
print(thisId)
cfc09a38e082407daaf025f0cf067487 1daee1e823fb464dbbab28177b789a68 aea25de5b3b640439168485a77bdd08b
Manual removal of entries¶
You can delete individual documents from the store.
However, you first need to retrieve their identifier with a similarity search. The following method returns a list of matching 3-tuples, whose last item is the id of the document:
spiderDocIds = []
for doc, score, docId in myCassandraVStore.similarity_search_with_score_id('Compare Agelenidae and Lycosidae'):
print(f' * [{score:.3f}] "{doc.page_content[:32].strip()}..." ({docId})')
spiderDocIds.append(docId)
* [0.903] "The Agelenidae are a large..." (5b44a1115e3245198a132a42705694a8) * [0.901] "while the Pisauridae carry their..." (f2d60ab62b75478fb5d548c29b239529) * [0.901] "while the Pisauridae carry their..." (1daee1e823fb464dbbab28177b789a68) * [0.893] "Wolf spiders resemble nursery we..." (e9f81b5662dc409db259547348fd099b)
At this point you can perform the actual document deletion:
for spiderDocId in spiderDocIds:
myCassandraVStore.delete_by_document_id(spiderDocId)
The last method to remove entries from the store is demonstrated next.
Cleanup¶
You're done.
In order to leave the index empty for the next demo run, you may want to clean the index (i.e. empty the table on DB).
Just don't take this operation lightly in production!
myCassandraVStore.clear()