Showcase¶
This is a simple tutorial to go over vexpresso capabilities¶
Imports
import vexpresso
import numpy as np
from vexpresso.retrievers import Retriever, FaissRetriever
Collection Creation¶
First we'll create some sample data. Here we're using just strings, but because vexpresso uses daft, you can use any datatype!¶
data = {
"status": ["read", "unread", "read", "unread", "read", "unread", "read", "unread"],
"documents": ["A document that discusses domestic policy", "A document that discusses international affairs", "A document that discusses kittens", "A document that discusses dogs", "A document that discusses chocolate", "A document that is sixth that discusses government", "A document that discusses international affairs", "A document that discusses global affairs"],
"ids": ["id1", "id2", "id3", "id4", "id5", "id6", "id7", "id8"],
"numbers": list(range(1,9))
}
To create the collection, use the create method. This by default is lazy execution, meaning that we actually don't load in any data until execute or show is called. (Or if lazy is passed)¶
collection = vexpresso.create(data=data)
collection
2023-06-20 11:27:31.621 | INFO | daft.context:runner:80 - Using PyRunner
+----------+----------------------+--------+-----------+ | status | documents | ids | numbers | | Utf8 | Utf8 | Utf8 | Int64 | +==========+======================+========+===========+ | read | A document that | id1 | 1 | | | discusses domestic | | | | | policy | | | +----------+----------------------+--------+-----------+ | unread | A document that | id2 | 2 | | | discusses | | | | | international | | | | | affairs | | | +----------+----------------------+--------+-----------+ | read | A document that | id3 | 3 | | | discusses kittens | | | +----------+----------------------+--------+-----------+ | unread | A document that | id4 | 4 | | | discusses dogs | | | +----------+----------------------+--------+-----------+ | read | A document that | id5 | 5 | | | discusses chocolate | | | +----------+----------------------+--------+-----------+ | unread | A document that is | id6 | 6 | | | sixth that discusses | | | | | government | | | +----------+----------------------+--------+-----------+ | read | A document that | id7 | 7 | | | discusses | | | | | international | | | | | affairs | | | +----------+----------------------+--------+-----------+ | unread | A document that | id8 | 8 | | | discusses global | | | | | affairs | | | +----------+----------------------+--------+-----------+ (Showing first 8 of 8 rows)
If you want to operate directly¶
Vexpresso also works on clusters with Ray!¶
collection = vexpresso.create(data=data, backend="ray", cluster_address=..., cluster_kwargs=...)
Lets see what's in the collection now!¶
collection.show(5)
| status Utf8 | documents Utf8 | ids Utf8 | numbers Int64 |
|---|---|---|---|
| read | A document that discusses domestic policy | id1 | 1 |
| unread | A document that discusses international affairs | id2 | 2 |
| read | A document that discusses kittens | id3 | 3 |
| unread | A document that discusses dogs | id4 | 4 |
| read | A document that discusses chocolate | id5 | 5 |
vexpresso's Collection methods return Collection objects, allowing for complex chaining of calls¶
Embed Data¶
Lets add a list of dummy vectors to represent embeddings!¶
embeddings= [
[1.1, 2.3, 3.2],
[4.5, 6.9, 4.4],
[1.1, 2.3, 3.2],
[4.5, 6.9, 4.4],
[1.1, 2.3, 3.2],
[4.5, 6.9, 4.4],
[1.1, 2.3, 3.2],
[4.5, 6.9, 4.4],
]
collection = collection.add_column("embeddings_documents", embeddings)
By default vexpresso is "lazy", meaning that nothing is executed until .execute is called¶
Note: this can be bypassed by passing lazy=False
collection = collection.add_column("embeddings_documents", embeddings, lazy=False)
collection
+----------+-------------+--------+-----------+------------------------+ | status | documents | ids | numbers | embeddings_documents | | Utf8 | Utf8 | Utf8 | Int64 | List[item:Float64] | +==========+=============+========+===========+========================+ +----------+-------------+--------+-----------+------------------------+ (No data to display: Dataframe not materialized)
Let's execute it to get embeddings (or .collect)¶
collection = collection.execute()
collection.show(5)
| status Utf8 | documents Utf8 | ids Utf8 | numbers Int64 | embeddings_documents List[item:Float64] |
|---|---|---|---|---|
| read | A document that discusses domestic policy | id1 | 1 | [1.1, 2.3, 3.2] |
| unread | A document that discusses international affairs | id2 | 2 | [4.5, 6.9, 4.4] |
| read | A document that discusses kittens | id3 | 3 | [1.1, 2.3, 3.2] |
| unread | A document that discusses dogs | id4 | 4 | [4.5, 6.9, 4.4] |
| read | A document that discusses chocolate | id5 | 5 | [1.1, 2.3, 3.2] |
lets take a look at the embeddings¶
We can grab the raw data in a form of a dictionary or a list easily
collection.to_dict()["embeddings_documents"][:3]
[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2]]
collection["embeddings_documents"].to_list()[:3]
[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2]]
Query¶
Normally, we would use the same embedding function we used to embed content to query. But since we manually inputted the embeddings, lets create a simple embedding function that just returns an array of zeros
as you can see we now have an embeddings_documents column, let's query it and return the top 5 results!¶
queried = collection.query("embeddings_documents", query="test", embedding_fn=embed_fn, k=5, return_scores=True).execute()
We can see the actual similarity scores in embeddings__documents_score column¶
queried.show(5)
| status Utf8 | documents Utf8 | ids Utf8 | numbers Int64 | embeddings_documents List[item:Float64] | embeddings_documents_score Float64 |
|---|---|---|---|---|---|
| unread | A document that discusses international affairs | id2 | 2 | [4.5, 6.9, 4.4] | 0.976796 |
| unread | A document that discusses dogs | id4 | 4 | [4.5, 6.9, 4.4] | 0.976796 |
| unread | A document that is sixth that discusses government | id6 | 6 | [4.5, 6.9, 4.4] | 0.976796 |
| unread | A document that discusses global affairs | id8 | 8 | [4.5, 6.9, 4.4] | 0.976796 |
| read | A document that discusses international affairs | id7 | 7 | [1.1, 2.3, 3.2] | 0.931368 |
You can also query with embeddings directly! You can just call the embedding function directly, but we recommend using the collection object's embed_query method for embedding functions that may require resources (like gpus) or if you want to run it on a ray cluster
# you can just call the embedding function
embed_query = embed_fn(["test_1"])
embed_query = collection.embed_query("test1", embedding_fn=collection.embedding_functions["embeddings_documents"])
# you can also just pass in the string column name
embed_query = collection.embed_query("test1", embedding_fn="embeddings_documents")
queried = collection.query("embeddings_documents", query_embedding=embed_query, k=5).execute()
queried.show(5)
| status Utf8 | documents Utf8 | ids Utf8 | numbers Int64 | embeddings_documents List[item:Float64] |
|---|---|---|---|---|
| unread | A document that discusses international affairs | id2 | 2 | [4.5, 6.9, 4.4] |
| unread | A document that discusses dogs | id4 | 4 | [4.5, 6.9, 4.4] |
| unread | A document that is sixth that discusses government | id6 | 6 | [4.5, 6.9, 4.4] |
| unread | A document that discusses global affairs | id8 | 8 | [4.5, 6.9, 4.4] |
| read | A document that discusses international affairs | id7 | 7 | [1.1, 2.3, 3.2] |
we can also get a list of embeddinngs witth batch_embed_query
embed_query = collection.embed_queries(["test1", "test2"], embedding_fn=collection.embedding_functions["embeddings_documents"])
print(embed_query)
[[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]]
Sometimes you will want to batch queries together into a single call. vexpresso has a convenient batch_query function. This will return a list of Collections¶
queries = ["test_1", "test_5", "test_10"]
batch_queried = collection.batch_query("embeddings_documents", queries=queries, k=2)
We now have collections for each query¶
batch_queried[0].show(2)
| status Utf8 | documents Utf8 | ids Utf8 | numbers Int64 | embeddings_documents List[item:Float64] |
|---|---|---|---|---|
| unread | A document that is sixth that discusses government | id6 | 6 | [4.5, 6.9, 4.4] |
| unread | A document that discusses global affairs | id8 | 8 | [4.5, 6.9, 4.4] |
batch_queried[1].show(2)
| status Utf8 | documents Utf8 | ids Utf8 | numbers Int64 | embeddings_documents List[item:Float64] |
|---|---|---|---|---|
| unread | A document that is sixth that discusses government | id6 | 6 | [4.5, 6.9, 4.4] |
| unread | A document that discusses global affairs | id8 | 8 | [4.5, 6.9, 4.4] |
batch_queried[2].show(2)
| status Utf8 | documents Utf8 | ids Utf8 | numbers Int64 | embeddings_documents List[item:Float64] |
|---|---|---|---|---|
| unread | A document that is sixth that discusses government | id6 | 6 | [4.5, 6.9, 4.4] |
| unread | A document that discusses global affairs | id8 | 8 | [4.5, 6.9, 4.4] |
Filtering¶
With vexpresso, filtering is super easy. The syntax is similar to chromadb¶
Filter dictionary must have the following structure:¶
{
<field>: {
<filter_method>: <value>
},
<field>: {
<filter_method>: <value>
},
}
Let's filter the original collection to only include rows with numbers > 2
filtered_collection = collection.filter(
{
"numbers":{
"gt":2
}
}
).execute()
filtered_collection.show(5)
| status Utf8 | documents Utf8 | ids Utf8 | numbers Int64 | embeddings_documents List[item:Float64] |
|---|---|---|---|---|
| read | A document that discusses kittens | id3 | 3 | [1.1, 2.3, 3.2] |
| unread | A document that discusses dogs | id4 | 4 | [4.5, 6.9, 4.4] |
| read | A document that discusses chocolate | id5 | 5 | [1.1, 2.3, 3.2] |
| unread | A document that is sixth that discusses government | id6 | 6 | [4.5, 6.9, 4.4] |
| read | A document that discusses international affairs | id7 | 7 | [1.1, 2.3, 3.2] |
We can use multiple filter conditions as well¶
Let's filter the collection to only return rows with numbers <= 4 and status == "read"
filtered_collection = collection.filter(
{
"numbers":{
"lte":4
},
"status":{
"eq":"read"
}
}
).execute()
filtered_collection.show(5)
| status Utf8 | documents Utf8 | ids Utf8 | numbers Int64 | embeddings_documents List[item:Float64] |
|---|---|---|---|---|
| read | A document that discusses domestic policy | id1 | 1 | [1.1, 2.3, 3.2] |
| read | A document that discusses kittens | id3 | 3 | [1.1, 2.3, 3.2] |
Sometimes you need a custom filtering function, with vexpresso its easy to do that with the custom filter keyword!¶
Lets filter a collection to only return rows with even numbers and strings that contain a "3"
def custom_filter(number, mod_val) -> bool:
return number % mod_val == 0
filtered_collection = collection.filter(
{
"numbers":{
"custom":{"function":custom_filter, "function_kwargs":{"mod_val":2}}
},
"ids":{
"isin":["id1", "id2", "id4"]
}
}
).execute()
filtered_collection.show(5)
| status Utf8 | documents Utf8 | ids Utf8 | numbers Int64 | embeddings_documents List[item:Float64] |
|---|---|---|---|---|
| unread | A document that discusses international affairs | id2 | 2 | [4.5, 6.9, 4.4] |
| unread | A document that discusses dogs | id4 | 4 | [4.5, 6.9, 4.4] |
You can also combine filters + queries in the same call¶
Lets query the collection with "test" and filter only even numbers
even_filter = {
"numbers":{
"custom":{"function":custom_filter, "function_kwargs":{"mod_val":2}}
}
}
query_filtered_collection = collection.query("embeddings_documents", "test", k=10, filter_conditions=even_filter).execute()
query_filtered_collection.show(5)
| status Utf8 | documents Utf8 | ids Utf8 | numbers Int64 | embeddings_documents List[item:Float64] |
|---|---|---|---|---|
| unread | A document that discusses international affairs | id2 | 2 | [4.5, 6.9, 4.4] |
| unread | A document that discusses dogs | id4 | 4 | [4.5, 6.9, 4.4] |
| unread | A document that is sixth that discusses government | id6 | 6 | [4.5, 6.9, 4.4] |
| unread | A document that discusses global affairs | id8 | 8 | [4.5, 6.9, 4.4] |
Chaining Functions¶
We can chain functions lazily easily¶
For instance, lets query and filter multiple times
even_filter = {
"numbers":{
"custom":{"function":custom_filter, "function_kwargs":{"mod_val":2}}
}
}
chained_collection = collection.query("embeddings_documents", "test1", k=5) \
.filter(even_filter) \
.query("embeddings_documents", "test2", k=2) \
.filter({"numbers":{"lte":3}})
chained_collection.daft_df
| status Utf8 | documents Utf8 | ids Utf8 | numbers Int64 | embeddings_documents List[item:Float64] |
Here we queried for the closest 5 elements to "test1", filtered for only even numbers, queried top 2 of "test2", then filtered for numbers <= 3
chained_collection = chained_collection.execute()
chained_collection.show(5)
| status Utf8 | documents Utf8 | ids Utf8 | numbers Int64 | embeddings_documents List[item:Float64] |
|---|---|---|---|---|
| unread | A document that discusses international affairs | id2 | 2 | [4.5, 6.9, 4.4] |
get_text_features## Transforms
Sometimes you want to transform your data. Because of daft, you can use vexpresso to do this easily!¶
For example, lets add a new column where we change "document" to "replaced_document" in the document column, named "replaced". Lets specify that this output is also a string type¶
For a full list of datatypes, visit daft documentation: https://www.getdaft.io/projects/docs/en/latest/api_docs/datatype.html
def simple_apply_fn(strings):
return [
s.replace("document", "replaced_document") for s in strings
]
transformed_collection = collection.apply(simple_apply_fn, collection["documents"], to="replaced", datatype=vexpresso.DataType.string()).execute()
transformed_collection.show(5)
| status Utf8 | documents Utf8 | ids Utf8 | numbers Int64 | embeddings_documents List[item:Float64] | replaced Utf8 |
|---|---|---|---|---|---|
| read | A document that discusses domestic policy | id1 | 1 | [1.1, 2.3, 3.2] | A replaced_document that discusses domestic policy |
| unread | A document that discusses international affairs | id2 | 2 | [4.5, 6.9, 4.4] | A replaced_document that discusses international affairs |
| read | A document that discusses kittens | id3 | 3 | [1.1, 2.3, 3.2] | A replaced_document that discusses kittens |
| unread | A document that discusses dogs | id4 | 4 | [4.5, 6.9, 4.4] | A replaced_document that discusses dogs |
| read | A document that discusses chocolate | id5 | 5 | [1.1, 2.3, 3.2] | A replaced_document that discusses chocolate |
We can also pass in args, kwargs, and multiple columns into the apply function¶
For instance, lets append the number in numbers column to each document in documents
def multi_column_apply_fn(string_columns, numbers):
out = []
for string, num in zip(string_columns, numbers):
replaced = f"{string}_{num}"
out.append(replaced)
return out
transformed_collection = collection.apply(
multi_column_apply_fn,
collection["documents"],
numbers=collection["numbers"],
to="modified",
datatype=vexpresso.DataType.string()
).execute()
transformed_collection.show(5)
| status Utf8 | documents Utf8 | ids Utf8 | numbers Int64 | embeddings_documents List[item:Float64] | modified Utf8 |
|---|---|---|---|---|---|
| read | A document that discusses domestic policy | id1 | 1 | [1.1, 2.3, 3.2] | A document that discusses domestic policy_1 |
| unread | A document that discusses international affairs | id2 | 2 | [4.5, 6.9, 4.4] | A document that discusses international affairs_2 |
| read | A document that discusses kittens | id3 | 3 | [1.1, 2.3, 3.2] | A document that discusses kittens_3 |
| unread | A document that discusses dogs | id4 | 4 | [4.5, 6.9, 4.4] | A document that discusses dogs_4 |
| read | A document that discusses chocolate | id5 | 5 | [1.1, 2.3, 3.2] | A document that discusses chocolate_5 |
Adding data¶
Saving + Loading¶
Once you've done a bunch of processing on a collection, you probably want to save it somewhere. Vexpresso supports local file saving + huggingface datasets¶
Lets save the transformed_collection above to a directory saved_transformed_collection
transformed_collection.save("./saved_collection/saved_transformed_collection")
saving to ./saved_collection/saved_transformed_collection
We can then load the collection with the same create function. Make sure to also include the embedding functions that were used on the original collection!
loaded_collection = vexpresso.create(
directory_or_repo_id = "./saved_collection/saved_transformed_collection",
embedding_functions = {"embeddings_strings":embed_fn}
)
loaded_collection.show(5)
| status Utf8 | documents Utf8 | ids Utf8 | numbers Int64 | embeddings_documents List[item:Float64] | modified Utf8 |
|---|---|---|---|---|---|
| read | A document that discusses domestic policy | id1 | 1 | [1.1, 2.3, 3.2] | A document that discusses domestic policy_1 |
| unread | A document that discusses international affairs | id2 | 2 | [4.5, 6.9, 4.4] | A document that discusses international affairs_2 |
| read | A document that discusses kittens | id3 | 3 | [1.1, 2.3, 3.2] | A document that discusses kittens_3 |
| unread | A document that discusses dogs | id4 | 4 | [4.5, 6.9, 4.4] | A document that discusses dogs_4 |
| read | A document that discusses chocolate | id5 | 5 | [1.1, 2.3, 3.2] | A document that discusses chocolate_5 |
Now let's upload to huggingface!¶
For this you'll need to install huggingfacehub
# !pip install huggingface-hub
Automatically gets token from env variable: HUGGINGFACEHUB_API_TOKEN = ...
or you can pass in token directly via collection.save(token=...)
username = "shyamsn97"
repo_name = "vexpresso_test_showcase"
# username = "REPLACE"
# repo_name = "REPLACE"
loaded_collection.save(hf_username = username, repo_name = repo_name, to_hub=True, )
Uploading collection to None
/home/shyam/miniconda3/envs/py39/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm content.parquet: 100%|█████████████████████████████████████████████| 3.23k/3.23k [00:00<00:00, 6.79kB/s] Upload 1 LFS files: 100%|█████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.51it/s]
Upload to shyamsn97/vexpresso_test_showcase complete!
'shyamsn97/vexpresso_test_showcase'
The example is private by default, but this can be changed by the private flag
# loaded_collection.save(hf_username = username, repo_name = repo_name, to_hub=True, private=False)
You can see an example of the above data: https://huggingface.co/datasets/shyamsn97/vexpresso_test_showcase
Now lets load it!¶
loaded_collection = vexpresso.create(
hf_username = username,
repo_name = repo_name,
embedding_functions = {"embeddings_documents":embed_fn}
)
Retrieving from hf repo: shyamsn97/vexpresso_test_showcase
Fetching 2 files: 50%|█████████████████████████▌ | 1/2 [00:00<00:00, 9.84it/s] Downloading content.parquet: 100%|██████████████████████████████████| 3.23k/3.23k [00:00<00:00, 153kB/s] Fetching 2 files: 100%|███████████████████████████████████████████████████| 2/2 [00:00<00:00, 2.31it/s]
loaded_collection.show(5)
| status Utf8 | documents Utf8 | ids Utf8 | numbers Int64 | embeddings_documents List[item:Float64] | modified Utf8 |
|---|---|---|---|---|---|
| read | A document that discusses domestic policy | id1 | 1 | [1.1, 2.3, 3.2] | A document that discusses domestic policy_1 |
| unread | A document that discusses international affairs | id2 | 2 | [4.5, 6.9, 4.4] | A document that discusses international affairs_2 |
| read | A document that discusses kittens | id3 | 3 | [1.1, 2.3, 3.2] | A document that discusses kittens_3 |
| unread | A document that discusses dogs | id4 | 4 | [4.5, 6.9, 4.4] | A document that discusses dogs_4 |
| read | A document that discusses chocolate | id5 | 5 | [1.1, 2.3, 3.2] | A document that discusses chocolate_5 |