Showcase¶

This is a simple tutorial to go over vexpresso capabilities¶

Imports

In [1]:

                
                    Copied!
                    
import vexpresso
import numpy as np
from vexpresso.retrievers import Retriever, FaissRetriever
import vexpresso
import numpy as np
from vexpresso.retrievers import Retriever, FaissRetriever

Collection Creation¶

First we'll create some sample data. Here we're using just strings, but because `vexpresso` uses `daft`, you can use any datatype!¶

In [2]:

                
                    Copied!
                    
                        
                        
                    
                    

            
data = {
    "status": ["read", "unread", "read", "unread", "read", "unread", "read", "unread"],
    "documents": ["A document that discusses domestic policy", "A document that discusses international affairs", "A document that discusses kittens", "A document that discusses dogs", "A document that discusses chocolate", "A document that is sixth that discusses government", "A document that discusses international affairs", "A document that discusses global affairs"],
    "ids": ["id1", "id2", "id3", "id4", "id5", "id6", "id7", "id8"],
    "numbers": list(range(1,9))
}
data = {
    "status": ["read", "unread", "read", "unread", "read", "unread", "read", "unread"],
    "documents": ["A document that discusses domestic policy", "A document that discusses international affairs", "A document that discusses kittens", "A document that discusses dogs", "A document that discusses chocolate", "A document that is sixth that discusses government", "A document that discusses international affairs", "A document that discusses global affairs"],
    "ids": ["id1", "id2", "id3", "id4", "id5", "id6", "id7", "id8"],
    "numbers": list(range(1,9))
}

To create the collection, use the `create` method. This by default is lazy execution, meaning that we actually don't load in any data until `execute` or `show` is called. (Or if `lazy` is passed)¶

In [3]:

                
                    Copied!
                    
collection = vexpresso.create(data=data)
collection
collection = vexpresso.create(data=data)
collection

2023-06-20 11:27:31.621 | INFO     | daft.context:runner:80 - Using PyRunner

Out[3]:

+----------+----------------------+--------+-----------+
| status   | documents            | ids    |   numbers |
| Utf8     | Utf8                 | Utf8   |     Int64 |
+==========+======================+========+===========+
| read     | A document that      | id1    |         1 |
|          | discusses domestic   |        |           |
|          | policy               |        |           |
+----------+----------------------+--------+-----------+
| unread   | A document that      | id2    |         2 |
|          | discusses            |        |           |
|          | international        |        |           |
|          | affairs              |        |           |
+----------+----------------------+--------+-----------+
| read     | A document that      | id3    |         3 |
|          | discusses kittens    |        |           |
+----------+----------------------+--------+-----------+
| unread   | A document that      | id4    |         4 |
|          | discusses dogs       |        |           |
+----------+----------------------+--------+-----------+
| read     | A document that      | id5    |         5 |
|          | discusses chocolate  |        |           |
+----------+----------------------+--------+-----------+
| unread   | A document that is   | id6    |         6 |
|          | sixth that discusses |        |           |
|          | government           |        |           |
+----------+----------------------+--------+-----------+
| read     | A document that      | id7    |         7 |
|          | discusses            |        |           |
|          | international        |        |           |
|          | affairs              |        |           |
+----------+----------------------+--------+-----------+
| unread   | A document that      | id8    |         8 |
|          | discusses global     |        |           |
|          | affairs              |        |           |
+----------+----------------------+--------+-----------+
(Showing first 8 of 8 rows)

If you want to operate directly¶

Vexpresso also works on clusters with Ray!¶

collection = vexpresso.create(data=data, backend="ray", cluster_address=..., cluster_kwargs=...)

Lets see what's in the collection now!¶

In [4]:

                
                    Copied!
                    
collection.show(5)
collection.show(5)

Out[4]:

status Utf8	documents Utf8	ids Utf8	numbers Int64
read	A document that discusses domestic policy	id1	1
unread	A document that discusses international affairs	id2	2
read	A document that discusses kittens	id3	3
unread	A document that discusses dogs	id4	4
read	A document that discusses chocolate	id5	5

(Showing first 5 rows)

vexpresso's `Collection` methods return `Collection` objects, allowing for complex chaining of calls¶

Embed Data¶

Lets add a list of dummy vectors to represent embeddings!¶

In [5]:

                
                    Copied!
                    
                        
                        
                    
                    

            
embeddings= [
    [1.1, 2.3, 3.2],
    [4.5, 6.9, 4.4],
    [1.1, 2.3, 3.2],
    [4.5, 6.9, 4.4],
    [1.1, 2.3, 3.2],
    [4.5, 6.9, 4.4],
    [1.1, 2.3, 3.2],
    [4.5, 6.9, 4.4],
]
embeddings= [
    [1.1, 2.3, 3.2],
    [4.5, 6.9, 4.4],
    [1.1, 2.3, 3.2],
    [4.5, 6.9, 4.4],
    [1.1, 2.3, 3.2],
    [4.5, 6.9, 4.4],
    [1.1, 2.3, 3.2],
    [4.5, 6.9, 4.4],
]

In [6]:

                
                    Copied!
                    
collection = collection.add_column("embeddings_documents", embeddings)
collection = collection.add_column("embeddings_documents", embeddings)

By default vexpresso is "lazy", meaning that nothing is executed until `.execute` is called¶

Note: this can be bypassed by passing lazy=False

collection = collection.add_column("embeddings_documents", embeddings, lazy=False)

In [8]:

                
                    Copied!
                    
collection
collection

Out[8]:

+----------+-------------+--------+-----------+------------------------+
| status   | documents   | ids    | numbers   | embeddings_documents   |
| Utf8     | Utf8        | Utf8   | Int64     | List[item:Float64]     |
+==========+=============+========+===========+========================+
+----------+-------------+--------+-----------+------------------------+
(No data to display: Dataframe not materialized)

Let's execute it to get embeddings (or `.collect`)¶

In [9]:

                
                    Copied!
                    
collection = collection.execute()
collection = collection.execute()

In [10]:

                
                    Copied!
                    
collection.show(5)
collection.show(5)

Out[10]:

status Utf8	documents Utf8	ids Utf8	numbers Int64	embeddings_documents List[item:Float64]
read	A document that discusses domestic policy	id1	1	[1.1, 2.3, 3.2]
unread	A document that discusses international affairs	id2	2	[4.5, 6.9, 4.4]
read	A document that discusses kittens	id3	3	[1.1, 2.3, 3.2]
unread	A document that discusses dogs	id4	4	[4.5, 6.9, 4.4]
read	A document that discusses chocolate	id5	5	[1.1, 2.3, 3.2]

(Showing first 5 rows)

lets take a look at the embeddings¶

We can grab the raw data in a form of a dictionary or a list easily

In [11]:

                
                    Copied!
                    
collection.to_dict()["embeddings_documents"][:3]
collection.to_dict()["embeddings_documents"][:3]

Out[11]:

[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2]]

In [12]:

                
                    Copied!
                    
collection["embeddings_documents"].to_list()[:3]
collection["embeddings_documents"].to_list()[:3]

Out[12]:

[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2]]

Query¶

Normally, we would use the same embedding function we used to embed content to query. But since we manually inputted the embeddings, lets create a simple embedding function that just returns an array of zeros

as you can see we now have an `embeddings_documents` column, let's query it and return the top 5 results!¶

In [14]:

                
                    Copied!
                    
queried = collection.query("embeddings_documents", query="test", embedding_fn=embed_fn, k=5, return_scores=True).execute()
queried = collection.query("embeddings_documents", query="test", embedding_fn=embed_fn, k=5, return_scores=True).execute()

We can see the actual similarity scores in `embeddings__documents_score` column¶

In [15]:

                
                    Copied!
                    
queried.show(5)
queried.show(5)

Out[15]:

status Utf8	documents Utf8	ids Utf8	numbers Int64	embeddings_documents List[item:Float64]	embeddings_documents_score Float64
unread	A document that discusses international affairs	id2	2	[4.5, 6.9, 4.4]	0.976796
unread	A document that discusses dogs	id4	4	[4.5, 6.9, 4.4]	0.976796
unread	A document that is sixth that discusses government	id6	6	[4.5, 6.9, 4.4]	0.976796
unread	A document that discusses global affairs	id8	8	[4.5, 6.9, 4.4]	0.976796
read	A document that discusses international affairs	id7	7	[1.1, 2.3, 3.2]	0.931368

(Showing first 5 rows)

You can also query with embeddings directly! You can just call the embedding function directly, but we recommend using the collection object's embed_query method for embedding functions that may require resources (like gpus) or if you want to run it on a ray cluster

In [18]:

                
                    Copied!
                    
# you can just call the embedding function
embed_query = embed_fn(["test_1"])
# you can just call the embedding function
embed_query = embed_fn(["test_1"])

In [19]:

                
                    Copied!
                    
embed_query = collection.embed_query("test1", embedding_fn=collection.embedding_functions["embeddings_documents"])

# you can also just pass in the string column name
embed_query = collection.embed_query("test1", embedding_fn="embeddings_documents")

queried = collection.query("embeddings_documents", query_embedding=embed_query, k=5).execute()
queried.show(5)
embed_query = collection.embed_query("test1", embedding_fn=collection.embedding_functions["embeddings_documents"])

# you can also just pass in the string column name
embed_query = collection.embed_query("test1", embedding_fn="embeddings_documents")

queried = collection.query("embeddings_documents", query_embedding=embed_query, k=5).execute()
queried.show(5)

Out[19]:

status Utf8	documents Utf8	ids Utf8	numbers Int64	embeddings_documents List[item:Float64]
unread	A document that discusses international affairs	id2	2	[4.5, 6.9, 4.4]
unread	A document that discusses dogs	id4	4	[4.5, 6.9, 4.4]
unread	A document that is sixth that discusses government	id6	6	[4.5, 6.9, 4.4]
unread	A document that discusses global affairs	id8	8	[4.5, 6.9, 4.4]
read	A document that discusses international affairs	id7	7	[1.1, 2.3, 3.2]

(Showing first 5 rows)

we can also get a list of embeddinngs witth batch_embed_query

In [20]:

                
                    Copied!
                    
embed_query = collection.embed_queries(["test1", "test2"], embedding_fn=collection.embedding_functions["embeddings_documents"])
print(embed_query)
embed_query = collection.embed_queries(["test1", "test2"], embedding_fn=collection.embedding_functions["embeddings_documents"])
print(embed_query)

[[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]]

Sometimes you will want to batch queries together into a single call. vexpresso has a convenient `batch_query` function. This will return a list of Collections¶

In [21]:

                
                    Copied!
                    
queries = ["test_1", "test_5", "test_10"]
queries = ["test_1", "test_5", "test_10"]

In [22]:

                
                    Copied!
                    
batch_queried = collection.batch_query("embeddings_documents", queries=queries, k=2)
batch_queried = collection.batch_query("embeddings_documents", queries=queries, k=2)

We now have collections for each query¶

In [23]:

                
                    Copied!
                    
batch_queried[0].show(2)
batch_queried[0].show(2)

Out[23]:

status Utf8	documents Utf8	ids Utf8	numbers Int64	embeddings_documents List[item:Float64]
unread	A document that is sixth that discusses government	id6	6	[4.5, 6.9, 4.4]
unread	A document that discusses global affairs	id8	8	[4.5, 6.9, 4.4]

(Showing first 2 rows)

In [24]:

                
                    Copied!
                    
batch_queried[1].show(2)
batch_queried[1].show(2)

Out[24]:

status Utf8	documents Utf8	ids Utf8	numbers Int64	embeddings_documents List[item:Float64]
unread	A document that is sixth that discusses government	id6	6	[4.5, 6.9, 4.4]
unread	A document that discusses global affairs	id8	8	[4.5, 6.9, 4.4]

(Showing first 2 rows)

In [25]:

                
                    Copied!
                    
batch_queried[2].show(2)
batch_queried[2].show(2)

Out[25]:

status Utf8	documents Utf8	ids Utf8	numbers Int64	embeddings_documents List[item:Float64]
unread	A document that is sixth that discusses government	id6	6	[4.5, 6.9, 4.4]
unread	A document that discusses global affairs	id8	8	[4.5, 6.9, 4.4]

(Showing first 2 rows)

Filtering¶

With `vexpresso`, filtering is super easy. The syntax is similar to `chromadb`¶

Filter dictionary must have the following structure:¶

{
    <field>: {
        <filter_method>: <value>
    },
    <field>: {
        <filter_method>: <value>
    },
}

Let's filter the original collection to only include rows with numbers > 2

In [39]:

                
                    Copied!
                    
                        
                        
                    
                    

            
filtered_collection = collection.filter(
    {
        "numbers":{
            "gt":2
        }
    }
).execute()
filtered_collection = collection.filter(
    {
        "numbers":{
            "gt":2
        }
    }
).execute()

In [40]:

                
                    Copied!
                    
filtered_collection.show(5)
filtered_collection.show(5)

Out[40]:

status Utf8	documents Utf8	ids Utf8	numbers Int64	embeddings_documents List[item:Float64]
read	A document that discusses kittens	id3	3	[1.1, 2.3, 3.2]
unread	A document that discusses dogs	id4	4	[4.5, 6.9, 4.4]
read	A document that discusses chocolate	id5	5	[1.1, 2.3, 3.2]
unread	A document that is sixth that discusses government	id6	6	[4.5, 6.9, 4.4]
read	A document that discusses international affairs	id7	7	[1.1, 2.3, 3.2]

(Showing first 5 rows)

We can use multiple filter conditions as well¶

Let's filter the collection to only return rows with numbers <= 4 and status == "read"

In [61]:

                
                    Copied!
                    
                        
                        
                    
                    

            
filtered_collection = collection.filter(
    {
        "numbers":{
            "lte":4
        },
        "status":{
            "eq":"read"
        }
        
    }
).execute()
filtered_collection = collection.filter(
    {
        "numbers":{
            "lte":4
        },
        "status":{
            "eq":"read"
        }
        
    }
).execute()

In [62]:

                
                    Copied!
                    
filtered_collection.show(5)
filtered_collection.show(5)

Out[62]:

status Utf8	documents Utf8	ids Utf8	numbers Int64	embeddings_documents List[item:Float64]
read	A document that discusses domestic policy	id1	1	[1.1, 2.3, 3.2]
read	A document that discusses kittens	id3	3	[1.1, 2.3, 3.2]

(Showing first 2 rows)

Sometimes you need a custom filtering function, with vexpresso its easy to do that with the `custom` filter keyword!¶

Lets filter a collection to only return rows with even numbers and strings that contain a "3"

In [63]:

                
                    Copied!
                    
def custom_filter(number, mod_val) -> bool:
    return number % mod_val == 0
def custom_filter(number, mod_val) -> bool:
    return number % mod_val == 0

In [64]:

                
                    Copied!
                    
                        
                        
                    
                    

            
filtered_collection = collection.filter(
    {
        "numbers":{
            "custom":{"function":custom_filter, "function_kwargs":{"mod_val":2}}
        },
        "ids":{
            "isin":["id1", "id2", "id4"]
        }
    }
).execute()
filtered_collection = collection.filter(
    {
        "numbers":{
            "custom":{"function":custom_filter, "function_kwargs":{"mod_val":2}}
        },
        "ids":{
            "isin":["id1", "id2", "id4"]
        }
    }
).execute()

In [65]:

                
                    Copied!
                    
filtered_collection.show(5)
filtered_collection.show(5)

Out[65]:

status Utf8	documents Utf8	ids Utf8	numbers Int64	embeddings_documents List[item:Float64]
unread	A document that discusses international affairs	id2	2	[4.5, 6.9, 4.4]
unread	A document that discusses dogs	id4	4	[4.5, 6.9, 4.4]

(Showing first 2 rows)

You can also combine filters + queries in the same call¶

Lets query the collection with "test" and filter only even numbers

In [67]:

                
                    Copied!
                    
even_filter = {
    "numbers":{
        "custom":{"function":custom_filter, "function_kwargs":{"mod_val":2}}
    }
}
even_filter = {
    "numbers":{
        "custom":{"function":custom_filter, "function_kwargs":{"mod_val":2}}
    }
}

In [68]:

                
                    Copied!
                    
query_filtered_collection = collection.query("embeddings_documents", "test", k=10, filter_conditions=even_filter).execute()
query_filtered_collection = collection.query("embeddings_documents", "test", k=10, filter_conditions=even_filter).execute()

In [69]:

                
                    Copied!
                    
query_filtered_collection.show(5)
query_filtered_collection.show(5)

Out[69]:

status Utf8	documents Utf8	ids Utf8	numbers Int64	embeddings_documents List[item:Float64]
unread	A document that discusses international affairs	id2	2	[4.5, 6.9, 4.4]
unread	A document that discusses dogs	id4	4	[4.5, 6.9, 4.4]
unread	A document that is sixth that discusses government	id6	6	[4.5, 6.9, 4.4]
unread	A document that discusses global affairs	id8	8	[4.5, 6.9, 4.4]

(Showing first 4 rows)

Chaining Functions¶

We can chain functions lazily easily¶

For instance, lets query and filter multiple times

In [70]:

                
                    Copied!
                    
even_filter = {
    "numbers":{
        "custom":{"function":custom_filter, "function_kwargs":{"mod_val":2}}
    }
}
even_filter = {
    "numbers":{
        "custom":{"function":custom_filter, "function_kwargs":{"mod_val":2}}
    }
}

In [71]:

                
                    Copied!
                    
chained_collection = collection.query("embeddings_documents", "test1", k=5) \
                               .filter(even_filter) \
                               .query("embeddings_documents", "test2", k=2) \
                               .filter({"numbers":{"lte":3}})
chained_collection = collection.query("embeddings_documents", "test1", k=5) \
                               .filter(even_filter) \
                               .query("embeddings_documents", "test2", k=2) \
                               .filter({"numbers":{"lte":3}})

In [72]:

                
                    Copied!
                    
chained_collection.daft_df
chained_collection.daft_df

Out[72]:

status
Utf8

documents
Utf8

ids
Utf8

numbers
Int64

embeddings_documents
List[item:Float64]

(No data to display: Dataframe not materialized)

Here we queried for the closest 5 elements to "test1", filtered for only even numbers, queried top 2 of "test2", then filtered for numbers <= 3

In [75]:

                
                    Copied!
                    
chained_collection = chained_collection.execute()
chained_collection = chained_collection.execute()

In [76]:

                
                    Copied!
                    
chained_collection.show(5)
chained_collection.show(5)

Out[76]:

status Utf8	documents Utf8	ids Utf8	numbers Int64	embeddings_documents List[item:Float64]
unread	A document that discusses international affairs	id2	2	[4.5, 6.9, 4.4]

(Showing first 1 rows)

get_text_features## Transforms

Sometimes you want to transform your data. Because of `daft`, you can use `vexpresso` to do this easily!¶

For example, lets add a new column where we change "document" to "replaced_document" in the document column, named "replaced". Lets specify that this output is also a string type¶

For a full list of datatypes, visit daft documentation: https://www.getdaft.io/projects/docs/en/latest/api_docs/datatype.html

In [82]:

                
                    Copied!
                    
def simple_apply_fn(strings):
    return [
        s.replace("document", "replaced_document") for s in strings
    ]
def simple_apply_fn(strings):
    return [
        s.replace("document", "replaced_document") for s in strings
    ]

In [83]:

                
                    Copied!
                    
transformed_collection = collection.apply(simple_apply_fn, collection["documents"], to="replaced", datatype=vexpresso.DataType.string()).execute()
transformed_collection = collection.apply(simple_apply_fn, collection["documents"], to="replaced", datatype=vexpresso.DataType.string()).execute()

In [84]:

                
                    Copied!
                    
transformed_collection.show(5)
transformed_collection.show(5)

Out[84]:

status Utf8	documents Utf8	ids Utf8	numbers Int64	embeddings_documents List[item:Float64]	replaced Utf8
read	A document that discusses domestic policy	id1	1	[1.1, 2.3, 3.2]	A replaced_document that discusses domestic policy
unread	A document that discusses international affairs	id2	2	[4.5, 6.9, 4.4]	A replaced_document that discusses international affairs
read	A document that discusses kittens	id3	3	[1.1, 2.3, 3.2]	A replaced_document that discusses kittens
unread	A document that discusses dogs	id4	4	[4.5, 6.9, 4.4]	A replaced_document that discusses dogs
read	A document that discusses chocolate	id5	5	[1.1, 2.3, 3.2]	A replaced_document that discusses chocolate

(Showing first 5 rows)

We can also pass in args, kwargs, and multiple columns into the apply function¶

For instance, lets append the number in numbers column to each document in documents

In [86]:

                
                    Copied!
                    
                        
                        
                    
                    

            
def multi_column_apply_fn(string_columns, numbers):
    out = []
    for string, num in zip(string_columns, numbers):
        replaced = f"{string}_{num}"
        out.append(replaced)
    return out
def multi_column_apply_fn(string_columns, numbers):
    out = []
    for string, num in zip(string_columns, numbers):
        replaced = f"{string}_{num}"
        out.append(replaced)
    return out

In [87]:

                
                    Copied!
                    
                        
                        
                    
                    

            
transformed_collection = collection.apply(
    multi_column_apply_fn,
    collection["documents"],
    numbers=collection["numbers"],
    to="modified",
    datatype=vexpresso.DataType.string()
).execute()
transformed_collection = collection.apply(
    multi_column_apply_fn,
    collection["documents"],
    numbers=collection["numbers"],
    to="modified",
    datatype=vexpresso.DataType.string()
).execute()

In [88]:

                
                    Copied!
                    
transformed_collection.show(5)
transformed_collection.show(5)

Out[88]:

status Utf8	documents Utf8	ids Utf8	numbers Int64	embeddings_documents List[item:Float64]	modified Utf8
read	A document that discusses domestic policy	id1	1	[1.1, 2.3, 3.2]	A document that discusses domestic policy_1
unread	A document that discusses international affairs	id2	2	[4.5, 6.9, 4.4]	A document that discusses international affairs_2
read	A document that discusses kittens	id3	3	[1.1, 2.3, 3.2]	A document that discusses kittens_3
unread	A document that discusses dogs	id4	4	[4.5, 6.9, 4.4]	A document that discusses dogs_4
read	A document that discusses chocolate	id5	5	[1.1, 2.3, 3.2]	A document that discusses chocolate_5

(Showing first 5 rows)

Adding data¶

Saving + Loading¶

Once you've done a bunch of processing on a collection, you probably want to save it somewhere. Vexpresso supports local file saving + huggingface datasets¶

Lets save the transformed_collection above to a directory saved_transformed_collection

In [89]:

                
                    Copied!
                    
transformed_collection.save("./saved_collection/saved_transformed_collection")
transformed_collection.save("./saved_collection/saved_transformed_collection")

saving to ./saved_collection/saved_transformed_collection

We can then load the collection with the same create function. Make sure to also include the embedding functions that were used on the original collection!

In [90]:

                
                    Copied!
                    
loaded_collection = vexpresso.create(
    directory_or_repo_id = "./saved_collection/saved_transformed_collection",
    embedding_functions = {"embeddings_strings":embed_fn}
)
loaded_collection = vexpresso.create(
    directory_or_repo_id = "./saved_collection/saved_transformed_collection",
    embedding_functions = {"embeddings_strings":embed_fn}
)

In [91]:

                
                    Copied!
                    
loaded_collection.show(5)
loaded_collection.show(5)

Out[91]:

status Utf8	documents Utf8	ids Utf8	numbers Int64	embeddings_documents List[item:Float64]	modified Utf8
read	A document that discusses domestic policy	id1	1	[1.1, 2.3, 3.2]	A document that discusses domestic policy_1
unread	A document that discusses international affairs	id2	2	[4.5, 6.9, 4.4]	A document that discusses international affairs_2
read	A document that discusses kittens	id3	3	[1.1, 2.3, 3.2]	A document that discusses kittens_3
unread	A document that discusses dogs	id4	4	[4.5, 6.9, 4.4]	A document that discusses dogs_4
read	A document that discusses chocolate	id5	5	[1.1, 2.3, 3.2]	A document that discusses chocolate_5

(Showing first 5 rows)

Now let's upload to huggingface!¶

For this you'll need to install huggingfacehub

In [51]:

                
                    Copied!
                    
# !pip install huggingface-hub
# !pip install huggingface-hub

Automatically gets token from env variable: HUGGINGFACEHUB_API_TOKEN = ...

or you can pass in token directly via collection.save(token=...)

In [92]:

                
                    Copied!
                    
username = "shyamsn97"
repo_name = "vexpresso_test_showcase"
# username = "REPLACE"
# repo_name = "REPLACE"
username = "shyamsn97"
repo_name = "vexpresso_test_showcase"
# username = "REPLACE"
# repo_name = "REPLACE"

In [93]:

                
                    Copied!
                    
loaded_collection.save(hf_username = username, repo_name = repo_name, to_hub=True, )
loaded_collection.save(hf_username = username, repo_name = repo_name, to_hub=True, )

Uploading collection to None

/home/shyam/miniconda3/envs/py39/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

content.parquet: 100%|█████████████████████████████████████████████| 3.23k/3.23k [00:00<00:00, 6.79kB/s]

Upload 1 LFS files: 100%|█████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.51it/s]

Upload to shyamsn97/vexpresso_test_showcase complete!

Out[93]:

'shyamsn97/vexpresso_test_showcase'

The example is private by default, but this can be changed by the private flag

In [54]:

                
                    Copied!
                    
# loaded_collection.save(hf_username = username, repo_name = repo_name, to_hub=True, private=False)
# loaded_collection.save(hf_username = username, repo_name = repo_name, to_hub=True, private=False)

You can see an example of the above data: https://huggingface.co/datasets/shyamsn97/vexpresso_test_showcase

Now lets load it!¶

In [94]:

                
                    Copied!
                    
loaded_collection = vexpresso.create(
    hf_username = username,
    repo_name = repo_name,
    embedding_functions = {"embeddings_documents":embed_fn}
)
loaded_collection = vexpresso.create(
    hf_username = username,
    repo_name = repo_name,
    embedding_functions = {"embeddings_documents":embed_fn}
)

Retrieving from hf repo: shyamsn97/vexpresso_test_showcase

Fetching 2 files:  50%|█████████████████████████▌                         | 1/2 [00:00<00:00,  9.84it/s]
Downloading content.parquet: 100%|██████████████████████████████████| 3.23k/3.23k [00:00<00:00, 153kB/s]
Fetching 2 files: 100%|███████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.31it/s]

In [95]:

                
                    Copied!
                    
loaded_collection.show(5)
loaded_collection.show(5)

Out[95]:

status Utf8	documents Utf8	ids Utf8	numbers Int64	embeddings_documents List[item:Float64]	modified Utf8
read	A document that discusses domestic policy	id1	1	[1.1, 2.3, 3.2]	A document that discusses domestic policy_1
unread	A document that discusses international affairs	id2	2	[4.5, 6.9, 4.4]	A document that discusses international affairs_2
read	A document that discusses kittens	id3	3	[1.1, 2.3, 3.2]	A document that discusses kittens_3
unread	A document that discusses dogs	id4	4	[4.5, 6.9, 4.4]	A document that discusses dogs_4
read	A document that discusses chocolate	id5	5	[1.1, 2.3, 3.2]	A document that discusses chocolate_5

(Showing first 5 rows)

Showcase¶

This is a simple tutorial to go over vexpresso capabilities¶

Collection Creation¶

First we'll create some sample data. Here we're using just strings, but because vexpresso uses daft, you can use any datatype!¶

To create the collection, use the create method. This by default is lazy execution, meaning that we actually don't load in any data until execute or show is called. (Or if lazy is passed)¶

If you want to operate directly¶

Vexpresso also works on clusters with Ray!¶

Lets see what's in the collection now!¶

vexpresso's Collection methods return Collection objects, allowing for complex chaining of calls¶

Embed Data¶

Lets add a list of dummy vectors to represent embeddings!¶

By default vexpresso is "lazy", meaning that nothing is executed until .execute is called¶

Let's execute it to get embeddings (or .collect)¶

lets take a look at the embeddings¶

Query¶

as you can see we now have an embeddings_documents column, let's query it and return the top 5 results!¶

We can see the actual similarity scores in embeddings__documents_score column¶

Sometimes you will want to batch queries together into a single call. vexpresso has a convenient batch_query function. This will return a list of Collections¶

We now have collections for each query¶

Filtering¶

With vexpresso, filtering is super easy. The syntax is similar to chromadb¶

Filter dictionary must have the following structure:¶

We can use multiple filter conditions as well¶

Sometimes you need a custom filtering function, with vexpresso its easy to do that with the custom filter keyword!¶

You can also combine filters + queries in the same call¶

Chaining Functions¶

We can chain functions lazily easily¶

Sometimes you want to transform your data. Because of daft, you can use vexpresso to do this easily!¶

For example, lets add a new column where we change "document" to "replaced_document" in the document column, named "replaced". Lets specify that this output is also a string type¶

We can also pass in args, kwargs, and multiple columns into the apply function¶

Adding data¶

Saving + Loading¶

Once you've done a bunch of processing on a collection, you probably want to save it somewhere. Vexpresso supports local file saving + huggingface datasets¶

Now let's upload to huggingface!¶

Now lets load it!¶

First we'll create some sample data. Here we're using just strings, but because `vexpresso` uses `daft`, you can use any datatype!¶

To create the collection, use the `create` method. This by default is lazy execution, meaning that we actually don't load in any data until `execute` or `show` is called. (Or if `lazy` is passed)¶

vexpresso's `Collection` methods return `Collection` objects, allowing for complex chaining of calls¶

By default vexpresso is "lazy", meaning that nothing is executed until `.execute` is called¶

Let's execute it to get embeddings (or `.collect`)¶

as you can see we now have an `embeddings_documents` column, let's query it and return the top 5 results!¶

We can see the actual similarity scores in `embeddings__documents_score` column¶

Sometimes you will want to batch queries together into a single call. vexpresso has a convenient `batch_query` function. This will return a list of Collections¶

With `vexpresso`, filtering is super easy. The syntax is similar to `chromadb`¶

Sometimes you need a custom filtering function, with vexpresso its easy to do that with the `custom` filter keyword!¶

Sometimes you want to transform your data. Because of `daft`, you can use `vexpresso` to do this easily!¶