Skip to content

Commit

Permalink
Advanced caching updates feb (streamlit#1040)
Browse files Browse the repository at this point in the history
* Updating caching section in main concepts.

* Moving Caching doc over from Google with minor contextual edits. Another pass is still needed. There are placeholders for advanced caching and troubleshooting documents.

* Caching issues is still incomplete.

* Removing content for push later today. Will add advanced and caching issues next rev.

* Addressing Amanda's comments.

* Update

* Advanced caching and issues.

* Edits requested by Thiago

* Fixing link

* Adjusting main concepts to match the google docs section verbatim

* Small tweaks: renamed "items" → "components" and changed verb tenses here and there

* Forgot one verb tense change

* Rename "items" → "components"

* Update PR directly instead of commenting.

Co-authored-by: Thiago Teixeira <[email protected]>
  • Loading branch information
erikhopf and tvst authored Feb 9, 2020
1 parent 7d2c54d commit 4004a60
Show file tree
Hide file tree
Showing 6 changed files with 560 additions and 24 deletions.
169 changes: 169 additions & 0 deletions docs/advanced_caching.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
# Advanced caching

In [caching](caching.md), you learned about the Streamlit cache, which is accessed with the [`@st.cache`](api.html#streamlit.cache) decorator. In this article you'll see how Streamlit's caching functionality is implemented, so that you can use it to improve the performance of your Streamlit apps.

The cache is a key-value store, where the key is a hash of:

1. The input parameters that you called the function with
1. The value of any external variable used in the function
1. The body of the function
1. The body of any function used inside the cached function

And the value is a tuple of:

- The cached output
- A hash of the cached output (you'll see why soon)

For both the key and the output hash, Streamlit uses a specialized hash function that knows how to traverse code, hash special objects, and can have its [behavior customized by the user](#the-hash-funcs-parameter).

For example, when the function `expensive_computation(a, b)`, decorated with [`@st.cache`](api.html#streamlit.cache), is executed with `a=2` and `b=21`, Streamlit does the following:

1. Computes the cache key
1. If the key is found in the cache, then:
- Extracts the previously-cached (output, output_hash) tuple.
- Performs an **Output Mutation Check**, where a fresh hash of the output is computed and compared to the stored `output_hash`.
- If the two hashes are different, shows a **Cached Object Mutated** warning. (Note: Setting `allow_output_mutation=True` disables this step).
1. If the input key is not found in the cache, then:
- Executes the cached function (i.e. output = `expensive_computation(2, 21)`).
- Calculates the `output_hash` from the function's `output`.
- Stores `key → (output, output_hash)` in the cache.
1. Returns the output.

If an error is encountered an exception is raised. If the error occurs while hashing either the key or the output an `UnhashableTypeError` error is thrown. If you run into any issues, see [fixing caching issues](troubleshooting/caching_issues.md).

## The `hash_funcs` parameter

As described above, Streamlit's caching functionality relies on hashing to calculate the key for cached objects, and to detect unexpected mutations in the cached result.

For added expressive power, Streamlit lets you override this hashing process using the `hash_funcs` argument. Suppose you define a type called `FileReference` which points to a file in the filesystem:

```Python
class FileReference:
def __init__(self, filename):
self.filename = filename


@st.cache
def func(file_reference):
...
```

By default, Streamlit hashes custom classes like `FileReference` by recursively navigating their structure. In this case, its hash is the hash of the filename property. As long as the file name doesn't change, the hash will remain constant.

However, what if you wanted to have the hasher check for changes to the file's modification time, not just its name? This is possible with [`@st.cache`](api.html#streamlit.cache)'s `hash_funcs` parameter:

```Python
class FileReference:
def __init__(self, filename):
self.filename = filename

def hash_file_reference(file_reference):
filename = file_reference.filename
return (filename, os.path.getmtime(filename))

@st.cache(hash_funcs={FileReference: hash_file_reference})
def func(file_reference):
...
```

Additionally, you can hash `FileReference` objects by the file's contents:

```Python
class FileReference:
def __init__(self, filename):
self.filename = filename

def hash_file_reference(file_reference):
with open(file_reference.filename) as f:
return f.read()

@st.cache(hash_funcs={FileReference: hash_file_reference})
def func(file_reference):
...
```

```eval_rst
.. note::
Because Streamlit's hash function works recursively, you don't have to hash the contents inside `hash_file_reference` Instead, you can return a primitive type, in this case the contents of the file, and Streamlit's internal hasher will compute the actual hash from it.
```

## Typical hash functions

While it's possible to write custom hash functions, let's take a look at some of the tools that Python provides out of the box. Here's a list of some hash functions and when it makes sense to use them.

Python's [`id`](https://docs.python.org/3/library/functions.html#id) function | [Example](#example-1-pass-a-database-connection-around)

- Speed: Fast
- Use case: If you're hashing a singleton object, like an open database connection or a TensorFlow session. These are objects that will only be instantiated once, no matter how many times your script reruns.
- [Example](#example-1-pass-a-database-connection-around)<br/>

`lambda _: None` | [Example](#example-2-turn-off-hashing-for-a-specific-type)

- Speed: Fast
- Use case: If you want to turn off hashing of this type. This is useful if you know the object is not going to change.

Python's [`hash()`](https://docs.python.org/3/library/functions.html#hash) function | [Example](#example-3-use-python-s-hash-function)

- Speed: Can be slow based the size of the object being cached
- Use case: If Python already knows how to hash this type correctly.

Custom hash function | [Example](#the-hash-funcs-parameter)

- Speed: N/a
- Use case: If you'd like to override how Streamlit hashes a particular type.

## Example 1: Pass a database connection around

Suppose we want to open a database connection that can be reused across multiple runs of a Streamlit app. For this you can make use of the fact that cached objects are stored by reference to automatically initialize and reuse the connection:

```Python
@st.cache(allow_output_mutation=True)
def get_database_connection():
return db.get_connection()
```

With just 3 lines of code, the database connection is created once and stored in the cache. Then, every subsequent time `get_database_conection` is called, the already-created connection object is reused automatically. In other words, it becomes a singleton.

```eval_rst
.. tip::
Use the `allow_output_mutation=True` flag to suppress the immutability check. This prevents Streamlit from trying to hash the output connection, and also turns off Streamlit's mutation warning in the process.
```

What if you want to write a function that receives a database connection as input? For that, you'll use `hash_funcs`:

```Python
@st.cache(hash_funcs={DBConnection: id})
def get_users(connection):
# Note: We assume that connection is of type DBConnection.
return connection.execute_sql('SELECT * from Users')
```

Here, we use Python's built-in `id` function, because the connection object is coming from the Streamlit cache via the `get_database_conection` function. This means that the same connection instance is passed around every time, and therefore it always has the same id. However, if you happened to have a second connection object around that pointed to an entirely different database, it would still be safe to pass it to `get_users` because its id is guaranteed to be different than the first id.

These design patterns apply any time you have an object that points to an external resource, such as a database connection or Tensorflow session.

## Example 2: Turn off hashing for a specific type

You can turn off hashing entirely for a particular type by giving it a custom hash function that returns a constant. One reason that you might do this is to avoid hashing large, slow-to-hash objects that you know are not going to change. For example:

```Python
@st.cache(hash_funcs={pd.DataFrame: lambda _: None})
def func(huge_constant_dataframe):
...
```

When Streamlit encounters an object of this type, it always converts the object into `None`, no matter which instance of `FooType` its looking at. This means all instances are hash to the same value, which effectively cancels out the hashing mechanism.

## Example 3: Use Python's `hash()` function

Sometimes, you might want to use Python’s default hashing instead of Streamlit's. For example, maybe you've encountered a type that Streamlit is unable to hash, but it's hashable with Python's built-in `hash()` function. In that case, the solution is quite simple:

```Python
@st.cache(hash_funcs={FooType: hash})
def func(...):
...
```

## Next steps

- [Advanced concepts](advanced_concepts.md)
Loading

0 comments on commit 4004a60

Please sign in to comment.