You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It has been found that caching is great for user performance but has major drawbacks when combined with xloader and similar systems.
Unless the filename is changed, the previous file will be returned till cache has expired, but by then the xloader job has already run and ignored re-building the table.
To fix this, the proposal is to add a query string which contains the last modified date to the 302 redirect on public s3 access.
Why has this not occurred with signed urls before?
Internal the the ckanext-s3filestore when a signed url is generated for a resource, it is stored in redis to speed up the user experience. This is cleared when the file is updated so that the next file retrieval re-populates the redis key/value store.
Note: We cannot do the same with CloudFront due to costs associated with invalidations. After the first 1000 cache invalidations per month they then charge $0.005 USD per invalidation which can add up quite quickly.
The text was updated successfully, but these errors were encountered:
duttonw
changed the title
Add query string etga to redirects to s3 bucket where pre-signed urls are not in use (ensuring file freshness)
Add query string etag to redirects to s3 bucket where pre-signed urls are not in use (ensuring file freshness)
Nov 23, 2021
@duttonw It occurs to me that the same issue will likely affect any other extensions that download the resource, such as ckanext-validation or ckanext-archiver (although the latter should be okay because it just puts in a pointer to the S3 file).
Unless our ckan api extension is accepted in core, i'm uncertain how you would do this for file based storage. Hopefully with our contributions to the s3 extensions, it can be uplifted into the core at a later date.
with the use of attaching the etag (or simliar unique hash of the file) of the file on the query string, it should hopefully bypass all caches that it would hit. This would also help with validation of the file schema if setup also.
It has been found that caching is great for user performance but has major drawbacks when combined with xloader and similar systems.
Unless the filename is changed, the previous file will be returned till cache has expired, but by then the xloader job has already run and ignored re-building the table.
To fix this, the proposal is to add a query string which contains the last modified date to the 302 redirect on public s3 access.
Why has this not occurred with signed urls before?
Internal the the ckanext-s3filestore when a signed url is generated for a resource, it is stored in redis to speed up the user experience. This is cleared when the file is updated so that the next file retrieval re-populates the redis key/value store.
Note: We cannot do the same with CloudFront due to costs associated with invalidations. After the first 1000 cache invalidations per month they then charge $0.005 USD per invalidation which can add up quite quickly.
Solution:
Inside https://github.com/qld-gov-au/ckanext-s3filestore/blob/master/ckanext/s3filestore/uploader.py#L270 we already have metadata of the file such as Last-Modified or etag. For simplicity, let us go with etag as it won't change till the file has been altered and won't interfere with our cache system performance.
so around https://github.com/qld-gov-au/ckanext-s3filestore/blob/master/ckanext/s3filestore/uploader.py#L296 for public_read files, add ?etag=${etag}
This does not need to be done for pre-signed urls.
info on etags for s3: https://teppen.io/2018/06/23/aws_s3_etags/
The text was updated successfully, but these errors were encountered: