Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add query string etag to redirects to s3 bucket where pre-signed urls are not in use (ensuring file freshness) #59

Closed
duttonw opened this issue Nov 23, 2021 · 3 comments
Assignees

Comments

@duttonw
Copy link
Member

duttonw commented Nov 23, 2021

It has been found that caching is great for user performance but has major drawbacks when combined with xloader and similar systems.

Unless the filename is changed, the previous file will be returned till cache has expired, but by then the xloader job has already run and ignored re-building the table.

To fix this, the proposal is to add a query string which contains the last modified date to the 302 redirect on public s3 access.

Why has this not occurred with signed urls before?
Internal the the ckanext-s3filestore when a signed url is generated for a resource, it is stored in redis to speed up the user experience. This is cleared when the file is updated so that the next file retrieval re-populates the redis key/value store.

Note: We cannot do the same with CloudFront due to costs associated with invalidations. After the first 1000 cache invalidations per month they then charge $0.005 USD per invalidation which can add up quite quickly.

Solution:
Inside https://github.com/qld-gov-au/ckanext-s3filestore/blob/master/ckanext/s3filestore/uploader.py#L270 we already have metadata of the file such as Last-Modified or etag. For simplicity, let us go with etag as it won't change till the file has been altered and won't interfere with our cache system performance.

so around https://github.com/qld-gov-au/ckanext-s3filestore/blob/master/ckanext/s3filestore/uploader.py#L296 for public_read files, add ?etag=${etag}
This does not need to be done for pre-signed urls.

info on etags for s3: https://teppen.io/2018/06/23/aws_s3_etags/

@duttonw duttonw changed the title Add query string etga to redirects to s3 bucket where pre-signed urls are not in use (ensuring file freshness) Add query string etag to redirects to s3 bucket where pre-signed urls are not in use (ensuring file freshness) Nov 23, 2021
@ThrawnCA
Copy link
Contributor

ThrawnCA commented Nov 23, 2021

@duttonw It occurs to me that the same issue will likely affect any other extensions that download the resource, such as ckanext-validation or ckanext-archiver (although the latter should be okay because it just puts in a pointer to the S3 file).

@duttonw
Copy link
Member Author

duttonw commented Nov 23, 2021

Unless our ckan api extension is accepted in core, i'm uncertain how you would do this for file based storage. Hopefully with our contributions to the s3 extensions, it can be uplifted into the core at a later date.

with the use of attaching the etag (or simliar unique hash of the file) of the file on the query string, it should hopefully bypass all caches that it would hit. This would also help with validation of the file schema if setup also.

ThrawnCA added a commit that referenced this issue Nov 23, 2021
@duttonw
Copy link
Member Author

duttonw commented Nov 24, 2021

#60 fixes this

@duttonw duttonw closed this as completed Nov 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants