You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The SQL layer requires all aggregates to be computed precisely by the backend engine. This enables us to deal with very large datasets that cannot efficiently be stored in memory. However, it makes it tricky to compute differentially private quantiles such as median.
We are considering an approach where the backend engine computes 100 exact quantiles for each aggregate row in the output. This raises several questions:
Accuracy
It's unclear how the accuracy of this approach will compare with traditional methods that access the whole rowset at once. The distribution of data could have an impact on accuracy.
Rewritten query size
This would greatly increase the size of the rewritten query, since each MEDIAN in the original SELECT would turn into 100 new QUANTILE queries on the backend. It's conceivable that large queries requesting many quantiles would run into size limitations with the rewritten query. This could be worked around by issuing multiple queries, but max query size will differ based on engine.
Engine performance
Most modern database engines can optimize all 100 quantile calls over the same column to share a sort, so query performance hit should not be large compared to requesting one exact quantile. However, it's possible that some engines have trouble with this, in which case we could see unacceptable slowdowns in some queries.
Result set complexity
This will add a lot of extra columns to the response from the database. For example, a query with 5 Median columns would result in 500 columns in each response row. More complex queries could result in exceeding the column limits, which vary depending on engine. We can work around this by asking the engine to aggregate each set of 100 quantiles into a single comma-separated list.
Engine cross-compatibility
Most modern engines have a percentile_cont and percentile_disc from the SQL standard, which can be used to calculate exact quantiles. However, two big gaps are MySQL and SQLite. We could hand-implement for these databases by sorting and ranking, but this will greatly increase the complexity of the rewritten query. Since SQLite is our Pandas solution, we could also just extract the in-memory rows and compute quantiles that way, leaving only MySQL as a gap. Alternately, we could simply disable support on one or both of these engines.
Additionally, there is no standardized SQL way to aggregate the 100 quantiles into a comma-separated string. So we would need to use engine specific syntax: ARRAY_AGG (U-SQL), STRING_AGG (T-SQL), GROUP_CONCAT (MySQL, SQLite), array_join (SparkSQL), ListAgg (Oracle, Redshift). The Reader architecture supports this in a straightforward way, but we would need to do testing to ensure there aren't differences in how each engine implements the functionality. This would also require us to update the grammar to support all of these functions, if we want the output queries to always round-trip.
Alternate methods
For queries with only a few MEDIAN requests, it might be better to uniformly sample some number of rows (e.g. 10,000), and return them to the client for in-memory computation of differentially private quantile. This has a disadvantage of seriously limiting the number of quantile columns per query, and would have an unknown impact on utility. But it would have the advantage of using the same rewritten query on all database engines.
For tables with few rows, we could return all rows and do in-memory processing. There might also be cases where it makes most sense for the private_reader to issue multiple backend queries in sequence, for example, to perform a binary search. Both of these options would result in different accuracy and privacy usage across different engines, though, which may not be desirable.
Finally, there might be some approaches we aren't aware of.
The text was updated successfully, but these errors were encountered:
The SQL layer requires all aggregates to be computed precisely by the backend engine. This enables us to deal with very large datasets that cannot efficiently be stored in memory. However, it makes it tricky to compute differentially private quantiles such as median.
We are considering an approach where the backend engine computes 100 exact quantiles for each aggregate row in the output. This raises several questions:
Accuracy
It's unclear how the accuracy of this approach will compare with traditional methods that access the whole rowset at once. The distribution of data could have an impact on accuracy.
Rewritten query size
This would greatly increase the size of the rewritten query, since each
MEDIAN
in the originalSELECT
would turn into 100 newQUANTILE
queries on the backend. It's conceivable that large queries requesting many quantiles would run into size limitations with the rewritten query. This could be worked around by issuing multiple queries, but max query size will differ based on engine.Engine performance
Most modern database engines can optimize all 100 quantile calls over the same column to share a sort, so query performance hit should not be large compared to requesting one exact quantile. However, it's possible that some engines have trouble with this, in which case we could see unacceptable slowdowns in some queries.
Result set complexity
This will add a lot of extra columns to the response from the database. For example, a query with 5 Median columns would result in 500 columns in each response row. More complex queries could result in exceeding the column limits, which vary depending on engine. We can work around this by asking the engine to aggregate each set of 100 quantiles into a single comma-separated list.
Engine cross-compatibility
Most modern engines have a
percentile_cont
andpercentile_disc
from the SQL standard, which can be used to calculate exact quantiles. However, two big gaps are MySQL and SQLite. We could hand-implement for these databases by sorting and ranking, but this will greatly increase the complexity of the rewritten query. Since SQLite is our Pandas solution, we could also just extract the in-memory rows and compute quantiles that way, leaving only MySQL as a gap. Alternately, we could simply disable support on one or both of these engines.Additionally, there is no standardized SQL way to aggregate the 100 quantiles into a comma-separated string. So we would need to use engine specific syntax:
ARRAY_AGG
(U-SQL),STRING_AGG
(T-SQL),GROUP_CONCAT
(MySQL, SQLite),array_join
(SparkSQL),ListAgg
(Oracle, Redshift). TheReader
architecture supports this in a straightforward way, but we would need to do testing to ensure there aren't differences in how each engine implements the functionality. This would also require us to update the grammar to support all of these functions, if we want the output queries to always round-trip.Alternate methods
For queries with only a few
MEDIAN
requests, it might be better to uniformly sample some number of rows (e.g. 10,000), and return them to the client for in-memory computation of differentially private quantile. This has a disadvantage of seriously limiting the number of quantile columns per query, and would have an unknown impact on utility. But it would have the advantage of using the same rewritten query on all database engines.For tables with few rows, we could return all rows and do in-memory processing. There might also be cases where it makes most sense for the
private_reader
to issue multiple backend queries in sequence, for example, to perform a binary search. Both of these options would result in different accuracy and privacy usage across different engines, though, which may not be desirable.Finally, there might be some approaches we aren't aware of.
The text was updated successfully, but these errors were encountered: