Skip to content

Latest commit

 

History

History
710 lines (579 loc) · 33.8 KB

README.md

File metadata and controls

710 lines (579 loc) · 33.8 KB

This document is an individual draft proposal. It has not been adopted by the Private Advertising Technology Community Group.


Private Aggregation API explainer

Author: Alex Turner ([email protected])

Table of Contents

Introduction

This proposal introduces a generic mechanism for measuring aggregate, cross-site data in a privacy preserving manner.

Browsers are now working to prevent cross-site user tracking, including by partitioning storage and removing third-party cookies. There are a range of API proposals to continue supporting legitimate use cases in a way that respects user privacy. Many of these proposals, including Shared Storage and Protected Audience, plan to isolate potentially identifying cross-site data in special contexts, which ensures that the data cannot escape the user agent.

Relative to cross-site data from an individual user, aggregate data about groups of users can be less sensitive and yet would be sufficient for a wide range of use cases. An aggregation service has been proposed to allow reporting noisy, aggregated cross-site data. This service was originally proposed for use by the Attribution Reporting API, but allowing more general aggregation would support additional use cases. In particular, the Protected Audience and Shared Storage proposals expect this functionality to become available.

So, to complement the Attribution Reporting API, we propose a general-purpose Private Aggregation API that can be called from a wide array of contexts, including isolated contexts that have access to cross-site data (such as a shared storage worklet). Within these contexts, potentially identifying data could be encapsulated into "aggregatable reports". To prevent leakage, the cross-site data in these reports would be encrypted to ensure it can only be processed by the aggregation service. During processing, this service will add noise and impose limits on how many queries can be performed.

This API introduces a contributeToHistogram() function; see examples below. This call registers a histogram contribution for reporting. Later, the browser constructs an aggregatable report, which contains an encrypted payload with the specified contribution(s) for later computation via the aggregation service. The API queues the constructed report to be sent to the reporting endpoint of the script's origin (in other words, the reporting origin) after a delay. The report format and endpoint paths are detailed below. After the endpoint receives the reports, it batches the reports and sends them to the aggregation service for processing. The output of that process is a summary report containing the (approximate) result, which is dispatched back to the reporting origin.

See the Private Aggregation API specification.

Examples

Protected Audience reporting

The Protected Audience API plans to run on-device ad auctions using cross-site data as an input. The Private Aggregation API will allow measurement of the auction results from within the isolated execution environments.

For example, a key measurement use case is to report the price of the auctions' winning bids. This tells the seller how much they should be paid and who should pay them. To support this, each seller's JavaScript could define a reportResult() function. For example:

function reportResult(auctionConfig, browserSignals) {
  // Helper functions that map each buyer to its predetermined bucket and scales
  // each bid appropriately for measurement, see scaling values below.
  function convertBuyerToBucketId(buyer_origin) {  }
  function convertBidToReportingValue(winning_bid_price) {  }

  // The user agent sends the report to the reporting endpoint of the script's
  // origin (that is, the caller of `runAdAuction()`) after a delay.
  privateAggregation.contributeToHistogram({
    // Note: the bucket must be a BigInt and the value an integer Number.
    bucket: convertBuyerToBucketId(browserSignals.interestGroupOwner),
    value: convertBidToReportingValue(browserSignals.bid)
  });
}

The buyer can make their own measurements, which could be used to verify the seller's information. To support this, each buyer's JavaScript would define a reportWin() function (and possibly also a reportLoss() function). For example:

function reportWin(auctionSignals, perBuyerSignals, sellerSignals, browserSignals) {
  // The buyer defines their own similar functions.
  function convertSellerToBucketId(seller_origin) {  }
  function convertBidToReportingValue(winning_bid_price) {  }

  privateAggregation.contributeToHistogram({
    bucket: convertSellerToBucketId(browserSignals.seller),
    value: convertBidToReportingValue(browserSignals.bid),
  });
}

Measuring user demographics with cross-site information

publisher.example wants to measure the demographics of its user base, for example, a histogram of number of users split by age ranges. demo.example is a popular site that knows the demographics of its users. publisher.example embeds demo.example as a third-party, allowing it to measure the demographics of the overlapping users.

First, demo.example saves these demographics to its shared storage when it is the top level site:

sharedStorage.set("demo", '{"age": "40-49", ...}');

Then, in a demo.example iframe on publisher.example, the appropriate shared storage operation is triggered once for each user:

await sharedStorage.worklet.addModule("measure-demo.js");
await sharedStorage.run("send-demo-report");

Shared storage worklet script (i.e. measure-demo.js):

class SendDemoReportOperation {
  async function run() {
    let demo_string = await sharedStorage.get("demo");
    let demo = {};
    if (demo_string) {
      demo = JSON.parse(demo_string);
    }

    // Helper function that maps each age range to its predetermined bucket, or
    // a special unknown bucket e.g. if the user has not visited `demo.example`.
    function convertAgeToBucketId(country_string_or_undefined) {  }

    // The report will be sent to `demo.example`'s reporting endpoint after a
    // delay.
    privateAggregation.contributeToHistogram({
      bucket: convertAgeToBucketId(demo["age"]);
      value: 128,  // A predetermined fixed value, see scaling values below.
    });

    // Could add more contributeToHistogram() calls to measure other demographics.
  }
}
register("send-demo-report", SendDemoReportOperation);

Goals

This API aims to support a wide range of aggregation use cases, including measurement of demographics and reach, while remaining generic and flexible. We intend for this API to be callable in as many contexts and situations as possible, including the isolated contexts used by other privacy-preserving API proposals for processing cross-site data. This will help to foster continued growth, experimentation, and rapid iteration in the web ecosystem; to support a thriving, open web; and to prevent ossification and unnecessary rigidity.

This API also intends to avoid the privacy risks presented by unpartitioned storage and third-party cookies. In particular, it seeks to prevent off-browser cross-site recognition of users. Developer adoption of this API will help to replace the usage of third-party cookies, making the web more private by default.

Non-goals

This API does not intend to regulate what data is allowed as an input to aggregation. Instead, the aggregation service will protect this input by adding noise to limit the impact of any individual's input data on the output. Learn more about contribution bounding and budgeting below.

Further, this API does not seek to prevent (probabilistic) cross-site inference about sufficiently large groups of people. That is, learning high confidence properties of large groups is ok as long as we can bound how much an individual affects the aggregate measurement. See also discussion of this non-goal in other settings.

Operations

This current design supports one operation: constructing a histogram. This operation matches the description in the Attribution Reporting API with Aggregatable Reports explainer, with a fixed domain of 'buckets' that the reports contribute bounded integer 'values' to. Note that sums can be computed using the histogram operation by contributing values to a fixed, predetermined bucket and ignoring the returned values for all other buckets after querying.

Over time, we should be able to support additional operations by extending the aggregation service infrastructure. For example, we could add a 'count distinct' operation that, like the histogram operation, uses a fixed domain of buckets, but without any values. Instead, the computed result would be (approximately) how many unique buckets the reports contributed to. Other possible additions include supporting federated learning via privately aggregating machine learning update vectors or extending the histogram operation to support values that are vectors of integers rather than only scalars.

The operation would be indicated by using the appropriate JavaScript call, e.g. contributeToHistogram() and contributeToDistinctCount() for histograms and count distinct, respectively.

Reports

The report, including the payload, will mirror the structure proposed for the Attribution Reporting API. However, a few details will change. For example, fields with no equivalent on this API (e.g. attribution_destination and source_registration_time) won't be present. Additionally, the api field will contain either "protected-audience" or "shared-storage" to reflect which API's context requested the report.

The following is an example report showing the JSON format

{
  "shared_info": "{\"api\":\"protected-audience\",\"report_id\":\"[UUID]\",\"reporting_origin\":\"https://reporter.example\",\"scheduled_report_time\":\"[timestamp in seconds]\",\"version\":\"[api version]\"}",

  "aggregation_service_payloads": [
    {
      "payload": "[base64-encoded data]",
      "key_id": "[string]",

      // Optional debugging information if debugging is enabled
      "debug_cleartext_payload": "[base64-encoded unencrypted payload]",
    }
  ],

  // Optional debugging information if debugging is enabled and debug key specified
  "debug_key": "[64 bit unsigned integer]"
}

As described earlier, these reports will be sent to the reporting origin after a delay. The URL path used for sending the reports will be /.well-known/private-aggregation/report-protected-audience and /.well-known/private-aggregation/report-shared-storage for reports triggered within a Protected Audience or Shared Storage context, respectively.

Temporary debugging mechanism

While third-party cookies are still available, we plan to have a temporary mechanism available that allows for easier debugging. This mechanism involves temporarily relaxing some privacy constraints. It will help ensure that the API can be fully understood during roll-out and help flush out any bugs (either in browser or caller code), and more easily compare the performance to cookie-based alternatives.

This mechanism is similar to Attribution Reporting API's debug aggregatable reports. When the debug mode is enabled for a report, a cleartext version of the payload will be included in the report. Additionally, the shared_info will also include the flag "debug_mode": "enabled" to allow the aggregation service to support debugging functionality on these reports.

This data will only be available in a transitional phase while third-party cookies are available and are already capable of user tracking. The debug mode will only be enabled for contexts that are able to access third-party cookies. That is, it will be disabled if third-party cookies are disabled/deprecated generally or for a particular site/context; note that this also means debug mode will automatically become deprecated when third-party cookies are deprecated.

Though the debug mode is tied to third-party cookie availability, browsers may temporarily allow debug mode without third-party cookies in order to support testing, such as the browsers in the Mode B group of Chrome-facilitated testing.

Enabling

The following javascript call will then enable debug mode for all future reports requested in that context (e.g. Shared Storage operation or Protected Audience function call):

privateAggregation.enableDebugMode();

The browser can optionally apply debug mode to reports requested earlier in that context.

This javascript function can only be called once per context. Any subsequent calls will throw an exception.

Debug keys

To allow sites to associate reports with the contexts that triggered them, we also allow setting 64-bit unsigned integer debug keys. These keys are passed as an optional field to the javascript call, for example:

privateAggregation.enableDebugMode({debugKey: 1234n});

Duplicate debug report

When debug mode is enabled, an additional, duplicate debug report will be sent immediately (i.e. without the random delay) to a separate debug endpoint. This endpoint will use a path like /.well-known/private-aggregation/debug/report-protected-audience (and the equivalent for Shared Storage).

The debug reports should be almost identical to the normal reports, including the additional debug fields. However, the payload ciphertext will differ due to repeating the encryption operation and the key_id may differ if the previous key had since expired or the browser randomly chose a different valid public key.

Reducing volume by batching

In the case of multiple calls to contributeToHistogram(), we can reduce report volume by sending a single report with multiple contributions instead of multiple reports. For this to be possible, the different calls must involve the same reporting origin and the same API (i.e. Protected Audience or Shared Storage). Additionally, the calls must be made at a similar time as the reporting time will necessarily be shared.

Batching scope

For calls within a Shared Storage worklet, calls within the same Shared Storage operation should be batched together.

For calls within a Protected Audience worklet, calls using the same reporting origin within the same auction should be batched together. This should happen even between different interest groups or Protected Audience function calls. However, reports triggered via window.fence.reportEvent() (see here for more detail), should only be batched per-event. This avoids excessive delay if this event is triggered substantially later. Reports for Protected Audience bidders may not share the same aggregation coordinator choice. In this case, calls should be batched separately for the different coordinator choices.

One consideration in the short term is that these calls may have different associated debug modes or keys. In this case, only calls sharing those details should be batched together.

Contributions limit

We will also need a limit on the number of contributions within a single report. In the case that too many contributions are specified with a ‘batching scope’, we should truncate them to the limit.

However, to reduce the impact of this limit, we will pre-aggregate (i.e. merge) any contributions that have the same bucket and filtering ID before truncation.

If necessary, we could instead split the contributions back into multiple reports, each respecting the limit.

Strawman limit: 20 contributions per report.

Padding

The size of the encrypted payload may reveal information about the number of contributions embedded in the aggregatable report. This can be mitigated by padding the plaintext payload (e.g. to a fixed size). In the shorter term, we plan to pad the payload by adding 'null' contributions (i.e. with value 0) to a fixed length. In the future, we plan to instead append bytes to a fixed length, but this will require updating the payload version.

Aggregation coordinator choice

This API should support multiple deployment options for the aggregation service, e.g. deployments on different cloud providers. To avoid a leak, this choice should not be possible from within an isolated context when a context ID is set.

We propose a new optional string field aggregationCoordinatorOrigin to allow developers to specify the deployment option, e.g. the origin for the aggregation service deployed on AWS, GCP, and other platforms in the future. The specified origin would need to be on an allowlist maintained by the browser. If none is specified, a default will be used.

This allowlist matches the Attribution Reporting API's, available here.

Shared Storage callers would specify this field when calling the run() or selectURL() APIs, e.g.

sharedStorage.run('someOperation', {'privateAggregationConfig':
    {'aggregationCoordinatorOrigin': 'https://coordinator.example'}});

Note that we are reusing the privateAggregationConfig that currently allows for specifying a context ID (see here).

Protected Audience sellers would specify this field in the auctionConfig, e.g.

const myAuctionConfig = {
  ...
  'privateAggregationConfig': {
    'aggregationCoordinatorOrigin': 'https://coordinator.example'
  }
};
const auctionResultPromise = navigator.runAdAuction(myAuctionConfig);

For Protected Audience bidders, we plan to allow this choice to be set for each interest group via navigator.joinAdInterestGroup(), e.g.

const interestGroup = {
  ...
  'privateAggregationConfig': {
    'aggregationCoordinatorOrigin': 'https://coordinator.example'
  }
};
const auctionResultPromise = navigator.joinAdInterestGroup(interestGroup);

This setting would be able to be overridden via the typical interest group mechanisms. For example, the update mechanism could support a new privateAggregationConfig key matching the call to joinAdInterestGroup().

Privacy and security

Metadata readable by the reporting origin

Reports will, by default, come with a variety of (unencrypted) metadata that the reporting origin will be able to directly read. While this metadata can be useful, we must be careful to balance the impact on privacy. Here are some examples of metadata that could be included, along with some potential risks:

  • The originally scheduled reporting time (noised within an ~hour granularity)
    • Could be used to identify users on the reporting site within a time window
    • Note that combining this with the actual timestamp the report was received could reveal if the user's device was offline, etc.
  • The reporting origin
    • Determined by the execution context's origin, but a site could use different subdomains, e.g. to separate use cases.
  • Which API triggered the report
    • The api field and the endpoint path indicates which API triggered the report (e.g. Protected Audience or Shared Storage).
  • The API version
    • A version string used to allow future incompatible changes to the API. This should usually correspond to the browser version and should not change often.
  • Encrypted payload sizes
    • If we do not carefully add padding or enforce that all reports are of the same natural size, this may expose some information about the contents of the report.
  • Debugging information
    • If debugging is enabled, some additional metadata will be provided. While this information may be potentially identifying, it will only be available temporarily: while third-party cookies are enabled.
  • Developer-selected metadata

Open question: what metadata to allow

It remains an open question what metadata should be included or allowed in the report and how that metadata could be selected or configured. Note that any variation in the reporting endpoint (such as the URL path) would, for this analysis, be equivalent to including the selected endpoint as additional metadata.

While allowing a developer to specify arbitrary metadata from an isolated context would negate the privacy goals of the API, specifying a report's metadata from a non-isolated context (e.g. a main document) may be less worrisome. This could improve the API's utility and flexibility. For example, allowing this might simplify usage for a single origin using the API for different use cases. This non-isolated metadata selection could also allow for first-party trust signals to be associated with a report.

Alternatively, there may be ways to "noise" the metadata to achieve differential privacy. Further study and consideration is needed here.

Contribution bounding and budgeting

As described above, the aggregation service protects user privacy by adding noise, aiming to have a framework that could support differential privacy. However, simply protecting each query to the aggregation service or each report sent from a user agent would be vulnerable to an adversary that repeats queries or issues multiple reports, and combines the results. Instead, we propose the following structure. See below for the specific choices we have made in our current implementation.

First, each user agent will limit the contribution that it could make to the outputs of aggregation service queries. (Note that this limitation is a rate over time, not an absolute number, see Partition choice below.) In the case of a histogram operation, the user agent could bound the L1 norm of the values, i.e. the sum of all the contributions across all buckets. The user agent could consider other bounds, e.g. the L2 norm.

The user agent would also need to determine what the appropriate 'partition' is for this budgeting, see partition choice below. For example, there could be a separate L1 'budget' for each origin, resetting periodically. Exceeding these limits will cause future contributions to silently drop.

Second, the server-side processing will limit the number of queries that can be performed on reports containing the same 'shared ID' to a small number (e.g. a single query). See here for more detail. This also limits the number of queries that can contain the same report. The shared ID is a hash representing the partition. It is computed by the aggregation service using data available in each aggregatable report. Note that the hash's input data will need to differ from the Attribution Reporting API (e.g. to exclude fields like the destination site that don't exist in this API).

With the above restrictions, the processing servers only need to sample the noise for each query from a fixed distribution. In principle, this fixed noise could be used to achieve differential privacy, e.g. by using Laplace noise with the following parameter: (max number of queries per report) * (max L1 per user per partition) / epsilon.

Scaling values

Developers will need to choose an appropriate scale for their measurements. In other words, they will likely want to multiply their values by a fixed, predetermined constant.

Scaling the values up, i.e. choosing a larger constant, will reduce the relative noise added by the server (as the noise has constant magnitude). However, this will also cause the limit on the L1 norm of the values contributed to reports, i.e. the sum of all contributions across all buckets, to be reached faster. Recall that no more reports can be sent after depleting the budget.

Scaling the values down, i.e. choosing a smaller constant, will increase the relative noise, but would also reduce the risk of reaching the budget limit. Developers will have to balance these considerations to choose the appropriate scale. The examples below explore this in more detail.

Examples

These examples use an L1 bound of 216 = 65 536.

Let's consider a basic measurement case: a binary histogram of counts. For example, using bucket 0 to indicate a user is a member of some group and bucket 1 to indicate they are not. Suppose that we don't want to measure anything else and we've set up our measurement so that each user is only measured once (per partition per time period). Then, each user could contribute their full limit (i.e. 216) to the appropriate bucket. After all the reports for all users are collected, a single query would be performed and the server would add noise (from a fixed distribution) to each bucket. We would then divide the values by 216 to obtain a fairly precise result (with standard deviation of 1/216 of the server's noise distribution).

If each user had instead just contributed a value of 1, we wouldn't have to divide the query result by 216. However, each user would end the week with the vast majority of their budget remaining -- and the processing servers would still add the same noise. So, our result would be much less precise (with standard deviation equal to the server's noise distribution).

On the other hand, suppose we wanted to allow each user to report multiple times per time period to this same binary histogram. In this case, we would have to reduce each contribution from 216 to a lower predetermined value, say, 212. Then, each user would be allowed to contribute up to 16 times to the histogram. Note that you have to reduce each contribution by the worst case number of contributions per user. Otherwise, users contributing too much will have reports dropped.

Partition choice

A narrow partition (e.g. giving each top-level URL a separate budget) may not sufficiently protect privacy. Unfortunately, very broad partitions (e.g. a single budget for the browser) may allow malicious (or simply greedy) actors to exhaust the budget, denying service to all others.

The ergonomics of the partition should also be considered. Some choices might require coordination between different entities (e.g. different third parties on one site) or complex delegation mechanism. Other choices would require complex accounting; for example, requiring Shared storage to record the source of each piece of data that could have contributed (even indirectly) to a report.

Note also that it is important to include a time component to the partition, e.g. resetting limits periodically. This does risk long-term information leakage from dedicated adversaries, but is essential for utility. Other options for recovering from an exhausted budget may be possible but need further exploration, e.g. allowing a site to clear its data to reset its budget.

Implementation plan

We plan to enforce a per-site budget that resets every 10 minutes; that is, we will bound the contributions that any site can make to a histogram over any 10 minute rolling window. We plan to use an L1 bound of 216 = 65 536 for this bound; this aligns with the Attribution Reporting API with Aggregatable Reports explainer.

As a backstop to limit worst-case leakage, we plan a separate, looser per-site bound that is enforced on a 24 hour rolling window, limiting the daily L1 norm to 220 = 1 048 576.

This site will match the site of the execution environment, i.e. the site of the reporting origin, no matter which top-level sites are involved. For the earlier example, this would correspond to the runAdAuction() caller within reportResult() and the interest group owner within reportWin()/reportLoss().

We initially plan to have two separate budgets: one for calls within Shared Storage worklets and one for Protected Audience worklets. However, see shared contribution budget below.

Enrollment and attestation

Use of this API requires enrollment and attestation via the Privacy Sandbox enrollment attestation model.

When an aggregatable report is triggered, a check will be performed to determine whether the calling site is enrolled and attested. If this check fails, the report will be dropped (i.e. not sent).

Future Iterations

Supporting different aggregation modes

This API will support an optional parameter alternativeAggregationMode that accepts a string value. This parameter will allow developers to choose among different options for aggregation infrastructure supported by the user agent. This will allow experimentation with new technologies, and allows us to test new approaches without removing core functionality provided by the default option. The "experimental-poplar" option will implement a protocol similar to poplar VDAF in the PPM Framework.

Shared contribution budget

Separating contribution budgets for Shared Storage worklets and Protected Audience worklets provides additional flexibility; for example, some partition choices may not be compatible (e.g. a per-interest group budget). However, we could consider merging the two budgets in the future.

Authentication and data integrity

To ensure the integrity of the aggregated data, it may be desirable to support a mechanism for authentication. This would help limit the impact of reports sent from malicious or misbehaving clients on the results of each query.

To ensure privacy, the reporting endpoint should be able to determine whether a report came from a trusted client without determining which client sent it. We may be able to use trust tokens for this, but further design work is required.

Aggregate error reporting

Unfortunately, errors that occur within isolated execution contexts cannot be easily reported (e.g. to a non-isolated document or over the network). If allowed, such errors could be used as an information channel. While these errors could still appear in the console, it would also aid developers if we add a mechanism for aggregate error reporting. This reporting could be automatic or could be required to be configured according to the developers' preferences.


This document is an individual draft proposal. It has not been adopted by the Private Advertising Technology Community Group.