Skip to content

Latest commit

 

History

History

firestore-bigquery-export

Stream Collections to BigQuery

Author: Firebase (https://firebase.google.com)

Description: Sends realtime, incremental updates from a specified Cloud Firestore collection to BigQuery.

Details: Use this extension to export the documents in a Cloud Firestore collection to BigQuery. Exports are realtime and incremental, so the data in BigQuery is a mirror of your content in Cloud Firestore.

The extension creates and updates a dataset containing the following two BigQuery resources:

  • A table of raw data that stores a full change history of the documents within your collection. This table includes a number of metadata fields so that BigQuery can display the current state of your data. The principle metadata fields are timestamp, document_name, and the operation for the document change.
  • A view which represents the current state of the data within your collection. It also shows a log of the latest operation for each document (CREATE, UPDATE, or IMPORT).

If you create, update, delete, or import a document in the specified collection, this extension sends that update to BigQuery. You can then run queries on this mirrored dataset.

Note that this extension only listens for document changes in the collection, but not changes in any subcollection. You can, though, install additional instances of this extension to specifically listen to a subcollection or other collections in your database. Or if you have the same subcollection across documents in a given collection, you can use {wildcard} notation to listen to all those subcollections (for example: chats/{chatid}/posts).

Enabling wildcard references will provide an additional STRING based column. The resulting JSON field value references any wildcards that are included in ${param:COLLECTION_PATH}. You can extract them using JSON_EXTRACT_SCALAR.

Partition settings cannot be updated on a pre-existing table, if these options are required then a new table must be created.

Clustering will not need to create or modify a table when adding clustering options, this will be updated automatically.

Additional setup

Before installing this extension, you'll need to:

Transform function

Prior to sending the document change to BigQuery, you have an opportunity to transform the data with an HTTP function. The payload will contain the following:

{ 
  data: [{
    insertId: int;
    json: {
      timestamp: int;
      event_id: int;
      document_name: string;
      document_id: int;
      operation: ChangeType;
      data: string;
    },
  }]
}

The response should be indentical in structure.

Backfill your BigQuery dataset

This extension only sends the content of documents that have been changed -- it does not export your full dataset of existing documents into BigQuery. So, to backfill your BigQuery dataset with all the documents in your collection, you can run the import script provided by this extension.

Important: Run the import script over the entire collection after installing this extension, otherwise all writes to your database during the import might be lost.

Generate schema views

After your data is in BigQuery, you can run the schema-views script (provided by this extension) to create views that make it easier to query relevant data. You only need to provide a JSON schema file that describes your data structure, and the schema-views script will create the views.

Billing

To install an extension, your project must be on the Blaze (pay as you go) plan

  • This extension uses other Firebase and Google Cloud Platform services, which have associated charges if you exceed the service’s no-cost tier:
    • BigQuery (this extension writes to BigQuery with streaming inserts)
    • Cloud Firestore
    • Cloud Functions (Node.js 10+ runtime. See FAQs)

Configuration Parameters:

  • Cloud Functions location: Where do you want to deploy the functions created for this extension? You usually want a location close to your database. For help selecting a location, refer to the location selection guide.

  • BigQuery Dataset location: Where do you want to deploy the BigQuery dataset created for this extension? For help selecting a location, refer to the location selection guide.

  • Project Id: Override the default project bigquery instance. This can allow updates to be directed to a bigquery instance on another project.

  • Collection path: What is the path of the collection that you would like to export? You may use {wildcard} notation to match a subcollection of all documents in a collection (for example: chatrooms/{chatid}/posts). Parent Firestore Document IDs from {wildcards} can be returned in path_params as a JSON formatted string.

  • Enable Wildcard Column field with Parent Firestore Document IDs: If enabled, creates a column containing a JSON object of all wildcard ids from a documents path.

  • Dataset ID: What ID would you like to use for your BigQuery dataset? This extension will create the dataset, if it doesn't already exist.

  • Table ID: What identifying prefix would you like to use for your table and view inside your BigQuery dataset? This extension will create the table and view, if they don't already exist.

  • BigQuery SQL table Time Partitioning option type: This parameter will allow you to partition the BigQuery table and BigQuery view created by the extension based on data ingestion time. You may select the granularity of partitioning based upon one of: HOUR, DAY, MONTH, YEAR. This will generate one partition per day, hour, month or year, respectively.

  • BigQuery Time Partitioning column name: BigQuery table column/schema field name for TimePartitioning. You can choose schema available as timestamp OR a new custom defined column that will be assigned to the selected Firestore Document field below. Defaults to pseudo column _PARTITIONTIME if unspecified. Cannot be changed if Table is already partitioned.

  • Firestore Document field name for BigQuery SQL Time Partitioning field option: This parameter will allow you to partition the BigQuery table created by the extension based on selected. The Firestore Document field value must be a top-level TIMESTAMP, DATETIME, DATE field BigQuery string format or Firestore timestamp(will be converted to BigQuery TIMESTAMP). Cannot be changed if Table is already partitioned. example: postDate

  • BigQuery SQL Time Partitioning table schema field(column) type: Parameter for BigQuery SQL schema field type for the selected Time Partitioning Firestore Document field option. Cannot be changed if Table is already partitioned.

  • BigQuery SQL table clustering: This parameter will allow you to set up Clustering for the BigQuery Table created by the extension. (for example: data,document_id,timestamp- no whitespaces). You can select up to 4 comma separated fields. The order of the specified columns determines the sort order of the data. Available schema extensions table fields for clustering: document_id, timestamp, event_id, operation, data.

  • Backup Collection Name: This (optional) parameter will allow you to specify a collection for which failed BigQuery updates will be written to.

  • Transform function URL: Specify a function URL to call that will transform the payload that will be written to BigQuery. See the pre-install documentation for more details.

  • Use new query syntax for snapshots: If enabled, snapshots will be generated with the new query syntax, which should be more performant, and avoid potential resource limitations.

Cloud Functions:

  • fsexportbigquery: Listens for document changes in your specified Cloud Firestore collection, then exports the changes into BigQuery.

APIs Used:

  • bigquery.googleapis.com (Reason: Mirrors data from your Cloud Firestore collection in BigQuery.)

Access Required:

This extension will operate with the following project IAM roles:

  • bigquery.dataEditor (Reason: Allows the extension to configure and export data into BigQuery.)

  • datastore.user (Reason: Allows the extension to write updates to the database.)