Skip to content

Latest commit

 

History

History
325 lines (264 loc) · 12.6 KB

view-spec.md

File metadata and controls

325 lines (264 loc) · 12.6 KB
title
View Spec

Iceberg View Spec

Background and Motivation

Most compute engines (e.g. Trino and Apache Spark) support views. A view is a logical table that can be referenced by future queries. Views do not contain any data. Instead, the query stored by the view is executed every time the view is referenced by another query.

Each compute engine stores the metadata of the view in its proprietary format in the metastore of choice. Thus, views created from one engine can not be read or altered easily from another engine even when engines share the metastore as well as the storage system. This document standardizes the view metadata for ease of sharing the views across engines.

Goals

  • A common metadata format for view metadata, similar to how Iceberg supports a common table format for tables.

Overview

View metadata storage mirrors how Iceberg table metadata is stored and retrieved. View metadata is maintained in metadata files. All changes to view state create a new view metadata file and completely replace the old metadata using an atomic swap. Like Iceberg tables, this atomic swap is delegated to the metastore that tracks tables and/or views by name. The view metadata file tracks the view schema, custom properties, current and past versions, as well as other metadata.

Each metadata file is self-sufficient. It contains the history of the last few versions of the view and can be used to roll back the view to a previous version.

Metadata Location

An atomic swap of one view metadata file for another provides the basis for making atomic changes. Readers use the version of the view that was current when they loaded the view metadata and are not affected by changes until they refresh and pick up a new metadata location.

Writers create view metadata files optimistically, assuming that the current metadata location will not be changed before the writer's commit. Once a writer has created an update, it commits by swapping the view's metadata file pointer from the base location to the new location.

Specification

Terms

  • Schema -- Names and types of fields in a view.
  • Version -- The state of a view at some point in time.

View Metadata

The view version metadata file has the following fields:

Requirement Field name Description
required view-uuid A UUID that identifies the view, generated when the view is created. Implementations must throw an exception if a view's UUID does not match the expected UUID after refreshing metadata
required format-version An integer version number for the view format; must be 1
required location The view's base location; used to create metadata file locations
required schemas A list of known schemas
required current-version-id ID of the current version of the view (version-id)
required versions A list of known versions of the view [1]
required version-log A list of version log entries with the timestamp and version-id for every change to current-version-id
optional properties A string to string map of view properties [2]

Notes:

  1. The number of versions to retain is controlled by the table property: version.history.num-entries.
  2. Properties are used for metadata such as comment and for settings that affect view maintenance. This is not intended to be used for arbitrary metadata.

Versions

Each version in versions is a struct with the following fields:

Requirement Field name Description
required version-id ID for the version
required schema-id ID of the schema for the view version
required timestamp-ms Timestamp when the version was created (ms from epoch)
required summary A string to string map of summary metadata about the version
required representations A list of representations for the view definition
optional default-catalog Catalog name to use when a reference in the SELECT does not contain a catalog
required default-namespace Namespace to use when a reference in the SELECT is a single identifier

When default-catalog is null or not set, the catalog in which the view is stored must be used as the default catalog.

Summary

Summary is a string to string map of metadata about a view version. Common metadata keys are documented here.

Requirement Key Value
optional engine-name Name of the engine that created the view version
optional engine-version Version of the engine that created the view version

Representations

View definitions can be represented in multiple ways. Representations are documented ways to express a view definition.

A view version can have more than one representation. All representations for a version must express the same underlying definition. Engines are free to choose the representation to use.

View versions are immutable. Once a version is created, it cannot be changed. This means that representations for a version cannot be changed. If a view definition changes (or new representations are to be added), a new version must be created.

Each representation is an object with at least one common field, type, that is one of the following:

  • sql: a SQL SELECT statement that defines the view

Representations further define metadata for each type.

SQL representation

The SQL representation stores the view definition as a SQL SELECT, with metadata such as the SQL dialect.

A view version can have multiple SQL representations of different dialects, but only one SQL representation per dialect.

Requirement Field name Type Description
required type string Must be sql
required sql string A SQL SELECT statement
required dialect string The dialect of the sql SELECT statement (e.g., "trino" or "spark")

For example:

USE prod.default
CREATE OR REPLACE VIEW event_agg (
    event_count COMMENT 'Count of events',
    event_date) AS
SELECT
    COUNT(1), CAST(event_ts AS DATE)
FROM events
GROUP BY 2

This create statement would produce the following sql representation metadata:

Field name Value
type "sql"
sql "SELECT\n COUNT(1), CAST(event_ts AS DATE)\nFROM events\nGROUP BY 2"
dialect "spark"

If a create statement does not include column names or comments before AS, the fields should be omitted.

The event_count (with the Count of events comment) and event_date field aliases must be part of the view version's schema.

Version log

The version log tracks changes to the view's current version. This is the view's history and allows reconstructing what version of the view would have been used at some point in time.

Note that this is not the version's creation time, which is stored in each version's metadata. A version can appear multiple times in the version log, indicating that the view definition was rolled back.

Each entry in version-log is a struct with the following fields:

Requirement Field name Description
required timestamp-ms Timestamp when the view's current-version-id was updated (ms from epoch)
required version-id ID that current-version-id was set to

Appendix A: An Example

The JSON metadata file format is described using an example below.

Imagine the following sequence of operations:

USE prod.default
CREATE OR REPLACE VIEW event_agg (
    event_count COMMENT 'Count of events',
    event_date)
COMMENT 'Daily event counts'
AS
SELECT
    COUNT(1), CAST(event_ts AS DATE)
FROM events
GROUP BY 2

The metadata JSON file created looks as follows.

The path is intentionally similar to the path for Iceberg tables and uses a metadata directory.

s3://bucket/warehouse/default.db/event_agg/metadata/00001-(uuid).metadata.json
{
  "view-uuid": "fa6506c3-7681-40c8-86dc-e36561f83385",
  "format-version" : 1,
  "location" : "s3://bucket/warehouse/default.db/event_agg",
  "current-version-id" : 1,
  "properties" : {
    "comment" : "Daily event counts"
  },
  "versions" : [ {
    "version-id" : 1,
    "timestamp-ms" : 1573518431292,
    "schema-id" : 1,
    "default-catalog" : "prod",
    "default-namespace" : [ "default" ],
    "summary" : {
      "engine-name" : "Spark",
      "engineVersion" : "3.3.2"
    },
    "representations" : [ {
      "type" : "sql",
      "sql" : "SELECT\n    COUNT(1), CAST(event_ts AS DATE)\nFROM events\nGROUP BY 2",
      "dialect" : "spark"
    } ]
  } ],
  "schemas": [ {
    "schema-id": 1,
    "type" : "struct",
    "fields" : [ {
      "id" : 1,
      "name" : "event_count",
      "required" : false,
      "type" : "int",
      "doc" : "Count of events"
    }, {
      "id" : 2,
      "name" : "event_date",
      "required" : false,
      "type" : "date"
    } ]
  } ],
  "version-log" : [ {
    "timestamp-ms" : 1573518431292,
    "version-id" : 1
  } ]
}

Each change creates a new metadata JSON file. In the below example, the underlying SQL is modified by specifying the fully-qualified table name.

USE prod.other_db;
CREATE OR REPLACE VIEW default.event_agg (
    event_count COMMENT 'Count of events',
    event_date)
COMMENT 'Daily event counts'
AS
SELECT
    COUNT(1), CAST(event_ts AS DATE)
FROM prod.default.events
GROUP BY 2

Updating the view produces a new metadata file that completely replaces the old:

s3://bucket/warehouse/default.db/event_agg/metadata/00002-(uuid).metadata.json
{
  "view-uuid": "fa6506c3-7681-40c8-86dc-e36561f83385",
  "format-version" : 1,
  "location" : "s3://bucket/warehouse/default.db/event_agg",
  "current-version-id" : 1,
  "properties" : {
    "comment" : "Daily event counts"
  },
  "versions" : [ {
    "version-id" : 1,
    "timestamp-ms" : 1573518431292,
    "schema-id" : 1,
    "default-catalog" : "prod",
    "default-namespace" : [ "default" ],
    "summary" : {
      "engine-name" : "Spark",
      "engineVersion" : "3.3.2"
    },
    "representations" : [ {
      "type" : "sql",
      "sql" : "SELECT\n    COUNT(1), CAST(event_ts AS DATE)\nFROM events\nGROUP BY 2",
      "dialect" : "spark"
    } ]
  }, {
    "version-id" : 2,
    "timestamp-ms" : 1573518981593,
    "schema-id" : 1,
    "default-catalog" : "prod",
    "default-namespace" : [ "default" ],
    "summary" : {
      "engine-name" : "Spark",
      "engineVersion" : "3.3.2"
    },
    "representations" : [ {
      "type" : "sql",
      "sql" : "SELECT\n    COUNT(1), CAST(event_ts AS DATE)\nFROM prod.default.events\nGROUP BY 2",
      "dialect" : "spark"
    } ]
  } ],
  "schemas": [ {
    "schema-id": 1,
    "type" : "struct",
    "fields" : [ {
      "id" : 1,
      "name" : "event_count",
      "required" : false,
      "type" : "int",
      "doc" : "Count of events"
    }, {
      "id" : 2,
      "name" : "event_date",
      "required" : false,
      "type" : "date"
    } ]
  } ],
  "version-log" : [ {
    "timestamp-ms" : 1573518431292,
    "version-id" : 1
  }, {
    "timestamp-ms" : 1573518981593,
    "version-id" : 2
  } ]
}