Skip to content

Latest commit

 

History

History
371 lines (285 loc) · 31.7 KB

data-catalog-developer-concepts.md

File metadata and controls

371 lines (285 loc) · 31.7 KB
title description services documentationcenter author manager editor tags ms.assetid ms.service ms.devlang ms.topic ms.tgt_pltfrm ms.workload ms.date ms.author
Data Catalog developer concepts | Microsoft Docs
Introduction to the key concepts in Azure Data Catalog conceptual model, as exposed through the Catalog REST API.
data-catalog
spelluru
jhubbard
89de9137-a0a4-40d1-9f8d-625acad31619
data-catalog
NA
article
NA
data-catalog
10/15/2017
spelluru

Azure Data Catalog developer concepts

Microsoft Azure Data Catalog is a fully managed cloud service that provides capabilities for data source discovery and for crowdsourcing data source metadata. Developers can use the service via its REST APIs. Understanding the concepts implemented in the service is important for developers to successfully integrate with Azure Data Catalog.

Key concepts

The Azure Data Catalog conceptual model is based on four key concepts: The Catalog, Users, Assets, and Annotations.

concept

Figure 1 - Azure Data Catalog simplified conceptual model

Catalog

A Catalog is the top-level container for all the metadata an organization stores. There is one Catalog allowed per Azure Account. Catalogs are tied to an Azure subscription, but only one Catalog can be created for any given Azure account, even though an account can have multiple subscriptions.

A catalog contains Users and Assets.

Users

Users are security principals that have permissions to perform actions (search the catalog, add, edit or remove items, etc.) in the Catalog.

There are several different roles a user can have. For information on roles, see the section Roles and Authorization.

Individual users and security groups can be added.

Azure Data Catalog uses Azure Active Directory for identity and access management. Each Catalog user must be a member of the Active Directory for the account.

Assets

A Catalog contains data assets. Assets are the unit of granularity managed by the catalog.

The granularity of an asset varies by data source. For SQL Server or Oracle Database, an asset can be a Table or a View. For SQL Server Analysis Services, an asset can be a Measure, a Dimension, or a Key Performance Indicator (KPI). For SQL Server Reporting Services, an asset is a Report.

An Asset is the thing you add or remove from a Catalog. It is the unit of result you get back from Search.

An Asset is made up from its name, location, and type, and annotations that further describe it.

Annotations

Annotations are items that represent metadata about Assets.

Examples of annotations are description, tags, schema, documentation, etc. A full list of the asset types and annotation types are in the Asset Object model section.

Crowdsourcing annotations and user perspective (multiplicity of opinion)

A key aspect of Azure Data Catalog is how it supports the crowdsourcing of metadata in the system. As opposed to a wiki approach – where there is only one opinion and the last writer wins – the Azure Data Catalog model allows multiple opinions to live side by side in the system.

This approach reflects the real world of enterprise data where different users can have different perspectives on a given asset:

  • A database administrator may provide information about service level agreements, or the available processing window for bulk ETL operations
  • A data steward may provide information about the business processes to which the asset applies, or the classifications that the business has applied to it
  • A finance analyst may provide information about how the data is used during end-of-period reporting tasks

To support this example, each user – the DBA, the data steward, and the analyst – can add a description to a single table that has been registered in the Catalog. All descriptions are maintained in the system, and in the Azure Data Catalog portal all descriptions are displayed.

This pattern is applied to most of the items in the object model, so object types in the JSON payload are often arrays for properties where you might expect a singleton.

For example, under the asset root is an array of description objects. The array property is named “descriptions”. A description object has one property - description. The pattern is that each user who types description gets a description object created for the value supplied by the user.

The UX can then choose how to display the combination. There are three different patterns for display.

  • The simplest pattern is “Show All”. In this pattern, all the objects are shown in a list view. The Azure Data Catalog portal UX uses this pattern for description.
  • Another pattern is “Merge”. In this pattern, all the values from the different users are merged together, with duplicate removed. Examples of this pattern in the Azure Data Catalog portal UX are the tags and experts properties.
  • A third pattern is “last writer wins”. In this pattern, only the most recent value typed in is shown. friendlyName is an example of this pattern.

Asset object model

As introduced in the Key Concepts section, the Azure Data Catalog object model includes items, which can be assets or annotations. Items have properties, which can be optional or required. Some properties apply to all items. Some properties apply to all assets. Some properties apply only to specific asset types.

System properties

Property NameData TypeComments
timestampDateTimeThe last time the item was modified. This field is generated by the server when an item is inserted and every time an item is updated. The value of this property is ignored on input of publish operations.
idUriAbsolute url of the item (read-only). It is the unique addressable URI for the item. The value of this property is ignored on input of publish operations.
typeStringThe type of the asset (read-only).
etagStringA string corresponding to the version of the item that can be used for optimistic concurrency control when performing operations that update items in the catalog. "*" can be used to match any value.

Common properties

These properties apply to all root asset types and all annotation types.

Property NameData TypeComments
fromSourceSystemBooleanIndicates whether item's data is derived from a source system (like Sql Server Database, Oracle Database) or authored by a user.

Common root properties

These properties apply to all root asset types.

Property NameData TypeComments
nameStringA name derived from the data source location information
dslDataSourceLocationUniquely describes the data source and is one of the identifiers for the asset. (See dual identity section). The structure of the dsl varies by the protocol and source type.
dataSourceDataSourceInfoMore detail on the type of asset.
lastRegisteredBySecurityPrincipalDescribes the user who most recently registered this asset. Contains both the unique id for the user (the upn) and a display name (lastName and firstName).
containerIdStringId of the container asset for the data source. This property is not supported for the Container type.

Common non-singleton annotation properties

These properties apply to all non-singleton annotation types (annotations, which allowed to be multiple per asset).

Property NameData TypeComments
keyStringA user specified key, which uniquely identifies the annotation in the current collection. The key length cannot exceed 256 characters.

Root asset types

Root asset types are those types that represent the various types of data assets that can be registered in the catalog. For each root type, there is a view, which describes asset and annotations included in the view. View name should be used in the corresponding {view_name} url segment when publishing an asset using REST API.

Asset Type (View name)Additional PropertiesData TypeAllowed AnnotationsComments
Table ("tables")Description

FriendlyName

Tag

Schema

ColumnDescription

ColumnTag

Expert

Preview

AccessInstruction

TableDataProfile

ColumnDataProfile

ColumnDataClassification

Documentation

A Table represents any tabular data. For example: SQL Table, SQL View, Analysis Services Tabular Table, Analysis Services Multidimensional dimension, Oracle Table, etc.
Measure ("measures")Description

FriendlyName

Tag

Expert

AccessInstruction

Documentation

This type represents an Analysis Services measure.
measureColumnMetadata describing the measure
isCalculated BooleanSpecifies if the measure is calculated or not.
measureGroupStringPhysical container for measure
KPI ("kpis")Description

FriendlyName

Tag

Expert

AccessInstruction

Documentation

measureGroupStringPhysical container for measure
goalExpressionStringAn MDX numeric expression or a calculation that returns the target value of the KPI.
valueExpressionStringAn MDX numeric expression that returns the actual value of the KPI.
statusExpressionStringAn MDX expression that represents the state of the KPI at a specified point in time.
trendExpressionStringAn MDX expression that evaluates the value of the KPI over time. The trend can be any time-based criterion that is useful in a specific business context.
Report ("reports")Description

FriendlyName

Tag

Expert

AccessInstruction

Documentation

This type represents a SQL Server Reporting Services report
assetCreatedDateString
assetCreatedByString
assetModifiedDateString
assetModifiedByString
Container ("containers")Description

FriendlyName

Tag

Expert

AccessInstruction

Documentation

This type represents a container of other assets such as a SQL database, an Azure Blobs container, or an Analysis Services model.

Annotation types

Annotation types represent types of metadata that can be assigned to other types within the catalog.

Annotation Type (Nested view name)Additional PropertiesData TypeComments
Description ("descriptions")This property contains a description for an asset. Each user of the system can add their own description. Only that user can edit the Description object. (Admins and Asset owners can delete the Description object but not edit it). The system maintains users' descriptions separately. Thus there is an array of descriptions on each asset (one for each user who has contributed their knowledge about the asset, in addition to possibly one that contains information derived from the data source).
descriptionstringA short description (2-3 lines) of the asset
Tag ("tags")This property defines a tag for an asset. Each user of the system can add multiple tags for an asset. Only the user who created Tag objects can edit them. (Admins and Asset owners can delete the Tag object but not edit it). The system maintains users' tags separately. Thus there is an array of Tag objects on each asset.
tagstringA tag describing the asset.
FriendlyName ("friendlyName")This property contains a friendly name for an asset. FriendlyName is a singleton annotation - only one FriendlyName can be added to an asset. Only the user who created FriendlyName object can edit it. (Admins and Asset owners can delete the FriendlyName object but not edit it). The system maintains users' friendly names separately.
friendlyNamestringA friendly name of the asset.
Schema ("schema")The Schema describes the structure of the data. It lists the attribute (column, attribute, field, etc.) names, types as well other metadata. This information is all derived from the data source. Schema is a singleton annotation - only one Schema can be added for an asset.
columnsColumn[]An array of column objects. They describe the column with information derived from the data source.
ColumnDescription ("columnDescriptions")This property contains a description for a column. Each user of the system can add their own descriptions for multiple columns (at most one per column). Only the user who created ColumnDescription objects can edit them. (Admins and Asset owners can delete the ColumnDescription object but not edit it). The system maintains these user's column descriptions separately. Thus there is an array of ColumnDescription objects on each asset (one per column for each user who has contributed their knowledge about the column in addition to possibly one that contains information derived from the data source). The ColumnDescription is loosely bound to the Schema so it can get out of sync. The ColumnDescription might describe a column that no longer exists in the schema. It is up to the writer to keep description and schema in sync. The data source may also have columns description information and they are additional ColumnDescription objects that would be created when running the tool.
columnNameStringThe name of the column this description refers to.
descriptionStringa short description (2-3 lines) of the column.
ColumnTag ("columnTags")This property contains a tag for a column. Each user of the system can add multiple tags for a given column and can add tags for multiple columns. Only the user who created ColumnTag objects can edit them. (Admins and Asset owners can delete the ColumnTag object but not edit it). The system maintains these users' column tags separately. Thus there is an array of ColumnTag objects on each asset. The ColumnTag is loosely bound to the schema so it can get out of sync. The ColumnTag might describe a column that no longer exists in the schema. It is up to the writer to keep column tag and schema in sync.
columnNameStringThe name of the column this tag refers to.
tagStringA tag describing the column.
Expert ("experts")This property contains a user who is considered an expert in the data set. The experts’ opinions(descriptions) bubble to the top of the UX when listing descriptions. Each user can specify their own experts. Only that user can edit the experts object. (Admins and Asset owners can delete the Expert objects but not edit it).
expertSecurityPrincipal
Preview ("previews")The preview contains a snapshot of the top 20 rows of data for the asset. Preview only make sense for some types of assets (it makes sense for Table but not for Measure).
previewobject[]Array of objects that represent a column. Each object has a property mapping to a column with a value for that column for the row.
AccessInstruction ("accessInstructions")
mimeTypestringThe mime type of the content.
contentstringThe instructions for how to get access to this data asset. The content could be a URL, an email address, or a set of instructions.
TableDataProfile ("tableDataProfiles")
numberOfRowsintThe number of rows in the data set
sizelongThe size in bytes of the data set.
schemaModifiedTimestringThe last time the schema was modified
dataModifiedTimestringThe last time the data set was modified (data was added, modified, or delete)
ColumnsDataProfile ("columnsDataProfiles")
columnsColumnDataProfile[]An array of column data profiles.
ColumnDataClassification ("columnDataClassifications")
columnNameStringThe name of the column this classification refers to.
classificationStringThe classification of the data in this column.
Documentation ("documentation")A given asset can have only one documentation associated with it.
mimeTypestringThe mime type of the content.
contentstringThe documentation content.

Common types

Common types can be used as the types for properties, but are not Items.

Common TypePropertiesData TypeComments
DataSourceInfo
sourceTypestringDescribes the type of data source. For example: SQL Server, Oracle Database, etc.
objectTypestringDescribes the type of object in the data source. For example: Table, View for SQL Server.
DataSourceLocation
protocolstringRequired. Describes a protocol used to communicate with the data source. For example: "tds" for SQl Server, "oracle" for Oracle, etc. Refer to [Data source reference specification - DSL Structure](data-catalog-dsr.md) for the list of currently supported protocols.
addressDictionaryRequired. Address is a set of data specific to the protocol that is used to identify the data source being referenced. The address data scoped to a particular protocol, meaning it is meaningless without knowing the protocol.
authenticationstringOptional. The authentication scheme used to communicate with the data source. For example: windows, oauth, etc.
connectionPropertiesDictionaryOptional. Additional information on how to connect to a data source.
SecurityPrincipalThe backend does not perform any validation of provided properties against AAD during publishing.
upnstringUnique email address of user. Must be specified if objectId is not provided or in the context of "lastRegisteredBy" property, otherwise optional.
objectIdGuidUser or security group AAD identity. Optional. Must be specified if upn is not provided, otherwise optional.
firstNamestringFirst name of user (for display purposes). Optional. Only valid in the context of "lastRegisteredBy" property. Cannot be specified when providing security principal for "roles", "permissions" and "experts".
lastNamestringLast name of user (for display purposes). Optional. Only valid in the context of "lastRegisteredBy" property. Cannot be specified when providing security principal for "roles", "permissions" and "experts".
Column
namestringName of the column or attribute.
typestringdata type of the column or attribute. The Allowable types depend on data sourceType of the asset. Only a subset of types is supported.
maxLengthintThe maximum length allowed for the column or attribute. Derived from data source. Only applicable to some source types.
precisionbyteThe precision for the column or attribute. Derived from data source. Only applicable to some source types.
isNullableBooleanWhether the column is allowed to have a null value or not. Derived from data source. Only applicable to some source types.
expressionstringIf the value is a calculated column, this field contains the expression that expresses the value. Derived from data source. Only applicable to some source types.
ColumnDataProfile
columnName stringThe name of the column
type stringThe type of the column
min stringThe minimum value in the data set
max stringThe maximum value in the data set
avg doubleThe average value in the data set
stdev doubleThe standard deviation for the data set
nullCount intThe count of null values in the data set
distinctCount intThe count of distinct values in the data set

Asset identity

Azure Data Catalog uses "protocol" and identity properties from the "address" property bag of the DataSourceLocation "dsl" property to generate identity of the asset, which is used to address the asset inside the Catalog. For example, the "tds" protocol has identity properties "server", "database", "schema" and "object". The combinations of the protocol and the identity properties are used to generate the identity of the SQL Server Table Asset. Azure Data Catalog provides several built-in data source protocols, which are listed at Data source reference specification - DSL Structure. The set of supported protocols can be extended programmatically (Refer to Data Catalog REST API reference). Administrators of the Catalog can register custom data source protocols. The following table describes the properties needed to register a custom protocol.

Custom data source protocol specification

TypePropertiesData TypeComments
DataSourceProtocol
namespacestringThe namespace of the protocol. Namespace must be from 1 to 255 characters long, contain one or more non-empty parts separated by dot (.). Each part must be from 1 to 255 characters long, start with a letter and contain only letters and numbers.
namestringThe name of the protocol. Name must be from 1 to 255 characters long, start with a letter and contain only letters, numbers, and the dash (-) character.
identityPropertiesDataSourceProtocolIdentityProperty[]List of identity properties, must contain at least one, but no more than 20 properties. For example: "server", "database", "schema", "object" are identity properties of the "tds" protocol.
identitySetsDataSourceProtocolIdentitySet[]List of identity sets. Defines sets of identity properties, which represent valid asset's identity. Must contain at least one, but no more than 20 sets. For example: {"server", "database", "schema" and "object"} is an identity set for "tds" protocol, which defines identity of Sql Server Table asset.
DataSourceProtocolIdentityProperty
namestringThe name of the property. Name must be from 1 to 100 characters long, start with a letter and can contain only letters and numbers.
typestringThe type of the property. Supported values: "bool", boolean", "byte", "guid", "int", "integer", "long", "string", "url"
ignoreCaseboolIndicates whether case should be ignored when using property's value. Can only be specified for properties with "string" type. Default value is false.
urlPathSegmentsIgnoreCasebool[]Indicates whether case should be ignored for each segment of the url's path. Can only be specified for properties with "url" type. Default value is [false].
DataSourceProtocolIdentitySet
namestringThe name of the identity set.
propertiesstring[]The list of identity properties included into this identity set. It cannot contain duplicates. Each property referenced by identity set must be defined in the list of "identityProperties" of the protocol.

Roles and authorization

Microsoft Azure Data Catalog provides authorization capabilities for CRUD operations on assets and annotations.

Key concepts

The Azure Data Catalog uses two authorization mechanisms:

  • Role-based authorization
  • Permission-based authorization

Roles

There are three roles: Administrator, Owner, and Contributor. Each role has its scope and rights, which are summarized in the following table.

RoleScopeRights
AdministratorCatalog (all assets/annotations in the Catalog)Read Delete ViewRoles

ChangeOwnership ChangeVisibility ViewPermissions

OwnerEach asset (root item)Read Delete ViewRoles

ChangeOwnership ChangeVisibility ViewPermissions

ContributorEach individual asset and annotationRead Update Delete ViewRoles Note: all the rights are revoked if the Read right on the item is revoked from the Contributor

Note

Read, Update, Delete, ViewRoles rights are applicable to any item (asset or annotation) while TakeOwnership, ChangeOwnership, ChangeVisibility, ViewPermissions are only applicable to the root asset.

Delete right applies to an item and any subitems or single item underneath it. For example, deleting an asset also deletes any annotations for that asset.

Permissions

Permission is as list of access control entries. Each access control entry assigns set of rights to a security principal. Permissions can only be specified on an asset (that is, root item) and apply to the asset and any subitems.

During the Azure Data Catalog preview, only Read right is supported in the permissions list to enable scenario to restrict visibility of an asset.

By default any authenticated user has Read right for any item in the catalog unless visibility is restricted to the set of principals in the permissions.

REST API

PUT and POST view item requests can be used to control roles and permissions: in addition to item payload, two system properties can be specified roles and permissions.

Note

permissions only applicable to a root item.

Owner role only applicable to a root item.

By default when an item is created in the catalog its Contributor is set to the currently authenticated user. If item should be updatable by everyone, Contributor should be set to <Everyone> special security principal in the roles property when item is first published (refer to the following example). Contributor cannot be changed and stays the same during life-time of an item (even Administrator or Owner doesn’t have the right to change the Contributor). The only value supported for the explicit setting of the Contributor is <Everyone>: Contributor can only be a user who created an item or <Everyone>.

Examples

Set Contributor to <Everyone> when publishing an item. Special security principal <Everyone> has objectId "00000000-0000-0000-0000-000000000201". POST https://api.azuredatacatalog.com/catalogs/default/views/tables/?api-version=2016-03-30

Note

Some HTTP client implementations may automatically reissue requests in response to a 302 from the server, but typically strip Authorization headers from the request. Since the Authorization header is required to make requests to Azure Data Catalog, you must ensure the Authorization header is still provided when reissuing a request to a redirect location specified by Azure Data Catalog. The following sample code demonstrates it using the .NET HttpWebRequest object.

Body

{
    "roles": [
        {
            "role": "Contributor",
            "members": [
                {
                    "objectId": "00000000-0000-0000-0000-000000000201"
                }
            ]
        }
    ]
}

Assign owners and restrict visibility for an existing root item: PUT https://api.azuredatacatalog.com/catalogs/default/views/tables/042297b0...1be45ecd462a?api-version=2016-03-30

{
    "roles": [
        {
            "role": "Owner",
            "members": [
                {
                    "objectId": "c4159539-846a-45af-bdfb-58efd3772b43",
                    "upn": "[email protected]"
                },
                {
                    "objectId": "fdabd95b-7c56-47d6-a6ba-a7c5f264533f",
                    "upn": "[email protected]"
                }
            ]
        }
    ],
    "permissions": [
        {
            "principal": {
                "objectId": "27b9a0eb-bb71-4297-9f1f-c462dab7192a",
                "upn": "[email protected]"
            },
            "rights": [
                {
                    "right": "Read"
                }
            ]
        },
        {
            "principal": {
                "objectId": "4c8bc8ce-225c-4fcf-b09a-047030baab31",
                "upn": "[email protected]"
            },
            "rights": [
                {
                    "right": "Read"
                }
            ]
        }
    ]
}

Note

In PUT it’s not required to specify an item payload in the body: PUT can be used to update just roles and/or permissions.