title | description | services | documentationcenter | author | manager | editor | ms.assetid | ms.service | ms.workload | ms.tgt_pltfrm | ms.devlang | ms.topic | ms.date | ms.author |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Move data From PostgreSQL | Microsoft Docs |
Learn about how to move data from PostgreSQL Database using Azure Data Factory. |
data-factory |
linda33wj |
jhubbard |
monicar |
888d9ebc-2500-4071-b6d1-0f6bd1b5997c |
data-factory |
data-services |
na |
na |
article |
11/02/2016 |
jingwang |
This article outlines how you can use the Copy Activity in an Azure data factory to move data from PostgreSQL to another data store. This article builds on the data movement activities article, which presents a general overview of data movement with copy activity and supported data store combinations.
Data Factory service supports connecting to on-premises PostgreSQL sources using the Data Management Gateway. See moving data between on-premises locations and cloud article to learn about Data Management Gateway and step-by-step instructions on setting up the gateway.
Note
Gateway is required even if the PostgreSQL database is hosted in an Azure IaaS VM. You can install the gateway on the same IaaS VM as the data store or on a different VM as long as the gateway can connect to the database.
Data factory only supports moving data from PostgreSQL to other data stores, not from other data stores to PostgreSQL.
For Data Management Gateway to connect to the PostgreSQL Database, you need to install the Ngpsql data provider for PostgreSQL 2.0.12 or above on the same system as the Data Management Gateway. PostgreSQL version 7.4 and above is supported.
Note
See Troubleshoot gateway issues for tips on troubleshooting connection/gateway related issues.
The easiest way to create a pipeline that copies data from a PostgreSQL database to any of the supported sink data stores is to use the Copy data wizard. See Tutorial: Create a pipeline using Copy Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard.
The following example provides sample JSON definitions that you can use to create a pipeline by using Azure portal or Visual Studio or Azure PowerShell. They show how to copy data from PostgreSQL database to Azure Blob Storage. However, data can be copied to any of the sinks stated here using the Copy Activity in Azure Data Factory.
This sample shows how to copy data from a PostgreSQL database to an Azure Blob Storage. However, data can be copied directly to any of the sinks stated here using the Copy Activity in Azure Data Factory.
The sample has the following data factory entities:
- A linked service of type OnPremisesPostgreSql.
- A linked service of type AzureStorage.
- An input dataset of type RelationalTable.
- An output dataset of type AzureBlob.
- The pipeline with Copy Activity that uses RelationalSource and BlobSink.
The sample copies data from a query result in PostgreSQL database to a blob every hour. The JSON properties used in these samples are described in sections following the samples.
As a first step, setup the data management gateway. The instructions are in the moving data between on-premises locations and cloud article.
PostgreSQL linked service:
{
"name": "OnPremPostgreSqlLinkedService",
"properties": {
"type": "OnPremisesPostgreSql",
"typeProperties": {
"server": "<server>",
"database": "<database>",
"schema": "<schema>",
"authenticationType": "<authentication type>",
"username": "<username>",
"password": "<password>",
"gatewayName": "<gatewayName>"
}
}
}
Azure Blob storage linked service:
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<AccountName>;AccountKey=<AccountKey>"
}
}
}
PostgreSQL input dataset:
The sample assumes you have created a table “MyTable” in PostgreSQL and it contains a column called “timestamp” for time series data.
Setting “external”: true informs the Data Factory service that the dataset is external to the data factory and is not produced by an activity in the data factory.
{
"name": "PostgreSqlDataSet",
"properties": {
"type": "RelationalTable",
"linkedServiceName": "OnPremPostgreSqlLinkedService",
"typeProperties": {},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true,
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
Azure Blob output dataset:
Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path and file name for the blob are dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, day, and hours parts of the start time.
{
"name": "AzureBlobPostgreSqlDataSet",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"folderPath": "mycontainer/postgresql/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}",
"format": {
"type": "TextFormat",
"rowDelimiter": "\n",
"columnDelimiter": "\t"
},
"partitionedBy": [
{
"name": "Year",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "yyyy"
}
},
{
"name": "Month",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "MM"
}
},
{
"name": "Day",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "dd"
}
},
{
"name": "Hour",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "HH"
}
}
]
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
Copy activity:
The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run hourly. In the pipeline JSON definition, the source type is set to RelationalSource and sink type is set to BlobSink. The SQL query specified for the query property selects the data from the public.usstates table in the PostgreSQL database.
{
"name": "CopyPostgreSqlToBlob",
"properties": {
"description": "pipeline for copy activity",
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "select * from \"public\".\"usstates\""
},
"sink": {
"type": "BlobSink"
}
},
"inputs": [
{
"name": "PostgreSqlDataSet"
}
],
"outputs": [
{
"name": "AzureBlobPostgreSqlDataSet"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "PostgreSqlToBlob"
}
],
"start": "2014-06-01T18:00:00Z",
"end": "2014-06-01T19:00:00Z"
}
}
The following table provides description for JSON elements specific to PostgreSQL linked service.
Property | Description | Required |
---|---|---|
type | The type property must be set to: OnPremisesPostgreSql | Yes |
server | Name of the PostgreSQL server. | Yes |
database | Name of the PostgreSQL database. | Yes |
schema | Name of the schema in the database. The schema name is case-sensitive. | No |
authenticationType | Type of authentication used to connect to the PostgreSQL database. Possible values are: Anonymous, Basic, and Windows. | Yes |
username | Specify user name if you are using Basic or Windows authentication. | No |
password | Specify password for the user account you specified for the username. | No |
gatewayName | Name of the gateway that the Data Factory service should use to connect to the on-premises PostgreSQL database. | Yes |
See Move data between on-premises sources and the cloud with Data Management Gateway for details about setting credentials for an on-premises PostgreSQL data source.
For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.).
The typeProperties section is different for each type of dataset and provides information about the location of the data in the data store. The typeProperties section for dataset of type RelationalTable (which includes PostgreSQL dataset) has the following properties:
Property | Description | Required |
---|---|---|
tableName | Name of the table in the PostgreSQL Database instance that linked service refers to. The tableName is case-sensitive. | No (if query of RelationalSource is specified) |
For a full list of sections & properties available for defining activities, see the Creating Pipelines article. Properties such as name, description, input and output tables, and policy are available for all types of activities.
Properties available in the typeProperties section of the activity on the other hand vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks.
When source is of type RelationalSource (which includes PostgreSQL) the following properties are available in typeProperties section:
Property | Description | Allowed values | Required |
---|---|---|---|
query | Use the custom query to read data. | SQL query string. For example: "query": "select * from "MySchema"."MyTable"". | No (if tableName of dataset is specified) |
Note
Schema and table names are case-sensitive and they have to be enclosed in "" (double quotes) in the query.
Example:
"query": "select * from "MySchema"."MyTable""
[!INCLUDE data-factory-structure-for-rectangualr-datasets]
As mentioned in the data movement activities article Copy activity performs automatic type conversions from source types to sink types with the following 2-step approach:
- Convert from native source types to .NET type
- Convert from .NET type to native sink type
When moving data to PostgreSQL, the following mappings are used from PostgreSQL type to .NET type.
PostgreSQL Database type | PostgresSQL aliases | .NET Framework type |
---|---|---|
abstime | Datetime | |
bigint | int8 | Int64 |
bigserial | serial8 | Int64 |
bit [ (n) ] | Byte[], String | |
bit varying [ (n) ] | varbit | Byte[], String |
boolean | bool | Boolean |
box | Byte[], String | |
bytea | Byte[], String | |
character [ (n) ] | char [ (n) ] | String |
character varying [ (n) ] | varchar [ (n) ] | String |
cid | String | |
cidr | String | |
circle | Byte[], String | |
date | Datetime | |
daterange | String | |
double precision | float8 | Double |
inet | Byte[], String | |
intarry | String | |
int4range | String | |
int8range | String | |
integer | int, int4 | Int32 |
interval [ fields ] [ (p) ] | Timespan | |
json | String | |
jsonb | Byte[] | |
line | Byte[], String | |
lseg | Byte[], String | |
macaddr | Byte[], String | |
money | Decimal | |
numeric [ (p, s) ] | decimal [ (p, s) ] | Decimal |
numrange | String | |
oid | Int32 | |
path | Byte[], String | |
pg_lsn | Int64 | |
point | Byte[], String | |
polygon | Byte[], String | |
real | float4 | Single |
smallint | int2 | Int16 |
smallserial | serial2 | Int16 |
serial | serial4 | Int32 |
text | String |
[!INCLUDE data-factory-column-mapping]
[!INCLUDE data-factory-type-repeatability-for-relational-sources]
See Copy Activity Performance & Tuning Guide to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it.