Name	Name	Last commit message	Last commit date
Latest commit History 49 Commits
config/checkstyle	config/checkstyle
gradle/wrapper	gradle/wrapper
lib/embulk/input	lib/embulk/input
src	src
.gitignore	.gitignore
LICENSE.txt	LICENSE.txt
README.md	README.md
build.gradle	build.gradle
config.yml	config.yml
gradlew	gradlew
gradlew.bat	gradlew.bat

Google Cloud Bigquery extract file input plugin for Embulk

embulk file input plugin.

embulk : http://www.embulk.org/docs/
embulk plugins : https://plugins.embulk.org/

Reads files stored on Google Cloud Storage that extracted from bigquery table or query result

Overview

Plugin type: file input
Resume supported: no
Cleanup supported: yes

Detail

Reads files stored on Google Cloud Storage that extracted from bigquery table or query result

Maybe solution for download very big data in bigquery.

If you set table config without query config, then just extract table to Google Cloud Storage.

If you set query config, then query result save to temp table and then extracted that temp table to Google Cloud Storage uri. see : https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.extract

Usage

Install plugin

embulk gem install embulk-input-bigquery_extract_files

Update plugin (latest version : 0.0.13)

embulk gem update embulk-input-bigquery_extract_files

rubygem url : https://rubygems.org/profiles/jo8937

Configuration

project: Google Cloud Platform (gcp) project id (string, required)
json_keyfile: gcp service account's private key with json (string, required)
location: location of bigquery dataset and temp_dataset. see : https://cloud.google.com/bigquery/docs/locations (Optional) (string, default: US)
gcs_uri: bigquery result saved uri. bucket and path names parsed from this uri. (string, required)
temp_local_path: extract files download directory in local machine (string, required)
dataset: target datasource dataset (string, default: null)
table: target datasource table. either query or table are required. (string, default: null)
query: target datasource query. either query or table are required. (string, default: null)
temp_dataset: if you use query param, query result saved here (string, default: null)
temp_table: if you use query param, query result saved here. if not set, plugin generate temp name (string, default: null)
use_legacy_sql: if you use query param, see : https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.query.useLegacySql (string, default: false)
cache: if you use query param, see : https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.query.useQueryCache (string, default: true)
create_disposition: if you use query param, see : https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.query.createDisposition (string, default: CREATE_IF_NEEDED)
write_disposition: if you use query param, see : https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.query.writeDisposition (string, default: WRITE_APPEND)
file_format: Table extract file format. see : https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.extract.destinationFormat (string, default: CSV)
compression: Table extract file compression setting. see : https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.extract.compression (string, default: GZIP)
temp_schema_file_path: bigquery result schema file for parser. (Optional) (string, default: null)
bigquery_job_wait_second: bigquery job waiting second. (Optional) (string, default: 600)
throw_bigquery_job_wait_timeout: throw exception when bigquery job waiting second timeout. (Optional) (string, default: false)
throw_bigquery_job_includes_error: throw exception when bigquery job DONE but include errorResult (Optional) (string, default: false)
cleanup_gcs_before_executing: delete all file in gcs temp path before process start (Optional) (string, default: true)
cleanup_gcs_files: delete all file in gcs temp path after process end (Optional) (string, default: false)
cleanup_temp_table: delete query result table after process end (Optional) (string, default: true)
cleanup_local_temp_files: delete all file in local temp dir (Optional) (string, default: true)
direct_download_enabled: gcs download option. see : https://developers.google.com/api-client-library/java/google-api-java-client/reference/1.19.1/com/google/api/client/googleapis/media/MediaHttpDownloader#setDirectDownloadEnabled(boolean) (Optional) (string, default: false)
decoders: embulk java-file-input plugin's default attribute. see : http://www.embulk.org/docs/built-in.html#gzip-decoder-plugin
parser: embulk java-file-input plugin's default .attribute see : http://www.embulk.org/docs/built-in.html#csv-parser-plugin

Example

in:
  type: bigquery_extract_files
  project: googlecloudplatformproject
  json_keyfile: gcp-service-account-private-key.json
  dataset: target_dataset
  #table: target_table
  query: 'select a,b,c from target_table'
  gcs_uri: gs://bucket/subdir
  temp_dataset: temp_dataset
  temp_local_path: C:\Temp
  file_format: 'NEWLINE_DELIMITED_JSON'
  compression: 'GZIP'
  decoders:
  - {type: gzip}  
  parser:
    type: json
out: 
  type: stdout

Advenced Example

bigquery to mysql with auto-schema

I have to batch bigquery table to mysql every day for my job. then, I wan'to get auto-schema for this file input plugin.

this is my best practive for bigquery to mysql batch config.

in:
  type: bigquery_extract_files
  project: my-google-project
  json_keyfile: /tmp/embulk/google_service_account.json
  query: 'select * from dataset.t_nitocris'
  temp_dataset: temp_dataset
  gcs_uri: gs://bucket/embulktemp/t_nitocris_*
  temp_local_path: /tmp/embulk/data
  file_format: 'CSV'
  compression: 'GZIP'
  temp_schema_file_path: /tmp/embulk/schema/csv_schema_nitocris.json
  decoders:
  - {type: gzip}
  parser:
    type: csv_with_schema_file
    default_timestamp_format: '%Y-%m-%d %H:%M:%S %z'
    schema_path: /tmp/embulk/schema/csv_schema_nitocris.json
out:
  type: mysql
  host: host
  user: user
  password: password
  port: 3306
  database: MY_DATABASE
  table: 
  options: {connectTimeout: 0, waitTimeout: 0, enableQueryTimeouts: false, autoReconnect: true}
  mode: insert_direct
  retry_limit: 60
  retry_wait: 3000
  batch_size: 4096000

bigquery's TIMESTAMP value format is not exactly one format in one CSV export. you can define optional 'columns' param in 'csv_with_schema_file' parser for another format like below.

...
  parser:
    type: csv_with_schema_file
    default_timestamp_format: '%Y-%m-%d %H:%M:%S %z'
    schema_path: /tmp/embulk/schema/csv_schema_nitocris.json
    columns:
      - {name: Date2, type: timestamp, format: '%Y-%m-%d %H:%M:%S.%N %z'}

Build

$ ./gradlew gem  # -t to watch change of files and rebuild continuously

Plugin maintenance

for old version user

Remove plugin specific version

embulk gem uninstall embulk-input-bigquery_extract_files --version 0.0.13

rubygem url : https://rubygems.org/profiles/jo8937

Another choice

This plugin useful for file-input type. but maybe so complicated to use.

For portable use, see embulk-input-bigquery : https://github.com/medjed/embulk-input-bigquery

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Google Cloud Bigquery extract file input plugin for Embulk

Overview

Detail

Usage

Install plugin

Update plugin (latest version : 0.0.13)

Configuration

Example

Advenced Example

bigquery to mysql with auto-schema

Build

Plugin maintenance

Remove plugin specific version

Another choice

About

Releases

Packages

Contributors 2

Languages

License

jo8937/embulk-input-bigquery_extract_files

Folders and files

Latest commit

History

Repository files navigation

Google Cloud Bigquery extract file input plugin for Embulk

Overview

Detail

Usage

Install plugin

Update plugin (latest version : 0.0.13)

Configuration

Example

Advenced Example

bigquery to mysql with auto-schema

Build

Plugin maintenance

Remove plugin specific version

Another choice

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages