Skip to content

Latest commit

 

History

History
 
 

kerberized_data_lake

Table of Contents

Data Lake

This module is intended to spin up a bare bones data lake for demos and testing Kerberos integration with other services (e.g. airflow or dataflow). This is not meant for production use.

Architecture Diagram

This includes:

  • Multi-tenant Hadoop Cluster w/ Hive / Spark / Presto (Dataproc)
  • kerberos (MIT KDC)
  • hive metastore (Dataproc cluster on server perhaps DPMS in the future)

Troubleshooting

Issues with destroying KMS Resources

KMS keys cannot be deleted and this module will choke on trying to destory KMS keys or key rings. The workaround is to remove the key from terraform state.

terragrunt state rm module.test_data_lake.module.kms.google_kms_crypto_key.key_ephemeral[0]

Then on re-applies use a different keyring name. You should also taint your Dataproc clusters and the encrypted principals null resource so they get re-created on the next apply with the new secrets encrypted with the new KMS key.

terragrunt taint module.test_data_lake.null_rescource.encrypted_principals
terragrunt taint module.test_data_lake.google_dataproc_cluster.kdc_cluster
terragrunt taint module.test_data_lake.google_dataproc_cluster.metastore_cluster
terragrunt taint module.test_data_lake.google_dataproc_cluster.analytics_cluster

Requirements

Name Version
terraform >= 0.12.17
google >= 3.38.0, < 3.41.0

Providers

Name Version
google >= 3.38.0, < 3.41.0
google-beta n/a
null n/a

Inputs

Name Description Type Default Required
analytics_cluster name for analytics dataproc cluster string "analytics-cluster" no
analytics_realm Kerberos realm for analytics clusters to use string "ANALYTICS.FOO.COM" no
corp_kdc_realm Kerberos realm to represent centralized kerberos identities string "FOO.COM" no
data_lake_super_admin User email for super admin rights on data lake any n/a yes
dataproc_kms_key Name for KMS Key for kerberized dataproc string "dataproc-key" no
dataproc_subnet self link for VPC subnet in which to spin up dataproc clusters any n/a yes
kdc_cluster name for kdc dataproc cluster string "kdc-cluster" no
kms_key_ring Name for KMS Keyring string "dataproc-kerberos-keyring" no
metastore_cluster name for Hive Metastore dataproc cluster string "metastore-cluster" no
metastore_realm Kerberos realm for hive metastore to use string "HIVE-METASTORE.FOO.COM" no
project GCP Project ID in which to deploy data lake resources any n/a yes
region GCP Compute region in which to deploy dataproc clusters string "us-central1" no
tenants list of non-human kerberos principals (one per tenant) to be created as unix users on each cluster list(string)
[
"core-data"
]
no
users list of human kerberos principals to be created as unix users on each cluster list(string)
[
"user1",
"user2"
]
no
zone GCP Compute region in which to deploy dataproc clusters string "us-central1-f" no

Outputs

Name Description
analytics_cluster_fqdn Fully qualified domain name for cluster on which to run presto / spark jobs
gcs_encrypted_keytab_path GCS path to keep keytabs
kms_key kms key for decrypting keytabs