forked from airbnb/reair
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
7 changed files
with
130 additions
and
64 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
# Frequently Asked Questions | ||
|
||
### How do I use these tools to migrate a warehouse? | ||
|
||
One approach to migrating a warehouse is to use batch replication to get an initial copy of the tables in the original warehouse into the new warehouse. Once the initial copy is done, incremental replication can be used to replicate changes from a point shortly before batch replication was kicked off. Then, incremental replication keeps both clusters in sync until the cutover date. | ||
|
||
### How do I run against different versions of Hadoop / Hive? | ||
|
||
As shipped, the default configuration will work with most Hive and Hadoop v2 deployments because of the generally backward-compatible nature of the API calls used in this project. We have not been able to test against a variety of different versions, but it's possible to modify `build.gradle` and specify different Hadoop and Hive versions to produce more appropriate binaries if you encounter issues. | ||
|
||
### For idempotent file copy operations, how is file equality determined? | ||
|
||
To provide a fast check of equality, files are considered equal if the sizes and modified times match between the source and destination warehouses. Files that are considered equal will not be re-copied, and this behavior may not be suitable for all applications. | ||
|
||
### How do I filter out specific tables from getting replicated? | ||
|
||
Both batch and incremental replication provide a blacklist mechanism. For both, please see the example configuration templates for the appropriate configuration variable for blacklisting entries. | ||
|
||
### How do tables on S3 get replicated? | ||
|
||
For tables and partitions where the location is on S3, the metadata for the tables is copied over to the destination warehouse. Be aware that replicated S3-backed tables should be created as external tables, or there could be issues with Hive operations (e.g. `DROP TABLE`) causing inconsistencies. | ||
|
||
## Batch Replication | ||
|
||
### What kind of consistency guarantees does batch replication provide while it's running? | ||
|
||
While batch replication is running, there are no consistency guarantees on the destination warehouse. Because files are copied directly to destination directories, it's possible to observe a partially complete data directory. Once batch replication finishes, it guarantees that if a table exists in the metastore, it is consistent with the state of the source table at some point after batch replication was kicked off. In general, please wait for batch replication to finish running before running any queries on the destination warehouse. | ||
|
||
## Incremental Replication | ||
|
||
### What kind of consistency guarantees does incremental replication provide while it's running? | ||
|
||
Overall, incremental replication provides eventual consistency. Increment replication also guarantees that data directories never contain partial data. While incremental replication is running and there are updates to a table on the source warehouse, tables on the destination warehouse will either contain old data, new data, but never a partial result. These are the same semantics that Hive provides when overwriting a table with a query. In addition, incremental replication guarantees that data for a table will be copied before the before the metadata, so if a table is present in the metastore, the table can be queried. | ||
|
||
### Are there any issues with restarting the incremental replication process from a previous point in the audit log? | ||
|
||
Since incremental replication is idempotent, it is safe to restart from a previous point in the audit log. Files that have been copied will not be copied again, so recovery should be relatively fast. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
# Large HDFS directory copy | ||
|
||
## Overview | ||
|
||
This is a tool for migrating HDFS data when `distcp` has issues copying a directory with a large number of files. | ||
|
||
Depending on the options specified, it can: | ||
|
||
* Copy files that exist on the source but not on the destination (add option) | ||
* Copy files that exist on the source and on the destination but differ in file size (update option) | ||
* Delete files that exist on the destination, but not the source (delete option) | ||
|
||
Directories can be excluded from the copy by configuring the blacklist regex option. Directory names (not full paths) matching the regex are not traversed. | ||
|
||
The dry run mode only does a comparison between source and destination directories and outputs the operations that it would have done to the logging directory in text format. Please see the schema of the logging table below for details on the output. | ||
|
||
## Usage | ||
|
||
* Switch to the repo directory and build the JAR. | ||
|
||
``` | ||
cd reair | ||
gradlew shadowjar -p main -x test | ||
``` | ||
|
||
* If Hive logging table does not exist, create it using [these commands](../main/src/main/resources/create_hdfs_copy_logging_tables.hql). | ||
|
||
* CLI options: | ||
``` | ||
-source: source directory | ||
-destination: destination directory | ||
-temp-path: temporary directory where files will be copied to first | ||
-output-path: directory for logging data as produced by the MR job | ||
-operation: comma separated options for the copy: a(add), d(delete), u(update) | ||
-blacklist: skip directory names matching this regex | ||
-dry-run: don't execute the copy, but populate logging directory with data about planned operations | ||
``` | ||
|
||
The following is an example invocation. Please replace with appropriate values before trying this out. Typically, the job should be run on the destination cluster. | ||
|
||
``` | ||
export HADOOP_HEAPSIZE=8096 | ||
JOB_START_TIME="$(date +"%s")" | ||
hadoop jar airbnb-reair-main-1.0.0-all.jar \ | ||
com.airbnb.di.hive.batchreplication.hdfscopy.ReplicationJob \ | ||
-Dmapreduce.job.reduces=500 \ | ||
-Dmapreduce.map.memory.mb=8000 \ | ||
-Dmapreduce.map.java.opts="-Djava.net.preferIPv4Stack=true -Xmx7000m" \ | ||
-source hdfs://airfs-src/user/hive/warehouse \ | ||
-destination hdfs://airfs-dest/user/hive/warehouse \ | ||
-output-path hdfs://airfs-dest/user/replication/log/$JOB_START_TIME \ | ||
-temp-path hdfs://airfs-dest/tmp/replication/$JOB_START_TIME$ | ||
-blacklist "tmp.*" \ | ||
-operation a,u,d | ||
hive -e "LOAD DATA INPATH 'hdfs://airfs-dest/user/replication/log/$JOB_START_TIME' OVERWRITE INTO TABLE hdfs_copy_results PARTITION (job_start_time = $JOB_START_TIME);" | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# Known Issues | ||
## Incremental Replication | ||
* Due to https://issues.apache.org/jira/browse/HIVE-12865, exchange partition commands will be replicated under limited conditions. Resolution is pending. | ||
* Since the audit log hook writes the changes in a separate transaction from the Hive metastore, it's possible to miss updates if the client fails after the metastore write, but before hook execution. In practice, this is not an issue as failed Hive queries are re-run. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,12 +1,12 @@ | ||
-- hdfs copy job output table. | ||
CREATE TABLE IF NOT EXISTS hdfscopy_result( | ||
-- HDFS copy job output table. | ||
CREATE TABLE IF NOT EXISTS hdfs_copy_results( | ||
dst_path string, -- destination path | ||
action string, -- action, add, update, delete | ||
src_path string, -- source path | ||
size bigint, -- size | ||
ts bigint) -- file timestamp | ||
PARTITIONED BY ( | ||
jobts bigint) | ||
job_start_time bigint) | ||
ROW FORMAT DELIMITED | ||
FIELDS TERMINATED BY '\t' | ||
STORED AS TEXTFILE; |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -57,6 +57,4 @@ service TReplicationService { | |
|
||
// Get the lag for replication process in ms | ||
i64 getLag(); | ||
|
||
|
||
} |