forked from sonalgoyal/hiho
-
Notifications
You must be signed in to change notification settings - Fork 0
Hadoop Data Integration with various databases, ftp servers, salesforce. Incremental update, dedup, append, merge your data on Hadoop.
License
sueking/hiho
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
HIHO: Hadoop In, Hadoop Out. Hadoop Data Integration, deduplication, incremental update and more. This branch is for support for HIHO on Apache Hadoop 0.21. HIHO Features: 1. Import from a database to HDFS - query based import Join multiple tables, provide where conditions, dynamically bind parameters to SQL queries to get data to Hadoop. As simple as creating a simple config and running the job. bin/hadoop jar hiho.jar co.nubetech.hiho.job.DBQueryInputJob -conf dbInputQueryDelimited.xml or ${HIHO_HOME}/scripts/hiho import -jdbcDriver <jdbcDriver> -jdbcUrl <jdbcUrl> -jdbcUsername <jdbcUsername> -jdbcPassword <jdbcPassword> -outputPath <outputPath> -outputStrategy <outputStrategy> -delimiter <delimiter> -numberOfMappers <numberOfMappers> -inputOrderBy <inputOrderBy> -conf dbInputQueryDelimited.xml - table based import bin/hadoop jar hiho.jar co.nubetech.hiho.job.DBQueryInputJob -conf dbInputTableDelimited.xml or ${HIHO_HOME}/scripts/hiho import -jdbcDriver <jdbcDriver> -jdbcUrl <jdbcUrl> -jdbcUsername <jdbcUsername> -jdbcPassword <jdbcPassword> -outputPath <outputPath> -outputStrategy <outputStrategy> -delimiter <delimiter> -numberOfMappers <numberOfMappers> -inputOrderBy <inputOrderBy> -inputTableName <inputTableName> -inputFieldNames <inputFieldNames> -conf dbInputTableDelimited.xml - incremental import by appending to existing HDFS location so that all data is in one place. just specify isAppend = true in the configurations and import. Import will be written to existing HDFS folder. - configurable format for data import - delimited, avro By specifying the mapreduce.jdbc.hiho.input.outputStrategy as DELIMITED or AVRO. 2. Export to Databases - high performance MySQL loading using LOAD DATA INFILE ${HIHO_HOME}/scripts/hiho export mysql -conf mysqlExport.xml -inputPath <inputPath> -url <url> -userName <userName> -password <password> - high performance Oracle loading by creating external tables. See expert opinion http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:6611962171229 For information on external tables, check http://download.oracle.com/docs/cd/B12037_01/server.101/b10825/et_concepts.htm On the Oracle server, you will have to make a folder, create a directory through the Oracle Client(sqlplus) and grant privileges to that. Also allow ftp to the Oracle server mkdir -p ageTest sqlplus>create or replace directory age_ext as '/home/nube/age'; ${HIHO_HOME}/scripts/hiho export oracle -conf oracleExport.xml -inputPath <inputPath> -oracleFtpAddress <oracleFtpAddress> -oracleFtpPortNumber <oracleFtpPortNumber> -oracleFtpUserName <oracleFtpUserName> -oracleFtpPassword <oracleFtpPassword> -oracleExternalTableDirectory <oracleExternalTableDirectory> -driver <driver> -url <url> -userName <userName> -password <password> - custom loading and export to any database by emitting own GenericDBWritables. Check DelimitedLoadMapper 3. Export to SalesForce - send computed map reduce results to Salesforce. For this, you need to have a developer account with Bulk API enabled. You can join at http://developer.force.com/join If you get message: [LoginFault [ApiFault exceptionCode='INVALID_LOGIN' exceptionMessage='Invalid username, password, security token; or user locked out.' "Invalid username, password, security token; or user locked out. Are you at a new location? When accessing Salesforce--either via a desktop client or the API--from outside of your company’s trusted networks, you must add a security token to your password to log in. To receive a new security token, log in to Salesforce at http://www.salesforce.com and click Setup | My Personal Information | Reset Security Token." login and get the security token. then try sfUserName - Name of Salesforce account sfPassword - Password and security token. The Security Token can be obtained by logging in to the Salesforce.com site and clicking on Reset Security Token. sfObjectType - The Salesforce object to export sfHeaders - header describing the Salesforce object properties. For more information, refer to the Bulk API Developer's Guide. ${HIHO_HOME}/scripts/hiho export saleforce -inputPath <inputPath> -sfUserName <sfUserName> -sfPassword <sfPassword> -sfObjectType <sfObjectType> -sfHeaders <sfHeaders> 4. Export results to an FTP Server. Use the co.nubetech.hiho.mapreduce.lib.output.FTPOutputFormat directly in your job, just like FileOutputFormat. For usage, check co.nubetech.hiho.job.ExportToFTPserver. This job writes the output directly to an FTP server. It can be invoked as: ${HIHO_HOME}/scripts/hiho export ftp -inputPath <inputPath> -outputPath <outputPath> -ftpUserName <ftpUserName> -ftpAddress <ftpAddress> -ftpPortNumper <ftpPortNumper> -ftpPassword <ftpPassword> Where, ftpUserName - FTP server login username ftpAddress - FTP server address ftpPortNumper - FTP port ftpPassword - FTP server password outputPath is the location on the FTP server to which the output will be written. It should be a complete directory path - /home/sgoyal/output 5.Export to Hive This is used to export data from any other database to Hive database. Hive export can be done in two method query base and table based configuration needed are mapreduce.jdbc.hiho.input.loadTo - this configuration defines in which database you want to load your data from HDFS for eg:- hive mapreduce.jdbc.hiho.input.loadToPath - Our program also generates script for all queries , this configuration defines where to store that script on your local system mapreduce.jdbc.hiho.hive.driver - name of hive jdbc driver For eg:- org.apache.hadoop.hive.jdbc.HiveDriver mapreduce.jdbc.hiho.hive.url - hive url for jdbc connection For eg:- jdbc:hive:// (for embedded mode),jdbc:hive://localhost:10000/default for standalone mode mapreduce.jdbc.hiho.hive.usrName - user name for jdbc connection mapreduce.jdbc.hiho.hive.password - password for jdbc connection mapreduce.jdbc.hiho.hive.partitionBy - This configuration is when we want to create partitioned hive table. For eg :- country:string:us;name:string:jack (basic partition), country:string:us;name:string (static and one dynamic partition), country:string (dynamic partition) till now we allow only one dynamic partition We also allow to store data in a table for multiple partition at a time for that value is given as country:string:us,uk,aus for this we need to define three different queries or table in there respective configurations mapreduce.jdbc.hiho.hive.ifNotExists - set true if you want include 'if not exits' clause in your create table query mapreduce.jdbc.hiho.hive.tableName - write the name for the table in the hive you want to create mapreduce.jdbc.hiho.hive.sortedBy - this can be only used if clusteredBy configuration is defined, in this give the name of column by which u want to sort your data mapreduce.jdbc.hiho.hive.clusteredBy - This configuration defines name of column by which you want to cluster your data and define the number of buckets you want to create. For eg:- name:2 Execution command For table based bin/hadoop jar ~/workspace/hiho/build/classes/hiho.jar co.nubetech.hiho.job.DBQueryInputJob -conf ~/workspace/hiho/conf/dbInputTableDelimitedHive.xml or ${HIHO_HOME}/scripts/hiho import -jdbcDriver <jdbcDriver> -jdbcUrl <jdbcUrl> -jdbcUsername <jdbcUsername> -jdbcPassword <jdbcPassword> -outputPath <outputPath> -outputStrategy <outputStrategy> -delimiter <delimiter> -numberOfMappers <numberOfMappers> -inputOrderBy <inputOrderBy> -inputTableName <inputTableName> -inputFieldNames <inputFieldNames> -inputLoadTo hive -inputLoadToPath <inputLoadToPath> -hiveDriver <hiveDriver> -hiveUrl <hiveUrl> -hiveUsername <hiveUsername> -hivePassword <hivePassword> -hivePartitionBy <hivePartitionBy> -hiveIfNotExists <hiveIfNotExists> -hiveTableName <hiveTableName> -hiveSortedBy <hiveSortedBy> -hiveClusteredBy <hiveClusteredBy> -conf dbInputTableDelimitedHive.xml For query based bin/hadoop jar ~/workspace/hiho/build/classes/hiho.jar co.nubetech.hiho.job.DBQueryInputJob -conf ~/workspace/hiho/conf/dbInputQueryDelimitedHive.xml or ${HIHO_HOME}/scripts/hiho import -jdbcDriver <jdbcDriver> -jdbcUrl <jdbcUrl> -jdbcUsername <jdbcUsername> -jdbcPassword <jdbcPassword> -outputPath <outputPath> -outputStrategy <outputStrategy> -delimiter <delimiter> -numberOfMappers <numberOfMappers> -inputOrderBy <inputOrderBy> -inputLoadTo hive -inputLoadToPath <inputLoadToPath> -hiveDriver <hiveDriver> -hiveUrl <hiveUrl> -hiveUsername <hiveUsername> -hivePassword <hivePassword> -hivePartitionBy <hivePartitionBy> -hiveIfNotExists <hiveIfNotExists> -hiveTableName <hiveTableName> -hiveSortedBy <hiveSortedBy> -hiveClusteredBy <hiveClusteredBy> -conf dbInputQueryDelimitedHive.xml Notes:- 1. Hive table name is mandatory when you are quering more than one query or table that is the case of multiple partition 2. Please note that sorted feature will not work untill clustered feature is defined 6.Dedup details bin/hadoop jar ~/workspace/HIHO/deploy/hiho.jar co.nubetech.hiho.dedup.DedupJob -inputFormat <inputFormat> -dedupBy <"key" or "value"> -inputKeyClassName <inputKeyClassName> -inputValueClassName <inputValueClassName> -inputPath <inputPath> -outputPath <outputPath> -delimeter <delimeter> -column <column> -outputFormat <outputFormat> Alternatively Dedup can also be executed as:- Running HadoopTransform script present in $HIHO_HOME/scripts/ ${HIHO_HOME}/scripts/hiho dedup -inputFormat <inputFormat> -dedupBy <"key" or "value"> -inputKeyClassName <inputKeyClassName> -inputValueClassName <inputValueClassName> -inputPath <inputPath> -outputPath <outputPath> -delimeter <delimeter> -column <column> Example For Deduplication with key:- For Sequence Files: ${HIHO_HOME}/scripts/hiho dedup -inputFormat org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat -inputKeyClassName org.apache.hadoop.io.IntWritable -inputValueClassName org.apache.hadoop.io.Text -inputPath testData/dedup/inputForSeqTest -outputPath output -dedupBy key For Delimited Text Files: ${HIHO_HOME}/scripts/hiho dedup -inputFormat co.nubetech.hiho.dedup.DelimitedTextInputFormat -inputKeyClassName org.apache.hadoop.io.Text -inputValueClassName org.apache.hadoop.io.Text -inputPath testData/dedup/textFilesForTest -outputPath output -delimeter , -column 1 -dedupBy key Example For Deduplication with value:- For Sequence Files: ${HIHO_HOME}/scripts/hiho dedup -inputFormat org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat -inputKeyClassName org.apache.hadoop.io.IntWritable -inputValueClassName org.apache.hadoop.io.Text -inputPath testData/dedup/inputForSeqTest -outputPath output -dedupBy value For Delimited Text Files: ${HIHO_HOME}/scripts/hiho dedup -inputFormat co.nubetech.hiho.dedup.DelimitedTextInputFormat -inputKeyClassName org.apache.hadoop.io.Text -inputValueClassName org.apache.hadoop.io.Text -inputPath testData/dedup/textFilesForTest -outputPath output -dedupBy value 7. Merge details: ${HIHO_HOME}/scripts/hiho merge -newPath <newPath> -oldPath <oldPath> -mergeBy <"key" or "value"> -outputPath <outputPath> -inputFormat <inputFormat> -inputKeyClassName <inputKeyClassName> -inputValueClassName <inputValueClassName> -outputFormat <outputFormat> Example For Merge with key:- For Sequence Files: ${HIHO_HOME}/scripts/hiho merge -newPath testData/merge/inputNew/input1.seq -oldPath testData/merge/inputOld/input2.seq -mergeBy key -outputPath output -inputFormat org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat -inputKeyClassName org.apache.hadoop.io.IntWritable -inputValueClassName org.apache.hadoop.io.Text For Delimited Text Files: ${HIHO_HOME}/scripts/hiho merge -newPath testData/merge/inputNew/fileInNewPath.txt -oldPath testData/merge/inputOld/fileInOldPath.txt -mergeBy key -outputPath output -inputFormat co.nubetech.hiho.dedup.DelimitedTextInputFormat -inputKeyClassName org.apache.hadoop.io.Text -inputValueClassName org.apache.hadoop.io.Text Example For Merge with value:- For Sequence Files: ${HIHO_HOME}/scripts/hiho merge -newPath testData/merge/inputNew/input1.seq -oldPath testData/merge/inputOld/input2.seq -mergeBy value -outputPath output -inputFormat org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat -inputKeyClassName org.apache.hadoop.io.IntWritable -inputValueClassName org.apache.hadoop.io.Text For Delimited Text Files: ${HIHO_HOME}/scripts/hiho merge -newPath testData/merge/inputNew/fileInNewPath.txt -oldPath testData/merge/inputOld/fileInOldPath.txt -mergeBy value -outputPath output -inputFormat co.nubetech.hiho.dedup.DelimitedTextInputFormat -inputKeyClassName org.apache.hadoop.io.Text -inputValueClassName org.apache.hadoop.io.Text New Features in this release -incremental import and introduction of AppendFileInputFormat -Oracle export -FTP Server integration -Salesforce -Support for Apache Hadoop 0.20 -Support for Apache Hadoop 0.21 -Generic dedup and merge Other improvements -Ivy based build and dependency management -Junit and mockito based test cases
About
Hadoop Data Integration with various databases, ftp servers, salesforce. Incremental update, dedup, append, merge your data on Hadoop.
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published