Log (structured) analysis
For the purpose of demonstration, the application is configured to run in the standalone (single node) mode. In other words all the reads/writes are from/to the local file system. But it can easily be configured for any distributed file system such as HDFS.
- In the spark bin directory please create the subdirectory "test". Inside test create three sub directories "app","input" and "result". <>The app directory is for application jar. Please put the pop_test.jar in this directory. <>The input directory is for the input files, please put raw_pop.json and campaign.csv in this directory. <>The result directory is for result files. The enrich_pop.json and aggregate_pop.json files will be generated in this location.
Please launch the program using spark-submit:
spark-submit --class test.PlayLogTest test/app/pop-test.jar YYYY-MM-DD
Example: spark-submit --class test.PlayLogTest test/app/pop-test.jar 2017-03-29
The above command will process only the logs after 2017-03-29 (inclusive) and will genereate enrich_pop.json and aggregate_pop.json files.
Addition information:
- There are optimization scopes in terms of the usage of partitionedMaps, worker thread, JSON DeSer usage etc to be explored in the clustered setup.