GitHub - sumanthn/dataflux: Ingest , process, query, report large amounts of data

DataFlux

data flux is a POC for lambda architecture. flux is combination of various technologies to enable ingestion, processing, querying data @ scale. Apache Kafka is used to initially commit the incoming data, to be read later by batch processor to persist into HBase. Storm also consumes the data from Kafka for real time processing.

A data generator is provided which can feed http access log for processing. It supports conversations in web transactions enable better semantics.

Modules

dataflux-datagenerator -- generate web transactions data for processing dataflux-producer -- Kafka producer dataflux-batchpersist -- batch layer for persisting data into HBase dataflux-persister -- a common layer to be used in persistence

Some key features

Datageneration tries to be as close as real world, zipfian distribution for response codes, conversation flows with unique session id, generate bursts
Flow control in batch persist to handle slow consumer problem
Replay failed batches

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
dataflux-batchpersist		dataflux-batchpersist
dataflux-common		dataflux-common
dataflux-datagenerator		dataflux-datagenerator
dataflux-persister		dataflux-persister
dataflux-producer		dataflux-producer
diagrams		diagrams
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataFlux

Modules

Some key features

About

Releases

Packages

Contributors 2

Languages

License

sumanthn/dataflux

Folders and files

Latest commit

History

Repository files navigation

DataFlux

Modules

Some key features

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages