-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider adding aggregation similar to flow-tools #55
Comments
Any update on this? |
Hi, I have not done much work for this. I normally use it with Spark, so aggregation could be done there (it is still slow, in my opinion, but I will address that later). I do not have pcap file sample to implement this functionality. So this issue is sort of stuck. If you could help with pcap files, would be great. P.S. Could you give me a link to pcap format and explain a little bit what it is structurally? I normally work with NetFlow files only, so do not have much experience with other formats. |
This should give you an idea about pcap file format : https://wiki.wireshark.org/Development/LibpcapFileFormat. Its pretty straightforward. You will also find a lot of pcap samples on the same website. I have a few questions so I can better understand if the feature I requested makes sense to implement in this library. What software produces Netflow files for you. What is the main use case of this library and how is it supposed to be used? Is there a gitter channel available so we can take this discussion further? |
Thanks for the link.
Specification is here (streaming variant, I use files which are slightly different): http://netflow.caligare.com/netflow_v5.htm Normally, we get files delivered in this format already (I assume collected and compressed by some cisco software and hardware), files can be somehow large (hundreds of megabytes compressed binaries). This library is written mainly to use Apache Spark (http://spark.apache.org/) to read files and utilize cluster to do easy ETL, since library will convert netflow data into DataFrame, but can be used as Java code to read files. There is section in README how to do very simple test. Also some samples files are included in repository as test resources. Do you use Spark to read pcap files?
|
@sadikovi Thanks for the explanation. Is the process to dump Yes we read |
@r4ravi2008 something like that, I am not exactly sure how collection happens - my main work is making sure that spark can read whatever files were delivered:) I will have a look at pcap files this weekend to see how difficult it is to implement/use existing reader, will try to make it not to rely on any external commands. How do you read pcap files? Do you use PipedRDDs and call shell command to read files? |
I will be also, in addition to wiki, using this repo as reference (looks like it has quite a few examples: https://github.com/markofu/pcaps/tree/master/PracticalPacketAnalysis/ppa-capture-files). |
@sadikovi To read pcap files I used For parsing I kinda used references from from multiple sources: namely : this and this If you are aiming for this library to be something like ntop/nprobe but with scalability, I think it makes sense to add the feature I mentioned. And I will be happy to help in that aspect :) |
@r4ravi2008 would appreciate your help with this, thanks! |
Aggregation should be flexible, e.g. specifying groupBy and aggregation on numeric columns. Also need to investigate why
flow-tools
drop records when doing report in some cases.The text was updated successfully, but these errors were encountered: