-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
awk #160
Comments
This should work OK for very large datasets, in particular those much larger than RAM. |
Thank you, will try it out. AFAIU it prints result to stdout. What is the best way to print it to a in-memory variable? piping into file on a ram-disk? In the last question, there should be also count by group, not just sum. |
Not sure I understand, stdout is in RAM. If you want to store it in a python program, perhaps subprocess.check_output is easiest. |
Updated the last command to include count. |
For very large responses, perhaps reading with a pipe (also possible with subprocess) is useful, to avoid using much memory. |
The problem is that printing out to console will add an overhead, thus piping output into file should be preferred to reduce the overhead. |
Also each single command read data from disk, this is another overhead that should be reduced. Ideally to read data once and then run all commands in sequence producing output files of each query. |
OK, if you want to remove the io time, ramdisks are probably a good solution. |
I am not sure whether you want to look at the output or not. If not, then you can pipe it to /dev/null, which will avoid the console printing overhead. |
Any idea if this is the most recent version? https://github.com/ploxiln/mawk-2 |
I simply installed the ubuntu package, which is mawk 1.3.3. |
awk is a small DSL which can parse texts relatively quickly. It is installed by default on many unix-based systems, requires little code, and is easy to integrate in shell script pipelines.
I placed some solutions for the groupby questions here: https://gist.github.com/JohannesBuchner/442e09b7c77c7150a4885c715eb17e6b
Some of them may be correct.
mawk used to be faster than gawk, not sure this is still significant.
The median-related question has sorting in the solution, which can be parallelized. Not sure if there is a more elegant solution.
The text was updated successfully, but these errors were encountered: