awk #160

JohannesBuchner · 2020-08-21T14:54:21Z

awk is a small DSL which can parse texts relatively quickly. It is installed by default on many unix-based systems, requires little code, and is easy to integrate in shell script pipelines.

I placed some solutions for the groupby questions here: https://gist.github.com/JohannesBuchner/442e09b7c77c7150a4885c715eb17e6b
Some of them may be correct.

mawk used to be faster than gawk, not sure this is still significant.

The median-related question has sorting in the solution, which can be parallelized. Not sure if there is a more elegant solution.

JohannesBuchner · 2020-08-21T15:49:36Z

This should work OK for very large datasets, in particular those much larger than RAM.

jangorecki · 2020-08-21T15:57:04Z

Thank you, will try it out. AFAIU it prints result to stdout. What is the best way to print it to a in-memory variable? piping into file on a ram-disk? In the last question, there should be also count by group, not just sum.

JohannesBuchner · 2020-08-21T16:06:34Z

Not sure I understand, stdout is in RAM. If you want to store it in a python program, perhaps subprocess.check_output is easiest.

JohannesBuchner · 2020-08-21T16:30:47Z

Updated the last command to include count.

JohannesBuchner · 2020-08-21T16:32:50Z

For very large responses, perhaps reading with a pipe (also possible with subprocess) is useful, to avoid using much memory.

jangorecki · 2020-08-21T16:58:10Z

The problem is that printing out to console will add an overhead, thus piping output into file should be preferred to reduce the overhead.

jangorecki · 2020-08-21T17:00:22Z

Also each single command read data from disk, this is another overhead that should be reduced. Ideally to read data once and then run all commands in sequence producing output files of each query.

JohannesBuchner · 2020-08-21T17:03:42Z

OK, if you want to remove the io time, ramdisks are probably a good solution.

JohannesBuchner · 2020-08-21T17:10:38Z

I am not sure whether you want to look at the output or not. If not, then you can pipe it to /dev/null, which will avoid the console printing overhead.

jangorecki · 2020-08-26T09:28:57Z

Any idea if this is the most recent version? https://github.com/ploxiln/mawk-2

JohannesBuchner · 2020-08-26T10:01:45Z

I simply installed the ubuntu package, which is mawk 1.3.3.

jangorecki added the new solution label Aug 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

awk #160

awk #160

JohannesBuchner commented Aug 21, 2020

JohannesBuchner commented Aug 21, 2020

jangorecki commented Aug 21, 2020

JohannesBuchner commented Aug 21, 2020

JohannesBuchner commented Aug 21, 2020

JohannesBuchner commented Aug 21, 2020

jangorecki commented Aug 21, 2020

jangorecki commented Aug 21, 2020

JohannesBuchner commented Aug 21, 2020 •

edited

Loading

JohannesBuchner commented Aug 21, 2020

jangorecki commented Aug 26, 2020

JohannesBuchner commented Aug 26, 2020

awk #160

awk #160

Comments

JohannesBuchner commented Aug 21, 2020

JohannesBuchner commented Aug 21, 2020

jangorecki commented Aug 21, 2020

JohannesBuchner commented Aug 21, 2020

JohannesBuchner commented Aug 21, 2020

JohannesBuchner commented Aug 21, 2020

jangorecki commented Aug 21, 2020

jangorecki commented Aug 21, 2020

JohannesBuchner commented Aug 21, 2020 • edited Loading

JohannesBuchner commented Aug 21, 2020

jangorecki commented Aug 26, 2020

JohannesBuchner commented Aug 26, 2020

JohannesBuchner commented Aug 21, 2020 •

edited

Loading