-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update lightgrep scanner for bulk_extractor 2.0 #421
base: main
Are you sure you want to change the base?
Conversation
…rmance regression. The timings below are from the following command: ./src/bulk_extractor -F ../lightgrep/pytest/keys/shuf10.txt -Z -o ~/be_timed_output_without_thread_local_`printf %04d $i` -E scan_lightgrep ~/ev/terry-2009-12-11-002.E01 Thread_local? Clocktime (Min.) Clocktime (Max.) Clocktime (Average) Scan Lightgrep Time (Min.) Scan Lightgrep Time (Max.) Scan Lightgrep Time (Average) FALSE 162.965479 168.628229 164.2545712 494.810946 528.368114 504.1799554 TRUE 163.681386 173.587754 167.233617 499.815450 532.324335 516.4901762 This reverts commit 0ca43ec.
I'm going to close this and re-open it as a draft PR. |
Apparently that's not how you did it. I found instructions here. It's a draft now. |
Codecov Report
@@ Coverage Diff @@
## main #421 +/- ##
=======================================
Coverage 47.94% 47.94%
=======================================
Files 112 112
Lines 13224 13224
=======================================
Hits 6339 6339
Misses 6885 6885 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
I didn't know about draft PRs, TIL. |
Is this PR ready to go? |
[jeez, terrible formatting for reply-by-email]
Good question: yes, and no.
We think this PR works, but it depends on the current main branch of lightgrep. To make for a good user experience, we need to release a new version of lightgrep and then update this PR with updated build scripts that can pull that release.
The current plan is to get the new release of lightgrep out before the end of the year. It has been under continual development for the past few months, as a ~25% time project. It has several minor improvements and bug fixes (per the spirit of the ACM paper). If you’ve got a specific date in mind for a new bulk_extractor release, that would be good to know and we may be able to adjust.
We are _not_ entirely confident in our usage of the new sbuf/scanner API. We would _love_ a code review of this PR from you. We could also push up the requisite lightgrep code for you to test, if you’d prefer.
|
Hi. What's the status on this? |
We're getting ready to make a new lightgrep release for this to target. Can you review |
This PR has the following functionality changes:
-f
and-F
options, by default searching for both UTF-8 and UTF-16LE versions, with case-sensitivityscan_accts_lg
scan_base16_lg
scan_email_lg
scan_gps_lg
With the deletion of other lightgrep-based scanners, we were able to delete a lot of scaffolding code.
This PR is not yet ready, but we're opening it for comment. The following remains to be done:
Please let us know if you have any questions or comments.