Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

try creation of prosim-db in one single shot #6

Open
oricdev opened this issue Jul 17, 2018 · 0 comments
Open

try creation of prosim-db in one single shot #6

oricdev opened this issue Jul 17, 2018 · 0 comments
Labels
easy easy to deal with help wanted Extra attention is needed prio 2 middle priority

Comments

@oricdev
Copy link
Owner

oricdev commented Jul 17, 2018

Requirements: Issue #1 implemented

Why doing this?
As stated and repeated, the interset process performs a Mapreduction in-memory with data from the data packages (quick), but then, performs another Mapreduction with records already present in the Prosim-db which were created during previous sliced imports. This latter Mapreduction is a very heavy process to deal with for the MongoDb (performed for each record in Memory but still, a lot of read, write, expand, indexing staff in the db).
Hence it could be interesting to determine a maximum amount of products for which a 1-shot integration could be performed (only in-memory Mapreductions). Thus would let us gain a considerable amount of time (no scheduled tasks 2 an hour anymore) and possibly could the Prosim-db be generated from scratch in less than a day instead of several days.

How to proceed?
Number of products with appropriate non-empty tags for making the comparison between products is limited to about 20% of the OFF official db:
about 110.000 / 550.000 products
Check what happens in terms of resources used (memory, disk speed/space, overall behaviour) if we decide to create the Prosim-db in 1 shot by setting the environment as follows:

  • feeder_1 has extracted all 110.000 meeting non empty criteria for "nutrition_score_uk" and "categories_tags" => all_products.json
  • copy all_products.json into updated_products.json
  • in preparer/config.xml, set tags with these values:
    <width>120000</width>
    <height>120000</height>
    <stats_H_nb_products>nb products extracted in all_products.json</stats_H_nb_products>
    <stats_W_nb_products>nb products extracted in all_products.json</stats_W_nb_products>
  • preparer/progress.xml: clear values of the tags to start with a new Prosim-db
  • intersect/config.xml : set max db size to 500GB
    <max_db_size_gigabytes>500</max_db_size_gigabytes>
@oricdev oricdev added easy easy to deal with help wanted Extra attention is needed prio 2 middle priority labels Jul 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
easy easy to deal with help wanted Extra attention is needed prio 2 middle priority
Projects
None yet
Development

No branches or pull requests

1 participant