Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

csvlink latlong comparator failing #80

Open
bgoodger opened this issue Feb 27, 2018 · 4 comments
Open

csvlink latlong comparator failing #80

bgoodger opened this issue Feb 27, 2018 · 4 comments

Comments

@bgoodger
Copy link

Hi,

Attempting to link two CSV files and the latlong comparator is failing because the fields are being treated as strings.

Error:

INFO:root:taking a sample of 150000 possible pairs Traceback (most recent call last): File "/usr/local/bin/csvlink", line 11, in <module> sys.exit(launch_new_instance()) File "/usr/local/lib/python3.6/site-packages/csvdedupe/csvlink.py", line 210, in launch_new_instance d.main() File "/usr/local/lib/python3.6/site-packages/csvdedupe/csvlink.py", line 134, in main deduper.sample(nonexact_1, nonexact_2, self.sample_size) File "/usr/local/lib/python3.6/site-packages/dedupe/api.py", line 849, in sample original_length_2) File "/usr/local/lib/python3.6/site-packages/dedupe/labeler.py", line 321, in sample_product sample_size) File "/usr/local/lib/python3.6/site-packages/dedupe/labeler.py", line 67, in sample_product deque_2) File "/usr/local/lib/python3.6/site-packages/dedupe/sampling.py", line 23, in blockedSample *args)) File "/usr/local/lib/python3.6/site-packages/dedupe/sampling.py", line 122, in linkSamplePredicates yield linkSamplePredicate(subsample_size, predicate, items1, items2) File "/usr/local/lib/python3.6/site-packages/dedupe/sampling.py", line 144, in linkSamplePredicate block_keys = predicate_function(column) File "/usr/local/lib/python3.6/site-packages/dedupe/predicates.py", line 422, in latLongGridPredicate return (str([round(dim, digits) for dim in field]),) File "/usr/local/lib/python3.6/site-packages/dedupe/predicates.py", line 422, in <listcomp> return (str([round(dim, digits) for dim in field]),) TypeError: type str doesn't define __round__ method
Config:

"field_names": ["Account_Name", "Mailing_Street", "Mailing_Zip", "Mailing_Country","Mailing_City", "Mailing_State","Entity_Legal_Name","Australian_Business_Number","Geolocation"], "field_definition" : [{"field" : "Account_Name", "type" : "String"}, {"field" : "Mailing_Street", "type" : "String", "Has Missing" : true}, {"field" : "Mailing_Zip", "type" : "String", "Has Missing" : true}, {"field" : "Mailing_City", "type" : "String"}, {"field" : "Mailing_State", "type" : "String"}, {"field" : "Mailing_Country", "type" : "Exact"}, {"field" : "Entity_Legal_Name", "type" : "Exact", "Has Missing" : true}, {"field" : "Geolocation", "type" : "LatLong"}, {"field" : "Australian_Business_Number", "type" : "String", "Has Missing" : true}], "output_file": "output.csv", "skip_training": false, "training_file": "training.json", "sample_size": 150000, "recall_weight": 2 }

Data in csv looks like:
(-37.985132, 145.214008)

@bgoodger
Copy link
Author

@fgregg hopefully this is still maintained!

Fantastic package and hugely helpful

@joshsim
Copy link

joshsim commented Mar 30, 2018

I'm getting a similar error trying to pass LatLong as a field in a CSV. Dedupe job just within a single CSV itself.

Running Dedupe 1.8.1, Python 3.6, on MacOSx

  File "dedupe-try.py", line 131, in <module>
    deduper.sample(data_d, 15000) #To train dedupe, we feed it a sample of records.
  File "//anaconda/lib/python3.6/site-packages/dedupe/api.py", line 806, in sample
    self.active_learner.sample_combo(data, blocked_proportion, sample_size)
  File "//anaconda/lib/python3.6/site-packages/dedupe/labeler.py", line 151, in sample_combo
    super(RLRLearner, self).sample_combo(*args)
  File "//anaconda/lib/python3.6/site-packages/dedupe/labeler.py", line 38, in sample_combo
    data)
  File "//anaconda/lib/python3.6/site-packages/dedupe/sampling.py", line 23, in blockedSample
    *args))
  File "//anaconda/lib/python3.6/site-packages/dedupe/sampling.py", line 62, in dedupeSamplePredicates
    items)
  File "//anaconda/lib/python3.6/site-packages/dedupe/sampling.py", line 81, in dedupeSamplePredicate
    block_keys = predicate_function(column)
  File "//anaconda/lib/python3.6/site-packages/dedupe/predicates.py", line 406, in latLongGridPredicate
    return (str([round(dim, digits) for dim in field]),)
  File "//anaconda/lib/python3.6/site-packages/dedupe/predicates.py", line 406, in <listcomp>
    return (str([round(dim, digits) for dim in field]),)
TypeError: type str doesn't define __round__ method

@cviebrock
Copy link

Did anyone come up with a solution to this? I've tried storing my location column in the CSV as:

"123.45,-123.45"
"(123.45,-123.45)"
"[123.45,-123.45]"

... none of which work.

@harsha1597
Copy link

I was able to solve this problem by setting the LatLong column as a tuple containing float values rather than a string, i.e set the values in the Latlong column as (123.45 , 123.45).

You can see this example in the dedupe docs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants