Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Statistical bug in primary locations attribution #237

Closed
jcombaz opened this issue May 27, 2024 · 5 comments · Fixed by #242
Closed

Statistical bug in primary locations attribution #237

jcombaz opened this issue May 27, 2024 · 5 comments · Fixed by #242

Comments

@jcombaz
Copy link

jcombaz commented May 27, 2024

In the file synthesis/population/spatial/primary, the following code introduces significant spatial correlations between home and primary locations, since duplicated values are consecutive in the data generated by numpy.repeat (which is used to sample locations among candidates). In particular, the probability of two commutes having the exact same origins and destinations is significantly larger than what it should be.

   location_ids = np.repeat(location_ids, location_counts)

    # Construct a data set for all commutes to this zone
    origin_id = np.repeat(df_flow["origin_id"].values, df_flow["count"].values)

    df_result = pd.DataFrame.from_records(dict(
        origin_id = origin_id,
        location_id = location_ids
    ))

To fix this the attribution of origins to destinations should be made independent of the order in the data sets, e.g.:

   location_ids = np.repeat(location_ids, location_counts)

    # Construct a data set for all commutes to this zone
    origin_id = np.repeat(df_flow["origin_id"].values, df_flow["count"].values)

    np.random.shuffle(origin_id)
    np.random.shuffle(location_ids)

    df_result = pd.DataFrame.from_records(dict(
        origin_id = origin_id,
        location_id = location_ids
    ))
@sebhoerl
Copy link
Contributor

sebhoerl commented May 27, 2024

Hi Jacques, thanks for bringing this up. I just thought it through, and I think I agree. Though shuffling one is probably sufficient. I have the feeling that maybe the multinomial sampler was implemented differently before at some point and then this error got introduced.

Just out of interest, do you have any statistical analysis that shows the before and after?

@jcombaz
Copy link
Author

jcombaz commented May 28, 2024 via email

@sebhoerl
Copy link
Contributor

Ok, makes sense. Do you want to create a PR with the fix? Otherwise I can look into it end of the week.

@jcombaz
Copy link
Author

jcombaz commented May 28, 2024 via email

@sebhoerl
Copy link
Contributor

sebhoerl commented May 29, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants