Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Gaussian Mixture Models as a toy distribution #28

Closed
wants to merge 2 commits into from
Closed

Added Gaussian Mixture Models as a toy distribution #28

wants to merge 2 commits into from

Conversation

thelostscout
Copy link
Contributor

I thought it might be useful to have access to gaussian mixture model toy distributions

@LarsKue
Copy link
Owner

LarsKue commented May 24, 2024

Thank you for the addition! Can you briefly explain in the docstring what this distribution/dataset looks like (e.g. in 2D), and how it differs from the Hypersphere dataset?

@thelostscout
Copy link
Contributor Author

I don't think its too comparable to the hyperspheres from what I understand. In this case, the user controls the placement of all gaussian blobs as well as their weights and standard deviations.
Do you think it would be sensible to reduce the complexity of the creation of the distributions through the reduction to high level arguments like the number of mixtures?

@LarsKue
Copy link
Owner

LarsKue commented Jun 3, 2024

Yes, controlling the datasets via high-level hyperparameters in a similar fashion to how we construct models is the core philosophy of this library. Would you like to add this?

@thelostscout
Copy link
Contributor Author

I think random generation will move it much closer towards the hyperspheres dataset, dependent on how the generation of the means and stddevs is implemented. It might make the addition obsolete.

@LarsKue
Copy link
Owner

LarsKue commented Jun 3, 2024

In that case, let's stick to the hyperspheres dataset. You are welcome to add more generation modes to the hyperspheres dataset, though. For instance, we could replace it with something like

class MixtureDataset:
    def __init__(self, mode="spheres"):
        match mode:
            case "cubes": ...
            case "spheres": ...
            case "random": ...

where each mode changes the behaviour of the generation of the mean and std.

@thelostscout
Copy link
Contributor Author

My issue with random creation is lacking reproducability. Say I wanted to learn a distribution with two gaussians and compare the loss values for different network types. In this case the loss will be different depending on the overlap and position of the means usually.

@LarsKue
Copy link
Owner

LarsKue commented Jun 3, 2024

For this, you can either copy the dataset directly or use a seed (see lightning.seed_everything) before sampling.

@thelostscout
Copy link
Contributor Author

Hmm, I was thinking of a method that uses its own rng to be able to seed the dataset generation without affecting the rest of the process. But maybe thats not necessary

@thelostscout
Copy link
Contributor Author

In that case, let's stick to the hyperspheres dataset. You are welcome to add more generation modes to the hyperspheres dataset, though. For instance, we could replace it with something like

class MixtureDataset:
    def __init__(self, mode="spheres"):
        match mode:
            case "cubes": ...
            case "spheres": ...
            case "random": ...

where each mode changes the behaviour of the generation of the mean and std.

Yes, I can implement a hypercube version. I think I would place the blobs on the corners (and hence have an upper bound for the amount of centers).
Would you consider a name change of the dataset to something like make_blobs or gmm? When I looked at the different datasets I assumed that hyperspheres would do something similar to hypershells, rather make a gaussian mixture model after a certain distribution rule.

@LarsKue
Copy link
Owner

LarsKue commented Jun 5, 2024

Would you consider a name change of the dataset to something like make_blobs or gmm?

I would welcome a name change consistent with the implementation of changes. However, let's stick to ML jargon, using Dataset in place of Model where possible.

@thelostscout
Copy link
Contributor Author

Ok, the trivial way would be to name it GaussianMixtureDataset. Or maybe MultiGaussianDataset? Or DistributedBlobsDataset?

@thelostscout
Copy link
Contributor Author

Ok, see new pr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants