Latent structure blockmodels for Bayesian spectral graph clustering

This repository contains a python library used to implement methodologies and reproduce results in Sanna Passino, F. and Heard, N. A. (2022), Latent structure blockmodels for Bayesian spectral graph clustering, Statistics and Computing 32(2) (link, arXiv preprint).

The python library can be installed (in edit mode) as follows:

pip install -e lib/

The library can then be imported in any python session:

import lsbm

The repository contains multiple directories:

lib contains the python library;
data contains some of the datasets used in the paper;
notebooks contains Jupyter notebooks with examples on how to use the library;
scripts contains python scripts to reproduce the results in the paper.

Understanding the code

The main tool for inference on LSBMs is the MCMC sampler lsbm_gp_gibbs contained in the file lib/lsbm/lsbm_gp.py. The class can be initialised using three objects:

a -dimensional embedding X, a numpy array, where is the number of nodes;
the number of communities K, a positive integer;
the set of kernels csi, a dictionary containing kernel functions for the Gaussian processes.

Additionally, the file lib/lsbm/lsbm.py contains a simplified class lsbm_gibbs which can be used for models where the kernel functions are assumed to be inner products. The latent functions in such models can be expressed in the form , for basis functions and corresponding weights with joint normal-inverse-gamma prior with the variance parameter:

The class lsbm does not require a dictionary of kernel functions for initialisation, but a dictionary W_function containing the basis functions. The prior parameters of the NIG prior are then specified using the function initialisation within the same class. The class uses the Zellner's -prior as default.

Example: quadratic model

For example, consider the model where all the community-specific curves are assumed to be quadratic in , passing through the origin, and linear in the first dimension (with zero intercept and unit slope). Initialising the Gibbs sampler class in lsbm is easy, since the basis functions are straightforward to define:

phi = {}
for k in range(K):
    phi[k,0] = lambda theta: np.array([theta])
    for j in range(1,d):
        phi[k,j] = lambda theta: np.array([theta ** 2, theta])

m = lsbm.lsbm_gibbs(X=X, K=K, W_function=phi)

Defining the kernels requires a bit more effort. Using the dictionary phi defined in the previous code snippet, the corresponding kernels - assuming prior scale matrices Delta for the NIG prior - are:

csi = {}
for k in range(K):
    csi[k,0] = lambda theta,theta_prime: Delta[k,0] * np.matmul(phi[k,0](theta).reshape(-1,1),phi[k,0](theta_prime).reshape(1,-1)) 
    for j in range(1,d):
        csi[k,j] = lambda theta,theta_prime: np.matmul(np.matmul(phi[k,j](theta),Delta[k,j]),np.transpose(phi[k,j](theta_prime)))

m = lsbm_gp.lsbm_gp_gibbs(X=X, K=K, csi=csi)

Note that the kernel functions in csi must be flexible enough to handle correctly four types of input - (float,float), (float,vector), (vector,float), (vector,vector) -, returning appropriate objects - float, vector or matrix - in each of the four circumstances.

Then the remaining parameters in both classes are initialised using the function initialise. For example, z could be initialised using -means. If the first dimension is assumed to be linear in , then could be initialised using a noisy version the first column of the embedding. For the same reason, the mean of the prior on could just be assumed to be the mean of the first column of the embedding, with large variance . Finally, the parameters of the inverse-gamma prior on the variance component must be chosen carefully: the scale b_0 needs to be small to avoid unwanted strong prior effects.

m.initialise(z=np.random.choice(m.K,size=m.n), theta=X[:,0]+np.random.normal(size=m.n,scale=0.01), 
                            mu_theta=X[:,0].mean(), sigma_theta=10, a_0=1, b_0=0.01, first_linear=True)

In the class lsbm_gibbs, the function initialise requires an additional parameter Lambda_0, representing the scale parameter for the Zeller's prior scale matrix .

Finally, the MCMC sampler could be run, specifying the number of samples samples, the burn-in length burn, the variance of the proposal distribution for , denoted sigma_prop, and the integer parameter thinning. The MCMC sampler retains only samples indexed by multiples of thinning.

q = m.mcmc(samples=10000, burn=1000, sigma_prop=0.5, thinning=1)

Reproducing the results in the paper

The figures in the paper could be reproduced using the following files in the directory scripts:

Figure 1: sim_pictures_intro.py;
Figure 3: sim_pictures.py;
Figures 4 and 5: sim.py;
Figure 6: harry_potter.py;
Figure 7 and 8: drosophila.py.

For security reasons, the ICL network data have not been made available, but the code to run the quadratic model and truncated power basis spline model are available in scripts/icl.py. Similarly, scripts/icl_gp.py reproduces the same models under an explicit Gaussian process setting.

Name		Name	Last commit message	Last commit date
Latest commit History 111 Commits
data		data
lib		lib
notebooks		notebooks
pgpem		pgpem
scripts		scripts
svgs		svgs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Latent structure blockmodels for Bayesian spectral graph clustering

Understanding the code

Example: quadratic model

Reproducing the results in the paper

About

Releases

Packages

Languages

License

fraspass/lsbm

Folders and files

Latest commit

History

Repository files navigation

Latent structure blockmodels for Bayesian spectral graph clustering

Understanding the code

Example: quadratic model

Reproducing the results in the paper

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages