This repository contains python code used to reproduce results and simulations in Sanna Passino, F. and Heard, N. A. (2021+), Latent structure blockmodels for Bayesian spectral graph clustering.
The main tool for inference on LSBMs is the MCMC sampler lsbm_gp_gibbs
contained in the file lsbm_gp.py
. The class can be initialised using three objects:
- a -dimensional embedding
X
, anumpy
array, where is the number of nodes; - the number of communities
K
, a positive integer; - the set of kernels
csi
, a dictionary containing kernel functions for the Gaussian processes.
Additionally, the file lsbm.py
contains a simplified class lsbm_gibbs
which can be used for models where the kernel functions are assumed to be inner products. The latent functions in such models can be expressed in the form , for basis functions and corresponding weights with joint normal-inverse-gamma prior with the variance parameter:
The class lsbm
does not require a dictionary of kernel functions for initialisation, but a dictionary W_function
containing the basis functions. The prior parameters of the NIG prior are then specified using the function initialisation
within the same class. The class uses the Zellner's -prior as default.
For example, consider the model where all the community-specific curves are assumed to be quadratic in , passing through the origin, and linear in the first dimension (with zero intercept and unit slope). Initialising the Gibbs sampler class in lsbm
is easy, since the basis functions are straightforward to define:
phi = {}
for k in range(K):
phi[k,0] = lambda theta: np.array([theta])
for j in range(1,d):
phi[k,j] = lambda theta: np.array([theta ** 2, theta])
m = lsbm.lsbm_gibbs(X=X, K=K, W_function=phi)
Defining the kernels requires a bit more effort. Using the dictionary phi
defined in the previous code snippet, the corresponding kernels - assuming prior scale matrices Delta
for the NIG prior - are:
csi = {}
for k in range(K):
csi[k,0] = lambda theta,theta_prime: Delta[k,0] * np.matmul(phi[k,0](theta).reshape(-1,1),phi[k,0](theta_prime).reshape(1,-1))
for j in range(1,d):
csi[k,j] = lambda theta,theta_prime: np.matmul(np.matmul(phi[k,j](theta),Delta[k,j]),np.transpose(phi[k,j](theta_prime)))
m = lsbm_gp.lsbm_gp_gibbs(X=X, K=K, csi=csi)
Note that the kernel functions in csi
must be flexible enough to handle correctly four types of input - (float,float)
, (float,vector)
, (vector,float)
, (vector,vector)
-, returning appropriate objects - float
, vector
or matrix
- in each of the four circumstances.
Then the remaining parameters in both classes are initialised using the function initialise
. For example, z
could be initialised using -means. If the first dimension is assumed to be linear in , then could be initialised using a noisy version the first column of the embedding. For the same reason, the mean of the prior on could just be assumed to be the mean of the first column of the embedding, with large variance . Finally, the parameters of the inverse-gamma prior on the variance component must be chosen carefully: the scale b_0
needs to be small to avoid unwanted strong prior effects.
m.initialise(z=np.random.choice(m.K,size=m.n), theta=X[:,0]+np.random.normal(size=m.n,scale=0.01),
mu_theta=X[:,0].mean(), sigma_theta=10, a_0=1, b_0=0.01, first_linear=True)
In the class lsbm_gibbs
, the function initialise
requires an additional parameter Lambda_0
, representing the scale parameter for the Zeller's prior scale matrix .
Finally, the MCMC sampler could be run, specifying the number of samples samples
, the burn-in length burn
, the variance of the proposal distribution for , denoted sigma_prop
, and the integer parameter thinning
. The MCMC sampler retains only samples indexed by multiples of thinning
.
q = m.mcmc(samples=10000, burn=1000, sigma_prop=0.5, thinning=1)
The figures in the paper could be reproduced using the following files:
- Figure 1:
sim_pictures_intro.py
; - Figure 3:
sim_pictures.py
; - Figures 4 and 5:
sim.py
; - Figure 6:
harry_potter.py
; - Figure 7 and 8:
drosophila.py
.
For security reasons, the ICL network data have not been made available, but the code to run the quadratic model and truncated power basis spline model are available in icl.py
. Similarly, icl_gp.py
reproduces the same models under an explicit Gaussian process setting.