This repository contains a python library used to implement methodologies and reproduce results in Sanna Passino, F. and Heard, N. A. (2022), Latent structure blockmodels for Bayesian spectral graph clustering, Statistics and Computing 32(2) (link, arXiv preprint).
The python library can be installed (in edit mode) as follows:
pip install -e lib/
The library can then be imported in any python session:
import lsbm
The repository contains multiple directories:
lib
contains the python library;data
contains some of the datasets used in the paper;notebooks
contains Jupyter notebooks with examples on how to use the library;scripts
contains python scripts to reproduce the results in the paper.
The main tool for inference on LSBMs is the MCMC sampler lsbm_gp_gibbs
contained in the file lib/lsbm/lsbm_gp.py
. The class can be initialised using three objects:
- a
-dimensional embedding
X
, anumpy
array, whereis the number of nodes;
- the number of communities
K
, a positive integer; - the set of kernels
csi
, a dictionary containingkernel functions
for the Gaussian processes.
Additionally, the file lib/lsbm/lsbm.py
contains a simplified class lsbm_gibbs
which can be used for models where the kernel functions are assumed to be inner products. The latent functions in such models can be expressed in the form , for basis functions
and corresponding weights
with joint normal-inverse-gamma prior with the variance parameter:
The class lsbm
does not require a dictionary of kernel functions for initialisation, but a dictionary W_function
containing the basis functions. The prior parameters of the NIG prior are then specified using the function initialisation
within the same class. The class uses the Zellner's -prior as default.
For example, consider the model where all the community-specific curves are assumed to be quadratic in , passing through the origin, and linear in the first dimension (with zero intercept and unit slope). Initialising the Gibbs sampler class in
lsbm
is easy, since the basis functions are straightforward to define:
phi = {}
for k in range(K):
phi[k,0] = lambda theta: np.array([theta])
for j in range(1,d):
phi[k,j] = lambda theta: np.array([theta ** 2, theta])
m = lsbm.lsbm_gibbs(X=X, K=K, W_function=phi)
Defining the kernels requires a bit more effort. Using the dictionary phi
defined in the previous code snippet, the corresponding kernels - assuming prior scale matrices Delta
for the NIG prior - are:
csi = {}
for k in range(K):
csi[k,0] = lambda theta,theta_prime: Delta[k,0] * np.matmul(phi[k,0](theta).reshape(-1,1),phi[k,0](theta_prime).reshape(1,-1))
for j in range(1,d):
csi[k,j] = lambda theta,theta_prime: np.matmul(np.matmul(phi[k,j](theta),Delta[k,j]),np.transpose(phi[k,j](theta_prime)))
m = lsbm_gp.lsbm_gp_gibbs(X=X, K=K, csi=csi)
Note that the kernel functions in csi
must be flexible enough to handle correctly four types of input - (float,float)
, (float,vector)
, (vector,float)
, (vector,vector)
-, returning appropriate objects - float
, vector
or matrix
- in each of the four circumstances.
Then the remaining parameters in both classes are initialised using the function initialise
. For example, z
could be initialised using -means. If the first dimension is assumed to be linear in
, then
could be initialised using a noisy version the first column of the embedding. For the same reason, the mean
of the prior on
could just be assumed to be the mean of the first column of the embedding, with large variance
. Finally, the parameters of the inverse-gamma prior on the variance component must be chosen carefully: the scale
b_0
needs to be small to avoid unwanted strong prior effects.
m.initialise(z=np.random.choice(m.K,size=m.n), theta=X[:,0]+np.random.normal(size=m.n,scale=0.01),
mu_theta=X[:,0].mean(), sigma_theta=10, a_0=1, b_0=0.01, first_linear=True)
In the class lsbm_gibbs
, the function initialise
requires an additional parameter Lambda_0
, representing the scale parameter for the Zeller's prior scale matrix .
Finally, the MCMC sampler could be run, specifying the number of samples samples
, the burn-in length burn
, the variance of the proposal distribution for
, denoted
sigma_prop
, and the integer parameter thinning
. The MCMC sampler retains only samples indexed by multiples of thinning
.
q = m.mcmc(samples=10000, burn=1000, sigma_prop=0.5, thinning=1)
The figures in the paper could be reproduced using the following files in the directory scripts
:
- Figure 1:
sim_pictures_intro.py
; - Figure 3:
sim_pictures.py
; - Figures 4 and 5:
sim.py
; - Figure 6:
harry_potter.py
; - Figure 7 and 8:
drosophila.py
.
For security reasons, the ICL network data have not been made available, but the code to run the quadratic model and truncated power basis spline model are available in scripts/icl.py
. Similarly, scripts/icl_gp.py
reproduces the same models under an explicit Gaussian process setting.