Description

Algorithms for finding clusters in graphs.

The methods in this module can be divided into three sub-categories.

Clustering Algorithms: The two key clustering methods provided by this module are stag::spectral_cluster and stag::local_cluster.

Similarity Graph Construction: The module provides the methods stag::similarity_graph and stag::approximate_similarity_graph for constructing a similarity graph from data.

Clustering Evaluation: The module provides implementations of the standard ARI and NMI clustering evaluation metrics in the stag::adjusted_rand_index and stag::normalised_mutual_information methods.

Functions
std::vector< StagInt >	stag::spectral_cluster (stag::Graph *graph, StagInt k)

std::vector< StagInt >	stag::cheeger_cut (stag::Graph *graph)

std::vector< StagInt >	stag::local_cluster (stag::LocalGraph *graph, StagInt seed_vertex, double target_volume)

std::vector< StagInt >	stag::local_cluster_acl (stag::LocalGraph *graph, StagInt seed_vertex, double locality, double error)

std::vector< StagInt >	stag::local_cluster_acl (stag::LocalGraph *graph, StagInt seed_vertex, double locality)

std::tuple< SprsMat, SprsMat >	stag::approximate_pagerank (stag::LocalGraph *graph, SprsMat &seed_vector, double alpha, double epsilon)

std::vector< StagInt >	stag::sweep_set_conductance (stag::LocalGraph *graph, SprsMat &vec)

std::vector< StagInt >	stag::sweep_set_conductance (stag::Graph *graph, SprsMat &vec)

std::vector< StagInt >	stag::connected_component (stag::LocalGraph *g, StagInt v)

std::vector< std::vector< StagInt > >	stag::connected_components (stag::Graph *g)

double	stag::adjusted_rand_index (std::vector< StagInt > &gt_labels, std::vector< StagInt > &labels)

double	stag::mutual_information (std::vector< StagInt > &gt_labels, std::vector< StagInt > &labels)

double	stag::normalised_mutual_information (std::vector< StagInt > &gt_labels, std::vector< StagInt > &labels)

double	stag::conductance (stag::LocalGraph *graph, std::vector< StagInt > &cluster)

std::vector< StagInt >	stag::symmetric_difference (std::vector< StagInt > &S, std::vector< StagInt > &T)

Graph	stag::approximate_similarity_graph (DenseMat *data, StagReal a)

Graph	stag::similarity_graph (DenseMat *data, StagReal a)

Function Documentation

◆ spectral_cluster()

std::vector< StagInt > stag::spectral_cluster	(	stag::Graph *	graph,
		StagInt	k
	)

Spectral clustering algorithm.

This is a simple graph clustering method, which provides a clustering of the entire graph. To use spectral clustering, simply pass a stag::Graph object and the number of clusters you would like to find.

#include <iostream>
#include <stag/graph.h>
#include <stag/cluster.h>
 
int main() {
  stag::Graph myGraph = stag::barbell_graph(10);
  std::vector<StagInt> clusters = stag::spectral_cluster(&myGraph, 2);
 
  for (auto c : clusters) {
    std::cout << c << ", ";
  }
  std::cout << std::endl;
 
  return 0;
}

The spectral clustering algorithm has the following steps.

Compute the \(k\) smallest eigenvectors of the normalised Laplacian matrix.
Embed the vertices into \(\mathbb{R}^k\) according to the eigenvectors.
Cluster the vertices into \(k\) clusters using a \(k\)-means clustering algorithm.

Parameters

graph	the graph object to be clustered
k	the number of clusters to find. Should be less than \(n/2\).

Returns: a vector giving the cluster membership for each vertex in the graph

References: A. Ng, M. Jordan, Y. Weiss. On spectral clustering: Analysis and an algorithm. NeurIPS'01

◆ cheeger_cut()

std::vector< StagInt > stag::cheeger_cut ( stag::Graph * graph )

Find the Cheeger cut in a graph.

Let \(G = (V, E)\) be a graph and \(\mathcal{L}\) be its normalised Laplacian matrix with eigenvalues \(0 = \lambda_1 \leq \lambda_2 \leq \ldots \leq \lambda_n\). Then, Cheeger's inequality states that

\[ \frac{\lambda_2}{2} \leq \Phi_G \leq \sqrt{2 \lambda_2}, \]

where

\[ \Phi_G = \min_{S \subset V} \phi(S) \]

is the conductance of \(G\). The proof of Cheeger's inequality is constructive: by computing the eigenvector corresponding to \(\lambda_2\), and performing the sweep set operation, we are able to find a set \(S\) with conductance close to the optimal. The partition returned by this algorithm is called the 'Cheeger cut' of the graph.

Parameters

graph the graph object to be partitioned

Returns: A vector giving the cluster membership for each vertex in the graph. Each entry in the vector is either \(0\) or \(1\) to indicate which side of the cut the vertex belongs to.

◆ local_cluster()

std::vector< StagInt > stag::local_cluster	(	stag::LocalGraph *	graph,
		StagInt	seed_vertex,
		double	target_volume
	)

Local clustering algorithm based on personalised Pagerank.

Given a graph and starting vertex, return a cluster which is close to the starting vertex.

This method uses the ACL local clustering algorithm.

Parameters

graph	a graph object implementing the LocalGraph interface
seed_vertex	the starting vertex in the graph
target_volume	the approximate volume of the cluster you would like to find

Returns: a vector containing the indices of vectors considered to be in the same cluster as the seed_vertex.

References: R. Andersen, F. Chung, K. Lang. Local graph partitioning using pagerank vectors. FOCS'06

◆ local_cluster_acl() [1/2]

std::vector< StagInt > stag::local_cluster_acl	(	stag::LocalGraph *	graph,
		StagInt	seed_vertex,
		double	locality,
		double	error
	)

The ACL local clustering algorithm. Given a graph and starting vertex, return a cluster close to the starting vertex, constructed in a local way.

The locality parameter is passed as the alpha parameter in the personalised Pagerank calculation.

Parameters

graph	a graph object implementing the LocalGraph interface
seed_vertex	the starting vertex in the graph
locality	a value in \([0, 1]\) indicating how 'local' the cluster should be. A value of \(1\) will return only the seed vertex, and a value of \(0\) will explore the whole graph.
error	(optional) - the acceptable error in the calculation of the approximate pagerank. Default \(0.001\).

Returns: a vector containing the indices of vectors considered to be in the same cluster as the seed_vertex.

References: R. Andersen, F. Chung, K. Lang. Local graph partitioning using pagerank vectors. FOCS'06

◆ local_cluster_acl() [2/2]

std::vector< StagInt > stag::local_cluster_acl	(	stag::LocalGraph *	graph,
		StagInt	seed_vertex,
		double	locality
	)

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

◆ approximate_pagerank()

std::tuple< SprsMat, SprsMat > stag::approximate_pagerank	(	stag::LocalGraph *	graph,
		SprsMat &	seed_vector,
		double	alpha,
		double	epsilon
	)

Compute the approximate Pagerank vector.

The parameters seed_vector, alpha, and epsilon are used as described in the ACL paper.

Note that the dimension of the returned vectors may not match the correct number of vertices in the graph provided since the approximate Pagerank is computed locally.

Parameters

graph	a stag::LocalGraph object
seed_vector	the seed vector of the personalised Pagerank
alpha	the locality parameter of the personalised Pagerank
epsilon	the error parameter of the personalised Pagerank

Returns

A tuple of sparse column vectors corresponding to

p: the approximate Pagerank vector
r: the residual vector

By the definition of approximate Pagerank, it holds that p + ppr(r, alpha) = ppr(s, alpha).

Exceptions

std::invalid_argument if the provided seed_vector is not a column vector.

References: R. Andersen, F. Chung, K. Lang. Local graph partitioning using pagerank vectors. FOCS'06

◆ sweep_set_conductance() [1/2]

std::vector< StagInt > stag::sweep_set_conductance	(	stag::LocalGraph *	graph,
		SprsMat &	vec
	)

Find the sweep set of the given vector with the minimum conductance.

First, sort the vector such that \(v_1<= \ldots <= v_n\). Then let

\[ S_i = \{v_j : j <= i\} \]

and return the set of original indices corresponding to

\[ \mathrm{argmin}_i \phi(S_i) \]

where \(\phi(S)\) is the conductance of \(S\).

When the provided graph is a stag::LocalGraph, the volume of the support of the provided vector should be less than half the total volume of the graph. The method does not (and cannot) check this condition.

When the provided graph is a stag::Graph, there is no restriction on the volume of the support of the provided vector.

Note that the caller is responsible for any required normalisation of the input vector. In particular, this method does not normalise the vector by the node degrees.

Parameters

graph	a stag::LocalGraph or stag::Graph object
vec	the vector to sweep over

Returns: a vector containing the indices of vec which give the minimum conductance in the given graph

◆ sweep_set_conductance() [2/2]

std::vector< StagInt > stag::sweep_set_conductance	(	stag::Graph *	graph,
		SprsMat &	vec
	)

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

◆ connected_component()

std::vector< StagInt > stag::connected_component	(	stag::LocalGraph *	g,
		StagInt	v
	)

Return the vertex indices of every vertex in the same connected component as the specified vertex.

The running time of this method is proportional to the size of the returned connected component.

The returned vector is not sorted.

Parameters

g	a stag::LocalGraph instance
v	a vertex of the graph

Returns: a vector containing the vertex ids of every vertex in the connected component corresponding to v

◆ connected_components()

std::vector< std::vector< StagInt > > stag::connected_components ( stag::Graph * g )

Return a vector of the connected components in the specified graph.

Parameters

g	a stag::Graph instance

Returns: a vector containing the connected components of the graph

◆ adjusted_rand_index()

double stag::adjusted_rand_index	(	std::vector< StagInt > &	gt_labels,
		std::vector< StagInt > &	labels
	)

Compute the Adjusted Rand Index between two label vectors.

Parameters

gt_labels	the ground truth labels for the dataset
labels	the candidate labels whose ARI should be calculated

Returns: the ARI between the two labels vectors

References: W. M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association. 66 (336): 846–850. 1971.

◆ mutual_information()

double stag::mutual_information	(	std::vector< StagInt > &	gt_labels,
		std::vector< StagInt > &	labels
	)

Compute the Mutual Information between two label vectors.

Parameters

gt_labels	the ground truth labels for the dataset
labels	the candidate labels whose MI should be calculated

Returns: the MI between the two labels vectors

◆ normalised_mutual_information()

double stag::normalised_mutual_information	(	std::vector< StagInt > &	gt_labels,
		std::vector< StagInt > &	labels
	)

Compute the Normalised Mutual Information between two label vectors.

Parameters

gt_labels	the ground truth labels for the dataset
labels	the candidate labels whose NMI should be calculated

Returns: the NMI between the two labels vectors

References: Vinh, Epps, and Bailey, (2009). Information theoretic measures for clusterings comparison. 26th Annual International Conference on Machine Learning (ICML ‘09).

◆ conductance()

double stag::conductance	(	stag::LocalGraph *	graph,
		std::vector< StagInt > &	cluster
	)

Compute the conductance of the given cluster in a graph.

Given a graph \(G = (V, E)\), the conductance of \(S \subseteq V\) is defined to be

\[ \phi(S) = \frac{w(S, V \setminus S)}{\mathrm{vol}(S)}, \]

where \(\mathrm{vol}(S) = \sum_{v \in S} \mathrm{deg}(v)\) is the volume of \(S\) and \(w(S, V \setminus S)\) is the total weight of edges crossing the cut between \(S\) and \(V \setminus S\).

Parameters

graph	a stag::LocalGraph object representing \(G\).
cluster	a vector of node IDs in \(S\).

Returns: the conductance \(\phi_G(S)\).

◆ symmetric_difference()

std::vector< StagInt > stag::symmetric_difference	(	std::vector< StagInt > &	S,
		std::vector< StagInt > &	T
	)

Compute the symmetric difference of two sets of integers.

Given sets \(S\) and \(T\), the symmetric difference \(S \triangle T\) is defined to be

\[ S \triangle T = \{S \setminus T\} \cup \{T \setminus S\}. \]

Although \(S\) and \(T\) are provided as vectors, they are treated as sets and any duplicates will be ignored.

Parameters

S	a vector containing the first set of integers
T	a vector containing the second set of integers

Returns: a vector containing the integers in the symmetric difference of S and T.

◆ approximate_similarity_graph()

Graph stag::approximate_similarity_graph	(	DenseMat *	data,
		StagReal	a
	)

Construct an approximate similarity graph for the given dataset.

Given datapoints \(\{x_1, \ldots, x_n\} \in \mathbb{R}^n\) and a parameter \(a\), the similarity between two data points is given by

\[ k(x_i, x_j) = \mathrm{exp}\left(- a \|x_i - x_j\|^2 \right). \]

Then, the similarity graph of the data is a complete graph on \(n\) vertices such that the weight between vertex \(i\) and \(j\) is given by \(k(x_i, x_j)\). However, the complete similarity graph requires \(O(n^2)\) time and space to construct.

This method implements an algorithm which approximates the similarity graph with a sparse graph, while preserving any cluster structure of the graph. This algorithm has running time \(\widetilde{O}(n^{1.25})\).

Parameters

data	an \(n \times d\) Eigen matrix representing the dataset.
a	the parameter of the similarity kernel.

Returns: a stag::Graph object representing the similarity of the data

Reference: Peter Macgregor and He Sun, Fast Approximation of Similarity Graphs with Kernel Density Estimation. In NeurIPS'23.

◆ similarity_graph()

Graph stag::similarity_graph	(	DenseMat *	data,
		StagReal	a
	)

Construct a complete similarity graph for the given dataset.

Given datapoints \(\{x_1, \ldots, x_n\} \in \mathbb{R}^n\) and a parameter \(a\), the similarity between two data points is given by

\[ k(x_i, x_j) = \mathrm{exp}\left(- a \|x_i - x_j\|^2 \right). \]

Then, the similarity graph of the data is a complete graph on \(n\) vertices such that the weight between vertex \(i\) and \(j\) is given by \(k(x_i, x_j)\).

Note that the time and space complexity of this method is \(O(n^2)\). For a faster, approximate method, you could consider using stag::approximate_similarity_graph.

Parameters

data	an \(n \times d\) Eigen matrix representing the dataset.
a	the parameter of the similarity kernel.

Returns: a stag::Graph object representing the similarity of the data

Description

Functions

Function Documentation

◆ spectral_cluster()

◆ cheeger_cut()

◆ local_cluster()

◆ local_cluster_acl() [1/2]

◆ local_cluster_acl() [2/2]

◆ approximate_pagerank()

◆ sweep_set_conductance() [1/2]

◆ sweep_set_conductance() [2/2]

◆ connected_component()

◆ connected_components()

◆ adjusted_rand_index()

◆ mutual_information()

◆ normalised_mutual_information()

◆ conductance()

◆ symmetric_difference()

◆ approximate_similarity_graph()

◆ similarity_graph()