#include <kde.h>

Description

A CKNS Gaussian KDE data structure.

This data structure implements the CKNS algorithm for kernel density estimation. Given data \(x_1, \ldots, x_n \in \mathbb{R}^d\), in matrix format, this data structure will preprocess the data in \(O(\epsilon^{-2} n^{1.25})\) time, such that for any query point, a \((1 + \epsilon)\)-approximate kernel density estimate can be returned in \(O(\epsilon^{-2} n^{0.25})\) time.

References: Charikar, Moses, et al. "Kernel density estimation through density constrained near neighbor search." 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS). IEEE, 2020.

Public Member Functions
	CKNSGaussianKDE (DenseMat *data, StagReal a, StagReal eps, StagReal min_mu)

	CKNSGaussianKDE (DenseMat *data, StagReal a, StagReal eps)

	CKNSGaussianKDE (DenseMat *data, StagReal a)

	CKNSGaussianKDE (DenseMat *data, StagReal a, StagReal min_mu, StagInt K1, StagReal K2_constant, StagInt sampling_offset)

	ExactGaussianKDE (DenseMat *data, StagReal a)

std::vector< StagReal >	query (DenseMat *q)

StagReal	query (const stag::DataPoint &q)

Constructor & Destructor Documentation

◆ CKNSGaussianKDE() [1/4]

stag::CKNSGaussianKDE::CKNSGaussianKDE	(	DenseMat *	data,
		StagReal	a,
		StagReal	eps,
		StagReal	min_mu
	)

Initialise a new KDE data structure with the given dataset.

Data should be a pointer to a matrix \(X \in \mathbb{R}^{n \times d}\) where each row represents a data point.

The \(\epsilon\) parameter is used to control the error guarantee of the CKNS data structure. A lower value of \(\epsilon\) will give a more accurate estimate, at the cost of a higher processing time. The data structure should produce estimates which are within a \((1 \pm \epsilon)\) factor of the true kernel density.

The initialisation time complexity of the data structure is \(O(\epsilon^{-2} n^{1.25} \log^2(n))\) and the query time for each query point is \((O(\epsilon^{-2} n^{0.25} \log^2(n))\).

Parameters

data	pointer to an \((n \times d)\) matrix containing the dataset.
a	the parameter \(a\) of the Gaussian kernel function.
eps	(optional) the error parameter \(\epsilon\) of the KDE data structure. Default is 0.5.
min_mu	(optional) the minimum kernel density value of any query point. A smaller number will give longer preprocessing and query time complexity. If a query point has a kernel density smaller than this value, then the data structure may not return the correct result. Default is 1 / n.

◆ CKNSGaussianKDE() [2/4]

stag::CKNSGaussianKDE::CKNSGaussianKDE	(	DenseMat *	data,
		StagReal	a,
		StagReal	eps
	)

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

◆ CKNSGaussianKDE() [3/4]

stag::CKNSGaussianKDE::CKNSGaussianKDE	(	DenseMat *	data,
		StagReal	a
	)

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

◆ CKNSGaussianKDE() [4/4]

stag::CKNSGaussianKDE::CKNSGaussianKDE	(	DenseMat *	data,
		StagReal	a,
		StagReal	min_mu,
		StagInt	K1,
		StagReal	K2_constant,
		StagInt	sampling_offset
	)

Initialise a new KDE data structure with the given dataset.

Data should be a pointer to a matrix \(X \in \mathbb{R}^{n \times d}\) where each row represents a data point.

This constructor gives fine-grained control over the constants used by the data structure to control the accuracy and variance of the estimator. For casual application, the simpler constructors will be sufficient, and will select sensible default values for the constants.

Parameters

data	pointer to an \((n \times d)\) matrix containing the dataset.
a	the parameter \(a\) of the Gaussian kernel function.
min_mu	the minimum kernel density value of any query point. A smaller number will give longer preprocessing and query time complexity. If a query point has a kernel density smaller than this value, then the data structure may not return the correct result.
K1	the number of copies of the data structure to create in parallel. This parameter controls the variance of the estimator returned by the algorithm. Is is usually set to \(\epsilon^{-2} \cdot \log(n)\).
K2_constant	controls the collision probability of each of the E2LSH hash tables used within the data structure. A higher value will give more accurate estimates at the cost of higher memory and time complexity. It is usually set to \(0.1 \log(n)\).
sampling_offset	the CKNS algorithm samples the dataset with various sampling probabilities. Setting a sampling offset of \(k\) will further subsample the data by a factor of \(1/2^k\). This will speed up the algorithm at the cost of some accuracy. It is usually set to \(0\).

Member Function Documentation

◆ ExactGaussianKDE()

stag::CKNSGaussianKDE::ExactGaussianKDE	(	DenseMat *	data,
		StagReal	a
	)

Initialise the data structure with the given dataset and Gaussian kernel parameter \(a\).

The initialisation time for this data structure is \(O(1)\).

Parameters

data
a

◆ query() [1/2]

std::vector< StagReal > stag::CKNSGaussianKDE::query ( DenseMat * q )

Calculate the KDE value for each of the data points in the query matrix.

The query time complexity for each query point is \(O(n)\).

Parameters

q

Returns: a vector of KDE values.

◆ query() [2/2]

StagReal stag::CKNSGaussianKDE::query ( const stag::DataPoint & q )

Calculate the exact kernel density for the given query point.

Note: If you would like to obtain a kernel density for many query points, you should use the other query method to pass them all simultaneously.

Parameters

q	the query data point

Returns: the kernel density for the given query point

Description

Public Member Functions

Constructor & Destructor Documentation

◆ CKNSGaussianKDE() [1/4]

◆ CKNSGaussianKDE() [2/4]

◆ CKNSGaussianKDE() [3/4]

◆ CKNSGaussianKDE() [4/4]

Member Function Documentation

◆ ExactGaussianKDE()

◆ query() [1/2]

◆ query() [2/2]