Introduction to trace estimation with primate

primate contains a variety of methods for estimating quantities from matrix functions. One popular such quantity is the trace of f(A), for some generic f: [a,b] \to \mathbb{R} defined on the spectrum of A: \mathrm{tr}(f(A)) \triangleq \mathrm{tr}(U f(\Lambda) U^{\intercal}), \quad \quad f : [a,b] \to \mathbb{R}

Many quantities of common interest can be expressed in as traces of matrix functions with the right choice of f. A few such function specializations include:

\begin{align*} f(\lambda) &= \lambda \quad &\Longleftrightarrow& \quad &\mathrm{tr}(A) \\ f(\lambda) &= \lambda^{-1} \quad &\Longleftrightarrow& \quad &\mathrm{tr}(A^{-1}) \\ f(\lambda) &= \log(\lambda) \quad &\Longleftrightarrow& \quad &\mathrm{logdet}(A) \\ f(\lambda) &= \mathrm{exp}(\lambda) \quad &\Longleftrightarrow& \quad &\mathrm{exp}(A) \\ f(\lambda) &= \mathrm{sign}(\lambda) \quad &\Longleftrightarrow& \quad &\mathrm{rank}(A) \\ &\vdots \quad &\hphantom{\Longleftrightarrow}& \quad & \vdots & \end{align*}

In this introduction, the basics of randomized trace estimation are introduced, with a focus on how to use matrix-free algorithms to estimate traces of matrix functions.

Basics: Trace computation

Suppose we wanted to compute the trace of some matrix A. By the very definition of the trace, we have a variety of means to do it: \mathrm{tr}(A) \triangleq \sum\limits_{i=1}^n A_{ii} = \sum\limits_{i=1}^n \lambda_i = \sum\limits_{i=1}^n e_i^T A e_i

where, in the right-most sum, we’re using e_i to represent the ith identity vector: A_{ii} = e_i^T A e_i, \quad e_i = [0, \dots, 0, \underbrace{1}_{i}, 0, \dots, 0]

Let’s see some code. First, we start with a simple (random) positive-definite matrix A \in \mathbb{R}^{n \times n}.

from primate.random import symmetric
A = symmetric(150, pd = True)

By default, symmetric normalizes A such that the eigenvalues are uniformly distributed in the interval [0, 1]. For reference, the spectrum of A looks as follows:

Now, using numpy, we can verify all of these trace definitions are indeed equivalent:

import numpy as np
eye = lambda i: np.ravel(np.eye(1, 150, k=i))
print(f"Direct: {np.sum(A.diagonal()):.8f}")
print(f"Eigen:  {np.sum(np.linalg.eigvalsh(A)):.8f}")
print(f"Matvec: {sum([eye(i) @ A @ eye(i) for i in range(150)]):.8f}")
Direct: 75.69739746
Eigen:  75.69739746
Matvec: 75.69739746

While (1) trivially yields \mathrm{tr}(A) in (optimal) O(n) time, there exist situations where the diagonal entries of A are not available—particularly in the large-scale regime. In contrast, though both (2) and (3) are less efficient, they are nonetheless viable alternatives to obtain \mathrm{tr}(A).

Implicit trace estimation

The implicit trace estimation problem can be stated as follows:

Given access to a square matrix A \in \mathbb{R}^{n \times n} via its matrix-vector product operator x \mapsto Ax, estimate its trace \mathrm{tr}(A) = \sum\limits_{i=1}^n A_{ii}

Note we no longer assume we have direct access to the diagonal here, leaving us with the Eigen or Matvec methods. The Eigen method is expensive, and though the Matvec method solves the problem exactly, its pretty inefficient from a computational standpoint—most of the entries of e_i are zero, thus most inner product computations do not contribute to the trace estimate at all.

One idea, attributed to A. Girard, is to replace e_i with random zero-mean vectors, e.g. random sign vectors v \in \{-1, +1\}^{n} or random normal vectors v \sim \mathcal{N}(\mu=0, \sigma=1):

\mathtt{tr}(A) \approx \frac{1}{n_v}\sum\limits_{i=1}^{n_v} v_i^\top A v_i

It was shown more formally later by M.F. Huchinson that if these random vectors v \in \mathbb{R}^n satisfy \mathbb{E}[vv^T] = I then this approach indeed forms an unbiased estimator of \mathrm{tr}(A). To see this, its sufficient to combine the linearity of expectation, the cyclic-property of the trace function, and the aforementioned isotropy conditions: \mathtt{tr}(A) = \mathtt{tr}(A I) = \mathtt{tr}(A \mathbb{E}[v v^T]) = \mathbb{E}[\mathtt{tr}(Avv^T)] = \mathbb{E}[\mathtt{tr}(v^T A v)] = \mathbb{E}[v^T A v]

Naturally we expect the approximation to gradually improve as more vectors are sampled, converging as n_v \to \infty. Let’s see if we can check this: we start by collecting a few sample quadratic forms

from primate.random import rademacher
gen_isotropic = lambda n: (rademacher(A.shape[0]) for _ in range(n))
estimates = np.array([v @ A @ v for v in gen_isotropic(500)])

Plotting both the estimator formed by averaging the first k samples along with its absolute error, we get the following figures for k = 50 and k = 500, respectively:

Sure enough, the estimator quickly gets within an error rate of about 1-5% away from the actual trace, getting generally closer as more samples are collected. On the other hand, improvement is not gaurenteed, and securing additional precision much less than 1% becomes increasingly difficult due to the randomness and variance of the sample estimates.

Maximizing configurability

Rather than manually averaging samples, trace estimates can be obtained via the hutch function:

from primate.trace import hutch
print(f"Trace estimate: {hutch(A, maxiter=50)}")
Trace estimate: 76.11619074561499

Though its estimation is identical to averaging quadratic forms v^T A v of isotropic vectors as above, hutch comes with a few extra features to simplify its usage.

For example, to stop the estimator once it reaches an absolute tolerence, supply a residual value to atol:

## Stops the estimator if margin of error <= 0.15
print(f"Trace est: {hutch(A, atol=0.15, maxiter=1000)}")
Trace est: 75.6280519007306

The default early-stopping rule uses the CLT to bound the error of the estimator. A simpler early-stopping method can also be used, which is similar to the quadrature method:

## Stops the estimator when pair of iterates change <= 0.05
hutch(A, atol=0.05, stop='change', maxiter=1000)

By default, all v \mapsto v^T A v evaluations are done in parallel, which can be controlled num_threads:

hutch(A, num_threads=8) ## ~8x times faster on native 8-core machine

Moreover, a variety of isotropic distributions and random number generators (RNGs) are available to choose from, each of which can be seeded with positive integers seeds for determinism.

## Samples from the normal distribution N(0, 1) using the SplitMix64 generator 
hutch(A, pdf="normal", rng="splitmix64", seed=1234)

You can also parameterize the underlying function approximation parameters.

hutch(A, deg = 40)      ## Degree-40 Krylov expansion 
hutch(A, orth = 5)      ## Re-orthogonalize against 5 add. Lanczos vectors
hutch(A, quad="fttr")   ## Use three-term recurrence for quad. approx 

See the hutch documentation page for more details on the hyper-parameters. More details about the convergence checking options are also given below.

Extending to matrix functions

While there are some real-world applications for estimating \mathrm{tr}(A), in many setting the trace of a given matrix is not a very informative quantity. On the other hand, as shown in the beginning, there are many situations where we might be interested in estimating the trace of a matrix function f(A). It is natural to consider extending the Hutchinson estimator above with something like:

\mathrm{tr}(f(A)) \approx \frac{n}{n_v} \sum\limits_{i=1}^{n_v} v_i^T f(A) v_i

Of course, the remaining difficulty lies in computing quadratic forms v \mapsto v^T f(A) v efficiently. The approach taken by Ubaru, Saad, and Seghouane (2017) is to notice that sums of these quantities can transformed into a Riemann-Stieltjes integral:

v^T f(A) v = v^T U f(\Lambda) U^T v = \sum\limits_{i}^n f(\lambda_i) \mu_i^2 = \int\limits_{a}^b f(t) \mu(t)

where scalars \mu_i \in \mathbb{R} constitute a cumulative, piecewise constant function \mu : \mathbb{R} \to \mathbb{R}_+:

\mu_i = (U^T v)_i, \quad \mu(t) = \begin{cases} 0, & \text{if } t < a = \lambda_1 \\ \sum_{j=1}^{i-1} \mu_j^2, & \text{if } \lambda_{i-1} \leq t < \lambda_i, i = 2, \dots, n \\ \sum_{j=1}^n \mu_j^2, & \text{if } b = \lambda_n \leq t \end{cases}

As with any definite integral, one would ideally like to approximate its value with a quadrature rule. An exemplary such approximation is the m-point Gaussian quadrature rule:

\int\limits_{a}^b f(t) \mu(t) \approx \sum\limits_{k=1}^m \omega_k f(\eta_k)

with weights \{\omega_k\} and nodes \{ \eta_k\}. On one hand, Gaussian Quadrature (GQ) is just one of many quadrature rules that could be used to approximate v^T A v. On the other hand, GQ is often preferred over other rules due to its exactness on polynomials up to degree 2m - 1.

Of course, both the weights and nodes of GQ rule are unknown for the integral above. This contrasts the typical quadrature setting, wherein \omega is assumed uniform or is otherwise known ahead-of-time.

A surprising and powerful fact is that both the weights \{\omega_k\} and nodes \{ \eta_k\} of the m-point GQ above are directly computable the degree m matrix T_m produced by the Lanczos method:

Q_m^T A Q_m = T_m = Y \Theta Y^T \quad \Leftrightarrow \quad (\eta_i, \omega_i) = (\theta_i, \tau_i), \; \tau_i = (e_1^T y_i)^2

In other words, the Ritz values \{\theta_i\} of T_m are exactly the nodes of GQ quadrature rule applied to \mu(t), while the first components of the Ritz vectors \tau_i (squared) are exactly the weights. For more information about this non-trivial fact, see Golub and Meurant (2009).

Stochastic Lanczos Quadrature

By using Lanczos quadrature estimates, the theory above enables the Hutchinson estimator to extend to be used as an estimator for the trace of a matrix function:

\mathrm{tr}(f(A)) \approx \frac{n}{n_v} \sum\limits_{i=1}^{n_v} e_1^T f(T_m) e_1 = \frac{n}{n_v} \sum\limits_{j=1}^{n_v} \left ( \sum\limits_{i=1}^m \tau_i^{(j)} f(\theta_i)^{(j)} \right )

Under relatively mild assumptions, if the function of interest f: [a,b] \to \mathbb{R} is analytic on [\lambda_{\text{min}}, \lambda_{\text{max}}], then for constants \epsilon, \eta \in (0, 1) the output \Gamma of the Hutchinson estimator satisfies:

\mathrm{Pr}\Bigg[ \lvert \mathrm{tr}(f(A)) - \Gamma \rvert \leq \epsilon \lvert \mathrm{tr}(f(A)) \rvert \Bigg] \geq 1 - \eta

…if the number of sample vectors n_v satisfies n_v \geq (24/\epsilon^2) \log(2/\eta) and the degree of the Lanczos expansion is sufficiently large. In other words, we can achieve an arbitrary (1 \pm \epsilon)-approximation of \mathrm{tr}(f(A)) with success probability \eta using on the order of \sim O(\epsilon^{-2} \log(\eta^{-1})) evaluations of e_1^T f(T_m) e_1. This probablistic guarantee is most useful when \epsilon is not too small, i.e. only a relatively coarse approximation of \mathrm{tr}(f(A)) is needed.

Example: Logdet computation

Back to the computation, let’s see how the above theory translate to code with primate.

To parameterize a Girard-Hutchinson estimator for use with matrix-functions, it suffices to simply set the fun argument in hutch to either the name of a built-in function or an arbitrary Callable.

For example, to approximate the log-determinant of A:

## Log-determinant
logdet_test = hutch(A, fun="log", maxiter=25)
logdet_true = np.sum(np.log(np.linalg.eigvalsh(A)))

print(f"Logdet (exact):  {logdet_true}")
print(f"Logdet (approx): {logdet_test}")
Logdet (exact):  -148.3218440234385
Logdet (approx): -149.9574763133825

Here we see that using only n / 6 evaluations of e_1^T f(T_m) e_1, we get a decent approximation of logdet(A). To get a better idea of the quality of the estimator, set verbose=True:

est = hutch(A, fun="log", maxiter=25, seed=5, verbose=True)
Girard-Hutchinson estimator (fun=log, deg=20, quad=golub_welsch)
Est: -143.982 +/- 6.08 (95% CI), CV: 2%, Evals: 25 [R] (seed: 5)

The first line of the statement contains fixed information about the estimator, including the type of estimator (Girard-Hutchinson), the function applied to the spectrum (log), the degree of the Lanczos quadrature approximation (20), and the quadrature method used (golub_welsch).

The second line prints the runtime information about the samples, such as the final trace estimate, its margin of error and coefficient of variation (a.k.a relative std. deviation), the number of samples vectors n_v, and the isotropic distribution of choice (‘R’ for rademacher).

Another way to get an idea of summarize the convergence behavior of the estimator is to pass the plot=True option. Here, we pass the same seed for reproducibility with the above call.

est = hutch(A, fun="numrank", maxiter=150, seed=8, plot=True)

Similar information as reported by the verbose=True flag is shown cumulatively applied to every sample. The left plot shows the sample point estimates alongside yellow bands denoting the margin of error associated with the 95% confidence interval (CI) at every sample index. The right plot show convergence information, e.g. the running coefficient of variation is shown on the top-right and an upper bound on the absolute error is derived from the CI is shown on the bottom right.

While the margin of error is typically a valid, conservative estimate of estimators error, in some situations these bounds may not be valid as they are based solely on the CLT implications for normally distributed random variables. In particular, there is an inherent amount of error associated with the Lanczos-based quadrature approximations that is not taken into account in the error bound, which may invalidate the corresponding CI’s.

References

Golub, Gene H, and Gérard Meurant. 2009. Matrices, Moments and Quadrature with Applications. Vol. 30. Princeton University Press.
Ubaru, Shashanka, Yousef Saad, and Abd-Krim Seghouane. 2017. “Fast Estimation of Approximate Matrix Ranks Using Spectral Densities.” Neural Computation 29 (5): 1317–51.