Introduction to trace estimation with `primate`

primate contains a variety of methods for estimating quantities from matrix functions. One popular such quantity is the trace of f(A), for some generic f: [a,b] \to \mathbb{R} defined on the spectrum of A: \mathrm{tr}(f(A)) \triangleq \mathrm{tr}(U f(\Lambda) U^{\intercal}), \quad \quad f : [a,b] \to \mathbb{R}

Many quantities of common interest can be expressed in as traces of matrix functions with the right choice of f. A few such function specializations include:

\begin{align*} f(\lambda) &= \lambda \quad &\Longleftrightarrow& \quad &\mathrm{tr}(A) \\ f(\lambda) &= \lambda^{-1} \quad &\Longleftrightarrow& \quad &\mathrm{tr}(A^{-1}) \\ f(\lambda) &= \log(\lambda) \quad &\Longleftrightarrow& \quad &\mathrm{logdet}(A) \\ f(\lambda) &= \mathrm{exp}(\lambda) \quad &\Longleftrightarrow& \quad &\mathrm{exp}(A) \\ f(\lambda) &= \mathrm{sign}(\lambda) \quad &\Longleftrightarrow& \quad &\mathrm{rank}(A) \\ &\vdots \quad &\hphantom{\Longleftrightarrow}& \quad & \vdots & \end{align*}

In this introduction, the basics of randomized trace estimation are introduced, with a focus on how to use matrix-free algorithms to estimate traces of matrix functions.

Basics: Trace computation

Suppose we wanted to compute the trace of some matrix A. By the very definition of the trace, we have a variety of means to do it: \mathrm{tr}(A) \triangleq \sum\limits_{i=1}^n A_{ii} = \sum\limits_{i=1}^n \lambda_i = \sum\limits_{i=1}^n e_i^T A e_i

where, in the right-most sum, we’re using e_i to represent the ith identity vector: A_{ii} = e_i^T A e_i, \quad e_i = [0, \dots, 0, \underbrace{1}_{i}, 0, \dots, 0]

Let’s see some code. First, we start with a simple (random) positive-definite matrix A \in \mathbb{R}^{n \times n}.

from primate.random import symmetric
A = symmetric(150, pd=True, seed=1234)

By default, symmetric normalizes A such that the eigenvalues are uniformly distributed in the interval [-1, 1]. For reference, the spectrum of A looks as follows:

Now, using numpy, we can verify all of these trace definitions are indeed equivalent:

import numpy as np
eye = lambda i: np.ravel(np.eye(1, 150, k=i))
print(f"Direct: {np.sum(A.diagonal()):.8f}")
print(f"Eigen:  {np.sum(np.linalg.eigvalsh(A)):.8f}")
print(f"Matvec: {sum([eye(i) @ A @ eye(i) for i in range(150)]):.8f}")

Direct: 74.88426090
Eigen:  74.88426090
Matvec: 74.88426090

While (1) trivially yields \mathrm{tr}(A) in (optimal) O(n) time, there exist situations where the diagonal entries of A are not available—particularly in the large-scale regime. In contrast, though both (2) and (3) are less efficient, they are nonetheless viable alternatives to obtain \mathrm{tr}(A).

Implicit trace estimation

The implicit trace estimation problem can be stated as follows:

Implicit trace problem: Given a square matrix A \in \mathbb{R}^{n \times n} which we can only access via matrix-vector products x \mapsto Ax, estimate its trace \mathrm{tr}(A)

Note we no longer assume we have direct access to the diagonal here, leaving us with the Eigen or Matvec methods. The Eigen method is expensive, and though the Matvec method solves the problem exactly, its pretty inefficient from a computational standpoint—most of the entries of e_i are zero, thus most inner product computations do not contribute to the trace estimate at all.

One idea, attributed to A. Girard, is to replace e_i with random zero-mean vectors, e.g. random sign vectors v \in \{-1, +1\}^{n} or random normal vectors v \sim \mathcal{N}(\mu=0, \sigma=1):

\mathtt{tr}(A) \approx \frac{1}{n_v}\sum\limits_{i=1}^{n_v} v_i^\top A v_i

It was shown more formally later by M.F. Huchinson that if these random vectors v \in \mathbb{R}^n satisfy \mathbb{E}[vv^T] = I then this approach indeed forms an unbiased estimator of \mathrm{tr}(A). To see this, its sufficient to combine the linearity of expectation, the cyclic-property of the trace function, and the above isotropy conditions: \mathtt{tr}(A) = \mathtt{tr}(A I) = \mathtt{tr}(A \mathbb{E}[v v^T]) = \mathbb{E}[\mathtt{tr}(Avv^T)] = \mathbb{E}[\mathtt{tr}(v^T A v)] = \mathbb{E}[v^T A v]

Naturally we expect the approximation to gradually improve as more vectors are sampled, converging as n_v \to \infty. Let’s see if we can check this: we start by collecting a few sample quadratic forms

from primate.random import isotropic
estimates = np.array([v.T @ A @ v for v in isotropic((500, A.shape[1]))])

Plotting both the estimator formed by averaging the first k samples along with its absolute error, we get the following figures for k = 50 and k = 500, respectively:

Sure enough, the estimator quickly gets within an error rate of about 1-5% away from the actual trace, getting generally closer as more samples are collected. On the other hand, improvement is not guaranteed, and securing additional precision much less than 1% becomes increasingly difficult due to the randomness and variance of the sample estimates.

Maximizing configurability

Rather than manually averaging samples, trace estimates can be obtained via the hutch function:

from primate.trace import hutch
print(f"Trace estimate: {hutch(A)}")

Trace estimate: 75.05368697525256

Though its estimation is identical to averaging quadratic forms v^T A v of isotropic vectors as above, hutch comes with a few extra features to simplify its usage.

For example, to stop the estimator once it reaches an absolute tolerance, supply a residual value to atol:

## Stops the estimator if margin of error <= `atol`
rng = np.random.default_rng(1234)
est1 = hutch(A, converge='confidence', atol=0.15, rtol=0.0, seed=rng)
est2 = hutch(A, converge='confidence', atol=5.00, rtol=0.0, seed=rng)
print(f"Trace      : {A.trace()}")
print(f"Trace est1 : {est1} (95% CI, atol=0.15)")
print(f"Trace est2 : {est2} (95% CI, atol=5.00)")

Trace      : 74.88426089840914
Trace est1 : 74.82468827388188 (95% CI, atol=0.15)
Trace est2 : 74.96254652411871 (95% CI, atol=5.00)

Note we set rtol = 0.0 to prevent the relative tolerance rule from stopping computation early (on the other hand, if you only want relative tolerance to be used, you can set atol = 0.0). The default early-stopping rule uses the CLT to bound the error of the estimator. A simpler early-stopping method can also be used, which is similar to the quadrature method:

## Stops the estimator when any pair of consecutive iterates change by <= 0.05
hutch(A, converge='tolerance', atol=0.05, rtol=0.0)

76.02295212266661

You can adjust the distribution used in sampling random vectors via the pdf parameter (one of “rademacher”, “normal”, or “sphere”). You can also fix the seed for deterministic behavior (you can also supply your own random number generators):

## Samples from the normal distribution N(0, 1)
hutch(A, pdf="normal", seed=1234)

rng = np.random.default_rng(1234)
hutch(A, pdf="sphere", seed=rng)

75.14020034128664

See the hutch documentation page for more details on the hyper-parameters.

Extending to matrix functions

In many settings, the trace of a matrix is not a very informative quantity, but the trace of a matrix function f(A) may very well be an important quantity to estimate. It is natural to consider extending the Hutchinson estimator above with something like:

\mathrm{tr}(f(A)) \approx \frac{n}{n_v} \sum\limits_{i=1}^{n_v} v_i^T f(A) v_i

Of course, the remaining difficulty lies in computing quadratic forms v \mapsto v^T f(A) v efficiently. The approach taken by Ubaru, Saad, and Seghouane (2017) is to notice that sums of these quantities can transformed into a Riemann-Stieltjes integral:

v^T f(A) v = v^T U f(\Lambda) U^T v = \sum\limits_{i}^n f(\lambda_i) \mu_i^2 = \int\limits_{a}^b f(t) \mu(t)

where scalars \mu_i \in \mathbb{R} constitute a cumulative, piecewise constant function \mu : \mathbb{R} \to \mathbb{R}_+:

\mu_i = (U^T v)_i, \quad \mu(t) = \begin{cases} 0, & \text{if } t < a = \lambda_1 \\ \sum_{j=1}^{i-1} \mu_j^2, & \text{if } \lambda_{i-1} \leq t < \lambda_i, i = 2, \dots, n \\ \sum_{j=1}^n \mu_j^2, & \text{if } b = \lambda_n \leq t \end{cases}

As with any definite integral, one would ideally like to approximate its value with a quadrature rule. An exemplary such approximation is the m-point Gaussian quadrature rule:

\int\limits_{a}^b f(t) \mu(t) \approx \sum\limits_{k=1}^m \omega_k f(\eta_k)

with weights \{\omega_k\} and nodes \{ \eta_k\}. On one hand, Gaussian Quadrature (GQ) is just one of many quadrature rules that could be used to approximate v^T A v. On the other hand, GQ is often preferred over other rules due to its exactness on polynomials up to degree 2m - 1.

Of course, both the weights and nodes of GQ rule are unknown for the integral above. This contrasts the typical quadrature setting, wherein \omega is assumed uniform or is otherwise known ahead-of-time.

A surprising and powerful fact is that both the weights \{\omega_k\} and nodes \{ \eta_k\} of the m-point GQ above are directly computable the degree m matrix T_m produced by the Lanczos method:

Q_m^T A Q_m = T_m = Y \Theta Y^T \quad \Leftrightarrow \quad (\eta_i, \omega_i) = (\theta_i, \tau_i), \; \tau_i = (e_1^T y_i)^2

In other words, the Ritz values \{\theta_i\} of T_m are exactly the nodes of GQ quadrature rule applied to \mu(t), while the first components of the Ritz vectors \tau_i (squared) are exactly the weights. For more information about this non-trivial fact, see Golub and Meurant (2009).

Stochastic Lanczos Quadrature

By using Lanczos quadrature estimates, the theory above enables the Hutchinson estimator to extend to be used as an estimator for the trace of a matrix function:

\mathrm{tr}(f(A)) \approx \frac{n}{n_v} \sum\limits_{i=1}^{n_v} e_1^T f(T_m) e_1 = \frac{n}{n_v} \sum\limits_{j=1}^{n_v} \left ( \sum\limits_{i=1}^m \tau_i^{(j)} f(\theta_i)^{(j)} \right )

Under relatively mild assumptions, if the function of interest f: [a,b] \to \mathbb{R} is analytic on [\lambda_{\text{min}}, \lambda_{\text{max}}], then for constants \epsilon, \eta \in (0, 1) the output \Gamma of the Hutchinson estimator satisfies:

\mathrm{Pr}\Bigg[ \lvert \mathrm{tr}(f(A)) - \Gamma \rvert \leq \epsilon \lvert \mathrm{tr}(f(A)) \rvert \Bigg] \geq 1 - \eta

…if the number of sample vectors n_v satisfies n_v \geq (24/\epsilon^2) \log(2/\eta) and the degree of the Lanczos expansion is sufficiently large. In other words, we can achieve an arbitrary (1 \pm \epsilon)-approximation of \mathrm{tr}(f(A)) with success probability \eta using on the order of \sim O(\epsilon^{-2} \log(\eta^{-1})) evaluations of e_1^T f(T_m) e_1. This probabilistic guarantee is most useful when \epsilon is not too small, i.e. only a relatively coarse approximation of \mathrm{tr}(f(A)) is needed.

Example: Logdet computation

Back to the computation, here’s a quick code example using primate approximate the log-determinant of A:

from primate.operators import MatrixFunction
M = MatrixFunction(A, fun="log")
logdet_test = hutch(M, converge="count", count=A.shape[0] // 6, seed=rng)
logdet_true = np.sum(np.log(np.linalg.eigvalsh(A)))

print(f"Logdet (exact):  {logdet_true}")
print(f"Logdet (approx): {logdet_test}")

Logdet (exact):  -151.5297634573417
Logdet (approx): -153.09132782984355

Here we see that using only n / 6 evaluations of e_1^T f(T_m) e_1, we get a decent approximation of logdet(A).

References

Golub, Gene H, and Gérard Meurant. 2009. Matrices, Moments and Quadrature with Applications. Vol. 30. Princeton University Press.

Ubaru, Shashanka, Yousef Saad, and Abd-Krim Seghouane. 2017. “Fast Estimation of Approximate Matrix Ranks Using Spectral Densities.” Neural Computation 29 (5): 1317–51.