This document contains an overview of how to use the interface in the *Mapper* package to fully customize the *mapper* construction. The focus on this document is on package functionality and usage, as opposed the theory the *Mapper* framework was founded on, and as such discusses *Mapper* from a strictly algorithmic / computational perspective.

Consider a noisy sampling of points along the perimeter of a circle in \(\mathbb{R}^2\)

```
set.seed(1234)
## Generate noisy points around the perimeter of a circle
n <- 1500
t <- 2*pi*runif(n)
r <- runif(n, min = 2, max = 2.1)
noisy_circle <- cbind(r*cos(t), r*sin(t))
## Plot the circle
plot(noisy_circle, pch = 20, asp = 1, xlab = "X", ylab = "Y", main = "Circle")
```

To get the **mapper** of this circle, first supply the data via `X`

and the mapped values via `filter`

, along with the parameters to other components such as the cover and clustering parameters. See `?mapper`

for more details. A summary is available with the default print method.

```
library("Mapper")
left_pt <- noisy_circle[which.min(noisy_circle[, 1]),]
f_x <- matrix(apply(noisy_circle, 1, function(pt) (pt - left_pt)[1]))
m <- mapper(X = noisy_circle,
filter = f_x,
cover = list(typename="fixed interval", number_intervals=5L, percent_overlap=50),
distance_measure = "euclidean",
clustering_algorithm = list(cl="single"))
print(m)
```

```
## Mapper with filter f: X^2 -> Z
## Fixed Interval Cover: (number intervals = [5], percent overlap = [50]%)
```

By default, the core information of the \(1\)-skeleton of the `Mapper`

construction is returned, including:

The vertices, and the points (by index) comprising their corresponding connected components.

Some adjacency representation of the mapper graph

A map between the index set of the cover and the vertices

This is, in essence, all the central information one needs to conduct a more in-depth analysis of the mapper.

For most use-cases, the static method above is sufficient for getting a compact Mapper construction back. However, building the ‘right’ Mapper is usually an *exploratory process* that requires parameter tuning in the form of e.g. alternative filter functions, ‘fitted’ clustering parameters, different covering geometries, etc. Indeed, the *Mapper* method itself is sometimes thought of as a tool for exploring properties of the data (as expressed via a filter), whose interconnections reflect some aspects of the metric structure. However, for large data sets, employing this iterative fitting process with the above wrapper method can be prohibitively expensive.

To help facilitate this process, the *Mapper* package uses `R6`

class generators to decompose the components used to build the mapper. A few of the benefits `R6`

classes include full encapsulation support, reference semantics for member fields, easy-to-employ parameter checking via active bindings, and composability via method chaining. More details available at Hadleys Advanced R series or the R6 documentation site.

To demonstrate some of these benefits, consider a simplified interpretation of the Mapper pipeline:

“Filter” the data via a reference map

Equip the filtered space with a cover

Decompose preimages of the (pulled back) open sets of the cover into

*clusters*Construct the

*k*-skeleton by considering non-empty intersections of the clusters

These steps are demonstrated below using the `noisy_circle`

data set below. To begin building a mapper object more fluidly, one starts by using the `MapperRef`

R6-class instance generator.

The data can be specified at initialization, or later using the `use_data`

method. The format of the data must be either a matrix of coordinate values or a function which returns a matrix of coordinate values.

Assume, as before, that our filter is the distance from every points first coordinate to the point with left-most coordinate, \(p\), i.e.

```
## 1. Specify a filter function for the data
left_pt <- noisy_circle[which.min(noisy_circle[, 1]),]
f_x <- matrix(apply(noisy_circle, 1, function(pt) (pt - left_pt)[1]))
```

Coloring the points in the original space w/ a rainbow gradient based on their filter values sheds light on what the filter geometrically represents

```
## Bin the data onto a sufficiently high-resolution rainbow gradient from blue (low) to red (high)
rbw_pal <- rev(rainbow(100, start = 0, end = 4/6))
binned_idx <- cut(f_x, breaks = 100, labels = F)
plot(noisy_circle, pch = 20, asp = 1, col = rbw_pal[binned_idx], xlab = "X", ylab = "Y", main = "Circle")
```

The filter is assigned as usual, and like the data matrix, it accepts a matrix of coordinate values or a function that generates a matrix of coordinate values.

`m$use_filter(f_x)`

Both the data matrix and filter as stored as *functions*, which both accept as input a set of indices and return as output a matrix of coordinate values.

Once the filter has been applied, a cover must be chosen to discretize the space. Here, a simple rectangular cover with fixed centers is used.

Cover parameters may be be supplied at initialization, or set via the assignment operator. If you supply a single value when the filter dimensionality \(> 1\), the argument is recycled across the dimensions. The cover summary can be printed as follows:

`print(cover)`

`## Fixed Interval Cover: (number intervals = [5], percent overlap = [20]%)`

Once parameterized, the cover may be explicitly constructed via the `construct_cover`

member, before passing back to the `MapperRef`

instance. The `construct_cover`

function uses the given set of parameters to populate the intersection between the open sets in the cover and the given filter data.

```
cover$construct_cover(filter = m$filter) ## Now the level sets are populated
summary(cover$level_sets)
```

```
## Length Class Mode
## (1) 459 -none- numeric
## (2) 269 -none- numeric
## (3) 234 -none- numeric
## (4) 258 -none- numeric
## (5) 456 -none- numeric
```

`Mapper`

accepts any cover that derives from the abstract `CoverRef`

R6 class. Any such derived cover instance can be assigned to the `$cover`

member.

A few covering methods often used in practice are available. To see a list, use the `covers_available`

method.

```
## Typename: Generator: Parameters:
## fixed interval FixedIntervalCover number_intervals, percent_overlap
## restrained interval RestrainedIntervalCover number_intervals, percent_overlap
## ball BallCover epsilon
```

If you want to use a cover outside of the ones offers by the package, feel free to derive a new type cover class (and submitting a pull request!). There’s also a small tutorial on how one might go about making their own.

After creating the cover, the next step is to perform the *pullback* operation, wherein the cover \(\mathcal{U}\) of the space \(Z\) is used to create a cover on \(X\) through \(f\). This operation generally denotes two operations: computing the preimages of each open set, and decomposing the subsets of \(X\) given by these preimages into connected components. The latter is referred to in the original Mapper paper as *partial clustering*, and it (naturally) requires a number of parameters.

The goal of the *Mapper* package is to provide the user full access to how the open sets are decomposed, while also providing easy-to-parameterize defaults for those unfamiliar with clustering in general. Towards this end, there are two options to assign the clustering algorithm: either choose from a set of pre-defined clustering algorithms (+ hyper-parameters), or simply supply your own function to do the decomposition.

For people do not have much experience in clustering, one may simply one of many pre-defined linkage criteria to the `cl`

argument of the `use_clustering_algorithm`

method. The choice of linkage criteria determines the type of hierarchical clustering to perform. Given a cluster hierarchy, it’s common to `cut`

the tree at some height value to obtain a partitioning of the points. There are a few available cutting heuristics implemented which attempt to identify the major ‘clusters’ automatically. See `?use_clustering_algorithm`

for more details.

Below is an example of how to specify a pre-configured clustering approach.

`m$use_clustering_algorithm(cl = "single", cutoff_method = "continuous")`

For a complete list of the available linkage criteria, see `?stats::hclust`

.

If it is not sufficient to use one of the standard clustering methods, you can alteratively just supply your own clustering function. The only requirement is that it take input the index of the open set `pid`

, a vector of indices `idx`

\(\subset \{1, 2, \dots, n\}\) (representing the preimage), and the *mapperRef* instance object `self`

, and return an integer vector giving a partitioning on that subset. An example of how one might construct a custom clustering function leveraging the `parallelDist`

and `fastcluster`

packages is given below.

```
custom_clustering_f <- function(num_bins.default = 10L){
dependencies_met <- sapply(c("parallelDist", "fastcluster"), requireNamespace)
stopifnot(all(dependencies_met))
function(pid, idx, self) {
dist_x <- parallelDist::parallelDist(self$X(idx), method = self$measure)
hcl <- fastcluster::hclust(dist_x, method = "single")
eps <- cutoff_first_bin(hcl, num_bins = num_bins.default)
cutree(hcl, h = eps)
}
}
m$use_clustering_algorithm(cl = custom_clustering_f(10L))
```

The above example also demonstrates a simplified example of what one of the pre-configured options might look like. Functions supplied via the `use_clustering_algorithm`

accessor are given access to the instance fields and methods, e.g. the data `self$X(...)`

and the measure with `self$measure`

.

By default, the metric to cluster on is set to euclidean distance. However, the distance measure may also be any one of the 49 (at this time of writing) proximity measures available in from the `proxy`

package. To change the distance measure, use the `use_distance_measure`

method.

`m$use_distance_measure(measure = "euclidean")`

For a complete list of the distance/similarity measures supported, see `?proxy::pr_DB`

.

Notice the clustering function in this example was created as a closure. Since closures capture their enclosing environments, they can be quite useful for e.g. storing auxiliary parameters, validating package dependencies (e.g. as above), caching information for improved efficiency, etc. However, for the same reason, they are a common cause of a form of memory leaking if used carelessly, so they should be used with mindfully.

Once the clustering algorithm has been picked the pullback, also called the *inverse image*, can be constructed.

In the `Mapper`

package, `$construct_pullback`

applies the clustering algorithm to the preimages of given cover, populating the vertices the *mapper* construct. A mapping between the index set and the vertex labels is also recorded.

```
## 0 1 2 3 4 5 6 7
## 459 133 136 129 105 132 126 456
```

`str(m$pullback)`

```
## List of 5
## $ (1): num 0
## $ (2): num [1:2] 1 2
## $ (3): num [1:2] 3 4
## $ (4): num [1:2] 5 6
## $ (5): num 7
```

The final step to create a Mapper is to construct the geometric realization of the pullback cover: a simplicial complex. Internally, the simplicial complex *Mapper* computes is stored in a Simplex Tree. The underlying simplex tree is exported as an Rcpp Module, and is accessible via the `$simplicial_complex`

member.

`## < empty simplex tree >`

The simplex tree is a type of trie specialized specifically for storing and running general algorithms on simplicial complexes. For more details on the simplex tree, see use `?simplextree`

, or see the package repository, and references therein.

The construction of the simplicial complex is achieved by creating a \(k\)-simplex for each non-empty \((k + 1)\)-fold intersection. Often, instead of constructing the full-dimensional complex (where the dimension is governed by the maximum number of non-empty intersections in the cover), it’s common to restrict the *mapper* to the *k*-skeleton of the nerve instead.

With the `Mapper`

package, each dimension can be specified (and parameterized) independently, i.e.

```
m$construct_nerve(k = 0L)
m$construct_nerve(k = 1L)
# ...
m$simplicial_complex
```

`## Simplex Tree with (8, 8) (0, 1)-simplices`

The reason for this design is that, for some practical applications, there may be dimension-dependent operations one may want to isolate. For example, if instead of modifying the complex directly, you want to just get the \(k\)-simplices that *would* be added to complex, i.e. had a non-empty intersection, you can specify `modify=FALSE`

`m$construct_nerve(k = 1L, modify = FALSE)`

```
## [,1] [,2]
## [1,] 0 1
## [2,] 0 2
## [3,] 1 4
## [4,] 2 3
## [5,] 3 6
## [6,] 4 5
## [7,] 5 7
## [8,] 6 7
```

Just like the `$construct_pullback`

method, `$construct_nerve`

optionally accepts a subset of indices which can be used to restrict the intersection computation to specific subsets of the data. For example, the following only considers non-empty intersections between vertices whose connected components intersected the sets indexed by “(1)” and “(2)”

```
## This checks the pairs given by: expand.grid(v1=m$pullback$`(1)`, v2=m$pullback$`(2)`)
m$construct_nerve(k = 1L, indices = matrix(c("(1)", "(2)"), nrow = 1))
```

In this way, the `$construct_nerve`

method allows one to obtain finer control over how exactly either individual pairs of subsets or individual dimension of the simplicial complex are constructed. If one just wants to construct \(k\)-simplices for all non-empty intersections up to some dimension of \(k\), e.g. to reduce the *mapper* to its graph representation, the \(1\)-skeleton may be computed as follows.

```
m$construct_k_skeleton(k = 1L) ## equivalent
m$simplicial_complex
```

`## Simplex Tree with (8, 8) (0, 1)-simplices`

Each call to `construct_k_skeleton`

clears out the simplicial complex, decomposes the preimages into vertices, and then iteratively builds the nerve up to the specified dimension. Constructing the skeleton this way is an idempotent operation: it can safely be called redundantly, and should always return the same result (assuming the clustering algorithm is deterministic).

The above is a brief exposition into the various components of the `MapperRef`

object. Generally speaking, obtaining an ‘accurate’ *mapper* representation of data may require careful tweaking. This package attempts to enable the configurability one might need to obtain that representation.

To streamline this process, the `Mapper`

packages relies on method chaining to allow composability. That is, by convention, all the methods denoted with the `use_*`

enact side-effects on the `MapperRef`

object and return the instance invisibly. This enables a reduction of the Mapper `pipeline`

to a sequence of chained operations:

```
m <- MapperRef$new()$
use_data(noisy_circle)$
use_filter(filter = f_x)$
use_cover(cover = "fixed interval", number_intervals = 5L, percent_overlap = 20)$
use_distance_measure(measure = "euclidean")$
use_clustering_algorithm(cl = "single")$
construct_pullback()$
construct_nerve(k = 0L)$
construct_nerve(k = 1L)
```

Practical defaults and wrappers are available to simplify this even further. In particular, the default distance measure is euclidean, the default clustering algorithm is single-linkage w/ a sensible cutting heuristic, etc. Thus, the above can also be constructed with:

```
m <- MapperRef$new(noisy_circle)$
use_filter(filter = f_x)$
use_cover(number_intervals = 5L, percent_overlap = 20)$
construct_k_skeleton(k = 1L)
```

If the Mapper is constantly being reparameterized, this interface is preferred over the more concise `mapper`

call because of the efficiency that comes with mutability. To illustrate this, consider the case where you want to use a different distance measure for the clustering algorithm. To change the metric with the ‘static’ approach, one might do something like the following:

```
all_params <- list(
X = noisy_circle,
filter = f_x,
cover = list(cover="fixed interval", number_intervals=10L, percent_overlap=50),
clustering_algorithm = list(cl="single", cutoff_method="histogram", num_bins=10L)
)
m_static1 <- do.call(mapper, append(all_params, list( distance_measure = "euclidean")))
m_static2 <- do.call(mapper, append(all_params, list( distance_measure = "mahalanobis")))
```

Both function calls require the entire mapper to be reconstructed, including the cover, the allocation for the simplex tree, all initializers, etc. With reference semantics, you can keep all the same settings and just change what is needed. For example, the following switches the distance measure and exports `mapper`

objects without reconstructing the cover:

```
m_dynamic <- lapply(c("euclidean", "mahalanobis"), function(measure){
m$use_distance_measure(measure)$
construct_k_skeleton(k=1L)$
exportMapper()
})
```

Note that, due to the a natural dependency chain that comes with Mapper, the pullback and nerve operations both needed to be recomputed (via `construct_k_skeleton`

), since the latter depends on the former, and the former depends on the distance measure. Determining what requires recomputing depends on how far up the dependency chain you go. For example, consider recomputing the 1-simplices with different intersection size thresholds.

```
invisible(lapply(1:10, function(e_weight){
m$construct_nerve(k = 1L, min_weight = e_weight)
# ... $exportMapper()
}))
```

In this case, compared to the static method, we save on needing to recompute the cover, the clustering, and the construction of the pullback.

The `pixiplex`

package provides a htmlwidget implementation of a subset of the pixi.js geared for visualizing and interacting with simplicial complexes, which can be used to visualize the *mapper* construction. An export option is available via the `$as_pixiplex`

function, which provides a few default coloring and sizing options.

```
library("pixiplex")
plot(m$as_pixiplex())
```

For more information, see the package repository.

For a static rendering of the *mapper*, one can use either the `as_igraph`

method or, for higher-order complexes, the default S3 plot method available in the simplextree package (e.g. `plot(m$simplicial_complex)`

). Other options for visualization are an ongoing effort. If you have a particular visualization package you prefer, feel free to submit a issue or PR on the github repository.