Multilayer Bootstrap Networks

Table of Contents

1. What is MBN

2. Algorithm description

3. Highlights

4. How does it work?

5. Demos

6. MATLAB code and usages

7. References

What is MBN


Multilayer bootstrap networks (MBN) is an unsupervised nonlinear dimensionality reduction algorithm [1].


A representative demo of the first two components of the MBN output on the 70000 images of MNIST handwritten digits.

The above feature was produced under the default parameter setting of MBN, which can be downloaded here: MNIST_10dim. When used for clustering, it results in a clustering accuracy of 96.64% by both k-means and agglomerative hierarchical clustering.The original MNIST can be found at

[return to top]

Algorithm description


Given a dataset Z, MBN builds a multilayer network with L hidden layers from bottom-up, each layer contains V independent clusterings.

for LayerID = 1,2,...,L


for ClusteringID = 1,2,...,V


Step 1: Random resampling of feature (optional): Randomly sample a number of variables of Z to form a new set X

Step 2: Random resampling of data: Randomly sample k data points from X;

Step 3: One-nearest-neighbor learning: Take the k data points as the centers of a clustering model, and use the model to partition X into k disjoint clusters by one-nearest-neighbor learning. The clustering model outputs a one-hot representation for each datapoint in Z.


Step 4: Concatenate the one-hot representations produced by all clusterings into a sparse representation, and use the sparse representation as the input dataset of its upper layer, i.e. Z.




1) A key parameter setting: Parameter k from the bottom layer to the top layer should become smaller and smaller.

2) Post-processing: MBN may use any linear dimensionality reduction algorithm as the output layer to reduce the sparse representation to an explicit low-dimensional feature. Here we prefer principal component analysis (PCA).

[return to top]



MBN has the following properties:

1. Simple:

    It contains only three simple operators--- random resampling, stacking, and one-nearest-neighbor optimization.

2. Effective:

    It does not make model or prior assumptions of data.

    It supports not ony large-scale data but also very small-scale data that contain only dozens of data points.

    It does not suffer from outliers.

    It does not suffer from bad local minima, though it is a multilayer nonlinear method.

    It can be built as 'deep' as needed easily, where the word 'deep' refers to deep learning.

    It produces good performance in a number of evaluated benchmark datasets without hyperparameter tunning.  If the hyperparameters were tuned carefully, it produces better performance.

3. Efficient:

    It supports parallel computing naturally.

    Its time complexity grows linearly with the size of data.

    Only the bottom layer costs some time, while other layers are computed efficiently.

[return to top]

How does it work?


MBN contains two core components---(1) each layer is a histogram-based density estimator which transforms the data space to a uniformly distributed probability space, and (2) a stack of such density estimators build a hierarchical structure which reduces the small principal components gradually by building a vast number of hierarchical trees implicitly.

1. Histogram-based density estimator

a) How strong is its representation ability?

The representation ability of a learning model can be understood as how fineness the model patitions the data space. A single-layer MBN can partition the data space to O(k*2^V) disconnected fractions, while a single-layer binary neural network with V hidden units can partition the data space to O(2^V) disconnected fractions [2]. Hence, the representation ability of MBN is linearly larger than that of a binary neural network.

b) Why does the simple random resampling build an effective density estimator?

According to the weak learnability [3], if a learning machine is stronger than random guess, then a set of such diverse learning machines construct a strong learning machine as we have described in item 1.a. To understand this, we may imagine that a binary neural network builds such a strong learning machine by training each binary hidden unit a weak learner via e.g. backpropagagtion. As for MBN, each random sampling of data also builds such a weak learner if the random sampling guarantees that at least one data point per class is selected in probability.

c) Why use binarization by one-nearest-neighbor learning?

MBN estimates the (frequentist) probabilistic similarity of any two data points by the number of the cluster centers they share in common over V votes, which is a histogram-based density estimator. After binarization, we do not need probability model assumptions anymore, such as Gaussian assumption. As we know, model assumptions cause many unwanted problems.

d) Do we lose much information from the binarization?

The density estimator does not lose much information. As presented in item 1.a,  the representation ability is determined by how subtle the data space can be partitioned. We have proved that the loss is linearly decreased to a small value by increasing the number of the base clusterings, which is similar to the proof of the generalization of random forests [4].

2. Hierarchical structure

a) Why hierarchical structure?

The histogram-based density estimator encodes most information into the probability space, including unwanted noise, nonlinearity, and small variances. Therefore, we need to further reduce the useless information. Deep learning provides the motivation, where the upper layers compress the input into abstract representations. Here we emphasize that MBN is not a neural network, since it does not rely on backpropagation. We would prefer MBN an ensemble learning method since its theoretical foundation is built on ensemble learning.

b) How does the hierarchical structure work?

Suppose the parameters k from the bottom layer to the top layer are k1, k2,...,kt, with k1>k2>...>kt. Then, according to item 1.a, the number of the disconnected fractions from bottom-up are O(k1*2^V), O(k2*2^V),...,O(kt*2^V) respectively. It is obvious that O(k1*2^V)>O(k2*2^V)>...>O(kt*2^V), which means that the small fractions at the bottom layer, named leaf nodes, are gradually merged, to small fractions at the top layer, named root nodes. Because each root node corresponds to a tree, we see that MBN builds as many as O(kt*2^V) trees.

c) Will the hierarchical structure lose much information?

It depends on how the information is defined. We give two examples. If our purpose is for classification, e.g. clustering or content-based retrieval, we only need to get the classification labels, while all other information can be discarded as noise. From this point of view, MBN does not lose much information. However, if our purpose is for nearest-neighbor-search, such as instance-based retrieval, then MBN may slightly contaminate the accurate nearest-neighbor-chain (or called ranking order) of the data in the original space. For the latter, we may consider building a single layer network instead of a multilayer network.

d) What is the connection between MBN and random forests, given that both of them build trees?

First, MBN builds a vast number of trees on data space instead of on data points, while random forests build a handful trees on data points directly. It is easy to see that the data space around a data point may be partitioned to many disconnected fractions by MBN. Second, MBN uses the root nodes of trees as the representation of data, while each tree of random forests uses its leaf nodes to select a nearest neighbor of a given test data point for voting in the original data space. Hence, their trees are fundamentally different.

* One comment on the name of MBN

The keyword bootstrap refers to a data resampling method, named bootstrap resampling [5], which resamples data with replacement.  However, MBN resamples data without replacement which is another method---delete-d jackknife resampling [6]. Because bootstrap resampling has been frequently used in ensemble learning, we name the proposed MBN for the convenience of the classification of the technology.

[return to top]



We ran MBN with the default setting. For comparison, we took PCA, spectral clustering [7], and t-SNE [8] as the baselines. When running spectral clustering with non-text data, we use Gaussian kernel with the kernel width determined automatically by the average Euclidean distance between data points. We apply the low-dimensional output to k-means clustering. We also list the MBN with the agglomerative hierarchical clustering.

We evaluated the comparison methods in terms of discriminant trace (DT) criterion [9][10, eq. (4.50)], clustering accuracy (ACC) , and normalized mutual information (NMI) [11], where the DT criterion directly evaluates the discriminant ability of the learned features by measuring the between-class variances over within-class variances. The larger the DT score is, the better the comparison method behaves.

We ran the comparison on the following four representative datasets for 10 times and reported the average performance. The average results are listed in the follow tables. The statistical significance is evaluated by the two tailed-t test whose null hypothesis is at the 5% significance level. The datasets and overall results of the 10 independent runs have been incorporated in the MATLAB toolbox of MBN.


1. MNIST5000

This data set contains 5000 random selected images belonging to 10 handwritten digits from MNIST

DT score ACC NMI
MBN 6.36 82.5% 0.776
PCA 0.26 52.7% 0.498
Spectral clustering 0.80 52.0% 0.476
t-SNE 4.67 72.8% 0.695
*MBN with hierarchical clustering 6.37 81.6% 0.773

2. Speaker recognition

This data set contains 4000 utterances belonging to 200 speakers. This dataset is extracted from the training data of NIST SRE 2018 by the x-vector front-end.

DT score ACC NMI
MBN 51.43 78.6% 0.930
PCA 5.13 47.4% 0.760
Spectral clustering 3.03 79.3% 0.927
t-SNE 12.38 65.5% 0.871
*MBN with hierarchical clustering 52.49 87.1% 0.925

3. COIL-100

This data set contains 7200 images belonging to 100 objects.

DT score ACC NMI
MBN 18.95 66.6% 0.856
PCA 1.76 51.1% 0.776
Spectral clustering 3.33 52.6% 0.779
t-SNE 10.27 60.5% 0.829
*MBN with hierarchical clustering 18.66 68.8% 0.862

4. 20-Newsgroups

This text corpus contains 18846 documents belonging to 20 classes.

DT score ACC NMI
MBN 1.62 64.4% 0.536
PCA 0.52 41.6% 0.466
Spectral clustering 1.06 46.4% 0.506
t-SNE 3.66 64.4% 0.597
*MBN with hierarchical clustering 1.63 60.4% 0.536
Note that the above performance can be improved further if we tune the hyperparameters.

[return to top]

MATLAB code and usages


1. Code

    MBN function only: MBN.m

    MBN code with demos: MBN_toolbox_v7.rar [25MB] (Note: You need MATLAB R2017a and after to run the t-SNE baseline as a built-in function of MATLAB.)

2. Key functions

    It takes PCA as a preprocessing stage, which reduces the data to no more than 100 dimensions for the computational efficiency.

    It supports the standard PCA, linear kernel based sparse kernel PCA (recommended) [12], EM-PCA [13], spectral clustering, or  tsne-MBN as the output layer, where tsne-MBN is a revised t-SNE that constructs the density affinity matrix of data by the output of the top hidden layer of MBN instead of the original affinity matrix construction method of t-SNE.

    It supports k-means clustering or agglomerative hierarchical clustering as the clustering tool. When the number of classes is unknown, then agglomerative hierarchical clustering determines the number of classes on-the-fly.

3. Parameters

The MATLAB MBN function is MBN(data, c, param), where

  • data --- input data with a size of N*D, where N is the number of the data points and D is the dimension of the feature
  • c --- number of classes
  • param --- parameter setting of MBN in the form of  {'parameter name1', 'value1', 'parameter name2', 'value2',...}. The optional parameters in 'param' is described in the following table:
Parameter name Description Default value
ktop Parameter k of the k-centroids clustering at the top layer. It should be set to guarantee that at least one data point per class is selected in each random sample, which guarantees that the random resampling is still stronger than random guess.

Note: 'ktop' is important. When 'c' is not given or when the data is severely class-imbalanced, user should better set 'ktop' manually.

1.5c, when parameter 'c' is given;

100, when parameter 'c' is not given

k1 Parameter k of the k-centroids clustering at the bottom layer. min(0.5N, 10000)
delta delta = k(t)/k(t-1), where k(t-1) and k(t) represent the parameters k at two adjacent layers. It is a value between 0 and 1 0.5
a Ratio of randomly sampled dimensions over D. It is a value betwen 0 and 1. 0.5
V Number of k-centroids clustering per layer. 400
k Parameter k of the k-centroids clustering at all layers. It is determined jointly by 'ktop', 'k1' and 'delta'. However, users may specify 'k' as a vector, such as [100,80,50], which defines an MBN of three layers. If 'k' is set, then 'ktop', 'k1' and 'delta' are disabled.  
s Similarity measurement at the bottom layer. It can be set to 'e' and 'l'. 'e' means Euclidean distance. 'l' means inner-product similarity. 'e'
m Determine whether the MBN model will be saved for prediction. It can be set to 'yes' or 'no'. Because we do not provide MBN_predict function any more in new software versions. Users may always set it to 'no'. 'no'
f Determine whether the sparse feature at each hidden layer will be saved. It can be set to 'yes' or 'no'. 'no'
d Number of the dimensions of the MBN output. It is valied when the output layer is PCA or spectral clustering only. c
ifPCA Determine whether the input data is preprocessed by PCA. It can be set to 'yes' or 'no'. 'yes'
inputdim Determine the maximum number of dimensions of the input data after the PCA preprocessing. 100
ifCluster Determine whether using the low-dimensional output of MBN for clustering. It can be set to 'yes' or 'no'. 'yes'
ClusterTool Determine which clustering algorithm is used. It can be set to 'kmeans' and 'ahc'. 'kmeans' is the k-means clustering algorithm. 'ahc' is the agglomerative hierarchical clustering. 'kmeans'
outputlayer Determine which kind of output layer is adopted. It can be set to 'pca', 'spca', 'empca', 'spectral', and 'tsne-mbn'. 'pca' is the built-in PCA function of MATLAB. 'spca' is the linear kernel based kernel PCA. 'empca' is the expectation-maximization PCA (EM-PCA). 'spectral' is the Laplacian matrix decomposition of the linear kernel based the spectral clustering. 'tsne-mbn' is an MBN revised t-SNE nonlinear dimensionality reduction method that uses the output of MBN to construct the affinity matrix of t-SNE.

If the data set is larger than 10000 data points, 'spca' takes at most 10000 data points to do eigenvalue decomposition of the linear kernel matrix. Users may change this limitation in the MBN function.

nWorkers Determine how many workers will be used for the parallel computing. maximum number of physical cores of CPU

4. Examples

Although we have defined many parameters, we use their default values at most of the time. Here we list some usages for users' reference:
Situations Solutions Example functions
Class-balanced data Default setting MBN(data, c)
Class-imbalanced data Need to specify parameter 'ktop'. See the description of 'ktop'. MBN(data, c, {'ktop', 300})
Number of classes unknown Need to specify parameter 'ktop'. If 'ktop' is not given explictly, then MBN will use the default value 'ktop' = 100. MBN(data, [], {'ktop', 50})
Need to build a deeper model than the default Set 'delta' larger than the default, i.e. 0.5. MBN(data, c, {'delta', 0.8})
Need to control the memory occupacy. Set 'k1' to a limited number, which may sacrifice performance. MBN(data, c, {'k1', 2000})
Need to denoise data that have simple distributions. Set MBN to a single layer MBN(data, c, {'k1', 50, 'delta', 0})
Need to use original data as input Disable the PCA preprocessing, with the risk of high computational cost. MBN(data, c, {'ifPCA', 'no'})
Text data (or similar very high-dimensional sparse data) The most common similarity measurement for text data is cosine similarity, therefore, we should first normalize data and then use inner-product as the similarity measure of the bottom layer. Because text data is very sparse, we should set 'a' = 1 to prevent information loss in the inner-product operator. MBN(data, c, {'s', 'l', 'a', 1, 'ifPCA', 'no'})

[return to top]


  1. X.-L. Zhang, "Multilayer bootstrap networks," Neural Networks, vol. 103, pp. 29-43, 2018[pdf]
  2. Y. Bengio, "Learning deep architectures for AI," Foundations and trends® in Machine Learning, 2009.
  3. R.E. Schapire, "The strength of weak learnability," Machine Learning, 1990.
  4. L. Breiman, "Random forests," Machine Learning, 2001.
  5. B. Efron, "Bootstrap methods: another look at the jackknife," The Annals of Statistics, 1979.
  6. B. Efron and R. Tibshirani, An Introduction To The Bootstrap, CRC press, 1993.
  7. A. Y. Ng, M. I. Jordan, and Y. Weiss, "On spectral clustering: Analysis and an algorithm," In Advances in Neural Information Processing Systems 14, 2002.
  8. L. Van der Maaten and G. E. Hinton. "Visualizing data using t-sne," Journal of Machine Learning Research, 2008.
  9. M.-Z. Li and X.-L. Zhang, "Learning deep representations by multilayer bootstrap networks for speaker diarization," arXiv preprint: arXiv 1910.10969, 2019.
  10. C. M. Bishop, Pattern Recognition and Machine Learning, Springer: Neural Networks, 2006.
  11. A. Strehl and J. Ghosh, "Cluster ensembles!a knowledge reuse framework for combining multiple partitions," Journal of Machine Learning Research, 2003.
  12. B. Schölkopf, A. Smola, and K.-R. M┨ller, "Nonlinear component analysis as a kernel eigenvalue problem," Neural Computation, 1998.
  13. S. T. Roweis, "EM algorithms for PCA and SPCA," In Advances in Neural Information Processing Systems 10, 1998.


[return to top]

[go back to homepage]