top of page
Writer's pictureVladimir Steiner

Efficient and Scalable Bayesian Neural Nets with Rank-1 Factor

Updated: Oct 12, 2022

Hey everyone, new blog post today, let's get right to it !


This article was written by Michael Dusenberry, Ghassen Jerfel, Yeming Wen, Yi-An Ma, Jasper Snoek, Katherine Heller, Balaji Lakshminarayanan and Dustin Tran. It is about Bayesian Neural Networks (BNN) and trying to find a way to make them more parameter efficient and scalable. The proposed solution is to have every weight matrix have a rank-1 distribution which would mean that the BNN has a rank-1 parameterization.


The idea behind the experiment is to ensure a rank-1 parameterization and to compare it to more classical BNNs, and to deep ensembles (which are considered more accurate than BNNs but still having efficency problems). This would allow a to have about a hundred times fewer parameters.


The loss of the model will be :


Here, N is the number of input-output pairs (X,y), B is the minibatch size, p(W) is the distribution over weights, and r and s are the input neurons and pre-activations.


The important part of this scientific process is to prove that it is possible to use rank-1 parameterization in the same way as full-rank ones. To prove this they use the theorem that I will try to summarize and explain here :

In a fully-connected network of width M, height H, we will consider (as they do) the score function f(x|W) and define it recursively :


Let us consider now that the model learned on N data points and$W_*$ is the local minimum of the sum of f(x_n | W) in the space of weight matrices. Considering both a full-rank perturbation W - W*and a rank-1 one W*¤ rs^T - W*, we can connect both of them thanks to their covariances (assuming the full-rank perturbation has the multiplicative covariance structure). Indeed, the covariance of the full-rank and the rank-1 perturbation will respectively be :



It is thus possible to find an equality thanks to Sigma, with Sigma being some symmetric positive semi-definite matrix. The equality will be :



With this, the rank-1 proves to be able to manage a large spectrum of values in W, within the assumptions made at the start of course.


Another interesting point in the article is that, as opposed to deep ensembles, rank-1 priors can apparently use two types of Negative Log-Likelihood (NLL). The Mixture NLL which is apparently often preferred but cannot be used with deep ensembles and the Average NLL, which is a less accurate bounding:




In this experiment, both NLL were alternatively used during training, whereas all testing was made with Mixture NLL.


The model was tested on a number of different datasets : CIFAR 10 and 100, their corrupted versions, as well as ImageNet and binary mortality prediction with

the MIMIC-III EHR dataset. Here is the results of their tests on ImageNet for example :



ImageNet is probably the dataset where their results are the worse, compared to other state of the art techniques. However, they seem to be close enough that this solution deserves more digging into.


This article looks for a new way to solve the parameter efficiency problem of the BNNs, and reduce by about a hundred times the number of parameters. Even if as of now, the results are not up to par with state of the art methods, it is clearly a very good lead, as one of the main problems in the Computer Vision domain currently is the scalability of those methods. As this article is really recent, we do not have enough hindsight to see if it can actually be exploited. Personally, I found it really interesting and hope to see methods based on this in the near future.


16 views0 comments

Recent Posts

See All

コメント


Post: Blog2_Post
bottom of page