[D] Thoughts about super-convergence and highly-performant deep neural network parameter configurations

Comprehend13 · 2019-04-06T08:29:24+00:00

I suspect that the findings of the bias-variance tradeoff paper are dependent on the nature of the data - how complex the underlying generative process is, and how much data there is to identify it. Both papers utilized image data (ImageNet, MNIST) which contain high signal to noise ratios - I would be interested in seeing this phenomena replicated in a variety of different settings, including high noise settings.

bbsome · 2019-04-06T10:39:39+00:00

So what you described as the "tunnelling" behaviour for overparameterization, is as far as I know a well-known hypothesis of why large networks can be trained "so easily" compared to shallow/narrow ones. That is the fact that you have so many degrees of freedom most likely produces paths in the landscape destroying any local minimum and instead of making them behave like wells or narrow ridges which then move you towards better minimums. Nevertheless, I think it is still very unclear, even if the hypothesis is correct and this is the reason why simple algorithms like SGD can work on large nets, why do this behaviour lead only to places with low generalization error. Since the algorithm always works only on the training set, there is clearly some form of hidden bias that somehow leads the algorithm to better results. However, the behaviour described only explains why we are able to get with very simple optimizers (and without exponentially many iterations) such good performance, but not why does this performance generalize.

zerobullshit · 2019-04-06T10:40:58+00:00

Cool hypothesis!
To explore your intuition further you could try reproduce the resnet-50 experiment from the cyclic-lr paper and plot over the set of used parameters/weights in order to see if the "double descent" behavior emerges as well. Since resnet uses ReLU, some weights/parameters should be disabled (0) depending on the training example. If you can then find the set of parameters which are always disabled (never used) then subtract from the total number of parameters you can then find how parametrized your network is, and see if that corresponds to the "double descent" behavior in "Reconciling modern machine learning and the bias-variance trade-off".

GrumpyGeologist · 2019-04-06T08:13:51+00:00

Wow, great exhibition of your hypothesis. I'm afraid I'm not qualified to provide technical comments, though, but I'll follow this thread with interest.

Zinan · 2019-04-06T20:38:08+00:00

This is a neat idea. It would make sense for this phenomenon to be two sides of the same coin, which is escaping local minima; overparametrization reduces local minima by turning them into saddle points (as bbsome mentioned). For increasing learning rates, I like viewing increasing the learning rate on the loss landscape similar to how I would envision increasing the temperature on a particle on an energy landscape; the stochasticity of the gradient update has a better chance of "shooting" the parameter vector into a state that is more optimal then whatever minima it was stuck in originally.

trenobus · 2019-04-07T00:35:42+00:00

The main hypothesis I'd like to conjecture here is that there may be a somewhat common (perhaps even creeping towards global) quality of the structure of natural information that, when handled by learning systems/algorithms, results in what's essentially a potential energy barrier 'shell' separating a parameter configuration space (for a given architecture) that can only generalize moderately well, and a parameter configuration space (of the same architecture) that is capable of greatly improved generalization and performance.

Natural data tends to be noisy. You could model an input vector as some "actual" value plus a noise vector of equal dimension. In order to generalize a NN essentially needs to be capable of filtering out the noise vector at some level. I suspect this becomes harder as the distribution of the noise vectors becomes less symmetric (for a given signal-to-noise ratio). So I wonder how (or if) the phenomena in these papers would manifest with noiseless artificial data, and with the same data augmented with noise vectors sampled from distributions of different shapes.

What I'm suggesting is that the input noise vector might be the potential energy barrier that you postulate.

mbleich · 2019-04-07T01:21:37+00:00

Over-parameterization turns local minima into saddles. High learning rates efficiently navigate surfaces rich in saddles.

redditpirateroberts · 2019-04-06T08:28:02+00:00

Love it! wish there was more of it!

MaxMachineLearning · 2019-04-06T13:49:55+00:00

I dabble a bit in researching the underlying topology of neural networks. One idea I find rather interesting that provides some interesting information is the idea of a "learning surface". Basically, you look at the surface created by the weights in a layer over time using some methods from topological data analysis. The last set of weights in a network between the output layer and the last hidden layer hold some very interesting information. If you project it into 2 or 3 dimensions, you can see this interesting branching pattern where each branch is one of the possible outputs. Now there's a bunch of other cool stuff there that you can examine but I would be interested in comparing the two methods and examining that surface. It might provide some insight into why they seem to behave similarly.

eric_he · 2019-04-06T13:56:10+00:00

Thanks for pointing this out. To preface, I am a non-expert and I’m writing immediately after reading, so anything I say should be taken with a grain of salt.

Rather than thinking deeply about the mechanisms of a vastly varying learning rate under the 1-cycle policy, I focused on the comments that the 1-cycle policy required reducing the strength of other forms of other regularization such as drop-out, early stopping, etc.

Now every form of regularization can be viewed as a Bayesian inductive bias. For example, the l2 regularization term is equivalent to an inductive bias that a starting weights has probability inversely proportional to their distance from the origin. So when we are keeping the regularization “balanced” in the vocabulary of the super-convergence paper, we are really just weighting one inductive bias more heavily than another.

The double loss paper attributed the generalization ability of complex functions as a result of an over-parameterized network being able to pick and choose a vastly larger number of ways of memorizing the data through the parameters. This greater selection, combined with SGD, implicitly assigns higher weight to “low RHKS norm” parameterizations which the authors argue is demonstrated to be a good inductive bias on natural data.

So really what might be going on in the super-convergence paper is that doing away with the other “helpful” inductive biases to focus on the implicit bias of SGD, which is magnified by increases in LR and overparameterization, is what is allowing increased maximum performance and faster convergence.

serge_cell · 2019-04-07T07:34:08+00:00

Authors of the 1st paper didn't specify step size and multiplier for "standard learning rate policy" for imagenet, and I don't see step size/multiplyer study for imagenet either. It could be that "standard learning rate policy" still better with smaller initial learning rate or sharper step size. CIFAR/MNIST results often don't transfer to imagenet at all.

Aeglen · 2019-04-06T09:59:16+00:00

I am not exactly qualified to comment, being towards the start of my ML journey(?), but I think I understood the gist of what you were saying so let's give this a go.

In the "traditional" case as shown by the blue line in figure 1, the network appears to be homing in on a single loss-space minimum (imagine a 3D landscape or something and it's trying to find the lowest point, but with many more dimensions).

My understanding from what you said is that there might be many such loss-function-mimima, some which generalize better than others, and that the dip in val_acc is due to climbing over the wall between two.

Here are the questions and observations which come to mind:

1) Is it really the case that there are commonly many such minima? I'd imagine this is heavily dependent on the problem and data.

2) The loss smoothly-ish increases before smoothly-ish decreasing. I wonder what the network is doing. Clearly the weights have been changed so that it is in a less desirable area of the loss surface, but how could it "know" about the existance of another better-generalizing minimum? And why not find an even better one afterwards? This maybe suggests that it's converging to the same minimum, but faster.

3) The rapidly-converging curves are abruptly cut off. Why, and what happens next?

Maybe this is rubbish (sadly I don't have time to read the papers), but it's been interesting to think about. Perhaps, if possible, you could extract and compare the weights themselves each time to see if they differ by much to test your theory?

tr1pzz · 2019-04-06T13:31:13+00:00

Very interesting line of thought! Gonna read the two papers you referenced before commenting on these intuitions, cause I always find it tricky to apply common sense reasoning to high-dimensional parameter spaces..

In the meantime allow me to drop one of my videos here which might be very relevant to this discussion: https://youtu.be/pFWiauHOFpY

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS