×
all 39 comments

[–]Comprehend13 17 points18 points  (1 child)

I suspect that the findings of the bias-variance tradeoff paper are dependent on the nature of the data - how complex the underlying generative process is, and how much data there is to identify it. Both papers utilized image data (ImageNet, MNIST) which contain high signal to noise ratios - I would be interested in seeing this phenomena replicated in a variety of different settings, including high noise settings.

[–]Miejuib[S] 0 points1 point  (0 children)

Definitely agreed, I've seen some initial work into replicating the results, but nothing that exhaustive yet

[–]bbsome 9 points10 points  (5 children)

So what you described as the "tunnelling" behaviour for overparameterization, is as far as I know a well-known hypothesis of why large networks can be trained "so easily" compared to shallow/narrow ones. That is the fact that you have so many degrees of freedom most likely produces paths in the landscape destroying any local minimum and instead of making them behave like wells or narrow ridges which then move you towards better minimums. Nevertheless, I think it is still very unclear, even if the hypothesis is correct and this is the reason why simple algorithms like SGD can work on large nets, why do this behaviour lead only to places with low generalization error. Since the algorithm always works only on the training set, there is clearly some form of hidden bias that somehow leads the algorithm to better results. However, the behaviour described only explains why we are able to get with very simple optimizers (and without exponentially many iterations) such good performance, but not why does this performance generalize.

[–]eric_he 1 point2 points  (2 children)

The double descent risk curve paper argues that the overparameterization has a hidden bias towards minimizing the RHKS norm, which turns out to be a good inductive bias when we want to generalize.

[–]MetricSpade007 0 points1 point  (1 child)

Question (a few months late): How can one think of minimizing the RHKS norm intuitively? It's not super clear to me yet what this exactly means.

[–]eric_he 0 points1 point  (0 children)

lol im surprised we can still even comment. Im not an expert here but I did read wikipedia. Paraphrased:

> The RHKS norm is a function norm with the quality that if the RHKS distance between f and g are close, then f(x) - g(x) is close for every single x.

From the paper:

> Favoring small norm interpolating predictors [i.e. RHKS] turns out to be a powerful inductive bias on MNIST and other real and synthetic data sets [4].

So the paper is saying that during gradient descent in a massive function space (e.g. the function space of a neural net), the fitted function f tends to get closer and closer in RHKS norm to the true function g. But this has the quality of having f(x) close to g(x) for all possible features x.

So even though we have a finite sample of x's to draw from, when the function space is large enough, gradient descent is biased towards picking a function that is close everywhere to the true function, rather than just the x's we have. Why? Well, because the empirical true functions we see are not wild, so they can be approximated well using a finite sample. At least that's what I think the paper is saying, let me know if you have anything you're still confused about

[–]MelonFace 1 point2 points  (1 child)

While this is just speculation, I suspect that there might be something along the lines of occhams razor at play.

Similar (low complexity) solutions are common both in reality and in the optima found by greedy algorithms.

I guess one way to approach this hypothesis would be to artificially create datasets with overly complex data generation schemes, draw a limited number of samples, and hope to confirm that the networks don't generalize well to these artificial problems.

[–]Miejuib[S] 1 point2 points  (0 children)

I actually really like this idea, along with /u/trenobus 's idea of modulating the level of noise as well. Thanks for the thoughts!

[–]zerobullshit 4 points5 points  (4 children)

Cool hypothesis!
To explore your intuition further you could try reproduce the resnet-50 experiment from the cyclic-lr paper and plot over the set of used parameters/weights in order to see if the "double descent" behavior emerges as well. Since resnet uses ReLU, some weights/parameters should be disabled (0) depending on the training example. If you can then find the set of parameters which are always disabled (never used) then subtract from the total number of parameters you can then find how parametrized your network is, and see if that corresponds to the "double descent" behavior in "Reconciling modern machine learning and the bias-variance trade-off".

[–]LeanderKu 1 point2 points  (3 children)

I don’t understand what you propose with your weight-counting and what it’s got to do with the RELU. Can you elaborate?

[–]zerobullshit 4 points5 points  (2 children)

Basically the authors of the "double descent" paper observed this behavior by training over different number of parameters (x-axis). However, the total number of parameters in the cyclic-lr paper is fixed, they just use the standard resnet.Now what I understood from this post was the conjecture that those processes might be related. This could be due to the lr having an effect on the parametrization of the network, so if we wanted to show this we would need to know the effective number of parameters that the network is using at a certain step. This might be smaller than the total number because ReLU = 0 for negative numbers there should most likely be non-activated parameters. If our network did not have such an activation function, all neurons would fire all the time so the total number of effective parameters would be the same as the total number of parameters.

So you could support this hypothesis if you could to show the connection between effective parameter use and validation accuracy using cyclic-lr.

(Edit): Note, I don't actually know if this would result in anything - but it's just a suggestion in how to possibly support the OP's hypothesis.

[–]Miejuib[S] 1 point2 points  (1 child)

Basically the authors of the "double descent" paper observed this behavior by training over different number of parameters (x-axis). However, the total number of parameters in the cyclic-lr paper is fixed, they just use the standard resnet.Now what I understood from this post was the c

Definitely an interesting idea. Will take some work to build out the appropriate tooling to do that, but I like the idea.

[–]zerobullshit 1 point2 points  (0 children)

Cool also if you can find another way to measure the compression of the network - how compact it can be represented at a stage of training without an large drop in accuracy - you could essentially proxy the number of parameters used and use that metric to hopefully show the behavior you are looking for.

[–]GrumpyGeologist 22 points23 points  (2 children)

Wow, great exhibition of your hypothesis. I'm afraid I'm not qualified to provide technical comments, though, but I'll follow this thread with interest.

[–]ReacH36 12 points13 points  (0 children)

What a considerate and measured response.

[–]Miejuib[S] 1 point2 points  (0 children)

thank you! I appreciate the comment

[–]Zinan 2 points3 points  (1 child)

This is a neat idea. It would make sense for this phenomenon to be two sides of the same coin, which is escaping local minima; overparametrization reduces local minima by turning them into saddle points (as bbsome mentioned). For increasing learning rates, I like viewing increasing the learning rate on the loss landscape similar to how I would envision increasing the temperature on a particle on an energy landscape; the stochasticity of the gradient update has a better chance of "shooting" the parameter vector into a state that is more optimal then whatever minima it was stuck in originally.

[–]Miejuib[S] 0 points1 point  (0 children)

Yes exactly. And I think it has some other interesting ramifications as well. For example, consider if the double-descent curve is valid in general for learning algorithms, and what it might mean even for living creatures. If a significant amount of extra over-parameterization is required to get to the region of monotonically decreasing eval loss, that presumably comes with an associated cost in the amount of nutrients required to create and sustain those brain structures, which may create a energy barrier that evolution may or may not be able to surmount, or only very rarely.

[–]trenobus 2 points3 points  (1 child)

The main hypothesis I'd like to conjecture here is that there may be a somewhat common (perhaps even creeping towards global) quality of the structure of natural information that, when handled by learning systems/algorithms, results in what's essentially a potential energy barrier 'shell' separating a parameter configuration space (for a given architecture) that can only generalize moderately well, and a parameter configuration space (of the same architecture) that is capable of greatly improved generalization and performance.

Natural data tends to be noisy. You could model an input vector as some "actual" value plus a noise vector of equal dimension. In order to generalize a NN essentially needs to be capable of filtering out the noise vector at some level. I suspect this becomes harder as the distribution of the noise vectors becomes less symmetric (for a given signal-to-noise ratio). So I wonder how (or if) the phenomena in these papers would manifest with noiseless artificial data, and with the same data augmented with noise vectors sampled from distributions of different shapes.

What I'm suggesting is that the input noise vector might be the potential energy barrier that you postulate.

[–]Miejuib[S] 1 point2 points  (0 children)

That's a really cool way of describing it. And I like the idea of doing experiments with artificially generated datasets in order to specifically control the complexity and noise of a given dataset.

It also makes me wonder if it might be possible to engineer a dataset under a structured generation scheme such that it becomes much more difficult for a learning algorithm to model the actual underlying nature of it. ie: so that it becomes very unlike for it to escape into the second region of the double descent (even with significant over-parameterization)

[–]mbleich 2 points3 points  (1 child)

Over-parameterization turns local minima into saddles. High learning rates efficiently navigate surfaces rich in saddles.

[–]Miejuib[S] 0 points1 point  (0 children)

Yes, that is of course reasonable. I think the thing I'm more interested in is how the underlying topological structure of natural information affects the loss landscape (and in turn, how that affects the ability of learning algorithms to navigate the loss landscape). Also, I'm not sure that simple explanation accounts for the behavior observed in img1, where the eval loss is observed to actually decrease for some time before once again increasing at a significant rate.

[–]redditpirateroberts 3 points4 points  (1 child)

Love it! wish there was more of it!

[–]Miejuib[S] 0 points1 point  (0 children)

Thank you! I will definitely keep thinking about it, and I'll see about trying some experiments to further probe the idea.

[–]MaxMachineLearning 1 point2 points  (4 children)

I dabble a bit in researching the underlying topology of neural networks. One idea I find rather interesting that provides some interesting information is the idea of a "learning surface". Basically, you look at the surface created by the weights in a layer over time using some methods from topological data analysis. The last set of weights in a network between the output layer and the last hidden layer hold some very interesting information. If you project it into 2 or 3 dimensions, you can see this interesting branching pattern where each branch is one of the possible outputs. Now there's a bunch of other cool stuff there that you can examine but I would be interested in comparing the two methods and examining that surface. It might provide some insight into why they seem to behave similarly.

[–]Miejuib[S] 0 points1 point  (0 children)

This is exactly the sort of thing I would love to learn more about. Can you send me some resources / references about it so that I can learn more? Would also be happy to chat more at length

[–]brunocas 0 points1 point  (2 children)

Could you provide a few references showing this branching pattern?

[–]MaxMachineLearning 1 point2 points  (1 child)

Sure: https://github.com/maximevictor/topo-learning Paper: https://arxiv.org/abs/1902.08160

This is the first paper that I read and played with the ideas from. I attached the github link where you can see the script. I also recreated the results in PyTorch myself and tested them on other neural networks, mainly just in the connections between the the second to last and final layer weights with constant initialization. You do get a branching pattern like they present, but I think that it sort of just demonstrates why initialization is so important, and might not be anything deeper than that. However I would suggest reading this paper: https://arxiv.org/abs/1806.07572 which provides a more generalized mathematical theory in a similar vein that seems to have interesting applications. Hopefully this provides some interesting material for you!

[–]brunocas 0 points1 point  (0 children)

Thanks so much, sounds interesting.

[–]eric_he 1 point2 points  (1 child)

Thanks for pointing this out. To preface, I am a non-expert and I’m writing immediately after reading, so anything I say should be taken with a grain of salt.

Rather than thinking deeply about the mechanisms of a vastly varying learning rate under the 1-cycle policy, I focused on the comments that the 1-cycle policy required reducing the strength of other forms of other regularization such as drop-out, early stopping, etc.

Now every form of regularization can be viewed as a Bayesian inductive bias. For example, the l2 regularization term is equivalent to an inductive bias that a starting weights has probability inversely proportional to their distance from the origin. So when we are keeping the regularization “balanced” in the vocabulary of the super-convergence paper, we are really just weighting one inductive bias more heavily than another.

The double loss paper attributed the generalization ability of complex functions as a result of an over-parameterized network being able to pick and choose a vastly larger number of ways of memorizing the data through the parameters. This greater selection, combined with SGD, implicitly assigns higher weight to “low RHKS norm” parameterizations which the authors argue is demonstrated to be a good inductive bias on natural data.

So really what might be going on in the super-convergence paper is that doing away with the other “helpful” inductive biases to focus on the implicit bias of SGD, which is magnified by increases in LR and overparameterization, is what is allowing increased maximum performance and faster convergence.

[–]Miejuib[S] 0 points1 point  (0 children)

That's an interesting perspective that I hadn't considered. Now I wish they had included more charts of how the eval loss evolved over time under regimes of varying levels of other regularizations. I will read up more to better understand the RKHS norm, since I'm not super familiar with it right now.

[–]serge_cell 1 point2 points  (1 child)

Authors of the 1st paper didn't specify step size and multiplier for "standard learning rate policy" for imagenet, and I don't see step size/multiplyer study for imagenet either. It could be that "standard learning rate policy" still better with smaller initial learning rate or sharper step size. CIFAR/MNIST results often don't transfer to imagenet at all.

[–]Miejuib[S] 1 point2 points  (0 children)

This is true, I wish they had also included a definition for what parameters they used for that.

[–]Aeglen 2 points3 points  (2 children)

I am not exactly qualified to comment, being towards the start of my ML journey(?), but I think I understood the gist of what you were saying so let's give this a go.

In the "traditional" case as shown by the blue line in figure 1, the network appears to be homing in on a single loss-space minimum (imagine a 3D landscape or something and it's trying to find the lowest point, but with many more dimensions).

My understanding from what you said is that there might be many such loss-function-mimima, some which generalize better than others, and that the dip in val_acc is due to climbing over the wall between two.

Here are the questions and observations which come to mind:

1) Is it really the case that there are commonly many such minima? I'd imagine this is heavily dependent on the problem and data.

2) The loss smoothly-ish increases before smoothly-ish decreasing. I wonder what the network is doing. Clearly the weights have been changed so that it is in a less desirable area of the loss surface, but how could it "know" about the existance of another better-generalizing minimum? And why not find an even better one afterwards? This maybe suggests that it's converging to the same minimum, but faster.

3) The rapidly-converging curves are abruptly cut off. Why, and what happens next?

Maybe this is rubbish (sadly I don't have time to read the papers), but it's been interesting to think about. Perhaps, if possible, you could extract and compare the weights themselves each time to see if they differ by much to test your theory?

[–]Miejuib[S] 1 point2 points  (0 children)

  1. In general, yes. Consider that most neural nets are trained starting from different random initializations, and the configurations they settle upon tend to generalize well, despite the learned representations not necessarily being entirely consistent, even between identical architectures trained on the same data. For more discussion of this, I recommend checking out: http://proceedings.mlr.press/v44/li15convergent.pdf " Towards Understanding Learning Representations: To What Extent Do Different Neural Networks Learn the Same Representation "
  2. With regard to the first part of your question (re: 'knowing' about better minima), my hypothesis is that when the learning rate is at its smallest values in the first cycle of the cosine loss, this acts to essentially amplify the directions in which the network parameter configuration is able to traverse through the loss landscape
  3. In the super-convergence paper, from my understanding, their experiments were specifically performed to stop once they had achieved an eval loss equal to their control experiment, hence why the plots stopped where they did. I do wish they had included extended data regarding how the training performed beyond that.

Your suggestion is an interesting idea, though it would need to be done carefully for reasons as described in the paper linked in point 1. Perhaps by enforcing the same initialization of the networks with and without the super-convergence learning rate schedule

[–]mbleich -1 points0 points  (0 children)

How about this? Over-parameterization turns local minima into saddles. High learning rates (on SGD) efficiently navigate spaces dense with saddles.

[–]tr1pzz 2 points3 points  (1 child)

Very interesting line of thought! Gonna read the two papers you referenced before commenting on these intuitions, cause I always find it tricky to apply common sense reasoning to high-dimensional parameter spaces..

In the meantime allow me to drop one of my videos here which might be very relevant to this discussion: https://youtu.be/pFWiauHOFpY

[–]Miejuib[S] 0 points1 point  (0 children)

Dude that was a great video, thanks for the link! I will definitely be keeping track of that channel, it seems to be a pretty quality production.