Alright, so the point of this post is that I noticed a parallel between two recent papers, and I have a hypothesis that they are actually closely connected. I'd like to hear any thoughts you guys may have, and I'm happy to receive criticism.
The relevant papers are these:
ref_1: https://arxiv.org/pdf/1708.07120.pdf
```Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates```
ref_2: https://arxiv.org/pdf/1812.11118.pdf
```Reconciling modern machine learning and the bias-variance trade-off```
------------------------------------
So obviously the proposed technique for super-convergence in ref_1 is pretty cool on its own, but that's not actually the thing that really caught my eye. One of the things I found the most interesting about the paper was a very striking artifact in another figure that they didn't discuss in the paper:
img1
Now, when I saw those val accuracy curves, I was immediately reminded of another very cool paper that was published recently, that is- ref_2, the paper discussing the bias-variance trade-off and its proposal of a 'double descent' risk curve:
img2
And further, the next figure presented in that same paper ( ref_2 ), where the authors plotted metrics of architectures at various degrees of over-parameterization:
img3
I know there's no direct evidence that the two behaviors are actually related, but I think that would be a very possible, and potentially very interesting connection between the two phenomenons, since it would speak to the nature of the loss landscape for a given architecture/task, and how the gradient-descent paths through it might be affected by varying the dimension of the parameter space / the learning rate.
The main hypothesis I'd like to conjecture here is that there may be a somewhat common (perhaps even creeping towards global) quality of the structure of natural information that, when handled by learning systems/algorithms, results in what's essentially a potential energy barrier 'shell' separating a parameter configuration space (for a given architecture) that can only generalize moderately well, and a parameter configuration space (of the same architecture) that is capable of greatly improved generalization and performance.
So for example, when greatly over-parameterizing a network as in ref_2, after some threshold, it's my suspicion that the models essentially become capable of doing a learning-algorithm equivalent of tunneling through that potential energy barrier simply as a result of having enough degrees of freedom such that the model is able to "poke holes" through the loss landscape ( perhaps with some percolation threshold-like property that is dependent on the degree of over-parameterization. ie: for thicker potential energy barriers, excessively redundant degrees of freedom in the parameter space may act like short, randomly aligned 'holes' throughout the space, a critical density of which is required to be able to successfully tunnel through the barrier )
And then the alternative tunneling method (as is relevant to the super-convergence paper, ref_1 ) being that the model first hits the "moderate quality" parameter configuration space, and then because of the cosine learning rate schedule decaying down to basically nothing, it finds the point in its local proximity (its local minimum) that minimizes the magnitude of the potential energy barrier separating it from the "highly-performant" parameter configuration space.
Then as the learning rate ramps up again, it turns the trajectory of the parameter configuration into what's essentially a destructively resonant system, and in doing so, is able to tunnel through the potential energy barrier that way, inadvertently benefiting from what's usually considered 'bad' / divergent behavior of the optimizer such as when the value of the learning rate is too high, and causes the loss to increase until the model dies from NaNs.
Which I guess could even mean that instead of the first cycle of the cosine decay causing the benefit of finding the 'best local minimum', what might instead be happening is that during the ramp-down phase, it could be more about restricting the extra degrees of freedom in the parameter space, so that when the learning rate ramps up again, it doesn't accidentally throw itself into actual, complete and utter divergence/destruction regions of the parameter space.
For a mental image, I picture that as something like a space ship's airlock, and during the first cycle's decay period, it's essentially 'closing the door behind it' as it 'enters the airlock', so that when the learning rate ramps up again, it becomes much more difficult for it to return the way it came from, leaving the only remaining option being to tunnel forwards through the barrier in the 'opposite direction', which leads to the region of the parameter space that is capable of significantly improved model performance and generalization capability.
( I apologize for the description of the mental image since I know that's pretty non-kosher when discussing the structure of highly dimensional topologies, but it helped me structure my thoughts on the matter, so I figured it might help a reader here also )
So that's the idea I'd love to hear thoughts about. When reading new publications and running my own experiments, in addition to the concepts explicitly being described or tested, I also generally try to build my understanding of what the topological structure of natural information may be like, and how its structure may affect the ability of learning algorithms to interact with it. ( Of course there are plenty of papers discussing the general capability to parameterize and substantially modify the loss landscape, but for the case of this hypothesis, I am choosing to not address those modifications in favor of opting to treat the distribution of the natural information and its corresponding loss landscape in as unmodified a manner as possible, since such parameterizations can, from my understanding, significantly alter the convergence rate of algorithms like gradient descent )
What do you guys think? Is this just a completely bat-shit crazy and obviously wrong idea for reasons that I don't understand yet? Could it actually hold some weight, and serve as an (at least somewhat) accurate description of how the models actually traverse the loss landscapes? Is there some good way you guys can think of that this hypothesis could be experimentally tested in a falsifiable and repeatable manner? Do you have any other thoughts on the matter or know of references to other publications that are closely tied to the discussed concepts?
Let me know, I'd love to hear your input, and I don't mind being wrong if it means it's an opportunity to learn something new.
Want to add to the discussion?
Post a comment!