×
all 28 comments

[–][deleted]  (3 children)

[deleted]

    [–]gwern[S] 24 points25 points  (2 children)

    I'm afraid I'm to blame for that pun.

    [–]wassname 1 point2 points  (0 children)

    on the gripping hand

    And a "Mote in Gods Eye" reference, you monster :p

    [–]Imnimo 7 points8 points  (3 children)

    I wonder if you could tease out some of the differences between architectures by using randomly initialized, untrained networks to extract content and style. They wouldn't be as good as trained networks (although it'd be cool if they were!), but it'd let you test a wide variety of architectures without having to run really expensive training.

    [–]gwern[S] 8 points9 points  (2 children)

    Since a random untrained CNN can do image completion somewhat credibly, it wouldn't surprise me, but it would be hard to make any claims from random CNN performance to what they might be learning or generalizing.

    [–]kkastner 5 points6 points  (1 child)

    They also work extremely well for texture generation see discussion here using style transfer type techniques.

    [–]allicisred 0 points1 point  (0 children)

    Prefer telegram or discord?

    [–]darkconfidantislife 5 points6 points  (4 children)

    I'd like to clarify somewhat on my hypothesis (#1): I meant that VGG is using its huge number of parameters to learn things about the images but not things that directly relate to the classification task. For example, VGG can be pruned to about 10% of its capacity while still retaining the same classification performance.

    [–]gwern[S] 6 points7 points  (3 children)

    For example, VGG can be pruned to about 10% of its capacity while still retaining the same classification performance.

    So can most CNNs, though. 10% is nothing special in the model compression/distillation papers I've read, all of which are from >=2014. Does this really single out VGG and potentially explain its especially good performance in style transfer? Are there other tasks where VGG is strikingly good compared to resnets? (I mentioned object localization but people disagreed whether there was any problem or performance gap on Twitter, while so far the original contention 'VGG is the best style transfer net' appears unchallenged.)

    [–]darkconfidantislife 5 points6 points  (2 children)

    So can most CNNs, though. 10% is nothing special in the model compression/distillation papers I've read, all of which are from 2014.

    Actually VGG is an outlier, because it's consistently the only that can be pruned that deeply. GoogleNets and others hover around 30-50%. Even AlexNet is off by a lot compared to how much VGG can be removed from.

    [–]jcannell 1 point2 points  (0 children)

    The FC layers can be pruned much more aggressively than the convo layers, so that make sense.

    [–]wassname 0 points1 point  (0 children)

    In other words: Perhaps training a network with so many extra parameters let it develop some more complicated but meaningful internal structure, rather than having it forced on it by the architecture?

    Seems like distillation would be a good test of this, like suggested. Because that would probably discard this structure.

    [–]SkiddyX 6 points7 points  (3 children)

    I feel like the trend of using feature pyramids while using ResNet for object detection might be for a similar reason. I would be interested to see StyleNet results using one.

    [–]wassname 2 points3 points  (2 children)

    Yeah it is strange the many detection models wont converge without pretraining from a imagenet classification task.

    I've tried some variations, but they haven't clarified the issue for me. I tried pretraining with a MSCOCO detection task, and then applying to a different detection task. Surprisingly the imagenet/classification pretraining worked better than MSCOCO/detection pretraining - even though it's being applied to a detection task.

    [–]SkiddyX 1 point2 points  (1 child)

    I have experienced the same thing, it might be related to the different loss functions used?

    [–]wassname 0 points1 point  (0 children)

    Good to know it's not just me.

    Could be I guess. Classification has softmax, while detection often has dual losses: softmax, MSE (or similar).

    Machine learning often performs better when you frame a problem as classification rather than regression. Since detection uses regression (and a classification head) is might be a more complicated task which generalizes less.

    [–]ProGamerGov 3 points4 points  (2 children)

    Although this VGG-specificity appears to be folklore among practitioners, this is not something I have seen noticed in neural style transfer papers; indeed, the review Jing et al 2017 explicitly says that other models work fine, but their reference is to Johnson's list of models where almost every single model is (still) VGG-based and the ones which are not come with warnings (NIN-Imagenet: "May need heavy tweaking to achieve reasonable results"; Illustration2vec: "Best used with anime content...Be warned that it can sometimes be difficult to avoid the burn marks that the model sometimes creates"; PASCAL VOC FCN-32s: "Uses more resources than VGG-19, but can produce better results depending on your style and/or content image." etc).

    Not all of Neural-Style's default settings are ideal for each model. My research on how the Adam Optimizer affects style transfer, shows that the default parameters Neural-Style uses for Adam are unstable and cause something similar to the "burn marks" which I wrote about in your source.

    I've found that when attempting to train my own VGG networks (for style transfer, not classification even though classification training was used), that the best parameters to use, change as the network is changed.

    Add/remove FC layers from retrained VGG and resnet models. Does that lead to large gains/losses in quality?

    As far as I know, Neural-Style does not use FC layers. /u/crowsonkb 's style_transfer even had the FC Layers completely removed from the equation. Though the FC Layers can also be used to exert more control over the style transfer process.

    When using modified Neural-Style scripts that support label files, one can see that the network predictions are not always what the content image contains.

    [–]gwern[S] 2 points3 points  (1 child)

    As far as I know, Neural-Style does not use FC layers. /u/crowsonkb 's style_transfer even had the FC Layers completely removed from the equation. Though the FC Layers can also be used to exert more control over the style transfer process.

    No, the idea there is that the FC layers, while not being used in generating features in style transfer, may still have affected the training of the rest of the VGG model and made the convolution layers lower down (which are generating the actual features) learn something somewhat different qualitatively than other models like resnets which typically have just one final FC layer on top of all the convolution+BNs, thereby yielding different features in the convolution layers. Something like - handwaving furiously - the VGG convolutions focus on learning textures and shapes while the VGG's FC layers do all the semantic thinking & understanding putting the pieces together, only the former of which is what we want.

    Speculative, yes, but training end-to-end means such global dynamics are possible, I have seen adding FCs change image generation in playing with anime GANs (adding 1-3 FC layers to the upscaling seemed to help with global coherency like ensuring that eye colors get matched), and the FC layers do contain a huge number of parameters so they might be doing something. Given all the contradictory opinions on what's going on, it's not a terrible idea.

    If so, then the logical test of the hypothesis is to see if a VGG model trained without FC layers is still as good for style transfer, and if other resnet models trained with additional FC layers get better for style transfer.

    [–]ProGamerGov 2 points3 points  (0 children)

    I've done some fine tuning with the FC 8 layer on a few different VGG-16 models, but I haven't had access to the resources required for fully retraining the VGG models (AWS doesn't make it very cheap). I haven't experimented with removing or adding more FC layers however, so I am not sure if my fine tuning experience will be of use to you?

    This experimentation was a while ago, and I was still learning a lot of the basics, but I was trying to only train the FC 8 layer directly, while letting things propagate into the rest of the model from there (at least I think that's what happens when the rest of the layers have their learning rate set to 0?). During this experimentation, I definitely noticed the finer details like texture and color being affected in the style transfer outputs I made for each set of 100-1000 training iterations.

    Edit:

    This is the results of one of my older successful experiments, where the quality of the output did not appear to be degraded in anyway from the original model.

    And this was the result from one of my less successful training attempts.


    If so, then the logical test if the hypothesis is to see if a VGG model trained without FC layers is still as good for style transfer, and if other resnet models trained with additional FC layers get better for style transfer.

    I would say that your are correct in your hypothesis that the FC layers are probably an important part of training a model to perform well in style transfer, but it would be nice to see that proved experimentally.

    Maybe some sort of DeepDream like setup used on each layer channel individually, could help us shed some light on how changes to the FC layers, affect the lower layers? I'm not the mathematically minded even though I'd like to be, so there is probably a way we could do an experiment like this without analyzing output images? This is the area where I have been looking at for improving style transfer models for the past while.

    Another thing that might be related, is that certain content and style layers in Neural-Style, produce significantly more artifacts than others do. I'm not sure if this is related to the reason for VGG's success, but I figured I should mention it because giving emphasis to sets of style layer channels, seems to create a similar result. I tried to explore this to prevent my tiles from drifting apart, but it didn't completely solve the issue. Though it almost seemed like some of these artifacts would end up growing into larger features. The NIN and ResNet models which I have tested in Neural-Style, also had artifacts which could be somewhat controlled with different layer combinations. But these ResNet and NIN artifacts were different than the VGG artifacts in that they did not appear as randomly and they were composed patterns that made up larger grids instead of more the random DeepDream like artifacts caused by the VGG models.

    [–]stochastic_gradient 4 points5 points  (2 children)

    Here's another hypothesis: Batch norm is to blame. VGG does not use it, while the other networks do.

    The test for this would be to train a resnet on ImageNet using some of the recent self-normalizing tricks [1,2], and see if the learned features work better for style transfer.

    1: https://arxiv.org/abs/1706.02515

    2: https://arxiv.org/abs/1709.04054

    [–]Deep_Fried_Learning 1 point2 points  (0 children)

    This might be tangentially related... In the paper Speed/accuracy trade-offs for modern convolutional object detectors they say:

    With the exception of VGG, we also do not perform “layer normalization” (as suggested in [26]) as we found it not to be necessary for the other feature extractors.

    That paper might also provide some insights to this topic, since it performed a lot of experiments to directly compare VGGs, Resnets and Mobilenets.

    [–]shortscience_dot_org 0 points1 point  (0 children)

    I am a bot! You linked to a paper that has a summary on ShortScience.org!

    Self-Normalizing Neural Networks

    Summary by Léo Paillier

    Objective: Design Feed-Forward Neural Network (fully connected) that can be trained even with very deep architectures.

    • Dataset: [MNIST](yann.lecun.com/exdb/mnist/), [CIFAR10](), [Tox21]() and [UCI tasks]().

    • Code: [here]()

    Inner-workings:

    They introduce a new activation functio the Scaled Exponential Linear Unit (SELU) which has the nice property of making neuron activations converge to a fixed point with zero-mean and unit-variance.

    They also demonstrate that upper and lowe... [view more]

    [–]wassname 0 points1 point  (0 children)

    Test 1: retrain much smaller VGGs

    Quick tests are good since they are more likely to be performed. So for a quick test you could just compare VGG16 and VGG19 and see how the style transfer differs.

    [–]toastjam 0 points1 point  (3 children)

    Stupid question, but how is a residual different from loss?

    [–]gwern[S] 3 points4 points  (2 children)

    Not sure what you mean.

    [–]toastjam 0 points1 point  (1 child)

    To put it another way, a loss function is how an NN's prediction differs from what it's being trained to predict. A residual is what, in that context?

    [–]sritee 6 points7 points  (0 children)

    I think it means residual connections, check out resnet