×
all 83 comments

[–]gwern[S] 44 points45 points  (16 children)

relative to the NVIDIA Titan X, the two cards should end up quite close, trading blows now and then.

Speaking of the Titan, on an interesting side note, it doesn’t look like NVIDIA is going to be doing anything to hurt the compute performance of the GTX 1080 Ti to differentiate the card from the Titan, which has proven popular with GPU compute customers. Crucially, this means that the GTX 1080 Ti gets the same 4:1 INT8 performance ratio of the Titan, which is critical to the cards’ high neural networking inference performance. As a result the GTX 1080 Ti actually has slighty greater compute performance (on paper) than the Titan. And NVIDIA has been surprisingly candid in admitting that unless compute customers need the last 1GB of VRAM offered by the Titan, they’re likely going to buy the GTX 1080 Ti instead.

From the sound of it, this is excellent news for deep learning and the 1080ti now the standard GPU! A potent $700 package with no gotchas.

[–]bbsome 29 points30 points  (15 children)

except that the float 16 performance is still crap

[–]FloRicx 13 points14 points  (2 children)

Hey, this is not the first comment I read about crappy float16 performance. Could you elaborate on a bit, or give some pointers? Thanks!

[–]gwern[S] 2 points3 points  (0 children)

Is it? I didn't see any mention of FP16 on the 1080ti being crippled, and that would be an odd thing to do if they are, as Anandtech says, uncrippling INT8. And the Anandtech table says the 1080ti and Titan X FP16 are the same, so presumably it's at least not worse.

[–]daV1980 -1 points0 points  (10 children)

When you're doing giant concatenations of matrix multiplies, why would you use fp16? That would be a terrible, terrible idea. You might as well set your values to 0 ahead of time to save yourself debugging effort, because that's going to be the result.

[–]ajmooch 8 points9 points  (7 children)

Lol, what? Fp16 works just fine, and there's been tons of work on even lower precision weights and activations.

[–]daV1980 2 points3 points  (6 children)

We've had this problem in graphics for years and we do way, way fewer concatenated FP operations. The smallest positive value in fp16 is 0.000061, which is not particularly small.

I'm not saying it's impossible to use fp16, I'm saying that when it fails it will be very confusing why, and that failure will depend heavily on the NN in question, including the weights, topology, etc. That debugging problem will be super unfun.

I will wholeheartedly admit to being just a NN amateur, but I am extremely well qualified to speak about FP and GPU issues (my post history has links to GPU talks I've given).

[–]ajmooch 6 points7 points  (4 children)

So I know basically nothing about graphics, but I've been told that you guys get some benefit out of using doubles instead of single-precision floats, perhaps for this reason? It's not an issue that really comes up with deep nets unless you have standard exploding/vanishing gradient problems, which things like batchnorm, residual connections, proper initialization/regularization schemes (shameless plug for Orthogonal Regularization), etc, largely alleviate for most networks of interest.

It's a known and very well-studied issue, and one where we've gotten to the point where machine precision isn't really the crux of the problem anymore, since we're not explicitly just doing stacked matrix multiplies.Not sure how the binary/extremely low precision crowd feels on this one, but FP16 has never had any problems for me.

[–]daV1980 1 point2 points  (3 children)

We tend to use fp32 most of the time, or fp16 when we know we can get away with it (certain calculations we know will be in camera space or NDC).

I haven't implemented a NN yet from the ground up outside of toy implementations in MATLAB, and I haven't yet looked into the bowels of what tensorflow generates in terms of CUDA functions. I'm surprised to hear it's not matrix multiplies or a series of dot products (it's not that I disbelieve you, I am simply surprised that's not the implementation).

[–]ajmooch 8 points9 points  (2 children)

So most networks are basically a series of dot products (convolution/correlation just being a natural way to express a "sliding window" dot product), but with various other operations interspersed in between. The two basic building blocks are conv/dot (dot for a fully-connected layer) and nonlinear activations, either a ReLU (aka half-wave rectifier) or some squashing nonlinearity like sigmoid/tanh.

If you do just stack a bunch of these naively then you definitely will deal with vanishing or exploding gradients as a result of repeated multiplication! This is an issue that's been known for years, and there's a couple of quick and easy workarounds (which weren't so easy to come up with).

The first is batchnorm, which basically rescales the output of each layer so that it looks gaussian distributed (by using the mean and variance statistics across a minibatch), and the second is residual connections, which basically say instead of multiplying the input by a weight matrix, multiply the input by a weight matrix and then add it to the original input. Combining these two makes it really easy to build deep networks; the residual connection makes it so that even if you're multiplying the input by a small matrix that would decrease its magnitude/norm, when you add it back onto the input you're not dealing with the same "values get smaller 'cuz you kept multiplyin em like a scrub" problem.

[–]rumblestiltsken 0 points1 point  (1 child)

You know, this conversation makes me wonder whether some batchnorm like operation might work in computer graphics. Halving the compute cost by allowing widespread fp16 would be pretty game changing.

I can't really see a reason why computer graphics would need numbers lower than fp16 allows, considering the output palette is constrained by the human eye.

Disclaimer: know very little about computer graphics.

[–]david-gpu 1 point2 points  (0 children)

I've designed GPU hardware for a number years. /u/daV1980 is right when he discusses precision issues of fp16 in graphics.

It's true that when you are computing pixels you can sometimes use fp16 as the error is tolerable for the human eye, although even moderately long pixel shaders suffer from the limited precision and range of fp16. Small precision errors compound very fast, which is why fp16 is mostly used as a storage format and not for arithmetic.

But where things get ugly real fast is when you use fp16 in the vertex shader. As you probably know GPUs render surfaces as a collection of triangular meshes. There are some pieces of code called vertex shaders that compute how that geometry is transformed and projected onto the framebuffer. Attempting to use fp16 for those transformations invariably leads to artifacts.

[–]__Cyber_Dildonics__ 2 points3 points  (0 children)

Or you could deal with these difficulties instead of giving up and saying there is no solution.

[–]Caffeine_Monster 0 points1 point  (0 children)

Using a very restrictive weight domain i.e. 8 bits, can actually be detrimental to neural network performance. Sure, you can potentially shunt more gradient descent passes through in a given time frame, but global optima convergence is potentially slower if the problem you are trying to solve requires high precision pattern discrimination.

My point is the math behind neural networks assumes you have to true access to real numbers - a int8 type will mean any backprop deltas are truncated by a significant amount such that they can fit the uint8 data type. You are essentially introducing rounding errors to the learning pass.

If you are very careful with your network architecture, int 8 can defiantly give better convergence time with an i

[–]hapemask 0 points1 point  (0 children)

Using low-precision floating point representations is a reasonably well-studied topic for neural networks (though I don't study it myself, I am aware of it).

Not only can you train a network w/float32 and then compress it down to something like float8 w/an acceptable loss in accuracy, people have also trained networks with low-precision floats from start to end: https://arxiv.org/pdf/1412.7024.pdf

While the results are perhaps counter-intuitive, NNs appear relatively robust to precision loss. This is different from something like graphics where floating point (im)precision could mean your ray passes right through a triangle it should hit. Perhaps the precision robustness is a side effect of training the network to be robust to a wide variety of image appearances.

[–]mimighost 10 points11 points  (4 children)

WOW, looking forward to see how this hold against current GTX 1080. But 3GB more VRAM and no price bump is already a sweet update.

[–][deleted]  (3 children)

[deleted]

    [–]mimighost 7 points8 points  (2 children)

    Yeah, AMD simply slays recently. Good to see there is some competition in the chip market, eventually.

    [–]subzerofun 0 points1 point  (1 child)

    I'm new to the ML field and just have little programming background – but an overwhelming interest in all recent development and applications involving neural nets. So forgive me if these questions sound stupid: From what i've seen in most of the github projects is that the devs either offer you the option to use the CPU – which is in most cases too slow unless you have a bazillion-core xeon machine. Or you can use your GPU(s) via CUDA. I've seen some implementations of popular projects for OpenCL – but a lot of devs of the popular nn libraries say in their documentations that OpenCL is not that well supported or has limited functionality compared to CUDA.

    So when the Vega cards will come out – will everyone have to wait until OpenCL is as far integrated as CUDA is now? Or is it just bias because i've read more about projects that rely mainly on CUDA?

    How well do frameworks like Tensorflow, Theano, Torch, Pytorch perform with AMD cards/OpenCL at the moment?

     

    It's comments like these that made me wonder (even if they are from older threads):

    http://stackoverflow.com/a/29051706

    So what kind of GPU should I get? NVIDIA or AMD? NVIDIA’s standard libraries made it very easy to establish the first deep learning libraries in CUDA, while there were no such powerful standard libraries for AMD’s OpenCL. Right now, there are just no good deep learning libraries for AMD cards – so NVIDIA it is. Even if some OpenCL libraries would be available in the future I would stick with NVIDIA: The thing is that the GPU computing or GPGPU community is very large for CUDA and rather small for OpenCL. Thus in the CUDA community good open source solutions and solid advice for your programming is readily available.

     

    https://www.reddit.com/r/MachineLearning/comments/4di4os/deep_learning_is_so_dependent_on_nvidia_are_there/  

    To be fair, there is nothing wrong with AMD's GPU hardware (and in some respects they often lead Nvidia), but that is only one component of what it takes to compete in these markets. The OpenCL ecosystem is just not a viable alternative to the Cuda ecosystem for most use cases, not even close. There are feedback loops too ... strong cuda support leads to larger attendance at Nvidia's GTC conference which leads to more devs which leads to more apps and so on.

     

    OpenCL is also the inferior framework from a programmer's perspective. Maybe future OpenCL or OpenCL-like standards can learn from what CUDA did right.

     

    Nobody(or very few ppl) wants to spend time rewriting stuff for AMD, although there are a few efforts in the works. On top of that Nvidia is putting in the work to develop CuDNN, but there is no equivalent from AMD.

     

    I chatted with a guy from AMD the other day at SVVR, and as far as I can tell they're not investing in deep learning at all. (Building OpenCL and/or generic BLAS libraries is not investing in deep learning. An equivalent to CUDNN is needed.) I don't understand how they can be so clueless. This market could be bigger than graphics!

    [–]duschendestroyer 1 point2 points  (0 children)

    They don't.

    [–]mljoe 7 points8 points  (3 children)

    11GB is going to work for most interesting NN architectures right?

    [–]TanktopModul 2 points3 points  (2 children)

    Yes, if you are OK with turning down the batch size sometimes.

    [–]gwern[S] 1 point2 points  (1 child)

    In which case you can always buy 2 1080tis for the price of 1 Titan and get double the FLOPS by splitting the minibatch across them. You're really only screwed if you have a single model 12GB in size which can be trained in minibatches of 1... and even that might be splittable across 2 1080tis with some work (put half the layers on one and half on the other a la synthetic gradients?). I struggle to think of scenarios in which 1 Titan makes more sense than 1-2 1080tis.

    [–]TanktopModul 0 points1 point  (0 children)

    Agreed. With this price point, I don't think a strong argument for the Titan X can be made any more.

    [–]eazolan 4 points5 points  (23 children)

    I'm actually surprised at how little video cards have changed since I bought mine two years ago. Mine is still in the top 5!

    [–]endless_sea_of_stars 10 points11 points  (11 children)

    Moores law is slowing down. We aren't doubling performance every two years anymore. "The party isn't over, but the cops have been called and the music has been turned way down."

    [–]eazolan 11 points12 points  (4 children)

    Yeah, but graphic card CPUs have never been cutting edge. Just last year they went from 28nm die to 16nm.

    And that should have been a huge difference. For the same area they could have put in far more CUDA cores. Instead they just bumped it up about 25%. Whee.

    It's not Moores law. NVIDA and AMD are holding back.

    [–]magnavoid 0 points1 point  (3 children)

    Of course they're holding back. Its all about maintaining their "upgrade path" for the foreseeable future. If they were to produce completely maxxed out hardware the average gaming consumer would have little reason to upgrade as often as they do currently. Its all about money and stock prices.

    [–]VelveteenAmbush 7 points8 points  (0 children)

    Alleged collusive conspiracies between cutthroat competitors require much more evidence IMO than "its [sic] all about money and stock prices."

    [–]canttouchmypingas 2 points3 points  (0 children)

    I don't know, if AMD didn't hold anything back for once, they could take over the market, at least temporarily.

    [–]gattia 1 point2 points  (0 children)

    I agree with /u/VelveteenAmbush. Also, by throttling upgrades they are also throttling use for machine learning which they have to know is going to be a HUGE market in the foreseeable future.

    [–]smith2008 6 points7 points  (3 children)

    Moore's law is just a part of the story. The important thing is how fast and how much data we can move around. In this regards the GPUs have picked up and are out speeding the Moore's law quite a bit. So I would say "Moore's Law is dead, long live the Moore's Law".

    [–]endless_sea_of_stars 3 points4 points  (1 child)

    Yes it is true that GPUs have benefitted more from die shrinks than CPUs. But die shrinks are becoming more difficult for each generation. I hope they can keep pulling memory and architecture tricks out of their hat.

    [–]smith2008 0 points1 point  (0 children)

    Yeah. I hope so too. Maybe something new will replace GPUs in terms of AI applications. Dunno, TPUs are interesting but I think the main strength of the GPUs are how accessible they are. I mean Alex Krizhevsky et al were able to crack through the ConvNets (in 2012, ImageNet) on a pair of standard gamer's GPUs.

    [–]PLLOOOOOP 1 point2 points  (0 children)

    The important thing is how fast and how much data we can move around.

    And how efficiently we can do it! We can get more done on a 30W chip now than we could on a 150W chip a decade ago. Before then, total power consumption was climbing almost as fast as clock rates were, and power consumption per die area was exploding at an insane rate.

    With the right mix of normalizing factors (cache & bus performance, power efficiency, etc), Moore's exponential growth will continue for a long time.

    [–]__Cyber_Dildonics__ 2 points3 points  (1 child)

    Moore's law was about transistors, not performance.

    [–]endless_sea_of_stars 0 points1 point  (0 children)

    Eh, that is a can of worms you are opening there. You have the strict definition: number of transistors double in density every two years. Luckily GPU's scale much better with transistor count then CPU's do. Therefore while CPU's began to trail off GPU's kept going strong. Unfortunately as time goes on, each die shrink gets harder and harder.

    On the other hand you have the more layman's definition of computer performance increases exponentially over time. We're still growing exponentially but the exponent just ain't what it used to be.

    It would be interesting to see a chart of nVidia's $400 card performance improvements from 2004 to today. Compare the performance improvement of one generation to the next.

    [–]smith2008 2 points3 points  (8 children)

    What card do you have? I am pretty sure GPUs have changed massively in the last 2 years. Even if 980TI/TitanX(M) were great they will fall short if compare to those new beasts in terms of price performance.

    [–]eazolan 0 points1 point  (7 children)

    NVIDIA GeForce GTX 980, launched September 18, 2014.

    Yeah, of course it will fall short of the brand new smaller die that are just coming out.

    But up to right now? Not a lot of change.

    [–]tomgie 1 point2 points  (1 child)

    Series behind you with the 780 Still strong :D

    [–]PLLOOOOOP 1 point2 points  (0 children)

    I'm rocking a 560Ti. ...Not doing so hot these days. 😬

    [–]smith2008 0 points1 point  (0 children)

    980 and 980TI are great cards. The new cards are not always better. For instance 980TI could keep up with 1080 sometimes in some cases. But when increase the batch size and model size it start to crackle. I have one 980TI which kicks 300w (water cooled) and I really feel it is pushing the boundaries of what is possible with that architecture. While the 1080 feels like it is just getting started. I think they will push really hard with 1080TI and the difference will became really noticeable.

    [–]melgor89 0 points1 point  (3 children)

    Looking into 1080Ti, it it more than x2 faster, so it in accordance with Moore's Law . I have 2x980 and I'm planing to replace them with faster GPU's. Chaning them both to 1080Ti will give ~4x faster training and x2.75 more memory. It is really huge boost.

    [–]__Cyber_Dildonics__ 1 point2 points  (1 child)

    Moore's law is about transistor density, not performance.

    [–]melgor89 1 point2 points  (0 children)

    Yes, you are right. I checked the transistor density at both card and here are the results: 980 GTX: 5.2 bilion transistors on 398mm2, so 0.013 bil-trans/mm2 980 GTX: 12 bilion transistors on 471mm2, so 0.0255 bil-trans/mm2

    So it look like that both perforamance and transistor density doubled by 2 years.

    [–]carbonat38[🍰] 1 point2 points  (1 child)

    and if we look at cpus it is even more depressing by a million times.

    [–]eazolan 0 points1 point  (0 children)

    I dunno. As much as I'd like to see a major performance increase in CPUs, I think most software needs to be refactored these days.

    [–]lyomi 5 points6 points  (2 children)

    It has Int8 performance that is 4 times fp32!

    [–]markov01 0 points1 point  (1 child)

    what for? pretty useless

    [–]endless_sea_of_stars 2 points3 points  (0 children)

    There are numerous classes of calculations that don't need high precision.

    https://petewarden.com/2015/05/23/why-are-eight-bits-enough-for-deep-neural-networks/

    [–]canttouchmypingas 3 points4 points  (0 children)

    And my pascal is arriving today............

    Guess I gotta return it and buy two of these instead. Good thing I didn't open the box.

    [–]VelveteenAmbush 1 point2 points  (3 children)

    Edit: this is wrong, I was confusing the Nvidia Titan X with the other Nvidia Titan X.

    Underwhelmed. As against the GTX Titan X, there's no improvement except on price. I bought my Titan X almost two years ago for $1000. This is 30% cheaper, 9% less memory, no other obvious compute advantages (am I wrong about that?). Arguably a marginal improvement on the whole, but not strictly superior, and the rate of improvement is not at all on the order of what I expected for the past couple of years. What's the deal? Am I missing something?

    Does Nvidia just lack competition in deep learning, or is there a fundamental reason why massively parallel compute has seemingly hit a wall? (Please don't just say "moore's law is over" unless you understand and can articulate why it should matter for massively parallel GPUs as opposed to single-core CPUs...)

    [–]infinity 3 points4 points  (2 children)

    This is being compared to the latest Titan X(Pascal) and not the 2 year old Titan X (Maxwell).

    [–]VelveteenAmbush 2 points3 points  (1 child)

    Ugh. Thank you, you're right.

    Will never understand Nvidia's branding strategy...

    [–]codechisel 0 points1 point  (0 children)

    I've made the same mistake. They do a horrible job with branding and versioning these products. The whole industry is like that IMO.

    [–]gattia 1 point2 points  (2 children)

    Anyone have any thought or opinion on what the quick release of this means for the timeline for a Titan X upgrade?

    The 1080 Ti definitely seems to be a no brainer vs Titan X, you could buy 2 and do training over both cards which would nearly double your overall memory. But, for overall memory an upgraded Titan x with > 12 GB (16 or 24 Gb) would be awesome and would definitely be the path i'd go.

    [–]Fab527 0 points1 point  (0 children)

    They'll very likely not upgrade the Titan X until Volta, which should come in 2018

    the "ti" card is usually less powerful than the Titan, but this time they made it just as powerful as the Titan to counterattack AMD's Vega

    [–]EmetToMet 0 points1 point  (6 children)

    I bought 2 Titan X Pascales recently. Is it worth returning them and preordering 2 1080ti's?

    [–]hereticmoox 5 points6 points  (2 children)

    Not if you're going to use them for ML.

    [–]EmetToMet 0 points1 point  (1 child)

    Is the Titan X Pascale better suited for ML than the 1080Ti? It seems like there specs are similar enough. Does the 1 more GB in VRAM make a big enough difference?

    [–]smith2008 0 points1 point  (0 children)

    Wait for benchmarks if you can. But I am pretty sure the prices of those Titans (P) were blown quite a bit. So even if they edge 1080ti's it won't be by much. And in terms of price 2x1200$ == 3.5 x 1080ti (700$) . IMO 1080TI is going to be like 980TI - spectacular.

    [–]MassiveDumpOfGenius 0 points1 point  (0 children)

    Seems like 1080ti is just binned titan xp. Has exactly 11/12 of everything. With a little higher boosted clock due to better reference cooler.

    If the price difference make sense to you, change it. Keep titan xp if you want the best stuff possible (right now)

    [–]SuperFX -1 points0 points  (0 children)

    I would return them for sure. All indications so far is that the new 1080Ti's will be just as fast, if not faster, with only a slight hit in memory capacity.

    [–]chogall 0 points1 point  (6 children)

    Waiting for AMD Radeon Instinct...

    [–]Wootbears 0 points1 point  (3 children)

    Is there any information on release dates for that? I want to build a new rig sometime before the summer, so I'm hoping to find all the best parts by then.

    [–]chogall 0 points1 point  (2 children)

    In 1H. The whole AMD product release lineup is extremely interesting. 8 core CPU at similar prices to Intel's 4 core, passively cooled Radeon Instinct accelerators vs NVidia's overheating style.

    p.s., AMD always had a better architecture in accessing on-board memory; this is a very specific reason why NVidia never gain any traction in mobile space.

    [–]Wootbears 0 points1 point  (1 child)

    What setup are you using now? Right now I'm using my old gaming rig with an nvidia card (980), but the pc itself is starting to slow down. I used to run an AMD setup, but I realized that NVIDIA generally seems to have better support. But it will be interesting if AMD cards have cuda support out of the box, and if it's simple to set up something like TensorFlow-gpu.

    [–]chogall 0 points1 point  (0 children)

    Right now I am on i7 3770k, 32G RAM, 1060 w/ 6G RAM.

    CUDA is NVidia's library. AMD is working hard w/ HIP/GPUOpen that has a CUDA transpiler so I believe it will be available at launch w/ decent support. AMD is no children in this game and they are targeting Intel/NVidia's cloud computing monopoly.

    [–]gwern[S] 0 points1 point  (1 child)

    To point out as I do everytime mentions AMD in a deep learning context: but who knows what the driver and library support situation will be like. If you waste a week dealing with OpenCL issues or writing your own backend or something, you've easily wasted the difference between a 1080ti and the AMD equivalent.

    [–]chogall 0 points1 point  (0 children)

    Not going to OCP this year (starting tomorrow), but if anyone goes there please share how the field looks like for AMD!

    [–]koobear 0 points1 point  (0 children)

    If I don't get into grad school, I'll be splurging the money I've been saving up. Maybe I'll pick up a few of these along with the 1800X and build my personal ML server.

    [–][deleted] 0 points1 point  (0 children)

    Great!! I will try two.

    [–]thecity2 0 points1 point  (3 children)

    Hey, guys it's coming in over $700, what can we do to get in under that price point? Thinking, thinking...Uh, do we need 12 GB? I mean 11 GB is pretty much just as good right? Yeah, 11 GB is cool. Alright let's ship it!

    [–]xzxzzx 12 points13 points  (2 children)

    It's almost certainly due to the binning process. The ROPs, total RAM and RAM bus width are all cut down by exactly 1/12.

    [–]omnipedia 0 points1 point  (1 child)

    What do you mean by binning process? Marketing?

    [–]xzxzzx 9 points10 points  (0 children)

    The chip design used in the 1080 ti is the same as the Titan. "Binning" is the process of testing chips and putting them into different categories (bins) depending on how flawless they are. The 1080 ti uses chips where part of the chip has been disabled.

    [–]wisp5 0 points1 point  (4 children)

    Does anybody know if the 1GB less memory than the Titan going to be the difference between 64 and 128 sized mini-batches? For Resnets / batch norm-heavy architectures can't that lead to a non-insignificant hit in performance?

    [–]darkconfidantislife 2 points3 points  (0 children)

    Yes, it can, hence why they did it.

    [–]fldwiooiu 3 points4 points  (2 children)

    why would 8% less memory lead you to reduce batch size by 50%? That's just stupid. Batch size being a power of 2 is a dumb holdover from the cuda_convnet days.

    [–]wisp5 -1 points0 points  (1 child)

    An 8% reduction in batch size can be damaging to the performance of lots of network architectures. It's not "stupid"; batch size is a hyper-parameter that can severely affect things. The reason I'm asking is because I have a chip with greater memory than the 1080 Ti, so I don't want to be bottlenecked if I get a Ti and parallelize.

    [–]fldwiooiu -1 points0 points  (0 children)

    An 8% reduction in batch size can be damaging to the performance of lots of network architectures.

    bullshit. source or gtfo.

    and again, why would you possibly think

    1GB less memory than the Titan going to be the difference between 64 and 128 sized mini-batches

    128 vs 117 might make sense, but if you're blindly picking a power of two for your batch size, I really doubt it's perfectly aligned to max out your 12GB of memory.