×
all 139 comments

[–]Xorlium 86 points87 points  (8 children)

They seem to misunderstand what data leakage is...

I didn't understand the rest of the discussion without reading the paper, but their reply on point 1, about data leakeges, tells me they didn't really understand your point.

[–]avaxzat 47 points48 points  (3 children)

They even consistently put the term data leakage between quotation marks, as if it's not even a real thing. These people are clearly not serious data scientists.

[–]po-handz 26 points27 points  (2 children)

At this point it's questionable if they're even serious scientists, as real scientists welcome constructive criticism and consult statisticians/data scientists prior to publishing in top journals

[–]tarquinnn 14 points15 points  (0 children)

real scientists welcome constructive criticism and consult statisticians/data scientists prior to publishing in top journals

HAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHA but seriously, this is certainly not the case in biology. I think we should this so called "real science" for what it is.

[–]news_main 7 points8 points  (0 children)

There is so much ego in a review process (both from the lab and reviewers that may be in the same field), worst part about science.

[–]critplat 3 points4 points  (2 children)

The discussion here perplexes me. Won't there essentially always be sources of data leakage from uncontrolled/unknown latent variables due to omitted variables or implicitly conditioning on certain contexts in the problem specification? E.g. I'm not sure if either analysis accounted for which fault line two locations are at or on which tectonic plate. There's no doubt that removing earthquakes occurring at the same location is a more severe test, but also there is no end to criticisms like this you could make of basically any analysis. Am I missing something?

[–]Xorlium 24 points25 points  (1 child)

No. Data leakage is when you are using some information that would NOT be available for prediction when you actually need to make a prediction. The usual example is that data about 2018 would not be available in 2017. In this case, data about an earthquake was (allegedly) used to train a model that then was used to predict things about that same earthquake. But if you want to predict something about a future earthquake, you wouldn't have information about that same earthquake. Maybe in this case it doesn't matter, I don't know, but I think the point is that you don't know if it matters...

How to properly validate is an important and non trivial topic. Many kaggle contests have been ruined by data leakages, for example.

[–]critplat 3 points4 points  (0 children)

Got it. Thanks for clarifying that. I was thinking it was a statistical independence issue.

[–]Deto 79 points80 points  (4 children)

People are focusing on the authors, but IMO the blame equally rests with Nature. People pay big money to access their content and so they need to have a review process that ensures that better ensures faulty methods are adequately screened.

[–]suddencactus 18 points19 points  (1 child)

Agreed. It reminds me of the story this week about Science refusing to publish a replication failure for a well known study they published earlier.

It seems many of these top journals emphasize Popular -Science-like flashy headlines over balanced, critiqued results and honesty.

[–]Deto 12 points13 points  (0 children)

I think part of the problem is just not having an adequate review process for interdisciplinary work. They probably sent the paper out to experts on earthquakes but nobody who knew machine learning real well.

[–]beginner_ 14 points15 points  (1 child)

True but most here I assume have lost their faith in peer-review years ago.

[–]whymauri 15 points16 points  (0 children)

I mean, reviews in ML are orders of magnitude more broken than Science/Nature. The main benefit is some conferences have open reviews, which I personally think is wonderful.

[–]sensetime 177 points178 points  (16 children)

I found the response from the authors to be more condescending than this critique.

The comments raised the issue that much simpler methods can achieve pretty much the same results, highlighting the need to do proper ablation studies. The final paragraph of the response basically also said we are earthquake scientists, who are you? and told Nature they will be disappointed if these comments are published.

Why aren't these concerns worthy of publication in Nature? Why should they be censored? Wouldn't publishing them lead to more healthy scientific discussion? They are not unique as there are follow up articles with similar concerns.

I dunno, if I was reviewing this paper for an ML conference, I would have similar concerns. At least demand some ablation studies.

[–]darchon30704 63 points64 points  (11 children)

I've read their (Phoebe DeVrias and Brendan Meade) emotionally passionate response to the Nature editors. While I don't know the context of the comments given by those editors, it's safe to say that this is a VERY immature way of accepting critique.

While I do despise Nature for :

a) Making the publication and review process more difficult

b) Refusal to acknowledge reproducibility (and waste valuable time of researchers on fraudulent publications)

c) Hypocritically sit behind a $200 paywall while promoting open access

there's no way around a prestigious Monopoly. The authors should have just accepted these harsh critiques and moved on. It's not like they're afraid that their results aren't reproducible anyways, amiright?

[–][deleted] 9 points10 points  (7 children)

Does sci-hub work with nature? Just so you know $200 is saved.

[–]darchon30704 8 points9 points  (6 children)

Yes. But with a) Net Neutrality gone, b) tensions between Russia and the West rising, c) mergers of companies that own Nature, there's an incentive to block scihub in the US in the near future. Quote me in 2030 when scihub is blocked, and we'll have to resort to VPNs.

[–][deleted] 8 points9 points  (0 children)

Laughs in proxies. :p

[–]csreid 4 points5 points  (3 children)

with a) Net Neutrality gone, b) tensions between Russia and the West rising, c) mergers of companies that own Nature, there's an incentive to block scihub in the US in the near future

Unless the ISP owns Nature or the gov't steps up with some huge overreach, then no, there really isn't, and even then, the ISP owning Nature and blocking scihub is pretty much a clear-cut case for abuse of monopoly power that would result in some monopoly busting.

[–]Spoofied 0 points1 point  (2 children)

It's possible Nature would lobby the government to block it as a foreign agency or for IP theft.

I would imagine the government isn't obligated to protect Russian/Kazak copyright infringement.

[–]lostmsu 2 points3 points  (1 child)

AFAIK, US government does not censor resident access to the Internet in any way whatsoever. And they do not have jurisdiction outside US soil, so can't prevent it from being hosted.

See The Pirate Bay history.

[–]Spoofied 0 points1 point  (0 children)

You seem to be correct. I can't think of a single example at all.

[–]Spoofied 0 points1 point  (0 children)

2030? I see you're a glass half full kinda guy.

[–]themiro 5 points6 points  (1 child)

The authors should have just accepted these harsh critiques and moved on

What? That is the opposite of how science works.

Honestly baffled by this entire thread

[–]murukeshm 8 points9 points  (0 children)

I suppose that's to be understood as "acknowledge the critique is valid (accept them), and incorporate them in future work (move on, instead of remaining on the old work)."

[–]peterballoon 0 points1 point  (0 children)

I work in a related field. Actually, most of the new methodological innovations in geophysics are NOT published in Nature/Science, but more technical professional journals. The problem with earthquake science is that, for such a small scientific community, the editors of Nature/Science and some senior high-profile scientists completely monopoly the evaluation of innovativeness of new discoveries. There is no way to break their clan.

I feel sorry for Nature on not recognizing these major problems in this paper. If it's in other more competitive subjects, the editors won't be so stubborn. That's why the young generations in our field no longer care about high-impact journals such as Nature

[–]se4u 7 points8 points  (0 children)

minor correction s/abolition/ablation/g

[–]msdrahcir 4 points5 points  (0 children)

> we are earthquake scientists

From Harvard \s

[–]pwnersaurus 85 points86 points  (17 children)

Personally I think a mistake Shah made was in also calling out the fact that simpler models could do a comparable job, which meant that his criticism lost a lot of focus. That specific issue doesn't invalidate the paper, it would be something that would be more suited to a separate article, exactly like what Mignan and Broccardo have done.

Nonetheless, the argument of the paper as laid out in the authors' response is confusing - their argument seems to be that the maximum change in shear stress and von-Mises yield criterion are useful quantities because a neural network gives the same accuracy as them. If the AUC scores for these non-ML based methods are only interpretable relative to the neural network, then it's important for the neural network to be implemented correctly. On the other hand, if the purpose is as the referee states

Instead, the paper showed that a relatively simple, but purely data-driven approach could predict aftershock locations better than Coulomb stress (the metric used in most studies to date) and also identify stress-based proxies (max shear stress, von Mises stress) that have physical significance and are better predictors than the classical Coulomb stress. In this way, the deep learning algorithm was used as a tool to remove our human bias toward the Coulomb stress criterion, which has been ingrained in our psyche by more than 20 years of published literature.

then one could simply compare the AUC scores for max shear stress and von Mises stress to Coulomb stress and conclude that they were better predictors, without involving a neural network at all. In that respect, it doesn't matter one bit what the neural network does, and that seems to be the main reason for the responses being what they are. Personally, I don't understand the fascination with publishing neural network results like these, but they do seem to be of interest to specific research communities (not just earthquakes, in many other domains too) and the relevance of the results is left to the journal and the reviewers, and isn't something that would warrant publishing amendments for.

Overall, I think it would have been more useful for Shah to

a) Clearly indicate the amendments that should be made to this work e.g. updating the AUC and explained variance values in the paper b) Write up his broader critique and post it on arXiv or similar

But lastly, one thing my PhD supervisor often liked to point out is that top journals like Nature and Science have a relatively high rate of publishing results that later can't be reproduced or are found to be flawed in some way. They may be the most prestigious journals, that doesn't actually mean they are the most scientifically rigorous

[–]beginner_ 25 points26 points  (11 children)

Personally I think a mistake Shah made was in also calling out the fact that simpler models could do a comparable job, which meant that his criticism lost a lot of focus. That specific issue doesn't invalidate the paper, it would be something that would be more suited to a separate article, exactly like what Mignan and Broccardo have done.

No it just adds to the issue. They used AI/neural network to get maximum exposure because it's all the hype. There was no reason to use it. using a much, much more complex model with identical performance is always an issue (overfitting...) and that issue often can only be seen in real-life with new data (even without leakage).

It's simply another great indication they have 0 clue about proper modeling. The goal should be to predict aftershocks, regardless how. That is science. What's is attention whoring is using a hyped tool because it gets you attention. That article would have been buried in an obscure paper had they used Random Forest.

[–]pwnersaurus 22 points23 points  (3 children)

I agree that it's a poor use of ML but the point is that by calling out those issues at the same time, the authors and journal dismissed Shah entirely and a major part of their response was that those particular points didn't affect their main findings i.e. it didn't have to be the simplest method. I think there's a decent chance he might have received a different response had they not had those points available to include in their rebuttal - they only served as a distraction from getting what Shah actually wanted (for them to fix their analysis).

Also, "The goal should be to predict aftershocks, regardless how" is actually not the point they are making (and I don't think it is it of great scientific interest either, it may be important if you're issuing earthquake warnings, but if your neural network can perfectly predict earthquakes but you don't know how, then you haven't actually learned anything about earthquakes. This is by far the biggest problem I have with many applications of neural networks in scientific research, it doesn't actually improve understanding. Getting high model accuracy might be the end game for CS research, but its only a means to an end in other domains). If I understand them correctly, their point is that the neural network is correlated with other, simple, physically-based metrics that are for historical reasons not widely used. So they are advocating using those methods instead on the basis that perform similarly well to deep learning, and much better than traditional methods. I don't understand why they need deep learning to make that point though.

[–]smurfpiss 4 points5 points  (1 child)

I completely agree with you. Shah basically diluted his argument. Had he just focused doggedly on the issue of data leakage, they'd in turn have to focus on their lackluster response.

[–]m-cdf 0 points1 point  (0 children)

I am not sure. “Occam’s razor” is not just a matter of "hygiene" but “Occam’s razor” is tightly coupled with generalization bounds ( https://towardsdatascience.com/generalization-bounds-rely-on-your-deep-learning-models-4842ed4bcb2a) .. it is indeed a concern, IMHO

[–]beginner_ 4 points5 points  (0 children)

but if your neural network can perfectly predict earthquakes but you don't know how, then you haven't actually learned anything about earthquakes

No but saved millions of lifes.

This is by far the biggest problem I have with many applications of neural networks in scientific research, it doesn't actually improve understanding

True and even more so for Neural networks over RF. my default position is that if you need a ANN (or even RF), your problem is fundamentally too complex to derive some "simple" rules from the model. "Simple" as in a human expert can understand them.

EDIT: This means I agree with you, that using DL here doesn't add anything new. And it's also obvious to me that you would need extremely complex data and model to model this properly. Aftershocks and their spatial distributions will logically greatly depend on the geology of the affected region. And said geology can vary within small areas. And so forth.

[–]coolpeepz 6 points7 points  (1 child)

Yeah honestly I hate all of the “We brought AI to ___” articles. The goal should be to solve a problem, not to use “AI”. If I’m buying a product, I just care that it works.

[–]Toast119 0 points1 point  (0 children)

It did work though. And this isn't a product?

[–]Toast119 -2 points-1 points  (3 children)

There was no reason to use it.

I mean, it worked...?

This just reeks of some weird minimalist elitism.

[–]po-handz 2 points3 points  (1 child)

No it didn't really work considering they mixed train/test sets (I believe). The minimalist argument is valid if the authors are incapable of implementing more complex algorithms

[–]Toast119 0 points1 point  (0 children)

But even if that were true (which I don't think the case for is strong), that would impact any ML method.

[–]fridsun 1 point2 points  (0 children)

There was no reason to use it.

refers to the problem of the paper that the deep learning model is useless for proving the most important hypothesis one can derive from the paper, that is

the maximum change in shear stress, the von Mises yield criterion (a scaled version of the second invariant of the deviatoric stress-change tensor) and the sum of the absolute values of the independent components of the stress-change tensor

are each a much better predictor than the more widely used

Coulomb failure stress change

Notice the above hypothesis is not the one presented in the abstract of the paper. Instead, the abstract demonstrates that the authors have mistakenly taken the correctness of their deep learning model for granted, and chose to evaluate the physical predictors against the model instead of the actual data. As it turns out, their model was not correct due to their testing data being not independent with their training data, contradicting what the authors have described in the abstract:

We show that a neural network trained on more than 131,000 mainshock–aftershock pairs can predict the locations of aftershocks in an independent test dataset of more than 30,000 mainshock–aftershock pairs more accurately (area under curve of 0.849) than can classic Coulomb failure stress change (area under curve of 0.583). [emphasis mine]

and therefore breaking the logic chain they have relied on for proof.

What's above does not mean that the material from the paper is not enough to prove a useful hypothesis. Unfortunately, in the authors' and the editor's responses, they have conflated sufficiency of material for correctness of proof, to the point of dismissing the related data scientific expertise of a fellow scientist on the ground of earthquake studies. The authors' contribution is no excuse for such mistakes.

There was no reason to use it.

does not refer to the fact that simpler ML models suffice.

Had the authors presented their discoveries primarily in terms of the earthquake science domain and kept the DL model to exploratory use and out of the logic chain for proof, I do not believe ML researchers would obsess over the data leakage. But anything that a logic chain for proof consists of should be subject to the highest standard of academic scrutiny.

[–][deleted]  (3 children)

[deleted]

    [–]fridsun -1 points0 points  (2 children)

    The paper is trying to find a better forecast for aftershocks following a major earthquake. Some uses are e.g. more precise warnings and predictive rescue planning.

    [–][deleted]  (1 child)

    [deleted]

      [–]fridsun 0 points1 point  (0 children)

      The authors are satisfied that their DL model outperforms the commonly used domain metric and that it can be explained by a.k.a. performs on par with some of the not yet commonly used domain metrics. Neither that they made a mistake in their model nor that they are satisfied with a model on par with domain metrics mean their goal isn’t finding a better forecast than is currently commonly available.

      [–]AyEhEigh 141 points142 points  (18 children)

      "We admit we used data from the same earthquakes in both the training and testing sets but that doesn't matter because we're smart earthquake scientistsᵀᴹ." - DeVries, et al.

      [–]0_Gravitas 39 points40 points  (10 children)

      That's sort of what I got out of it their response.

      They don't seem to grok that the ability to predict B(magnitude, location, and time, I assume) and its child aftershocks from the stress field after shock A could quite easily imply that the network is implicitly modelling how the entire stress field evolves in that particular earthquake and giving matches for where the features of that stress field's evolution predict an aftershock. In which case, giving it inputs of B's stress field would give similar results because it's very similar information.

      The real root of their problem though is that they just don't know how to partition a dataset. As scientists (people who presumably work with statistics very frequently), they should understand some of the basics, like how your randomly selected control group and experimental group should be statistically independent! The same principles of fairness in partitioning apply quite well to training and validation groups.

      (Sorry, for the rant. Scientists fucking up the basics of sampling is a pet peeve of mine)

      [–]GermanAaron 5 points6 points  (4 children)

      Stupid question but how can test and training set be statistically independent? Isn't the entire idea of ML that given the training set, you can make predictions for the test set? This does not seem independent to me, but I'm probably making a mistake somewhere.

      [–]TheFlyingDrildo 13 points14 points  (0 children)

      The i.i.d assumption (independently and identically distributed) is specifically what underlies this. On a particular level of the hierarchy (specifically the earthquake here rather it's individual time points), every unit is an independent draw from the same underlying distribution. So since the units in the test set are from the same distribution, we can make inferences about them, even though they are independent draws within the distribution.

      [–]hughperman 19 points20 points  (1 child)

      Independent in the sense that they are not from the same source - i.e. not the same patient, or earthquake - so that you are not modelling the specifics of that source instead of the class during training and then falsely increasing accuracy during testing when you can detect that a sample comes from a class source by its specific signature instead of a class-wide property.

      [–]GermanAaron 5 points6 points  (0 children)

      Thanks, that makes sense.

      [–]suddencactus 0 points1 point  (0 children)

      Some of the point here is that there are a lot of latent variables or unknown parameters driving the predicted quantities. You want to partition your dataset to separate as many of these as possible. Not just mainshock, but also year and even location if you can. This is especially true when selection bias is being applied to one of those latent variables as it is for time.

      If you look at the predicted variable as a Bayesian network being driven by latent variables A and B with E as noise (so you're trying to match a model to P(y|a,b)). Now, if you don't partition carefully you'll have combinations of A,B, and E shared between the test set and training set. Now your model is based not only A and B but also E, making your model P(y | a,b,e). This is usually magnified when your model is overfit. Now in reality you're not basing the model directly on latent variables A and B but on the observables Xn = f(a,b,e), but that only makes debugging the data leakage more confusing. To get externally valid results, you have to draw independently over as many latent variables as possible to remove spurious connections and keep meaningful ones.

      [–]AyEhEigh 14 points15 points  (3 children)

      Yeah but I'm betting when these earthquake experts were refining their craft in school, the only training they got for dealing with data was traditional stats and not the ML techniques we have now. The problem is that a lot of traditional stats techniques are quickly being replaced by modern ML techniques--so much so that the American Statistical Association recommends not using p-values in scientific journal papers anymore. I'm also betting that hypothesis testing and p-values are the vast majority of what they've done with their data in the past (that and regressions).

      What kills me is they achieved higher accuracy on the test than on the train and that didn't set off any alarm bells at all. How naive do you have to be to think that deep learning is this magic black box that is able to model data it's never seen better than the data it has seen?

      [–]NOTWorthless 20 points21 points  (2 children)

      Uh, what? I assure you the ASA has reached no such conclusion about p-values, and to the extent that there is controversy surrounding p-values it is for reasons that are decades old. The ML community has contributed nothing to that discussion and proposed nothing to replace them. If you think ML is replacing traditional stats it is because you are buying into own field’s hype too much. Statistics is about much more than just raw prediction, and with the ML hype in full swing it is more important than ever to have a strong statistical foundation for your conclusions. For most problems of interest to statisticians, the predictive issue is completely orthogonal to what we are doing.

      [–]TheFlyingDrildo 22 points23 points  (1 child)

      For anybody reading this, what the ASA actually officially recommended against was using the phrase 'statistically significant' or any associated markers of significance such as stars, p-value < 0.05, etc...

      [–]themiro 0 points1 point  (0 children)

      oh that's obviously the right move

      [–]frostbird 1 point2 points  (0 children)

      Seriously, this is like the first thing you learn about using machine learning.

      [–]beginner_ 8 points9 points  (6 children)

      Yeah that shows the sad state of academia. Scary part is that google, probably google brain was involved. Shouldn't they have realized this?

      [–]AyEhEigh 14 points15 points  (5 children)

      I'm betting Google threw a couple interns or junior analysts their way just to get the PR for it. Honestly, if you think about it, a junior analyst at the Googs should've been able to knock something like this out easily enough because it's a pretty straightforward application. However, this seems like a situation where it's hard to pick out the potential for data leakage without a solid understanding of the actual data--and the people who actually have a solid understanding of the data are lost in the sauce on the ML front.

      [–]beginner_ 6 points7 points  (2 children)

      I'm betting Google threw a couple interns or junior analysts their way just to get the PR for it.

      Maybe. I thought Google brain is only for people already having achieved something within google or major experts and not for junior analysts.

      [–]thundergolfer 1 point2 points  (1 child)

      They have residents though, which are like interns.

      [–]NotAlphaGo 5 points6 points  (0 children)

      Try to find good ML people in geoscience.

      TLDR; It's hard

      [–]b1946ac 2 points3 points  (1 child)

      Surprisingly not. Brendan Meade is a Staff Research Scientist at Google Brain and Fernanda Viegas and Martin Wattenberg (co-authors on the Nature paper) are Senior Staff Research Scientists at Google Brain. Which is shocking right?

      [–]AyEhEigh 0 points1 point  (0 children)

      Definitely unexpected.

      [–]iiiooou 29 points30 points  (3 children)

      Seems like they believe deep learning gives unbiased results automagically so why care about data leakage

      [–]maxToTheJ 61 points62 points  (2 children)

      Sounds like they are 80% ready to start a DL startup.

      [–]tdgros 11 points12 points  (0 children)

      leakAIge

      [–]po-handz 5 points6 points  (0 children)

      shut up and take my VC money!!!!

      [–]ApeOfGod 24 points25 points  (0 children)

      Notebook showing better performance on test set than training set. Nice refereeing Nature, great job!

      [–]inarrears 15 points16 points  (1 child)

      Here's the previous discussion about the Nature paper on this subreddit: https://www.reddit.com/r/MachineLearning/comments/9bo9i9/r_deep_learning_of_aftershock_patterns_following/

      Also, the PDF (no paywall) link to the paper (hosted on China University of Geosciences :): http://www.cugb.edu.cn/uploadCms/file/20600/20181225095255165239.pdf

      [–]ruixif 12 points13 points  (2 children)

      There was a similar and ridiculous publication on using ml models for chemistry last year. link Obviously these authors know nothing about ML or statistics. I don’t think they really understand chemistry even. What they did was just using some R APIs.

      After that, a serious comment from professional data scientists was published. Those guys did a clean ablation study on the original data set.comment

      Ablation study is definitely necessary for these ml applications. More generally, for any experimental science.

      [–]HateMyself_FML 0 points1 point  (1 child)

      I completely agree. I was very surprised to see that paper in Science.

      [–]ruixif 0 points1 point  (0 children)

      There was

      Ironically there was a similar paper published earlier from the same group, with the same topic and almost the same (misused) methods. lol

      https://pubs.acs.org/doi/10.1021/jacs.8b01523

      [–]glockenspielcello 18 points19 points  (4 children)

      What I took away from point 1 of their response was that they included e.g. the input-output pair (Mainshock A, {Mainshock B + other aftershocks}) in their training set, and then (Mainshock B, {other aftershocks}) in the testing set, but didn't include any of the same shocks as inputs across their training and testing sets. This seems legitimate to me at first glance (although it's perfectly possible that there are other good practices that come with modeling data that has this kind of hierarchical structure, that I'm simply ignorant of)– it's fine if your function output takes on multiple of the same values in it's range across the training and testing sets, just as long as the values in the domain are disjoint.

      Can someone explain to me if/why I am wrong on this point? Also, is this what they actually did in the paper, or did I misinterpret the authors' responses/did they misrepresent their methodology?

      [–]serge_cell 10 points11 points  (2 children)

      As I understand (not necessarily correct) the problem is that Mainshock A and Mainshock B happens close in space and time and therefore correlated. It similar to using close in time images from video in both training and testing sets.

      [–]glockenspielcello 7 points8 points  (1 child)

      Seems like a reasonable explanation. The fact that the test AUC is higher than the training AUC seems pretty damning.

      I wish that there were clearer standards for how to handle cases like this, because this seems like a reasonable error to make as a practicing scientist even if you are following 'best practices' like group partitioning (which, if aforementioned understanding is correct, the authors did implement, contrary to what was alleged in this blog post). Yes, these models probably overfit to the specific spatial and temporal conditions, but where do you draw the line when you have similar latent features like these? This is a pretty general question, and it's problematic because any time you draw a line in the sand around something like this, you introduce an element of human subjectivity into your model-building process. Since the whole goal of this exercise is to encourage consistent, rigorous practices for data science, it seems like this is a problem that requires some careful attention.

      I'm curious as to how people solve this problem in the example of video data that you gave. Is there a principled way of choosing a sufficient temporal distance between training and testing frames in video processing tasks? Or is it mostly arbitrary/at experimenter's discretion?

      [–]breadwithlice 2 points3 points  (0 children)

      In the end the choice of your test set depends what you want to do with the model you develop. If the goal is to predict aftershocks of any new earthquake happening using past data, in my opinion the same temporal order should be respected in the train / test set. I'm wondering why they did not use for instance a train set with earthquakes from 2000-2012 and a test set with 2014-2017 for example (potentially leaving a temporal gap to avoid correlations). This way you establish that a model trained on past data is useful, which would be the goal of such a tool.

      As far as I could tell their argument is that the stress changes, which are inputs to the neural network, of Mainshock A and Mainshock B are not correlated at all even if Mainshock B is an aftershock of Mainshock A. Perhaps, I'm not an earthquake scientist, but then again why take the risk? Even if there the correlation is low, the model can still take advantage of this and report an unrepresentative accuracy on the test set. It's all about having a representative test set of the situation the model is destined for to evaluate its real usefulness.

      For the video topic, if you randomly include lots of pictures and video frames of dogs / cats to make a dog/cat image classifier and randomly distribute them in train / test sets, you might end up in a situation where a frame at time 0s is in train set and a frame of the same video at time 0.05s is in the test set which is most likely very similar. This will boost the accuracy of the test set misleading you into thinking you did a great job whereas your image classifier might perform poorly. I haven't seen this specific scenario happen (I haven't built many dog/cat recognizers from video) but I've seen it a few times with duplicate data points ending in training and test sets in the context of NLP.

      [–]Toast119 2 points3 points  (0 children)

      Yeah I have the same understanding and am slightly confused as to how this is data leakage. Especially considering that you'd have to have an understanding of earthquakes to even conclude these events are dependent.

      [–]luaudesign 5 points6 points  (0 children)

      It's because of this stupid publishing culture that science has a credibility crisis.

      [–]galacticlunchbox 5 points6 points  (3 children)

      I sincerely commend your efforts on this. This doesn't happen enough and it needs to. The journal's response to your comments was disgraceful, IMO.

      I can relate strongly to your situation. Back when I was working on my PhD I became interested in a particular narrow subfield of machine learning application and began to do a lit review to see where I could make a contribution. I was subsequently blown away to find that literally every paper published in this area over the last decade had one or more experimental design flaws that led me to question their optimistic results...training set contamination, confounding variables, etc. Like you, I knew I couldn't let it go, so I ended up writing a paper on my findings, reproduced some of their experiments but using proper methodology and showed that the results were significantly poorer than those previously published. My goal wasn't to hurt anyone's feelings, but rather to hopefully help push the research in the right direction. Fast forward a handful of years and this is now one of my most cited papers.

      [–]hoolahan100 6 points7 points  (0 children)

      It is really unbelievable that a reputed journal would respond in this way.

      [–]johnnydozenredroses 10 points11 points  (0 children)

      I haven't read the paper, only your critique and their response. I think theirs is very sloppy work, but what pisses me off is that Nature's response is "Hey, the lay public won't understand the critique, so we'll do nothing". At the very least ask them to update their paper addressing the criticism.

      [–]TheOverGrad 4 points5 points  (1 child)

      Great video by Arnaud Mignan going over his complementary critique of the approach:

      https://www.youtube.com/watch?v=k8ciiViRqPA

      [–]nikitau 0 points1 point  (0 children)

      I'll start this comment by disclaiming that I haven't finished reading the original paper, but Arnaud Mignan's analysis raised a few questions for me. Considering his good results with just the two relatively simple engineered features (I don't think either of them are time-dependant) would this dissipate some of the fears of data leakage in the original model? Also, if so, considering the authors' curious reply regarding their houldout sets, could test set contamination be the source of potentially overly-optimistic results, seeing as Arnaud uses the original splits in his notebooks?

      I also think Arnaud's result, assuming the splitting is done properly, is really more interesting in terms of interpretability than the original heavily over-parametrized model.

      [–]vicks9880 5 points6 points  (2 children)

      There is a race to publish articles in every university / every college courses nowadays. I see students who just started with deep learning trying to push their so called research with huge overfitting. Sadly there is no private leaderboard there like kaggle¡

      [–]despitebeing13pc 0 points1 point  (1 child)

      My perspective, and I'm a complete noob to this field, but curiously following it.

      There are nothing but Machine Learning Scientist positions all over the UK, every single major company is doing something, ARM now has a neural processing unit, Intel has something, imagination fleshed something out. None of these are even selling. And then there is Graphcore, which just shook down everyone they could find in Cambridge for their salary expectations in the Spring and then just announced a Cambridge design center last week.

      Every recruiter has 10000000 machine learning positions to fill too.

      Every position is focused on throwing together models and building neural networks because it's still R&D phase but I expect when it comes to "what can we practically do better with this..." the whole bubble is going to burst.

      [–]Maplernothaxor 0 points1 point  (0 children)

      What do you mean by shook down everyone? I’ve never heard the phrase before

      [–]Er4zor 4 points5 points  (0 children)

      These kinds of exchanges should be rendered public more often!

      One may disagree with the authors, the critics or the editor, but it always creates a good opportunity to open a discussion across fields (and journals).

      [–]dampew 7 points8 points  (2 children)

      Ultimately I more or less agree with the referee's comment: https://github.com/rajshah4/aftershocks_issues/blob/master/correspondence/Nature_Referee_Comments.md

      But I'm glad Rajiv's improvements have been made public.

      Mostly I lament that our journal system is inflexible to the open-source nature of the modern era. Imagine you had added a layer and gotten a 10% improvement in AUC -- would that be worth publishing in Nature? Probably not, right, but it should be made public somewhere?

      I also want to thank the OP for this interesting post and case study, it's always good to keep these issues in mind.

      [–]serge_cell 5 points6 points  (1 child)

      The referee comment make sense with one major exception:

      potential data leakage between nearby ruptures is a somewhat rare occurrence that should not modify the main results significantly.

      There shouldn't be any "should not modify" in scientific context. If the claim is made that result will not change there should be at very least some numeric estimation why it is so - like if network has Lipschitz property and dataset is balanced. In general case I can easily imagine small part of results having disproportional weight in imbalances dataset. I'm not saying that is the case for that paper, but those estimation should be made before referee comment

      [–]dampew 2 points3 points  (0 children)

      Yeah 100% agreed. It's actually the crux of the matter and it shows that Rajiv didn't do a good job of writing the comment. "Should not modify" is fine if the effects are clearly small, but it's not clear in this case, and it's entirely possible that this is the only thing it's learning from. There is little reason to care about methodological weaknesses unless they have the potential to change the message of the paper.

      I think Rajiv would have done a better job if he had been less confrontational about it and less hung up on the details. A comment more along the lines of, "There is an uncontrolled methodological weakness in the paper. This is the problem. This is how to fix/improve it. Fortunately, it seems as though the problem had little impact on the overall message of the manuscript."

      I think the bigger lesson for everyone here is that how you communicate with people can be just as important as what you're trying to say.

      [–]kastilyo 3 points4 points  (0 children)

      Great post. How can one learn to be responsible when implementing predictive modeling? As a graduate student I want to use machine learning for my thesis, its extremely powerful,but I dont want to miss subtleties and wind up with bad science. With great power comes great responsibility; How can I learn more about the things I might not be aware of?

      [–][deleted]  (1 child)

      [deleted]

        [–]AboveTableAccount 0 points1 point  (0 children)

        Authors getting touchy is... unfortunate but understandable, it's Natures response that's more worrying. I mean I understand not publishing a really minor quibble but this seems like a much more noticeable problem with their experiment design/algorithm design.

        [–]louislinaris 2 points3 points  (0 children)

        I recently read an article with similar concerns but about reproducibility of results, and top journals didn't care. Unfortunately they don't want to retract bad findings since their success is dependent on publishing high impact work. It's a very perverse incentive

        [–]Cherubin0 2 points3 points  (0 children)

        Nature doesn't care about research quality. They only care about paywalling as much as possible.

        [–]baylearn 2 points3 points  (2 children)

        This paper was featured on Google Research (now GoogleAI)'s blog:

        Forecasting earthquake aftershock locations with AI-assisted science

        Phoebe DeVries Post-Doctoral Fellow, Harvard Published Aug 30, 2018

        From hurricanes and floods to volcanoes and earthquakes, the Earth is continuously evolving in fits and spurts of dramatic activity. Earthquakes and subsequent tsunamis alone have caused massive destruction in the last decade—even over the course of writing this post, there were earthquakes in New Caledonia, Southern California, Iran, and Fiji, just to name a few.

        Earthquakes typically occur in sequences: an initial "mainshock" (the event that usually gets the headlines) is often followed by a set of "aftershocks." Although these aftershocks are usually smaller than the main shock, in some cases, they may significantly hamper recovery efforts. Although the timing and size of aftershocks has been understood and explained by established empirical laws, forecasting the locations of these events has proven more challenging.

        We teamed up with machine learning experts at Google to see if we could apply deep learning to explain where aftershocks might occur, and today we’re publishing a paper on our findings. But first, a bit more about how we got here: we started with a database of information on more than 118 major earthquakes from around the world.

        ...

        Read more: https://www.blog.google/technology/ai/forecasting-earthquake-aftershock-locations-ai-assisted-science/

        [–]murukeshm 0 points1 point  (1 child)

        Isn't the Google AI Blog http://ai.googleblog.com/? And it certainly seems to have been alive when this post was published, so I wonder why this wasn't published there but on the general blog.

        [–]baylearn 0 points1 point  (0 children)

        You're right, it's on the general google blog rather than the more specific ai blog!

        I think maybe because the article is in Nature, rather than a specific domain venue like Nature Machine Intelligence (banned) or NeurIPS?

        [–]beginner_ 6 points7 points  (0 children)

        We are earthquake scientists and our goal was to use a machine learning approach to gain some insight into aftershock location patterns.

        Exactly. You are Earthquake scientists and hence should know the bounds of your education and knowledge which advanced machine learning isn't part of.

        This is just another of many stories how real science is going down the drain. Arrogant, diva like professors with an ego up on Mars. It's not about science, it's about getting attention (published) and hence getting more funding (money) to be able to get more attention. It's not about the truth anymore. Hence their harsh reaction, because of ego-defense. They don't care zilch about the truth and actual science. Anytime I'm surprised by my models performance I start looking for the error. And there always is one.

        Compounding this is the complete failure of current publishing system. Just another case that peer-review has close to 0 value. Some time ago there was also a machine learning paper in the field I work in. In Science. it was horrible and essentially brute-forced p-hacking. 300 data points with 5000 features with backwards feature elimination. What could go wrong. Not related to AI but also in Science was the famous Microplastic-paper. Science claims it only published with all raw data available. It was clear Science never got that data and authors later claimed the laptop with the data on it got stolen from their car. Right you have no backup of your multi-year projects data and then leave that data in your car...

        The more hype, the more BS and AI is very, very hyped. I really like the image on the tweet with the leaking pipe. So yeah big thumbs up for that guy!

        [–]critplat 13 points14 points  (3 children)

        It seems that the critic missed that two of the comparison methods were actually not standard techniques, but discovered as important during the data analysis, and so the analysis with a better holdout set actually just replicates the paper's original results and doesn't change anything about the interpretation?

        [–]thommyfilson 16 points17 points  (2 children)

        You’re being downvoted, but I do think this is the core argument of the authors and nature. The NN was able to find some new predictors of earthquakes that humans hadn’t found/used before.

        Both the authors and nature could’ve handled this better - simply running experiments on different splits likely would’ve resolved all this - but decided to argue instead.

        I think the other points by the critics aren’t critical with respect to the paper. The potential data leakage (haven’t fully read the paper so I’m not certain on the method, so maybe there is/isn’t leakage) is the main concern. The others aren’t too related to the core finding that there are new and better metrics that humans haven’t and don’t use.

        [–]a3onstorm 5 points6 points  (1 child)

        Yeah agreed. Technically, the critic actually proved that the data leakage did NOT inflate results, because he reproduced the same result that the paper did about the von Mises yield criterion being as good as the neural network

        [–]MauTau 2 points3 points  (0 children)

        I see what point you aguys are making, but one of the articles Rajiv came across, specifically this one, states that the predictors they found were not insightful. From the article: " In the following, we show that a logistic regression based on mainshock average slip, d, and minimum distance r between space cells and mainshock rupture (i.e. the simplest of the possible models, with orthogonal features), provides comparable or better accuracy than a DNN."

        [–]alexmlamb 1 point2 points  (1 child)

        Is this published in standard nature or one of the weird nature offshoots (like Nature Machine Intelligence)?

        [–]rshah4 2 points3 points  (0 children)

        Standard nature

        [–]thebrashbhullar 1 point2 points  (0 children)

        The authors of the paper clearly do not know about Occam's Razor principle .. "Simple is Better". How does using a neural network to learn trivial signal cause "advancement in science"?

        [–]dertuncay 1 point2 points  (0 children)

        I am really glad to find such a topic as an earth scientist who tries to implement some ML on his data. I've read that article and my plan is to cite it as an example on ML applications on earth science in my PhD. thesis.

        Since it is published in Nature, my thoughts about it was the quality of the work. I am not working in that particular topic in earth science but it looked interesting to me when I read it. Thank you for your work on that article.

        I'd like to share another strange (on my point of view) article about a similar topic. In that work, authors keep saying that they have a 'big data' which is around 900 MB of catalog. In the database they have 1.4 million of earthquakes. However, in data preparation section they are saying that "events. The catalog was filtered according to a minimum magnitude M0 = 2.5. Thus, only events with at least such magnitude will be considered from this point to the rest of the work. This filtering resulted in 63,960 events with magnitude greater than or equal to M0." From now on they do not have 'big data' anyway and taking care of 64 thousand event is way more easy. It doesn't mean that you shouldn't use ML for it but throughout the article, they keep saying that they have a big data etc. Moreover they used Spark's The cloud-based Big Data IT infrastructure for the processes since they have a 'big data' and it is hard to deal with.
        It is also written that "In this work, the default configuration of deep learning implementation in the H2O library was used.". Combination of these sentences makes me think that these people made this work just because it is a hot topic and they do not have so much of a knowledge about the ML processes.

        [–]Chemracer 2 points3 points  (10 children)

        Can I get a TL:DR?

        [–]sciences_bitch 23 points24 points  (0 children)

        Data leakage.

        [–]shaggorama 11 points12 points  (8 children)

        Several earthquakes appeared in both the training and test sets.

        [–]johnnydozenredroses 3 points4 points  (6 children)

        I mean : this isn't fraud, right ? It's just very very sloppy.

        [–]shaggorama 14 points15 points  (4 children)

        no one said it was fraud. But it's sufficiently sloppy to render their entire analysis worthless.

        [–]trevorbix 2 points3 points  (3 children)

        It's sort of like predicting the results of a football match two weekends ago but you include last weekends result in the training data. If the weekend priors result influence the most recent weekend (as we know they would, teams improving, injuries etc), then its time leakage. That's how I see it anyway.

        [–]shaggorama 7 points8 points  (2 children)

        It's more like trying to predict last weekend's results based on who won last weekend. It's redundant. You can't make any statements about the generalizability of a model if you evaluate it on the training data.

        [–]trevorbix 2 points3 points  (1 child)

        Agree with the second statement, but they aren't using the aftershock of A and the original shock of A in the training data at the same time, the problem occurs when the aftershock of A, is mapped also as a primary shock B isn't it? So you have a primary shock in the training and also as a secondary shock. Listen to me, I AM AN EARTHQUAKE SCIENTIST.

        [–]jmineroff 1 point2 points  (0 children)

        A better example would be if Team A beat Team B last Friday. Their approach involves using the data that Team A won in the training set, and then asking about the outcome for Team B in the test set. While you aren’t querying for the exact same result, the concern is that the leakage causes spurious correlations like “Team A always wins on Fridays.”

        Edit: In the specific case of earthquakes, you might be capturing otherwise unknowable geographic/geological factors that affect how aftershocks function in a given quake.

        [–]beginner_ 0 points1 point  (0 children)

        it's what you expected will happen if you give a toddler a gun.

        [–]Toast119 1 point2 points  (0 children)

        This isn't true though right? It's that earthquakes in the training and test set both had aftershocks in the same geographic region.

        [–]philneedapug 1 point2 points  (0 children)

        OP, were you able to replicate their performance after removing the leaked data (pulse A and B mentioned in the authors letter), and did you see a significant drop in AUC? If you did then there is really no argument to defend it. (The authors even argued data leaking makes it harder, wow)

        [–]AboveTableAccount 0 points1 point  (0 children)

        Wow. I would've never thought a journal like Nature would respond with that. The authors "Not moving the field forward" "sound condescending. Yeah because pretending like you didn't make an error is definitely moving the field forward. This is really concerning actually. This sets a really bad precedent I mean no one will blame them for making a mistake everyone does it, mistakes get published but authors and journals not even taking the responsibility to correct things is a really bad precedent. Now I grant you though I'm aware of Nature's prestige it's a journal I've just never read is this behaviour usual for them?

        I understand that these types of issues might not be of interest to non-data scientists but it really should be. It doesn't neccesarily ruin the central thesis of the work but being so non-chalant about ignoring a source of error is rather unscientific.

        [–]shyam4iiser 0 points1 point  (0 children)

        It seems that this terrible mess could have been avoided if the authors did some proper prospective forecasting experiments, for instance, the kind that is facilitated by this platform:

        https://www.richterx.com/rX.php?go=forecast

        [–]clanleader 0 points1 point  (0 children)

        In all fairness doesn't this sum up the majority of science? I thought it was well known now that a majority of published research finding are false, p-value hunting etc. Even in major journals like Nature, impact factors often have more importance than reproducibility (aka science).

        Whilst reading their rebuttal I was thinking "surely they used a validation set for all the testing and the test set only once to either confirm or deny their hypothesis". I'm sure there was one and should dig deeper, but just came across this post from a google search and it made for interesting nightly reading. I hold anyone that does replication studies in high esteem, it's the most important yet neglected part of science.

        Anyway, interesting read, thank you.

        [–][deleted] 0 points1 point  (1 child)

        Is this the paper that came from the LANL kaggle competition? In that case, it probably won that competition and has some legitimacy in that regard.

        Your points are very valid, but I just need some clarification addressing my question to be satisfied.

        [–]rshah4 6 points7 points  (0 children)

        LANL kaggle competition

        This was unrelated to the Kaggle competition (and published prior to the Kaggle competition)