×
all 10 comments

[–]gwern 43 points44 points  (3 children)

Looks overfit to me, suggesting lack of data. I am not expecting the libertarian anti-recycling rant from the OA post demonstrating the power of GPT-2-large*, but even GPT-2-small should be able to do better than this... There aren't that many CW threads, 'top 5 controversial' is not that much especially when they may not be very controversial and you say many are deleted/empty, and GPT-2-small really needs many megabytes of text. (I think CW/Motte are also very strange in terms of general controversiality - they aren't that much like Twitter or Facebook clickbait. So even if it did work, it wouldn't necessarily be very conflict-triggering.)

What you should do is use the Reddit comments BigQuery dataset: https://bigquery.cloud.google.com/table/fh-bigquery:reddit_comments.all?pli=1 It's not that hard to figure out, and already has a 'controversiality' field. Make a list of the top dozen or score of political/culture-war subreddits and then use its binary 'controversiality' field (Reddit won't provide the exact up/downvotes so there's no way to calculate your own 'controversial' score). This will provide all the comments you could possibly need. You can run the SQL and dump it to a GCP bucket and export as JSON or other formats, which will be easy to extract to pure text & retrain your GPT-2-small on.

* Amusing anecdote: someone on Twitter thought the recycling rant showed that GPT-2-large was just memorizing, citing as justification the fact that the rant could be found as a self-post on Reddit. I pointed out to them that the timestamp of that post was after the OA post...


To demo the BigQuery mirror a little bit...

For February 2019 alone, this yields: SELECT count(body) FROM [fh-bigquery:reddit_comments.2019_02] WHERE subreddit == "politics" AND controversiality == 1 LIMIT 20; ~> 82,168 comments frmo /r/politics.

Some samples: SELECT * FROM [fh-bigquery:reddit_comments.2019_02] WHERE subreddit == "politics" AND controversiality == 1 LIMIT 20; which yields:

  • 1. https://www.npr.org/sections/parallels/2018/05/19/612487104/venezuela-to-hold-presidential-election-but-main-opposition-is-boycotting-it You are lying or just incorrect.
  • 2 No. Small states deserve equal representation. 8 cities should run a country of 350m
  • 3 [deleted]
  • 4 Seeing a lot of Tulsi hate spewed with no reasoning to back it up (aside from easily debunkable things such as the "Assad apologist" and "homophobic" smears). Only thing that's been brought up to me that is factual is people questioning her views on Iran. Anyone have anything factual to add? I would appreciate it so I could gain more information on the subject.
  • 5 yeah capitalists tend to more conservative with their bullets - 2 in the back of the head in an apparent suicide attempt
  • 6 Yes of course but I think the main issue is that this puts him at a disadvantage in terms of policy and popularity. Also, the main flaw I see with the Democratic Party is the size and warring factions within it - the left will eventually triumph and centrists like booker and Clinton will be erased for the better.
  • 7 I'm not convinced that the Senate doesn't accurately represent the states.
  • 8 He took a racist photo. Whooptie fuckin doo. I have to ask the question, is being a "pure" liberal more important than winning an election? Come the fuck on.
  • 9 I'm pretty sure Trump's base says the exact same thing. What's next? All the articles against her are 'fake news'?
  • 10 Mine is also related to her looks, but I work in an industry with it’s fair share of beautiful women, and
  • 11 No he doesnt like his daughter employing a racist stereotype in order to get votes. Theres a difference lol
  • 12 All party representatives of authoritarian slant in both parties fell for the 'trafficking' scare tactics to shut down freedom of speech (Larry Flynt is a civil right cohort that helped maintain freedom of speech and the moral laws are back) and blame sites for the content users post. Making the case that site operators are 'pimping' and 'trafficking' humans for what their users post on a classified ads site (which Kamala Harris led), is the most authoritarian move ever tried against the internet. Go after the users who are breaking the law, not shut down the whole site and business, very anti-business and anti free market. Most of the 'trafficking' scare is to shut down porn and sex workers for moral laws, moral laws are heavy on the right. Moral laws like prohibition do not work, they only make the issue more dangerous for everyone. Ultimately authoritarians want to setup the internet firewall like in China and Russia and make porn and prostitution more dangerous by sending it underground in the black market where mafias run it.
  • 13 Tax breaks are reduced revenue, yes! Since you understand that part, I don't know what your protest is. Because she was saying that tax breaks can be given to the public instead of amazon. She's not wrong. We should just also not give them to Amazon. Also, saying tax breaks are finite is technically true, but practically useless at best, and misleading in a sinister way at worst.
  • 14 And how exactly does that benefit anyone? Imagine if every company did this. All it does is subsidize corporations by lowering their effective tax rate. That does not benefit citizens whatsoever. All this accomplishes is essentially moving a business from one place to another, and getting taxpayers to pay them to do it. I'd wager it makes things worse in another respect, that the business is incentivised to relocate to sub-optimal locations for their business. They would make less money, but the tax benefits mean they still get more profit than moving somewhere they could do better business. This clearly hurts the economy.
  • 15 Hey can I ask you a question? This is more about Bernie in 2020. I have avoided asking my Bernie colleagues because it's a sore subject. But...I looked through your history and, although recent, you "know your shit". I supported Bernie in the primaries in 2016. Voted Hillary in the General and full disclosure did some volunteer work as well. I would have MUCH rather had Bernie vs Hillary. That said, do you feel Bernie has some bridge he feels he might need to mend to get someone like me...who is actually concerned (with reason) that the movement could divide the left? And while Bernie himself went out and tried to get his supporters to come over to Clinton...they didn't That worries me about Bernie Sanders. Do those concerns come up at all within the grassroots movement and what can you tell me is going to be different about this run?
  • 16 You could have joined the military to get your education, gone to a smaller school or online. Your free to do as you please why should everyone else have to pay for your lifestyle?
  • 17 Bernie received 46% of the elected delegates. That is the fairest way to measure since it includes the full value of caucus states (that have lower overall turnouts than primaries).
  • 18 [removed]
  • 19 The Republican Party must be abolished. Their brand is sunk.
  • 20 Bernie will have to sign a pledge that he is a democrat to run, I believe he has agreed to do that. That makes him more democrat than most people, cause I know I never signed a pledge.

Look pretty inflammatorily left-like to me. :)

Given that the BQ will provide more comments than you can feasibly train on, filtering further would be a good idea. Obviously, remove any '[deleted]' or '[removed]', and pick ones which are a at least a certain length and so are more likely to express a coherent argument and aren't just a hit-and-run; but one less obvious trick that comes to mind is to look for extreme scores in addition to the controversiality bit. So select all scores which are >20 or <-20, for example. This would work best if you pick subreddits from all across the spectrum, because it means that a comment has either gratified its partisans or infuriated its enemies.

[–]ratroj[S] 3 points4 points  (2 children)

Thanks a ton for this in-depth reply! I do agree that the performance of this model is underwhelming, and I'd imagine that a larger dataset would help to remedy that. It looks like I'd better get started with that (exceedingly useful) BigQuery dataset.

[–][deleted] 1 point2 points  (1 child)

With so many things you could work on, why choose this project?

[–]gwern 3 points4 points  (0 children)

For the lulz, presumably.

[–]Dudesan 62 points63 points  (1 child)

Most of these are of the "vaguely syntactically correct but obvious gibberish" variety. A few isolated sentences might be mined for bon mots. The last passage you quote looks like something a human could have written - perhaps not a sober human, but a human.

Honourable mention:

According to the relevant data, we have 1455 students who have taken AFAP (American Indian and Alaska Native Studies). Of these, 7 (58%) are boys, 6 (179%) are male, and 1 (97%) are non-binary (including the boys) and 1 (761) are genderfluid (all other numbers are in red).

I mean, I've seen worse math in gender-politics posts.

[–]BothAfternoonprideful inbred leprechaun 4 points5 points  (0 children)

I hurt myself laughing at that one :-)

[–][deleted]  (1 child)

[deleted]

    [–]gwern 7 points8 points  (0 children)

    2) Can you use the "noncontrovesial" comments as the other side of the classification? Unsure how GPT-2 works but this would be a natural way for most ML models.

    Yes, there's no reason you couldn't train it with 'negative samples' and make it assign them lower likelihoods. But the current training codebase doesn't support this kind of training at all. Just positive samples. So you'd need to rustle up as pure flamebait as possible.

    [–]wulfrickson 6 points7 points  (1 child)

    Wait, is the model generating plausible but fake URLs? I was fooled by http://unz.com/us/nrberg/russian-science-school-turned-politically-wrong/. It's a shame it's fake.

    I would agree with /u/gwern that this feels like overfitting on a tiny dataset (how many comments were there in total? Far fewer than a thousand, I would expect) and training on a bigger subreddit would be interesting - with the caveat that the mods on the biggest politics subs are notoriously heavy-handed DNC shills who would probably have deleted most of the actually interesting things.

    [–]ff29180dIronic. He could save others from tribalism, but not himself. 2 points3 points  (0 children)

    Wait, is the model generating plausible but fake URLs?

    Yes, it's common, check the @ask-gpt account on Tumblr.