×
all 154 comments

[–]gitamar 65 points66 points  (12 children)

Interesting write up! Should be cross-posted to /r/machinelearning or even /r/computervision.

[–]lolcop01 15 points16 points  (10 children)

Thanks for these subreddits! These should make my reddit experience a bit more meaningful. Less cats, more knowledge.

[–][deleted] 26 points27 points  (2 children)

Seems obvious but this tip saved reddit for me: you can unsubscribe from the default subs. Do it. You won't miss it, even if it feels wrong.

[–]Tekmo 8 points9 points  (0 children)

Yeah, /r/programming also improved in quality once it was no longer a default sub

[–][deleted] 2 points3 points  (0 children)

I actually enjoyed reddit more and spent less time on reddit after doing so, which was amazing.

[–]quzybd 1 point2 points  (5 children)

Fun fact: a few years ago Google ran a huge NN on YouTube videos. And what did the network do? It taught itself to recognize CATS!!!

[–]redonculous 11 points12 points  (4 children)

*taught.

[–]gitamar 0 points1 point  (0 children)

You are welcome! Unfortunately they are not that active, but still better than /r/genetic_algorithms

[–][deleted] 0 points1 point  (0 children)

/r/datascience would also appreciate this.

[–]mutantturkey 93 points94 points  (20 children)

I wanna see the cocaine stock ticker!

[–]miekao[S] 37 points38 points  (6 children)

I did too! Unfortunately I never got back around to the idea until after I had heard of the epic SR takedown.

There were a few more interesting problems to solve to go down that route: a good bit of language processing.

Quantities on SR were pretty free-form. It'd have to understand "QP" could mean "quarter-pound" depending on context, or when "O" "Z" or "Ounce" all mean the same thing, and when they don't, etc. Metric to imperial, and so on. Good stuff.

[–][deleted] 4 points5 points  (1 child)

What about the new SR?

[–]miekao[S] 1 point2 points  (0 children)

Never got around to it, unfortunately.

[–]don-to-koi 1 point2 points  (2 children)

Once the tool noticed I had renamed the file, it slurped it up, and added it to the corpus to train upon. 

How exactly was this done? Does it keep reading the directory from time to time? Do you have some kind of checksum for a file so that it's still recognized as the "same" file even after a name change? Because I'm assuming that the tool was still dumping out other failed logins in the interim. How was it to know that the file with the changed name was not a new failed login but a previous failed login that you had corrected?

[–]miekao[S] 2 points3 points  (1 child)

Considering it was still in "scratch pad" mode, it was a terrible inotify hack at the time. Nothing worth talking about architecturally.

[–]don-to-koi 0 points1 point  (0 children)

Oh, I was just curious about the logic/algorithm you used.

[–]indieinvader 1 point2 points  (0 children)

Well, I know what I'm going to be hacking on tonight!

[–]cybrbeast 24 points25 points  (2 children)

Indeed, he did it for the stats, show us the stats!

[–]periphreal 14 points15 points  (0 children)

The experiment didn't end up going much further (and I don't have that data!)

[–][deleted] 10 points11 points  (0 children)

I hope he posts some graphics to /r/dataisbeautiful!

[–]labiaflutteringby 27 points28 points  (5 children)

I wanna play MMO Drug Wars with realtime prices based on the actual black market!

[–]mutantturkey 17 points18 points  (1 child)

the kingpin of your empire just got whacked

Pick an option

1) go to war with the other family 2) take out the mole

[–]Electro_Nick_s 8 points9 points  (0 children)

Why not both?

[–]case9 4 points5 points  (1 child)

Seems flawed because the player actions wouldn't effect the market prices in the game

[–]jmblock2 7 points8 points  (0 children)

except they might...

[–]neoice 2 points3 points  (0 children)

I wanna play MMO Drug Wars with realtime prices based on the actual black market!

it exists, it's called being a drug dealer.

[–]gwern 2 points3 points  (0 children)

You could try using Christin's sanitized datasets: https://arima.cylab.cmu.edu/sr/ He removed most of the information, but I believe cocaine was a separate category on SR1 so it looks like you could get a cocaine ticker of sorts.

(If you don't want SR1 specifically but would settle for successor markets, one could probably use data from Grams or my own dumps.)

[–][deleted] 0 points1 point  (2 children)

The silk road is no longer, and most of the alternatives are using stronger captchas.

[–]I_RAPE_SLOTHS 0 points1 point  (1 child)

It's very much alive and kicking at http://silkroad6ownowfk.onion, and I don't believe it even requires a captcha to login.

[–][deleted] 1 point2 points  (0 children)

That's not the same site that OP was trying to scrape data from.

[–]dd_123 125 points126 points  (52 children)

A great example of why you should always try to avoid creating your own captcha scheme. I can't count the amount of times people (especially on forums) think they've come up with some great new scheme which is in fact relatively easy for computers to solve with minimal effort.

[–]TMaster 75 points76 points  (9 children)

FTA:

Because the Silk Road developers had to be paranoid, they couldn't use an external captcha service like ReCaptcha.

Making your own CAPTCHA or taking an off-the-shelf one, it doesn't matter that much anymore anyway. Without making things extremely hard to read for humans you're not going to stop all bots anymore, you're just adding in a hurdle to some degree nowadays.

Before anyone asks, yes I do think reCAPTCHA does it well, but they too have had their problems with bots in various ways, and the authors clearly states his reasons for his beliefs that reCAPTCHA wasn't used.

[–]KayRice 28 points29 points  (4 children)

you're just adding in a hurdle to some degree nowadays.

Certainly considering you can pay for CAPTCHA solves by the thousands.

[–]cmseagle 14 points15 points  (3 children)

I'm not at all familiar with such things. What's the going rate for captcha solves in $/captcha?

Edit: Answered my own question. A quick Google and I found one service that does it for $1.39 per 1000 captchas.

[–]Almafeta 10 points11 points  (0 children)

On mturk, 3 cents per 100.

[–]deadstone 2 points3 points  (1 child)

Man, I'd solve captchas for bitcoins.

[–]housemans 2 points3 points  (0 children)

You can.

[–]Genesis2001 12 points13 points  (2 children)

Making your own CAPTCHA or taking an off-the-shelf one, it doesn't matter that much anymore anyway. Without making things extremely hard to read for humans you're not going to stop all bots anymore, you're just adding in a hurdle to some degree nowadays.

This is why I realized (through some reading) that Q&A captcha questions tend to work better and are easier to solve for humans, but harder for computers to crack. (Though, it should be stated that Q&A captchas can probably be broken (I have no evidence, just theory) if the question is simple enough; ie: "2 + 2 = ?" or common everyday questions).

I tend to prefer questions that relate somehow to the community that they're registering.

[–]JohnMcPineapple 5 points6 points  (1 child)

Using state of the art captcha images, but with a question instead of random characters, should make it both easier for humans (you can still read the text if one or two characters turn out unreadable) and harder for computers to solve.

[–]Genesis2001 0 points1 point  (0 children)

As mentioned in another comment here, and my own comment, it's not entirely fool proof. Bot creators (through targeting your site specifically) and human spammers (though it takes many hours to effectively spam, thus making it very low payoff) can still get around Q&A captchas.

[–][deleted] 50 points51 points  (0 children)

The problem is going the other way, using a standard premade captcha brings its own problems. Its more targeted, has more people trying to break it and it breaks when someone is cracking open another website, yours is now broken too.

The real problem is captcha's in general. they annoy your users, the 'best' ones fail with humans often and don't really solve the problem of stopping automated software thanks to things like mechanical turk.

There are better solutions than captcha's, but they are generally hard to do and specific to the product so lazyness takes over I guess.

[–]Mattho 12 points13 points  (0 children)

With own scheme you are open to someone targeting you specificaly. With common captcha (such as reCaptcha) you are easier target.

[–]danweber 23 points24 points  (2 children)

For a lot of people, coding up a very simple captcha is the right thing.

Asking "what is 2 + 3?" will stop bots that haven't been written for your site.

Yes, someone can always write a bot that defeats what you have, but it takes a lot longer to write a bot than it does to just make a slight edit to your captcha.

[–]imog 14 points15 points  (1 child)

This is false. I ran a site that gets a few million pageviews monthly. Xrumer can detect and properly answer basic math.

For over a year however, it would automatically fill in password fields. So we would display extra password fields in the code, but not to humans in the browser... Xrumer filled in the passwords, so we knew it was a bit and discarded that input. Humans never saw it and they had no problems.

But ya, xrumer does a lot of basic Q&A. So when doing Q&A on high traffic sites, you have to be creative to keep accessibility high for humans, but difficult for bots. If you have international users, this is even trickier. An example good question that is hard for a bot - "which is larger, an elephant or a cat?" Any human can answer, bots cannot because the question requires advanced interpretation of what is asked... Its much easier to tell a not to look for certain patterns like 2+2 or 2 plus 2 and actually do the math, so any pattern that looks like a math problem is actually trivial to code for a bot master.

[–]lachy_xe 6 points7 points  (0 children)

For over a year however, it would automatically fill in password fields. So we would display extra password fields in the code, but not to humans in the browser... Xrumer filled in the passwords, so we knew it was a bit and discarded that input. Humans never saw it and they had no problems.

For those that are interested, this kind of technique for detecting bots is often known as a Honeypot. (The wiki article goes into a lot of detail, but in its simplest form, it is often just a hidden field.)

[–]bungle 1 point2 points  (0 children)

I can't count the amount of times people (especially on forums) think they've come up with some great new scheme

In forums, especially, it is really ok to implement your own scheme. Nobody is going to make any effort to break scheme used just by your forum that contains photos of lol cats. It's mainly for preventing forum spamming, and nothing really bad happens if someone breaks it (then just change a scheme a bit).

[–][deleted]  (1 child)

[deleted]

    [–]Philluminati 4 points5 points  (0 children)

    I'm not sure captcha qualifies as security. More like just "noise reduction".

    [–]YourTechGuy 24 points25 points  (19 children)

    Interesting article, but I wouldn't have gone the letter frequency route. While it's easier to use these general frequencies, creating a Markov chain from the trained examples would probably yield better results.

    Optimally, I think the best route would be to fuse multiple classifiers (one based on image data, another on Markov, another on something like Levenshtein distance from words in a chosen dictionary).

    It's a great starter though (far better than random guessing at 1/((26^5)*1000)), I hope he publishes the CAPTCHA files he collected so other people can try their hand at it.

    [–]miekao[S] 16 points17 points  (7 children)

    Not a bad idea.

    I'll try digging out the solved captchas when I'm not on mobile.

    Something I found interesting: the neural network routinely found typos I had made while solving the captchas. After a bit, I made it alert me when it was suspicious, so they could be corrected.

    [–]nemec 13 points14 points  (0 children)

    The student has become the master ;)

    [–]YourTechGuy 3 points4 points  (3 children)

    That is interesting. Thanks for the quick reply.

    [–]miekao[S] 11 points12 points  (2 children)

    I've added the corpus to the repo.

    It lists less than 1700 solves, some of which were done by hand, some that the software solved. At some point, I pruned failing examples that hit really off-the-wall edge cases, as they were negatively training the neural network.

    [–]nilknarf 13 points14 points  (1 child)

    Thanks for the dataset! Just for fun I made some quick modifications to a captcha solver I wrote for another site to run on this. It is incredibly slow but only got 17 wrong out of the first 100 (which is a decent rate for 100loc): https://gist.github.com/fta2012/034b0686897d94e74b00

    Run with python solver.py with the captcha-corpus in the same folder with just the jpg files.

    Trains on images 500 to 1000 (arbitrarily chosen).

    [–]miekao[S] 8 points9 points  (0 children)

    That is damn amazing.

    [–]maxd 2 points3 points  (1 child)

    Did you consider attempting what the Google image matching service does? Basically it creates a really low rez version of the image, and turns that into a pretty small ascii search string. It seems like you could turn the 20x20 character images into 5x5 1-bit images representing each character, and run the comparison like that.

    [–]miekao[S] 0 points1 point  (0 children)

    Honestly, it never crossed my mind, and I wasn't sure of the effectiveness. It'd be interesting to attempt, though.

    [–][deleted] 2 points3 points  (9 children)

    Curious what you mean by "fusing" - how would that work programmatically?

    [–]YourTechGuy 14 points15 points  (8 children)

    I'm actually happy someone asked about this--it pertains to one of my past research areas that doesn't get much recognition: information fusion.

    Specifically, the type of info fusion I was referring to was decision-level fusion: using multiple classifier systems to all guess at an answer and then deriving the answer from these answers. This type of fusion can be used in a meta-algorithm called "boosting", where multiple weak (i.e. not very accurate) classifiers are combined (weighted by how accurate they are) into one much more accurate classifier.

    In this case, just using character recognition and a Markov model might be 30% accurate, and the model discussed was 56% accurate. If "boosted" correctly, these two models could be fused into a model that was slightly more accurate (e.g. 70% accurate). The addition of other weak classifiers could then boost the accuracy more (this, of course, is subject to diminishing returns, and sometimes the disparity between two models is too great and they can't be fused).

    I could go on much longer if anyone was interested; it's a very interesting field of research.

    [–]miekao[S] 12 points13 points  (2 children)

    For fuck's sake, please go on much longer. I deal with problems where this would be useful very often.

    Parts of this have swirled around in my head, but combining multiple results in a way that's not haphazard guesswork eludes me. Every attempt I've made at assigning weights to classifiers have been by feel or trial and error. And when iterating on parameters, and finally getting an improvement, I don't think it's all it could be.

    I would absolutely, very emphatically, like to hear anything you want to say on the subject.

    [–]YourTechGuy 20 points21 points  (1 child)

    Okay, so I keep typing responses and for one reason or another the tab keeps closing. I've resorted to writing in a word processor and hopefully you'll actually see this iteration...I'll most certainly forget to add something, so I'll probably edit this later.

    I did most of my work with decision trees and random forests (which are essentially just collections of decision trees), and I've found AdaBoost to be the best algorithm there. As an aside, I've had more success in general with decision tree-based algorithms than neural networks and the like—but they are perfectly good (you certainly had good success with them in this article).

    AdaBoost works by combining classifiers using their current error rate as weights and summing their results to produce one “boosted” decision. What's really cool about it is that it focuses each additional classifier on classifying the currently misclassified examples. As such, you should see your overall error rate decrease fairly substantially with each additional classifier added. There are two main caveats with using AdaBoost: the dataset cannot be noisy and each model must be better than guessing. Beating guessing usually isn't much of a problem—it doesn't need to be much better than guessing anyway—and in this case you beat guessing by a whole lot (guessing is at ~8.416*10-9 %), and any other classifers you make would probably be way better than that as well. Noisy datasets are more of a problem (and one that may be the cause of your current error rate): the very thing that makes this meta-algorithm so good (reducing error by focusing on the misclassified) hurts when some examples just can't be trained from your current feature set; AdaBoost will end up perseverating on those examples and probably won't reduce error much.

    For noisy datasets, you should look at BrownBoost. It's similar in that it's also a boosting by majority meta-algorithm (i.e. if most classifiers say that X is the right answer, then it probably is), but handles noisy datasets by “throwing away” data that fails to be classified by different classifiers. As such, it can achieve a better overall accuracy, as it won't be swayed by examples that just won't be classified well.

    A different form of decision-level fusion could work well here is between SVM and kNN. With both, you'll want to do some dimension reduction (probably aided by Ada or the like) so you can cut down on training time and complexity of the model (not that you had too many features, but it always helps if you can cut one out). I don't have much to say on these (I didn't work with them much, just read papers), but there is some good literature on how they applied this type of fusion to various problems.

    A resources I definitely recommend you checking out is Boosting by MIT press—it explains exactly how Boosting works—I used to have a copy of it on my desk when I was working on that research project. They also have some lessons on YouTube (just search “boosting MIT”) that give a good introduction. As far as actual ML resources go, you should be modeling with Weka. Weka allows even ML novices to make models, you just plug in some data and start firing away. It's free and performs just as well as the expensive tools. It can scale across a cluster, or if your tigher on budget (or have really big data), you can just sample and feed Weka something smaller to work on. Best of all, it also functions as a Java library, so you can integrate it into more complex programs seamlessly. The people who publish Weka also write a book on data mining that I'd recommend if you're new to the area. It's not perfect, but you can pick up the international edition (same as the standard) for roughly $17. It's not perfect but it is fairly broad--just supplement it with some actual practice as there are few (if any) practice problems.

    While I can't go too in-depth into my research (it would be too easy link my Reddit identity to my real life—something I'd like to avoid), I always like to give a bit about my credentials with posts like this. I created a classifier (for a nontrivial problem) that had 98.7% accuracy. For reference, the people I was competing against were all in the 50-60% range; I probably could have improved it even more if I had some more time. I'm currently working on creating a new type of biometric authentication scheme, and while it's still too early to know for sure, I'm expecting about 98-99% accuracy out of the box.

    I'm always available via Reddit PM should anyone ever have a question about ML (I love talking about work, so really, fire away). I also do a lot of consulting work for companies of varying sizes, so if you'd rather just offload your ML, don't hesitate to write :).

    Ninja Edit: If you have specific questions about ML/my experience/etc feel free to just post them below. I'm going out tonight but will answer everything that's asked.

    Edit 2: links

    Edit 3: Thank you for the gold! I love talking about information fusion, I'm happy some found it useful!

    [–]miekao[S] 0 points1 point  (0 children)

    Thank you.

    [–]very_mechanical 4 points5 points  (3 children)

    I remember vaguely that most of the leaders in the Netflix-sponsored competition used a combination of a variety of techniques to produce a final answer. That might not have been equivalent to combing classifier systems, though.

    [–]YourTechGuy 3 points4 points  (2 children)

    Funny you should mention Netflix. I was at a Data Mining conference recently and attended a talk on how Netflix's recommendation system works. Its a very fancy collaborative filtering algorithm.

    I ended up speaking to the guy afterwards; he was incredibly bright. I can post their paper if anyone's interested.

    [–]ultramilkman 2 points3 points  (1 child)

    Could you please post the papers, and go into why you had such a high opinion of him?

    [–]YourTechGuy 1 point2 points  (0 children)

    So I went through my conferences folder and it appears that since the talk was a "tutorial" and not an actual presentation, there is only a brief summary and not a full paper. In any case, here is the link to the summary of his talk. The presenter, Dr. Xavier Amatriain, was very impressive.

    Netflix does publish a lot of papers though, so I'll see if I can find something relevant to their current collaborative-filter based techniques.

    [–]_alexkane_ 2 points3 points  (0 children)

    Yes, please go on

    [–][deleted] 14 points15 points  (13 children)

    "I wrote a Mechanize tool that downloaded 2,000 captcha examples from the site: one every two seconds. Then I solved them all by hand, renaming the files to (solution).jpg. That was not fun."

    Well, holy shit

    [–]miekao[S] 27 points28 points  (10 children)

    Toward the end, I was able to solve captchas on autopilot while maintaining an intelligent conversation, watching Netflix, and yelling at my dog.

    [–]marshall007 2 points3 points  (1 child)

    The experiment didn't end up going much further (and I don't have that data!) ...

    Does this mean you lost the solution set you'd built up?

    [–]miekao[S] 1 point2 points  (0 children)

    No, just the amount of time it ran before the shutdown made the dataset pretty uninteresting.

    [–]me-at-work 2 points3 points  (1 child)

    You've solved the captchas generated by ExpressionEngine! It's a shareware CMS written in PHP.

    I have used it for a few websites. By default you can only configure this. I noticed how easy the default captcha would be to solve for computers, it's fun to see that confirmed!

    When I was using it, I hacked the code a bit to use a stencil font and random words, which makes it considerable more difficult to solve automatically :)

    [–]miekao[S] 1 point2 points  (0 children)

    Thank you. That's a big mystery solved.

    [–]GeorgieCaseyUnbanned 0 points1 point  (4 children)

    why didn't you just hire this out on oDesk? your time is a lot more valuable!

    [–][deleted] 3 points4 points  (0 children)

    Because he was doing it while maintaining an intelligent conversation, watching Netflix, and yelling at his dog. Down time is pretty much wasted time with no value; a good time to do repetitive and boring tasks. If someone doesn't do any of those things and is always being productive, then hiring out would be the smart thing to do.

    [–]yaazz 1 point2 points  (0 children)

    Seems like a great project to put on mechanical turk

    [–]AgentME 1 point2 points  (1 child)

    He could have skipped automating the solver at all and just went that route for all of the CAPTCHAs too.

    [–]miekao[S] 3 points4 points  (0 children)

    With that attitude, we'd still have scribes and not keyboards, mates.

    Considering the use I actually got of it, though, solid point.

    [–]octnoir 0 points1 point  (0 children)

    True programmer and computer scientist.

    When someone says 'not fun' it usually means:

    "Fuck this fucking fucking piece of shit. I'm doing this absolute bullshit manual labor when a fucking machine should be able to do this. Fuck de fuck fuck fuck fuck fuck!" which goes on for a night, but usually stupidly longer amount of time than forseen like a week.

    And the only polite thing you can say as the victim is 'not fun'. That's the code. Saying anything more is a violation of that code.

    [–]WOnder9393 0 points1 point  (0 children)

    I once made a solver for our university information system's anti-scraping captcha. I was too lazy to solve all samples by hand so I started by coding a script that would split the glyph images from the samples into unnamed categories so that each category contains only images representing the same glyph. That way I only had to manually name the categories (in my case 23 glyphs + some duplicates). When recognizing a captcha, I would then find the best matching category for each glyph image. I managed to get practically 100% success rate (I couldn't find any sample that wouldn't be recognized correctly) but that captcha was really easy to break.

    [–]land_stander 6 points7 points  (0 children)

    Interesting read. Glad I am not the only developer who has off the wall ideas like this that never quite reach fruition :)

    I'm playing around with some machine learning stuff with OpenCV right now. I think I'll spend some time on that this weekend now.

    [–]nilknarf 5 points6 points  (0 children)

    I also wrote about solving weak captchas before (but for much simpler captchas): http://franklinta.com/2014/08/24/solving-captchas-on-project-euler/.

    I think the weakness was the same for both of these cases: using a uniform font reduced the problem to image template matching!

    [–]DanAtkinson 4 points5 points  (0 children)

    Hey! This is a bit random but I submitted a pull request to correct a couple of words in the readme file. For some reason, Wired Magazine thinks that it's all my work and would like an interview.

    Whilst I've corrected him, you may wish to reach out to him personally - @a_greenberg.

    [–]Booshanky 6 points7 points  (12 children)

    I've never bothered to check out the silk road or evolution. I know it's done through TOR, but anyone know a good FAQ somewhere? Google is mostly full of new stories, maybe my search terms suck, haha.

    [–]GetsEclectic 6 points7 points  (7 children)

    [–]Booshanky 2 points3 points  (6 children)

    Hrm, so you just gotta have the right software and hit those .onion links. Neat. Thanks for the heads up!

    [–]rubber_band_man_ 2 points3 points  (5 children)

    I recommend using tails in a vm.

    [–]drysart 3 points4 points  (4 children)

    Set up two VMs, both connected via a private LAN.

    One VM also has an outgoing connection to the Internet. This VM runs the TOR proxy, and exposes it to the private LAN. The VM does nothing else.

    The second VM only has the private LAN connection and therefore must use the TOR proxy provided by the first VM. This second VM is the VM you actually use for browsing and such.

    You're 100% guaranteed to not accidentally leak any identifying information, because there is none to leak. No public IP, no pre-existing personal files sitting around, etc. The main weakness is it's an unusual setup, and so stand out basically because you're not using TOR Browser like 99% of everyone else hitting an .onion site will be.

    [–]Gimmick_Man 3 points4 points  (2 children)

    Why use VM instead of just installing tails on a usb drive?

    [–]AgentME 0 points1 point  (0 children)

    Easier to wipe a VM clean and reset it to a good state each time you use it, even if the software in the VM got exploited. If you boot straight from a usb drive, and your stuff gets exploited, then it could infect the thumb drive, your BIOS, etc. (Not that it's particularly likely, but if you're already bothering to be so paranoid, you might as well go that extra bit.)

    [–]Roadside-Strelok 0 points1 point  (0 children)

    aka Whonix

    [–][deleted] 1 point2 points  (0 children)

    Check out OpenBazaar too and if you have the wherewithal, contribute!

    [–]xraystyle 1 point2 points  (2 children)

    /u/Booshanky... I recognize that username.

    [–]Booshanky 0 points1 point  (1 child)

    Lemme guess, CGN? Haha

    [–]xraystyle 0 points1 point  (0 children)

    CGN ftw. Check your PMs.

    [–]holambro 6 points7 points  (6 children)

    So does this totally debunk the assertion by the FBI that they found the SR server because of a leaky captcha? It certainly appears that way to me.

    DPR's laywers should go have a chat with Mike and see if he's willing to testify on their behalf.

    [–]miekao[S] 6 points7 points  (0 children)

    I just used Mechanize to rip the images. I don't know if it was in fact being served from what "should've been" a non-routable leaked IP address.

    [–]drysart 2 points3 points  (0 children)

    No, this has nothing to do with the FBI's assertion.

    [–]minlite 5 points6 points  (3 children)

    They're bullshitting. They used illegal methods for spying and want to make it all legal by claiming it was a leaked IP address from the captcha. Many security researchers have confirmed that it not true.

    [–]vwermisso 4 points5 points  (2 children)

    Could I trouble you for a source?

    Because all I've heard was the opposite. Some blackhat even messaged the dude running it saying his captcha was leaking his IP before it was fixed.

    [–]minlite 0 points1 point  (1 child)

    [–]happyscrappy 1 point2 points  (0 children)

    Do you read your sources before posting them?

    "So, does this mean the FBI did get its information from the NSA illegally and that Tor's encryption has been broken?

    Cubrilov doesn't think so."

    "And neither Cubrilovic and Sandvik is accusing the FBI of lying. They argue only that its account of entering “miscellaneous” characters into the site is a carefully cloaked description of injecting commands into the Silk Road’s login fields."

    [–]mserenio 11 points12 points  (0 children)

    I am more of a front-end, design guy but I kinda understood how he went about it. Awesome stuff.

    [–]dummer_august 2 points3 points  (0 children)

    The hardest part for me in this article would be to solve 2000 captchas by hand. (I usually get 4 out of 5 wrong)

    [–]tigertom 2 points3 points  (1 child)

    He is surprised that J is so rare, if you have ever played scrabble you would know that - it's how the scoring works

    1 point: E ×12, A ×9, I ×9, O ×8, N ×6, R ×6, T ×6, L ×4, S ×4, U ×4

    2 points: D ×4, G ×3

    3 points: B ×2, C ×2, M ×2, P ×2

    4 points: F ×2, H ×2, V ×2, W ×2, Y ×2

    5 points: K ×1

    8 points: J ×1, X ×1

    10 points: Q ×1, Z ×1

    [–]miekao[S] 0 points1 point  (0 children)

    I've always played Scrabble buzzed, and never noticed. Thanks for pointing this out.

    [–]snkscore 2 points3 points  (0 children)

    Great write up!

    [–]XTornado 1 point2 points  (1 child)

    I would have payed somebody else to solve the 2000 captchas :P or much better I would have done a Faucet for some new "altcoin" or somethign similar requiring people to solve a captcha and using this images and saving what people types.

    [–]miekao[S] 4 points5 points  (0 children)

    I think people are imagining the captchas being solved as they're normally seen in the wild: On a web page with a request/response cycle.

    Solving 2000 already-fetched captchas, on a local machine, only took about twice as long as typing 2000 words, if it were done uninterrupted.

    [–]octnoir 1 point2 points  (0 children)

    Very nice article and way to segment step by step how to break a Captcha and use existing tech/methods to do so.

    This is a prime example of a security principle: Captcha is not an unbreakable wall - it will be gotten over and is simply a small obstacle.

    The question is how much of an obstacle do you want to create, and how do you deal with the ones that go over. The latter is more important for you, the web site creator, to answer which very few consider once putting up a simple captcha thinking they are safe from automation/bots.

    [–]xmsxms 3 points4 points  (2 children)

    Given those captchas I would have expected a 100% success rate, not 50%. No offence to the author, but much more difficult captchas are solved with a much higher success rate.

    [–]Lengador 4 points5 points  (1 child)

    I did think something similar; they look like very easy captchas. That being said, the author claimed to have gotten the result after only 12 hours of coding which I think is very impressive.

    [–]miekao[S] 5 points6 points  (0 children)

    Yeah, I had a big, blunt hammer that'd work more than half the time, so I called it done.

    [–]muyuu 1 point2 points  (0 children)

    No surprise there. The captcha is as terrible as it gets.

    [–]Crashthatch 0 points1 point  (0 children)

    Great writeup. Very interesting.

    I've often wondered how hard it would be to write something to crack some of these "easy" custom captchas. Had never thought of using a spell-checker / wordlist to spot "impossible" words and improve.

    [–]nakilon 0 points1 point  (0 children)

    Ruby FTW

    [–][deleted]  (4 children)

    [deleted]

      [–]thefallingoff 2 points3 points  (0 children)

      Ruby as marklit pointed out.

      First, It takes a hash (associative array), sorts it according to the value, which returns an array of 2-d arrays. Then the map method iterates over each element applying the first method (you can probably guess what that does) and returns a new array, made up of just the letters. Then join simply joins the array elements and returns a string.

      [–]marklit 1 point2 points  (2 children)

      It's Ruby, not Python. It's calculating how many times each letter appears in a string.

      [–]kreiger 3 points4 points  (1 child)

      No, it creates a string of the letters in the english alphabet in order of frequency.

      [–]miekao[S] 0 points1 point  (0 children)

      Thanks, this is the correct answer. The code wasn't particularly built for clarity, given its experimental nature.