CAPTCHA WIIB!

Posted on June 15th, 2007

You can follow any responses to this entry through the RSS 2.0 feed.

Oh, barbarian acronym. I hate it. It sounds like a buzz word, but by buzz word, I mean an horrible one. :dead:

Ajax

AJAX, well, its funny, sounds like this good old cleaning product and afteall, its fine. Let’s not list now the hundred, i mean thousand, hundred of thousands of billions of acronyms and high tech buzz words created for the sole purpose of, well, make you sound cool and trendy !

So yea, CAPTCHA, meaning Completely Automated Public Turing test to tell Computers and Humans Apart. In other words, “Write the Image text In the Box”, i mean WIIB ! Yeap Wiib. Yeap this acronym now officially exists since approximately.. now.

Anyway, this sounds so much more sexy now. Let’s go on!

So this WIIB, simply ensure that you are a human and not a computer, since a computer has to recognize the text and it’s not an easy task (Or so one would think). I’m sure you’ve seen dozen of these

Some companies, recently self advertised in the news brought my attention to the low security this kind of protection provides. :read:

Let’s take the newest one, the OCR Research team. They’re claiming to break WIIB’s for a living. Worst, they’re claiming to rate well known company WIIB’s. They’re also claiming to provide (for a high fee) their own WIIB, which is just a 3D textured render of the words, with ugly colors (hardly human readable, to be honest). Introducing… Teabag 3D. Oh yeah!

Take 1, TeaBag 3D:

I’ll quote (including the spelling errors :evil: )

This version uses “military” style coloring, and it seems to make picture more pretty looking and readable. Well, in general it’s still not perfectly readable but it’s extremaly hard to break, that’s for sure!

Teabag 3D

Can you read it ? I see a V, E then.. is it T ? is ir R ? hmm then.. O or 0 (zero) then a dot and.. well.. something. ooh! a nice little mountain ! :)

So yea, the idea is funny and original but far from good, as this is easy to render into wire frame (look at the polygons, the background ones are different) then separate the letters and finally recognizing the 3D model as a flat letter. Of course, the computer is not gonna recognise the characters I couldn’t recognize myself as a human, simply because they don’t look like characters at all :)

Take 2, the WIIB rating:

Alright, now let’s see they competence level. Note, that I consider my own level to be sailing around null and shinning by it’s absence.

I selected the ua.fm WIIB. Well right, they only test weak ones and thus this one is rather weak, that’s true. The number of stars indicate the “strenght level”, meaning, the more stars the better protection. Alright boys and girls, here we go:

ua.fm 1

First sample. Looks simple to me, beside the R, which is reversed, so I actually don’t know if I should treat it as a letter. The letter is weird because it’s Ukrainian (just set your OCR dictionary to Ukrainian to recognize it :s )

Here’s another from the same site:

ua.fm 2

Well, right. They are second best in their tests which is pretty high (but hey, they’re not gonna put good ones since they sell their own;)

I’ll quote again

Strong enough CAPTCHA

Alright. So i made a 20 lines python script (using the integrated PIL module) which removes generic noise while keeping precision (unlike Gaussian filters). Of course you need different filters for different kind of noise a lot of websites are beating by this very simple one.

def iswhite(pixel):
        """If pixel is equal to the defined "white" pixel, according to an error ratio of "threshold"
        then assume this pixel is white anyway"""
        white = (255,255,255)
        t = 5 # threshold
        if pixel == white: return 1
        for i in range(0, 3):
                if pixel[i] > white[i]+t: return 0
                if pixel[i] < white[i]-t: return 0
        return 1
def process_dots(img):
        """Find if we have pixel correlations in the pixel table.
        If not, then this is a lone pixel (dot-style-noise) and must be erased by white.
        This algo checks for horiz, vert, and diagonals"""
        white = (255,255,255)
        (x, y) = img.size
        pix = img.load()
        for m_x in range(0, x):
                for m_y in range(0, y):
                        if m_x > 2:
                                pix[m_x, m_y] = white
                                continue
                        if m_y < 2:
                                pix[m_x, m_y] = white
                                continue
                        if iswhite(pix[m_x, m_y]):
                                if not iswhite(pix[m_x, m_y-1]):
                                        if iswhite(pix[m_x, m_y-2]):
                                                pix[m_x, m_y-1] = white
                                if not iswhite(pix[m_x-1, m_y]):
                                        if iswhite(pix[m_x-2, m_y]):
                                                pix[m_x-1, m_y] = white
                                if not iswhite(pix[m_x-1, m_y-1]):
                                        if iswhite(pix[m_x-2, m_y-2]):
                                                pix[m_x-1, m_y-1] = white
        return img

So yeah, here's the image after the 0.01 ms processing:

ua.fm processed

Kinda clear uh ?

Let's do it:

# gocr -p -f UTF8 -i clean_01.png

я965724

Uhm yea.. hard =p If ever you're unable to write a correct algorithm for recognizing forms anyway, you can do like they all do, use a small pre-coded neural network, associate a few samples with letters, and tada, it will recognize it for you and the detection rate will still be fairly high! It makes things much more simple.

Psst: I didn't even need to separate the letters in that one! :evil:

Ethic

The main issue with such companies, is ethical. Usually, they sell their services to spammers/etc (although this one claims not to). Then if they don't, they sell their services to test your CAPTCHA WIIB.

All fine, yea, but what they are not telling, is that we are to a level of knowledge where anyone who is able to code or script something, can figure out how to break most of them.

And if you can't, usually, a human cannot read them either.

Or, you simple need to spend some more time on it than a single hour (like, recognizing lines striking the letters and removing them: the shapes are usually long and more or less straight, so it's easy in theory but the algorithm is a bit annoying to code)

And I didn't even rant about the coding part (often, you don't need to recognize the text on the image to get the value... !)

Then they sell their own, which someone elses will break. Including the 3D nifty thing, I won't publish the code for it but I gave the tips.

Last thoughts

While this kind of trick works again stupid spammers and is cheap to implement, this is of course not the ultimate solution and will be totally void one day or the next.

Researches have been made into other directions (like, show 6 images with a bear and one with a car, and ask to choose which one is not fitting. Or, Ask questions like, what is the animal on the picture? they all have weaknesses and strength and I didn't find the ideal one either)

As usual, everything is discussed and linked for you, by wikipedia: http://en.wikipedia.org/wiki/Captcha

4 Responses to “CAPTCHA WIIB!”

  1. bri says:

    tout le monde sait que les vrais captcha de nos jours, c’est les captcha du pauvre, les captcha pour le gars qui veut pas passer 1000 ans sur ses transformations de text en images illisibles : La phrase.

    exemple :
    un et un (en chiffes) ?
    -> réponse : 2
    trois fois deux (en lettres) ?
    -> réponse : six

    C’est ça le futur. Mieux les questions de culture générale, comme ça t’as en plus le bénéfice de bloquer l’accès de ton site aux idiots, et ça, de nos jours, ça vaut de l’or.

    ++

  2. kang says:

    Super simple a “casser” ceux la, c’est facile à détecter, les images c’est possible (comme démontré, pas trop dur), mais pas aussi facile que le texte :p

    Sinon les questions de culture générale ca me parait une idée marrante :) Je vais peut être remplacer le mien si je trouve une base de donnée libre type “questions pour un champion” :)

    En plus, je pourrais plus répondre aux commentaires si je trouve pas les réponses :/

  3. tucker says:

    $ gocr -p -f UTF8 -i clean_01.png
    _9g_724

    that be output from gocr v0.48 20090802 on urxvt v.9.06-1

    is there something you aren’t telling us?

  4. kang says:

    this is a rather old entry thus using an old version of gocr
    in any case, it’s possible that you’re missing the proper fontset/encoding for recognition. (i don’t remember exactly, this being 3 years ago)
    the “-p” means its using a database, probably for the reversed R.
    it’s likely that newer versions or newer ocrs do this recognition differently as well (and probably better, except not in your case apparently)
    usually, as long as the characters are all clear (not barred, consistent colors, etc) most ocr’s are going to recognise them

Leave a Reply