0% found this document useful (0 votes)
10 views4 pages

DAWECA Notes

Uploaded by

Moad Elmardi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views4 pages

DAWECA Notes

Uploaded by

Moad Elmardi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

ELMARDI – EL HAJJAR

An Experimental Investigation of Text-


based CAPTCHA Attacks and Their
Robustness

• The optical character recognition (OCR) technique can effectively attack the
primitive text-based CAPTCHAs that usually contain simple characters.

• Long short-term memory (LSTM) can recognize a text sequence end to end.

• Instead of collecting large real-world CAPTCHA samples, we train a model


with synthetic samples and then fine-tune the model with a small number
of real-world CAPTCHA samples. This is used in transfer learning-based
methods.

• A large-sized text-based CAPTCHA dataset is open for access via


https://www.kaggle.com/datasets/chenxuanxiao/xdcaptcha-dataset and
https://drive.google.com/drive/folders/1tgefqBsNUESpgERgSP21Dn1fadrUrGUf

• Text-based CAPTCHAs can be classified based on their resistance


mechanisms to three categories:

o Enriching character shapes: Character rotation, multi-fonts, hollow …

o Complicating character structures: Overlapping, CCT …

o Adding auxiliary interference: Noise arcs, complex

background…

• Attacks were sorted in two categories:

o Segmentation-based: separate characters in isolation and then


recognize them individually.

o Nonsegmentation-based: recognize text sequences in one step via


deep learning models.

Application Système 1
▪ End-to-end methods: train recognition models directly.

▪ Transfer learning-based attacks: leverage model training to fine-


tune pretrained models.

• Traditional attacks consist of three main steps: preprocessing, segmentation


and recognition. To accurately split the characters, preprocessing is usually
required to remove additional interferences.

• Generative adversarial network (GAN)-based models, a deep learning


technique, are used in the preprocessing stage to automatically remove the
complex interference features of text-based CAPTCHAs.
• Preprocessing:

o Image binarization: transform color images into black and white by


changing the color of each pixel using a thresholding algorithm.

o Dilation and erosion: adding and removing pixels around the objects in an
image, respectively.

o CFS: By illing the pixels in the same connected domain, it can convert
hollow characters into solid ones to facilitate better segmentation.

Application Système 2
Breaking CAPTCHAs on the Dark web

Problem 1: We want to do web scraping (data collection), but CAPTCHAs


prevent it.

• Scrapers are tools that enable navigation of websites and extraction of relevant
information for the user (more details in the link below).

• CAPTCHAs differentiate between humans and bots. They are easy for humans to solve
but difficult for bots.

• CAPTCHAs are used by website administrators to prevent automated activities


like SPAM, DDoS, and web scraping.

• Web scrapers are bots.

• https://proxyway.com/guides/how-to-bypass-captcha?ref=parsehub.com: A site
for viewing different types of CAPTCHAs.

Problem 2: How can a web scraper bypass a CAPTCHA that prevents it from scraping the web?
What is the impact of breaking CAPTCHAs? What role do OCR and ML play?

Why are CAPTCHAs used on the dark web?

For the same reasons as on the surface web, but also for:

• Preventing bot activity,

• Reducing load on hidden services,

• Mitigating cybersecurity threats,

• Protecting against Tor-specific issues,

• Regulating user access,

• Adhering to cultural and security norms.

Operational methods:

• Two methods for breaking CAPTCHAs:

o Using OCR (e.g., Tesseract)

o Using Machine Learning (e.g., TensorFlow)

Dataset details:

• Two CAPTCHA datasets, each containing 100,000 images.


Application Système 3
• CAPTCHAs are images of five characters.

• A third dataset is a combination of the two datasets.

• A test dataset contains 1,000 images (500 from each).

• The characters 'O', 'o', and '0' are absent from the CAPTCHAs.

Comparison methodology:

• Both Tesseract and TensorFlow are compared on the same test dataset.

• Evaluation metrics:

o Success rate: Indicates if the CAPTCHA is solved correctly. If even one


of the five characters is predicted incorrectly, the entire CAPTCHA is
considered incorrectly solved.

o Accuracy: Measured using the Levenshtein distance.


(https://en.wikipedia.org/wiki/Levenshtein_distance).

Evaluation results:

• TensorFlow Success Rate: DS2 > DS1 > DS1+2

• Tesseract Success Rate: DS1 > DS2

o Note: TensorFlow outperforms Tesseract for this metric.

• TensorFlow Accuracy: DS2 > DS1 > DS1+2

• Tesseract Accuracy: DS1 > DS2

Application Système 4

You might also like