0% found this document useful (0 votes)
6 views10 pages

Lecture 1.1.1

The document provides an overview of data extraction, a critical step in the ETL process that enables organizations to derive insights from unstructured data. It outlines four primary techniques for data extraction: Association, Classification, Clustering, and Regression, along with various methods such as Manual, Web Scraping, OCR-based, and AI-enabled extraction. The document emphasizes the importance of selecting appropriate techniques and methods to optimize data extraction for informed decision-making.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views10 pages

Lecture 1.1.1

The document provides an overview of data extraction, a critical step in the ETL process that enables organizations to derive insights from unstructured data. It outlines four primary techniques for data extraction: Association, Classification, Clustering, and Regression, along with various methods such as Manual, Web Scraping, OCR-based, and AI-enabled extraction. The document emphasizes the importance of selecting appropriate techniques and methods to optimize data extraction for informed decision-making.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 10

UNIVERSITY INSTITUTE OF

ENGINEERING
COMPUTER SCIENCE ENGINEERING

Data Visualization
(CSH-461)

Prepared By : Shivam Sharma(E-16516)


Topic: Introduction to Data Extraction DISCOVER . LEARN .
EMPOWER
Data Extraction:

Data extraction is the first and perhaps most important step of the Extract/Transform/Load (ETL)
process. Through properly extracted data, organizations can gain valuable insights, make informed
decisions, and drive efficiency within all workflows.
Data extraction is crucial for almost all organizations since there are multiple different sources
generating large amounts of unstructured data. Therefore, if the right data extraction techniques are not
applied, organizations not only miss out on opportunities but also end up wasting valuable time, money,
and resources.

Techniques for Data Extraction


Data extraction can be divided into four techniques. The selection of which technique is to be used is
done primarily based on the type of data source. The four data extraction techniques are:

1.Association
2.Classification
3.Clustering
4.Regression
1. Association
Association data extraction technique operates and extracts data based on the
relationships and patterns between items in a dataset. It works by identifying
frequently occurring combinations of items within a dataset. These relationships, in
turn, help create patterns in the data.
Furthermore, this method uses “support” and “confidence” parameters to identify
patterns within the dataset and make it easier for extraction. The most frequent use
cases for association techniques would be invoices or receipts data extraction.
2. Classification
Classification-based data extraction techniques are the most widely accepted, easiest,
and efficient methods of data extraction. In this technique, data is categorized into
predefined classes or labels with the help of predictive algorithms. Based on this
labelled data, models are created and trained for classification-based extraction.
A common use case for classification-based data extraction techniques would be in
managing digital mortgage or banking systems.

3. Clustering
Clustering data extraction techniques apply algorithms to group similar data points
into clusters based on their characteristics. This is an unsupervised learning technique
and does not require prior labelling of the data.
Clustering is often used as a prerequisite for other data extraction algorithms to
function properly. The most common use case for clustering is when extracting visual
data, from images or posts, where there can be many similarities and differences
4. Regression
Each dataset consists of data with different variables. Regression data extraction
techniques are used to model relationships between one or more independent
variables and a dependent variable.
Regressive data extraction applies different sets of values or “continuous values” that
define the variables of the entities associated with the data. Most commonly,
organizations use regression data extraction for identifying dependent and
independent variables with datasets.

4
Types of Data Extraction
Organizations use multiple different types of data extraction such as Manual,
Traditional OCR-based, Web scraping, etc. Each data extraction method uses a
particular data extraction technique that we read earlier.

1. Manual data extraction


As the name suggests, manual data extraction method involves the collection of data
manually from different data sources and storing it in a single location. This data
collection is done without the help of any software or tools.
Although manual data extraction is extremely time-consuming and prone to errors, it
is still widely used across businesses.

2. Web Scraping
Web scraping refers to the extraction of data from a website. This data is then
exported and collected in a format more useful for the user, be it a spreadsheet or an
API. Although web scraping can be done manually, in most cases it is done with the
help of automated bots or crawlers as they can be less costly and work faster.
However, in most cases, web scraping is not a straightforward task. Websites come in
many different formats and can have challenges such as captchas, etc. to avoid as
well.

5
3. OCR-based data extraction
Optical Character Recognition or OCR refers to the extraction of data from printed or
written text, scanned documents, or images containing text and converting it into
machine-readable format. OCR-based data extraction methods require little to no
manual intervention and have a wide variety of uses across industries.
OCR tools work by preprocessing the image or scanned document and then identifying
the individual character or symbol by using pattern matching or feature recognition.
With the help of deep learning, OCR tools today can read 97% of the text correctly
regardless of the font or size and can also extract data from unstructured documents.

4. Template-based data extraction


Template-based data extraction relies on the use of pre-defined templates to extract
data from a particular data set the format for which largely remains the same. For
example, when an AP department needs to process multiple invoices of the same
format, template-based data extraction may be used since the data that needs to be
extracted will largely remain the same across invoices.
This method of data extraction is extremely accurate as long as the format remains
the same. The problem arises when there are changes in the format of the data set.
This can cause issues in template-based data extraction and may require manual
intervention.

6
5. AI-enabled data extraction
AI-enabled data extraction technique is the most efficient way to extract data while
reducing errors. This automates the entire extraction process requiring little to no
manual intervention while also reducing the time and resources invested in this
process.
AI-based document processing utilizes intelligent data interpretation to understand the
context of the data before extracting it. It also cleans up noisy data, removes
irrelevant information, and converts data into a suitable format. AI in data extraction
largely refers to the use of Machine Learning (ML), Natural Language Processing (NLP),
and Optical Character Recognition (OCR) technologies to extract and process the data.
Automate manual data entry using Nanonet’s AI-based OCR software. Capture data
from documents instantly. Reduce turnaround times and eliminate manual effort.

6. API Integration
API integration is one of the most efficient methods of extracting and transferring
large amounts of data. An API enables fast and smooth extraction of data from
different types of data sources and consolidation of the extracted data in a centralized
system.
One of the biggest advantages of API is that the integration can be done between
almost any type of data system and the extracted data can be used for multiple
different activities such as analysis, generating insights, or creating reports.
7
7. Text pattern matching
Text pattern matching or text extraction refers to the finding and retrieving of specific
patterns within a given data set. A specific sequence of characters or patterns needs
to be predefined which will then be searched for within the provided data set.
This data extraction type is useful for validating data by finding specific keywords,
phrases, or patterns within a document.

8. Database querying
Database querying is the process of requesting and retrieving specific information or
data from a database management system (DBMS) using a query language. It allows
users to interact with databases to extract, manipulate, and analyse data based on
their specific needs.
Structured query language (SQL) is the most commonly used query language for
relational databases. Users can specify criteria, such as conditions, and filters, to fetch
specific records from the database. Database querying is essential for making
informed decisions and building data-driven businesses.

8
References:
Ben Fry, Visualizing Data: Exploring and Explaining Data with Processing Environment,
O'Reilly Media.

Video Link:
https://www.youtube.com/watch?v=E7oACf4a24Y&pp=ygUaZGF0YSBleHRyYWN0aW9u
IGluIGVuZ2xpc2g%3D

9
THANK YOU

For queries
Email: [email protected]

You might also like