Lecture 1.1.1

The document provides an overview of data extraction, a critical step in the ETL process that enables organizations to derive insights from unstructured data. It outlines four primary techniques for data extraction: Association, Classification, Clustering, and Regression, along with various methods such as Manual, Web Scraping, OCR-based, and AI-enabled extraction. The document emphasizes the importance of selecting appropriate techniques and methods to optimize data extraction for informed decision-making.

Uploaded by

priyanshughosh0722

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views10 pages

Lecture 1.1.1

Uploaded by

priyanshughosh0722

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 10

UNIVERSITY INSTITUTE OF

ENGINEERING
COMPUTER SCIENCE ENGINEERING

Data Visualization
(CSH-461)

Prepared By : Shivam Sharma(E-16516)

Topic: Introduction to Data Extraction DISCOVER . LEARN .
EMPOWER
Data Extraction:

Data extraction is the first and perhaps most important step of the Extract/Transform/Load (ETL)
process. Through properly extracted data, organizations can gain valuable insights, make informed
decisions, and drive efficiency within all workflows.
Data extraction is crucial for almost all organizations since there are multiple different sources
generating large amounts of unstructured data. Therefore, if the right data extraction techniques are not
applied, organizations not only miss out on opportunities but also end up wasting valuable time, money,
and resources.

Techniques for Data Extraction

Data extraction can be divided into four techniques. The selection of which technique is to be used is
done primarily based on the type of data source. The four data extraction techniques are:

1.Association
2.Classification
3.Clustering
4.Regression
1. Association
Association data extraction technique operates and extracts data based on the
relationships and patterns between items in a dataset. It works by identifying
frequently occurring combinations of items within a dataset. These relationships, in
turn, help create patterns in the data.
Furthermore, this method uses “support” and “confidence” parameters to identify
patterns within the dataset and make it easier for extraction. The most frequent use
cases for association techniques would be invoices or receipts data extraction.
2. Classification
Classification-based data extraction techniques are the most widely accepted, easiest,
and efficient methods of data extraction. In this technique, data is categorized into
predefined classes or labels with the help of predictive algorithms. Based on this
labelled data, models are created and trained for classification-based extraction.
A common use case for classification-based data extraction techniques would be in
managing digital mortgage or banking systems.

3. Clustering
Clustering data extraction techniques apply algorithms to group similar data points
into clusters based on their characteristics. This is an unsupervised learning technique
and does not require prior labelling of the data.
Clustering is often used as a prerequisite for other data extraction algorithms to
function properly. The most common use case for clustering is when extracting visual
data, from images or posts, where there can be many similarities and differences
4. Regression
Each dataset consists of data with different variables. Regression data extraction
techniques are used to model relationships between one or more independent
variables and a dependent variable.
Regressive data extraction applies different sets of values or “continuous values” that
define the variables of the entities associated with the data. Most commonly,
organizations use regression data extraction for identifying dependent and
independent variables with datasets.

4
Types of Data Extraction
Organizations use multiple different types of data extraction such as Manual,
Traditional OCR-based, Web scraping, etc. Each data extraction method uses a
particular data extraction technique that we read earlier.

1. Manual data extraction

As the name suggests, manual data extraction method involves the collection of data
manually from different data sources and storing it in a single location. This data
collection is done without the help of any software or tools.
Although manual data extraction is extremely time-consuming and prone to errors, it
is still widely used across businesses.

2. Web Scraping
Web scraping refers to the extraction of data from a website. This data is then
exported and collected in a format more useful for the user, be it a spreadsheet or an
API. Although web scraping can be done manually, in most cases it is done with the
help of automated bots or crawlers as they can be less costly and work faster.
However, in most cases, web scraping is not a straightforward task. Websites come in
many different formats and can have challenges such as captchas, etc. to avoid as
well.

5
3. OCR-based data extraction
Optical Character Recognition or OCR refers to the extraction of data from printed or
written text, scanned documents, or images containing text and converting it into
machine-readable format. OCR-based data extraction methods require little to no
manual intervention and have a wide variety of uses across industries.
OCR tools work by preprocessing the image or scanned document and then identifying
the individual character or symbol by using pattern matching or feature recognition.
With the help of deep learning, OCR tools today can read 97% of the text correctly
regardless of the font or size and can also extract data from unstructured documents.

4. Template-based data extraction

Template-based data extraction relies on the use of pre-defined templates to extract
data from a particular data set the format for which largely remains the same. For
example, when an AP department needs to process multiple invoices of the same
format, template-based data extraction may be used since the data that needs to be
extracted will largely remain the same across invoices.
This method of data extraction is extremely accurate as long as the format remains
the same. The problem arises when there are changes in the format of the data set.
This can cause issues in template-based data extraction and may require manual
intervention.

6
5. AI-enabled data extraction
AI-enabled data extraction technique is the most efficient way to extract data while
reducing errors. This automates the entire extraction process requiring little to no
manual intervention while also reducing the time and resources invested in this
process.
AI-based document processing utilizes intelligent data interpretation to understand the
context of the data before extracting it. It also cleans up noisy data, removes
irrelevant information, and converts data into a suitable format. AI in data extraction
largely refers to the use of Machine Learning (ML), Natural Language Processing (NLP),
and Optical Character Recognition (OCR) technologies to extract and process the data.
Automate manual data entry using Nanonet’s AI-based OCR software. Capture data
from documents instantly. Reduce turnaround times and eliminate manual effort.

6. API Integration
API integration is one of the most efficient methods of extracting and transferring
large amounts of data. An API enables fast and smooth extraction of data from
different types of data sources and consolidation of the extracted data in a centralized
system.
One of the biggest advantages of API is that the integration can be done between
almost any type of data system and the extracted data can be used for multiple
different activities such as analysis, generating insights, or creating reports.
7
7. Text pattern matching
Text pattern matching or text extraction refers to the finding and retrieving of specific
patterns within a given data set. A specific sequence of characters or patterns needs
to be predefined which will then be searched for within the provided data set.
This data extraction type is useful for validating data by finding specific keywords,
phrases, or patterns within a document.

8. Database querying
Database querying is the process of requesting and retrieving specific information or
data from a database management system (DBMS) using a query language. It allows
users to interact with databases to extract, manipulate, and analyse data based on
their specific needs.
Structured query language (SQL) is the most commonly used query language for
relational databases. Users can specify criteria, such as conditions, and filters, to fetch
specific records from the database. Database querying is essential for making
informed decisions and building data-driven businesses.

8
References:
Ben Fry, Visualizing Data: Exploring and Explaining Data with Processing Environment,
O'Reilly Media.

Video Link:
https://www.youtube.com/watch?v=E7oACf4a24Y&pp=ygUaZGF0YSBleHRyYWN0aW9u
IGluIGVuZ2xpc2g%3D

9
THANK YOU

For queries
Email: [email protected]

Splunk for Data Insights: Definitive Reference for Developers and Engineers
From Everand
Splunk for Data Insights: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Hkdse English Reading全方位實戰神技精讀主筆記 Sample 1643026918
No ratings yet
Hkdse English Reading全方位實戰神技精讀主筆記 Sample 1643026918
23 pages
Seeking A Friend For The End of The World
100% (4)
Seeking A Friend For The End of The World
108 pages
Data Visualization
No ratings yet
Data Visualization
179 pages
Unit 2
No ratings yet
Unit 2
53 pages
DV Classnotes
No ratings yet
DV Classnotes
28 pages
Image Retrieval: Unlocking the Power of Visual Data
From Everand
Image Retrieval: Unlocking the Power of Visual Data
Fouad Sabry
No ratings yet
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
Leverage The Data Trapped in Unstructured Sources With Data Extraction
No ratings yet
Leverage The Data Trapped in Unstructured Sources With Data Extraction
27 pages
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Image Retrieval: Fundamentals and Applications
From Everand
Image Retrieval: Fundamentals and Applications
Fouad Sabry
No ratings yet
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
20 pages
Data Mining - Unit 1
No ratings yet
Data Mining - Unit 1
45 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
Ba Unit 2 Imp
No ratings yet
Ba Unit 2 Imp
9 pages
Data Warehousing Mining
No ratings yet
Data Warehousing Mining
26 pages
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
DM Mod 1
No ratings yet
DM Mod 1
17 pages
ADTHEORY4
No ratings yet
ADTHEORY4
13 pages
Assignment 2
No ratings yet
Assignment 2
5 pages
Lexicon of Computer Science Terminology: Lexicon of Tech and Business, #16
From Everand
Lexicon of Computer Science Terminology: Lexicon of Tech and Business, #16
Mustafa Al-Dori
4/5 (1)
Knowledge Management UNIT-3 Notes
No ratings yet
Knowledge Management UNIT-3 Notes
17 pages
IV-cse DM Viva Questions
No ratings yet
IV-cse DM Viva Questions
10 pages
ISS - Module 3
No ratings yet
ISS - Module 3
11 pages
IBM WebSphere eXtreme Scale 6
From Everand
IBM WebSphere eXtreme Scale 6
Anthony Chaves
No ratings yet
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
DM & W SQ
No ratings yet
DM & W SQ
15 pages
Unit-2 DS
No ratings yet
Unit-2 DS
10 pages
Unit - III DW
No ratings yet
Unit - III DW
14 pages
Data Mining and Warehouse
No ratings yet
Data Mining and Warehouse
30 pages
BDA Class1
No ratings yet
BDA Class1
33 pages
Chapter 1.3
No ratings yet
Chapter 1.3
9 pages
Unit 2 Data Preprocessing and Association Rule Mining
No ratings yet
Unit 2 Data Preprocessing and Association Rule Mining
31 pages
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
From Everand
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
WINTON CLEM
No ratings yet
Semantic Translation: Fundamentals and Applications
From Everand
Semantic Translation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Mastering Data Mining Techniques
From Everand
Mastering Data Mining Techniques
Dhaanyalakshmi Ahuja
No ratings yet
Superset Data Exploration and Analysis Framework: Definitive Reference for Developers and Engineers
From Everand
Superset Data Exploration and Analysis Framework: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Unit 1 - PPT
No ratings yet
Unit 1 - PPT
67 pages
Past PPR
No ratings yet
Past PPR
31 pages
Data Mining Practical 123
No ratings yet
Data Mining Practical 123
26 pages
Unit - 2
No ratings yet
Unit - 2
17 pages
UNIT-1 1) KDD: KDD (Knowledge Discovery in Database)
No ratings yet
UNIT-1 1) KDD: KDD (Knowledge Discovery in Database)
17 pages
Data Mining AND Warehousing: Abstract
No ratings yet
Data Mining AND Warehousing: Abstract
12 pages
Shortnjn
No ratings yet
Shortnjn
12 pages
Data Mining
No ratings yet
Data Mining
22 pages
Final Term Paper
No ratings yet
Final Term Paper
24 pages
Data Mining 445545
No ratings yet
Data Mining 445545
11 pages
Data Warehousing&Dat Mining
No ratings yet
Data Warehousing&Dat Mining
12 pages
DM Unit 1
No ratings yet
DM Unit 1
10 pages
What Is Big Data Analytics
No ratings yet
What Is Big Data Analytics
3 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
23 pages
2 Data Science - Managing Data
No ratings yet
2 Data Science - Managing Data
37 pages
Presented By: - Preeti Kudva (106887833) - Kinjal Khandhar (106878039)
No ratings yet
Presented By: - Preeti Kudva (106887833) - Kinjal Khandhar (106878039)
72 pages
Module 3
No ratings yet
Module 3
30 pages
Dw&bi PR2,3
No ratings yet
Dw&bi PR2,3
6 pages
Dta Mining
No ratings yet
Dta Mining
15 pages
Data Mining Unit-1 Complete
No ratings yet
Data Mining Unit-1 Complete
45 pages
Data Mining OVERVIEW
No ratings yet
Data Mining OVERVIEW
8 pages
Unit 5 Notes
No ratings yet
Unit 5 Notes
19 pages
Synopsis Print
No ratings yet
Synopsis Print
4 pages
Python Data Wrangling for Business Analytics: Python for Business Analytics Series
From Everand
Python Data Wrangling for Business Analytics: Python for Business Analytics Series
George Snypes
2/5 (1)
Knight's Microsoft Business Intelligence 24-Hour Trainer: Leveraging Microsoft SQL Server Integration, Analysis, and Reporting Services with Excel and SharePoint
From Everand
Knight's Microsoft Business Intelligence 24-Hour Trainer: Leveraging Microsoft SQL Server Integration, Analysis, and Reporting Services with Excel and SharePoint
Brian Knight
3/5 (1)
Restauración de Poblaciones de Plantas Amenazadas
No ratings yet
Restauración de Poblaciones de Plantas Amenazadas
2 pages
wph16 01 Que 20220616
No ratings yet
wph16 01 Que 20220616
20 pages
TSC7320 Controller Manual
No ratings yet
TSC7320 Controller Manual
51 pages
Micro CC 20 Plus Communication Protocol
No ratings yet
Micro CC 20 Plus Communication Protocol
9 pages
Chapter 7 - Introduction To Arrays
No ratings yet
Chapter 7 - Introduction To Arrays
33 pages
Introduction To CAM Lesson 1
No ratings yet
Introduction To CAM Lesson 1
9 pages
Project Report On - HRM IN AN ORGANIZATIONS
No ratings yet
Project Report On - HRM IN AN ORGANIZATIONS
17 pages
Introduction To CFD SPRING 2016
No ratings yet
Introduction To CFD SPRING 2016
36 pages
Uncovering The Reasons Behind Willingness To Pay For ChatGPT-4 Premium
No ratings yet
Uncovering The Reasons Behind Willingness To Pay For ChatGPT-4 Premium
17 pages
Questions Social Studies
No ratings yet
Questions Social Studies
21 pages
The Amish Community PowerPoint
No ratings yet
The Amish Community PowerPoint
21 pages
Child, You Have To Do It Now
No ratings yet
Child, You Have To Do It Now
69 pages
A) Program To Implement A FLYING KITE
No ratings yet
A) Program To Implement A FLYING KITE
36 pages
Avigilon Vertical Brochure - CriticalInfrastructure - ENG
No ratings yet
Avigilon Vertical Brochure - CriticalInfrastructure - ENG
6 pages
Sexual Sounds Can Trigger Porn Filter
No ratings yet
Sexual Sounds Can Trigger Porn Filter
1 page
Infographic - Schools of Criticism
No ratings yet
Infographic - Schools of Criticism
1 page
NCM 110 Nsginfos
No ratings yet
NCM 110 Nsginfos
17 pages
Mse & History Format
No ratings yet
Mse & History Format
24 pages
Xe155ucr Spec
No ratings yet
Xe155ucr Spec
20 pages
IELTS Listening Test 196
No ratings yet
IELTS Listening Test 196
7 pages
COMPTRONIX
No ratings yet
COMPTRONIX
18 pages
Types of Steel Beam Connections and Their Details
No ratings yet
Types of Steel Beam Connections and Their Details
5 pages
Rockridge News
No ratings yet
Rockridge News
16 pages
Microsoft Powerpoint Tips and Tricks
No ratings yet
Microsoft Powerpoint Tips and Tricks
8 pages
A - First Solar FS 275
No ratings yet
A - First Solar FS 275
2 pages
PDF Calypso Advanced e 3 6 SZ 001 DL
No ratings yet
PDF Calypso Advanced e 3 6 SZ 001 DL
8 pages
LFP Syllabus
No ratings yet
LFP Syllabus
2 pages
TTL 1 2024 Edition
No ratings yet
TTL 1 2024 Edition
76 pages

Lecture 1.1.1

Uploaded by

Lecture 1.1.1

Uploaded by

UNIVERSITY INSTITUTE OF

Prepared By : Shivam Sharma(E-16516)

Techniques for Data Extraction

1. Manual data extraction

4. Template-based data extraction

You might also like