3.1 - Data Collection - Image and Text

The document discusses collecting and managing image and text data. It defines image and text collection as utilizing digital images and unstructured text for research. It describes various sources for collecting images and text, such as forums, social media, books and articles. It also discusses file formats, volume considerations, labeling, and using tools like ImageKit to manage large image datasets. For text, it describes refining raw text into an intermediate form through techniques like tagging, then analyzing the text through methods like categorization and visualization. Storing both raw and analyzed data is also discussed.

Uploaded by

ziyatogana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views

3.1 - Data Collection - Image and Text

Uploaded by

ziyatogana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Data Collection:

Image and Text

CC19 – Data Mining
Agenda
• Image Collection
• Defining Image Collection
• Collecting Image Data
• Managing Image Data
• Text Collection
• Defining Text Collection
• Collecting Text Data
• Managing Text Data
Image Collection
Data Collection: Image and Text
Defining Image Collection
• Images are a visual or mental representation of something
or someone.
• In data science, these images have a class or type that
they belong to.
• Different types of images are used to conduct analysis
and train modern systems.
Defining Image Collection
• Image collection refers to the practice of utilizing digital
images for research and data science.
• Using images for data science is common now but is a
recent practice in the information age.
• Image collection requires its own steps and tools due to
the nature and characteristics of images.
Collecting Image Data
• Collecting image data can be challenging, since they
come from a variety of sources:
• Forums
• Data Vendors
• Digital Creation
• Image Sharing Platforms
• Image Scanning
• Photography
• Open-Source Datasets
• Search Engines
• Social Media
Collecting Image Data
• When collecting images, we need to ensure that we know
what type of images we want to collect.
• Ideally, we should be able to answer these questions
about our image data:
• What type of images will be gathered?
• In what format will the images appear in?
• What is the expected volume of images?
Collecting Image Data: Characteristics
• Images organized around specific themes or categories
can provide insights into a variety of topics.
• Understanding the characteristics of images are
important, to glean patterns in our data.
• That is why knowing the characteristics of our image data
is important.
Collecting Image Data: Characteristics
• There are many other characteristics to image data than
just the image itself:
Characteristic Digital Image Details
Data The Viewable Image or the Analytic
Content derived from the Image
Metadata Camera Make & Model, Image
Timestep, Aperture, Shutter Speed,
Caption, Author, Legal Information,
Copyright Information, etc.
Paradata Image Source/s, Image File Type, Image
Edit History, Image Sharing History
Collecting Image Data: Image Format
• In addition to knowing the characteristics of image data, it
is important to know how it is formatted.
• The format used for images can affect how it will be
analyzed when fed to an algorithm or program.
• It is also important to consider the tools that will be used to
analyze the images and what formats they support.
Collecting Image Data: Image Format
• These are the most common types of image formats:
• JPEG (or JPG) – Joint Photographic Experts Group
• GIF – Graphics Interchange Format
• PNG – Portable Network Graphics
• HEIF – High Efficiency Image File Format
• TIFF – Tagged Image File Format
• BMP – Windows Bitmap
• WebP
Collecting Image Data: Image Format
• You need to consider the volume and type of images you
are collecting when deciding on the best format.
• Ideally, you would choose the highest quality available for
all images, but your tradeoff will be larger image sizes.
• Cornell University, a leading university in data science,
recommends TIFF for images.
Collecting Image Data: Volume
• Your storage is limited, so you should take into
consideration how many images to collect.
• While larger storage devices have become relatively
cheaper in recent years, images have become larger.
• Expectation of big data analytics has also called for larger
and larger image datasets.
Collecting Image Data: Volume
• The quickest solution to large images is lossy compression,
converting the image to a lossy format (e.g. JPEG).
• An alternative is storing all the images into a lossless file
compression format (e.g. .zip, .7z, and .gz).
• This retains the overall quality of the images while reducing
the overall file size of the dataset.
Managing Image Data
• Storing and organizing your images is important, to ensure
that you can easily sort to different images as needed.
• This includes labeling, storing images in specific folders,
renaming the images themselves, or editing metadata.
• The effectiveness of these techniques decreases as the
dataset becomes larger.
Managing Image Data
• Using simple folders and labels is sufficient for small image
sets (~10,000 images).
• As you make use of larger image sets, using special image
management tools are recommended (e.g. ImageKit).
Text Collection
Data Collection: Image and Text
Defining Text Collection
• Text refers to printed books, documents, and media that
cover different ideas and content.
• In data science, text content is typically analyzed to
determine the sentiments of users.
• The text content can come from different sources
depending on your needs and goals.
Defining Text Collection
• Text collection is a type of data collection that deals
specifically with unstructured text data.
• It typically makes use of natural language processing
(NLP) techniques to extract insights from the text.
• Text collection is a multidisciplinary field, involving different
techniques such as text retrieval and text analysis.
Collecting Text Data
• Text data comes from a variety of sources:
• Books
• Classic Literature
• Corpora
• Data Vendors
• Dictionaries
• Forums
• Interview Transcripts
• Magazines
• Open-Source Datasets
• Short Stories
• Surveys
• Web Articles
Collecting Text Data
• Text collection has two key
phases in its process:
• Text Refining
• Knowledge Distillation
Collecting Text Data: Text Refining
• Text refining is transforming free-form text to a chosen
intermediate form (IF).
• This IF can be in a semi-structured form such as
conceptual graphs.
• It can also be in a structured form such as relational data.
Collecting Text Data: Text Refining
• The purpose of turning text data into an IF is to make it
easier to process and organize the text.
• An IF will typically have labels for the individual text, or
tags which describe the topic or idea.
• It might also take note of keywords for sentiment analysis.
Collecting Text Data: Text Refining
• Mining a document-based
IF deduces patterns and
relationship across
documents.
• Examples of this are
clustering/visualization and
categorization.
Collecting Text Data: Text Refining
• Mining a concept-based IF
deduces patterns and
relationships across objects
and concepts.
• Examples of this are
predictions and associative
discovery.
Collecting Text Data: Knowledge Distillation

• Knowledge distillation deduces patterns or knowledge

from the IF.
• This is where you will utilize analysis tools or machine
learning models to comb through your text data.
• The tools used for knowledge distillation depends on the
goals of your data mining process.
Collecting Text Data: Knowledge Distillation

• These are examples of methods used for text data

analysis:
• Text Data Categorization
• Text Data Extraction
• Text Data Identification
• Text Data Parsing
• Text Data Translation
• Text Data Visualization
Managing Text Data
• Typically, we store both the “raw” data itself and the
analyzed/”new” data together.
• This allows us to validate our text refining process and
ensure that our text is being interpreted correctly.
• We also keep the “raw” data so that we can utilize
multiple data mining techniques on the same data.
Managing Text Data
• Processing text data is iterative in nature, which means
that we will better understand our data as we analyze it.
• This process can involve various specialized techniques
such as feature selection and feature extraction.
• Due to this, we usually end up with a corpora at the end
of our data mining process.
Managing Text Data
• A corpora acts as a collection of the linguistic patterns
that we have analyzed from our text data.
• This is what allows us to analyze new data and make
predictions.
• When we rebuild a text mining tool, we typically also
rebuild the corpora.
References
• Image Management as a Data Service (iassistquarterly.com)
• Text-Mining-The-state-of-the-art-and-the-challenges.pdf
(researchgate.net)
• Text Data Collection Services | OCR Dataset- GTS
• Getting Started in Text Mining | PLOS Computational Biology
• Automated Data Collection with R – A Practical Guide to Web
Scraping and Text Mining (core.ac.uk)
• Online-Data-Collection.pdf (researchgate.net)
• Text Mining in Data Mining - GeeksforGeeks

Lc3500i Maintenance Manual
No ratings yet
Lc3500i Maintenance Manual
61 pages
Data Collection and Storage
No ratings yet
Data Collection and Storage
15 pages
Data Analytics
No ratings yet
Data Analytics
21 pages
Dept. of ISE, Acit 1
No ratings yet
Dept. of ISE, Acit 1
12 pages
BCSE206L_FDS_MODULE-4_SMSATAPATHY
No ratings yet
BCSE206L_FDS_MODULE-4_SMSATAPATHY
50 pages
WINSEM2023-24 BCSE206L TH VL2023240501787 2024-02-19 Reference-Material-I
No ratings yet
WINSEM2023-24 BCSE206L TH VL2023240501787 2024-02-19 Reference-Material-I
42 pages
Chapter 1: Text Mining: Big Data Analytics (15CS82)
No ratings yet
Chapter 1: Text Mining: Big Data Analytics (15CS82)
12 pages
Assignment 5
No ratings yet
Assignment 5
16 pages
Chapter 10 - Data at Scale
No ratings yet
Chapter 10 - Data at Scale
29 pages
Advanced Analytics - Course Outline
No ratings yet
Advanced Analytics - Course Outline
4 pages
08-Text_Mining
No ratings yet
08-Text_Mining
38 pages
45 Ijmtst0806103
No ratings yet
45 Ijmtst0806103
4 pages
Lec 5 e Text Analytics Vector Space TF IDF
No ratings yet
Lec 5 e Text Analytics Vector Space TF IDF
51 pages
Data Collection in Our World
No ratings yet
Data Collection in Our World
17 pages
Text Mining PPT Merged
100% (1)
Text Mining PPT Merged
58 pages
05b.BDA (18CS72) Module-5 Text Mining
No ratings yet
05b.BDA (18CS72) Module-5 Text Mining
23 pages
DMPPT 557
No ratings yet
DMPPT 557
14 pages
Business Intelligence and Data Mining: by Dr. Atanu Rakshit Email: Atanu - Rakshit@iimrohtak - Ac.in
No ratings yet
Business Intelligence and Data Mining: by Dr. Atanu Rakshit Email: Atanu - Rakshit@iimrohtak - Ac.in
122 pages
Data Sources
No ratings yet
Data Sources
9 pages
Web-Script - SSVTC-Unit 123
No ratings yet
Web-Script - SSVTC-Unit 123
14 pages
ASTMA Explanations m1 stuff
No ratings yet
ASTMA Explanations m1 stuff
27 pages
Bda Mod5
No ratings yet
Bda Mod5
20 pages
5213935-UNIT 2 AI PROJECT CYCLE With Modelling - Uploaded
No ratings yet
5213935-UNIT 2 AI PROJECT CYCLE With Modelling - Uploaded
42 pages
An Introduction To Data Mining
No ratings yet
An Introduction To Data Mining
3 pages
Text Analytics
100% (1)
Text Analytics
34 pages
Xi Ai Unit - 5 Notes
No ratings yet
Xi Ai Unit - 5 Notes
28 pages
AI PROJECT CYCLE
No ratings yet
AI PROJECT CYCLE
50 pages
Bda (Chapter 1)
No ratings yet
Bda (Chapter 1)
8 pages
Text Analytics
No ratings yet
Text Analytics
9 pages
UNIT - 1 Text Mining
No ratings yet
UNIT - 1 Text Mining
18 pages
Chapter 7 - Text Mining, Sentiment Analysis, and Social Analytics
No ratings yet
Chapter 7 - Text Mining, Sentiment Analysis, and Social Analytics
91 pages
AFM_Module 4
No ratings yet
AFM_Module 4
48 pages
TEXT ANALYTICS With Python
No ratings yet
TEXT ANALYTICS With Python
37 pages
417 AI Handbook Class9!81!96
No ratings yet
417 AI Handbook Class9!81!96
16 pages
IMTC634_Data Science_Chapter 7
No ratings yet
IMTC634_Data Science_Chapter 7
24 pages
DSB - Unit4-Representing and Miniing text-decision-analytic-think-II
No ratings yet
DSB - Unit4-Representing and Miniing text-decision-analytic-think-II
46 pages
Astma Lab Manual
No ratings yet
Astma Lab Manual
17 pages
ETB Text analytics using Machine Learning -20-12-24
No ratings yet
ETB Text analytics using Machine Learning -20-12-24
38 pages
fuba reviewer
No ratings yet
fuba reviewer
6 pages
Text Mining Techniques Applications and Issues2
No ratings yet
Text Mining Techniques Applications and Issues2
5 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
6 pages
Advanced-Applications
No ratings yet
Advanced-Applications
54 pages
DATAMINING
No ratings yet
DATAMINING
8 pages
Chapter 2
No ratings yet
Chapter 2
13 pages
Assignment Rubel - Data Mining
No ratings yet
Assignment Rubel - Data Mining
12 pages
Class 9 AI Project Cycle Notes
No ratings yet
Class 9 AI Project Cycle Notes
8 pages
Unit 2 Data Literacy
No ratings yet
Unit 2 Data Literacy
6 pages
Your Data Literacy Depends On Understanding The Types of Data and How They're Captured
No ratings yet
Your Data Literacy Depends On Understanding The Types of Data and How They're Captured
5 pages
Module 6_Social Media Analytics and Text Mining.
No ratings yet
Module 6_Social Media Analytics and Text Mining.
27 pages
AWS ML Notes -Domain 1 - Data Processing
No ratings yet
AWS ML Notes -Domain 1 - Data Processing
37 pages
Identifying Data Sources
No ratings yet
Identifying Data Sources
4 pages
Netflix Analysis Report (2105878 - Bibhudutta Swain)
No ratings yet
Netflix Analysis Report (2105878 - Bibhudutta Swain)
19 pages
TextMining PAKDD1999
No ratings yet
TextMining PAKDD1999
7 pages
DS Module2 L1 L11
No ratings yet
DS Module2 L1 L11
27 pages
3.Data (1)
No ratings yet
3.Data (1)
23 pages
ACFrOgAo1SpYCo1YmTJeiGbHKH22nYKAL3GLgRtzpk4R3gRbHCAsTnCSMxfKm0SFBNYGz7keG7rfZN Y3QVo gdxiQyqG - 6KLsY2icn
No ratings yet
ACFrOgAo1SpYCo1YmTJeiGbHKH22nYKAL3GLgRtzpk4R3gRbHCAsTnCSMxfKm0SFBNYGz7keG7rfZN Y3QVo gdxiQyqG - 6KLsY2icn
14 pages
Text and Sentiment Analysis
No ratings yet
Text and Sentiment Analysis
41 pages
Basic 6
No ratings yet
Basic 6
53 pages
Introduction Data Science
No ratings yet
Introduction Data Science
29 pages
Image Retrieval: Fundamentals and Applications
From Everand
Image Retrieval: Fundamentals and Applications
Fouad Sabry
No ratings yet
Image Retrieval: Unlocking the Power of Visual Data
From Everand
Image Retrieval: Unlocking the Power of Visual Data
Fouad Sabry
No ratings yet
Future Plan Calander Startup
No ratings yet
Future Plan Calander Startup
4 pages
Ring Compressor Operation and Parts Manual
No ratings yet
Ring Compressor Operation and Parts Manual
7 pages
FJA13009 (Similar A J5804-Corrente Do j5804 10A) PDF
No ratings yet
FJA13009 (Similar A J5804-Corrente Do j5804 10A) PDF
6 pages
Silicon Chip Online - Build A Digital Insulation Meter
No ratings yet
Silicon Chip Online - Build A Digital Insulation Meter
2 pages
PMO Presentation
No ratings yet
PMO Presentation
30 pages
Oswas 2.0
0% (2)
Oswas 2.0
26 pages
Kubota Ductile Iron Pipe: KUBOTA Corporation KUBOTA Membrane USA Corporation
No ratings yet
Kubota Ductile Iron Pipe: KUBOTA Corporation KUBOTA Membrane USA Corporation
10 pages
Ebooks Join Our Telegram Channel:-Https://T.Me/Pdfbasket
No ratings yet
Ebooks Join Our Telegram Channel:-Https://T.Me/Pdfbasket
87 pages
Career Paths For ECE Students W
No ratings yet
Career Paths For ECE Students W
25 pages
HT 356 SD
No ratings yet
HT 356 SD
61 pages
_The IT Professional’s Linux Handbook_ Commands, Concepts
No ratings yet
_The IT Professional’s Linux Handbook_ Commands, Concepts
12 pages
Windows 10 ADMX Spreadsheet
No ratings yet
Windows 10 ADMX Spreadsheet
433 pages
Different Types of ASIC
100% (1)
Different Types of ASIC
21 pages
Executing The Problem Solving Method-Lesson Plan Jmeck V5a10
No ratings yet
Executing The Problem Solving Method-Lesson Plan Jmeck V5a10
3 pages
Jet Engine Performance Parameters
No ratings yet
Jet Engine Performance Parameters
32 pages
Computers: Tools For An Information Age
No ratings yet
Computers: Tools For An Information Age
51 pages
About SLK
No ratings yet
About SLK
9 pages
Ebook Classic Titles
No ratings yet
Ebook Classic Titles
5 pages
ReleaseNote FileList of X510UA WIN10 64 V2.01 Lite
No ratings yet
ReleaseNote FileList of X510UA WIN10 64 V2.01 Lite
2 pages
MPFI
No ratings yet
MPFI
19 pages
Wipro Technical and HR Interview Questions PDF
100% (2)
Wipro Technical and HR Interview Questions PDF
9 pages
Bio Project
No ratings yet
Bio Project
11 pages
Change Management in IT
No ratings yet
Change Management in IT
18 pages
Automatic Gates
No ratings yet
Automatic Gates
74 pages
Schneider Electric_Sepam-series-20_59622
No ratings yet
Schneider Electric_Sepam-series-20_59622
3 pages
A Little Less Talk - Inventor Professional Tube & Pipe Demo: Learning Objectives
No ratings yet
A Little Less Talk - Inventor Professional Tube & Pipe Demo: Learning Objectives
54 pages
Prof. Tim Osswald University of Wisconsin-Madison
No ratings yet
Prof. Tim Osswald University of Wisconsin-Madison
2 pages
OCR Comp Sci WB 2 Answers
No ratings yet
OCR Comp Sci WB 2 Answers
20 pages
Third Periodical Test in EPP 6
No ratings yet
Third Periodical Test in EPP 6
5 pages

3.1 - Data Collection - Image and Text

Uploaded by

3.1 - Data Collection - Image and Text

Uploaded by

Data Collection:

Image and Text

• Knowledge distillation deduces patterns or knowledge

• These are examples of methods used for text data

You might also like