0% found this document useful (0 votes)
19 views

3.1 - Data Collection - Image and Text

The document discusses collecting and managing image and text data. It defines image and text collection as utilizing digital images and unstructured text for research. It describes various sources for collecting images and text, such as forums, social media, books and articles. It also discusses file formats, volume considerations, labeling, and using tools like ImageKit to manage large image datasets. For text, it describes refining raw text into an intermediate form through techniques like tagging, then analyzing the text through methods like categorization and visualization. Storing both raw and analyzed data is also discussed.

Uploaded by

ziyatogana
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

3.1 - Data Collection - Image and Text

The document discusses collecting and managing image and text data. It defines image and text collection as utilizing digital images and unstructured text for research. It describes various sources for collecting images and text, such as forums, social media, books and articles. It also discusses file formats, volume considerations, labeling, and using tools like ImageKit to manage large image datasets. For text, it describes refining raw text into an intermediate form through techniques like tagging, then analyzing the text through methods like categorization and visualization. Storing both raw and analyzed data is also discussed.

Uploaded by

ziyatogana
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Data Collection:

Image and Text


CC19 – Data Mining
Agenda
• Image Collection
• Defining Image Collection
• Collecting Image Data
• Managing Image Data
• Text Collection
• Defining Text Collection
• Collecting Text Data
• Managing Text Data
Image Collection
Data Collection: Image and Text
Defining Image Collection
• Images are a visual or mental representation of something
or someone.
• In data science, these images have a class or type that
they belong to.
• Different types of images are used to conduct analysis
and train modern systems.
Defining Image Collection
• Image collection refers to the practice of utilizing digital
images for research and data science.
• Using images for data science is common now but is a
recent practice in the information age.
• Image collection requires its own steps and tools due to
the nature and characteristics of images.
Collecting Image Data
• Collecting image data can be challenging, since they
come from a variety of sources:
• Forums
• Data Vendors
• Digital Creation
• Image Sharing Platforms
• Image Scanning
• Photography
• Open-Source Datasets
• Search Engines
• Social Media
Collecting Image Data
• When collecting images, we need to ensure that we know
what type of images we want to collect.
• Ideally, we should be able to answer these questions
about our image data:
• What type of images will be gathered?
• In what format will the images appear in?
• What is the expected volume of images?
Collecting Image Data: Characteristics
• Images organized around specific themes or categories
can provide insights into a variety of topics.
• Understanding the characteristics of images are
important, to glean patterns in our data.
• That is why knowing the characteristics of our image data
is important.
Collecting Image Data: Characteristics
• There are many other characteristics to image data than
just the image itself:
Characteristic Digital Image Details
Data The Viewable Image or the Analytic
Content derived from the Image
Metadata Camera Make & Model, Image
Timestep, Aperture, Shutter Speed,
Caption, Author, Legal Information,
Copyright Information, etc.
Paradata Image Source/s, Image File Type, Image
Edit History, Image Sharing History
Collecting Image Data: Image Format
• In addition to knowing the characteristics of image data, it
is important to know how it is formatted.
• The format used for images can affect how it will be
analyzed when fed to an algorithm or program.
• It is also important to consider the tools that will be used to
analyze the images and what formats they support.
Collecting Image Data: Image Format
• These are the most common types of image formats:
• JPEG (or JPG) – Joint Photographic Experts Group
• GIF – Graphics Interchange Format
• PNG – Portable Network Graphics
• HEIF – High Efficiency Image File Format
• TIFF – Tagged Image File Format
• BMP – Windows Bitmap
• WebP
Collecting Image Data: Image Format
• You need to consider the volume and type of images you
are collecting when deciding on the best format.
• Ideally, you would choose the highest quality available for
all images, but your tradeoff will be larger image sizes.
• Cornell University, a leading university in data science,
recommends TIFF for images.
Collecting Image Data: Volume
• Your storage is limited, so you should take into
consideration how many images to collect.
• While larger storage devices have become relatively
cheaper in recent years, images have become larger.
• Expectation of big data analytics has also called for larger
and larger image datasets.
Collecting Image Data: Volume
• The quickest solution to large images is lossy compression,
converting the image to a lossy format (e.g. JPEG).
• An alternative is storing all the images into a lossless file
compression format (e.g. .zip, .7z, and .gz).
• This retains the overall quality of the images while reducing
the overall file size of the dataset.
Managing Image Data
• Storing and organizing your images is important, to ensure
that you can easily sort to different images as needed.
• This includes labeling, storing images in specific folders,
renaming the images themselves, or editing metadata.
• The effectiveness of these techniques decreases as the
dataset becomes larger.
Managing Image Data
• Using simple folders and labels is sufficient for small image
sets (~10,000 images).
• As you make use of larger image sets, using special image
management tools are recommended (e.g. ImageKit).
Text Collection
Data Collection: Image and Text
Defining Text Collection
• Text refers to printed books, documents, and media that
cover different ideas and content.
• In data science, text content is typically analyzed to
determine the sentiments of users.
• The text content can come from different sources
depending on your needs and goals.
Defining Text Collection
• Text collection is a type of data collection that deals
specifically with unstructured text data.
• It typically makes use of natural language processing
(NLP) techniques to extract insights from the text.
• Text collection is a multidisciplinary field, involving different
techniques such as text retrieval and text analysis.
Collecting Text Data
• Text data comes from a variety of sources:
• Books
• Classic Literature
• Corpora
• Data Vendors
• Dictionaries
• Forums
• Interview Transcripts
• Magazines
• Open-Source Datasets
• Short Stories
• Surveys
• Web Articles
Collecting Text Data
• Text collection has two key
phases in its process:
• Text Refining
• Knowledge Distillation
Collecting Text Data: Text Refining
• Text refining is transforming free-form text to a chosen
intermediate form (IF).
• This IF can be in a semi-structured form such as
conceptual graphs.
• It can also be in a structured form such as relational data.
Collecting Text Data: Text Refining
• The purpose of turning text data into an IF is to make it
easier to process and organize the text.
• An IF will typically have labels for the individual text, or
tags which describe the topic or idea.
• It might also take note of keywords for sentiment analysis.
Collecting Text Data: Text Refining
• Mining a document-based
IF deduces patterns and
relationship across
documents.
• Examples of this are
clustering/visualization and
categorization.
Collecting Text Data: Text Refining
• Mining a concept-based IF
deduces patterns and
relationships across objects
and concepts.
• Examples of this are
predictions and associative
discovery.
Collecting Text Data: Knowledge Distillation

• Knowledge distillation deduces patterns or knowledge


from the IF.
• This is where you will utilize analysis tools or machine
learning models to comb through your text data.
• The tools used for knowledge distillation depends on the
goals of your data mining process.
Collecting Text Data: Knowledge Distillation

• These are examples of methods used for text data


analysis:
• Text Data Categorization
• Text Data Extraction
• Text Data Identification
• Text Data Parsing
• Text Data Translation
• Text Data Visualization
Managing Text Data
• Typically, we store both the “raw” data itself and the
analyzed/”new” data together.
• This allows us to validate our text refining process and
ensure that our text is being interpreted correctly.
• We also keep the “raw” data so that we can utilize
multiple data mining techniques on the same data.
Managing Text Data
• Processing text data is iterative in nature, which means
that we will better understand our data as we analyze it.
• This process can involve various specialized techniques
such as feature selection and feature extraction.
• Due to this, we usually end up with a corpora at the end
of our data mining process.
Managing Text Data
• A corpora acts as a collection of the linguistic patterns
that we have analyzed from our text data.
• This is what allows us to analyze new data and make
predictions.
• When we rebuild a text mining tool, we typically also
rebuild the corpora.
References
• Image Management as a Data Service (iassistquarterly.com)
• Text-Mining-The-state-of-the-art-and-the-challenges.pdf
(researchgate.net)
• Text Data Collection Services | OCR Dataset- GTS
• Getting Started in Text Mining | PLOS Computational Biology
• Automated Data Collection with R – A Practical Guide to Web
Scraping and Text Mining (core.ac.uk)
• Online-Data-Collection.pdf (researchgate.net)
• Text Mining in Data Mining - GeeksforGeeks

You might also like