The document discusses collecting and managing image and text data. It defines image and text collection as utilizing digital images and unstructured text for research. It describes various sources for collecting images and text, such as forums, social media, books and articles. It also discusses file formats, volume considerations, labeling, and using tools like ImageKit to manage large image datasets. For text, it describes refining raw text into an intermediate form through techniques like tagging, then analyzing the text through methods like categorization and visualization. Storing both raw and analyzed data is also discussed.
The document discusses collecting and managing image and text data. It defines image and text collection as utilizing digital images and unstructured text for research. It describes various sources for collecting images and text, such as forums, social media, books and articles. It also discusses file formats, volume considerations, labeling, and using tools like ImageKit to manage large image datasets. For text, it describes refining raw text into an intermediate form through techniques like tagging, then analyzing the text through methods like categorization and visualization. Storing both raw and analyzed data is also discussed.
CC19 – Data Mining Agenda • Image Collection • Defining Image Collection • Collecting Image Data • Managing Image Data • Text Collection • Defining Text Collection • Collecting Text Data • Managing Text Data Image Collection Data Collection: Image and Text Defining Image Collection • Images are a visual or mental representation of something or someone. • In data science, these images have a class or type that they belong to. • Different types of images are used to conduct analysis and train modern systems. Defining Image Collection • Image collection refers to the practice of utilizing digital images for research and data science. • Using images for data science is common now but is a recent practice in the information age. • Image collection requires its own steps and tools due to the nature and characteristics of images. Collecting Image Data • Collecting image data can be challenging, since they come from a variety of sources: • Forums • Data Vendors • Digital Creation • Image Sharing Platforms • Image Scanning • Photography • Open-Source Datasets • Search Engines • Social Media Collecting Image Data • When collecting images, we need to ensure that we know what type of images we want to collect. • Ideally, we should be able to answer these questions about our image data: • What type of images will be gathered? • In what format will the images appear in? • What is the expected volume of images? Collecting Image Data: Characteristics • Images organized around specific themes or categories can provide insights into a variety of topics. • Understanding the characteristics of images are important, to glean patterns in our data. • That is why knowing the characteristics of our image data is important. Collecting Image Data: Characteristics • There are many other characteristics to image data than just the image itself: Characteristic Digital Image Details Data The Viewable Image or the Analytic Content derived from the Image Metadata Camera Make & Model, Image Timestep, Aperture, Shutter Speed, Caption, Author, Legal Information, Copyright Information, etc. Paradata Image Source/s, Image File Type, Image Edit History, Image Sharing History Collecting Image Data: Image Format • In addition to knowing the characteristics of image data, it is important to know how it is formatted. • The format used for images can affect how it will be analyzed when fed to an algorithm or program. • It is also important to consider the tools that will be used to analyze the images and what formats they support. Collecting Image Data: Image Format • These are the most common types of image formats: • JPEG (or JPG) – Joint Photographic Experts Group • GIF – Graphics Interchange Format • PNG – Portable Network Graphics • HEIF – High Efficiency Image File Format • TIFF – Tagged Image File Format • BMP – Windows Bitmap • WebP Collecting Image Data: Image Format • You need to consider the volume and type of images you are collecting when deciding on the best format. • Ideally, you would choose the highest quality available for all images, but your tradeoff will be larger image sizes. • Cornell University, a leading university in data science, recommends TIFF for images. Collecting Image Data: Volume • Your storage is limited, so you should take into consideration how many images to collect. • While larger storage devices have become relatively cheaper in recent years, images have become larger. • Expectation of big data analytics has also called for larger and larger image datasets. Collecting Image Data: Volume • The quickest solution to large images is lossy compression, converting the image to a lossy format (e.g. JPEG). • An alternative is storing all the images into a lossless file compression format (e.g. .zip, .7z, and .gz). • This retains the overall quality of the images while reducing the overall file size of the dataset. Managing Image Data • Storing and organizing your images is important, to ensure that you can easily sort to different images as needed. • This includes labeling, storing images in specific folders, renaming the images themselves, or editing metadata. • The effectiveness of these techniques decreases as the dataset becomes larger. Managing Image Data • Using simple folders and labels is sufficient for small image sets (~10,000 images). • As you make use of larger image sets, using special image management tools are recommended (e.g. ImageKit). Text Collection Data Collection: Image and Text Defining Text Collection • Text refers to printed books, documents, and media that cover different ideas and content. • In data science, text content is typically analyzed to determine the sentiments of users. • The text content can come from different sources depending on your needs and goals. Defining Text Collection • Text collection is a type of data collection that deals specifically with unstructured text data. • It typically makes use of natural language processing (NLP) techniques to extract insights from the text. • Text collection is a multidisciplinary field, involving different techniques such as text retrieval and text analysis. Collecting Text Data • Text data comes from a variety of sources: • Books • Classic Literature • Corpora • Data Vendors • Dictionaries • Forums • Interview Transcripts • Magazines • Open-Source Datasets • Short Stories • Surveys • Web Articles Collecting Text Data • Text collection has two key phases in its process: • Text Refining • Knowledge Distillation Collecting Text Data: Text Refining • Text refining is transforming free-form text to a chosen intermediate form (IF). • This IF can be in a semi-structured form such as conceptual graphs. • It can also be in a structured form such as relational data. Collecting Text Data: Text Refining • The purpose of turning text data into an IF is to make it easier to process and organize the text. • An IF will typically have labels for the individual text, or tags which describe the topic or idea. • It might also take note of keywords for sentiment analysis. Collecting Text Data: Text Refining • Mining a document-based IF deduces patterns and relationship across documents. • Examples of this are clustering/visualization and categorization. Collecting Text Data: Text Refining • Mining a concept-based IF deduces patterns and relationships across objects and concepts. • Examples of this are predictions and associative discovery. Collecting Text Data: Knowledge Distillation
• Knowledge distillation deduces patterns or knowledge
from the IF. • This is where you will utilize analysis tools or machine learning models to comb through your text data. • The tools used for knowledge distillation depends on the goals of your data mining process. Collecting Text Data: Knowledge Distillation
• These are examples of methods used for text data
analysis: • Text Data Categorization • Text Data Extraction • Text Data Identification • Text Data Parsing • Text Data Translation • Text Data Visualization Managing Text Data • Typically, we store both the “raw” data itself and the analyzed/”new” data together. • This allows us to validate our text refining process and ensure that our text is being interpreted correctly. • We also keep the “raw” data so that we can utilize multiple data mining techniques on the same data. Managing Text Data • Processing text data is iterative in nature, which means that we will better understand our data as we analyze it. • This process can involve various specialized techniques such as feature selection and feature extraction. • Due to this, we usually end up with a corpora at the end of our data mining process. Managing Text Data • A corpora acts as a collection of the linguistic patterns that we have analyzed from our text data. • This is what allows us to analyze new data and make predictions. • When we rebuild a text mining tool, we typically also rebuild the corpora. References • Image Management as a Data Service (iassistquarterly.com) • Text-Mining-The-state-of-the-art-and-the-challenges.pdf (researchgate.net) • Text Data Collection Services | OCR Dataset- GTS • Getting Started in Text Mining | PLOS Computational Biology • Automated Data Collection with R – A Practical Guide to Web Scraping and Text Mining (core.ac.uk) • Online-Data-Collection.pdf (researchgate.net) • Text Mining in Data Mining - GeeksforGeeks