Data Labeling in Machine Learning with Python: Explore modern ways to prepare labeled data for training and fine-tuning ML and generative AI models
()
About this ebook
Data labeling is the invisible hand that guides the power of artificial intelligence and machine learning. In today’s data-driven world, mastering data labeling is not just an advantage, it’s a necessity. Data Labeling in Machine Learning with Python empowers you to unearth value from raw data, create intelligent systems, and influence the course of technological evolution.
With this book, you'll discover the art of employing summary statistics, weak supervision, programmatic rules, and heuristics to assign labels to unlabeled training data programmatically. As you progress, you'll be able to enhance your datasets by mastering the intricacies of semi-supervised learning and data augmentation. Venturing further into the data landscape, you'll immerse yourself in the annotation of image, video, and audio data, harnessing the power of Python libraries such as seaborn, matplotlib, cv2, librosa, openai, and langchain. With hands-on guidance and practical examples, you'll gain proficiency in annotating diverse data types effectively.
By the end of this book, you’ll have the practical expertise to programmatically label diverse data types and enhance datasets, unlocking the full potential of your data.
Related to Data Labeling in Machine Learning with Python
Related ebooks
Python Data Cleaning and Preparation Best Practices: A practical guide to organizing and handling data from various sources and formats using Python Rating: 0 out of 5 stars0 ratingsGenerative AI with Amazon Bedrock: Build, scale, and secure generative AI applications using Amazon Bedrock Rating: 0 out of 5 stars0 ratingsDeep Learning for Data Architects: Unleash the power of Python's deep learning algorithms (English Edition) Rating: 0 out of 5 stars0 ratingsMachine Learning With Go: Leverage Go's powerful packages to build smart machine learning and predictive applications, 2nd Edition Rating: 0 out of 5 stars0 ratingsMastering Machine Learning with R Rating: 0 out of 5 stars0 ratingsMachine Learning with BigQuery ML: Create, execute, and improve machine learning models in BigQuery using standard SQL queries Rating: 0 out of 5 stars0 ratingsAdvanced Deep Learning with R: Become an expert at designing, building, and improving advanced neural network models using R Rating: 0 out of 5 stars0 ratingsScalable Data Architecture with Java: Build efficient enterprise-grade data architecting solutions using Java Rating: 0 out of 5 stars0 ratingsLearning Hunk Rating: 0 out of 5 stars0 ratingsCloud Scale Analytics with Azure Data Services: Build modern data warehouses on Microsoft Azure Rating: 0 out of 5 stars0 ratingsMachine Learning with Spark: Develop intelligent, distributed machine learning systems Rating: 0 out of 5 stars0 ratingsArchitecting Data-Intensive Applications: Develop scalable, data-intensive, and robust applications the smart way Rating: 0 out of 5 stars0 ratingsScala and Spark for Big Data Analytics: Explore the concepts of functional programming, data streaming, and machine learning Rating: 0 out of 5 stars0 ratingsScala Machine Learning Projects: Build real-world machine learning and deep learning projects with Scala Rating: 0 out of 5 stars0 ratingsUltimate Data Engineering with Databricks Rating: 0 out of 5 stars0 ratingsLearning Data Mining with Python Rating: 0 out of 5 stars0 ratingsCodeless Time Series Analysis with KNIME: A practical guide to implementing forecasting models for time series analysis applications Rating: 0 out of 5 stars0 ratingsMastering Machine Learning with R: Advanced machine learning techniques for building smart applications with R 3.5, 3rd Edition Rating: 0 out of 5 stars0 ratingsStream Analytics with Microsoft Azure: Real-time data processing for quick insights using Azure Stream Analytics Rating: 0 out of 5 stars0 ratingsUltimate Snowflake Architecture for Cloud Data Warehousing Rating: 0 out of 5 stars0 ratingsAdvanced Analytics with R and Tableau Rating: 0 out of 5 stars0 ratingsBig Data Analytics with Hadoop 3: Build highly effective analytics solutions to gain valuable insight into your big data Rating: 0 out of 5 stars0 ratingsMastering Data Engineering and Analytics with Databricks Rating: 0 out of 5 stars0 ratingsModern Scala Projects: Leverage the power of Scala for building data-driven and high-performant projects Rating: 0 out of 5 stars0 ratings
Data Modeling & Design For You
The Secrets of ChatGPT Prompt Engineering for Non-Developers Rating: 5 out of 5 stars5/5Data Analytics for Beginners: Introduction to Data Analytics Rating: 4 out of 5 stars4/5Mastering Agile User Stories Rating: 4 out of 5 stars4/5Neural Networks for Beginners: An Easy-to-Follow Introduction to Artificial Intelligence and Deep Learning Rating: 2 out of 5 stars2/5Data Visualization: a successful design process Rating: 4 out of 5 stars4/5DAX Patterns: Second Edition Rating: 5 out of 5 stars5/5Spreadsheets To Cubes (Advanced Data Analytics for Small Medium Business): Data Science Rating: 0 out of 5 stars0 ratingsR All-in-One For Dummies Rating: 0 out of 5 stars0 ratingsMicrosoft Access: Database Creation and Management through Microsoft Access Rating: 0 out of 5 stars0 ratingsMachine Learning Interview Questions Rating: 5 out of 5 stars5/5Living in Data: A Citizen's Guide to a Better Information Future Rating: 4 out of 5 stars4/5150 Most Poweful Excel Shortcuts: Secrets of Saving Time with MS Excel Rating: 3 out of 5 stars3/5Tailoring Prompts For Success - The Ultimate ChatGPT Prompt Engineering Guide Rating: 3 out of 5 stars3/5Python Data Analysis Cookbook Rating: 5 out of 5 stars5/5Data Analytics with Python: Data Analytics in Python Using Pandas Rating: 3 out of 5 stars3/5AI and UX: Why Artificial Intelligence Needs User Experience Rating: 0 out of 5 stars0 ratingsRaspberry Pi :Raspberry Pi Guide On Python & Projects Programming In Easy Steps Rating: 3 out of 5 stars3/5Supercharge Power BI: Power BI is Better When You Learn To Write DAX Rating: 5 out of 5 stars5/5Advanced Deep Learning with Python: Design and implement advanced next-generation AI solutions using TensorFlow and PyTorch Rating: 0 out of 5 stars0 ratingsThinking in Algorithms: Strategic Thinking Skills, #2 Rating: 4 out of 5 stars4/5R: Data Analysis and Visualization Rating: 5 out of 5 stars5/5Supercharge Excel: When you learn to Write DAX for Power Pivot Rating: 0 out of 5 stars0 ratingsPrinciples of Data Science Rating: 4 out of 5 stars4/5Mastering Python Design Patterns Rating: 0 out of 5 stars0 ratingsApplied Predictive Modeling: An Overview of Applied Predictive Modeling Rating: 0 out of 5 stars0 ratings
Reviews for Data Labeling in Machine Learning with Python
0 ratings0 reviews
Book preview
Data Labeling in Machine Learning with Python - Vijaya Kumar Suda
Data Labeling in Machine Learning with Python
Copyright © 2024 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Group Product Manager: Niranjan Naikwadi
Publishing Product Manager: Sanjana Gupta
Book Project Manager: Hemangi Lotlikar
Content Development Editor: Shreya Moharir
Technical Editor: Rahul Limbachiya
Copy Editor: Safis Editing
Proofreader: Safis Editing
Indexer: Tejal Soni
Production Designer: Joshua Misquitta
DevRel Marketing Coordinator: Vinishka Kalra
First published: January 2024
Production reference: 1300124
Published by Packt Publishing Ltd.
Grosvenor House
11 St Paul’s Square
Birmingham
B3 1RB, UK.
ISBN 978-1-80461-054-1
www.packtpub.com
Acknowledgments
I extend my heartfelt gratitude to my mother, Rajya Lakshmi Suda, and dedicate this work to the cherished memory of my father, Koteswara Rao Suda. Their sacrifices and unwavering determination have been a profound source of inspiration.
Special thanks to my wife, Radhika, for her enduring support and patience throughout the writing of this book.
To my son, Chandra Suda (Rise Global Winner 2023), and daughter, Akshaya, your talents and creativity have shown me the beautiful evolution of skill.
I am deeply appreciative of my siblings, Rama Devi, Swarna Kumar, and Dr. Sri Kumar, for their continuous support.
A sincere acknowledgment to my mentors and managers, Kevin Fleck and Des Quinta, for their invaluable support and motivation throughout the writing process of this book.
Finally, I want to thank the Packt Publishing team, especially Shreya and Hemangi, for their fantastic support, which made the writing process an absolute pleasure.
Contributors
About the author
Vijaya Kumar Suda is a seasoned data and AI professional, boasting over two decades of expertise collaborating with global clients. Having resided and worked in diverse locations such as Switzerland, Belgium, Mexico, Bahrain, India, Canada, and the USA, Vijaya has successfully assisted customers spanning various industries. Currently serving as a senior data and AI consultant at Microsoft, he is instrumental in guiding industry partners through their digital transformation endeavors using cutting-edge cloud technologies and AI capabilities. His proficiency encompasses architecture, data engineering, machine learning, generative AI, and cloud solutions. Vijaya also shares his insights through engaging videos on the cloud, data, and AI on his YouTube channel, Cloud & Data Science .
About the reviewers
Pritesh Kanani is a full stack developer with experience in data wrangling and supervised machine learning. He helped a major oil and gas company with building a tool to monitor drilling operations and handling thousands of high frequency data streams. He completed a post-graduation course in applied AI and is currently utilizing his full stack data science and cloud computing skills at a leading nuclear and renewable energy organization in Ontario, Canada.
Sourav Roy is a passionate data enthusiast, an experienced machine learning practitioner, and an expert book reviewer with a focus on literature linked to data. He possesses a diverse skill set in data engineering and data analytics, which allows him to combine technical proficiency with a deep passion in his work on data-centric books. Sourav obtained a master’s degree in data science and analytics from Queen’s University. He is presently employed as a data engineer in the banking sector.
Mitesh Mangaonkar is an engineering leader pioneering generative AI to transform data platforms. As a tech lead at Airbnb, he builds cutting-edge data pipelines leveraging big technologies and modern data stacks to power trust and safety products. Previously, at AWS, Mitesh helped Fortune 500 companies migrate their data warehouses to the cloud and engineered highly scalable, resilient systems. An innovator at heart, he combines deep data engineering expertise with a passion for AI to create the next generation of data products. Mitesh is an influential voice shaping the future of data engineering and governance.
Table of Contents
Preface
Part 1: Labeling Tabular Data
1
Exploring Data for Machine Learning
Technical requirements
EDA and data labeling
Understanding the ML project life cycle
Defining the business problem
Data discovery and data collection
Data exploration
Data labeling
Model training
Model evaluation
Model deployment
Introducing Pandas DataFrames
Summary statistics and data aggregates
Summary statistics
Data aggregates of the feature for each target class
Creating visualizations using Seaborn for univariate and bivariate analysis
Univariate analysis
Bivariate analysis
Profiling data using the ydata-profiling library
Variables section
Interactions section
Correlations
Missing values
Sample data
Unlocking insights from data with OpenAI and LangChain
Summary
2
Labeling Data for Classification
Technical requirements
Predicting labels with LLMs for tabular data
Data labeling using Snorkel
What is Snorkel?
Why is Snorkel popular?
Loading unlabeled data
Creating the labeling functions
Labeling rules
Constants
Labeling functions
Creating a label model
Predicting labels
Labeling data using the Compose library
Labeling data using semi-supervised learning
What is semi-supervised learning?
What is pseudo-labeling?
Labeling data using K-means clustering
What is unsupervised learning?
K-means clustering
Inertia
Dunn's index
Summary
3
Labeling Data for Regression
Technical requirements
Using summary statistics to generate housing price labels
Finding the closest labeled observation to match the label
Using semi-supervised learning to label regression data
Pseudo-labeling
Using data augmentation to label regression data
Using k-means clustering to label regression data
Summary
Part 2: Labeling Image Data
4
Exploring Image Data
Technical requirements
Visualizing image data using Matplotlib in Python
Loading the data
Checking the dimensions
Visualizing the data
Checking for outliers
Performing data preprocessing
Checking for class imbalance
Identifying patterns and relationships
Evaluating the impact of preprocessing
Practice example of visualizing data
Practice example for adding annotations to an image
Practice example of image segmentation
Practice example for feature extraction
Analyzing image size and aspect ratio
Impact of aspect ratios on model performance
Image resizing
Image normalization
Performing transformations on images – image augmentation
Summary
5
Labeling Image Data Using Rules
Technical requirements
Labeling rules based on image visualization
Image labeling using rules with Snorkel
Weak supervision
Rules based on the manual visualization of an image’s object color
Real-world applications
A practical example of plant disease detection
Labeling images using rules based on properties
Bounding boxes
Example 1 – image classification – a bicycle with and without a person
Example 2 – image classification – dog and cat images
Labeling images using transfer learning
Example – digit classification using a pre-trained classifier
Example – person image detection using the YOLO V3 pre-trained classifier
Example – bicycle image detection using the YOLO V3 pre-trained classifier
Labeling images using transformations
Summary
6
Labeling Image Data Using Data Augmentation
Technical requirements
Training support vector machines with augmented image data
Kernel trick
Data augmentation
Image data augmentation
Implementing an SVM with data augmentation in Python
Introducing the CIFAR-10 dataset
Loading the CIFAR-10 dataset in Python
Preprocessing the data for SVM training
Implementing an SVM with the default hyperparameters
Evaluating SVM on the original dataset
Implementing an SVM with an augmented dataset
Training the SVM on augmented data
Evaluating the SVM’s performance on the augmented dataset
Image classification using the SVM with data augmentation on the MNIST dataset
Convolutional neural networks using augmented image data
How CNNs work
Practical example of a CNN using data augmentation
CNN using image data augmentation with the CIFAR-10 dataset
Summary
Part 3: Labeling Text, Audio, and Video Data
7
Labeling Text Data
Technical requirements
Real-world applications of text data labeling
Tools and frameworks for text data labeling
Exploratory data analysis of text
Loading the data
Understanding the data
Cleaning and preprocessing the data
Exploring the text’s content
Analyzing relationships between text and other variables
Visualizing the results
Exploratory data analysis of sample text data set
Exploring Generative AI and OpenAI for labeling text data
GPT models by OpenAI
Zero-shot learning capabilities
Text classification with OpenAI models
Data labeling assistance
OpenAI API overview
Use case 1 – summarizing the text
Use case 2 – topic generation for news articles
Use case 3 – classification of customer queries using the user-defined categories and sub-categories
Use case 4 – information retrieval using entity extraction
Use case 5 – aspect-based sentiment analysis
Hands-on labeling of text data using the Snorkel API
Hands-on text labeling using Logistic Regression
Hands-on label prediction using K-means clustering
Generating labels for customer reviews (sentiment analysis)
Summary
8
Exploring Video Data
Technical requirements
Loading video data using cv2
Extracting frames from video data for analysis
Extracting features from video frames
Color histogram
Optical flow features
Motion vectors
Deep learning features
Appearance and shape descriptors
Visualizing video data using Matplotlib
Frame visualization
Temporal visualization
Motion visualization
Labeling video data using k-means clustering
Overview of data labeling using k-means clustering
Example of video data labeling using k-means clustering with a color histogram
Advanced concepts in video data analysis
Motion analysis in videos
Object tracking in videos
Facial recognition in videos
Video compression techniques
Real-time video processing
Video data formats and quality in machine learning
Common issues in handling video data for ML models
Troubleshooting steps
Summary
9
Labeling Video Data
Technical requirements
Capturing real-time video
Key components and features
A hands-on example to capture real-time video using a webcam
Building a CNN model for labeling video data
Using autoencoders for video data labeling
A hands-on example to label video data using autoencoders
Transfer learning
Using the Watershed algorithm for video data labeling
A hands-on example to label video data segmentation using the Watershed algorithm
Computational complexity
Performance metrics
Real-world examples for video data labeling
Advances in video data labeling and classification
Summary
10
Exploring Audio Data
Technical requirements
Real-life applications for labeling audio data
Audio data fundamentals
Hands-on with analyzing audio data
Example code for loading and analyzing sample audio file
Best practices for audio format conversion
Example code for audio data cleaning
Extracting properties from audio data
Tempo
Chroma features
Mel-frequency cepstral coefficients (MFCCs)
Zero-crossing rate
Spectral contrast
Considerations for extracting properties
Visualizing audio data with matplotlib and Librosa
Waveform visualization
Loudness visualization
Spectrogram visualization
Mel spectrogram visualization
Considerations for visualizations
Ethical implications of audio data
Recent advances in audio data analysis
Troubleshooting common issues during data analysis
Troubleshooting common installation issues for audio libraries
Summary
11
Labeling Audio Data
Technical requirements
Downloading FFmpeg
Azure Machine Learning
Real-time voice classification with Random Forest
Transcribing audio using the OpenAI Whisper model
Step 1 – importing the Whisper model
Step 2 – loading the base Whisper model
Step 3 – setting up FFmpeg
Step 4 – transcribing the YouTube audio using the Whisper model
Classifying a transcription using Hugging Face transformers
Hands-on – labeling audio data using a CNN
Exploring audio data augmentation
Introducing Azure Cognitive Services – the speech service
Creating an Azure Speech service
Speech to text
Speech translation
Summary
12
Hands-On Exploring Data Labeling Tools
Technical requirements
Azure Machine Learning data labeling
Label Studio
pyOpenAnnotate
Data labeling using Azure Machine Learning
Benefits of data labeling with Azure Machine Learning
Data labeling steps using Azure Machine Learning
Image data labeling with Azure Machine Learning
Text data labeling with Azure Machine Learning
Audio data labeling using Azure Machine Learning
Integration of the Azure Machine Learning pipeline with the labeled dataset
Exploring Label Studio
Labeling the image data
Labeling the text data
Labeling the video data
pyOpenAnnotate
Computer Vision Annotation Tool
Comparison of data labeling tools
Advanced methods in data labeling
Active learning
Semi-automated labeling
Summary
Index
Other Books You May Enjoy
Preface
In today’s data-driven era, where more than 2.5 quintillion bytes of data are produced daily in various forms such as text, image, audio, and video, data stands as the cornerstone of the AI revolution. However, the majority of real-world data available for training supervised machine learning models lacks labels, or we encounter limited labeled data. This presents a significant challenge, as labeled data is essential for training any supervised machine learning model and fine-tuning large language models in the age of generative AI.
To address the scarcity of labeled data and facilitate the preparation of labeled data for training supervised machine learning models and fine-tuning large language models, this book introduces various methods for programmatic data labeling using Python libraries and methods, including semi-supervised and unsupervised learning.
This book guides you through the process of loading and analyzing tabular data, images, videos, audio, and text using various Python libraries, the OpenAI API, LangChain, and Azure Machine Learning. It explores techniques such as weak supervision, pseudo-labeling, and K-means clustering for classification and labeling, while also providing data augmentation methods to enhance accuracy. Utilizing the Azure OpenAI API and LangChain, the book demonstrates the automation of data analysis using natural language without the need to acquire any programming skills. It also encompasses the classification and data labeling of text data using OpenAI and large language models (LLMs). This book covers a wide variety of open source data annotation tools, along with Azure Machine Learning, and compares the pros and cons of these tools.
Real-world examples from various industries are incorporated to illustrate the application of these methods to tabular, text, image, video, and audio data.
By the conclusion of this book, you will have acquired the skills to explore different types of data using Python and OpenAI LLMs. You will have learned how to prepare data with labels, whether for training machine learning models or unlocking insights about the data to leverage for business use cases across industries.
Who this book is for
This book is for aspiring AI engineers, machine learning engineers, data scientists, and data engineers who want to learn about data labeling methods and algorithms for model training. Data enthusiasts and Python developers will be able to use this book to learn about data exploration and annotation using Python libraries.
What this book covers
Chapter 1, Exploring Data for Machine Learning, provides an overview of data analysis and visualization methods using various Python libraries. Additionally, it deep dives into unlocking data insights with natural language using OpenAI LLMs.
Chapter 2, Labeling Data for Classification, covers the process of labeling tabular data for training classification models. Various methods, such as Snorkel Python functions, semi-supervised learning, and clustering data using K-means, are explored.
Chapter 3, Labeling Data for Regression, addresses the labeling of tabular data for training regression models. Techniques include leveraging summary statistics, creating pseudo labels, employing data augmentation methods, and utilizing K-means clustering.
Chapter 4, Exploring Image Data, covers the analysis and visualization of image data and feature extraction from images using various Python libraries.
Chapter 5, Labeling Image Data Using Rules, discusses labeling images based on heuristics and image properties such as aspect ratio, and also covers image classification using pre-trained classifiers such as YOLO.
Chapter 6, Labeling Image Data Using Data Augmentation, explores methods of image data augmentation for training support vector machines and Convolutional Neural Networks (CNNs), as well as addressing image data labeling.
Chapter 7, Labeling Text Data, covers generative AI and various methods for labeling text data. This includes Azure OpenAI with real-world use cases, text classification, and sentiment analysis using Snorkel and K-means clustering.
Chapter 8, Exploring Video Data, focuses on loading video data, extracting features, visualizing video data, and clustering video data using K-means clustering.
Chapter 9, Labeling Video Data, delves into labeling video data using CNNs, segmenting video data with the watershed algorithm, and capturing important features using autoencoders, accompanied by real-world examples.
Chapter 10, Exploring Audio Data, provides the fundamentals of audio data, loading and visualizing audio data, extracting features, and real-life applications.
Chapter 11, Labeling Audio Data, covers transcribing audio data using OpenAI’s Whisper model, labeling the transcription, creating spectrograms for audio data classification, augmenting audio data, and using Azure Cognitive Services for speech.
Chapter 12, Hands-On Exploring Data Labeling Tools, covers various data labeling tools, including open source tools such as Label Studio, CVAT, pyOpenAnnotate, and Azure Machine Learning. It also includes a comparison of various data labeling tools for image, text, audio, and video data.
To get the most out of this book
Basic Python knowledge is beneficial but not necessary to get the most out of this book.
If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.
Download the example code files
You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Data-Labeling-in-Machine-Learning-with-Python. If there’s an update to the code, it will be updated in the GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
Conventions used
There are a number of text conventions used throughout this book.
Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: Now let us generate the augmented data by calling the noise, scale, and rotation augmentation functions, as follows.
A block of code is set as follows:
# Train a linear regression model on the labeled data
regressor = LinearRegression()
regressor.fit(train_data, train_labels)
When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
news_headline="
Label the following news headline into 1 of the following categories: Business, Tech, Politics, Sport, Entertainment\n\n Headline 1: Trump is ready to contest in nov 2024 elections\nCategory:
",
response = openai.Completion.create(
engine=model_deployment_name,
prompt= news_headline,
temperature=0,
Any command-line input or output is written as follows:
pip install keras
Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: Change System preferences | Security and privacy | General, and then select Open anyway.
Tips or important notes
Appear like this.
Get in touch
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com
Share your thoughts
Once you’ve read Data Labeling in ML and AI with Python, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.
Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.
Download a free PDF copy of this book
Thanks for purchasing this book!
Do you like to read on the go but are unable to carry your print books everywhere?
Is your eBook purchase not compatible with the device of your choice?
Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.
Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.
The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily
Follow these simple steps to get the benefits:
Scan the QR code or visit the link below
https://packt.link/free-ebook/9781804610541
2. Submit your proof of purchase
3. That’s it! We’ll send your free PDF and other benefits to your email directly
Part 1: Labeling Tabular Data
This part of the book will guide you in exploring tabular data and programmatically labeling the data using Python libraries, such as Snorkel labeling functions. You will be able to achieve this without requiring any prior data science knowledge. Additionally, it covers data labeling using K-means clustering.
This part comprises the following chapters:
Chapter 1, Exploring Data for Machine Learning
Chapter 2, Labeling Data for Classification
Chapter 3, Labeling Data for Regression
1
Exploring Data for Machine Learning
Imagine embarking on a journey through an expansive ocean of data, where within this vastness are untold stories, patterns, and insights waiting to be discovered. Welcome to the world of data exploration in machine learning (ML). In this chapter, I encourage you to put on your analytical lenses as we embark on a thrilling quest. Here, we will delve deep into the heart of your data, armed with powerful techniques and heuristics, to uncover its secrets. As you embark on this adventure, you will discover that beneath the surface of raw numbers and statistics, there exists a treasure trove of patterns that, once revealed, can transform your data into a valuable asset. The journey begins with exploratory data analysis (EDA), a crucial phase where we unravel the mysteries of data, laying the foundation for automated labeling and, ultimately, building smarter and more accurate ML models. In this age of generative AI, the preparation of quality training data is essential to the fine-tuning of domain-specific large language models (LLMs). Fine-tuning involves the curation of additional domain-specific labeled data for training publicly available LLMs. So, fasten your seatbelts for a captivating voyage into the art and science of data exploration for data labeling.
First, let’s start with the question: What is data exploration? It is the initial phase of data analysis, where raw data is examined, visualized, and summarized to uncover patterns, trends, and insights. It serves as a crucial step in understanding the nature of the data before applying advanced analytics or ML techniques.
In this chapter, we will explore tabular data using various libraries and packages in Python, including Pandas, NumPy, and Seaborn. We will also plot different bar charts and histograms to visualize data to find the relationships between various features, which is useful for labeling data. We will be exploring the Income dataset located in this book’s GitHub repository (a link for which is located in the Technical requirements section). A good understanding of the data is necessary in order to define business rules, identify matching patterns, and, subsequently, label the data using Python labeling functions.
By the end of this chapter, we will be able to generate summary statistics for the given dataset. We will derive aggregates of the features for each target group. We will also learn how to perform univariate and bivariate analyses of the features in the given dataset. We will create a report using the ydata-profiling library.
We’re going to cover the following main topics:
EDA and data labeling
Summary statistics and data aggregates with Pandas
Data visualization with Seaborn for univariate and bivariate analysis
Profiling data using the ydata-profiling library
Unlocking insights from data with OpenAI and LangChain
Technical requirements
One of the following Python IDE and software tools needs to be installed before running the notebook in this chapter:
Anaconda Navigator: Download and install the open source Anaconda Navigator from the following URL:
https://docs.anaconda.com/navigator/install/#system-requirements
Jupyter Notebook: Download and install Jupyter Notebook:
https://jupyter.org/install
We can also use open source, online Python editors such as Google Colab (https://colab.research.google.com/) or Replit (https://replit.com/)
The Python source code and the entire notebook created in this chapter are available in this book’s GitHub repository:
https://github.com/PacktPublishing/Data-Labeling-in-Machine-Learning-with-Python
You also need to create an Azure account and add an OpenAI resource for working with generative AI. To sign up for a free Azure subscription, visit https://azure.microsoft.com/free. To request access to the Azure OpenAI service, visit https://aka.ms/oaiapply.
Once you have provisioned the Azure OpenAI service, deploy the LLM model – either GPT-3.5-Turbo or GPT 4.0 – from Azure OpenAI Studio. Then copy the keys for OpenAI from OpenAI Studio and set up the following environment variables:
os.environ['AZURE_OPENAI_KEY'] = 'your_api_key'
os.environ['AZURE_OPENAI_ENDPOINT") ='your_azure_openai_endpoint'
Your endpoint should look like this: https://YOUR_RESOURCE_NAME.openai.azure.com/.
EDA and data labeling
In this section, we will gain an understanding of what EDA is. We will see why we need to perform it and discuss its advantages. We will also look at the life cycle of an ML project and learn about the role of data labeling in this cycle.
EDA comprises data discovery, data collection, data cleaning, and data exploration. These steps are part of any machine learning project. The data exploration step includes tasks such as data visualization, summary statistics, correlation analysis, and data distribution analysis. We will dive deep into these steps in the upcoming sections.
Here are some real-world examples of EDA:
Customer churn analysis: Suppose you work for a telecommunications company and you want to understand why customers are churning (canceling their subscriptions); in this case, conducting EDA on customer churn data can provide valuable insights.
Income data analysis: EDA on the Income dataset with predictive features such as education, employment status, and marital status helps to predict whether the salary of a person is greater than $50K.
EDA is a critical process for any ML or data science project, and it allows us to understand the data and gain some valuable insights into the data domain and business.
In this chapter, we will use various Python libraries, such as Pandas, and call the describe and info functions on Pandas to generate data summaries. We will discover anomalies in the data and any outliers in the given dataset. We will also figure out various data types and any missing values in the data. We will understand whether any data type conversions are required, such as converting string to float, for performing further analysis. We will also analyze the data formats and see whether any transformations are required to standardize them, such as the date format. We will analyze the counts of different labels and understand whether the dataset is balanced or imbalanced. We will understand the relationships between various features in the data and calculate the correlations between features.
To summarize, we will understand the patterns in the given dataset and also identify the relationships between various features in the data samples. Finally, we will come up with a strategy and domain rules for data cleaning and transformation. This helps us to predict labels for unlabeled data.
We will plot various data visualizations using Python libraries such as seaborn and matplotlib. We will create bar charts, histograms, heatmaps, and various charts to visualize the importance of features in the dataset and how they depend on each other.
Understanding the ML project life cycle
The following are the major steps in an ML project:
Figure 1.1 – ML project life cycle diagram