Discover millions of ebooks, audiobooks, and so much more with a free trial

From $11.99/month after trial. Cancel anytime.

Data Labeling in Machine Learning with Python: Explore modern ways to prepare labeled data for training and fine-tuning ML and generative AI models
Data Labeling in Machine Learning with Python: Explore modern ways to prepare labeled data for training and fine-tuning ML and generative AI models
Data Labeling in Machine Learning with Python: Explore modern ways to prepare labeled data for training and fine-tuning ML and generative AI models
Ebook806 pages5 hours

Data Labeling in Machine Learning with Python: Explore modern ways to prepare labeled data for training and fine-tuning ML and generative AI models

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Data labeling is the invisible hand that guides the power of artificial intelligence and machine learning. In today’s data-driven world, mastering data labeling is not just an advantage, it’s a necessity. Data Labeling in Machine Learning with Python empowers you to unearth value from raw data, create intelligent systems, and influence the course of technological evolution.
With this book, you'll discover the art of employing summary statistics, weak supervision, programmatic rules, and heuristics to assign labels to unlabeled training data programmatically. As you progress, you'll be able to enhance your datasets by mastering the intricacies of semi-supervised learning and data augmentation. Venturing further into the data landscape, you'll immerse yourself in the annotation of image, video, and audio data, harnessing the power of Python libraries such as seaborn, matplotlib, cv2, librosa, openai, and langchain. With hands-on guidance and practical examples, you'll gain proficiency in annotating diverse data types effectively.
By the end of this book, you’ll have the practical expertise to programmatically label diverse data types and enhance datasets, unlocking the full potential of your data.

LanguageEnglish
Release dateJan 31, 2024
ISBN9781804613788
Data Labeling in Machine Learning with Python: Explore modern ways to prepare labeled data for training and fine-tuning ML and generative AI models

Related to Data Labeling in Machine Learning with Python

Related ebooks

Data Modeling & Design For You

View More

Related articles

Reviews for Data Labeling in Machine Learning with Python

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Data Labeling in Machine Learning with Python - Vijaya Kumar Suda

    Cover.png

    Data Labeling in Machine Learning with Python

    Copyright © 2024 Packt Publishing

    All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

    Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

    Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

    Group Product Manager: Niranjan Naikwadi

    Publishing Product Manager: Sanjana Gupta

    Book Project Manager: Hemangi Lotlikar

    Content Development Editor: Shreya Moharir

    Technical Editor: Rahul Limbachiya

    Copy Editor: Safis Editing

    Proofreader: Safis Editing

    Indexer: Tejal Soni

    Production Designer: Joshua Misquitta

    DevRel Marketing Coordinator: Vinishka Kalra

    First published: January 2024

    Production reference: 1300124

    Published by Packt Publishing Ltd.

    Grosvenor House

    11 St Paul’s Square

    Birmingham

    B3 1RB, UK.

    ISBN 978-1-80461-054-1

    www.packtpub.com

    Acknowledgments

    I extend my heartfelt gratitude to my mother, Rajya Lakshmi Suda, and dedicate this work to the cherished memory of my father, Koteswara Rao Suda. Their sacrifices and unwavering determination have been a profound source of inspiration.

    Special thanks to my wife, Radhika, for her enduring support and patience throughout the writing of this book.

    To my son, Chandra Suda (Rise Global Winner 2023), and daughter, Akshaya, your talents and creativity have shown me the beautiful evolution of skill.

    I am deeply appreciative of my siblings, Rama Devi, Swarna Kumar, and Dr. Sri Kumar, for their continuous support.

    A sincere acknowledgment to my mentors and managers, Kevin Fleck and Des Quinta, for their invaluable support and motivation throughout the writing process of this book.

    Finally, I want to thank the Packt Publishing team, especially Shreya and Hemangi, for their fantastic support, which made the writing process an absolute pleasure.

    Contributors

    About the author

    Vijaya Kumar Suda is a seasoned data and AI professional, boasting over two decades of expertise collaborating with global clients. Having resided and worked in diverse locations such as Switzerland, Belgium, Mexico, Bahrain, India, Canada, and the USA, Vijaya has successfully assisted customers spanning various industries. Currently serving as a senior data and AI consultant at Microsoft, he is instrumental in guiding industry partners through their digital transformation endeavors using cutting-edge cloud technologies and AI capabilities. His proficiency encompasses architecture, data engineering, machine learning, generative AI, and cloud solutions. Vijaya also shares his insights through engaging videos on the cloud, data, and AI on his YouTube channel, Cloud & Data Science .

    About the reviewers

    Pritesh Kanani is a full stack developer with experience in data wrangling and supervised machine learning. He helped a major oil and gas company with building a tool to monitor drilling operations and handling thousands of high frequency data streams. He completed a post-graduation course in applied AI and is currently utilizing his full stack data science and cloud computing skills at a leading nuclear and renewable energy organization in Ontario, Canada.

    Sourav Roy is a passionate data enthusiast, an experienced machine learning practitioner, and an expert book reviewer with a focus on literature linked to data. He possesses a diverse skill set in data engineering and data analytics, which allows him to combine technical proficiency with a deep passion in his work on data-centric books. Sourav obtained a master’s degree in data science and analytics from Queen’s University. He is presently employed as a data engineer in the banking sector.

    Mitesh Mangaonkar is an engineering leader pioneering generative AI to transform data platforms. As a tech lead at Airbnb, he builds cutting-edge data pipelines leveraging big technologies and modern data stacks to power trust and safety products. Previously, at AWS, Mitesh helped Fortune 500 companies migrate their data warehouses to the cloud and engineered highly scalable, resilient systems. An innovator at heart, he combines deep data engineering expertise with a passion for AI to create the next generation of data products. Mitesh is an influential voice shaping the future of data engineering and governance.

    Table of Contents

    Preface

    Part 1: Labeling Tabular Data

    1

    Exploring Data for Machine Learning

    Technical requirements

    EDA and data labeling

    Understanding the ML project life cycle

    Defining the business problem

    Data discovery and data collection

    Data exploration

    Data labeling

    Model training

    Model evaluation

    Model deployment

    Introducing Pandas DataFrames

    Summary statistics and data aggregates

    Summary statistics

    Data aggregates of the feature for each target class

    Creating visualizations using Seaborn for univariate and bivariate analysis

    Univariate analysis

    Bivariate analysis

    Profiling data using the ydata-profiling library

    Variables section

    Interactions section

    Correlations

    Missing values

    Sample data

    Unlocking insights from data with OpenAI and LangChain

    Summary

    2

    Labeling Data for Classification

    Technical requirements

    Predicting labels with LLMs for tabular data

    Data labeling using Snorkel

    What is Snorkel?

    Why is Snorkel popular?

    Loading unlabeled data

    Creating the labeling functions

    Labeling rules

    Constants

    Labeling functions

    Creating a label model

    Predicting labels

    Labeling data using the Compose library

    Labeling data using semi-supervised learning

    What is semi-supervised learning?

    What is pseudo-labeling?

    Labeling data using K-means clustering

    What is unsupervised learning?

    K-means clustering

    Inertia

    Dunn's index

    Summary

    3

    Labeling Data for Regression

    Technical requirements

    Using summary statistics to generate housing price labels

    Finding the closest labeled observation to match the label

    Using semi-supervised learning to label regression data

    Pseudo-labeling

    Using data augmentation to label regression data

    Using k-means clustering to label regression data

    Summary

    Part 2: Labeling Image Data

    4

    Exploring Image Data

    Technical requirements

    Visualizing image data using Matplotlib in Python

    Loading the data

    Checking the dimensions

    Visualizing the data

    Checking for outliers

    Performing data preprocessing

    Checking for class imbalance

    Identifying patterns and relationships

    Evaluating the impact of preprocessing

    Practice example of visualizing data

    Practice example for adding annotations to an image

    Practice example of image segmentation

    Practice example for feature extraction

    Analyzing image size and aspect ratio

    Impact of aspect ratios on model performance

    Image resizing

    Image normalization

    Performing transformations on images – image augmentation

    Summary

    5

    Labeling Image Data Using Rules

    Technical requirements

    Labeling rules based on image visualization

    Image labeling using rules with Snorkel

    Weak supervision

    Rules based on the manual visualization of an image’s object color

    Real-world applications

    A practical example of plant disease detection

    Labeling images using rules based on properties

    Bounding boxes

    Example 1 – image classification – a bicycle with and without a person

    Example 2 – image classification – dog and cat images

    Labeling images using transfer learning

    Example – digit classification using a pre-trained classifier

    Example – person image detection using the YOLO V3 pre-trained classifier

    Example – bicycle image detection using the YOLO V3 pre-trained classifier

    Labeling images using transformations

    Summary

    6

    Labeling Image Data Using Data Augmentation

    Technical requirements

    Training support vector machines with augmented image data

    Kernel trick

    Data augmentation

    Image data augmentation

    Implementing an SVM with data augmentation in Python

    Introducing the CIFAR-10 dataset

    Loading the CIFAR-10 dataset in Python

    Preprocessing the data for SVM training

    Implementing an SVM with the default hyperparameters

    Evaluating SVM on the original dataset

    Implementing an SVM with an augmented dataset

    Training the SVM on augmented data

    Evaluating the SVM’s performance on the augmented dataset

    Image classification using the SVM with data augmentation on the MNIST dataset

    Convolutional neural networks using augmented image data

    How CNNs work

    Practical example of a CNN using data augmentation

    CNN using image data augmentation with the CIFAR-10 dataset

    Summary

    Part 3: Labeling Text, Audio, and Video Data

    7

    Labeling Text Data

    Technical requirements

    Real-world applications of text data labeling

    Tools and frameworks for text data labeling

    Exploratory data analysis of text

    Loading the data

    Understanding the data

    Cleaning and preprocessing the data

    Exploring the text’s content

    Analyzing relationships between text and other variables

    Visualizing the results

    Exploratory data analysis of sample text data set

    Exploring Generative AI and OpenAI for labeling text data

    GPT models by OpenAI

    Zero-shot learning capabilities

    Text classification with OpenAI models

    Data labeling assistance

    OpenAI API overview

    Use case 1 – summarizing the text

    Use case 2 – topic generation for news articles

    Use case 3 – classification of customer queries using the user-defined categories and sub-categories

    Use case 4 – information retrieval using entity extraction

    Use case 5 – aspect-based sentiment analysis

    Hands-on labeling of text data using the Snorkel API

    Hands-on text labeling using Logistic Regression

    Hands-on label prediction using K-means clustering

    Generating labels for customer reviews (sentiment analysis)

    Summary

    8

    Exploring Video Data

    Technical requirements

    Loading video data using cv2

    Extracting frames from video data for analysis

    Extracting features from video frames

    Color histogram

    Optical flow features

    Motion vectors

    Deep learning features

    Appearance and shape descriptors

    Visualizing video data using Matplotlib

    Frame visualization

    Temporal visualization

    Motion visualization

    Labeling video data using k-means clustering

    Overview of data labeling using k-means clustering

    Example of video data labeling using k-means clustering with a color histogram

    Advanced concepts in video data analysis

    Motion analysis in videos

    Object tracking in videos

    Facial recognition in videos

    Video compression techniques

    Real-time video processing

    Video data formats and quality in machine learning

    Common issues in handling video data for ML models

    Troubleshooting steps

    Summary

    9

    Labeling Video Data

    Technical requirements

    Capturing real-time video

    Key components and features

    A hands-on example to capture real-time video using a webcam

    Building a CNN model for labeling video data

    Using autoencoders for video data labeling

    A hands-on example to label video data using autoencoders

    Transfer learning

    Using the Watershed algorithm for video data labeling

    A hands-on example to label video data segmentation using the Watershed algorithm

    Computational complexity

    Performance metrics

    Real-world examples for video data labeling

    Advances in video data labeling and classification

    Summary

    10

    Exploring Audio Data

    Technical requirements

    Real-life applications for labeling audio data

    Audio data fundamentals

    Hands-on with analyzing audio data

    Example code for loading and analyzing sample audio file

    Best practices for audio format conversion

    Example code for audio data cleaning

    Extracting properties from audio data

    Tempo

    Chroma features

    Mel-frequency cepstral coefficients (MFCCs)

    Zero-crossing rate

    Spectral contrast

    Considerations for extracting properties

    Visualizing audio data with matplotlib and Librosa

    Waveform visualization

    Loudness visualization

    Spectrogram visualization

    Mel spectrogram visualization

    Considerations for visualizations

    Ethical implications of audio data

    Recent advances in audio data analysis

    Troubleshooting common issues during data analysis

    Troubleshooting common installation issues for audio libraries

    Summary

    11

    Labeling Audio Data

    Technical requirements

    Downloading FFmpeg

    Azure Machine Learning

    Real-time voice classification with Random Forest

    Transcribing audio using the OpenAI Whisper model

    Step 1 – importing the Whisper model

    Step 2 – loading the base Whisper model

    Step 3 – setting up FFmpeg

    Step 4 – transcribing the YouTube audio using the Whisper model

    Classifying a transcription using Hugging Face transformers

    Hands-on – labeling audio data using a CNN

    Exploring audio data augmentation

    Introducing Azure Cognitive Services – the speech service

    Creating an Azure Speech service

    Speech to text

    Speech translation

    Summary

    12

    Hands-On Exploring Data Labeling Tools

    Technical requirements

    Azure Machine Learning data labeling

    Label Studio

    pyOpenAnnotate

    Data labeling using Azure Machine Learning

    Benefits of data labeling with Azure Machine Learning

    Data labeling steps using Azure Machine Learning

    Image data labeling with Azure Machine Learning

    Text data labeling with Azure Machine Learning

    Audio data labeling using Azure Machine Learning

    Integration of the Azure Machine Learning pipeline with the labeled dataset

    Exploring Label Studio

    Labeling the image data

    Labeling the text data

    Labeling the video data

    pyOpenAnnotate

    Computer Vision Annotation Tool

    Comparison of data labeling tools

    Advanced methods in data labeling

    Active learning

    Semi-automated labeling

    Summary

    Index

    Other Books You May Enjoy

    Preface

    In today’s data-driven era, where more than 2.5 quintillion bytes of data are produced daily in various forms such as text, image, audio, and video, data stands as the cornerstone of the AI revolution. However, the majority of real-world data available for training supervised machine learning models lacks labels, or we encounter limited labeled data. This presents a significant challenge, as labeled data is essential for training any supervised machine learning model and fine-tuning large language models in the age of generative AI.

    To address the scarcity of labeled data and facilitate the preparation of labeled data for training supervised machine learning models and fine-tuning large language models, this book introduces various methods for programmatic data labeling using Python libraries and methods, including semi-supervised and unsupervised learning.

    This book guides you through the process of loading and analyzing tabular data, images, videos, audio, and text using various Python libraries, the OpenAI API, LangChain, and Azure Machine Learning. It explores techniques such as weak supervision, pseudo-labeling, and K-means clustering for classification and labeling, while also providing data augmentation methods to enhance accuracy. Utilizing the Azure OpenAI API and LangChain, the book demonstrates the automation of data analysis using natural language without the need to acquire any programming skills. It also encompasses the classification and data labeling of text data using OpenAI and large language models (LLMs). This book covers a wide variety of open source data annotation tools, along with Azure Machine Learning, and compares the pros and cons of these tools.

    Real-world examples from various industries are incorporated to illustrate the application of these methods to tabular, text, image, video, and audio data.

    By the conclusion of this book, you will have acquired the skills to explore different types of data using Python and OpenAI LLMs. You will have learned how to prepare data with labels, whether for training machine learning models or unlocking insights about the data to leverage for business use cases across industries.

    Who this book is for

    This book is for aspiring AI engineers, machine learning engineers, data scientists, and data engineers who want to learn about data labeling methods and algorithms for model training. Data enthusiasts and Python developers will be able to use this book to learn about data exploration and annotation using Python libraries.

    What this book covers

    Chapter 1, Exploring Data for Machine Learning, provides an overview of data analysis and visualization methods using various Python libraries. Additionally, it deep dives into unlocking data insights with natural language using OpenAI LLMs.

    Chapter 2, Labeling Data for Classification, covers the process of labeling tabular data for training classification models. Various methods, such as Snorkel Python functions, semi-supervised learning, and clustering data using K-means, are explored.

    Chapter 3, Labeling Data for Regression, addresses the labeling of tabular data for training regression models. Techniques include leveraging summary statistics, creating pseudo labels, employing data augmentation methods, and utilizing K-means clustering.

    Chapter 4, Exploring Image Data, covers the analysis and visualization of image data and feature extraction from images using various Python libraries.

    Chapter 5, Labeling Image Data Using Rules, discusses labeling images based on heuristics and image properties such as aspect ratio, and also covers image classification using pre-trained classifiers such as YOLO.

    Chapter 6, Labeling Image Data Using Data Augmentation, explores methods of image data augmentation for training support vector machines and Convolutional Neural Networks (CNNs), as well as addressing image data labeling.

    Chapter 7, Labeling Text Data, covers generative AI and various methods for labeling text data. This includes Azure OpenAI with real-world use cases, text classification, and sentiment analysis using Snorkel and K-means clustering.

    Chapter 8, Exploring Video Data, focuses on loading video data, extracting features, visualizing video data, and clustering video data using K-means clustering.

    Chapter 9, Labeling Video Data, delves into labeling video data using CNNs, segmenting video data with the watershed algorithm, and capturing important features using autoencoders, accompanied by real-world examples.

    Chapter 10, Exploring Audio Data, provides the fundamentals of audio data, loading and visualizing audio data, extracting features, and real-life applications.

    Chapter 11, Labeling Audio Data, covers transcribing audio data using OpenAI’s Whisper model, labeling the transcription, creating spectrograms for audio data classification, augmenting audio data, and using Azure Cognitive Services for speech.

    Chapter 12, Hands-On Exploring Data Labeling Tools, covers various data labeling tools, including open source tools such as Label Studio, CVAT, pyOpenAnnotate, and Azure Machine Learning. It also includes a comparison of various data labeling tools for image, text, audio, and video data.

    To get the most out of this book

    Basic Python knowledge is beneficial but not necessary to get the most out of this book.

    If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

    Download the example code files

    You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Data-Labeling-in-Machine-Learning-with-Python. If there’s an update to the code, it will be updated in the GitHub repository.

    We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

    Conventions used

    There are a number of text conventions used throughout this book.

    Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: Now let us generate the augmented data by calling the noise, scale, and rotation augmentation functions, as follows.

    A block of code is set as follows:

    # Train a linear regression model on the labeled data

    regressor = LinearRegression()

    regressor.fit(train_data, train_labels)

    When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

    news_headline="

    Label the following news headline into 1 of the following categories: Business, Tech, Politics, Sport, Entertainment\n\n Headline 1: Trump is ready to contest in nov 2024 elections\nCategory:

    ",

    response = openai.Completion.create(

    engine=model_deployment_name,

    prompt= news_headline,

    temperature=0,

    Any command-line input or output is written as follows:

    pip install keras

    Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: Change System preferences | Security and privacy | General, and then select Open anyway.

    Tips or important notes

    Appear like this.

    Get in touch

    Feedback from our readers is always welcome.

    General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.

    Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

    Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

    If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com

    Share your thoughts

    Once you’ve read Data Labeling in ML and AI with Python, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

    Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

    Download a free PDF copy of this book

    Thanks for purchasing this book!

    Do you like to read on the go but are unable to carry your print books everywhere?

    Is your eBook purchase not compatible with the device of your choice?

    Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

    Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.

    The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily

    Follow these simple steps to get the benefits:

    Scan the QR code or visit the link below

    https://packt.link/free-ebook/9781804610541

    2. Submit your proof of purchase

    3. That’s it! We’ll send your free PDF and other benefits to your email directly

    Part 1: Labeling Tabular Data

    This part of the book will guide you in exploring tabular data and programmatically labeling the data using Python libraries, such as Snorkel labeling functions. You will be able to achieve this without requiring any prior data science knowledge. Additionally, it covers data labeling using K-means clustering.

    This part comprises the following chapters:

    Chapter 1, Exploring Data for Machine Learning

    Chapter 2, Labeling Data for Classification

    Chapter 3, Labeling Data for Regression

    1

    Exploring Data for Machine Learning

    Imagine embarking on a journey through an expansive ocean of data, where within this vastness are untold stories, patterns, and insights waiting to be discovered. Welcome to the world of data exploration in machine learning (ML). In this chapter, I encourage you to put on your analytical lenses as we embark on a thrilling quest. Here, we will delve deep into the heart of your data, armed with powerful techniques and heuristics, to uncover its secrets. As you embark on this adventure, you will discover that beneath the surface of raw numbers and statistics, there exists a treasure trove of patterns that, once revealed, can transform your data into a valuable asset. The journey begins with exploratory data analysis (EDA), a crucial phase where we unravel the mysteries of data, laying the foundation for automated labeling and, ultimately, building smarter and more accurate ML models. In this age of generative AI, the preparation of quality training data is essential to the fine-tuning of domain-specific large language models (LLMs). Fine-tuning involves the curation of additional domain-specific labeled data for training publicly available LLMs. So, fasten your seatbelts for a captivating voyage into the art and science of data exploration for data labeling.

    First, let’s start with the question: What is data exploration? It is the initial phase of data analysis, where raw data is examined, visualized, and summarized to uncover patterns, trends, and insights. It serves as a crucial step in understanding the nature of the data before applying advanced analytics or ML techniques.

    In this chapter, we will explore tabular data using various libraries and packages in Python, including Pandas, NumPy, and Seaborn. We will also plot different bar charts and histograms to visualize data to find the relationships between various features, which is useful for labeling data. We will be exploring the Income dataset located in this book’s GitHub repository (a link for which is located in the Technical requirements section). A good understanding of the data is necessary in order to define business rules, identify matching patterns, and, subsequently, label the data using Python labeling functions.

    By the end of this chapter, we will be able to generate summary statistics for the given dataset. We will derive aggregates of the features for each target group. We will also learn how to perform univariate and bivariate analyses of the features in the given dataset. We will create a report using the ydata-profiling library.

    We’re going to cover the following main topics:

    EDA and data labeling

    Summary statistics and data aggregates with Pandas

    Data visualization with Seaborn for univariate and bivariate analysis

    Profiling data using the ydata-profiling library

    Unlocking insights from data with OpenAI and LangChain

    Technical requirements

    One of the following Python IDE and software tools needs to be installed before running the notebook in this chapter:

    Anaconda Navigator: Download and install the open source Anaconda Navigator from the following URL:

    https://docs.anaconda.com/navigator/install/#system-requirements

    Jupyter Notebook: Download and install Jupyter Notebook:

    https://jupyter.org/install

    We can also use open source, online Python editors such as Google Colab (https://colab.research.google.com/) or Replit (https://replit.com/)

    The Python source code and the entire notebook created in this chapter are available in this book’s GitHub repository:

    https://github.com/PacktPublishing/Data-Labeling-in-Machine-Learning-with-Python

    You also need to create an Azure account and add an OpenAI resource for working with generative AI. To sign up for a free Azure subscription, visit https://azure.microsoft.com/free. To request access to the Azure OpenAI service, visit https://aka.ms/oaiapply.

    Once you have provisioned the Azure OpenAI service, deploy the LLM model – either GPT-3.5-Turbo or GPT 4.0 – from Azure OpenAI Studio. Then copy the keys for OpenAI from OpenAI Studio and set up the following environment variables:

    os.environ['AZURE_OPENAI_KEY'] = 'your_api_key'

    os.environ['AZURE_OPENAI_ENDPOINT") ='your_azure_openai_endpoint'

    Your endpoint should look like this: https://YOUR_RESOURCE_NAME.openai.azure.com/.

    EDA and data labeling

    In this section, we will gain an understanding of what EDA is. We will see why we need to perform it and discuss its advantages. We will also look at the life cycle of an ML project and learn about the role of data labeling in this cycle.

    EDA comprises data discovery, data collection, data cleaning, and data exploration. These steps are part of any machine learning project. The data exploration step includes tasks such as data visualization, summary statistics, correlation analysis, and data distribution analysis. We will dive deep into these steps in the upcoming sections.

    Here are some real-world examples of EDA:

    Customer churn analysis: Suppose you work for a telecommunications company and you want to understand why customers are churning (canceling their subscriptions); in this case, conducting EDA on customer churn data can provide valuable insights.

    Income data analysis: EDA on the Income dataset with predictive features such as education, employment status, and marital status helps to predict whether the salary of a person is greater than $50K.

    EDA is a critical process for any ML or data science project, and it allows us to understand the data and gain some valuable insights into the data domain and business.

    In this chapter, we will use various Python libraries, such as Pandas, and call the describe and info functions on Pandas to generate data summaries. We will discover anomalies in the data and any outliers in the given dataset. We will also figure out various data types and any missing values in the data. We will understand whether any data type conversions are required, such as converting string to float, for performing further analysis. We will also analyze the data formats and see whether any transformations are required to standardize them, such as the date format. We will analyze the counts of different labels and understand whether the dataset is balanced or imbalanced. We will understand the relationships between various features in the data and calculate the correlations between features.

    To summarize, we will understand the patterns in the given dataset and also identify the relationships between various features in the data samples. Finally, we will come up with a strategy and domain rules for data cleaning and transformation. This helps us to predict labels for unlabeled data.

    We will plot various data visualizations using Python libraries such as seaborn and matplotlib. We will create bar charts, histograms, heatmaps, and various charts to visualize the importance of features in the dataset and how they depend on each other.

    Understanding the ML project life cycle

    The following are the major steps in an ML project:

    Figure 1.1 – ML project life cycle diagram
    Enjoying the preview?
    Page 1 of 1