This document outlines the machine learning process, detailing the steps involved in developing a machine learning model, including problem definition, data gathering, preparation, analysis, feature engineering, model training, evaluation, and deployment. It emphasizes the importance of systematic development and best practices to enhance model performance. Additionally, it includes an assignment to describe various machine learning processes and compare them with data mining processes.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
7 views28 pages
Session 4 Machine Learning Process
This document outlines the machine learning process, detailing the steps involved in developing a machine learning model, including problem definition, data gathering, preparation, analysis, feature engineering, model training, evaluation, and deployment. It emphasizes the importance of systematic development and best practices to enhance model performance. Additionally, it includes an assignment to describe various machine learning processes and compare them with data mining processes.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28
Session 4
Machine Learning Process
Learning Outcomes • By the end of this lecture, you will be able to:
• Understand the process of developing a machine
learning model. • Identify and explain each step in the machine learning life cycle. • Apply the machine learning life cycle to real-world examples. • Recognize common challenges and best practices in each phase of the cycle. Machine learning overview • Machine learning is a subset of artificial intelligence (AI). • Trains computers to mimic human thinking. • Utilizes real-world data for training. • It follows predefined steps to train computer • This process is known as a machine learning lifecycle. Steps in the Machine Learning Process • Guides the development and deployment of machine learning models. • It’s a Structured process with various steps. • Understanding the life cycle ensures: • systematic development and deployment, • improves efficiency, and • enhances model performance. Steps in the Machine Learning Process • Prior to starting the process, you need toClearly define the problem you aim to solve Problem Definition
Example: Predicting customer churn for a telecom
company [problem]. • Key Considerations: Business objectives, success metrics, feasibility. Step 1: Gathering Data • Identify Data Sources • Recognize where data can be collected from. • Examples: Files, databases, internet, mobile devices. • Collect Data • Gather data from identified sources. • Ensure data is relevant and comprehensive. • Integrate Data • Combine data from different sources. • Create a coherent and unified dataset. • Outcome • Readytouse dataset for further processing. Step 2: Data Preparation • Raw data, is often messy and unstructured. • Data cleaning involves addressing issues such as missing values, outliers, and inconsistencies that could compromise the accuracy and reliability of the machine learning model. Objective • Refine raw data for meaningful analysis. • Lay the foundation for robust model development.
• The basic features of Data Cleaning and Preprocessing are
discussed next: Step 2: Data Preparation Data Cleaning • Address missing values. • Handle outliers. • Resolve inconsistencies. Data Preprocessing • Standardize formats. • Scale values. • Encode categorical variables. Step 2: Data Preparation Data Quality • Ensure well-organized data. • Prepare for meaningful analysis. Data Integrity • Maintain dataset integrity. • Effective cleaning and preprocessing. Step 3: Data Wrangling • The process of cleaning and converting raw data into a useable format. • It is the process of cleaning the data, selecting the variable to use, and transforming the data in a proper format to make it more suitable for analysis in the next step. • Cleaning of data is required to address the quality issues. Step 3: Data Wrangling • In real-world applications, collected data may have various issues, including: Missing Values Duplicate data Invalid data Noise (irrelevant or meaningless data) • So, we use various filtering techniques to clean the data. • It is mandatory to detect and remove the above issues because it can negatively affect the quality of the outcome. Step 4: Analyze Data • Also called “Exploratory Data Analysis (EDA) ” • Understanding the underlying patterns and characteristics of collected data. • Leveraging statistical and visual tools to gain insights into the dataset’s structure. • Visualizations, summary statistics, and correlation analyses play crucial role. • Example of data visualization (e.g., histogram, scatter plot). Step 4: Analyze Data • Exploration: Use statistical and visual tools to explore the structure and patterns in the data. • Patterns and Trends: Identify underlying patterns, trends, and potential challenges within the dataset. • Insights: Gain valuable insights to inform decisions in later stages of the machine learning process. • Decision Making: Use exploratory data analysis to make informed decisions about feature engineering and model selection. Step 5: Feature Engineering and Selection • Feature Selection: Identify the subset of features that most significantly impact the model’s performance. • Feature Engineering: Create new features or transform existing ones to better capture patterns and relationships. • Requires domain expertise and a deep understanding of the problem • Aim is o engineer features that contribute meaningfully to predictive power. • Optimization: Balance feature set for predictive accuracy while minimizing computational complexity. Step 5: Feature Engineering and Selection - Example using Python Problem: to predict the `price` of houses using the available features. Dataset :Assume we have a dataset `house_data.csv` with the following columns: • house_id • size_in_sqft • num_bedrooms • num_bathrooms • location • year_built • price Step 5: Feature Engineering and Selection – Example using Python Loading the Data: Step 5: Feature Engineering and Selection – Example using Python Exploring the Data : Step 5: Feature Engineering and Selection – Example using Python Handling Missing Values : Step 5: Feature Engineering and Selection – Example using Python Feature Creation • Total Rooms: Create a new feature by adding the number of bedrooms and bathrooms : Step 5: Feature Engineering and Selection – Example using Python Feature Creation • Age of House: Create a new feature representing the age of the house : Step 5: Feature Engineering and Selection – Example using Python Feature Creation • Age of House: Create a new feature representing the age of the house : Step 5: Feature Engineering and Selection – Example using Python Feature Creation • Location Encoding: Convert categorical data into numerical data. : Step 5: Feature Engineering and Selection – Example using Python Feature Selection • Drop less relevant or redundant features : Step 6: Train Model • Split the dataset into training and testing Training Set: Used to train the model. Testing Set: Used to evaluate the model. • Select an appropriate machine learning algorithm Regression: Linear Regression, Ridge, Lasso, etc. Classification: Logistic Regression, Decision Trees, Random Forest, SVM, etc. Clustering: K-Means, Hierarchical Clustering, etc. • Train the model Step 7: Model Evaluation • Test the model to determine the percentage accuracy of the model. • Involves rigorous testing against validation datasets. • Evaluation metrics such as accuracy, precision, recall, and F1 score are computed to gauge its effectiveness. • Provides insights into the model’s strengths and weaknesses. Step 7: Model Deployment • We deploy the model in the real-world system. • The deployment phase is similar to making the final report for a project. Next Steps 1. Install Python compatible IDE (Integrated Development Environment). 2. Install Weka Machine Learning Environment Assignment: 1. Describe the following machine learning processes: a. CRISP-DM b. SEMMA c. KDD (6 marks) 2. Identify the key differences and similarities among the data miming (KDD) and machine learning (CRISP-DM, SEMMA) processes? (4 marks) Submit by: 19/05/2025 (hard copy)