0% found this document useful (0 votes)
36 views11 pages

Machine Learning Life Cycle

The machine learning life cycle consists of seven key steps: Gathering Data, Data Preparation, Data Wrangling, Data Analysis, Model Training, Model Testing, and Deployment. Each step is crucial for building an effective machine learning model, from collecting and cleaning data to training and evaluating the model before deploying it in real-world applications. Understanding the problem and the data is essential for achieving accurate predictions and successful outcomes.

Uploaded by

kuriaaustine125
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views11 pages

Machine Learning Life Cycle

The machine learning life cycle consists of seven key steps: Gathering Data, Data Preparation, Data Wrangling, Data Analysis, Model Training, Model Testing, and Deployment. Each step is crucial for building an effective machine learning model, from collecting and cleaning data to training and evaluating the model before deploying it in real-world applications. Understanding the problem and the data is essential for achieving accurate predictions and successful outcomes.

Uploaded by

kuriaaustine125
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Machine learning Life cycle

Machine learning has given the computer systems the abilities to automatically learn without
being explicitly programmed. But how does a machine learning system work? So, it can be
described using the life cycle of machine learning. Machine learning life cycle is a cyclic process
to build an efficient machine learning project. The main purpose of the life cycle is to find a
solution to the problem or project.
Machine learning life cycle involves seven major steps, which are given below:
o Gathering Data
o Data preparation
o Data Wrangling
o Analyse Data
o Train the model
o Test the model
o Deployment
The most important thing in the complete process is to understand the problem and to know
the purpose of the problem. Therefore, before starting the life cycle, we need to understand the
problem because the good result depends on the better understanding of the problem.
In the complete life cycle process, to solve a problem, we create a machine learning system
called "model", and this model is created by providing "training". But to train a model, we need
data, hence, life cycle starts by collecting data.
PlayNext
Mute
Current Time 0:38
/
Duration 18:10
Loaded: 9.17%

1. Gathering Data:
Data Gathering is the first step of the machine learning life cycle. The goal of this step is to
identify and obtain all data-related problems.
In this step, we need to identify the different data sources, as data can be collected from various
sources such as files, database, internet, or mobile devices. It is one of the most important
steps of the life cycle. The quantity and quality of the collected data will determine the
efficiency of the output. The more will be the data, the more accurate will be the prediction.
This step includes the below tasks:
o Identify various data sources
o Collect data
o Integrate the data obtained from different sources
By performing the above task, we get a coherent set of data, also called as a dataset. It will be
used in further steps.

2. Data preparation
After collecting the data, we need to prepare it for further steps. Data preparation is a step
where we put our data into a suitable place and prepare it to use in our machine learning
training.
In this step, first, we put all data together, and then randomize the ordering of data.
This step can be further divided into two processes:
o Data exploration:
It is used to understand the nature of data that we have to work with. We need to
understand the characteristics, format, and quality of data.
A better understanding of data leads to an effective outcome. In this, we find
Correlations, general trends, and outliers.
o Data pre-processing:
Now the next step is preprocessing of data for its analysis.

3. Data Wrangling
Data wrangling is the process of cleaning and converting raw data into a useable format. It is the
process of cleaning the data, selecting the variable to use, and transforming the data in a proper
format to make it more suitable for analysis in the next step. It is one of the most important
steps of the complete process. Cleaning of data is required to address the quality issues.
It is not necessary that data we have collected is always of our use as some of the data may not
be useful. In real-world applications, collected data may have various issues, including:
o Missing Values
o Duplicate data
o Invalid data
o Noise
So, we use various filtering techniques to clean the data.
It is mandatory to detect and remove the above issues because it can negatively affect the
quality of the outcome.

4. Data Analysis
Now the cleaned and prepared data is passed on to the analysis step. This step involves:
o Selection of analytical techniques
o Building models
o Review the result
The aim of this step is to build a machine learning model to analyze the data using various
analytical techniques and review the outcome. It starts with the determination of the type of
the problems, where we select the machine learning techniques such
as Classification, Regression, Cluster analysis, Association, etc. then build the model using
prepared data, and evaluate the model.
Hence, in this step, we take the data and use machine learning algorithms to build the model.

5. Train Model
Now the next step is to train the model, in this step we train our model to improve its
performance for better outcome of the problem.
We use datasets to train the model using various machine learning algorithms. Training a model
is required so that it can understand the various patterns, rules, and, features.

6. Test Model
Once our machine learning model has been trained on a given dataset, then we test the model.
In this step, we check for the accuracy of our model by providing a test dataset to it.
Testing the model determines the percentage accuracy of the model as per the requirement of
project or problem.
7. Deployment
Advertisement
The last step of machine learning life cycle is deployment, where we deploy the model in the
real-world system.
If the above-prepared model is producing an accurate result as per our requirement with
acceptable speed, then we deploy the model in the real system. But before deploying the
project, we will check whether it is improving its performance using available data or not. The
deployment phase is similar to making the final report for a project.
1. Import Required Libraries
python
Copy code
# Basic libraries
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix,
roc_auc_score, roc_curve

2. Load and Inspect the Data


python
Copy code
# Load the dataset
data = pd.read_csv('your_dataset.csv')

# Preview the first few rows


print(data.head())

# Check for missing values


print(data.isnull().sum())

# Check data types and basic statistics


print(data.info())
print(data.describe())

3. Data Preprocessing
Handle Missing Values
python
Copy code
# Drop rows with missing target values (if any)
data.dropna(subset=['target_column'], inplace=True)

# Fill or impute missing values in other columns


data.fillna(data.mean(), inplace=True) # Example: fill with column mean
Encode Categorical Variables
python
Copy code
# Convert categorical variables to numeric values
for col in data.select_dtypes(include=['object']).columns:
data[col] = LabelEncoder().fit_transform(data[col])
Feature Scaling
python
Copy code
# Separate features and target variable
X = data.drop('target_column', axis=1)
y = data['target_column']

# Standardize features
scaler = StandardScaler()
X = scaler.fit_transform(X)

4. Split Data into Training and Testing Sets


python
Copy code
# Split data (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

5. Model Selection and Training


Initialize and Train a Basic Model
python
Copy code
# Choose a model (e.g., Random Forest)
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
Evaluate Initial Model Performance
python
Copy code
# Predict on test data
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

# Classification report and confusion matrix


print(classification_report(y_test, y_pred))
conf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.show()
ROC Curve and AUC Score
python
Copy code
# Calculate ROC-AUC score
y_pred_proba = model.predict_proba(X_test)[:, 1]
roc_auc = roc_auc_score(y_test, y_pred_proba)
print(f'ROC-AUC Score: {roc_auc:.2f}')

# Plot ROC curve


fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
plt.plot(fpr, tpr, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.show()

6. Model Improvement
Hyperparameter Tuning with Grid Search
python
Copy code
# Set up parameter grid
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [None, 10, 20, 30],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}

# Initialize GridSearchCV with cross-validation


grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5,
scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Best parameters and best score


print(f'Best parameters: {grid_search.best_params_}')
print(f'Best cross-validation accuracy: {grid_search.best_score_:.2f}')
Train Model with Best Parameters
python
Copy code
# Use best parameters from grid search
best_model = grid_search.best_estimator_
best_model.fit(X_train, y_train)

# Evaluate the optimized model


y_pred_best = best_model.predict(X_test)
accuracy_best = accuracy_score(y_test, y_pred_best)
print(f'Optimized Model Accuracy: {accuracy_best:.2f}')

# Updated classification report and confusion matrix


print(classification_report(y_test, y_pred_best))
conf_matrix_best = confusion_matrix(y_test, y_pred_best)
sns.heatmap(conf_matrix_best, annot=True, fmt='d', cmap='Greens')
plt.show()

7. Feature Importance Analysis


python
Copy code
# Feature importance
feature_importances = pd.Series(best_model.feature_importances_, index=data.columns[:-1])
feature_importances.sort_values().plot(kind='barh', title='Feature Importance')
plt.show()

8. Cross-Validation and Final Model Evaluation


python
Copy code
# Cross-validation accuracy
cv_scores = cross_val_score(best_model, X, y, cv=10, scoring='accuracy')
print(f'Cross-validated Accuracy: {np.mean(cv_scores):.2f} ± {np.std(cv_scores):.2f}')

9. Save the Model for Future Use


python
Copy code
import joblib

# Save the trained model


joblib.dump(best_model, 'final_model.pkl')
# Load the model later with
# model = joblib.load('final_model.pkl')

You might also like