Machine Learning With Python
Machine Learning With Python
Curated by
Telugu Gyaan
Table of Contents
Introduction to Machine Learning ............................................................................................................ 4
Theory of Machine Learning .................................................................................................................... 4
Advantages of machine learning ............................................................................................................ 6
Disadvantages of machine learning ...................................................................................................... 6
Applications of machine learning .......................................................................................................... 6
Key topics in machine learning............................................................................................................... 7
Evaluation Metrics ........................................................................................................................................... 9
Confusion Matrix........................................................................................................................................... 9
Accuracy ........................................................................................................................................................... 9
Precision ........................................................................................................................................................ 10
Recall or Sensitivity or True Positive Rate ..................................................................................... 10
F1 Score ......................................................................................................................................................... 10
Specificity or True Negative Rate ....................................................................................................... 10
Area Under the Receiver Operating Characteristic Curve (AUC-ROC) ............................... 11
1
Approach ........................................................................................................................................................... 13
Theoretical Approach .............................................................................................................................. 13
Programming Approach ..................................................................................................................... 14
Introduction to Libraries used in Machine Learning Python ............................................ 17
NumPy (Numerical Python) .............................................................................................................. 17
Pandas ........................................................................................................................................................... 18
Scikit-learn ................................................................................................................................................. 18
Matplotlib .................................................................................................................................................... 18
TensorFlow ................................................................................................................................................ 19
Keras .............................................................................................................................................................. 20
PyTorch ........................................................................................................................................................ 20
XGBoost ........................................................................................................................................................ 20
1. Linear Regression ..................................................................................................................................... 21
Assumptions of Linear Regression .................................................................................................... 21
Simple Linear Regression ...................................................................................................................... 22
Multiple Linear Regression ................................................................................................................... 22
Model Evaluation ....................................................................................................................................... 22
Limitations of Linear Regression ........................................................................................................ 23
Applications of Linear Regression ..................................................................................................... 23
Implementing Linear Regression using Python .................................................................... 25
2. Logistic Regression .................................................................................................................................. 28
Advantages of Logistic Regression ..................................................................................................... 30
Limitations of Logistic Regression ..................................................................................................... 30
Applications of Logistic Regression ................................................................................................... 30
Implementing Logistic Regression using Python ................................................................. 32
3. Decision Tree Classifier ......................................................................................................................... 35
Advantages of Decision Tree Classifier ............................................................................................ 36
Limitations of Decision Tree Classifier ............................................................................................ 37
Applications of Decision Tree Classifier .......................................................................................... 37
Implementing Decision Tree Classifier using Python ........................................................ 38
4. Support Vector Machine ........................................................................................................................ 41
Advantages of Support Vector Machine ........................................................................................... 43
Limitations of Support Vector Machine ........................................................................................... 44
Applications of Support Vector Machine ......................................................................................... 44
Implementing SVM using Python ................................................................................................... 45
2
5. K Nearest Neighbors ............................................................................................................................... 50
KNN algorithm ............................................................................................................................................ 51
Advantages of KNN ................................................................................................................................... 51
Limitations of KNN .................................................................................................................................... 51
Applications of KNN ................................................................................................................................. 52
Implementing KNN using Python .................................................................................................. 53
6. K Means Clustering .................................................................................................................................. 57
K-means clustering algorithm .............................................................................................................. 57
Advantages of K-means clustering ..................................................................................................... 58
Limitations of K-means clustering ..................................................................................................... 58
Applications of K-means clustering ................................................................................................... 59
Implementing K-Means Clustering using Python ................................................................. 61
7. Principal Component Analysis ............................................................................................................ 64
Steps involved in PCA .............................................................................................................................. 64
Applications of PCA in Machine Learning ....................................................................................... 65
Implementing PCA using Python ................................................................................................... 66
8. Random Forest .......................................................................................................................................... 69
Advantages of Random Forest ............................................................................................................. 70
Limitations of Random Forest ............................................................................................................. 70
Applications of Random Forest ........................................................................................................... 71
Implementing Random Forest using Python .......................................................................... 72
9. Time Series Modeling ............................................................................................................................. 76
Main Components of Time series ....................................................................................................... 76
ARIMA ............................................................................................................................................................. 76
Advantages of Time Series Modeling ................................................................................................ 77
Limitations of Time Series Modeling ................................................................................................ 77
Applications of Time Series Modeling .............................................................................................. 78
Implementing Random Forest using Python .......................................................................... 79
3
Introduction to Machine Learning
Machine Learning is a field of study and application that involves developing algorithms and
models that allow computers to learn and make predictions or decisions without being
explicitly programmed. It is a subset of Artificial Intelligence (AI) and is based on the idea
that machines can learn from data and improve their performance over time.
4
3. Reinforcement Learning: This approach involves training an agent to interact with an
environment and learn optimal actions through trial and error. The agent receives
feedback in the form of rewards or penalties based on its actions.
4. Deep Learning: Deep Learning is a subfield of Machine Learning that focuses on the
development of neural networks with multiple layers. These networks are capable of
learning complex patterns and representations from data.
5
Advantages of machine learning:
• Automation: Machine Learning enables automation of tasks that would otherwise require
manual effort and decision-making.
• Handling Complex Data: ML algorithms can handle and extract insights from large and
complex datasets.
• Adaptability: Machine Learning models can adapt and improve their performance over
time as new data becomes available.
• Overfitting: ML models may become overly specialized in the training data, leading to
poor performance on new, unseen data.
6
• Healthcare: Machine Learning is applied in disease diagnosis, drug discovery, personalized
medicine, and medical imaging analysis.
• Autonomous Vehicles: ML algorithms are used in self-driving cars to interpret sensor
data, make driving decisions, and improve safety.
7
• Clustering: Clustering algorithms group similar data points together based on their
characteristics.
• Dimensionality Reduction: These techniques aim to reduce the number of input variables
while preserving important information.
• Ensemble Learning: Ensemble learning combines multiple ML models to make predictions
or decisions. It improves accuracy, reduces overfitting, and includes techniques like
bagging, boosting, and stacking.
8
Evaluation Metrics
Confusion Matrix:
A confusion matrix summarizes the performance of a classification model by tabulating the
counts of true positives, false positives, true negatives, and false negatives. It is useful for
understanding the types of errors made by the model and assessing its performance across
different classes.
The confusion matrix is typically organized with the following side headings:
• True Positives (TP): Instances that are actually positive and are correctly predicted as
positive.
• False Positives (FP): Instances that are actually negative but are incorrectly predicted as
positive.
• True Negatives (TN): Instances that are actually negative and are correctly predicted as
negative.
• False Negatives (FN): Instances that are actually positive but are incorrectly predicted as
negative.
Accuracy:
Accuracy is the most basic evaluation metric, representing the proportion of correct
predictions out of the total predictions made. It is calculated as the ratio of the number of
correctly classified instances to the total number of instances. However, accuracy alone may
not provide a complete picture, especially when the classes are imbalanced.
9
𝑇𝑃+𝑇𝑁
Accuracy = 𝑇𝑃 + 𝐹𝑃+ 𝑇𝑁+𝐹𝑁
Precision:
Precision measures the proportion of correctly predicted positive instances out of all
predicted positive instances. It focuses on the accuracy of positive predictions. Precision is
calculated as the ratio of true positives (correctly predicted positives) to the sum of true
positives and false positives (incorrectly predicted positives).
𝑇𝑃
Precision =
𝑇𝑃 + 𝐹𝑃
F1 Score:
The F1 score combines precision and recall into a single metric, providing a balanced
evaluation. It is the harmonic mean of precision and recall, calculated as 2 * (precision *
recall) / (precision + recall). The F1 score is useful when the class distribution is imbalanced.
2 ∗ (𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙)
F1 Score =
(𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙)
10
Area Under the Receiver Operating Characteristic Curve (AUC-ROC):
The ROC curve is a graphical representation of the model's performance at different
classification thresholds. AUC-ROC represents the area under the ROC curve, which provides
an aggregate measure of the model's discrimination ability. A higher AUC-ROC indicates
better classification performance.
To find,
11
12
Approach
Theoretical Approach:
1. Define the problem: Clearly articulate the problem you want to solve or the goal you want
to achieve with ML. Determine if it's a classification, regression, clustering, or any other
type of problem.
8. Gather and explore the data: Collect the relevant dataset for your problem domain.
Explore the data to understand its structure, quality, and relationships. Perform
descriptive statistics, visualization, and data pre-processing tasks such as handling missing
values, outliers, and feature scaling.
9. Split the dataset: Divide the dataset into two or three parts, typically training, validation,
and test sets. The training set is used to train the ML algorithm, the validation set helps in
hyperparameter tuning, and the test set is used to evaluate the final model's
performance.
10. Feature engineering: Extract or create meaningful features from the dataset that can
improve the ML model's performance. This might involve feature selection,
dimensionality reduction techniques like Principal Component Analysis (PCA), or creating
new features through transformations or domain knowledge.
11. Select an appropriate algorithm: Based on the problem type, dataset size, complexity,
and other factors, choose a suitable ML algorithm. Consider algorithms such as decision
trees, random forests, support vector machines, neural networks, or ensemble methods
like gradient boosting or stacking.
12. Train the model: Feed the training dataset into the chosen algorithm and let it learn the
underlying patterns and relationships. Adjust the algorithm's hyperparameters (e.g.,
learning rate, regularization strength) to optimize the model's performance. Use the
validation set to fine-tune the hyperparameters through techniques like grid search or
random search.
13. Test the model: Finally, evaluate the model's performance on the test set, which provides
an unbiased assessment of its generalization capabilities. Ensure that the model's
performance on the test set is consistent with the validation set.
13
14. Evaluate the model: Once the model is trained, assess its performance using appropriate
evaluation metrics such as accuracy, precision, recall, F1-score, or mean squared error.
Compare the model's performance on the validation set with different hyperparameter
configurations to choose the best-performing model.
15. Iterate and improve: If the model's performance is not satisfactory, revisit previous steps.
Explore alternative algorithms, perform more feature engineering, collect additional data,
or refine the existing approach to improve the model's performance. Iterate until you
achieve the desired results.
16. Deploy the model: Once you're satisfied with the model's performance, deploy it in a
production environment to make predictions on new, unseen data. Monitor the model's
performance over time and retrain or update it as needed.
Programming Approach:
When applying a machine learning (ML) algorithm to a dataset in Python, it's important to
follow a systematic approach to ensure accurate results and efficient implementation.
Here's a general outline of steps to follow:
1. Import Required Libraries: Importing the necessary libraries is the first step in any data
analysis or machine learning project. Libraries like NumPy, Pandas, Scikit-learn, and
Matplotlib provide various functions and tools for data manipulation, model training, and
evaluation.
2. Import Required Dataset(s): Importing the dataset involves loading the data into a
suitable data structure like a Pandas DataFrame or a NumPy array. This allows you to
access and manipulate the data for further analysis.
3. Check for any Null values: Checking for null values is essential to ensure the quality and
integrity of the dataset. Missing values can affect the performance of ML algorithms.
14
Handling missing values can involve either removing the rows or columns containing null
values or filling them with appropriate values, such as the mean or median.
4. Assign Depended and Independent variables: In supervised learning tasks, the dataset is
typically divided into two components: the independent variables (features) and the
dependent variable (target variable). The independent variables are the inputs or
predictors, while the dependent variable is the variable being predicted by the ML
algorithm.
5. Split the data into Training and Testing Datasets: Splitting the data into training and
testing datasets allows you to evaluate the performance of the ML model on unseen data.
Typically, a certain percentage of the data (e.g., 80%) is used for training, while the
remaining data is used for testing. This split helps assess how well the model generalizes
to new, unseen data.
6. Feature Scaling: Feature scaling is the process of normalizing the numerical features in
the dataset to a similar scale. Scaling is often required when the features have different
magnitudes or units. Common scaling techniques include standardization (mean=0,
standard deviation=1) and normalization (scaling values to a specific range, e.g., [0, 1]).
7. Fit the ML Model: Fitting the ML model involves training the chosen algorithm on the
training dataset. The model learns patterns and relationships between the independent
variables and the target variable to make predictions on new, unseen data. The model's
parameters are adjusted during the training process to minimize the difference between
predicted and actual values.
8. Compute and Visualize Confusion Matrix: The confusion matrix is a performance
evaluation tool for classification problems. It provides a tabular representation of the
model's predictions against the actual labels. It contains four metrics: true positives (TP),
true negatives (TN), false positives (FP), and false negatives (FN). Visualizing the confusion
matrix helps understand the model's performance in terms of correctly and incorrectly
classified instances.
9. Compute the Accuracy, Precision, Recall and F1 Score (Evaluation metrics): Evaluation
metrics provide quantitative measures to assess the model's performance. Common
metrics include accuracy, precision, recall, and F1 score. Accuracy measures the overall
15
correctness of the model's predictions. Precision measures the model's ability to correctly
identify positive instances. Recall (also called sensitivity or true positive rate) measures
the model's ability to identify all positive instances. The F1 score combines precision and
recall into a single metric that balances both metrics.
Remember that the specific techniques, algorithms, and evaluation metrics may vary
depending on the problem type (classification, regression, clustering) and the characteristics
of the dataset.
16
Introduction to Libraries used in Machine Learning
Python
Machine learning is a rapidly evolving field, and there are several powerful libraries and
frameworks available to assist with various aspects of the machine learning workflow.
These libraries provide pre-built functions, algorithms, and tools that make it easier to
develop, train, evaluate, and deploy machine learning models. Here are some of the
most widely used and useful libraries in machine learning:
APPLICATIONS OF NUMPY:
USES
OF
NUMPY
17
Pandas:
Pandas is a popular library for data manipulation and analysis. It provides data structures
such as DataFrames that make it easy to work with structured data. Pandas offers
functions for data cleaning, transformation, and exploration, making it useful for data
pre-processing tasks in machine learning.
Scikit-learn:
Scikit-learn is a comprehensive library for machine learning in Python. It provides a wide
range of algorithms and tools for classification, regression, clustering, dimensionality
reduction, and model evaluation. Scikit-learn is known for its simplicity and ease of use,
making it an excellent choice for beginners.
Matplotlib:
Matplotlib is a widely used plotting library in Python. It provides a variety of functions to
create high-quality visualizations, including line plots, scatter plots, bar plots,
histograms, and more.
18
TensorFlow:
TensorFlow is a powerful open-source library for numerical computation and machine
learning developed by Google. It offers a flexible framework for building and deploying
machine learning models, with a focus on deep learning. TensorFlow provides a high-level
API called Keras, which simplifies the process of building neural networks.
19
Keras:
Keras is a high-level neural networks API written in Python. Initially developed as a user-
friendly interface for building deep learning models on top of TensorFlow, it has since
been integrated into TensorFlow's core library. Keras provides a simple and intuitive
interface for designing and training neural networks.
PyTorch:
PyTorch is another popular open-source machine learning library that focuses on dynamic
computation graphs. It offers a flexible and efficient framework for training deep learning
models. PyTorch provides extensive support for GPU acceleration and is widely used in
the research community.
XGBoost:
XGBoost is an optimized gradient boosting library that excels in handling tabular data and is
widely used for classification and regression tasks. It provides an implementation of the
gradient boosting algorithm, which combines multiple weak models to create a more
accurate ensemble model.
20
1. Linear Regression
➢
Supervised Learning Model
➢ used for Regression tasks
Mainly
Suitable for predicting continuous target variables
➢
Line of Best Fit
➢ Regression is a popular and widely used supervised learning algorithm used for
Linear
predicting continuous target variables based on one or more input features. It assumes a
linear relationship between the input variables (features) and the output variable (target).
• Homoscedasticity: The residuals (the differences between the actual and predicted
values) should have a constant variance across all levels of the input variables.
21
Simple Linear Regression:
Simple Linear Regression is the basic form of Linear Regression involving a single input
feature (X) and a single target variable (y). The relationship is represented by the
equation:
y = b0 + b1*X
where,
where:
Model Evaluation:
To assess the performance of a linear regression model, several evaluation metrics are
commonly used:
22
• Mean Squared Error (MSE): It measures the average squared difference between the
predicted and actual values. A lower MSE indicates better model performance.
• Root Mean Squared Error (RMSE): It is the square root of the MSE and provides the
measure of the average prediction error in the same units as the target variable.
• R-squared (R2) Score: It represents the proportion of the variance in the target variable
that can be explained by the model. It ranges from 0 to 1, with 1 indicating a perfect fit.
• Adjusted R-squared Score: It adjusts the R-squared score by considering the number of
input features and the sample size. It penalizes the addition of irrelevant features.
• Linearity Assumption: Linear Regression assumes a linear relationship between the input
features and the target variable. If the relationship is non-linear, Linear Regression may
not provide accurate predictions.
• Sensitive to Outliers: Linear Regression is sensitive to outliers, as they can significantly
impact the estimated coefficients and the model's performance.
• Assumptions Violation: If the assumptions of Linear Regression (linearity, independence,
homoscedasticity, normality) are violated, the model's performance may be affected.
• Multicollinearity: Linear Regression assumes independence between input features.
When features are highly correlated (multicollinearity), it can lead to unstable and
unreliable coefficient estimates.
• Limited to Linear Relationships: Linear Regression is not suitable for capturing complex
non-linear relationships between features and the target variable.
23
• Marketing and Sales: Linear Regression is employed in market research and sales
forecasting. It can assist in understanding the factors influencing consumer behavior,
predicting product demand, optimizing pricing strategies, and measuring the
effectiveness of marketing campaigns.
• Social Sciences: Linear Regression is used in social science research to analyze
relationships between variables. It can help examine factors affecting education
outcomes, assess social and economic disparities, study population trends, and analyze
survey data.
• Real Estate: Linear Regression can be utilized in real estate for property price prediction,
rental price estimation, assessing market trends, and evaluating the impact of location
and property characteristics.
24
Implementing Linear Regression using Python
Dataset Required:
https://drive.google.com/file/d/1uesxH_CQprom9HqwhspvoesUyHo4KA7h/view?
usp=s haring
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
data=pd.read_csv('/content/Salary_Data.csv')
Exploring Dataset:
data.head()
data.shape
data.isnull().sum()
#Checking for any Null values in the imported Datasets
25
Assigning Dependent and Independent variables:
x=data.iloc[:,:1].values
y=data.iloc[:, 1:2].values
model=LinearRegression()
model.fit(x_train, y_train)
y_pred=model.predict(x_test)
print(y_pred)
print(y_test)
26
Plot for Testing dataset
27
2. Logistic Regression
➢
Supervised Learning Model
➢
Primarily used for binary classification problems and Regression
Linear regression + Sigmoid Function
➢
Logistic regression is a statistical model used for binary classification problems. It is an
extension of linear regression that predicts the probability of an input belonging to a specific
class. Unlike linear regression, which predicts continuous values, logistic regression is
designed to handle discrete outcomes.
The fundamental concept behind logistic regression is the logistic function, also known as the
sigmoid function. The logistic function maps any real number to a value between 0 and 1. It
takes the form:
1
sigmoid(z) =
(1 + 𝑒𝑥𝑝(−𝑧))
In logistic regression, the model applies this sigmoid function to a linear combination of the
input features to obtain a value between 0 and 1. This value represents the estimated
probability of the input belonging to a particular class.
28
To learn the parameters of the logistic regression model, it uses a technique called
maximum likelihood estimation. The objective is to find the optimal set of coefficients that
maximizes the likelihood of observing the labeled data. This process involves minimizing a
cost function, often referred to as the cross-entropy loss, which measures the dissimilarity
between the predicted probabilities and the true class labels.
Once the model is trained, it can make predictions on new, unseen data by calculating the
probability of the input belonging to the positive class (class 1) based on the learned
coefficients and feature values. By applying a chosen threshold (commonly 0.5), the model
classifies the input into one of the two classes: positive or negative.
Logistic regression has several advantages. Firstly, it is a relatively simple and interpretable
model. The coefficients can be easily interpreted as the influence of each feature on the
probability of the outcome. This makes logistic regression useful for understanding the
relationship between predictors and the response variable.
Additionally, logistic regression is computationally efficient and can handle large datasets. It
requires fewer computational resources compared to more complex models like neural
networks. Moreover, logistic regression provides probabilistic outputs, allowing for a better
understanding of the uncertainty associated with each prediction.
However, logistic regression also has some limitations. It assumes a linear relationship
between the input features and the log-odds of the outcome. If the relationship is non-
linear,
logistic regression may not capture it effectively. In such cases, feature engineering or more
advanced techniques may be necessary.
Logistic regression is also sensitive to outliers. Outliers can disproportionately affect the
estimated coefficients, leading to biased predictions. Thus, it is important to preprocess the
data and handle outliers appropriately.
Logistic regression finds application in various domains. It is commonly used in areas such as
spam detection, fraud detection, disease diagnosis, sentiment analysis, and churn
prediction.
Its simplicity, interpretability, and efficiency make it a popular choice when transparency
and
explainability are important.
29
Advantages of Logistic Regression:
• Simplicity: Logistic regression is a relatively simple and interpretable model. It is easy to
understand and implement, making it a good choice when transparency and explainability
are important.
• Efficiency: Logistic regression can be trained efficiently even on large datasets. It has low
computational requirements, making it computationally inexpensive compared to more
complex models.
30
locations, and user behavior patterns, the model can predict the likelihood of a
transaction being fraudulent.
• Disease Diagnosis: Logistic regression is employed in medical research and healthcare to
predict the presence of certain diseases or conditions. By considering patient
characteristics, symptoms, and diagnostic test results, the model can assist in diagnosing
diseases such as cancer, diabetes, or heart disease.
• Sentiment Analysis: Logistic regression is used in sentiment analysis to determine the
sentiment or opinion expressed in textual data. It can classify text as positive or negative
based on the presence of certain words, sentiment indicators, or linguistic patterns. This
application is valuable in social media monitoring, brand reputation management, and
customer feedback analysis.
• Market Segmentation: Logistic regression is used in market research and customer
segmentation to divide a population into distinct groups based on their characteristics,
preferences, or behaviors. By analyzing demographic data, purchasing patterns, or survey
responses, the model can identify segments with similar traits for targeted marketing
strategies.
31
Implementing Logistic Regression using Python
Dataset Required
https://drive.google.com/file/d/1V6yFU3nDdx9R56yOzy6GxHP
q- Dav2A4K/view?usp=share_link
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import seaborn as sn
data.isnull().sum()
32
Assigning dependent and independent variables
model.fit(x_train, y_train)
y_pred=model.predict(x_test)
33
Evaluation Metrics
conf_mat=metrics.confusion_matrix(y_test, y_pred)
print('Confusion Matrix : ', conf_mat)
Accuracy_score=metrics.accuracy_score(y_test,y_pred)
print('Accuracy Score : ', Accuracy_score)
print('Accuracy in Percentage : ', int(Accuracy_score*100),'%')
34
3. Decision Tree Classifier
➢
Supervised Learning Model
➢structure Model
Tree
A decision tree classifier is a supervised machine learning algorithm that uses a tree-like
structure to make predictions or classify input data. It recursively partitions the input space
based on the features to create a tree of decision nodes and leaf nodes. Each decision node
represents a feature and a threshold, while each leaf node represents a class label or a
prediction.
The decision tree classifier operates by recursively splitting the input data based on the
values of different features. It partitions the data into subsets at each decision node based
on the selected feature and its threshold value. This process continues until the algorithm
reaches a stopping criterion, such as reaching a maximum depth, a minimum number of
samples, or when all samples in a subset belong to the same class.
The splitting process aims to maximize the homogeneity or purity of the subsets. Various
splitting criteria can be used, with the most common being Gini impurity and entropy. Gini
impurity measures the probability of misclassifying a randomly chosen sample if it were
35
labeled randomly according to the distribution of classes in the subset. Entropy, on the
other hand, measures the level of impurity or randomness in the subset.
• Handling Missing Values: Decision trees can handle missing values in the dataset. They
can evaluate the available features and select the optimal split based on the available
data.
36
Limitations of Decision Tree Classifier:
• Overfitting: Decision trees have a tendency to overfit the training data, especially when
the tree becomes deep and complex. Overfitting occurs when the tree captures noise or
irrelevant patterns in the training data, leading to poor generalization on unseen data.
• Instability: Decision trees can be sensitive to small changes in the training data, leading
to different tree structures and predictions. This instability can be reduced by using
ensemble methods like random forests or boosting.
• Disease Diagnosis: Decision trees are employed in medical diagnosis to predict the
presence of certain diseases or conditions. By considering symptoms, patient
characteristics, and medical test results, decision trees can assist in diagnosing diseases
and recommending appropriate treatments.
• Fraud Detection: Decision trees are used in fraud detection systems to identify fraudulent
transactions or activities. By analyzing transaction patterns, user behavior, and other
relevant features, decision trees can flag suspicious activities for further investigation.
37
Implementing Decision Tree Classifier using Python
Dataset Required
https://drive.google.com/file/d/1V6yFU3nDdx9R56yOzy6GxHP
q- Dav2A4K/view?usp=share_link
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
import seaborn as sn
38
Checking for any null values in dataset
data.isnull().sum()
model= DecisionTreeClassifier(criterion='entropy',random_state=5)
model.fit(x_train, y_train)
y_pred=model.predict(x_test)
print('y_pred: ', y_pred)
39
Evaluation Metrics
conf_mat=metrics.confusion_matrix(y_test, y_pred)
print('Confusion Matrix : ', conf_mat)
Accuracy_score=metrics.accuracy_score(y_test, y_pred)
print('Accuracy Score : ', Accuracy_score)
print('Accuracy in Percentage : ', int(Accuracy_score*100),'%')
40
4. Support Vector Machine
➢
Supervised Machine Learning Model
➢ for both Classification and Regression
Used
Hyperplane
➢
Support Vectors
➢
Support Vector Machine (SVM) is a supervised machine learning algorithm used for both
classification and regression tasks. It is a powerful and versatile algorithm that aims to find
an optimal hyperplane or decision boundary in a high-dimensional feature space to separate
different classes or predict numerical values.
The fundamental idea behind SVM is to find the hyperplane that maximally separates the
data points of different classes. The hyperplane is selected such that the distance between
the hyperplane and the closest data points from each class, known as support vectors, is
maximized. This distance is called the margin. The support vectors and the hyperplane are
the key components of SVM. The support vectors are the crucial data points that influence
the construction of the hyperplane, which in turn determines the separation between
different classes and enables accurate classification or regression.
41
Support Vectors: In Support Vector Machine (SVM), support vectors are the data points that
lie closest to the decision boundary, known as the hyperplane. These support vectors play a
crucial role in defining the decision boundary and determining the optimal hyperplane that
maximizes the margin.
The support vectors are the subset of training data points that have the most influence on
the construction of the hyperplane. They are the points that are located on or near the
margin, as well as the points that are misclassified. These data points are crucial because
they
define the separation between different classes and contribute to the calculation of the
margin.
The choice of support vectors is determined during the training process of the SVM
algorithm. The algorithm selects the support vectors based on their distance from the
decision boundary. Only the support vectors are necessary to define the hyperplane and
make predictions, rather than using all the training data points. This property of SVM makes
it memory-efficient and computationally efficient.
Hyperplane: In SVM, the hyperplane is a decision boundary that separates different classes
in the feature space. For binary classification tasks, the hyperplane is a (d-1)-dimensional
subspace in a d-dimensional feature space.
Mathematically,
it can be represented as:
w^T x + b = 0
where w is the weight vector perpendicular to the hyperplane, x is the input feature vector,
and b is the bias term. The weight vector w determines the orientation of the hyperplane,
while the bias term b shifts the hyperplane.
The objective of SVM is to find the optimal hyperplane that maximizes the margin, which is
the distance between the hyperplane and the nearest data points from each class, i.e., the
support vectors. The hyperplane that achieves the maximum margin is considered the best
decision boundary, as it provides better generalization to unseen data.
42
In cases where the data is not linearly separable, SVM uses the kernel trick to transform the
feature space into a higher-dimensional space. In this higher-dimensional space, a
hyperplane is sought to separate the transformed data. The kernel function computes the
inner products of the transformed feature vectors without explicitly calculating the
transformation. This allows SVM to capture complex non-linear decision boundaries.
For linearly separable data, SVM finds the hyperplane that achieves the maximum margin.
However, when the data is not linearly separable, SVM uses a technique called the kernel
trick to transform the original feature space into a higher-dimensional space, where the
classes can be separated by a hyperplane.
The kernel trick allows SVM to implicitly map the data into a higher-dimensional space
without explicitly calculating the transformed feature vectors. This is computationally
efficient and enables SVM to capture complex non-linear relationships between features.
• Regularization: SVM includes a regularization parameter (C) that controls the trade-off
between maximizing the margin and minimizing the classification errors. This parameter
helps prevent overfitting and allows the model to generalize well to unseen data.
43
Limitations of Support Vector Machine:
• Computationally intensive: SVM can be computationally expensive, especially when
dealing with large datasets. Training time and memory requirements can increase
significantly as the number of samples and features grows.
• Parameter selection: SVM has several parameters, including the choice of kernel
function, regularization parameter (C), and kernel-specific parameters. Selecting
appropriate values for these parameters can be challenging and often requires careful
tuning.
• Bioinformatics: SVM is used in protein structure prediction, gene expression analysis, and
disease classification based on genomic data.
• Financial analysis: SVM can be applied to credit scoring, stock market prediction, fraud
detection, and anomaly detection in financial data.
• Medical diagnosis: SVM has been employed in medical diagnosis, including cancer
classification, disease prognosis, and identification of genetic markers.
• Remote sensing: SVM is used in satellite image analysis, land cover classification, and
pattern recognition in remote sensing applications.
44
Implementing SVM using Python
Dataset Required
https://drive.google.com/file/d/1V6yFU3nDdx9R56yOzy6GxHP
q- Dav2A4K/view?usp=share_link
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from sklearn import metrics
import seaborn as sn
data.isnull().sum()
45
Assigning dependent and independent variables
Standardization (Z-score normalization) scales the features of a dataset so that they have
zero mean and unit variance. This transformation centers the data around the mean and
46
scales it by the standard deviation. It does not enforce a specific range for the transformed
values. Normalization, on the other hand, scales the features to a specific range, often
between 0 and 1 or -1 and 1. It is achieved by dividing each value by the maximum value
in the feature range or by applying other normalization techniques.
model= SVC(kernel='rbf',random_state=0)
model.fit(x_train, y_train)
svc_prediction=model.predict(x_test)
print('svc_prediction: ', svc_prediction)
conf_mat=metrics.confusion_matrix(y_test, svc_prediction)
print('SVC [ kernerl - rbf ]')
print('Confusion Matrix : \n', conf_mat)
Accuracy_score=metrics.accuracy_score(y_test, svc_prediction)
print('Accuracy Score : ', Accuracy_score)
print('Accuracy in Percentage : ', int(Accuracy_score*100),'%')
print(classification_report(svc_prediction,y_test))
47
conf_mat=pd.crosstab(y_test, y_pred, rownames=['Actual'],
colnames=['Predicted'])
sn.heatmap(conf_mat, annot=True).set(title='SVC [rbf]')
model= SVC(kernel='linear',random_state=0)
model.fit(x_train, y_train)
svc_prediction=model.predict(x_test)
print('svc_prediction: ', svc_prediction)
48
Evaluation Metrics for 'Linear' kernel
conf_mat=metrics.confusion_matrix(y_test, svc_prediction)
print('SVC [ kernerl - linear ]')
print('Confusion Matrix : \n', conf_mat)
Accuracy_score=metrics.accuracy_score(y_test, svc_prediction)
print('Accuracy Score : ', Accuracy_score)
print('Accuracy in Percentage : ', int(Accuracy_score*100),'%')
print(classification_report(svc_prediction,y_test))
49
5. K Nearest Neighbors
➢
Used for both classification and regression
➢
Distance based model
K-nearest neighbors (KNN) is a non-parametric machine learning algorithm used for both
classification and regression tasks. It is a simple yet powerful algorithm that makes
predictions based on the similarity of input data to its neighboring data points.
Theory: The KNN algorithm works based on the principle that similar data points tend to
share the same class or have similar output values. The algorithm stores the entire training
dataset in memory and uses it during the prediction phase. When a new data point is
provided, the algorithm calculates the distance between that point and all other data points
in the training set. The distance metric used is typically Euclidean distance, although other
distance metrics can be employed.
50
KNN algorithm:
1. Determine the number of neighbors (K) to consider, usually specified by the user.
2. Calculate the distance between the new data point and all other data points in the training
set.
3. Select the K data points with the shortest distances (i.e., the K nearest neighbors).
4. For classification tasks, assign the class label that is most frequent among the K nearest
neighbors to the new data point. For regression tasks, compute the average or weighted
average of the output values of the K nearest neighbors.
5. Output the predicted class label or regression value for the new data point.
Advantages of KNN:
• Simplicity: KNN is easy to understand and implement. It does not require any assumptions
about the underlying data distribution or model structure.
• Versatility: KNN can be applied to both classification and regression problems. It can
handle both numerical and categorical data.
• Adaptability: KNN is a lazy learner, meaning it does not perform a training phase. This
makes it suitable for dynamic or changing environments, as the model can be updated
with new data points easily.
• Non-linearity: KNN can capture complex, non-linear relationships between the input
features and the target variable.
Limitations of KNN:
• Computational Complexity: As the algorithm compares the new data point with all
training data points, the computational cost can be high, especially for large datasets.
• Sensitivity to Feature Scaling: KNN calculates distances based on feature values. If the
features have different scales, features with larger values can dominate the distance
calculation, leading to biased results. It is essential to normalize or standardize the
features before applying KNN.
51
• Determining the Optimal K: Choosing the optimal number of neighbors (K) is subjective
and depends on the dataset and problem at hand. A small K may result in overfitting, while
a large K may introduce more noise and dilute the local patterns.
Applications of KNN:
• Recommender Systems: KNN can be used to build recommendation engines by finding
similar users or items based on their features or preferences.
• Image Recognition: KNN can be applied to image classification tasks by comparing the
pixel values or feature vectors of images to identify similar objects or patterns.
• Anomaly Detection: KNN can detect outliers or anomalies in data by identifying data
points that are significantly different from their neighbors.
• Text Classification: KNN can classify text documents based on their word frequencies or
vector representations by measuring the similarity between documents.
• Healthcare: KNN can assist in medical diagnosis by finding similar patient cases or medical
images for comparison and decision support.
52
Implementing KNN using Python
Dataset Required
https://drive.google.com/file/d/1865t5MZPQn53A5bSYMN86D0BmLSBxa
n- /view?usp=share_link
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from sklearn import metrics
import seaborn as sn
data.isnull().sum()
53
Assigning dependent and independent variables
x=data.iloc[:, :-1].values
y=data.iloc[:, -1].values
54
Preprocessing Data with StandardScaler
results=[]
for i in [1,2,3,4,5]:
model = KNeighborsClassifier(n_neighbors=i,
metric='minkowski', p=2)
model.fit(x_train, y_train)
y_pred=model.predict(x_test)
Accuracy_score=metrics.accuracy_score(y_test, y_pred)
results.append(Accuracy_score)
55
Evaluation Metrics
models = pd.DataFrame({
'n_neighbors': ['1', '2','3','4','5'],
'Accuracy Score':
[results[0],results[1],results[2],results[3],results[4]]})
models.sort_values(by='Accuracy Score')
print(models.to_string(index=False))
56
6. K Means Clustering
➢
Unsupervised Learning
➢
Used for clustering, not for classification or regression
K - Number
➢
Clusters
➢
K-means clustering is an unsupervised machine learning algorithm used to partition a dataset
into K distinct clusters. The goal is to group similar data points together and ensure that data
points within the same cluster are more similar to each other than to those in other clusters.
The algorithm accomplishes this by iteratively assigning data points to the nearest cluster
centroid and updating the centroids based on the assigned points.
57
2. Assignment: Assign each data point to the nearest centroid based on a distance metric,
typically Euclidean distance. Each data point belongs to the cluster with the closest
centroid.
3. Update: Recalculate the centroids of each cluster by taking the mean of all the data points
assigned to that cluster.
4. Repeat steps 2 and 3 until convergence: Iterate steps 2 and 3 until the cluster
assignments no longer change significantly or a maximum number of iterations is reached.
5. Finalization: Once the algorithm converges, the final centroids represent the cluster
centers, and each data point is assigned to a specific cluster.
• Versatility: K-means can be applied to various types of data and is not restricted to any
specific data distribution. It is effective in finding clusters of different shapes and sizes.
• Interpretable results: The cluster centroids obtained from K-means are interpretable and
can provide insights into the structure and patterns present in the data.
58
• Sensitive to outliers: K-means is sensitive to outliers as they can significantly impact the
centroid calculation. Outliers may be assigned to inappropriate clusters or form their own
clusters, affecting the overall clustering quality.
• Limited to linear boundaries: K-means assumes that clusters are isotropic, spherical, and
have equal variance. It struggles to handle non-linear cluster boundaries and clusters of
different shapes and sizes. Other clustering algorithms like DBSCAN or hierarchical
clustering may be more appropriate for such scenarios.
59
• Social Network Analysis: K-means clustering can be used in social network analysis to
identify communities or groups of individuals with similar characteristics or
interaction patterns. It helps in understanding the structure and dynamics of social
networks and can be applied in various domains such as marketing, sociology, and
online social platforms.
60
Implementing K-Means Clustering using Python
Dataset Required
https://drive.google.com/file/d/1s-
aZWEqNCPTBWCK6qACmbqH6lMDt6Hyp/view?usp=sharing
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import numpy as np
x=data.iloc[:,[2,4]].values
61
Number of Clusters via Elbow Method
wcss=[]
for i in range(1,11):
model=KMeans(n_clusters=i,init='k-means++',random_state=21)
model.fit(x)
wcss.append(model.inertia_)
plt.plot(range(1,11),wcss)
plt.title('WCSS via Elbow method')
plt.xlabel('Number of clusters:')
plt.ylabel('WCSS Value')
plt.show()
model = KMeans(n_clusters=4,init='k-means++',random_state=42)
y_means=model.fit_predict(x)
print("y_means:\n\n",y_means)
62
Scattering the Clusters
plt.scatter(x[y_means==0,0],x[y_means==0,1],s=100,c='magenta',
label='cluster1')
plt.scatter(x[y_means==1,0],x[y_means==1,1],s=100,c='blue',lab
el='cluster2')
plt.scatter(x[y_means==2,0],x[y_means==2,1],s=100,c='orange',l
abel='cluster3')
plt.scatter(x[y_means==3,0],x[y_means==3,1],s=100,c='cyan',lab
el='cluster4')
plt.scatter(model.cluster_centers_[:,0],model.cluster_centers_
[:,1],s=200,c='black',label='centerids')
plt.title('cluster of amazon users')
plt.xlabel('age')
plt.ylabel('purchase rating')
plt.legend()
plt.show()
63
7. Principal Component Analysis
Principal Component Analysis (PCA) is a dimensionality reduction technique commonly used
in machine learning and data analysis. It aims to transform a dataset containing a large
number of correlated variables into a smaller set of uncorrelated variables, known as
principal components. PCA achieves this by identifying the directions, or principal axes,
along which the data varies the most.
2. Covariance Matrix Calculation: Once the data is standardized, the next step is to
calculate
the covariance matrix. The covariance matrix represents the relationships between
variables
and measures how they vary together. It is a square matrix where each element represents
the covariance between two variables.
3. Eigenvector and Eigenvalue Calculation: After computing the covariance matrix, the next
step is to calculate the eigenvectors and eigenvalues of the matrix. Eigenvectors represent
the directions or axes of the data, while eigenvalues represent the amount of variance
explained by each eigenvector. The eigenvectors and eigenvalues are calculated using linear
algebra techniques.
6. Reconstruction of Data: If required, the projected data can be reconstructed back into
the
original feature space. This involves multiplying the projected data by the transposed
eigenvectors and adding back the mean values that were subtracted during standardization.
The reconstructed data will have reduced dimensions but will closely resemble the original
data.
Applications of PCA in Machine Learning:
• Dimensionality Reduction: PCA is primarily used to reduce the number of features in a
dataset while preserving the most important information. By discarding less significant
principal components, it helps overcome the curse of dimensionality and improves
computational efficiency.
• Data Visualization: PCA can be used to visualize high-dimensional data in a lower-
dimensional space. By selecting the first two or three principal components, we can plot
the data and gain insights into patterns, clusters, or outliers.
• Noise Filtering: PCA can separate the signal and noise in data. The first few principal
components often capture the main signal, while the later components capture noise or
less significant variations. Removing the components with lower importance can help
denoise the data.
• Feature Extraction: PCA can be used to extract a reduced set of features from a larger
feature space. These new features, represented by the principal components, can then
be used as inputs to other machine learning algorithms.
65
Implementing PCA using Python
Dataset Required
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
data.keys()
AssignED variables
print(data['target_names'])
print(data['feature_names'])
Creating dataframe
df1=pd.DataFrame(data['data'],columns=data['feature_names'])
66
Standard Scaler and PCA
sc=StandardScaler()
sc.fit(df1)
scaled_data=sc.transform(df1)
principal=PCA(n_components=3)
principal.fit(scaled_data)
x=principal.transform(scaled_data)
print(x.shape)
principal.components_
plt.figure(figsize=(10,10))
plt.title('PCA 2D')
plt.scatter(x[:,0],x[:,1],c=data['target'],cmap='plasma')
plt.xlabel('pc1')
plt.ylabel('pc2')
67
from mpl_toolkits.mplot3d import Axes3D
fig=plt.figure(figsize=(10,10))
axis=fig.add_subplot(111,projection='3d')
axis.scatter(x[:,0],x[:,1],x[:,2],c=data['target'],cmap='plasm
a')
axis.set_xlabel('PC1',fontsize=10)
print(principal.explained_variance_ratio_)
68
8. Random Forest
Random Forest is a popular machine learning algorithm that belongs to the ensemble
learning family. It is a combination of multiple decision trees, where each tree contributes
to the final prediction through a voting or averaging mechanism. Random Forest is primarily
used for classification tasks, but it can also be applied to regression problems.
Random Forest builds an ensemble of decision trees by randomly selecting subsets of the
training data and features. The random selection of data is called bootstrap sampling, which
means that each tree is trained on a different subset of the original data created by
sampling
with replacement. This introduces diversity and reduces the risk of overfitting.
At each node of a decision tree, a random subset of features is considered to determine the
best split. This randomness ensures that each tree has its own unique structure and avoids
favoring any specific features. By combining the predictions of all the individual trees,
Random Forest achieves robust and accurate results.
69
During the prediction phase, each tree in the Random Forest independently classifies the
input data point. In the case of classification, the class that receives the majority of votes
from the trees is selected as the final prediction. For regression, the average of the
predictions from all the trees is taken.
Non-linearity: Random Forest can capture complex non-linear relationships in the data.
•
By combining multiple decision trees, it can model intricate decision boundaries and
interactions between features.
70
• Bias in Imbalanced Datasets: Random Forest can exhibit bias towards the majority class
in imbalanced datasets. Since each tree is trained independently, the majority class tends
to have a stronger influence on the final predictions. Balancing techniques or specialized
modifications may be required to handle imbalanced data effectively.
• Disease Diagnosis: Random Forest can assist in medical diagnosis by combining multiple
factors such as patient symptoms, test results, and medical history to predict the presence
or absence of a specific disease.
• Image Recognition: Random Forest is used in image recognition tasks, including object
detection, facial recognition, and scene classification. The ensemble of decision trees can
effectively analyze image features and classify them into predefined categories.
• Fraud Detection: Random Forest can identify fraudulent activities by analyzing patterns
and anomalies in financial transactions, online behaviors, or insurance claims.
Customer Churn Prediction: Random Forest can predict customer churn by considering
•
various customer attributes, purchase history, and engagement metrics to identify
customers at risk of leaving a service or product.
71
Implementing Random Forest using Python
Dataset Required
https://drive.google.com/file/d/1865t5MZPQn53A5bSYMN86D0BmLSBxa
n- /view?usp=share_link
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import classification_report
import seaborn as sn
import matplotlib.pyplot as plt
from sklearn import tree
print(data.shape)
data.head()
x=data.iloc[:,:-1].values
y=data.iloc[:,-1].values
72
Splitting the dataset into Training and Testing Dataset
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=
0 .3,random_state=41)
sc=StandardScaler()
x_train=sc.fit_transform(x_train)
x_test=sc.transform(x_test)
model=RandomForestClassifier(n_estimators=10,
criterion='entropy',random_state=0)
model.fit(x_train,y_train)
y_pred=model.predict(x_test)
Evaluation Metrics
73
conf_mat=pd.crosstab(y_test,y_pred, rownames=['Actual'],
colnames=['Predicted'])
sn.heatmap(conf_mat, annot=True).set(title='Random Forest
Classifier')
import pydotplus
from IPython.display import Image
74
75
9. Time Series Modeling
Time series modeling is a statistical technique used to analyze and forecast data that varies
over time. It is particularly suited for datasets where the order and timing of observations
are important. Time series data is commonly encountered in fields such as finance,
economics, weather forecasting, and stock market analysis.
The fundamental concept in time series modeling is that the observations at different time
points are not independent, but rather exhibit temporal dependence. The goal is to capture
and model the underlying patterns, trends, and seasonal variations in the data to make
accurate predictions or forecasts.
The most common approach to time series modeling is using autoregressive integrated
moving average (ARIMA) models.
ARIMA models combine three components:
76
2. Integrated (I) component: It deals with the process of differencing the time series data to
achieve stationarity. Stationarity is a desirable property in time series modeling, where
the mean, variance, and autocorrelation structure remain constant over time.
3. Moving average (MA) component: It models the relationship between an observation
and a linear combination of past error terms. It captures the residual fluctuations that are
not accounted for by the autoregressive component.
• Impact Assessment: Time series models can be used to assess the impact of specific
events or interventions on the data. It helps in evaluating the effectiveness of policies,
marketing campaigns, or other interventions.
77
Applications of Time Series Modeling:
• Economic Forecasting: Time series models are extensively used in economics to forecast
economic indicators like GDP, inflation rates, stock prices, and interest rates. They help
policymakers, investors, and analysts make informed decisions.
• Demand Forecasting: Time series modeling is valuable in predicting product demand,
allowing companies to optimize inventory management, production planning, and supply
chain operations.
• Energy Load Forecasting: Time series modeling is used to forecast energy demand and
load patterns, helping energy providers optimize energy generation, distribution, and
pricing strategies. Accurate load forecasting is crucial for maintaining grid stability and
meeting customer demand.
• Weather Forecasting: Time series models are employed in weather forecasting to predict
variables such as temperature, precipitation, wind speed, and humidity. These forecasts
are vital for various sectors, including agriculture, transportation, and disaster
management.
• Stock Market Analysis: Time series modeling assists in analyzing stock price movements
and identifying trends and patterns. It helps investors and traders make informed
decisions regarding buying, selling, and portfolio management.
• Sales Forecasting: Time series modeling is utilized in sales forecasting to predict future
sales volumes or revenues. This information aids businesses in production planning,
inventory management, and resource allocation.
Epidemiology and Disease Surveillance: Time series models are valuable for tracking and
•
predicting disease outbreaks, analyzing epidemic patterns, and estimating the spread of
infectious diseases. They play a crucial role in public health decision-making and resource
allocation.
Financial Market Analysis: Time series modeling is employed in analyzing financial market
•
data, such as exchange rates, interest rates, and commodity prices. It assists in identifying
trends, seasonality, and volatility patterns, aiding in investment strategies and risk
management.
78
• Quality Control: Time series modeling is used to monitor and control manufacturing
processes, ensuring product quality and identifying deviations from standard
performance. It helps in maintaining consistency and minimizing defects.
• Web Traffic Analysis: Time series modeling helps analyze web traffic patterns, predict
website visitor volumes, and optimize server resources and capacity planning.
https://drive.google.com/file/d/1s-
aZWEqNCPTBWCK6qACmbqH6lMDt6Hyp/view?usp=sharing
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from math import sqrt
df = pd.DataFrame()
df = pd.read_csv('Alcohol_Sales.csv',index_col =
'DATE',parse_dates = True)
df.index.freq = 'MS'
df.tail()
79
df.coulmns = ['S4248SM144NCEN']
df.plot(figsize=(12,8))
df['Sale_LastMonth'] = df['S4248SM144NCEN'].shift(+1)
df['Sale_2Monthsback'] = df['S4248SM144NCEN'].shift(+2)
df['Sale_3Monthsback'] = df['S4248SM144NCEN'].shift(+3)
df
80
81
df = df.dropna()
df
lin_model = LinearRegression()
model = RandomForestRegressor(n_estimators = 100,max_features
= 3,random_state = 1)
x1,x2,x3,y =
df['Sale_LastMonth'],df['Sale_2Monthsback'],df['Sale_3Monthsba
ck'],df['S4248SM144NCEN']
x1,x2,x3,y =
np.array(x1),np.array(x2),np.array(x3),np.array(y)
x1,x2,x3,y = x1.reshape(-1,1),x2.reshape(-1,1),x3.reshape(-
1,1),y.reshape(-1,1)
final_x = np.concatenate((x1,x2,x3),axis = 1)
print(final_x)
82
X_train,X_test,y_train,y_test =
final_x[:-30],final_x[- 30:],y[:-30],y[-30:]
model.fit(X_train,y_train)
lin_model.fit(X_train,y_train)
pred = model.predict(X_test)
plt.rcParams["figure.figsize"] = (12,8)
plt.plot(pred,label = 'Random_Forest_Predictions')
plt.plot(y_test,label = 'Actual Sales')
plt.legend(loc='upper left')
plt.show()
lin_pred = lin_model.predict(X_test)
plt.rcParams["figure.figsize"] = (11,6)
plt.plot(lin_pred,label='Linear_Regression_Predictions')
plt.plot(y_test,label = 'Actual Sales')
plt.legend(loc='upper left')
plt.show()
83
rmse_rf = sqrt(mean_squared_error(pred,y_test))
rmse_lr = sqrt(mean_squared_error(lin_pred,y_test))
print('Mean Squarred Error for Random Forest Module
is:',rmse_rf)
print('Mean Squarred Error for Linear Regression is:',rmse_lr)
84