Group
Group
Minor Project
On
WEATHER ANALYSIS AND FORECAST
1
Acknowledgement
We also wish to sincerely thank ITS Mohan Nagar for collaborating with Shape
My Skills Pvt. Ltd. to provide this exceptional learning opportunity. Special
appreciation goes to our course coordinator, Mr. Neeraj sir, for his dedicated
efforts and support during the training. Furthermore, I am deeply grateful to
Head of the Department, for his encouragement and for
providing the necessary resources and environment to facilitate this learning
experience.
Thank you all for making this experience valuable and memorable.
Sincerely,
Akshay,
Ayush,
Rajdeep
Certificate
This is to certify that Akshay, Ayush, Rajdeep has successfully completed the project titled
"WEATHER ANALYSIS AND FORECAST" as part of the Machine Learning and Data
Science summer training program organized by ITS Mohan Nagar in collaboration with
ShapeMySkills Pvt. Ltd.
This project was conducted under the esteemed guidance of Mr. Prateek Gupta, whose
expertise and mentorship were instrumental in its successful completion. The project
exemplifies a thorough understanding of machine learning and data science techniques,
highlighting the skills acquired during the training program.
We commend Monisha for her dedication, hard work, and enthusiasm throughout the project
duration.
Date: 3.september,2024
Signature:
3
List of Figures
17 Figure 17: Figure 17 represents the heatmap for truth and predicted values 39
4
List of Tables
4 Table 4: Forecasting 41
5
List of Abbreviations
2 I/O: Input/Output 9
10 3D: 3 Dimensional 19
6
INDEX
S No. Title Page
1 Chapter 1 8
Introduction to Python and Machine Learning
2 Chapter 2 9
Introduction to Python
Libraries
3 Chapter 3 10
Introduction to Machine Learning Algorithm
4 Chapter 4 11
Python and Machine Learning Code
5 Chapter 5 42
Conclusion and Result
6 Chapter 6 42
Future Scope
7
CHAPTER 1
Introduction to Python and Machine Learning
Introduction to python :
Python is a general-purpose, dynamically typed, high-level, compiled and interpreted,
garbage-collected, and purely object-oriented programming language that supports procedural,
object-oriented, and functional programming.
It was Created by Guido van Rossum and first released in 1991, Python emphasizes code
readability and allows programmers to express concepts in fewer lines of code compared to
languages like C++ or Java.
o Easy to use and Learn: Python has a simple and easy-to-understand syntax, unlike
traditional languages like C, C++, Java, etc., making it easy for beginners to learn.
o Interpreted Language: Python does not require compilation, allowing rapid
development and testing. It uses Interpreter instead of Compiler.
o Object-Oriented Language: It supports object oriented programming
i.e(inheritance,encapsulation,polymorphism,abstraction) making writing reusable and
modular code easy.
o Extensive Libraries : Python has a rich ecosystem of libraries and frameworks, such as
NumPy, Pandas, and Matplotlib, which simplify tasks like data manipulation and
visualization.
o Data Science: Python is important in this field because it is easy to use and has powerful tools for
data analysis and visualization like NumPy, Pandas, and Matplotlib.
o Machine Learning: Python is widely used for machine learning due to its simplicity,
ease of use, and availability of powerful machine learning libraries.
Introduction to Machine Learning
Machine learning (ML) is a subfield of artificial intelligence (AI) that involves the development
of algorithms and statistical models enabling computers to perform tasks without explicit
instructions. Instead, these systems learn patterns and make decisions based on data. Machine
learning is transforming various industries by automating complex processes, providing insights
from large datasets, and creating new opportunities for innovation.
Machine learning leverages computational methods to improve performance on a given task over
time with experience. This process involves:
1. Supervised Learning: The model is trained on a labeled dataset, meaning that each
training example is paired with an output label. Common algorithms include:
o Linear Regression
o Decision Trees
o Support Vector Machines (SVM)
o Neural Networks
2. Unsupervised Learning: The model is provided with unlabeled data and must find
inherent patterns or groupings. Common algorithms include:
o Clustering (e.g., K-Means, Hierarchical Clustering)
o Association Rules (e.g., Apriori, Eclat)
o Principal Component Analysis (PCA)
3. Reinforcement Learning: The model learns by interacting with an environment,
receiving rewards or penalties based on its actions, and aims to maximize cumulative
rewards. Key concepts include:
o Markov Decision Processes (MDP)
o Q-Learning
o Deep Q-Networks (DQN)
9
CHAPTER 2
Libraries of Python
Numpy
Introduction
NumPy, short for Numerical Python, is a fundamental package for scientific computing with
Python. It provides support for arrays, matrices, and many mathematical functions to operate
on these data structures.
Some of the most popular libraries:
1. NumPy
NumPy, originally stands for Numerical Python, is a core package for numerical
computing in Python. It supports massive, multidimensional arrays and matrices, as
well as a set of mathematical functions for effectively manipulating these arrays.
Features of NumPy:
● NumPy includes an extremely efficient multi-dimensional array object called
numpy.ndarray, which can store and manage big datasets rapidly.
● NumPy includes a large number of numerical computing tools and methods
that can operate on these arrays.
● NumPy has routines for manipulating arrays such as reshaping (reshape()),
stacking (stack(), hstack(), vstack()), splitting (split()), indexing (indexing and
slicing), and sorting.
Advantages of NumPy
1. Performance
11
2. Pandas
Pandas is a flexible and advanced Python toolkit for data manipulation and analysis. It
includes high-level data structures like Data Frame and Series, which make it easier to
work with organized data.
categories. It looks like a spreadsheet or a SQL table and is ideal for working with
tabular data.
Series: A one-dimensional labelled array that can carry data of any type (integer, float,
text, etc.). Series function similarly to columns in a Data Frame or named arrays.
Pandas can read and write data from a variety of file formats, including CSV, Excel,
Pandas includes sophisticated indexing techniques (loc and iloc) for picking subsets
of data. This includes selecting rows and columns based on labels (loc) or integer
positions (iloc), giving users easy access to data.
Advantages of Pandas
DataFrames: Efficiently handle tabular data with labeled axes (rows and columns).
Series: Simplify manipulation of one-dimensional labeled arrays.
2. Data Cleaning
Handling Missing Data: Functions for detecting, filling, and removing missing values.
Data Transformation: Easy methods for merging, reshaping, and transforming
datasets.
3. Data Selection 12
Indexing and Slicing: Powerful, flexible, and intuitive data selection capabilities.
Label-based and Position-based Indexing: Access data using labels or positions.
13
3. Matplotlib
Matplotlib is a robust Python package that allows you to create static, animated, and
interactive visualizations. It is commonly used for data visualization jobs and offers a
versatile framework for creating plots and figures in a variety of formats.
Advantages of Matplotlib
1. Versatility
Wide Range of Plots: Supports various types of plots such as line, bar, scatter,
histogram, pie, and more.
Customizable: Highly customizable plots, allowing for detailed adjustments to suit
specific needs.
2. Integration
Seamless with NumPy and Pandas: Easily integrates with NumPy and Pandas for
plotting data from arrays and DataFrames.
Compatible with Other Libraries: Works well with other libraries like SciPy and
scikit-learn for enhanced functionality.
3. Publication Quality
Advantages of Seaborn
1. High-Level Interface
2. Statistical Visualization
Built-in Support: Directly integrates with Pandas DataFrames and handles statistical
aggregations and visualizations effortlessly.
Advanced Plotting Functions: Includes specialized plots like categorical plots,
distribution plots, and regression plots.
15
Chapter 3
Introduction to Machine Learning Algorithm
Introduction to Categorial
A categorical variable, also known as a discrete variable, is a type of variable that can take on
one of a limited, fixed number of possible values, representing distinct categories or classes.
Examples include:
Binary categories: Yes/No, True/False, 0/1
Multiclass categories: Red/Green/Blue, Dog/Cat/Horse, Low/Medium/High
Goal:
The primary objective of categorical machine learning algorithms is to predict the category or
class of new instances based on learned patterns from a labelled dataset. This process is
known as classification.
Non-ordinal vs. ordinal: Some categorical data is non-ordinal (e.g., fruit types), whilst
others are ordinal (e.g., rating scales such as "low," "medium," and "high").
Types of Algorithms
Categorical algorithms are built specifically for processing and analysing categorical data,
which is made up of variables that indicate discrete categories or groups. Several types of
categorical algorithms have been created to handle the specific issues caused by categorical
data, ensuring that machine learning models execute accurately and efficiently.
16
Logistic Regression
Decision Tree
Random Forest
● Random forest is an ensemble learning method that uses several decision trees
to increase forecast accuracy and robustness.
● During training, a large number of decision trees are constructed and the
findings are combined to provide a more accurate and reliable forecast.
● During tree creation, a random subset of characteristics is selected for splitting
at each node, increasing tree variety.
● Averaging the outcomes of numerous trees reduces the risk of overfitting,
which is common with individual decision trees.
● Generally, achieves good accuracy and robust performance on a variety of
datasets.
K- Nearest Neighbor
18
Chapter 4
Python and Machine Learning Code
Libraries
Pandas: It provides data structures like Data Frames and Series to handle and analyse data
efficiently. NumPy is a library for numerical computing in Python. Seaborn is a statistical
data visualization library. Matplotlib is a plotting library for Plotting graphs, histograms,
scatter plots, and customizing visualizations.
Dataset Read
21
Apr, May, Jun, July and Aug are the hottest months. One could group them together as "Su
But, since this is not how seasons work. We have four main seasons in India and this is ho
Summer(Also called, "Pre Monsoon Season") : March, April and May. Autumn(Also called
"Post Monsoon Season) : October and November.
We also will stick to these seasons for our analysis.Woah... what is this now??? on the first
Insights:
May 1921 has been the hottest month in india in the history. What could be the reason ?
Dec, Jan and Feb are the coldest months. One could group them together as "Winter".
Apr, May, Jun, July and Aug are the hottest months. One could group them together as "Su
But, since this is not how seasons work. We have four main seasons in India and this is ho
Winter : December, January and February.
22
13
23
24
Figure 2: Figure represents the Boxplot for LAST EVALUATION
25
Figure 3: Figure represents the Boxplot for NUMBER PROJECTS
26
Removing outliers using the Interquartile Range (IQR) method is a common technique to clean
data by identifying and excluding extreme values that may distort analysis.
27
Figure 6: Figure represents the Boxplot for WORK ACCIDENT
Removing outliers using the Interquartile Range (IQR) method is a common technique to clean
data by identifying and excluding extreme values that may distort analysis.
28
RESETING INDEX AFTER REMOVING OUTLIERS
Outliers are data points that deviate significantly from the rest of the dataset. They can arise
due to measurement errors, data entry errors, or inherent variability in the data. Outliers can
skew the analysis results, leading to inaccurate conclusions. Therefore, it is essential to
identify and handle outliers before performing statistical analysis.
Identifying Outliers
One of the most common methods for identifying outliers is the Interquartile Range (IQR)
method. The IQR is a measure of statistical dispersion, which is the range within which the
central 50% of the values lie. It is calculated as the difference between the third quartile (Q3)
and the first quartile (Q1):
1. Calculate Q1 (the 25th percentile): The value below which 25% of the data points
fall.
2. Calculate Q3 (the 75th percentile): The value below which 75% of the data points
fall.
3. Calculate the IQR: The difference between Q3 and Q1.
4. Determine the lower bound: Q1−1.5×IQR
5. Determine the upper bound: Q3+1.5×IQR
29
Dataset Trend Visualization PAIRPLOT
Pair plots are particularly useful in the context of outlier detection and data preprocessing.
They provide a clear visual representation of how each variable interacts with others, making
it easier to spot anomalies that do not follow the general pattern of the data.
CORRELATION HEATMAP
Figure 9: Figure represents the CORRELATION HEATMAP
The image shows a correlation heatmap generated using the sns.heatmap() function from the
Seaborn library. Correlation heatmaps are useful for visualizing the strength and direction of
relationships between pairs of variables in a dataset.
30
SCATTER PLOT
Figure 10: Figure represents the SCATTER PLOT
The image shows a scatter plot generated using the sns.scatterplot() function from the
Seaborn library. Scatter plots are useful for visualizing the relationship between two
continuous variables. They help identify trends, patterns, and potential outliers in the data.
REGRESSION PLOT
Figure 11: Figure represents the REGRESSION PLOT
31
BAR PLOT
Figure 12: Figure represents the BAR PLOT
A bar plot (or bar chart) is a graphical display of data using bars of different heights. It is
commonly used to compare quantities across different categories. Here’s an explanation of
how to create and interpret a bar plot.
HISTOGRAM
Figure 13: Figure represents the HISTOGRAM
A histogram is a type of bar chart that represents the distribution of a dataset. It is used to
show the frequency (or count) of data points that fall within specified ranges (bins). Here’s a
step-by-step explanation of how to create and interpret a histogram.
32
LINE PLOT
Figure 14: Figure represents the LINE PLOT
A line plot (or line chart) is a type of chart used to display information as a series of data
points called 'markers' connected by straight line segments. It is commonly used to visualize
trends over time. Here’s a step-by-step explanation of how to create and interpret a line plot.
COUNTER PLOT
A counter plot, often referred to as a count plot, is used to visualize the count of observations in
each category of a categorical variable. It is particularly useful for understanding the distribution of
categorical data and comparing the frequencies of different categories.
33
Test and Train Splitting
The x and y variables are separated into independent and dependent values using `iloc`.
Columns from index 0 to 8 (excluding 8) are assigned to x, while the last column, which
represents the weather type, is assigned to y as the dependent variable.
LOGISTIC REGRESSION
The x and y variables are further divided into `x_train`, `y_train`, `x_test`, and `y_test`. The
`x_train` and `y_train` subsets are used for training the model, while `x_test` and `y_test` are
used for evaluating the model. The test size is set to 0.4, meaning 40% of the dataset will be
used for testing. The random state is set to 42 to ensure reproducibility by controlling the
selection of training rows.
34
ACCURACY AND PRECISION
DECISION TREE
35
Figure 16: represents the decision tree
The max depth with value 2 graph is plotted using decision tree model. A tree plot visually
represents the structure of a decision tree, illustrating how decisions are made based on
feature values. It shows the tree's nodes, branches, and leaves, detailing the splits and
outcomes at each node.
36
RANDOM FOREST
Parameters like `max_depth`, `bootstrap`, `max_features`, and `criterion` are set to optimize
the accuracy of the dataset. The best parameters are determined using `GridSearchCV`. The
`cv_rf` (a `GridSearchCV` instance) is used to fit the model on `x_train` and `y_train`.
37
The Best parameters are return to improve the accuracy. entropy for criterion, log2 for
max_feature and 5 as max depth is chosen.
The values of `x_test` are used to generate predictions, and these predictions are compared with
`y_test` to calculate the accuracy
K - Nearest Neighbor
The values of `y_test` and `y_prediction` are compared to determine the model's accuracy.
The `GaussianNB` library is imported from `sklearn`, and the model is fitted using `x_train`
and `y_train`.
39
Figure 16: represents the heatmap for truth and predicted values
A heatmap for true and predicted values visualizes the performance of a classification model
by showing how often each combination of actual and predicted classes occurs. It helps in
understanding the distribution of prediction errors and correct classifications.
Using the `GaussianNB` model, the predicted values (`y_pred`) are compared with the
actual values (`y_test`).
40
Data Comparison
Table 4: Table 4 represents the tabular form of all the algorithms and their accuracies
41
Chapter 5: Conclusion and Results
For the weather analytics job prediction task, the overall accuracy across all algorithms was
47%, suggesting a need for further model improvement and optimization. These findings can
guide future efforts in both weather prediction and weather analytics tasks, with an emphasis
on enhancing model performance and accuracy.
42