A Simple Approach To Weather Predictions by Using Naive Bayes Classifiers
A Simple Approach To Weather Predictions by Using Naive Bayes Classifiers
Bayes Classifiers
Agnieszka Lutecka1 , Zuzanna Radosz1
1
Silesian University Of Technology, Faculty of Applied Mathematics, Kaszubska 23, 44-100 Gliwice, Poland
Abstract
This article presents and explains how we have used Naive Bayes Classifier and date base to predict the weather. We compare
the dependence of accuracy on the probability of different distributions for various types of data.
Keywords
Naive Bayes Classifier, probability distribution, Python
𝜎 2𝜋
ISSN 1613-0073
Proceedings
64
Agnieszka Lutecka et al. CEUR Workshop Proceedings 64–75
The probability function plot of this distribution is a bell- Where 𝜇 is the mean value and 𝜎 is the standard devi-
shaped curve (the so-called bell curve). ation.
The density function graph is as follows:
3.3. Log Normal Distribution Figure 4: Graph of the probability density function
Source: Wikipedia.org
It is the continuous probability distribution of a positive
random variable whose logarithm is normally distributed.
Pattern:
1 −(𝑙𝑛𝑥 − 𝜇)2 3.5. Uniform distribution
√ exp( ) * 1(0, ∞) (5)
2𝜋𝜎𝑥 2𝜎 2 It is a continuous probability distribution for which the
probability density in the range from a to b is constant
65
Agnieszka Lutecka et al. CEUR Workshop Proceedings 64–75
and not equal to zero, and otherwise equal to zero. We column j, sr [j] - the mean value of the column j and
can see it in the formula below std [j] - standard deviation of the values from column j.
√ These methods process the input data through the formu-
⎨0 dla 𝑥 < 𝜇 −√ 3𝜎
⎧
⎪
√ las for the probability distribution and return the value
𝑓 (𝑥) = 2√13𝜎 dla 𝜇 − 3𝜎 ≤ 𝑥 ≤ 𝜇 + 3𝜎 (7)of the probability density function at the point sample
√
0 dla 𝑥 > 𝜇 + 3𝜎 [j], that is, the probability of sample [j] occurring under
⎪
⎩
the conditions sr [j] and st [j]. Finally, the algorithm
Where 𝜇 is the mean value and 𝜎 is the standard devia- returns the name of the weather most likely to occur at
tion. the sample input.
The function graph is as follows: Each probability distribution has a differently defined
density function. Therefore, the distributions may differ
in the results. Below we present the pseudocode of Naive-
Classifier class methods with an emphasis on processing
the input data by probability distributions.
66
Agnieszka Lutecka et al. CEUR Workshop Proceedings 64–75
67
Agnieszka Lutecka et al. CEUR Workshop Proceedings 64–75
In the attached photo we can see how the temperature To normalize the data to the range 0-1, we changed int64
value changes for a given weather, for example for "Mod- to float64.
erate rain", i.e. moderate rain, the maximum temperature
ranges from 5 to 20 degrees.
6. Implementation
5.2. Database modification 6.1. ProcessingData class
DateTime, SunRise, SunSet, MoonRise, MoonSet will not Our project consists of two files: a file containing the
be used in our project, so we can get rid of them. program code - "Pogoda.ipnb" and the database - "Istan-
d a t a . drop ( ’ DateTime ’ , a x i s =1 , bul Weather Data.csv". After analyzing the data from
i n p l a c e = True ) the database, we went to the "ProcessingData" class, in
d a t a . drop ( ’ S u n R i s e ’ , a x i s =1 , which we created 3 static methods: shuffle, splitSet and
i n p l a c e = True ) normalize.
d a t a . drop ( ’ SunSet ’ , a x i s =1 ,
i n p l a c e = True )
d a t a . drop ( ’ MoonRise ’ , a x i s =1 ,
6.2. Shuffle method
i n p l a c e = True )
d a t a . drop ( ’ MoonSet ’ , a x i s =1 , It takes base as input, i.e. our database. We use for loop
i n p l a c e = True ) to go through it selecting records and swapping them.
68
Agnieszka Lutecka et al. CEUR Workshop Proceedings 64–75
def s p l i t S e t ( x , k ) :
Program code n= i n t ( l e n ( x ) ∗ k )
x T r a i n =x [ : n ]
@staticmethod
x V a l =x [ n : ]
def s h u f f l e ( base ) :
r e t u r n xTrain , xVal
f o r i in range ( len ( base )
−1 , −1 , −1) :
base . i l o c [ i ] , base . i l o c [ rd
6.4. Normalize method
. r a n d i n t ( 0 , i ) ]= b a s e .
i l o c [ rd . r a n d i n t ( 0 , i ) ] , Takes x, which is a database that will have records scram-
base . i l o c [ i ] bled using the shuffle method. At the beginning, we enter
return base all data from the database into the variable values, except
for the string value, and the values of the column names
into the variable columnNames. We loop through all the
6.3. Splitset method columns in the column, and then take all the rows in
the column column and store them in the variable data.
It takes as input x - database and k - division of the set. In
Variables max1 and min1 are assigned maximum and
the variable n we write the length of the set x multiplied
minimum values from date. Using the next loop, we go
by k to know how to divide this set. Then, to two xTrain
through all the rows and assign to the variable val the
variables, we write the data from the database to the
formula for normalization min-max, that is, we subtract
value n, creating the training set, and to the variable xVal
the coordinate database record [row, column] from the
- all data following the value n, creating the validation
value min1, and then divide this difference by the differ-
set. Finally, we return both of these sets.
ence between max1 and min1. Finally, we write the value
after normalization to the database. The method returns
us a normalized database.
Program code
69
Agnieszka Lutecka et al. CEUR Workshop Proceedings 64–75
Program code
def normalize ( x ) :
v a l u e s =x . s e l e c t _ d t y p e s (
exclude =" o b j e c t " ) # s e l e c t
a l l d a t a from t h e d a t a b a s e
except the object , i . e .
string
columnNames= v a l u e s . columns .
tolist ()
f o r column i n columnNames :
d a t a =x . l o c [ : , column ] #
summon a l l rows i n
column column
max1=max ( d a t a )
min1=min ( d a t a )
f o r row i n r a n g e ( 0 , l e n ( x )
) : #we go t h r o u g h a l l
the l i n e s
v a l = ( x . a t [ row , column
] − min1 ) / ( max1−min1
)
x . a t [ row , column ]= v a l
return x
70
Agnieszka Lutecka et al. CEUR Workshop Proceedings 64–75
71
Agnieszka Lutecka et al. CEUR Workshop Proceedings 64–75
The next step is to loop through all the values of the @staticmethod
names list in sequence. Then we create auxiliary lists: tr d e f a n a l i z e ( T r a i n , Val , name ) :
[] - to store 6 probability values that correspond to the c o r r e c t =0
next database column, sr [] - to store the mean values f o r i in range ( len ( Val ) ) :
from each column, std - to store the standard deviations if NaiveClassifier .
for each column. We pass the next loop through all the c l a s s i f y ( Train , Val .
columns one by one. Then we calculate the mean and i l o c [ i ] , name ) == V a l .
standard deviation for a given column. The next step is i l o c [ i ] . Condition :
the conditions that prevent sr [] and std [] from occurring. c o r r e c t +=1
The time has come for timetables. Depending on the in- accuracy = c o r r e c t / len ( Val ) ∗100
put data "name" to the list tr [] we add the result of the return accuracy
function of the given distribution. After going through
the inner loop, we compute the value from Bayes’ theo-
rem. Based on the formula for conditional probability, we 7. Tests
multiply the values in the tr list, then multiply that prod-
uct by the list of names [i]. We divide the whole thing by We started our tests by checking the algorithm’s opera-
the length of the "names" list. We add the obtained result tion using various samples.
to the "values" list. After going through both loops, we
determine the index from the values list with the highest
value. Finally, we return the name of the weather with
the given index from the stringnames list.
Program code
As you can see in the attached picture, the algorithm
c l a s s AnalizingData :
72
Agnieszka Lutecka et al. CEUR Workshop Proceedings 64–75
73
Agnieszka Lutecka et al. CEUR Workshop Proceedings 64–75
time was only 1.467 minutes. However, the first calcula- [5] R. Avanzato, F. Beritelli, M. Russo, S. Russo, M. Vac-
tions, where the division was in the ratio of 1: 9, took as caro, Yolov3-based mask and face recognition al-
long as 9,517 minutes. This is because by reducing the gorithm for individual protection applications, in:
training set, we increase the number of records in the CEUR Workshop Proceedings, volume 2768, CEUR-
validation set. As a result, the algorithm will be called WS, 2020, pp. 41–45.
more times and the most time-consuming elements such [6] M. Woźniak, A. Zielonka, A. Sikora, Driving sup-
as extracting records with a given weather or loops will port by type-2 fuzzy logic control model, Expert
be performed many times. Systems with Applications 207 (2022) 117798.
Analyzing the above, we can see that the accuracy [7] G. Borowik, M. Woźniak, A. Fornaia, R. Giunta,
of the Gaussian distribution is superior to all of them, C. Napoli, G. Pappalardo, E. Tramontana, A soft-
its value is practically unchanged. The Laplace distri- ware architecture assisting workflow executions
bution is in second place, almost tapping 60%. The on cloud resources, International Journal of Elec-
value of the uniform distribution ranges from 50-55 %. It tronics and Telecommunications 61 (2015) 17–23.
achieves the most on the last chart where the training doi:10.1515/eletel-2015-0002.
set is 0.9. Triangular and log normal distributions reach [8] T. Qiu, B. Li, X. Zhou, H. Song, I. Lee, J. Lloret,
much lower values than the previously mentioned dis- A novel shortcut addition algorithm with particle
tributions. The jump is quite big, around 30%. However, swarm for multisink internet of things, IEEE Trans-
the log-normalized distribution only slightly exceeds the actions on Industrial Informatics 16 (2019) 3566–
triangular distribution once, and it is in the third graph. 3577.
Nevertheless, the accuracy values of both distributions [9] G. Capizzi, G. Lo Sciuto, C. Napoli, R. Shikler,
never exceed 20%. M. Wozniak, Optimizing the organic solar cell man-
ufacturing process by means of afm measurements
and neural networks, Energies 11 (2018).
9. Conclusion [10] M. Woźniak, A. Sikora, A. Zielonka, K. Kaur, M. S.
Hossain, M. Shorfuzzaman, Heuristic optimization
We can conclude from this that the Gauss distribution
of multipulse rectifier for reduced energy consump-
is the best probability distribution for our database. The
tion, IEEE Transactions on Industrial Informatics
algorithm with this distribution, with each modification,
18 (2021) 5515–5526.
correctly determines about 60% of weather names, which
[11] G. Capizzi, G. Lo Sciuto, C. Napoli, E. Tramontana,
is a good but unsatisfactory value. This is due to the
M. Woźniak, A novel neural networks-based tex-
way the data is distributed in the database. In the case
ture image processing algorithm for orange defects
of more different values for different weather conditions,
classification, International Journal of Computer
this algorithm could become much more efficient.
Science and Applications 13 (2016) 45–60.
[12] N. Brandizzi, S. Russo, R. Brociek, A. Wajda, First
References studies to apply the theory of mind theory to green
and smart mobility by using gaussian area cluster-
[1] Y. Li, W. Dong, Q. Yang, S. Jiang, X. Ni, J. Liu, Auto- ing, volume 3118, CEUR-WS, 2021, pp. 71–76.
matic impedance matching method with adaptive [13] D. Yu, C. P. Chen, Smooth transition in communica-
network based fuzzy inference system for wpt, IEEE tion for swarm control with formation change, IEEE
Transactions on Industrial Informatics 16 (2019) Transactions on Industrial Informatics 16 (2020)
1076–1085. 6962–6971.
[2] J. Yi, J. Bai, W. Zhou, H. He, L. Yao, Operating [14] C. Napoli, G. Pappalardo, E. Tramontana, A
parameters optimization for the aluminum electrol- hybrid neuro-wavelet predictor for qos control
ysis process using an improved quantum-behaved and stability, Lecture Notes in Computer Sci-
particle swarm algorithm, IEEE Transactions on ence (including subseries Lecture Notes in Arti-
Industrial Informatics 14 (2017) 3405–3415. ficial Intelligence and Lecture Notes in Bioinfor-
[3] J. W. W. L. Z. B. Wei Dong, Marcin Woźniak, De- matics) 8249 LNAI (2013) 527–538. doi:10.1007/
noising aggregation of graph neural networks by 978-3-319-03524-6_45.
using principal component analysis, IEEE Transac- [15] Y. Zhang, S. Cheng, Y. Shi, D.-w. Gong, X. Zhao,
tions on Industrial Informatics (2022). Cost-sensitive feature selection using two-archive
[4] N. Brandizzi, V. Bianco, G. Castro, S. Russo, A. Wa- multi-objective artificial bee colony algorithm, Ex-
jda, Automatic rgb inference based on facial emo- pert Systems with Applications 137 (2019) 46–58.
tion recognition, in: CEUR Workshop Proceedings, [16] M. Ren, Y. Song, W. Chu, An improved locally
volume 3092, CEUR-WS, 2021, pp. 66–74. weighted pls based on particle swarm optimization
for industrial soft sensor modeling, Sensors 19
74
Agnieszka Lutecka et al. CEUR Workshop Proceedings 64–75
(2019) 4099.
[17] B. Nowak, R. Nowicki, M. Woźniak, C. Napoli,
Multi-class nearest neighbour classifier for
incomplete data handling, in: Lecture Notes
in Artificial Intelligence (Subseries of Lec-
ture Notes in Computer Science), volume
9119, Springer Verlag, 2015, pp. 469–480.
doi:10.1007/978-3-319-19324-3_42.
[18] V. S. Dhaka, S. V. Meena, G. Rani, D. Sinwar, M. F.
Ijaz, M. Woźniak, A survey of deep convolutional
neural networks applied for prediction of plant leaf
diseases, Sensors 21 (2021) 4749.
[19] R. Brociek, G. Magistris, F. Cardia, F. Coppa,
S. Russo, Contagion prevention of covid-19 by
means of touch detection for retail stores, in: CEUR
Workshop Proceedings, volume 3092, CEUR-WS,
2021, pp. 89–94.
[20] N. Dat, V. Ponzi, S. Russo, F. Vincelli, Supporting
impaired people with a following robotic assistant
by means of end-to-end visual target navigation
and reinforcement learning approaches, in: CEUR
Workshop Proceedings, volume 3118, CEUR-WS,
2021, pp. 51–63.
[21] M. Woźniak, M. Wieczorek, J. Siłka, D. Połap, Body
pose prediction based on motion sensor data and
recurrent neural network, IEEE Transactions on
Industrial Informatics 17 (2020) 2101–2111.
[22] G. Capizzi, F. Bonanno, C. Napoli, A wavelet
based prediction of wind and solar energy for long-
term simulation of integrated generation systems,
in: SPEEDAM 2010 - International Symposium
on Power Electronics, Electrical Drives, Automa-
tion and Motion, 2010, pp. 586–592. doi:10.1109/
SPEEDAM.2010.5542259.
[23] G. Capizzi, G. Lo Sciuto, C. Napoli, M. Woźniak,
G. Susi, A spiking neural network-based long-term
prediction system for biogas production, Neural
Networks 129 (2020) 271 – 279.
[24] G. Capizzi, G. Lo Sciuto, C. Napoli, E. Tramontana,
An advanced neural network based solution to en-
force dispatch continuity in smart grids, Applied
Soft Computing Journal 62 (2018) 768 – 775.
[25] O. Dehzangi, et al., Imu-based gait recognition
using convolutional neural networks and multi-
sensor fusion, Sensors 17 (2017) 2735.
[26] G. Capizzi, F. Bonanno, C. Napoli, Hybrid neu-
ral networks architectures for soc and voltage pre-
diction of new generation batteries storage, in:
3rd International Conference on Clean Electrical
Power: Renewable Energy Resources Impact, IC-
CEP 2011, 2011, pp. 341–344. doi:10.1109/ICCEP.
2011.6036301.
[27] H. G. Hong, M. B. Lee, K. R. Park, Convolutional
neural network-based finger-vein recognition using
nir image sensors, Sensors 17 (2017) 1297.
75