A15 Final Document
A15 Final Document
LEARNING ALGORITHMS
Submitted in partial fulfillment of the requirements
for the award of the degree of
BACHELOR OF TECHNOLOGY
in
by
AKSHAYA G (194G1A0505)
MANOJ A (194G1A0557)
ARAVIND B (204G5A0501)
DEEPTHI T (204G5A0503)
2022-2023
SRINIVASA RAMANUJAN INSTITUTE OF TECHNOLOGY
(AUTONOMOUS)
(Affiliated to JNTUA, Accredited by NAAC with ‘A’ Grade, Approved by AICTE, New Delhi &
Accredited by NBA (EEE, ECE & CSE)
Rotarypuram Village, BK Samudram Mandal, Ananthapuramu-515701
Certificate
This is to certify that the project report entitled EARTHQUAKE DETECTION USING
MACHINE LEARNING ALGORITHMS is the bonafide work carried out by Akshaya G,
Manoj A, Aravind B, Deepthi T bearing Roll Number 194G1A0505, 194G1A0557,
204G5A0501, 204G5A0503 in partial fulfillment of the requirements for the award of the
degree of Bachelor of Technology in Computer Science & Engineering during the academic
year 2022-2023.
Mr. Nazeer Shaik M.Tech., (Ph.D) Mr. P. Veera Prakash M.Tech., (Ph.D)
Assistant Professor Assistant Professor
The satisfaction and euphoria that accompany the successful completion of any
task would be incomplete without the mention of people who made it possible, whose
constant guidance and encouragement crowned our efforts with success. It is a pleasant
aspect that we have now the opportunity to express our gratitude for all of them.
It is with immense pleasure that we would like to express our indebted gratitude
to our Guide Mr. Nazeer Shaik, Assistant Professor, Computer Science &
Engineering, who has guided us a lot and encouraged us in every step of the project
work. We thank him for the stimulating guidance, constant encouragement and
constructive criticism which have made possible to bring out this project work.
We express our deep felt gratitude to Dr. B. Harichandana, Associate
Professor, and Mrs. S. Sunitha, Assistant Professor, Project Coordinators for their
valuable guidance and unstinting encouragement enabled us to accoumplish our
project successfully in time.
We are very much thankful to Mr. P. Veera Prakash, Assistant Professor &
Head of the Department, Computer Science & Engineering, for his kind support
and for providing necessary facilities to carry out the work.
We wish to convey our special thanks to Dr. G. Bala Krishna, Principal of
Srinivasa Ramanujan Institute of Technology (Autonomous) for giving the
required information in doing our project work. Not to forget, We thank all other
faculty and non- teaching staff, and our friends who had directly or indirectly helped
and supported us in completing our project in time.
We also express our sincere thanks to the Management for providing excellent
facilities.
Finally, we wish to convey our gratitude to our families who fostered all the
requirements and facilities that we need.
Project Associates
194G1A0505
194G1A0557
204G5A0501
204G5A0503
DECLARATION
VII
LIST OF ABBREVIATIONS
VIII
ABSTRACT
An Earthquake is the sudden shaking of the surface of the earth resulting from sudden
release of energy in the Lithosphere. Reliable prediction of earthquakes has numerous
societal and engineering benefits. In recent years, the exponentially rising volume of
seismic data has led to the development of several automatic earthquake detection
algorithms through machine learning approaches. Different algorithms have been
applied on earthquake detection like Decision tree, Random Forest classifier, Support
vector Machine, KNeighborsClassifier.
IX
Earthquake Detection Using Machine Learning Algorithms
CHAPTER - 1
INTRODUCTION
Earthquake detection refers to the process of identifying and measuring seismic
activity, including the vibrations and waves that are generated by the movement of
tectonic plates in the Earth's crust. Earthquakes are among the most powerful and
destructive natural disasters, and their detection is crucial for understanding and
mitigating their impact.
The detection of earthquakes is typically carried out using a variety of tools and
techniques, including seismometers, which are instruments that measure the movement
of the ground, and other devices such as accelerometers, tiltmeters, and strain meters.
These devices are placed in various locations throughout the Earth's crust, including on
the surface, in boreholes, and on the seafloor.
When an earthquake occurs, it generates seismic waves that can be detected by
these instruments, and the resulting data can be used to determine the location,
magnitude, and other characteristics of the earthquake. This information is then used by
scientists and emergency responders to assess the potential impact of the earthquake and
to take appropriate actions to mitigate its effects.
Earthquakes are a natural disaster that can cause significant damage and loss of
life. Detecting earthquakes early is crucial for reducing their impact and ensuring the
safety of people living in affected areas. Traditional methods of earthquake detection
rely on seismometers and other specialized equipment, which can be expensive and
require significant resources to operate. Machine learning algorithms offer a promising
alternative for earthquake detection. By analyzing seismic data and identifying patterns,
these algorithms can help predict the likelihood of an earthquake occurring in a given
area. This can provide early warning signals and help initiate disaster response efforts,
potentially saving lives and reducing the impact of earthquakes. Various machine
learning algorithms can be used for earthquake detection, each with its own strengths
and weaknesses.
Once an earthquake is detected, the seismic data is analyzed to determine the
location and magnitude of the earthquake. This information is then used to issue alerts
and warnings to people in affected areas, which can help them prepare for and respond
to the earthquake.
1.2 Objectives
The objective of using machine learning algorithms for earthquake detection
is to improve the accuracy and speed of earthquake detection and to provide real-time
alerts and warnings to help mitigate the effects of earthquakes. Machine learning
algorithms can be used to analyze large amounts of seismic data and to identify patterns
and trends that are difficult or impossible for humans to detect.
By using machine learning algorithms for earthquake detection is that they can
process vast amounts of seismic data quickly and accurately. In addition, machine
learning algorithms can identify patterns and trends in the data that may be difficult for
humans to detect, especially in large and complex datasets.
CHAPTER - 2
LITERATURE SURVEY
Dinky Tulsi Nandwani et al. [1] research helps in reducing the effects of damage
and destruction caused by aftershocks of earthquake. If we know the grade of buildings,
we can take measures to reduce its weakness and try to strengthen it to face earthquake.
The lower the grade of building, the more damage it is to be caused by earthquake. The
higher the grade of building then it makes it easy for the building to survive an
earthquake. The predicted output of buildings data helps in identifying safe and unsafe
buildings. Using machine learning has helped us to make earthquake less painful and
severe. It is a feasible option to prevent the damage to buildings and loss of human lives.
Roxane Mallouhy et al. [2] worked on Eight machines learning algorithms. They
have been tested for our work to classify the major earthquake events between negative
and positive. The study has been applied to a dataset collected from a center in
California, which was recording inputs for 36 years. Every machine learning technique
shows different results from each other. KNN, Random Forest and MLP are the best by
producing the least false output (FP) while SVM, KNN, MLP and Random Forest
classify the higher number of outputs correctly.
Vindhya Mudgal et al. [3] has proposed AI-based approaches have opened up
new opportunities for enhancing the prediction process due to their greater precision
when compared to conventional procedures. The BP-AdaBoost model, which has high
percent accuracy, came out on top, followed by random forest and the support vector
machine stacking model, which has low percent accuracy. Each machine learning
approach yields outcomes that differ from one another.
Dr. S. Anbu Kumar et al. [4] has proposed earthquake prediction was performed,
by training different Machine Learning models on seismic and acoustic data collected
from a laboratory micro-earthquake simulation. During this research, six machine
learning techniques including Linear Regression, Support Vector Machine, Random
Forest Regression, Case Based Reasoning, XGBoost and Light Gradient Boosting
Mechanism are separately applied and accuracies in the training and testing datasets
were compared to pick out the best model. Light Gradient Boosting Model (LGBM)
performs well as compared to its rest competitors, it has a fair balance between Mean
Absolute Error (MAE) time to failure, and range of observations, and also has the least
outliers.
Alyona Galkina et al. [5] has proposed a method which aims to systematize the
methods used and analyze the main trends in making predictions using machine
learning. The main approaches in application of machine learning methods to a problem
of earthquake prediction are observed. The main open-source earthquake catalogs and
databases are described. The definition of main metrics used for performance
evaluation is given.
D. G. Cortés et al. (2018) In study [7], which was published in Computers &
Geosciences in 2018, an attempt to predict magnitude of the largest seismic event within
the next seven days was made. The problem of earthquake prediction was treated as a
regression task: four regressors (generalized linear models, gradient boosting machines,
deep learning and random forest) and ensembles for them were applied. The most
effective regressor was random forest (RF), yielding a mean absolute error of 0.74 on
average. RF was also one of the fastest, taking only 18 minutes to train the regression
models on all data. Particularly, the most accurate predictions of RF were made for
moderate earthquakes (magnitudes within a range on [4, 7); MAE<=0.26. Based on
these results, the authors concluded that using more complex regressor ensembles
would improve the accuracy of predictions for quakes of large magnitude.
Pratiksha Bangar et al. [8] which was published in Computers & Geosciences
in 2020, an accurate forecaster is designed and developed, a system that will forecast
the catastrophe. It. Data-sets for Indian sub-continental along with rest of the World are
collected from government sources. Pre-processing of data is followed by construction
of stacking model that combines Random Forest and Support Vector Machine
Algorithms. Algorithms develop this mathematical model reliant on “training data-set”.
Model looks for pattern that leads to catastrophe and adapt to it in its building, so as to
settle on choices and forecasts without being expressly customized to play out the task.
After forecast, we broadcast the message to government officials and across various
platforms.
CHAPTER - 3
ANALYSIS
3.1 EXISTING SYSTEM
In the existing system Neural networks have been investigated for predicting the
magnitude of the largest seismic event based on the analysis of seismicity indicators.
Seismicity indicators are mathematically computed parameters that are based on
earthquake data and can be used to assess the likelihood of future earthquakes.
3.1.1 Disadvantages
➢ The efficiency of existing algorithms is limited.
➢ Less Accuracy.
➢ Incorrect Predictions.
SVM uses a classifier that categorizes the info set by setting an optimal
hyperplane between data. This classifier is choosen as it is incredibly versatile in the
number of different kernel functions that can be applied, and this model can yield a high
predictability rate. .
➢ Random Forest
➢ Decision Tree
Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes represent
the features of a dataset, branches represent the decision rules, and each leaf node
represents the outcome.
Overall, the decision tree algorithm can help improve the accuracy of earthquake
detection by identifying the most important features for classification, allowing for the
creation of more complex models that capture the nuances of the seismic data.
3.2.1 Advantages
3.3 Methodology
The system uses machine learning to make predictions of the Thyroid disease
and Python as the programming language since Python has been accepted widely as a
language for experimenting in the machine learning area. Machine learning uses
historical data and information to gain experiences and generate a trained model by
training it with the data. This model then makes output predictions. The better the
collection of dataset, the better will be the accuracy of the classifier. It has been
observed that machine learning methods such as regression and classification perform
better than various statistical models.
Basic Terminology
• Dataset: A set of data examples, which contain features important to solving
the problem.
• Features: Important pieces of data that help us understand a problem.
These are fed into a Machine Learning algorithm to help it learn.
Supervised Learning: It is the most popular paradigm for machine learning. Given
data in the form of examples with labels, we can feed a learning algorithm these
example-label pairs one by one, allowing the algorithm to predict the label for each
example, and giving it feedback as to whether it predicted the right answer or not.
Over time, the algorithm will learn to approximate the exact nature of the relationship
between examples and their labels. When fully- trained, the supervised learning
algorithm will be able to observe a new, never before-seen example and predict a
good label for it.
Machine Learning offers a wide range of algorithms to choose from. These are usually
divided into classification, regression, clustering and association. Classification and
regression algorithms come under supervised learning while clustering and association
comes under unsupervised learning
• Classification: A classification problem is when the output variable is a category,
such as ―red‖ or ―blue‖ or ―disease‖ and ―no disease‖. Example: Decision
Trees
• Regression: A regression problem is when the output variable is a real value, such
as dollars or weight. Example: Linear Regression
• Clustering: A clustering problem is where you want to discover the inherent
groupings in the data, such as grouping customers by purchasing behavior
Example: k means clustering.
• Association: An association rule learning problem is where you want to discover
rules that describe large portions of your data, such as people that buy X also tend
to buy Y.
i. Data Collection: Collect the data that the algorithm will learn from.
ii. Data Preparation: Format and engineer the data into the optimal format,
extracting important features and performing dimensionality reduction.
iii. Training : Also known as the fitting stage, this is where the Machine Learning
algorithm actually learns by showing it the data that has been collected and
prepared.
iv. Evaluation: Test the model to see how well it performs.
v. Tuning: Fine tune the model to maximize its performance.
SVM uses a classifier that categorizes the info set by setting an optimal
hyperplane between data. This classifier is chosen as it is incredibly versatile in the
number of different kernel functions that can be applied, and this model can yield a
high predictability rate. Support Vector Machine is one among the foremost popular
and widely used clustering algorithms. It belongs to a gaggle of generalized linear
classifiers and is taken into account as an extension of the perceptron.
data point. It works by finding the distances between a query and all the examples in
the data, selecting the specified number examples (K) closest to the query, then votes
for the most frequent label (in the case of classification) or averages the labels (in the
case of regression).
Advantages
• Consistent Accuracy.
Decision Tree
Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes represent
the features of a dataset, branches represent the decision rules, and each leaf node
represents the outcome.
Overall, the decision tree algorithm can help improve the accuracy of earthquake
detection by identifying the most important features for classification, allowing for the
creation of more complex models that capture the nuances of the seismic data.
CHAPTER - 4
SYSTEM REQUIREMENTS SPECIFICATIONS
4.1 Hardware Requirements
The hardware requirements include the requirements specification
of the physical computer resources for a system to work efficiently. The hardware
requirements may serve as the basis for a contract for the implementation of the system
and should therefore be a complete and consistent specification of the whole system.
The Hardware Requirements are listed below:
1. Processor:
A processor is an integrated electronic circuit that performs the
calculations that run a computer. A processor performs arithmetical, logical,
input/output (I/O) and other basic instructions that are passed from an operating
system (OS). Most other processes are dependent on the operations of a processor. A
minimum 1 GHz processor should be used, although we would recommend S2GHz or
more. A processor includes an arithmetical logic and control unit (CU), which
measures capability in terms of the following:
• Ability to process instructions at a given time
2. Hard Drive:
A hard drive is an electro-mechanical data storage device that uses magnetic
storage to store and retrieve digital information using one or more rigid rapidly rotating
disks, commonly known as platters, coated with magnetic material. The platters are
Computer Science & Engineering, SRIT Page 14 of 49
Earthquake Detection Using Machine Learning Algorithms
paired with magnetic heads, usually arranged on a moving actuator arm, which reads
and writes data to the platter surfaces. Data is accessed in a random-access manner,
meaning that individual blocks of data can be stored or retrieved in any order and not
only sequentially. HDDs are a type of non-volatile storage, retaining stored data even
when powered off. 50 GB or higher is recommended for the proposed system.
3. Memory (RAM):
Random-access memory (RAM) is a form of computer data storage that stores
data and machine code currently being used. A random- access memory device allows
data items to be read or written in almost the same amount of time irrespective of the
physical location of data inside the memory. In today's technology, random-access
memory takes the form of integrated chips. RAM is normally associated with volatile
types of memory (such as DRAM modules), where stored information is lost if power
is removed, although non- volatile RAM has also been developed. A minimum of
RAM is recommended for the proposed system.
1. Google Colab:
Another attractive feature that Google offers to the developers is the use of
GPU. Colab supports GPU and it is totally free. The reasons for making it free for
public could be to make its software a standard in the academics for teaching machine
learning and data science. It may also have a long term perspective of building a
customer base for Google Cloud APIs which are sold per-use basis.
Irrespective of the reasons, the introduction of Colab has eased the learning and
development of machine learning applications.
2. Python:
It is an object-oriented, high-level programming language with integrated
dynamic semantics primarily for web and app development. It is extremely attractive
in the field of Rapid Application Development because it offers dynamic typing and
dynamic binding options. Python is relatively simple, so it's easy to learn since it
requires a unique syntax that focuses on readability. Developers can read and translate
Python code much easier than other languages. In turn, this reduces the cost of
program maintenance and development because it allows teams to work
collaboratively without significant language and experience barriers.
4.5 Scope
● To design a model for earthquake detection with four algorithms Random
Forest, decision tree, KNN and Support Vector Machine and we uses
Random Forest for detection
● To detect the random forest with good accuracy.
4.6 Performance
● The performance of our project can be estimated by using the confusion
matrix.
● The accuracy of this project is estimated to 97%.
CHAPTER - 5
DESIGN
Now, apply feature selection techniques to the rainfall dataset which has some
features in it. Then a subset of features which are most important in the prediction of
floods. choose the algorithm that gives the best possible accuracy with the subset of
features obtained after feature selection. Applying the algorithms to the dataset
actuallymeans that it needs to train the model with the algorithms and test the data so
that the model will be fit.
➢ Data Preprocessing
➢ Model Training
➢ Model Evaluation
➢ So, the aim of the project is to predict the dependent variables using
independent variables.
Univariate analysis is the simplest form of data analysis where the data being
analyzed contains only one variable. Since it's a single variable it doesn't deal with
causes or relationships.
Bivariate data is data that involves two different variables whose values can
change.Bivariate data deals with relationships between these two variables.
In this step the model is trained using the algorithms that are suitable. Flood
prediction is a kind of problem in which One variable has to be determined using
someindependent variables Regression model is suitable for this kind of scenario.
• A training model is a dataset that is used to train an ML algorithm. It
consists ofthe sample output data and the corresponding sets of input data
that have an influence on the output. The training model is used to run the
input data throughthe algorithm to correlate the processed output against
the sample output.
• Accuracy
• Recall score
• Precision score
• F1 score
Out of these we used Accuracy for evaluating our model. Accuracy is the most
commonly used metric to judge a model and is actually not a clear indicator of the
performance. The worst happens when classes are imbalanced balanced datasets. In
such cases, other evaluation metrics such as precision, recall, and F1 score can provide
a more informative picture of the model's performance.
It is important to carefully choose the evaluation metrics based on the
characteristics of the dataset and the problem being solved. In the case of imbalanced
datasets, using accuracy alone can lead to inaccurate conclusions about the
performance of a machine learning model, and it is important to consider alternative
metrics such as precision, recall, and F1 score.
As the strategic value of software increases for many companies, the industry
looks for techniques to automate the production of software and to improve quality
and reduce cost and time to the market. These techniques include component
technology, visual programming, patterns and frameworks. Additionally, the
development for the World Wide Web, while making some things simpler, has
exacerbated these architectural problems. The UML was designed to respond to these
needs. Simply, systems design refers to the process of defining the architecture,
components, modules, interfaces and data for a system to satisfy specified
requirements which can be done easily through UML diagrams.
System User
+read dataset +upload dataset
+train dataset +apply algorithm
+test dataset +predict results
+generate results()
+generate graph() +Analysis results()
.
Fig 5.3: Class Diagram
CHAPTER - 6
IMPLEMENTATION
The pip command is a tool for installing and managing Python packages, such
as those found in the Python Package Index. It's a replacement for easy installation.
Theeasiest way to install the nfl* python modules and keep them up-to-date is with a
Python-based package manager called pip.
pip install (module name)
NumPy:
Pandas:
Similar to NumPy, Pandas is one of the most widely used python libraries in
datascience. It provides high-performance, easy to use structures and data analysis
tools. Pandas provides an in-memory 2d table object called Data frame. It is like a
spreadsheet with column names and row labels. Hence, with 2d tables, pandas are
capable of providing many additional functionalities like creating pivot tables,
computing columns based on other columns and plotting graphs. Pandas can be
imported into Python using:
Matplotlib:
Seaborn:
Seaborn is a data visualization library built on top of matplotlib and closely
integrated with pandas’ data structures in Python. Visualization is the central part of
Seaborn which helps in exploration and understanding of data. Seaborn offers the
functionalities like Dataset oriented API to determine the relationship between
variables, Automatic estimation and plotting of linear regression plots, it supports
high-level abstractions for multi-plot grids and Visualizing univariate and bivariate
distribution.
Flask:
Flask is a web framework for Python that enables developers to build web applications
quickly and easily. It is a lightweight framework that provides only the essentials,
allowing developers to add additional libraries as needed. Flask is known for its
simplicity, flexibility, and scalability, making it a popular choice for building web
applications, including machine learning web apps.
The Flask module is a web framework for Python that enables developers to
build web applications quickly and easily. The render_template function allows
developers to render HTML templates, while the request object enables access to
incoming request data, such as form data or query parameters.
Sklearn:
Scikit-learn is a free software machine library for Python programming
language. It features various classification, regression and clustering algorithms. In
ourproject we have used different features of sklearn library like:
from sklearn.preprocessing import LabelEncoder
In machine learning, we usually deal with datasets which contain multiple
labels in one or more than one column. These labels can be in the form of words or
numbers. To make the data understandable or in human readable form, the training
datais often labeled in words.
Label encoding converts the data in machine readable form, but it assigns a
unique number (starting from 0) to each class of data. This may lead to the generation
of priority issues in training of data sets..
CSV file
The dataset used in this project is a .CSV file. In computing, a comma-
separated values (CSV) file is a delimited text file that uses a comma to separate
values. A CSV file stores tabular data (numbers and text) in plain text. Each line of
the file is a data record. Each record consists of one or more fields, separated by
commas. The use of thecomma as a field separator is the source of the name for this file
format. CSV is a simplefile format used to store tabular data, such as a spreadsheet or
database. Files in the CSVformat can be imported to and exported from programs that
store data in tables, such asMicrosoft Excel or OpenOffice Calc. Its data fields
are most often separated, ordelimited, by a comma.
A CSV is a comma-separated values file, which allows data to be saved in a
tabular format. CSVs look like a garden-variety spreadsheet but with a .csv extension.
CSV files can be used with most any spreadsheet program, such as Microsoft Excel
orGoogle Spreadsheets.
6.2 Implementation
6.2.2.2 Balancing
Balancing in machine learning refers to the process of addressing class
imbalance in a dataset. Class imbalance occurs when the number of instances in one
class is significantly higher than the other classes, which can lead to biased models
that perform poorly on the minority class.
To visualize the class distribution of the target variable in the dataset. If there
is a significant imbalance between the classes, such as one class having a much larger
count than the others, it may be necessary to balance the dataset before training a
machine learning model to avoid biased performance.
There are several techniques and tools that can be used for exploratory analysis,
including:
❖ Visualization: This involves creating graphs and charts to display the data
visually, such as histograms, scatterplots, and boxplots. Visualization can help
to identify outliers, trends, and patterns in the data.
❖ Data cleaning: This involves identifying and addressing issues such as missing
data, outliers, and inconsistencies in the data.
the data, for example by using principal component analysis (PCA) or factor
analysis.
❖ Clustering: This involves grouping similar data points together, based on their
characteristics or attributes.
One way to explore the relationship between magnitude and depth is to create
a scatter plot, with magnitude on the x-axis and depth on the y-axis. Each point on the
plot represents a single earthquake event, and the position of the point indicates its
magnitude and depth.
Item Visibility is one of the features of the dataset in which maximum number of
values in that row are Zeros so they have to be normalized and are set to some value.
So, mean of the entire column of data visibility is calculated and the value is set to that
mean value. This make all the Zeros of the column to particular value and accuracy of
the model can be made more efficient.
For all the Machine learning models to train with any algorithm of their choice
the dataset has to be divided into two parts called Training dataset and Testing dataset.
Generally, the training dataset will be 80% of the entire dataset and 20% of the data as
the Testing dataset.
Training dataset will be used to train the model and testing dataset will be used
to find the accuracy of our predicted model. Performance evaluation can be made with
the accuracy of the trained model with required algorithm.
Those splitting of the dataset to train and test splitting can be made using the command
from the sklearn library as shown above.
x_train-Represents train dataset
x_test-Represents test dataset.
6.3.2 StandardScaler
Step4: Fit the training dataset using Random Forest Classifier algorithm.
Step5: Predict the values for the testing dataset using a trained model.
When evaluating classification models, there are several common metrics used to assess
their performance. Here are some of the most common metrics and how to calculate
them:
from sklearn.metrics import f1_score,accuracy_score, recall_score,
precision_score
➢ Precision score: measures the proportion of true positive predictions out of the
total number of positive predictions. It is calculated as:
𝑇𝑃
Precision =
(𝑇𝑃 + 𝐹𝑃)
➢ Recall score: measures the proportion of true positive predictions out of the
total number of actual positive instances. It is calculated as:
𝑇𝑃
Recall =
(𝑇𝑃 + 𝐹𝑁)
➢ F1 score: is the harmonic mean of precision and recall, and provides a single
score that balances the trade-off between precision and recall. It is calculated as:
2 ∗ (𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑟𝑒𝑐𝑎𝑙𝑙)
F1 =
(𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙)
CHAPTER - 7
TESTING
7.1 SYSTEM TESTING
The purpose of testing is to discover errors. Testing is the process of trying to discover
every conceivable fault or weakness in a work product. It provides a way to check the
functionality of components, sub-assemblies, assemblies and/or a finished product. It is
the process of exercising software with the intent of ensuring that the software system
meets its specified requirements and functions as expected. Testing involves creating
and executing test cases to verify that the software behaves correctly under various
conditions and scenarios.
Software system meets its requirements and user expectations and does not fail in an
unacceptable manner. The testing process can be manual or automated and can be
conducted at various stages of the software development life cycle, such as unit testing,
integration testing, system testing, and acceptance testing. Overall, testing plays a
crucial role in ensuring the quality and reliability of software products.
Unit testing involves the design of test cases that validate that the internal program logic
is functioning properly, and that program inputs produce valid outputs. All decision
branches and internal code flow should be validated. It is the testing of individual
software units of the application it is done after the completion of an individual unit
before integration. This is a structural testing, that relies on knowledge of its
construction and is invasive.
Unit tests ensure that each unique path of a business process performs accurately to the
documented specifications and contains clearly defined inputs and expected results. unit
testing is an essential part of the software development process, which helps developers
identify defects early, improve the quality of the software, and reduce the cost of bug
fixes. By validating the functionality of each unit of code, developers can ensure that
the software application meets its functional requirements and performs as expected.
Integration testing is a type of software testing that involves testing the interactions and
interfaces between different software components that have been integrated into a larger
system. The purpose of integration testing is to ensure that these components work
together as expected and that the integrated system meets its functional requirements.
Functional tests provide systematic demonstrations that functions tested are available
as specified by the business and technical requirements, system documentation, and
user manuals.
CHAPTER - 8
Evaluation of Algorithms:
Evaluates the performance of four different machine learning algorithms,
including decision tree, K-Nearest Neighbors (KNN), Support Vector Machine (SVM),
and Random Forest.
Confusion Matrix:
The model is designed to classify objects based on their color, then the x-axis
of the confusion matrix would represent the predicted classes, which are the colors
green, yellow, orange, and red. The y-axis would represent the actual classes of the
objects. The confusion matrix would show the number of true positives (TP), true
negatives (TN), false positives (FP), and false negatives (FN) for each class.
CONCLUSION
REFERENCES
PUBLICATION