Air Quality Prediction Using Machine Learning
Air Quality Prediction Using Machine Learning
1. DATA COLLECTION
2. DATA PRE-PROCESSING
3. FEATURE EXTRATION
4. EVALUATION MODEL
DATA COLLECTION
Data used in this paper is a software data of JM1. This step is concerned with
selecting the subset of all available data that you will be working with. ML
problems start with data preferably, lots of data (examples or observations) for
which you already know the target answer. Data for which you already know
the target answer is called labelled data.
DATA PRE-PROCESSING
Organize your selected data by formatting, cleaning and sampling from it.
Three common data pre-processing steps are:
Formatting: The data you have selected may not be in a format that is suitable for
you to work with. The data may be in a relational database and you would like it in
a flat file, or the data may be in a proprietary file format and you would like it in a
relational database or a text file.
Cleaning: Cleaning data is the removal or fixing of missing data. There may be
data instances that are incomplete and do not carry the data you believe you need
to address the problem. These instances may need to be removed. Additionally,
there may be sensitive information in some of the attributes and these attributes
may need to be anonymized or removed from the data entirely.
Sampling: There may be far more selected data available than you need to work
with. More data can result in much longer running times for algorithms and larger
computational and memory requirements. You can take a smaller representative
sample of the selected data that may be much faster for exploring and prototyping
solutions before considering the whole dataset.
FEATURE EXTRATION
Next thing is to do Feature extraction is an attribute reduction process.
Unlike feature selection, which ranks the existing attributes according to their
predictive significance, feature extraction actually transforms the attributes. The
transformed attributes, or features, are linear combinations of the original
attributes. Finally, our models are trained using Classifier algorithm. We use
classify module on Natural Language Toolkit library on Python. We use the
labelled dataset gathered. The rest of our labelled data will be used to evaluate the
models. Some machine learning algorithms were used to classify pre-processed
data. The chosen classifiers were Random forest. These algorithms are very
popular in text classification tasks.
EVALUATION MODEL
Class diagram:-
Activity diagram:-
Domain Specification
MACHINE LEARNING
Machine Learning is a system that can learn from example through self-
improvement and without being explicitly coded by programmer. The
breakthrough comes with the idea that a machine can singularly learn from the data
(i.e., example) to produce accurate results.
Machine learning combines data with statistical tools to predict an output. This
output is then used by corporate to makes actionable insights. Machine learning is
closely related to data mining and Bayesian predictive modeling. The machine
receives data as input, use an algorithm to formulate answers.
A typical machine learning tasks are to provide a recommendation. For those who
have a Netflix account, all recommendations of movies or series are based on the
user's historical data. Tech companies are using unsupervised learning to improve
the user experience with personalizing recommendation.
Machine learning is also used for a variety of task like fraud detection, predictive
maintenance, portfolio optimization, automatize task and so on.
DATA RULES
COMPUTER
OUTPUT
Machine Learning
How does Machine learning work?
Machine learning is the brain where all the learning takes place. The way the
machine learns is similar to the human being. Humans learn from experience. The
more we know, the more easily we can predict. By analogy, when we face an
unknown situation, the likelihood of success is lower than the known situation.
Machines are trained the same. To make an accurate prediction, the machine sees
an example. When we give the machine a similar example, it can figure out the
outcome. However, like a human, if its feed a previously unseen example, the
machine has difficulties to predict.
The core objective of machine learning is the learning and inference. First of all,
the machine learns through the discovery of patterns. This discovery is made
thanks to the data. One crucial part of the data scientist is to choose carefully
which data to provide to the machine. The list of attributes used to solve a problem
is called a feature vector. You can think of a feature vector as a subset of data that
is used to tackle a problem.
The machine uses some fancy algorithms to simplify the reality and transform this
discovery into a model. Therefore, the learning stage is used to describe the data
and summarize it into a model.
For instance, the machine is trying to understand the relationship between the wage
of an individual and the likelihood to go to a fancy restaurant. It turns out the
machine finds a positive relationship between wage and going to a high-end
restaurant: This is the model
Inferring
When the model is built, it is possible to test how powerful it is on never-seen-
before data. The new data are transformed into a features vector, go through the
model and give a prediction. This is all the beautiful part of machine learning.
There is no need to update the rules or train again the model. You can use the
model previously trained to make inference on new data.
The life of Machine Learning programs is straightforward and can be summarized
in the following points:
1. Define a question
2. Collect data
3. Visualize data
4. Train algorithm
5. Test the Algorithm
6. Collect feedback
7. Refine the algorithm
8. Loop 4-7 until the results are satisfying
9. Use the model to make a prediction
Once the algorithm gets good at drawing the right conclusions, it applies that
knowledge to new sets of data.
Machine learning Algorithms and where they are used?
Machine learning can be grouped into two broad learning tasks: Supervised and
Unsupervised. There are many other algorithms
Supervised learning
An algorithm uses training data and feedback from humans to learn the
relationship of given inputs to a given output. For instance, a practitioner can use
marketing expense and weather forecast as input data to predict the sales of cans.
You can use supervised learning when the output data is known. The algorithm
will predict new data.
There are two categories of supervised learning:
Linear Finds a way to correlate each feature to the output to help Regression
regression predict future values.
Logistic Extension of linear regression that's used for classification tasks. Classification
regression The output variable 3is binary (e.g., only black or white) rather
than continuous (e.g., an infinite list of potential colors)
Naive The Bayesian method is a classification method that makes use Regression
Bayes of the Bayesian theorem. The theorem updates the prior Classification
knowledge of an event with the independent probability of each
feature that can affect the event.
Support Support Vector Machine, or SVM, is typically used for the Regression (no
vector classification task. SVM algorithm finds a hyperplane that very common
machine optimally divided the classes. It is best used with a non-linear Classification
solver.
Random The algorithm is built upon a decision tree to improve the Regression
forest accuracy drastically. Random forest generates many times Classification
simple decision trees and uses the 'majority vote' method to
decide on which label to return. For the classification task, the
final prediction will be the one with the most vote; while for the
regression task, the average prediction of all the trees is the final
prediction.
Classification
Imagine you want to predict the gender of a customer for a commercial. You will
start gathering data on the height, weight, job, salary, purchasing basket, etc. from
your customer database. You know the gender of each of your customer, it can
only be male or female. The objective of the classifier will be to assign a
probability of being a male or a female (i.e., the label) based on the information
(i.e., features you have collected). When the model learned how to recognize male
or female, you can use new data to make a prediction. For instance, you just got
new information from an unknown customer, and you want to know if it is a male
or female. If the classifier predicts male = 70%, it means the algorithm is sure at
70% that this customer is a male, and 30% it is a female.
The label can be of two or more classes. The above example has only two classes,
but if a classifier needs to predict object, it has dozens of classes (e.g., glass, table,
shoes, etc. each object represents a class)
Regression
When the output is a continuous value, the task is a regression. For instance, a
financial analyst may need to forecast the value of a stock based on a range of
feature like equity, previous stock performances, macroeconomics index. The
system will be trained to estimate the price of the stocks with the lowest possible
error.
K-means Puts data into some groups (k) that each contains data with Clustering
clustering similar characteristics (as determined by the model, not in
advance by humans)
PCA/T-SNE Mostly used to decrease the dimensionality of the data. The Dimension
algorithms reduce the number of features to 3 or 4 vectors Reduction
with the highest variances.
Unsupervised learning
In unsupervised learning, an algorithm explores input data without being given an
explicit output variable (e.g., explores customer demographic data to identify
patterns)
You can use it when you do not know how to classify the data, and you want the
algorithm to find patterns and classify the data for you
Augmentation:
Automation:
Finance Industry
Healthcare industry
● Healthcare was one of the first industry to use machine learning with image
detection.
Marketing
Machine learning gives terrific results for visual pattern recognition, opening up
many potential applications in physical inspection and maintenance across the
entire supply chain network.
Unsupervised learning can quickly search for comparable patterns in the diverse
dataset. In turn, the machine can perform quality inspection throughout the
logistics hub, shipment with damage and wear.
For instance, IBM's Watson platform can determine shipping container damage.
Watson combines visual and systems-based data to track, report and make
recommendations in real-time.
In past year stock manager relies extensively on the primary method to evaluate
and forecast the inventory. When combining big data and machine learning, better
forecasting techniques have been implemented (an improvement of 20 to 30 %
over traditional forecasting tools). In term of sales, it means an increase of 2 to 3 %
due to the potential reduction in inventory costs.
Deep Learning
Reinforcement Learning
● Q-learning
● Deep Q network
● State-Action-Reward-State-Action (SARSA)
● Deep Deterministic Policy Gradient (DDPG)
At that time, Under Armour had all of the 'must have' HR technology in place such
as transactional solutions for sourcing, applying, tracking and onboarding but those
tools weren't useful enough. Under armour choose HireVue, an AI provider for
HR solution, for both on-demand and live interviews. The results were bluffing;
they managed to decrease by 35% the time to fill. In return, the hired higher quality
staffs.
Apart from the three examples above, AI is widely used in other sectors/industries.
Artificial Intelligence
ML
Machine Learning DL
Deep Learning
In the table below, we summarize the difference between machine learning and
deep learning.
With machine learning, you need fewer data to train the algorithm than deep
learning. Deep learning requires an extensive and diverse set of data to identify the
underlying structure. Besides, machine learning provides a faster-trained model.
Most advanced deep learning architecture can take days to a week to train. The
advantage of deep learning over machine learning is it is highly accurate. You do
not need to understand what features are the best representation of the data; the
neural network learned how to select critical features. In machine learning, you
need to choose for yourself what features to include in the model.
TensorFlow
the most famous deep learning library in the world is Google's TensorFlow.
Google product uses machine learning in all of its products to improve the search
engine, translation, image captioning or recommendations.
To give a concrete example, Google users can experience a faster and more refined
the search with AI. If the user types a keyword a the search bar, Google provides a
recommendation about what could be the next word.
Google wants to use machine learning to take advantage of their massive datasets
to give users the best experience. Three different groups use machine learning:
● Researchers
● Data scientists
● Programmers.
They can all use the same toolset to collaborate with each other and improve their
efficiency.
Google does not just have any data; they have the world's most massive computer,
so TensorFlow was built to scale. TensorFlow is a library developed by the Google
Brain Team to accelerate machine learning and deep neural network research.
It was built to run on multiple CPUs or GPUs and even mobile operating systems,
and it has several wrappers in several languages like Python, C++ or Java.
TensorFlow Architecture
This is why it is called TensorFlow because the tensor goes in it flows through a
list of operations, and then it comes out the other side.
Development Phase: This is when you train the mode. Training is usually done on
your Desktop or laptop.
Run Phase or Inference Phase: Once training is done Tensorflow can be run on
many different platforms. You can run it on
You can train it on multiple machines then you can run it on a different machine,
once you have the trained model.
The model can be trained and used on GPUs as well as CPUs. GPUs were initially
designed for video games. In late 2010, Stanford researchers found that GPU was
also very good at matrix operations and algebra so that it makes them very fast for
doing these kinds of calculations. Deep learning relies on a lot of matrix
multiplication. TensorFlow is very fast at computing the matrix multiplication
because it is written in C++. Although it is implemented in C++, TensorFlow can
be accessed and controlled by other languages mainly, Python.
Finally, a significant feature of Tensor Flow is the Tensor Board. The Tensor
Board enables to monitor graphically and visually what TensorFlow is doing.
PYTHON OVERVIEW
Python is Interactive: You can actually sit at a Python prompt and interact
with the interpreter directly to write your programs.
Python is Object-Oriented: Python supports Object-Oriented style or
technique of programming that encapsulates code within objects.
History of Python
Python was developed by Guido van Rossum in the late eighties and early nineties
at the National Research Institute for Mathematics and Computer Science in the
Netherlands.
Python is derived from many other languages, including ABC, Modula-3, C, C++,
Algol-68, SmallTalk, Unix shell, and other scripting languages.
Python is copyrighted. Like Perl, Python source code is now available under the
GNU General Public License (GPL).
Python Features
Easy-to-read: Python code is more clearly defined and visible to the eyes.
A broad standard library: Python's bulk of the library is very portable and
cross-platform compatible on UNIX, Windows, and Macintosh.
Interactive Mode: Python has support for an interactive mode which allows
interactive testing and debugging of snippets of code.
Portable: Python can run on a wide variety of hardware platforms and has
the same interface on all platforms.
Apart from the above-mentioned features, Python has a big list of good features,
few are listed below:
It provides very high-level dynamic data types and supports dynamic type
checking.
It can be easily integrated with C, C++, COM, ActiveX, CORBA, and Java.
ANACONDA NAVIGATOR
The command line program conda is both a package manager and an environment
manager, to help data scientists ensure that each version of each package has all the
dependencies it requires and works correctly.
● Jupyter Notebook
● QT Console
● Spyder
● VS Code
● Glue viz
● Orange 3 App
● Rodeo
● RStudio
Advanced conda users can also build your own Navigator applications
How can I run code with Navigator?
The simplest way is with Spyder. From the Navigator Home tab, click Spyder, and
write and execute your code.
You can also use Jupyter Notebooks the same way. Jupyter Notebooks are an
increasingly popular system that combine your code, descriptive text, output,
images and interactive interfaces into a single notebook file that is edited, viewed
and used in a web browser.
What’s new in 1.9?
● Add support for Offline Mode for all environment related actions.
TESTING
● Functional Testing
Integration Testing
Here in machine learning we are dealing with dataset which is in excel sheet
format so if any test case we need means we need to check excel file. Later on
classification will work on the respective columns of dataset .
Test Case 1 :