0% found this document useful (0 votes)
37 views80 pages

Kadir

The document provides a comprehensive overview of Data Science, its importance, and the role of data scientists, along with tools and applications in various fields. It covers the lifecycle of Data Science, the significance of Big Data, and techniques for data analysis, including machine learning algorithms. Additionally, it discusses data visualization, ethical issues, and the clustering of social networks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views80 pages

Kadir

The document provides a comprehensive overview of Data Science, its importance, and the role of data scientists, along with tools and applications in various fields. It covers the lifecycle of Data Science, the significance of Big Data, and techniques for data analysis, including machine learning algorithms. Additionally, it discusses data visualization, ethical issues, and the clustering of social networks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

Contents

Unit I- Introduction of Data Science & R Programming

1. Introduction
1.1 DEFINATION OF DATA SCIENCE
1.2 Important of Data Science
1.3 Role of Data Scientist
1.4 Tools for Data Science
1.5 Applications of Data Science
1.6 Lifecycle of Data Science
2. Big Data and Data Science hype
2.1 Types of Big Data
2.2 Three Characteristics of Big Data-V3c
2.3 Benefits of Big Data
2.4 Big Data Techniques
2.5 Underfitting and Overfitting
2.6 Data Science Hype
3. Statistical Inference, Statistical modeling
4. Probability Distributions
4.1 What is Probability?
4.2 Why Probability is important?
4.3 How to use Probability in Data Science?
4.4 What are probability distributions?
4.5 What are the types of probability distributions?
5. Fitting a model
5.1 Objectives of Model Fitting
5.2 Why are we fitting models to data?
6. Introduction to R
6.1 What is R?
6.2 Why We Choose R for Data Science?
6.3 History of R
6.4 R Features
6.5 How R is different than Other Technologies

1|Page
6.6 Applications of R Programming
6.7 What is R In Data Science important?
6.8 What Makes R Suitable For Data Science?
6.9 Data Science Companies that Use R
7. Exploratory Data Analysis and the Data Science Process
7.1 Exploratory Data Analysis
7.2 Data Science Process
8. Basic tools (plots, graphs and summary statistics) of EDA
8.1 Exploratory data analysis
8.2 Types of Data
9. The Data Science Process - Case Study, Real Direct (online real estate firm)

UNIT- II (Basic Machine Learning Algorithms & Applications)

1. Linear Regression for Machine Learning


2. k-Nearest Neighbors (k-NN)
2.1 Working of KNN Algorithm
2.2 Implementation in Python
2.3 Pros and Cons of KNN
2.4 Applications of KNN
3. k-Means
3.1 Applications
4. Filtering Spam
4.1 What is spam?
4.2 Purpose of Spam
4.3 Spam Life Cycle
4.4 Types of Spam Filters
4.5 Spam Filters Properties
4.6 Bayesian Classification
4.7 Computing the Probability
4.8 How to Design a Spam Filtering System with Machine Learning Algorithm
5. Linear Regression and k-NN for filtering spam
6. Naive Bayes
7. Data Wrangling

2|Page
8. Feature Generation
8.1 INTRODUCTION
8.2 BACKGROUND
8.3 SYSTEM AND METHODS
9. Feature Selection algorithms
9.1 The Problem the Feature Selection Solves
9.2 Feature Selection Algorithms
9.3 How to Choose a Feature Selection Method for Machine Learning

UNIT- III Mining Social-Network Graphs & Data Visualization

1. What is Social networks as graphs


2. Clustering of graphs
a. Graphic Clustering Methods
b. What kinds of cuts are perfect for drawing clusters in graphs?
3. Direct discovery of communities in graphs
4. Partitioning of graphs
5. Neighborhood properties in graphs
6. Data Visualization, Basic principles
a. Data Visualization
b. History of Data Visualization
c. Why is data visualization important?
d. What makes data visualization effective?
e. Five types of Big Data Visualization groups
7. Common data visualization tools
8. Examples of inspiring (industry) projects
9. Data Science and Ethical Issues

3|Page
Unit I- Introduction of Data Science & R Programming
1. INTRODUCTION: WHAT IS DATA SCIENCE?
What is Data Science?

Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and
systems to extract knowledge and insights from data in various forms, both structured and
unstructured, similar to data mining.

Why is Data Science?

• Because you have too many data such as money, reviews, customer data, people working, etc.
• You want to keep it clear and easy to understand so you can make a change that’s why data science
is relevant.
• Data analysis lets people make better decisions, either faster or better.

Why Data Science is important?

Every company, however, has information, and its business value depends on how much information
it thinks.

Since late, Information Science has acquired significance in the light of the fact that it can assist
companies with growing business estimation of their accessible knowledge and thus allow them to
take the upper hand against their rivals.

It can help us know our customers better, it can help us refine our processes and it can help us make
better decisions. Knowledge, in the light of information technology, has become a vital instrument.

Role of Data Scientist

• Data scientists help organizations understand and handle data, and address complex problems using
knowledge from a range of technology niches.

• They are typically built in the fields of computer science, modeling, statistics, analytics and
mathematics, coupled with modeling statistics and mathematics combined with a clear business
sense.

How to do Data Science?

A typical data science process looks like this, which can be modified for specific use case:

 Understand the business


 Collect & explore the data
 Prepare & process the data
 Build & validate the models
 Deploy & monitor the performance

4|Page
Tools for Data Science

1. R
2. Python
3. SQL
4. Hadoop
5. Tableau
6. Weka

Applications of Data Science

1. Data Science in Healthcare


2. Data Science in E-commerce
3. Data Science in Manufacturing
4. Data Science as Conversational Agents
5. Data Science in Transport

Lifecycle of Data Science

A brief overview of the main phases of the Data Science Lifecycle is shown in Figure 1:

Figure 1: Data Science Lifecycle

Phase 1—Discovery: Phase 1 — It's important to understand the various criteria, requirements,
goals and necessary budget before you start the project. You ought to have the courage to ask the

5|Page
right questions. Here, you determine whether you have the resources needed for supporting
supp the
project in terms of people, equipment, time and data. You must also frame the business problem in
this process, and formulate initial hypotheses (IH) for testing.

Phase 2—Data preparation: You need analytical sandbox in this process, in which you yo can conduct
analytics for the entire duration of the project. Before modeling, you need to search, preprocess, and
condition data. In addition, you must perform ETLT (extracting, converting, loading, and converting)
to bring data into the sandbox. Let's look at the flow of statistic analysis in Figure 2 below.

Figure 2: Statistical Analysis flow of Data preperation

R may be used for data cleaning, retrieval, and visualization. It will help you identify the outliers and
create a relationship between the variables. When the data has been cleaned and packed, it is time to
do some exploratory analytics on it. Let's see if you can get this done.

Step 3 — Model planning: You can decide the methods and strategies for drawing the relationship
between variables here. These relationships will set the basis for the algorithms you will be
implementing over the next step. Using through statistical formula
formulass and visualization tools, you'll
apply Exploratory Data Analytics (EDA).

Let’s have a look at various model planning tools in Figure 3.

Figure 3: Common Tools for Model planning

1. R has a full range of modeling capabilities and offers a strong setti


setting
ng for interpretive model
building.

2. SQL Analysis services can use can data mining functions and simple predictive models to perform
in-database analytics.

3. SAS / ACCESS can be used to access Hadoop data, and is used to construct repeatable and
reusable flow diagrams for model.

While there are many tools on the market but R is the most commonly used tool.

6|Page
Now that you have insights into the essence of your results, and you have chosen to use the
algorithms. You will apply the algorithm in the next step, and you will create a model.

Phase 4—Model building: You will be designing data sets for training and testing purposes during
this process. You should decide whether your current resources are adequate to run the models, or
whether you need a more robust environment (such as fast and parallel processing). To construct the
model, you'll examine different learning strategies such as grouping, association, and clustering.

Model building can be done using the following methods shown in Figure 4.

Figure 4: Common Tools for Model Building.

Step 5 — Operationalize: You provide final reports, presentations, code and technical documents
during this process. Alternatively, a pilot project is often often applied in an area of real-time output.
It will give you a simple, small-scale image of the results and other relevant constraints before full
deployment.

Phase 6—Communicating results: Now it's necessary to determine whether you were able to
accomplish your target that you designed in the first step. Therefore, in the last step, you define all
the main outcomes, communicate to the stakeholders and decide if the project's results are a success
or a failure based on the Step 1 criteria.

2. BIG DATA AND DATA SCIENCE HYPE


What does that mean by "Big Data?"

Huge Data is a term used to describe a gigantic amount of both organized and unstructured
knowledge that is so immense that using traditional database and programming methods is
impossible to handle. The amount of knowledge is too high in most undertaking situations or it
moves too quickly, or it reaches the existing planning cap.

(1) Data collection by significant quantities a. Via machines, sensors, men, events.

(2) To do something about it. Decision taking, testing observations, gaining perspective, forecasting
the future.

7|Page
Types of Big Data

There are three types of data behind Big Data-


Data structured, semi-structured,
structured, and unstructured shown in
Figure 5.. There's a lot of useful knowledge in each category that you can mine to use in various
projects.

Figure 5: Types of Big data

 Structured

By structured data, we mean data that can be processed, stored, and retrieved in a fixed
format. It refers to highly organized information that can be readily and seamlessly stored and
accessed from a database by simple searchsea engine algorithms. For instance, the employee
table in a company database will be structured as the employee details, their job
positions, their salaries, etc., will be present in an organized manner.

 Unstructured

Unstructured information alludes to the


the information that does not have a particular structure
or structure at all. This makes it exceptionally troublesome and tedious to process and
investigate unstructured information. Email is a case of unstructured information.

 Semi-organized

Semi-organized
organized information relates to the information containing both the arrangements
referenced over, that is, organized and unstructured information. To be exact, it alludes to the
information that in spite of the fact that has not been grouped under a specific storehouse
(database), yet contains essential data or labels that isolate singular components inside the
information.
Big Data also requires several sources and often comes often from different types. But it is not
always an easy job to know how to o combine all of the tools you need to work with different styles.
Three Characteristics of Big Data--V3c
1. Volume - Data quantity
2. Velocity - Data Speed
3. Variety- Data Types

8|Page
1. Volume
A run of the mill PC may have had 10 gigabytes of storage in 2000. Today, Facebook is regularly
ingesting 500 terabytes of new information. Boeing 737 can generate 240 terabytes of flight
information on a single trip across the US. The advanced cell phones, the information they generate
and expend; sensors implanted into ordinary items can take care of containing human, field, and
other data, including video, in billions of new, continually refreshed information, before long
outcome.
2. Velocity
• Clickstreams and ad impressions capture consumer activity at millions of events per second • High-
frequency stock trading algorithms represent market movements within microseconds • machine-to-
machine processes share data between billions of devices • networks and sensors produce huge real-
time log data • online gaming systems help millions of competitor users, each pro
3. Variety
• Big data are not just numbers, dates, and strings. Huge data is also geospatial information, 3D
information, sound and video, and unstructured content, including web-based log documents.
• Modern database systems were built to handle smaller structured information volumes, fewer
changes or an expected, steady information structure. • Big Data inquiry involves details of different
kinds

Benefits of Big Data

Capacity to process Big Data gets different advantages, for example,

1. Organizations can use outside insight while taking choices

Access to social information from web indexes and locales like facebook, twitter are empowering
associations to calibrate their business procedures.

2. Improved client assistance

Customary client criticism frameworks are getting supplanted by new frameworks structured with
Big Data advances. In these new frameworks, Big Data and common language handling innovations
are being utilized to peruse and assess buyer reactions.

3. Early recognizable proof of hazard to the item/administrations, assuming any

4. Better operational effectiveness

Huge Data innovations can be utilized for making an arranging territory or landing zone for new
information before recognizing what information ought to be moved to the information distribution
center. Also, such coordination of Big Data advances and information distribution center encourages
an association to offload inconsistently got to information.

9|Page
Big Data Tools for Data Analysis
1) Apache Hadoop
2) CDH (Cloudera Distribution for Hadoop)
3) Cassandra
4) Knime
5) Datawrapper
6) MongoDB
7) Lumify
8) HPCC
9) Storm
10) Apache SAMOA
11) Talend
12) Rapidminer
13) Qubole
14) Tableau
15) R

Big Data Techniques

Six big data analysis techniques


Big data is defined by the three V's: the vast amount of data, the pace in which it is processed, and
the broad variety of data.7 Due to the second descriptor, the pace, data analytics has extended into
the technical fields of machine learning and artificial intelligence. In addition to the emerging
computer-based analytical methods, data harnesses are now used to analyze t.

In addition to the emerging data harnesses of computer-driven research techniques, analyzes are
often focused on conventional statistical methods.9 Essentially, how data analysis techniques operate
within an enterprise is doubled; broad data analysis is generated by streaming data as it appears, and
then conducting batch analysis of data as it grows – to search for behavioral patterns and trends.

When data becomes more informative in its size, scope and depth, the more creativity it drives.

1. A/B testing

2. Data fusion and data integration

3. Data mining

4. Machine learning

5. Natural language processing (NLP).

6. Statistics.

10 | P a g e
Underfitting and Overfitting

Machine learning uses data to create a “model” and uses model to make predictions

 Customers who are women over age 20 are likely to respond to an advertisement
 Students with good grades are predicted to do well on the SAT
 The temperature of a city can be estimated as the average of its nearby cities, unless
some of the cities are on the coast or in the mountains

 Underfitting
Model used for predictions is too simplistic
 60% of men and 70% of women responded to an advertisement, therefore all future
ads should go to women
 If a furniture item has four legs and a flat top it is a dining room table
 The temperature of a city can be estimated as the average of its nearby cities, unless
some of the cities are on the coast or in the mountains

 Overfitting
Model used for predictions is too specific
 The best targets for an advertisement are married women between 25 and 27 years with short
black hair, one child, and one pet dog
 If a furniture item has four 100 cm legs with decoration and a flat polished wooden top with
rounded edges then it is a dining room table

Data Science Hype

The noise around AI, data science, machine learning and profound learning is hitting a level of fever.
Our industry has experienced a difference in what people mean when they say "AI," "machine
learning" or "data science" as this noise has evolved. It can be argued that a growing taxonomy is
missing for our industry. If there is taxonomy then we have not done a very good job of adhering to
it, as data science professionals. This would have consequences. Two implications include creating a
hype-bubble that leads to unrealistic expectations and an growing inability to interact, especially with
colleagues from non-data sciences. I will cover succinct concepts in this post and then argue how
important it is.

Concise Definitions

Data Science: a discipline that produces predictions and explanations using code and
computer to create models that are put into action.
Machine Learning: a class of algorithms or techniques to capture complex data patterns in
model form automatically.
Deep learning: A class of machine learning algorithms that uses more than one hidden layer of
neural networks.
AI: a group of systems functioning in a manner comparable to humans in both degree of
autonomy and reach.

11 | P a g e
Hype

There is a lot of star strength in these words. We encourage people to dream and envision a better
future leading to their unnecessary use. More buzz around our industry is elevating the tide that
raises all sails, no? Sure, we all hope the tide keeps rising. Yet, if it bursts, we will aim for a
sustainable rise and stop a publicity bubble that would cause widespread disillusionment.

Numerous leaders are requesting guidance on how to assist executives, mid-level managers and even
emerging data scientists have reasonable standards of data science initiatives without losing data
science excitement. Unrealistic expectations delay development by swelling excitement when
projects yield less than utopian results

A major cause of this hype has been the constant overuse of "AI" when referring to any solution
which allows some kind of prediction. Owing to constant overuse, people automatically equate data
science ventures with near-perfect autonomous human-like solutions. Or, at the very least, people
believe that data science can easily solve their particular predictive need, without questioning
whether their organizational data supports such a concept.

Communication Inappropriate words are often used to clutter up conversations. It can be especially
detrimental when a cross-functional team assembles to communicate priorities and develop the end
solution in the early planning phases of a data science project.

I know a data science manager who demands that his data sciences team be practically locked with
business executives in a room for an hour before he approves any new data science project. Okay,
the door isn't actually closed, but it's shut, and for a full hour he wants them to discuss the project.
They saw a reduction in the rework of the project, as they concentrated on early communication with
business stakeholders. The difficulty of describing principles related to data science is as challenging
as it is. We only make it more complicated if we cannot describe our own words.

Because AI and deep learning have come onto the scene, conversations have to be interrupted and
questions answered to figure out what people are.
 Because AI and deep learning have come onto the scene, discussions are constantly required
to pause and ask questions and figure out what people actually mean by using those words.
For starters, how would you interpret those conversation-based statements?
 "Our goal is to make our technology AI-driven within 5 years." • "We need to improve
machine learning before we invest in deep learning." • "We use AI to predict fraud so that our
customers can spend with confidence." • "Our research showed that AI-investing
organizations had a 10 percent increase in revenue." The most common term
misunderstanding is when someone speaks about AI solutions, or when they do AI, whatever
they do.

The most common term-confusion is when someone talks about AI solutions, or when they do AI,
when they should actually talk about creating a model of deep learning or machine learning. It seems
like the exchange of words is all too frequently deliberate, with the speaker hoping to get a hype-
boost by saying "AI." Let's dive through each of the meanings, and see if we can find a taxonomy
agreement.

12 | P a g e
Data Science

First of all, like every other academic discipline I see data science as a technical discipline. Take
Biology for example. Biology requires a variety of concepts, theories, processes, and instruments.
Experimentation is normal. The bio-research group continuously contributes to the knowledge base
of the discipline. It is no different from data science. Practitioners do science of the results. Scientists
are moving the field forward with new hypotheses, principles and methods.

The data science activity includes the marriage of code (usually some mathematical programming
language) with data for model building. This involves the initial essential and dominant steps of
obtaining, cleaning, and preparing data. Models of data science generally make predictions (e.g.,
predicting loan risk, predicting diagnosis of disease, predicting how to respond to a conversation,
predicting what objects are in an image). Models of data science may also explain or characterize the
environment (e.g. whic) for us.

Data science models may also illustrate or define the environment for us (e.g., which combination of
variables is most important in making a diagnosis of the disease, which consumers are most similar
and how). Eventually, when applied to new data, these models are put into action for making
predictions and explications. Data science is a discipline that produces predictions and explanations
using code and data to create models that are put into action.
A description for data science can be difficult to formulate while at the same time separating it from
statistical analysis. Via educational training in math and statistics as well as professional experience
as a statistician I came to the data sciences profession. I used to do data science like many of you
before it became a thing.

Statistical analysis is focused on samples, experimental conditions, probabilities and distributions. It


typically refers to questions increasing the probability of events or the validity of claims. It uses
various algorithms such as t-test, chi-square, ANOVA, DOE, surface designs for the answer, etc.
Often these algorithms create models too. For example, surface response designs are techniques for
estimating a physical system's polynomial model based on observed explanatory factors, and how
they contribute to the response factor.

In my interpretation, one important point is that data science models are applied to new data in order
to make future predictions and explanations, or "put into development." Although it is true that
surface-response models can be used to predict a response on new data, it is typically a hypothetical
prediction of what would happen if the inputs were modified. The engineers then adjust the inputs
and analyze the responses the physical device produces in its new environment. The surface model of
the reaction is not put into development. This doesn't take the thousands of new input settings, in
batches or streams over time, and predicts responses.

This concept of data science is by no means foolproof but it starts capturing the essence of data
science by bringing predictive and descriptive models into action.

Machine Learning

Machine learning as a word reigns in the 1950s. Today, data scientists see it as a collection of
techniques used within data science. It is a tool collection, or a class of techniques used to construct
the above mentioned models. Machine learning helps computers to create (or learn) models on their
own, rather than a person directly articulating the reasoning for a model. This is achieved by
analyzing an initial collection of data, finding complex hidden patterns in that data and storing those

13 | P a g e
patterns in a model so that they can be later applied to new data for predictions or interpretations to
be made.
The magic behind this automated pattern-discovery process lies in the algorithms. Algorithms are
Machine Learning Workhorses. Popular machine learning algorithms include the various approaches
to the neural network, clustering strategies, gradient boosting machines, random forests and much
more. If data science is a discipline like biology, then microscopy or genetic engineering is like
machine learning. It is a set of methods and techniques which is used to exercise the discipline.

Deep Learning

Deep learning is the simplest interpretation of those concepts. Deep learning is a class of machine
learning algorithms that employs more than one hidden layer of neural networks. The own neural
networks date back to the 1950s. Recently, deep learning algorithms were very popular beginning in
the 1980s, with a lull in the 1990s and 2000s, followed by resurgence in our decade due to fairly
minor changes in how deep networks were designed that proved to have incredible impacts. Deep
learning can be applied to a wide range of applications, including image recognition, chat assistants,
and recommender systems. Google Voice, Google Images and Google Search for example are some
of the roots. For example, Google Speech, Google Photos, and Google Search are some of the
original solutions built using deep learning.

AI

AI has been around for a long time. Long before the recent hype storm that has co-opted it with
buzzwords. How do we, as data scientists, define it? When and how should we use it? What is AI to
us? Honestly, it’s not sure anyone really knows. This might be our “emperor has no clothes”
moment. We have the ambiguity and the resulting hype that comes from the promise of something
new and unknown. The CEO of a well-known data science company was recently talking with our
team at Domino when he mentioned “AI”. He immediately caught himself and said, “I know that
doesn’t really mean anything. I just had to start using it because everyone is talking about it. I
resisted for a long time but finally gave in.”

That said, I'm going to take a stab on it: AI is a category of systems that people aim to build that have
the distinguishing characteristic that in the degree of autonomy and scope of activity they will be
comparable with humans.

If data science is like biology and machine learning is like genetic engineering, and then AI is like
resistance to disease, to expand our analogy. It's the end product, a series of solutions or structures
we seek to build by applying machine learning (often deep learning) and other techniques.

Here is the concluding line. I think we need to distinguish between strategies that are part of AI
solutions, AI-like solutions and actual AI solutions. This includes AI building blocks, solutions with
AI-ish qualities, and solutions that approach human autonomy and scope. These are three separate
things. People just say “AI” for all three far too often.

For example,
 Deep learning is not AI. It is a technique that can be used as part of an AI solution.
 Most data science projects are not AI solutions. A customer churn model is not an AI
solution, no matter if it used deep learning or logistic regression.
 A self-driving car is an AI solution. It is a solution that operates with complexity and
autonomy that approaches what humans are capable of doing.

14 | P a g e
Gartner’s 2019 Hype Cycle

Figure 6: Gartner’s Hype Cycle

Gartner’s 2019 Hype Cycle for Emerging Technologies is out shown in Figure 6, so it is a good
moment to take a deep look at the report and reflect on our AI strategy as a company.
company

First and foremost, and before going into depth about the report
report's
's content and its consequences for the
AI strategy of companies, I would like to answer a very recurring message that I have seen in social
networks these last days. Several people were shocked to see those developments totally missing from
the study despite
ite appearing during previous years. As Gartner states in his study, his Hype Cycle
encompasses a wide variety of subjects, and if a specific technology is not highlighted, it does not
automatically mean that they are not relevant, quite the opposite. Some technologies that disappear
from the Hype Cycle due to the fact that they are no longer "emerging," but essential to business and
IT.

15 | P a g e
So let's start by actually reviewing the fundamental AI related technologies that were omitted from the
report this year but are still important for business:

 Deep neural networks (DNNs). Gartner also speaks about DNNs as a basis for many other
new technologies used in the Hype Cycle.
 Conversation AI platforms. Gartner no longer considers new technologies to Conversational
AI systems, though focusing on their importance to market.
 Digital Helpers. Gartner no longer considers new technologies as virtual assistants, though
focusing on their importance as industry.
 Artificial General Intelligence (AGI). In my opinion, a good call by Gartner to favor a
pragmatic vision around AI, moving away from hype. As Gartner mentions, AGI will not be
mature for decades.

According to Gartner, what areas of focus will be the AI leaders for companies? Based on the
2019 Emerging Technologies Priority List, those will be:

• Augmented Intelligence. Gartner recognizes this evolving technology as the key to the design
strategy of new business technologies, combining short-term automation with a mid-/long-term
strategy that ensures quality enhancement not only by means of automation, but also through
growing human talent.

• Edge AI. In those situations where contact costs, latency or high volume ingestion can be
crucial. This, of course, means ensuring that the correct AI technologies and techniques are
available for our use case (e.g. Deep Learning) are available for the IoT infrastructure we want to
deploy, among other conditions.

Finally, which are the new emerging technologies related to AI in 2019 Hype Cycle:

 Adaptive ML

 Emotion AI.

 Explainable AI.

 Generative Adversarial Networks (GANs).

 Transfer Learning.

16 | P a g e
3. STATISTICAL INFERENCE, STATISTICAL MODELING

STATISTICAL INFERENCE

Inferential Statistics

Inferential statistics allows you to make inferences about the population from the sample data shown
in Figure 7.

Figure 7: Inferential statistics

Population & Sample

A sample of a population is a representative subset. In certain cases carrying out a population census
is an ideal yet unrealistic solution. Sampling is much more practical though sampling error is
vulnerable. A sample that is not population representative is called bias, the approach chosen for such
sampling is called sampling bias. The key forms of sampling bias are comfort bias, judgment bias,
size bias, response bias. Randomisation is the best method to reduce bias in the sampling. Simple
random sampling is the simplest randomisation method; other systematic sampling techniques are
cluster sampling & stratified sampling.

Sampling Distributions

Sample means being spread more and more naturally around the true mean (the population
parameter), as we increase our sample size. Sample variation means decreasing as sample size
increases.

Central Limit Theorem

The Central Limit Theorem is used to help us understand the following facts, whether or not the
distribution of populations is normal:

1. The mean of the sample is the same as mean of population


2. The default sample deviation means is always equal to the standard error.
3. Sample distribution means will become ever more natural as the sample size increases.

17 | P a g e
Confidence Intervals

A sample mean can be referred to as a point estimate of a population mean. A confidence interval is
always cantered around the mean of your sample. To construct the interval, you add a margin of
error. The margin of error is found by multiplying the standard error of the mean by the z-score of
the percent confidence level shown in Figure 8:

Figure 8: Confidence intervals graph

The confidence level represents the number of times out of 100 that the population average would be
within the specified sample mean interval.

Hypothesis Testing

Hypothesis testing is a kind of statistical inference involving asking a question, collecting data and
then analysing what the data informs us on how to precede. The experimental hypothesis is called the
null hypothesis and given the Ho symbol. We test the null hypothesis against an alternative hypothesis
to which the symbol Ho is assigned shown if Figure 9.

Figure 9: Hypothesis testing

18 | P a g e
When testing a hypothesis we have to determine how much of a difference between means is required
to refute the null hypothesis. For their hypothesis check, Statisticians first select a level of significance
or degree of alpha (α).

Values that show the edge of the critical area are important. Critical regions define the entire value
area which means you are rejecting the null hypothesis.

Figure 10: left, right & two-tailed tests

These are the four basic steps we follow for (one & two group means) hypothesis testing:

1. State the null and alternative hypotheses.

2. Select the appropriate significance level and check the test assumptions.

3. Analyse the data and compute the test statistic.

4. Interpret the result.

Hypothesis Testing (One and Two Group Means)

Hypothesis Test on One Sample Mean When the Population Parameters are Known

We find the z-statistic of our sample mean in the sampling distribution and determine if that z-
score falls within the critical(rejection) region or not. This test is only appropriate when you know
the true mean and standard deviation of the population.

19 | P a g e
Hypothesis Tests When You Don’t Know Your Population Parameters

The Student’s t-distribution is similar to the normal distribution, except it is more spread out and
wider in appearance, and has thicker tails. The differences between the t-distribution and the normal
distribution are more exaggerated when there are fewer data points, and therefore fewer degrees of
freedom shown in Figure 11.

Figure 11: Distribution graph

20 | P a g e
Estimation as a follow-up to a Hypothesis Test

When a hypothesis is rejected, it is often useful to turn to estimation to try to capture the true value of
the population mean.

Two-Sample T Tests

Independent Vs Dependent Samples

When we have independent samples we assume that the scores of one sample do not affect the other.

unpaired t-test

In two dependent samples of data, each score in one sample is paired with a specific score in the other
sample.

paired t-test

Hypothesis Testing (Categorical Data)

Chi-square test is used for categorical data and it can be used to estimate how closely the distribution
of a categorical variable matches an expected distribution (the goodness-of-fit test), or to estimate
whether two categorical variables are independent of one another (the test of independence).

goodness-of-fit

degree of freedom (d f) = no. of categories(c)−1

21 | P a g e
test of independence

degree of freedom (df) = (rows−1)(columns−1)

Hypothesis Testing (More Than Two Group Means)

Analysis of Variance (ANOVA) allows us to test the hypothesis that multiple population means and
variances of scores are equal. We can conduct a series of t-tests instead of ANOVA but that would be
tedious due to various factors.

We follow a series of steps to perform ANOVA:

1. Calculate the total sum of squares (SST )

2. Calculate the sum of squares between (SSB)

3. Find the sum of squares within groups (SSW ) by subtracting

4. Next solve for degrees of freedom for the test

5. Using the values, you can now calculate the Mean Squares Between (MSB) and Mean Squares
Within (MSW ) using the relationships below

6. Finally, calculate the F statistic using the following ratio

7. It is easy to fill in the Table from here — and also to see that once the SS and df are filled in, the
remaining values in the table for MS and F are simple calculations

8. Find F critical

ANOVA formulas

If F-value from the ANOVA test is greater than the F-critical value, so we would reject our Null
Hypothesis.

One-Way ANOVA

22 | P a g e
One-way ANOVA method is the procedure for testing the null hypothesis that the population means
and variances of a single independent variable are equal.

Two-Way ANOVA

Two-way ANOVA method is the procedure for testing the null hypothesis that the population means
and variances of two independent variables are equal. With this method, we are not only able to study
the effect of two independent variables,
variables but also the interaction between these variables
variables.

We can also do two separate one-way


way ANOVA but two-way ANOVA gives us Efficiency, Control &
Interaction.

Quantitative Data (Correlation & Regression)

Correlation

Correlation refers to a mutual relationship or association between quantitative variables. It can help
in predicting one quantity from another.
anothe It often indicates the presence of a causal relationship.
relationship It
used as a basic quantity and foundation for many other modeling techniques.

Pearson Correlation

Figure 12: Pearson Correlation

23 | P a g e
Regression

Regression analysis is a set of statistical processes for estimating the relationships among variables
shown in Figure 13.

Figure 13: Regression

Simple Regression

This method uses a single independent variable to predict a dependent variable by fitting the best
relationship shown in Figure 13.

Figure 13: Simple Regression

Multiple Regression

This method uses more than one independent variable to predict a dependent variable by fitting the
best relationship shown in Figure 14.

Figure 14: Multiple Regression

It works best when multicollinearity is absent. It’s a phenomenon in which two or more predictor
variables are highly correlated.

24 | P a g e
Nonlinear Regression

In this method, observational data are modelled by a function which is a nonlinear combination of the
model parameters and depends on one or more independent variables shown in Figure 14.

Figure 14: Nonlinear regression

Significance in Data Science

In data science, inferential statistics is used is many ways:

 Making inferences about the population from the sample.

 Concluding whether a sample is significantly different from the population.

 If adding or removing a feature from a model will really help to improve the model.

 If one model is significantly better than the other?

 Hypothesis testing in general.

4. PROBABILITY DISTRIBUTIONS

What is Probability?

Probability is the chance that something will happen — how likely it is that some event will happen.

25 | P a g e
Figure 15: Probability

Probability of an event happening P(E) = Number of ways it can happen n(E)/ Total number of
outcomes n(T)

Probability is the measure of the likelihood that an event will occur. Probability is quantified as a
number between 0 and 1, where 0 indicates impossibility and 1 indicates certainty shown in Figure
15.

Why Probability is important?

In certain areas of our everyday lives, confusion and randomness exist and having a strong knowledge
of probability allows us to make sense of these uncertainties. Knowing about chance allows us to
make educated judgments on what is likely to happen, based on a trend of previously collected data or
estimation.

How to use Probability in Data Science?

Data Science also makes use of statistical inferences to forecast or interpret computer patterns, while
statistical inferences use data distribution of probabilities. Therefore it is important to know the
likelihood and its implementations to work effectively on data science problems.

What is Conditional Probability?

Conditional probability is a measure of the likelihood of an occurrence (some particular circumstance


occurring), provided that another occurrence has occurred (by inference, hypothesis, conclusion, or
evidence).

The probability of event B provided event A equals the likelihood of event A and event B divided by
the likelihood of event A.

How conditional probability is used in data science?

26 | P a g e
Most techniques in data science (i.e., Naive Bayes) depend on the theorem of Bayes. Bayes' theorem
is a formula that explains how, when given proof, to change the probabilities of the hypotheses.

Using the Bayes’ theorem, it’s possible to build a learner that predicts the probability of the response
variable belonging to some class, given a new set of attributes.

What are random variables?

A random variable is a set of possible values from a random experiment shown in Figure 16.

Figure 16: Random variables

A random variable (random quantity, aleatory variable, or stochastic variable) is a variable whose
possible values are outcomes of a random phenomenon.

Random variables can be discrete or continuous. Discrete random variables can only take certain
values while continuous random variables can take any value (within a range).

What are probability distributions?

Within a random variable the distribution of probability determines how the probabilities are
distributed over the variable random values.

To define the probability distribution, x is used for a discrete random variable, a probability mass
function denoted by f(x). This function gives random variable probability for any value.

27 | P a g e
A continuous random variable is defined for the probability that a continuous random variable will lie
within a given interval, since there is an infinite number of values at every interval. Thus here the
probability distribution is defined by the probability density function, which is also denoted by f(x).

What are the types of probability distributions?

A binomial distribution is a statistical experiment with the following characteristics: The experiment
is made up of n repeated trials. Each trial can produce only two possible results. We call one such
outcome a success and the other a loss. The probability of success, denoted by P, for every trial is the
same shown in Figure 17.

Figure 17: Probability Distributions

The standard distribution, also known as the Gaussian distribution, is a distribution of probability that
is symmetrical around the mean, indicating that the occurrence of data near the mean is more common
than that far from the mean.

This has the following characteristics:

• The typical curve is symmetrical around the mean μ;

• The mean is in the middle and splits the area into halves;

28 | P a g e
• The total area below the curve is equal to 1;

• It is entirely defined by its mean and standard deviation (or variance σ2)

Figure 18: Normal Distribution

How random variables & probability distributions are used in data science?

Data science also makes use of statistical inferences to forecast or interpret computer patterns, while
statistical inferences use data distribution of probability. Therefore it is important to know random
variables & their distributions of probability to function effectively on data science problems.

5. FITTING A MODEL

Fitting a model means that you are making your algorithm know the relationship between the
predictors and the outcome so that you can predict the future value of the outcome.

And the best fitted model has a particular set of parameters that best describes the question at hand.

Objectives of Model Fitting

There are two main goals for model fitting

1. Make inferences about the relationship between variables in a given data set.

2. Predictions/forecasting of potential events, based on models calculated using historical evidence.

Why are we fitting models to data?

1. Estimate the distributional properties of variables, theoretically subject to certain variables.

2. Concisely describe the relationship between the variables and make inferential assumptions about
the relationship

29 | P a g e
3. Predict the values of interest variables on the basis of values of other predictor variables, and
define the volatility of predictions.

6. INTRODUCTION TO R

What is R?

R programming language is one that allows statistical computation and is commonly used by data
miners and statisticians to analyze data. It was created by Ross Ihaka and Robert Gentleman in 1995
where they derived the word 'R' from the first letters of their names. R is a common choice for
statistical computing and graphical techniques in data analytics and in data science.

In its CRAN repository R holds a set of over 10,000 packages. Such products are appealing to
specific statistical applications. R may give a steep learning curve for beginners. But while R's
syntax can be simple to understand. It is an interactive method used for the implementation of
statistical learning. Hence, a. user without knowledge of statistics may not be able to get the best out
of R.

Why We Choose R for Data Science?

Data Science has emerged as 21st Century's most popular industry. That is because the need to
evaluate and create knowledge from the data is pressing. Industries are converting the raw data into
goods supplied with data. To do so, several essential tools are needed to churn the raw data. R is one
of the programming languages that provide you with an intensive environment for process analysis,
transformation, and knowledge visualization.

To several statisticians who want to get interested in the design of mathematical models to solve
complex problems, this is the primary option. R includes a sea of packages attracting all types of
disciplines such as astronomy, biology, etc. Though R was originally used for academic purposes, it
is now also being used in industry.

R is a technical language used for complex mathematical modeling. Alternatively, R also supports
array, matrix, and vector operations. R is renowned for its graphical libraries, which allow users to
delineate visual graphs and make them intractable to users. In addition, R helps its users to create
using Web applications
In addition to this, R provides you with several options of advanced data analytics like the
development of prediction models, machine learning algorithms, etc. R also provides several
packages for image processing.

History of R

• R was conceived by John Chambers at Bell Laboratories in 1976. R was created as an extension, as well as a
programming language implementation of S.

• Ross Ihaka and Robert Gentleman developed the R project and released its first version in 1995 and a stable
beta version in 2000

30 | P a g e
R Features

There are essential R features, which we will explore in order to understand R's role in data science shown in
Figure 19.

Figure 19: Features of R

1. Open-source – R is an open-source platform that allows you to access and change the code, and
even create your own libraries. This is safe to download, as well.

2. A complete language – Although R is commonly regarded as a statistical programming language,


it also contains many features of an Object Oriented programming language.

3. Analytical support-With R, through its wide variety of support libraries, you can perform
analytical operations. You can clean, arrange, analyze, display your data and construct predictive
models, too.

4. Help extensions – R allows developers to write their own libraries and packages as distributed
add-ons, and to promote such packages. This makes R a developer-friendly language that allows its
users to make changes and updates.

2. Facilitates database connectivity – This consists of a variety of add-on modules linking R to


databases such as the RODBC package, the Open DataBase Connectivity Protocol (ODBC) and the
ROracle package allowing connectivity with Oracle databases. The programming language R also
contains MySQL extensions as RMySQL.

3. Extensive community participation – It has an active community that is further enhanced by R


being an open-source programming language. To many this makes R a perfect alternative. R offers
various boot camps and seminars around the world.

31 | P a g e
4. Simple and easy to understand – Although many would argue that R provides the beginners with a
steep learning curve, this is because R is a statistical language. To use R to its maximum, you need to
have statistical experience. R, however, has a syntax which is easy to understand. That helps you to
better recall and appreciate R.

How R is different than Other Technologies

There are certain special features of R programming that make it different compared to other
technologies: • Graphical Libraries-R stays ahead of the curve with its elegant graphical libraries.
Libraries such as ggplot2,plotfully encourage attractive libraries for well-defined plot creation.

• Availability / Cost – R is free to use, indicating universal accessibility.

• Technology advancement – R supports different advanced tools and features that allow you to
create robust statistical models.

• Job Scenario – R is the primary Data Science device, as stated above. With Data Science's
exponential growth and growing demand, R has become the world's most in demand programming
language today.

• Customer and Public Service Support – You will experience good community service with R.

• Portability – R is extremely compact. For the best results a lot of different programming languages
and software frameworks can easily be combined with the R environment.

Applications of R Programming

• R is used in the finance and banking industries for fraud prevention, customer turnover reduction
and potential decision taking.

• Bioinformatics is also used to study genetic sequence sequences, to conduct drug discovery and
computer neuroscience.

• R is used to discover potential consumers in online ads through social media research.
Organizations often use insights from social media to evaluate consumer emotions and enhance their
goods.

• E-commerce firms use R to evaluate consumer transactions and their input.

• Production companies use R to evaluate input from customers. Manufacturing companies use R to
analyze customer feedback. They often use it to forecast future demand to adjust their output
velocities and increase profits.

What is R In Data Science important?

Several of R's essential data science framework features are-

1. R includes many essential data wrangling sets, such as dplyr, purrr, readxl, google sheets,
datapasta, jsonlite, tidyquant, tidyr etc.
32 | P a g e
2. R facilitates robust mathematical modeling. Considering that data science is heavy statistics, R is
an ideal method to execute various statistical operations on it.

3. R is an appealing method for various applications in the data sciences, as it offers esthetic
visualization software such as ggplot2, scatterplot3D, lattice, high charter etc.

4. R is used widely in the ETL (Extract, Transform, Load) data science applications. This offers an
interface to various databases such as SQL and even spreadsheets.

5. Another essential skill of R is to interface and analyze unstructured data with the NoSQL
databases. This is particularly useful for applications in Data Science where a collection of data
needs to be analyzed.

6. Data scientists may use machine learning algorithms with R to gain insights into future events.
Various packages are available such as rpart, CARET, randomForest, and nnet.

What Makes R Suitable For Data Science?

R is the most popular choice for data scientists. Following are some of the key reasons as to why
they use R –

1. R has been reliable and useful for many years at academia. R was generally used at the academy
for research purposes as it offered various statistical analytical instruments. With advancements in
data science and the need to analyze data, R has also become a common option within the industry.

2. R is an ideal resource when it comes to wrangling data. It allows many preprocessed packages to
be used which make it much easier to wrangle the data. This is one of the key reasons why in the
Data Science culture R is favored.

3. R offers its popular ggplot2 bundle that is best known for its visualizations. Ggplot2 offers esthetic
visualizations which are appropriate for all data operations. In addition, ggplot2 provides users with a
degree of interactivity so they can understand more clearly the data contained in the visualisation.

4. R includes modules for the machine learning of different operations. If it is boosting, constructing
random forests or doing regression and classification, machine learning offers a wide variety of
products.

Data Science Companies that Use R

Some of the major data science companies that use R analysis and statistical modeling are shown in
Figure 20 –

33 | P a g e
Figure20: Data Science companies that use R

1. Facebook-Facebook uses R extensively for monitoring of social networks. It uses R to gain


insights into users actions and create relationships between them.
2. Airbnb – Supports R with its complex day-to-day data operations. R uses the dplyr packet to slic
and dictate the data. It also makes use of the ggplot2 graphic package to visualize the results. This
also makes use of the pwr kit for different experiments and statistical studies.
3. Uber-Uber makes excellent use of the R kit to navigate its charting components. Shiny is an
interactive web application developed with R for interactive graphics embedding.
4. Google – R is a common option at Google for carrying out several analytical operations. The
project Google Flu Trends makes use of R to examine flu-related trends and patterns in searches. In
addition, Google's predictive API uses R to evaluate the historical data and render possible
predictions.
5. ANZ-ANZ is one of Australia's biggest banks. It uses R for credit risk analytics that includes
predicting loan defaults based on the clients' transactions and credit score.
6. Novartis – Norvatis is a leading pharmaceutical firm that relies on R for FDA applications for
clinical data analysis.
7. IBM-IBM is one of R's largest investors. It joined the consortium in R recently. IBM also makes
use of R to develop various analytical solutions. It used R in IBM Watson-an open forum for
computing. Furthermore, IBM supports R projects and helps the organization flourish by making
some significant contributions.

7. EXPLORATORY DATA ANALYSIS AND THE DATA SCIENCE


PROCESS

Exploratory Data Analysis

34 | P a g e
Exploratory Data Analysis refers to the essential phase of initial data analyses in order to identify
patterns, find anomalies, test hypotheses, and use descriptive statistics and graphical representations
to verify conclusions.
Second, knowing the data is a good idea, and trying to gain as many ideas from it. EDA is all about
making sense of the data in hand, before they get it dirty.

EDA explained using sample data set:

In order to share my interpretation of the definition and techniques I know, I will take an example of
the white version of the Wine Quality data set available on the UCI Machine Learning Repository
and seek to obtain as many insights from the data set using EDA as possible.

To start with, I imported required libraries (pandas, numpy, matplotlib, and seaborn for this example)
and loaded the data set.

Note: Any inferences I've been able to draw, with bullet points I described.

 Original data is separated by delimiter “ ; “ in given data set.

 To take a closer look at the data took help of “ .head()”function of pandas library which
returns first five observations of the data set. Similarly “.tail ()” returns last five observations
of the data set.

I found out the total number of rows and columns in the data set using “.shape”.

 Dataset comprises of 4898 observations and 12 characteristics.

 Out of which one is dependent variable and rest 11 are independent variables — physico-
chemical characteristics.

It is also a good practice to know the columns and their corresponding data types, along with finding
whether they contain null values or not.

35 | P a g e
 Data has only float and integer values.

 No variable column has null/missing values.

The describe() function in pandas is very handy in getting various summary statistics. This function
returns the count, mean, standard deviation, minimum and maximum values and the quantiles of the
data.

 Here as you can notice mean value is less than median value of each column which is represented
by 50% (50th percentile) in index column.

 There is notably a large difference between 75th %tile and max values of predictors “residual
sugar ”,” free sulfurdioxide”,”totalsulfur dioxide”.

 Thus observations 1 and 2 suggests that there are extreme values-Outliers in our data set.

Few key insights just by looking at dependent variable are as follows:

 Target variable/Dependent variable is discrete and categorical in nature.

 “quality” score scale ranges from 1 to 10;where 1 being poor and 10 being the best.

36 | P a g e
 1,2& 10 Quality ratings are not given by any observation. Only scores obtained are between 3
to 9.

 This tells us vote count of each quality score in descending order.

 “quality” has most values concentrated in the categories 5, 6 and 7.

 Only a few observations made for the categories 3 & 9.

Data Science Process

Data Science is a multidimensional discipline that uses scientific methods, techniques and algorithms
to derive information and insights from structured and unstructured data. I accept that much of his
research is data-related but includes a variety of other data-based processes.

Data Science represents a multidisciplinary field. This includes the systematic blend of scientific and
statistical methods, procedures, creation of algorithms, and technology to obtain useful data
information.
Yet how do all these dimensions work together? To grasp this, you need to learn the data science
process / the day-to-day life of a data scientist

Data Science Process – Daily Tasks of Data Scientist

The steps involved in the complete data science process are:

37 | P a g e
Step One: Ask Questions to Frame the Business Problem

Seek to get an idea of what a organization wants in the first step, and collect data based on it. You
start the data science process by asking the right questions to figure out what the issue is. Let's take a
very popular bag company question-the sales issue.

To analyze the issue, you need to begin by asking a lot of questions:

• Who are the target audience and the clients?

• How do you approach the target market?

• How does the selling cycle look like at the moment?

• Which details do you have about the target market?

• How do we classify those clients who are more likely to buy our product?

• You agree to work on the issue after a conversation with the marketing team: "How do we find
potential buyers who are more likely to purchase our product? "

• The next move for you is to find out what all the details you have with you to answer the questions
above.

Stage two: Get Relevant Data for Problem Analysis

• Now that you are aware of your business concern, it's time to collect the data that will help you
solve the problem. Before collecting the data you should ask if the organization already has the
correct data available?

• In certain cases, you can get the previously collected data sets from other investigations. The
following data are required: age, gender, previous customer’s transaction history, etc.

You find that most of the customer-related data is available in the company’s Customer
Relationship Management (CRM) software, managed by the sales team.

• SQL database is a rear device with many tables for CRM applications. As you go through the SQL
database, you will find out that the system stores detailed customer identification, contact and
demographic details (that they gave the company) and their detailed sales process as well.

• If you do not think the available data is adequate, then plans must be made to collect new data. By
showing or circulating a feedback form you can also take input from your guests and customers. I
agree with that, that's a lot of engineering work and it takes time and effort.

• In addition, the data you obtained is 'raw data' containing errors and missing values. And before the
data is analyzed, you need to clean (wrangle) the data.

38 | P a g e
Step Three: Continue. Explore the Data to Make Error Corrections

Exploring the data cleans and organizes it. This method is focused on more than 70 per cent of the data
scientist's time. While gathering all the data, you're not able to use it, because the raw data you've gathered
most likely contains odds.

First, you need to make sure that the data is error free and clean. It is the most significant step in the cycle
that requires patience and concentration.

For this function specific tools and techniques are used, such as Python, R, SQL, etc.

So you start answering these questions:

• Are the data values missing, i.e. are consumers without their contact numbers?

• Has null values in it? If it happens, how do you fix it?

• Have multiple datasets? Was it a sensible idea to fuse data sets? If so, then how can you bring them
together?

When the tests show the missing and false values they are ready for review. Remember that it is
easier to get the data incorrect than to have no experience at all.

Stage four. After analyzing the data, you have sufficient knowledge to construct a model to
address the question:

“How can we identify potential customers who are more likely to be?”

In this phase, you analyze the data in order to extract information from it. Analyzing the data
involves the application of various algorithms which will derive meaning from it:

• Build data model to answer the query.

• Test the model against gathered data.

• Use of different visualization software for presenting data.

• Carry out the algorithms and statistical analysis required.

• Findings align with other methods and sources.

But answering those questions will only give you theories and hints. Modeling data is an easy way to
simulate data within a proper equation the machine understands. You will be able to make model-
based predictions. You might need to try out multiple models to find the best fit.

Coming back to the issue of sales, this model will help you predict which clients are more likely to
buy. The prediction can be specific, like Female, 16-36 age group living in India.

39 | P a g e
Step five. Communicate the Analytical findings.

Communication skills are an important part of a work for data scientists but are often widely
underestimated. This will in fact be a very difficult part of your work, because it includes explaining the
results to the public and other team members in a way that they will clearly understand.

• Graph or chart the information for presentation with tools – R, Python, Tableau, Excel.

• Use "storytelling" to fit the results.

• Answer the different follow-up questions.

• Present data in different formats-reports, websites.

• Believe me; answers will always spark more questions, and the process begins again.

8. BASIC TOOLS (PLOTS, GRAPHS AND SUMMARY STATISTICS) OF


EDA
Exploratory data analysis
Exploratory Data Analysis (EDA) is a very critical step that takes place after the feature development and
data acquisition process and should be carried out prior to the modeling process. This is because it is really
important for a data scientist to be able to understand the essence of the data without making assumptions
about it.

The goal of EDA is to use summary statistics and visualizations to better understand the data and to find
clues about the patterns, the quality of the data and the assumptions and assumptions of our study. EDA is
NOT about creating fancy visualizations or even aesthetically appealing visualizations, the aim is to try
and answer data questions. Your goal should be to be able to produce a chart that anyone can look at in a
few seconds and understand what's going on. If not, the visualization is too complex (or fancy) and
something similar has to be used.

EDA is also very iterative since we first make assumptions based on our first exploratory visualizations,
then build some models. We then make visualizations of the model results and tune our models.

Types of Data

Once we can start talking about data discovery, let's learn the various types of data or measurement
rates first. I highly recommend that you read Measurement Rates in the online stats book and
continue reading the section to browse your statistical information. This segment is simply a
synopsis. Data comes in different forms but can be classified into two major groups: structured and
unstructured data. Structured data is data that is a form of high-degree or organizational data, such as
numerical or categorical data. Examples of standardized data are Temperature, phone numbers,
gender. Unstructured data is data in a form that doesn't have the framework of which we are using
directly. Types of unstructured data include pictures, videos, audio, text in the language and many
others. There is an emerging field called Deep Learning which uses a specialized collection of
algorithms with unstructured data that perform well. We will concentrate on structured data in this
guide but include brief details

40 | P a g e
Categorical Variables

Categorical variables can also be nominal or ordinal. Nominal data has no intrinsic ordering to the
categories. For example gender (Male, Female, Other) has no specific rdering. Ordinal data as clear
ordering such as three settings on a toaster (high medium and low). A frequency table (count of each
category) is the common statistic for describing categorical data of each variable, and a bar chart or a
waffle chart (shown below) are two visualizations which can be used.

While a pie chart is a very common method for representing categorical variables, it is not
recommended since it is very difficult for humans to understand angles. Simply put by statistician
and visualization professor Edward Tufte: “pie charts are bad”, others agree and some even say
that “pie charts are the worst”.

For example what you you determine from the follow pie chart?

The charts look identical, and it takes more than a couple of seconds to understand the data. Now
compare this to the corresponding bar chart

41 | P a g e
A reader an instantly compare the 5 variables. Since humans have a difficult time comparing angles,
bar graphs or waffle diagrams are recommended. There are many other visualizations which are not
recommended: spider charts, stacked bar charts, and many other junkcharts.

For example, this visualization is very complicated and difficult to understanding:

Often less is more: the plot redux as a simple line graph:

When dealing with multi-categorical data, avoid using stacked bar charts and follow the
guidelines written by Solomon Messing (data scientist I met while working at Facebook).

42 | P a g e
Numeric Variables

Any value within a finite or infinite interval can be numeric or continuous variables. Examples
include weight, height, and temperature. Intervals and ratios are two types of numeric variables.
Interval variables have numerical ratios and the same definition over the entire scale, but have no
absolute zero. For example, temperature may be meaningfully subtracted or added in Fahrenheit or
Celcius (difference between 10 degrees and 20 degrees is the same difference as 40 to 50 degrees),
but cannot be measured. For instance, a day that's twice as hot may not be twice as hot.

"The calculation proportion scale is the most insightful scale. This is a scale of intervals with the
additional property that its zero position implies the absence of the measured quantity. You can think
of a ratio scale as the three previous scales were rolled up into one. It provides a name or category
for each object as a nominal scale (numbers are used as labels). The objects are ordered, like an
ordinal scale (in terms of ordering the numbers). The same disparity at two positions on the scale has
the same value as an interval scale. And, moreover, the same ratio in two places on the scale also has
the same meaning. "A good example of a ratio scale is weight, because it has a true zero and can be
added, subtracted, multiplied or divided.

Binning (Numeric to Categorical)

The process of transforming numerical variables into categorical is what is otherwise known as
discretization. For example, age may be 0-12 (child), 13-19 (teenager), 20-65 (adult), 65+ (senior)
categories. Binning is useful because it can be used as a filter to minimize noise or non-linearity and
some algorithms need categorical data, such as decision trees. Binning also helps data scientists to
easily determine numerical values for outliers, null or incomplete values. Binning strategies involve
using equal width (based on the range), equal frequency in of bin, sorted rank, quantiles, or math
(such as log) functions. Binning may be used based on entropy of information, or acquiring
information.

Encoding

Encoding is the transformation of categorical variables into numeric (or binary) variables, otherwise
known as continuation. Sex is a simple example of encoding: -1, 0, 1 may be used to identify males,
females and others. Binary encoding is a special case of encoding where the value is set to a 0 or 1
indicating a category's absence or presence. One hot encoding is a special case where each binary is
encoded in multiple categories. If we have k categories, this will generate k extra features (increasing
the dimensionality of the data). Another type of encoding is using an encoding based on target and
probability. The average value is the group, which contains a chance.

Iris DataSet

For exploratory data analysis, we are going be investigating Anscombe’s_quartet and Fisher’s Iris
data set.

"Anscombe's quartet consists of four datasets with almost similar basic statistical features, but
looking somewhat different when graphed. Every dataset is composed of eleven points (x, y). They
were built by the statistician Francis Anscombe in 1973 in order to demonstrate both the importance
of graphing data before analyzing it and the effect of outliers on statistical properties." The Iris data
set is a multivariate set of data introduced by Ronald Fisher in his 1936 paper the use of multiple
measures. This indicates the variability of three related species in the Iris bulbs.

43 | P a g e
Summary Statistics

Summary statistics are metrics intended to explain the details. There are several summary measures
in the field of descriptive statistics but we will leave the description (and derivation) to textbooks.
Examples of summary / descriptive statistics for a single number variable are median, median, mode,
max, min, range, quartiles / percentiles, variance, standard deviation, determination coefficient,
skewedness and kurtosis. List statistics is the number of distinct counts for categorical variables. The
most basic overview metric for text data is the frequency of words and the frequency of reciprocal
documents. The descriptive statistics for bivariate data are the linear correlation, the chi-squared or
the p value

The Ansombe Excel worksheet with summary statistics can be downloaded from
datascienceguide.github.io/datasets/anscombe.xls and the Iris data collection Excel worksheet with
summary statistics for each component from datascienceguide.github.io/datasets/iris.xls Note how
the mean, standard deviation and correlation between x and y are almost identical on the Ansombe
dataset. If we know about linear regression we can also see the same linear regression coefficients.

Visualization

Visualizations can be used to analyze and explain data in addition to the summary statistics. We'll
learn about the importance of visualizations in the tutorials, and that using simple statistical
properties to represent data is not enough. This is shown by the quartet of Anscombe as outlined in
this article Why Data Visualizations (is important) is.

Examples of quantitative data visualizations involve line charts with error bars, histograms, box and
whisker graphs, scatter charts or mix charts for categorical data bar charts and waffle charts, and for
bivariate results. Many of these visualisations go through the tutorial on exploratory data analysis.

There are several tools and libraries that can be used to plot visualizations:-Excel/ Libre Office —
Weka — matplotlib (python) — seaborn (python) — Grammer of graphics (ggplot2) — infovis —
Rshiny — Data-driven documents (D3.js) — panoramic Additionally, TibcoSpotfire and Tableau are
common yet commercial data visualization solutions.

Univariate data (One variable)

The box plot and the histogram are the two visualizations used to illustrate univariate (1 variable)
data. The plot of boxes can be used to show the minimum, maximum, mean, median, quantile and
range.

44 | P a g e
The histogram can be used to show the count, mode, variance, standard deviation,
deviation, coefficient of
deviation, skewness and kurtosis.

Bivariate data (Two Variables)

When plotting the relation between two variables, one can use a scatter plot.

45 | P a g e
If the data is time series or has an order, a line chart can be used.

Multivariate Visualization

When dealing with multiple variables, it is tempting to make three dimensional plots, but as show
below it can be difficult to understand the data:

46 | P a g e
Rather I recommend create a scatter plot of the relation between each variable:

Combined charts also are great ways to visualize data, since each chart is simple to understand on its
own.

47 | P a g e
For very large dimensionality, you can reduce the dimensionality using principle component
analysis, Latent Dirichlet allocation or other techniques and then make a plot of the reduced
variables. This is particularly important for high dimensionality data and has applications in deep
learning such as visualizing natural language or images.

Text Data

For example with Text data, one could create a world cloud, where the size of each word is the based
on its frequency in the text. To remove the words which add noise to the dataset, the documents can
be grouped using Topic modeling and only the important words can be displayed.

Image data

When doing image classification, it is common to use decomposition and remove the dimensionality
of the data. For example, an image before decomposition looks like:

48 | P a g e
Instead of blindly using decomposition, a data scientist could plot the result:

By looking att the contrast (black and white) in the images, we can see there is an importance with the
locations of the eyes, nose and mouth, along with the head shape.

9. THE DATA SCIENCE PROCESS - CASE STUDY, REAL DIRECT (ONLINE REAL
ESTATE FIRM)

Data Science Case Studies


Here are the most popular data science case studies that will tell you how data science is used in
various industries. Also, the importance of data science in a variety of industries.

1. Data Science in Pharmaceutical Industries

Through improved data processing and cloudcloud-driven


driven applications, it is now easier to access a wide
variety of patient information datasets. In the pharmaceutical industry, artificial intelligence and data
analytics have revolutionized oncology. With new pharmaceutical
pharmaceutical innovations emerging every day,
it is difficult for physicians to keep up-to-date
up date on treatment options. However, it is difficult to tap
into a highly competitive market for more standardized medical care options. However, with the
advances in technology
echnology and the development of parallel pipelined computational models, it is now
easier for the pharmaceutical industry to have a competitive advantage over the market.

With various statistical models such as Markov Chains, it is now possible to predict
predic the probability
that doctors will prescribe medicines based on their experience with the brand. In the same way,
improving learning is beginning to develop itself in the area of digital marketing. It is used to
49 | P a g e
understand the patterns of digital participation of physicians and their prescriptions. The main aim of
this case study of data science is to discuss the problems facing it and how data science offers
solutions to them.

2. Predictive Modeling for Maintaining Oil and Gas Supply

The crude oil and gas industries are faced with a major problem of equipment failures, typically due
to the inefficiency of the oil wells and their output at a subpar stage. With the implementation of a
effective strategy that advocates for predictive maintenance, well operators can be alerted, as well as
informed of maintenance times, to the critical phases of shutdown. This would lead to an increase in
oil production and avoid further losses.

Data Scientists can apply the Predictive Maintenance Strategy to the use of data to optimize high-
value machinery for the production and refining of oil products. With telemetry data extracted
through sensors, a steady stream of historical data can be used to train our machine learning model.
This machine learning model will predict the failure of machine parts and will alert operators of
timely maintenance in order to avoid oil losses. The Data Scientist assigned to the implementation of
the PdM strategy should help prevent hazards and predict machine failure, encouraging operators to
take precautions.

3. Data Science in BioTech

The human genome consists of four building blocks – A, T, C and G. Our appearance and
characteristics are determined by the three billion permutations of these four building blocks.
Although there are genetic defects and lifestyle defects, the results will lead to chronic diseases.
Identifying these defects at an early stage will allow doctors and testing teams to take preventive
action.

Helix is one of the companies for genome analysis that provides customers with their genomic data.
Also, due to the emergence of new computational methodologies, many medicines adapted to
particular genetic designs have become increasingly popular. Thanks to the data explosion, we can
understand and analyze complex genomic sequences on a wide scale. Data Scientists can use modern
computational resources to manage massive databases and understand patterns in genomic sequences
to detect defects and provide information to physicians and researchers. In addition, with the use of
wearable tools, data scientists may use the relationship between genetic characteristics and medical
visits to build a predictive modeling framework.

4. Data Science in Education

Data Science has also changed the way students communicate with teachers and assess their success.
Instructors may use data science to evaluate the input they obtain from students and use it to enhance
their teaching. Data Science can be used to construct predictive modeling that can predict student
drop-out levels based on their results and advise instructors to take the appropriate precautions.

50 | P a g e
UNIT- II (Basic Machine Learning Algorithms & Applications)

1. LINEAR REGRESSION FOR MACHINE LEARNING


Linear regression in statistics and machine learning is perhaps one of the most well-known and well-
understood algorithms.

You will discover the linear regression algorithm in this article, how it operates and how you can
best use it in your machine learning projects. In this article you'll learn: • Why linear regression is
part of statistics as well as machine learning.

 The other titles known as linear regression.


 Algorithms of representation and inference used to construct a model of linear regression;
 How best to plan the data using linear regression modeling.

To grasp linear regression you don't need to learn any statistics or linear algebras. It is a gentle high-
level introduction to the technique to give you enough experience to be able to make successful use
of it for your own problems.

Discover how machine learning algorithms work in my latest book like kNN, decision trees, naive
bayes, SVM, ensembles and much more, with 22 tutorials and examples in excel.

Let's kick off. Figure 1 shows Linear Regression for Machine Learning

Figure 1: Linear Regression for Machine Learning Photo by Nicolas Raymond.

51 | P a g e
Isn't linear statistical regression?

Before we immerse yourself in the specifics of linear regression, you might wonder why we are
looking at this algorithm.

Isn't it mathematical technique?

Machine learning, more precisely the field of predictive modeling, is concerned primarily with
reducing a model's error or making the most accurate predictions possible, at the cost of description.
We can borrow, reuse and steal algorithms from many different fields like statistics in applied
machine learning and use them for these purposes.

Linear regression has thus been developed in the field of statistics and is being studied as a model for
understanding the relationship between them

As such, linear regression was developed in the field of statistics and is studied as a model for
understanding the relationship between input and output numerical variables, but was borrowed from
machine learning. It's both a statistical algorithm and a machine learning algorithm.

Next, let's look at some of the common names used to refer to a linear regression model. Figure 2
shows the sample of the handy machine learning algorithms mind map.

Get your FREE Algorithms Mind Map

Figure 2: Sample of the handy machine learning algorithms mind map.

52 | P a g e
Many Linear Regression Names

Things can get really complicated when you start looking at linear regression. The explanation is that
for so long (more than 200 years), linear regression has been around. From every possible angle, it
has been studied and sometimes every angle has a new and different name. A linear model, such as a
model that follows a linear relationship between the input variables (x) and the output variable (y), is
a linear regression. More precisely, a linear combination of input variables (x) can be used to
determine y.

The approach is referred to as simple linear regression when a single input variable (x) exists.
Statistical literature also refers to the approach as multiple linear regression when multiple input
variables are available. Different techniques, the most common of which is called Ordinary Lowest
Squares, may be used to prepare or train linear regression equations from data. Therefore it is
common to refer to the model built in this way as ordinary.

Linear Regression Model Representation

Linear regression, since it is so easy to depict, is a popular model. A representation is a linear


equation that combines a particular set of input values (x) with the solution for which the output is
expected for that set of input values (y). As such, both the values of input (x) and output are numeric.

For each input value or column, the linear equation assigns one scale factor, called the coefficient,
represented by the Greek capital letter Beta (B). An additional coefficient is often introduced, which
gives the line an extra degree of freedom (for example, going up and down on a two-dimensional
plot) and is also referred to as the coefficient of intercept or bias.

For example, in a simple regression problem (a single x and a single y), the form of the model would
be:

y = B0 + B1*x

In higher dimensions, when we have more than one input (x), the line is called a plane or a
hyperplane. The representation is therefore the form of the equation and the specific values used for
the coefficients (e.g. B0 and B1 in the example above).

It is common to talk about the complexity of a regression model such as linear regression. This refers
to the number of coefficients used for the model.

When the coefficient is zero, the influence of the input variable on the model is effectively removed
from the model prediction (0 * x = 0). This becomes relevant when you look at the regularization
methods that change the learning algorithm to reduce the complexity of the regression models by
putting pressure on the absolute size of the coefficients, driving some to zero.
53 | P a g e
Now that we understand the representation used for the linear regression model, let's look at some
ways that we can learn this representation from the data.

Figure 2: What is Linear Regression?

Linear Regression Learning Model

Learning a linear regression model means estimating the values of the coefficients used in the
representation with the data available to us.

In this section, we will take a brief look at four techniques for the preparation of a linear regression
model. This is not enough information to implement it from scratch, but enough to get a taste of the
computation and the trade-offs involved.

There are a lot more techniques because the model is so well studied. Take note of Ordinary Less
Squares because it is the most common method used in general. Take note of Gradient Descent as it
is the most common technique taught in machine learning classes.

54 | P a g e
1. Simple Linear Regression

With simple linear regression, if we have a single input, we can use statistics to estimate the
coefficients. This requires you to calculate statistical properties from data such as mean, standard
deviations, correlations and covariance. All data must be made available for the purpose of crossing
and calculating statistics.

2. Ordinary Least Squares

If we have more than one input, we can use the Ordinary Lowest Squares to estimate the coefficient
values.

The Ordinary Least Squares procedure seeks to minimize the sum of the squared residues. This
means that given the regression curve through the info , we calculate the space from each datum to
the regression curve , square it, and sum all the squared errors together. This is the quantity that the
least common squares are trying to minimize.

This approach treats the data as a matrix and uses linear algebra operations to estimate the optimum
coefficient values. This means that all the data must be available and you must have enough memory
to fit the data and perform the matrix operation.

It is unusual to perform the Ordinary Least Squares procedure yourself, unless it is done as a linear
algebra exercise. It's more likely you're going to call a procedure in a linear algebra library. This
procedure is very quick to calculate.

3. Gradient Descent

When one or more inputs are available, you can use the process of optimizing coefficient values by
iteratively minimizing the model error on your training data.

This operation is called Gradient Descent, starting with random values for each coefficient. The sum
of squared errors is calculated for each pair of input and output values. The learning rate is used as a
scale factor and the coefficients are updated in order to minimize the error. The process is repeated
until a minimum amount of squared error is achieved or no further improvement is possible.

When using this method, you must select the learning rate (alpha) parameter that will determine the
size of the improvement step to be taken for each iteration of the procedure.

Gradient descent is often taught using a linear regression model because it is relatively easy to
understand. In practice, it is useful when you have a very large dataset in either the number of rows
or the number of columns that may not fit into your memory.

55 | P a g e
4. Regularization

There are extensions to the training of a linear model called regularization methods. Both aim to
minimize the sum of the squared error of the model on the training data (using ordinary least
squares) but also to reduce the complexity of the model (such as the number or absolute size of the
sum of all coefficients in the model).

Two common examples of regularization procedures for linear regression are:

1. Lasso regression: where ordinary least squares are modified to minimize the absolute sum of the
coefficients (called L1 regularization) as well.

2. Ridge Regression: where the ordinary least squares are modified to minimize the squared absolute
sum of the coefficients (called L2 regularization)

These methods are effective to use when there is collinearity in your input values, and the ordinary
least squares would override the training data.

Now that you know some techniques to learn the coefficients in a linear regression model, let's look
at how we can use a model to make new data predictions.

Making Linear Regression Predictions Since representation is a linear equation, making predictions
is as simple as solving an equation for a specific set of inputs.

Let's use an example to make this concrete. Imagine that we predict weight (y) from height (x). Our
linear regression model for this problem would be:

y = B0 + B1 * x1
or
weight =B0 +B1 * height

Where B0 is that the bias coefficient and B1 is that the coefficient for the height column. We use a
learning technique to seek out an honest set of coefficient values. Once found, we will connect
different height values to predict the load .

For example, let’s use B0 = 0.1 and B1 = 0.5. Let’s plug them in and calculate the weight (in
kilograms) for an individual with the height of 182 centimeters.

weight = 0.1 + 0.5 * 182


weight = 91.1

56 | P a g e
You can see that the above equation could be plotted as a line in two
two-dimensions.
dimensions. The B0 is our
starting point regardless of what height we have. We can run through a bunch of heights from 100 to
250 centimeters and plug them to the equation and get weight values, creating our line.

Figure 3: Sample Height vs Weight Linear Regression


Now that we know how to make predictions given a learned linear regression model, let’s look at
some rules of thumb for preparing our data to make the most of this type of model.

Preparing Data for Linear Regression

Linear regression has been studied extensively, and there is a lot of literature on how your data needs
to be structured to make the most of the model.

As such, there is a lot of sophistication in talking about these requirements and expectations that can
be intimidating. In practice, these rules can be used more as thumb rules when using Ordinary Less
Squares Regression, the most common linear regression implementation.

Try using these heuristics to prepare your data differently and see what works best for your problem.

1. Linear Assumption. Linear regression assumes that the relationship between input and output is
linear. It doesn't support anything else. This may be obvious, but it's a good thing to remember
when you have a lot of attributes. You may need to transform the data to make the relationship
linear (e.g. transform log for an exponential relationship).

57 | P a g e
2. Remove your noise. Linear regression assumes that the input and output variables are not noisy.
Consider using data cleaning operations that will make it easier for you to expose and clarify the
signal in your data. This is most important for the output variable and, if possible, you want to
remove outliers in the output variable (y).

3. Remove Collinearity from me. Linear regression is over-fitting your data when you have highly
correlated input variables. Consider calculating pairwise correlations for your input data and
removing the most correlated data.

4. The Gaussian Distribution. Linear regression makes more reliable predictions if your input and
output variables have a Gaussian distribution. You may get some benefit from transforms (e.g. log
or BoxCox) on your variables to make their distribution look more Gaussian.

5. Rescale Inputs: Linear regression will often make more reliable predictions if you rescale input
variables using standardization or normalization.

2. K-NEAREST NEIGHBORS (K-NN)


K-nearest neighbors (KNN) algorithm may be a sort of controlled ML algorithm which will be used for both
classification and regression predictive problems. However, it's mainly used for the classification of
predictive problems within the industry.

The subsequent two properties would define KNN well –

• Lazy learning algorithm − KNN may be a lazy learning algorithm because it doesn't have a specialized
training phase and uses all data for training while classifying.

• Non-parametric learning algorithm − KNN is additionally a non-parametric learning algorithm because it


assumes nothing about the underlying data.

2.1 Working of KNN Algorithm

K-nearest neighbors (KNN) algorithm uses 'feature similarity' to predict the values of new data
points, which means that a value will be assigned to the new data point based on how closely the
points in the training set match. We can understand how it works by following steps − Step 1 − We
need a data set to implement any algorithm. So we have to load the training as well as the test data
during the first step of KNN.

Step 2 − Next, we’d like to settle on the value of K, i.e. the nearest data point. K could be any
integer.

Step 3 − For each point of the test data, do the following −

3.1 − Calculate the distance between the test data and each row of training data using any
method, namely: Euclidean, Manhattan or Hamming distance. Euclidean is the most common
method used to calculate distance.

3.2 − Now, based on the distance value, sort it in ascending order.

3.3 − Next, the top rows of K will be selected from the sorted array.

58 | P a g e
3.4 − Now assign a class to the test point based on the most frequent class of these rows.

Step 4 − End Example The following is an example to understand the concept of K and the working
of the KNN algorithm − Suppose we have a dataset that can be plotted as follows shown in Figure
4−

Figure 4: Dataset

Now, we'd like to classify new data point with black dot (at point 60, 60) into blue or red class. We
are assuming K = 3 i.e. it would find three nearest data points. It is shown in Figure 5 −

Figure 5: Finding three nearest neighbors

We can see in Figure 5 the three nearest neighbors of the data point with black dot. Among those
three, two of them lie in Red class hence the black dot will also be assigned in red class.

Implementation in Python
As we all know K-nearest neighbors (KNN) algorithm are often used for both classification also as
regression. The following are the recipes in Python to use KNN as classifier also as regressor –
KNN as Classifier
First, start with importing necessary python packages −

59 | P a g e
importnumpy as np
importmatplotlib.pyplot as plt
import pandas as pd
Next, download the iris dataset from its weblink as follows −
path = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
Next, we need to assign column names to the dataset as follows −
headernames = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']
Now, we need to read dataset to pandasdataframe as follows −
dataset = pd.read_csv(path, names = headernames)
dataset.head()
sepal-length sepal-width petal-length petal-width Class

0 5.1 3.5 1.4 0.2 Iris-setosa

1 4.9 3.0 1.4 0.2 Iris-setosa

2 4.7 3.2 1.3 0.2 Iris-setosa

3 4.6 3.1 1.5 0.2 Iris-setosa

4 5.0 3.6 1.4 0.2 Iris-setosa

Data Preprocessing will be done with the help of following script lines.

X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values
Next, we will divide the data into train and test split. Following code will split the dataset into 60%
training data and 40% of testing data −
fromsklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.40)

Next, data scaling will be done as follows −


fromsklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)

60 | P a g e
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Next, train the model with the help of KNeighborsClassifier class of sklearn as follows –

fromsklearn.neighbors import KNeighborsClassifier


classifier = KNeighborsClassifier(n_neighbors = 8)
classifier.fit(X_train, y_train)
At last we need to make prediction. It can be done with the help of following script −
y_pred = classifier.predict(X_test)
Next, print the results as follows −
fromsklearn.metricsimportclassification_report,confusion_matrix,accuracy_score
result=confusion_matrix(y_test,y_pred)
print("Confusion Matrix:")
print(result)
result1 =classification_report(y_test,y_pred)
print("Classification Report:",)
print(result1)
result2 =accuracy_score(y_test,y_pred)
print("Accuracy:",result2)

Output

Confusion Matrix:

[[21 0 0]
[ 0 16 0]
[ 0 7 16]]

Classification Report:

precision recall f1-score support


Iris-setosa 1.00 1.00 1.00 21
Iris-versicolor 0.70 1.00 0.82 16
Iris-virginica 1.00 0.70 0.82 23
microavg 0.88 0.88 0.88 60
macroavg 0.90 0.90 0.88 60
weightedavg 0.92 0.88 0.88 60

Accuracy: 0.8833333333333333

61 | P a g e
KNN as Regressor
First, start with importing necessary Python packages −
importnumpy as np
import pandas as pd
Next, download the iris dataset from its weblink as follows −
path = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
Next, we need to assign column names to the dataset as follows −
headernames = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']
Now, we need to read dataset to pandasdataframe as follows −
data=pd.read_csv(url, names =headernames)
array=data.values
X =array[:,:2]
Y =array[:,2]
data.shape
output:(150,5)
Next, import KNeighborsRegressor from sklearn to fit the model −
fromsklearn.neighbors import KNeighborsRegressor
knnr = KNeighborsRegressor(n_neighbors = 10)
knnr.fit(X, y)
At last, we can find the MSE as follows −
print ("The MSE is:",format(np.power(y-knnr.predict(X),2).mean()))

Output
The MSE is: 0.12226666666666669

2.3 Pros and Cons of KNN

Pros

• It is a very simple algorithm to understand and interpret.

• It’s very useful for non-linear data because there’s no assumption about the information in this
algorithm.

 It may be a versatile algorithm which will be used for both classification and regression.
 It is comparatively accurate, but there are far better supervised learning models than KNN.

Cons

• It's a bit expensive algorithm, because it stores all the training data.

62 | P a g e
 High memory storage required compared to other supervised learning algorithms.
 The prediction is slow within the case of a large N.
 It is extremely sensitive to the size of the information also on irrelevant features.

2.4 Applications of KNN

The following are some of the areas in which KNN can be successfully applied –

1. The KNN Banking System can be used in the banking system to predict the weather and the
individual is fit for loan approval? Does that individual have the same characteristics as one
of the defaulters?
2. Calculation of credit ratings KNN algorithms can be used to find an individual's credit rating
by comparing it to persons with similar characteristics.
3. Politics With the assistance of KNN algorithms, we will classify potential voters into
different classes like "Will vote," "Will not vote," "Will vote to the Congress Party," "Will
vote to the BJP Party."
4. Other areas where the KNN algorithm is often used are Speech Recognition, Handwriting
Detection, Image Recognition and Video Recognition.
3. K-MEANS
K-means algorithm is an iterative algorithm that attempts to divide the dataset into Kpre-defined
separate non-overlapping subgroups (clusters) where each data point belongs to only one group. It
tries to make inter-cluster data points as similar as possible while keeping clusters as different (far) as
possible. It assigns data points to a cluster in such a way that the sum of the squared distance between
the data points and the centroid cluster (arithmetic mean of all data points belonging to that cluster) is
at a minimum. The less variation we have within clusters, the more homogenous (similar) the data
points are within the same cluster.

The way k-means algorithm works is as follows:

1. Specify the number of K clusters.

2. Initialize the centroids by first shuffling the dataset and then randomly selecting K data points for
the centroids without substitution.

3. Keep iterating until there is no change in the centroids. i.e. the assignment of data points to clusters
does not change.

4. Calculate the sum of the squared distance between the data points and all the centroids.

5. Assign each data point to the nearest cluster (centroid).

6. Calculate the centroids for clusters by taking the average of all data points belonging to each
cluster.

The following approach k-means to solve the problem is called Expectation-Maximization. The E-
step assigns the data points to the nearest cluster. The M— step is computing the centroid of each
cluster. Below is a breakdown of how we can solve it mathematically (feel free to skip it).

63 | P a g e
The objective function is:

Where wik=1 for data point xi if it belongs to cluster k; otherwise, wik=0. Also, μk is the centroid of
xi’s cluster.

It’s a minimization problem of two parts. We first minimize J w.r.t. wik and treat μk fixed. Then we
minimize J w.r.t. μk and treat wik fixed. Technically speaking, we differentiate J w.r.t. wik first and
update cluster assignments (E-step). Then we differentiate J w.r.t. μk and recompute the centroids
after the cluster assignments from previous step (M-step). Therefore, E-step is:

In other words, assign the data point xi to the closest cluster judged by its sum of squared distance
from cluster’s centroid.

And M-step is:

Which translates to recomputing the centroid of each cluster to reflect the new assignments.

Few things to note here:

• Since cluster algorithms, together with k-means, use distance-based measurements to work out the
similarity between data points, it's counseled that data ought to be standardized to own a mean zero
and a typical deviation of 1, since nearly always the characteristics in any knowledge set would have
totally different units of measuring like age vs financial gain.

• In sight of the k-mean repetitious nature and therefore the random formatting of the centroids at the
start of the algorithmic program, {different|totally totally different|completely different} initializations

64 | P a g e
could result in different clusters, because the k-mean algorithmic program could also be stuck to the
native optimum and should not converge to the world optimum.

It is so counseled that the algorithmic program be run mistreatment totally different center of mass
initializations which the results of the run be chosen that yielded a lower total of square distance.

• The assignment of examples doesn't amendment is that the same issue as no amendment in within-
cluster variation:

Implementation

We will use simple implementation of k-means here to illustrate some of the concepts. Then we will
use sklearn implementation that makes it more efficient to take care of a lot of things for us.

Applications

K-means algorithm is very popular and is used in a variety of applications such as market
segmentation, document clustering, image segmentation and image compression, etc. The goal usually
when we're undergoing a cluster analysis is either:

1. Get a meaningful insight into the structure of the data we're dealing with.

2. Cluster-then-predicts where different models will be built for different subgroups if we believe
there is a wide variation in the behavior of different subgroups. An example of this is the clustering of
patients into different subgroups and the development of a model for each subgroup to predict the risk
of heart attack.

4. FILTERING SPAM
4.1 Spam

• Spam additionally referred to as unsought industrial Email (UCE)

• Involves causation messages by email to varied recipients at constant time (Mass Emailing).

• Grew exponentially since 1990 however has leveled off recently and is not any longer growing
exponentially

• 80% of all spam is sent by but two hundred spammers

4.2 Purpose of Spam

• Advertisements

• Pyramid schemes (Multi-Level Marketing)

65 | P a g e
• Giveaways

• Chain letters

• Political email

• Stock market recommendation

Spam as a retardant

• Consumes computing resources and time


• Reduces the effectiveness of legitimate advertising
• Cost Shifting
• Fraud
• Identity thievery
• Consumer Perception
• Global Implications
John Borgan [ReplyNet] – “Junk email isn't simply annoying any longer. It’s feeding into productivity. It’s
feeding into time”.

Some Statistics

 Cost of Spam 2009:


o $130 billion worldwide
o  $42 billion in US alone  30% increase from 2007 estimates
o 100% increase in 2007 from 2005
 Main components of cost:
o Productivity loss from inspecting and deleting spam missed by spam control products
(False Negatives)
o Productivity loss from searching for legitimate mails deleted in error by spam control
products (False Positives)
o Operations and helpdesk costs (Filters and Firewalls – installment and maintenance)

Email Address Harvesting - Process of obtaining email addresses through various methods:

 Purchasing /Trading lists with other spammers


 Bots
 Directory harvest attack
 Free Product or Services requiring valid email address
 News bulletins /Forums

• Cost of Spam 2009:

 $130 billion worldwide


 $42 billion in u. s. u. s. unit of time increase from 2007 estimates
 100% increase in 2007 from 2005

66 | P a g e
• Main components of cost:

 Productivity loss from inspecting and deleting spam lost by spam management merchandise
(False Negatives)
 Productivity loss from sorting out legitimate mails deleted in error by spam management
merchandise (False Positives)
 Operations and repair costs (Filters and Firewalls – installment and maintenance)

Email Address gather - methodology of obtaining email addresses through various methods:

• Purchasing /Trading lists with completely different spammers

• Bots

• Directory harvest attack

• Free Product or Services requiring valid email address

• News bulletins /Forums

4.3 Spam Life Cycle

Figure 6: Life cycle of spam

Types of Spam Filters

1. Header Filters
a. Look at email headers to judge if forged or not

67 | P a g e
b. Contain more information in addition to recipient , sender and subject fields
2. Language Filters
a. filters based on email body language
b. Can be used to filter out spam written in foreign languages

3. Content Filters
a. Scan the text content of emails
b. Use fuzzy logic
4. Permission Filters
a. Based on Challenge /Response system
5. White list/blacklist Filters
a. Will only accept emails from list of “good email addresses”
b. Will block emails from “bad email addresses”
6. Community Filters
a. Work on the principal of "communal knowledge" of spam
b. These types of filters communicate with a central server.
7. Bayesian Filters
a. Statistical email filtering
b. Uses Naïve Bayes classifier

Spam Filters Properties

1. Filter must prevent spam from entering inboxes


2. Able to detect the spam without blocking the ham
a. Maximize efficiency of the filter
3. Do not require any modification to existing e-mail protocols
4. Easily incremental
a. Spam evolve continuously
b. Need to adapt to each user

Data Mining and Spam Filtering

 Spam Filtering can be seen as a specific text categorization (Classification)


 History
o Jason Rennie’siFile (1996); first know program to use Bayesian Classification for
spam filtering

Bayesian Classification

1. Specific words are likely to occur in spam emails and legitimate emails For example, most email
users often encounter the word "Viagra" in spam emails, but rarely see it in other emails
2. The filter does not know these probabilities beforehand, and must be trained first so that it can
build them up to
3. To train the filter, the user must manually indicate whether or not the new email is spam
4. For all of the words in each training email, the filter will adjust the probability that each word will
appear in the spam or legitimate email in its database.

68 | P a g e
For example, Bayesian spam filters will typically have learned a very high spam probability for
the words "Viagra" and "refinance," but a very low spam probability for words that are seen only
in legitimate emails, such as the names of friends and family members

5. After training, the word probabilities are used to calculate the probability that an email containing
a specific set of words belongs to either category
6. Each word in the email contributes to the spam probability of the email, or only the most
interesting words
7. This contribution is calculated using the Bayes Theorem
8. Then, the spam probabil (false positive or false negative) which allows the software to
dynamically adapt to the ever-evolving nature of spam
10. Some spam filters combine the results of both Bayesian spam filtering and other heuristics
(predefined content rules, envelope viewing, etc.) resulting in even higher filtering accuracy.

Computing the Probability

1. Calculation of the probability that a message containing a given word is spam:

2. Suppose the suspected message contains the word "replica" Most people who are used to receive
e-mails know that this message is likely to be spam

3. The formula used by the software for computing.

69 | P a g e
Pros & Cons

Advantages

 It can be trained on a per-user basis


o The spam that a user receives is often related to the online user's activities

Disadvantages

 Bayesian spam filtering is susceptible to Bayesian poisoning


o Insertion of random innocuous words that are not normally associated with spam
o Replace text with pictures
o Google, by its Gmail email system, performing an OCR to every mid to large size
image, analyzing the text inside
 Spam emails not only consume computing resources, but can also be frustrating
 Numerous detection techniques exist, but none is a “good for all scenarios” technique
 Data Mining approaches for content based spam filtering seem promising

70 | P a g e
4.8 How to Design a Spam Filtering System with Machine Learning Algorithm

Exploratory Data Analysis (EDA)

Exploratory Data Analysis is a very important data science process. It helps the data scientist to
understand the data at hand and relates it to the business context.

The open source tools that I will use to visualize and analyze my data are Word Cloud.

Word Cloud is a data visualization tool used to represent text data. The size of the text in the image
represents the frequency or importance of the words in the training data.

Steps to take in this section:

1. Get the email data

2. Explore and analyze the data

3. Visualize the training data with Word Cloud & Bar Chart

Get the spam data

Data is the essential ingredients before we can develop any meaningful algorithm. Knowing where to
get your data can be a very handy tool especially when you are just a beginner.

Below are a few of the famous repositories where you can easily get thousand kind of data set
for free:

1. UC Irvine Machine Learning Repository

2. Kaggle datasets

3. AWS datasets

You can go to this link (https://spamassassin.apache.org/old/publiccorpus/) to go to the data set for


this email spamming data set, which is distributed by Spam Assassin. There are a few categories of
data that you can read from readme.html to get more background information about the data.

In short, there are two types of data present in this repository, namely ham (non-spam) and spam data.
In addition, the ham data is easy and hard, which means that there are some non-spam data that are
very similar to spam data. This could make our system difficult to make a decision.

If you are using Linux or Mac, simply do this on the terminal, wget is simply a command that will
help you download the url file:

71 | P a g e
Figure 7: Visualization for spam email

Figure 8: Visualization for non-spam email

From this view, you can see something interesting about the spam email. Many of them have a high
number of "spam" words, such as: free, money, product, etc. Having this awareness could help us
make a better decision when it comes to designing a spam detection system.

One important thing to note is that the word cloud displays only the frequency of words, not
necessarily the importance of words. It is therefore necessary to do some data cleaning, such as
removing stop words, punctuation and so on, from the data before visualizing it.

N-grams model visualization

Another technique of visualization is to use the bar chart and display the frequency of the most visible
words. N-gram means how many words you consider to be a single unit when you calculate the
frequency of words.

I have shown an example of 1-gram and 2-gram in Figure 9. You can definitely experiment with a
larger n-gram model.

74 | P a g e
Figure 9: Bar chart visualization of 1-gram model

Figure 10: Bar chart visualization of 2-gram model

Train Test Split

It is important to divide your data set into a training set and test set, so that you can evaluate the
performance of your model using the test set before deploying it in a production environment.
75 | P a g e
One important thing to note when splitting the train test is to ensure that the data distribution between
the training set and the test set is similar.

What this means in this context is that the percentage of spam emails in the training set and the test set
should be similar.

Figure 11: Target Count For Train Data

Figure 12: Train Data Distribution

76 | P a g e
Figure 13: Target Count For Test Data

Figure 14: Train Data Distribution

Test Data Distribution

The distribution between train data and test data are quite similar which is around 20–21%, so we are
good to go and start to process our data !

Data Preprocessing

Text Cleaning

77 | P a g e
Text Cleaning is a very important step in machine learning because your data may contains a lot of
noise and unwanted character such as punctuation, white space, numbers, hyperlink and etc.

Some standard procedures that people generally use are:

 convert all letters to lower/upper case

 removing numbers

 removing punctuation

 removing white spaces

 removing hyperlink

 removing stop words such as a, about, above, down, doing and the list goes on…

 Word Stemming

 Word lemmatization

The two techniques that may seem foreign to most people are word stemming and word
lemmatization. Both of these techniques try to reduce words to their most basic form, but they do so
with different approaches.

 Word stemming — Stemming algorithms work by removing the word end or beginning, using
a list of common prefixes and suffixes that can be found in that language. Examples of Word
Stemming for English Words are as follows:

Form Suffix Stem


running -ing run
runs -s run
consolidate -ate consolid
consolidated -ated consolid

 Word Lemmatization — Lemmatization is utilizing the dictionary of a particular language and


tried to convert the words back to its base form. It will try to take into account of the meaning of
the verbs and convert it back to the most suitable base form.

Form Suffix Stem


running -ing run
runs -s run
consolidate -ate consolid

78 | P a g e
Implementing these two algorithms might be tricky and requires a lot of thinking and design to deal
with different edge cases.

Luckily NLTK library has provided the implementation of these two algorithms, so we can use it out
of the box from the library!

Import the library and start designing some functions to help us understand the basic working of these
two algorithms.

# Just import them and use it

fromnltk.stem import PorterStemmer


from nltk.stem import WordNetLemmatizer

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

dirty_text = "He studies in the house yesterday, unluckily,


the fans breaks down"

defword_stemmer(words):
stem_words = [stemmer.stem(o) for o in words]
return " ".join(stem_words)

defword_lemmatizer(words):
lemma_words = [lemmatizer.lemmatize(o) for o in words]
return " ".join(lemma_words)

The output of word stemmer is very obvious, some of the endings of the words have been chopped off

clean_text = word_stemmer(dirty_text.split(" "))


clean_text

#Output
'He studi in the hous yesterday, unluckily, the fan break down'

The lemmatization has converted studies -> study, breaks -> break

clean_text = word_lemmatizer(dirty_text.split(" "))


clean_text

#Output
'I study in the house yesterday, unluckily, the fan break down'

Feature Extraction
Our algorithm always expect the input to be integers/floats, so we need to have some feature
extraction layer in the middle to convert the words to integers/floats.
79 | P a g e

You might also like