Data Science
Data Science
UNIT 2
Introduction to Data Science
Syllabus
• What is Data Science? Applications of Data Science, Data science life cycle, Tools for data science, definition
of AI, types of machine learning (ML), list of ML algorithms for classification, clustering, and feature
selection. Probability theory, bayes theorem, bayes probability; Cartesian plane, equations of lines, graphs;
exponents.
• Introduction to SQL: SQL Commands experimental demonstrations-DDL, DML, DCL, TCL, DQL. Import SQL
2
Outline:
Examples
3
Review of Previous Lecture :
Creating excel sheet
4
Topic for the Lecture:
What is data science
Probability Theorem
Bayes Theorem
5
Objective and Outcome of
Lecture:
6
Data Science:
8
The life cycle of the data science:
The business requirement step deals with the
identification of the problem and objectives
10
Explanation
• 1. SAS (Statistical Analysis Software) It is one of those data science tools
which are specifically designed for statistical operations. SAS is a closed
source proprietary software that is used by large organizations to analyze
data.
• 2. Apache Spark
• Apache Spark or simply Spark is an all-powerful analytics engine and it is the
most used Data Science tool. Spark is specifically designed to handle batch
processing and Stream Processing.
• 3. BigML
• It provides a fully interactable, cloud-based GUI environment that you can use
for processing Machine Learning Algorithms.
• 4. D3.js
• Javascript is mainly used as a client-side scripting language. D3.js, a Javascript
library allows you to make interactive visualizations on your web-browser.
Explanation
• 5. MATLAB
• MATLAB is a multi-paradigm numerical computing environment for
processing mathematical information.
• It is a closed-source software that facilitates matrix functions,
algorithmic implementation and statistical modeling of data. MATLAB
is most widely used in several scientific disciplines.
• 6. Excel Probably the most widely used Data Analysis tool. Microsoft
developed Excel mostly for spreadsheet calculations and today, it is
widely used for data processing, visualization, and complex
calculations.
• 7. ggplot2 ggplot2 is an advanced data visualization package for the
R programming language. The developers created this tool to replace
the native graphics package of R and it uses powerful commands to
create illustrious visualizations. It is the most widely used library that
Data Scientists use for creating visualizations from analyzed data.
Explanation
• 8. Tableau
• Tableau is a Data Visualization software that is packed with
powerful graphics to make interactive visualizations. It is focused
on industries working in the field of business intelligence. The
most
• important aspect of Tableau is its ability to interface with
databases, spreadsheets, OLAP (Online Analytical Processing)
cubes, etc. Along with these features, Tableau has the ability to
visualize geographical data and for plotting longitudes and
latitudes in maps.
16
Machine
Learning
“Machine learning enables
a machine toautomatically learn
from data, improve performance
from experiences, and predict
things withoutbeing explicitly
programmed.”
17
Key differences between AI
and ML
18
Key differences between AI
and ML
19
Types of machine
learning (ML)
20
Types of Machine Learning
3/24/2021 21
Supervised
learning
• Supervised learning as the name indicates the
presence of a supervisor as a teacher.
• Basically supervised learning is a learning in
which we teach or train the machine using data
which is well labeled that means some data is
already tagged with the correct answer.
• After that, the machine is provided with a new set
of examples(data) so that supervised learning
algorithm analyses the training data(set of training
examples) and produces a correct outcome from
labeled data.
22
Unsupervised
learning
• Unsupervised learning is the training of machine using
information that is neither classified nor labeled and
allowing the algorithm to act on that information without
guidance.
• Here the task of machine is to group unsorted
information according to similarities, patterns and
differences without any prior training of data.
• Unlike supervised learning, no teacher is provided that
means no training will be given to the machine.
• Therefore machine is restricted to find the hidden
structure in unlabeled data by our-self.
23
Semi-supervised
learning &
Reinforcement
learning
• Semi-supervised Learning is between the supervised
and unsupervised learning.
• It uses both labelled and unlabelled data for training.
• Reinforcement learning trains an algorithm with a
reward system, providing feedback when an artificial
intelligence agent performs the best action in a
particular situation.
• In Reinforcement learning , AI agents are attempting to
find the optimal way to accomplish a particular goal, or
improve performance on a specific task.
• As the agent takes action that goes toward the goal, it
receives a reward.
24
Examples /
Applications
25
Difference
26
Regressi
on
• Regression analysis is a statistical method to model
the relationship between a dependent (target) and
independent (predictor) variables with one or more
independent variables.
• Regression is a process of finding the correlations
between dependent and independent variables.
• It helps in predicting the continuous variables such
as prediction of Market Trends, prediction of House
prices, etc
27
ML Regression
Algorithms
• Simple Linear Regression
• Multiple Linear Regression
• Polynomial Regression
• Support Vector Regression
• Decision Tree Regression
• Random Forest Regression
28
Classificat
ion
• Classification algorithm is a Supervised Learning
technique that is used to identify the category of
new observations on the basis of training data.
• In Classification, a program learns from the given
dataset or observations and then classifies new
observation into a number of classes or groups.
• Such as, Yes or No, 0 or 1, Spam or Not Spam, cat
or dog, etc.
29
ML Classification
Algorithms
• Logistic Regression
• K-Nearest Neighbours
• Support Vector Machines
• Kernel SVM
• Naïve Bayes
• Decision Tree Classification
• Random Forest Classification
30
Difference between
Regression and
Classification
31
Clusteri
ng
• Grouping the similar data is called cluster
• Clustering or cluster analysis is a machine learning
technique, which groups the unlabelled dataset.
32
Clustering
Algorithms
• K-Means algorithm
• Agglomerative Hierarchical algorithm
• Mean-shift algorithm
• DBSCAN Algorithm (Density-Based Spatial
Clustering of Applications with Noise)
• Expectation-Maximization (EM) Clustering using
GMM (Gaussian Mixture Model)
33
Association Rule
• Association rule learning is a type of unsupervised learning technique that checks
for the dependency of one data item on another data item and maps accordingly
so that it can be more profitable. It tries to find some interesting relations or
associations among the variables of dataset. It is based on different rules to
discover the interesting relations between variables in the database.
34
35
Feature
selection
• In machine learning and statistics, feature
selection, also known as variable selection,
attribute selection or variable subset selection
• It is the process of selecting a subset of relevant
features (variables, predictors) for use in model
construction.
• When the number of features are very large. No-
need not use every feature at your disposal for
creating an algorithm.
• You can assist your algorithm by feeding in only
those features that are really important.
36
Feature
selection
• Machine learning works on a simple rule – if you put
garbage in, you will only get garbage to come out.
(garbage -noise) - “Sometimes, less is better!”
Top reasons to use feature selection are:
• It enables the machine learning algorithm to train faster.
• It reduces the complexity of a model and makes it easier
to interpret.
• It improves the accuracy of a model if the right subset is
chosen.
• It reduces overfitting.
37
ML Feature selection
Algorithms
Filter Methods:Filter methods are a type of feature
selection method that works by selecting features based
on some criteria prior to building the model.
• Pearson’s Correlation
• Linear Discriminant Analysis (LDA)
• ANOVA (Analysis of variance)
• Chi-Square
38
Wrapper Methods
• Forward Selection
• Backward Elimination
• Recursive Feature elimination
ML Feature selection
Algorithms
Embedded Methods
Embedded methods check the different training
iterations of the machine learning model and
evaluate the importance of each feature.
• Decision Tree
• ID3
• C4.5
• Classification And Regression Tree (CART)
40
Linear regression
y= mx+c+ ε
• y= Dependent Variable (Target Variable)
• x= Independent Variable (predictor Variable)
• c= y intercept of the line
• m= slope
• ε= error
41
Probability theory
This is the basic formula. But there are some more formulas
for different situations or events.
44
Probability theory
• For example,
When we toss a coin, either we get Head OR Tail, only two
possible outcomes are possible (H, T).
Solution:
b) If there are 100 bottles in the container, how many of them are likely to be green?
Ans: The experiment implies that 450 out of 1000 bottles are green.
47
Therefore, out of 100 bottles, 45 are green.
48
49
50
51
52
53
Probability Terms and Definition
Some of the important probability terms are
54
Probability Terms and Definition
Some of the important probability terms are
55
Question 2: Two dice are rolled, find the probability
that the sum is:
equal to 1
equal to 4
less than 13
Solution:
58
Major Applications of Probability
59
Basics of Probability
60
-Non Mutually exclusive
• In case of Non Mutual exclusive events A or B is the sum of A and B minus A and B i.e.
• P(A or B) =P(A) + P(B) – P(A and B) OR P(AUB)=P(A)+P(B)-P(A AND B)
61
62
63
64
65
2. Multifaction Rule
The set A∩B denotes the simultaneous occurrence of events A and B, that
is the set in which both events A and event B have occurred.
Sometimes, the occurring of the first event impacts the probability of the
second event. From the theorem,
we have, P(A ∩ B) = P(A) P(B | A), where A and B are independent
events. 66
2. Multifaction Rule
67
Dependent Event (Conditional
Prabability)
• The conditional probability of an event B in relationship to an event A is the probability that event B occurs
given that event A has already occurred.
68
69
Problem 1:
• A math teacher gave her class two tests. 25% of the class passed both tests and 42% of the class passed the
first test. What percent of those who passed the first test also passed the second test?
Answer:
P(Second | First) = P(First and Second)/P(First)
= 0.25/0.42=0.60
= 60%
70
Problem 2:
• A jar contains black and white marbles. Two marbles are chosen without replacement. The probability of
selecting a black marble and then a white marble is 0.34, and the probability of selecting a black marble on
the first draw is 0.47. What is the probability of selecting a white marble on the second draw, given that the
first marble drawn was black?
• Answer:
• P(White | Black) = P(Black and White)/P(Black)
= 0.34/0.47
=.72
= 72%
71
Bayes Theorem
• Bayes' theorem (alternatively Bayes' law or Bayes' rule), named after
Reverend Thomas Bayes, describes the probability of an event, based
on prior knowledge of conditions that might be related to the event.
73
Statement of theorem
74
Example 1:
• A bag I contain 4 white and 6 black balls while another Bag II contains 4 white and 3 black balls. One ball is
drawn at random from one of the bags, and it is found to be black. Find the probability that it was drawn from
Bag I.
• Solution:
• Let E1 be the event of choosing the bag I, E2 the event of choosing the bag II, and A be the event of drawing
a black ball.
• Then,P(E1) = P(E2) = 1/2
• Also,P(A|E1) = P(drawing a black ball from Bag I) = 6/10 = 3/5
• P(A|E2) = P(drawing a black ball from Bag II) = 3/7
• By using Bayes’ theorem, the probability of drawing a black ball from bag I out of two bags,
• P(E1|A) = P(E1)P(A|E1)/P(E1)P(A│E1)+P(E2)P(A|E2)
• =(1/2 × 3/5)/(1/2 × 3/5 + 1/2 × 3/7) = 7/12
75
Example 2:
76
Problem on Bayes Theorem
77
Assignment Example 2:
• A man is known to speak truth 2 out of 3 times. He throws a die and reports that the
number obtained is a four. Find the probability that the number obtained is actually a four.
• Solution:
• Let A be the event that the man reports that number four is obtained.
• Let E1 be the event that four is obtained and E2 be its complementary event.
• Then, P(E1) = Probability that four occurs = 1/6
• P(E2) = Probability that four does not occurs = 1 – P(E1) = 1 −1/6 = 5/6
• Also, P(A|E1) = Probability that man reports four and it is actually a four = 2/3
• P(A|E2) = Probability that man reports four and it is not a four = 1/3
• By using Bayes’ theorem, probability that number obtained is actually a four,
• P(E1|A) =P(E1)P(A|E1)/P(E1)P(A│E1) + P(E2)P(A|E2) = (1/6 × 2/3)/(1/6 × 2/3 + 5/6 ×
1/3) = 2/7
78
Problem on Bayes Theorem
1. In a bolt factory, machines A, B and C manufacture 25%, 35%
, 40% respectively. Of the total of their output 5, 4 and 2% are defective.
A bolt is drawn and is found to be defective. What are the
probabilities that it was manufactured by the machines C ?
Solution:
79
80
Another Way to Solve
81
Assignment
Problem on Bayes Theorem
2.An insurance company insured 2000 scooter drivers, 4000 car
drivers and 6000 truck drivers. The probability of an accident involving
a scooter, a car and a truck are 0.01, 0.03 and 0.15 respectively. One of
the insured persons meets with an accident. What is the probability that
he is a scooter driver?
82
Problem on Bayes Theorem
2. An insurance company insured 2000 scooter drivers, 4000 car
drivers and 6000 truck drivers. The probability of an accident involving
a scooter, a car and a truck are 0.01, 0.03 and 0.15 respectively. One of
the insured persons meets with an accident. What is the probability that
he is a scooter driver?
Solution:
83
84
85
Cartesian Plane
The cartesian plane is a two-dimensional coordinate plane formed by
the intersection of two perpendicular lines. The horizontal line is known
as X-axis, and the vertical line is known as Y-axis. The coordinate point
(x, y) on the Cartesian plane says that the horizontal distance of the
point from the origin is x, and the vertical distance is y. If the sign of x is
positive, the point is on the right of the origin; else it is on the left.
Similarly, if the sign is positive for y, the point is y points above the
origin else it is y points below it.
86
Cartesian Plane:
87
Equation of lines:
The Data Science and Analytics field has also used Graphs to
model various structures and problems.
91
Trivial Graph:
92
Simple Graph:
• A simple graph is a graph that does not contain more than one edge
between the pair of vertices. A simple railway track connecting
different cities is an example of a simple graph.
93
Multi Graph:
• Multi Graph:
• Any graph which contains some parallel edges but doesn’t contain any
self-loop is called a multigraph. For example a Road Map.
• Parallel Edges: If two vertices are connected with more than one edge
then such edges are called parallel edges that are many routes but
one destination.
• Loop: An edge of a graph that starts from a vertex and ends at the
same vertex is called a loop or a self-loop.
94
Exponents:
101
1. DDL – Data Definition Language - used to create and modify the structure of
objects in a database using predefined commands and a specific syntax. These
database objects include tables, sequences, locations, aliases, schemas and
indexes.
3. DCL – Data Control Language its commands are administrative powers that
allow other users access to the database.
In that table, if you want to add multiple columns, use the below syntax.
The column parameters specify the names of the columns of the table.
The data type parameter specifies the type of data the column can hold (e.g.
varchar, integer, date, etc.).
105
CREATE TABLE
The LastName, FirstName, Address, and City columns are of type varchar and
will hold characters and the maximum length for these fields is 255
characters.
106
INSERT VALUE IN TABLE
Syntax
The first way specifies both the column names and the values to be inserted.
If you are adding values for all the columns of the table, then no need to specify the column
names in the SQL query. However, make sure that the order of the values is in the same
order as the columns in the table.
107
INSERT VALUE IN TABLE
108
SELECT
Display the contents of the table
Syntax:
Select * from table_name
Example:
Select * from tasks
DESCRIBE TABLE
To view the structure / schema of a table
Syntax:
DESCRIBE table_name
DESC table_name
1. Example:
DELETE
To delete the contents of the table
Syntax:
DELETE * FROM table_nameDELETE FROM table_nameWHERE condition
Example:
DELETE * FROM tasks
DELETE * FROM tasks WHERE task_id=1
UPDATE
To update a value in table
Syntax:
UPDATE table_nameSET field1 = new-value1, field2 = new-value2 [WHERE Clause]
Example:
UPDATE tasks SET task_name=‘xyz’ WHERE task_id=1
DROP
TRUNCATE
MYSQLDATA TYPES
1. NUMERIC DATA TYPE
2. DATETIME DATA TYPE
3. STRING DATA TYPE
NUMERIC DATA TYPE
DATETIME DATA TYPE
STRING DATA TYPE
HOW TO IMPORT MYSQL DATABASE INTO
EXCEL
1. Create a new workbook in MS Excel
2. Click on DATA tab
3. Select from Other sources button
4. Select from SQL Server as shown in the image
5. Enter the server name/IP address. For this tutorial, am connecting to localhost 127.0.0.1
6. 2. Choose the login type. If you are on a local machine and you have windows authentication enabled.
7. 3. If you are connecting to a remote server, then you will need to provide user id and password details.
8. 4. Click on next button
CONTINU…
CONTINU….