Data Science Complete Theory
Data Science Complete Theory
Tushar B. Kute,
http://tusharkute.com
What is data modeling?
• Fact
• Dimension
• Attributes
• Fact table
• Dimension Table
Elements of multidimensional modeling
• Fact
– Facts are the measurements/metrics or facts from your business
process. For a Sales business process, a measurement would be
quarterly sales number
• Dimension
– Dimension provides the context surrounding a business process
event. In simple terms, they give who, what, where of a fact. In the
Sales business process, for the fact quarterly sales number,
dimensions would be
• Who – Customer Names
• Where – Location
• What – Product Name
• In other words, a dimension is a window to view information in the facts.
Elements of multidimensional modeling
• Attributes
– The Attributes are the various characteristics of
the dimension in dimensional data modeling.
• In the Location dimension, the attributes can be
– State
– Country
– Zipcode etc.
• Attributes are used to search, filter, or classify
facts. Dimension Tables contain Attributes
Elements of multidimensional modeling
• Fact Table
– A fact table is a primary table in dimension
modelling.
• A Fact Table contains
– Measurements/facts
– Foreign key to dimension table
Elements of multidimensional modeling
• Standardization
• Covariance Matrix Computation
• Computer Eigen vector and eigen values
• Feature vector
• Recast the Data Along the Principal
Components Axes
Standardization
•
If we rank the eigenvalues in descending order, we get
λ1>λ2, which means that the eigenvector that corresponds
to the first principal component (PC1) is v1 and the one
that corresponds to the second component (PC2) isv2.
Feature Vector
• Subspace Clustering
• Projected Clustering
• Projection Based Clustering
• Correlation Clustering
Subspace Clustering
Web Resources
https://mitu.co.in
http://tusharkute.com
[email protected]
[email protected]
Types of Data Visualization
Tushar B. Kute,
http://tusharkute.com
Types of Graphs
• Column Chart
• Bar Graph
• Stacked Bar Graph
• Area Chart
• Dual Axis Chart
• Line Graph
• Pie Chart
• Waterfall Chart
• Scatter Plot Chart
• Histogram
• Funnel Chart
• Heat Map
Bar Graph
₹
Histogram – When to use?
Web Resources
https://mitu.co.in
http://tusharkute.com
[email protected]
[email protected]
Dashboard
Tushar B. Kute,
http://tusharkute.com
Agenda
• Definition of Dashboard,
• Their type
• Evolution of dashboard
• Dashboard design and principles
• Display media for dashboard.
Dashboard
• Operational dashboards
– tell you what is happening now.
• Strategic dashboards
– track key performance indicators.
• Analytical dashboards
– process data to identify trends.
• Tactical dashboards
– used by mid-management to track
performance.
Strategic Dashboard
• Bullet graphs
• Bar graphs (horizontal and vertical)
• Stacked bar graphs (horizontal and vertical)
• Combination bar and line graphs
• Line graphs
• Sparklines
• Box plots
• Scatter plots
• Treemaps
Icon
Web Resources
https://mitu.co.in
http://tusharkute.com
[email protected]
[email protected]
Data Visualization
Tushar B. Kute,
http://tusharkute.com
Agenda
• Data visualization
– What? Why?
– Benefits
– Techniques
– Who uses it?
– Challenges
Data Visualization
Web Resources
https://mitu.co.in
http://tusharkute.com
[email protected]
[email protected]
Big Data Solutions
Tushar B. Kute,
http://tusharkute.com
Traditional Enterprise Approach
Traditional Enterprise Approach
Limitation
• This approach works fine with those
applications that process less voluminous data
that can be accommodated by standard
database servers, or up to the limit of the
processor that is processing the data.
• But when it comes to dealing with huge
amounts of scalable data, it is a hectic task to
process such data through a single database
bottleneck.
Google's Solution
Web Resources
https://mitu.co.in
http://tusharkute.com
[email protected]
[email protected]
MapReduce
Tushar B. Kute,
http://tusharkute.com
What is MapReduce?
Web Resources
https://mitu.co.in
http://tusharkute.com
[email protected]
[email protected]
Multi Layer Perceptron
Tushar B. Kute,
http://tusharkute.com
Neural Network
• Weights initialization –
– it is necessary to set initial weights for the first forward
pass. Two basic options are to set weights to zero or to
randomize them.
– However, this can result in a vanishing or exploding
gradient, which will make it difficult to train the model.
– To mitigate this problem, you can use a heuristic (a
formula tied to the number of neuron layers) to
determine the weights.
– A common heuristic used for the Tanh activation is
called Xavier initialization.
Hyperparameters of training algo
• Sigmoid / Logistic
• Tanh / Hyperbolic Tangent
• ReLU (Rectified Linear Unit)
• Leaky ReLU
• Parametric ReLU
• Softmax
• Swish
Sigmoid / Logistic
Sigmoid / Logistic
• Advantages
– Smooth gradient, preventing “jumps” in output values.
– Output values bound between 0 and 1, normalizing the output of
each neuron.
– Clear predictions—For X above 2 or below -2, tends to bring the Y
value (the prediction) to the edge of the curve, very close to 1 or 0.
This enables clear predictions.
• Disadvantages
– Vanishing gradient—for very high or very low values of X, there is
almost no change to the prediction, causing a vanishing gradient
problem. This can result in the network refusing to learn further, or
being too slow to reach an accurate prediction.
– Outputs not zero centered.
– Computationally expensive
Tanh
• Advantages
– Zero centered—making it easier to model inputs
that have strongly negative, neutral, and
strongly positive values.
– Otherwise like the Sigmoid function.
• Disadvantages
– Like the Sigmoid function
ReLU (Rectified Linear Unit)
ReLU (Rectified Linear Unit)
• Advantages
– Computationally efficient—allows the network to
converge very quickly
– Non-linear—although it looks like a linear function,
ReLU has a derivative function and allows for
backpropagation
• Disadvantages
– The Dying ReLU problem—when inputs approach
zero, or are negative, the gradient of the function
becomes zero, the network cannot perform
backpropagation and cannot learn.
Leaky ReLU
• Advantages
– Prevents dying ReLU problem—this variation of
ReLU has a small positive slope in the negative
area, so it does enable backpropagation, even
for negative input values
– Otherwise like ReLU
• Disadvantages
– Results not consistent—leaky ReLU does not
provide consistent predictions for negative
input values.
Leaky ReLU
Parametric ReLU
• Advantages
– Allows the negative slope to be learned—unlike
leaky ReLU, this function provides the slope of
the negative part of the function as an argument.
– It is, therefore, possible to perform
backpropagation and learn the most appropriate
value of α.
– Otherwise like ReLU
• Disadvantages
– May perform differently for different problems.
Softmax
Softmax
• Advantages
– Able to handle multiple classes only one class in
other activation functions—normalizes the outputs
for each class between 0 and 1, and divides by their
sum, giving the probability of the input value being in
a specific class.
– Useful for output neurons—typically Softmax is used
only for the output layer, for neural networks that
need to classify inputs into multiple categories.
Swish
• https://missinglink.ai
• https://machinelearningmastery.com
• https://www.allaboutcircuits.com
• https://medium.com
Thank you
This presentation is created using LibreOffice Impress 5.1.6.2, can be used freely as per GNU General Public License
Web Resources
http://mitu.co.in
http://tusharkute.com
[email protected]
[email protected]
Random Forest
Tushar B. Kute,
http://tusharkute.com
Random Forest
Source: medium.com
Regressor Output
Source: medium.com
Advantages
• www.pythonprogramminglanguage.com
• www.scikit-learn.org
• www.towardsdatascience.com
• www.medium.com
• www.analyticsvidhya.com
• www.kaggle.com
• www.stephacking.com
• www.github.com
Thank you
This presentation is created using LibreOffice Impress 5.1.6.2, can be used freely as per GNU General Public License
Web Resources
https://mitu.co.in
http://tusharkute.com
[email protected]
[email protected]
Decision Tree
Tushar B. Kute,
http://tusharkute.com
Lets see the example...
• When to stop
– no more input features
– all examples are classified the same
– too few examples to make an informative split
• Which test to split on
– split gives smallest error.
– With multi-valued features
– split on all values or
– split values into half.
Which attribute is best ?
• A measure for
– uncertainty
– purity
– information content
• Information theory: optimal length code assigns (- log2p) bits to
message having probability p
• S is a sample of training examples
– p+ is the proportion of positive examples in S
– p- is the proportion of negative examples in S
• Entropy of S: average optimal number of bits to encode
information about certainty/uncertainty about S
Entropy(S) = p+(-log2p+) + p-(-log2p-) = -p+log2p+- p-log2p-
Entropy
Humidity Wind
E=0.985 E=0.592
Gain(S,Humidity) Gain(S,Wind)
=0.940-(7/14)*0.985 =0.940-(8/14)*0.811
– (7/14)*0.592 – (6/14)*1.0
=0.151 =0.048
Humidity provides greater info. gain than Wind, w.r.t target classification.
Selecting next attribute
S=[9+,5-]
E=0.940
Outlook
Gain(S,Outlook)
=0.940-(5/14)*0.971
-(4/14)*0.0 – (5/14)*0.0971
=0.247
Selecting next attribute
Note: 0Log20 =0
Packages needed
• Data Analytics
– sudo pip3 install pandas
• Decision Tree Algorithm
– sudo pip3 install sklearn
• Visualization
– sudo pip3 install ipython
– sudo pip3 install graphviz
– sudo pip3 install pydotplus
– sudo apt install graphviz
Simplified Decision Tree
Decision Tree Classification
• train_test_split(*arrays, **options)
– Split arrays or matrices into random train and
test subsets
– *arrays : sequence of indexables with same
length / shape[0]
• Allowed inputs are lists, numpy arrays, scipy-sparse
matrices or pandas dataframes.
– test_size : float, int, or None (default is None)
• If float, should be between 0.0 and 1.0 and represent
the proportion of the dataset to include in the test
split.
DecisionTreeClassifier
• Fitting your model to (i.e. using the .fit() method on) the
training data is essentially the training part of the
modeling process. It finds the coefficients for the
equation specified via the algorithm being used.
• Then, for a classifier, you can classify incoming data points
(from a test set, or otherwise) using the predict method.
Or, in the case of regression, your model will
interpolate/extrapolate when predict is used on incoming
data points.
• It also should be noted that sometimes the "fit"
nomenclature is used for non-machine-learning methods,
such as scalers and other preprocessing steps.
Characterizing the classifier
Output
Confusion matrix
Classification Report
Accuracy Score
Visualizing the tree
Tree
Resources
• https://stackabuse.com/
• http://people.sc.fsu.edu
• https://www.geeksforgeeks.org
• http://scikit-learn.org/
• https://machinelearningmastery.com
Thank you
This presentation is created using LibreOffice Impress 5.1.6.2, can be used freely as per GNU General Public License
Web Resources
https://mitu.co.in
http://tusharkute.com
[email protected]
[email protected]
Naive Bayes Classifier using Python
Tushar B. Kute,
http://tusharkute.com
Naive Bayes Classifier
Defective Spanners
Bayes Theorem
Bayes Theorem
Bayes Theorem
Bayes Theorem
That’s intuitive
Exercise
Example:
Step-1
Step-1
Step-1
Step-2
Step-3
Naive Bayes – Step-1
Naive Bayes – Step-2
Naive Bayes – Step-3
Combining altogether
Naive Bayes – Step-4
Naive Bayes – Step-5
Types of model
Final Classification
Probability Distribution
Advantages
• www.datacamp.com
• www.scikit-learn.org
• www.towardsdatascience.com
• www.medium.com
• www.analyticsvidhya.com
• www.kaggle.com
• www.stephacking.com
• www.github.com
Thank you
This presentation is created using LibreOffice Impress 5.1.6.2, can be used freely as per GNU General Public License
Web Resources
https://mitu.co.in
http://tusharkute.com
[email protected]
[email protected]
Classification
Tushar B. Kute,
http://tusharkute.com
What is Classification?
• Lazy Learners –
– Lazy learners simply store the training data
and wait until a testing data appears.
– The classification is done using the most
related data in the stored training data.
– They have more predicting time compared to
eager learners. Eg – k-nearest neighbor, case-
based reasoning.
Types of Learners
• Eager Learners –
– Eager learners construct a classification
model based on the given training data
before getting data for predictions.
– It must be able to commit to a single
hypothesis that will work for the entire space.
– Due to this, they take a lot of time in training
and less time for a prediction. Eg – Decision
Tree, Naive Bayes, Artificial Neural Networks.
Types of Classification
• Binary Classification
• Multi-Class Classification
• Multi-Label Classification
• Imbalanced Classification
Types of Classification
• Linear Models
– Logistic Regression
– Support Vector Machines
• Nonlinear models
– K-nearest Neighbors (KNN)
– Kernel Support Vector Machines (SVM)
– Naïve Bayes
– Decision Tree Classification
– Random Forest Classification
Binary Classification
• https://missinglink.ai
• https://machinelearningmastery.com
• https://www.allaboutcircuits.com
• https://medium.com
Thank you
This presentation is created using LibreOffice Impress 5.1.6.2, can be used freely as per GNU General Public License
Web Resources
https://mitu.co.in
http://tusharkute.com
[email protected]
[email protected]
Linear Regression
Tushar B. Kute,
http://tusharkute.com
Linear Regression
Y(pred) = b0 + b1*x
• If we don’t square the error, then positive and negative point will
cancel out each other.
Many names of Linear Regression
• When there are one or more inputs you can use a process
of optimizing the values of the coefficients by iteratively
minimizing the error of the model on your training data.
• This operation is called Gradient Descent and works by
starting with random values for each coefficient.
• The sum of the squared errors are calculated for each pair
of input and output values.
• A learning rate is used as a scale factor and the coefficients
are updated in the direction towards minimizing the error.
• The process is repeated until a minimum sum squared error
is achieved or no further improvement is possible.
Regularization
• Go practical...
Thank you
This presentation is created using LibreOffice Impress 5.1.6.2, can be used freely as per GNU General Public License
Web Resources
https://mitu.co.in
http://tusharkute.com
[email protected]
[email protected]
Association Rule Mining
Tushar B. Kute,
http://tusharkute.com
The Association Rules
• Apriori Algorithm
• Eclat Algorithm
• F-P Growth Algorithm
Apriori Algorithm
retails.csv
Non-structured transactions
groceries.csv
Reading the csv file (both)
Transaction encoding
The Transaction Encoder
• Cross Selling
• Product Placement
• Affinity Promotion
• Fraud Detection
• Customer Behavior
Useful resources
• https://rasbt.github.io
• https://www.kdnuggets.com
• http://intelligentonlinetools.com
• http://pbpython.com
• www.towardsdatascience.com
• www.analyticsvidhya.com
• www.kaggle.com
• www.github.com
Thank you
This presentation is created using LibreOffice Impress 5.1.6.2, can be used freely as per GNU General Public License
Web Resources
http://mitu.co.in
http://tusharkute.com
[email protected]
[email protected]
Clustering Techniques
Tushar B. Kute,
http://tusharkute.com
Unsupervised learning flow
Clustering
• Elbow Method:
– First of all, compute the sum of squared error (SSE) for some
values of k (for example 2, 4, 6, 8, etc.). The SSE is defined as
the sum of the squared distance between each member of the
cluster and its centroid. Mathematically:
– If you plot k against the SSE, you will see that the error
decreases as k gets larger; this is because when the number of
clusters increases, they should be smaller, so distortion is also
smaller. The idea of the elbow method is to choose the k at
which the SSE decreases abruptly. This produces an "elbow
effect" in the graph
Find no. of clusters
In this case, k=6 is the value that the Elbow method has
selected.
Applying elbow method
The Kmeans() function
Got an elbow at 5
Silhouette Method
Web Resources
http://mitu.co.in
http://tusharkute.com
[email protected]
[email protected]
Probability
Tushar B. Kute,
http://tusharkute.com
What is Probability?
• How can this be the case? Well, if all you know is that at least
one of the children is a girl, then it is twice as likely that the
family has one boy and one girl than that it has both girls.
Bayes Theorem
Defective Spanners
Bayes Theorem
Bayes Theorem
Bayes Theorem
Bayes Theorem
That’s intuitive
Exercise
Example:
Step-1
Step-1
Step-1
Step-2
Step-3
Naive Bayes – Step-1
Naive Bayes – Step-2
Naive Bayes – Step-3
Combining altogether
Naive Bayes – Step-4
Naive Bayes – Step-5
Types of model
Final Classification
Random Variable
Web Resources
https://mitu.co.in
http://tusharkute.com
[email protected]
[email protected]
Statistics
Tushar B. Kute,
http://tusharkute.com
Objectives
• Practically...
Dispersion
• The Pearson correlation coefficient can take on any real value in the
range −1 ≤ r ≤ 1.
• The maximum value r = 1 corresponds to the case when there’s a
perfect positive linear relationship between x and y. In other words,
larger x values correspond to larger y values and vice versa.
• The value r > 0 indicates positive correlation between x and y.
• The value r = 0 corresponds to the case when x and y are
independent.
• The value r < 0 indicates negative correlation between x and y.
• The minimal value r = −1 corresponds to the case when there’s a
perfect negative linear relationship between x and y. In other
words, larger x values correspond to smaller y values and vice versa.
Pearson Correlation
Linear Regression
• The left plot has a perfect positive linear relationship between x and y, so r
= 1. The central plot shows positive correlation and the right one shows
negative correlation. However, neither of them is a linear function, so r is
different than −1 or 1.
• When you look only at the orderings or ranks, all three relationships are
perfect! The left and central plots show the observations where larger x
values always correspond to larger y values. This is perfect positive rank
correlation. The right plot illustrates the opposite case, which is perfect
negative rank correlation.
The Spearman Correlation
Web Resources
https://mitu.co.in
http://tusharkute.com
[email protected]
[email protected]
Machine Learning
Tushar B. Kute,
http://tusharkute.com
Objectives
Web Resources
https://mitu.co.in
http://tusharkute.com
[email protected]
[email protected]
Data Science Process
Tushar B. Kute,
http://tusharkute.com
Objectives
• Now that you have the raw data, it’s time to prepare it.
This includes transforming the data from a raw form into
data that’s directly usable in your models. To achieve this,
you’ll detect and correct different kinds of errors in the
data, combine data from different data sources, and
transform it. If you have successfully completed this step,
you can progress to data visualization and modeling.
• The fourth step is data exploration. The goal of this step
is to gain a deep understanding of the data. You’ll look
for patterns, correlations, and deviations based on visual
and descriptive techniques. The insights you gain from
this phase will enable you to start modeling.
Steps
• Clients like to know upfront what they’re paying for, so after you have
a good understanding of the business problem, try to get a formal
agreement on the deliverables. All this information is best collected in
a project charter. For any significant project this would be mandatory.
• A project charter requires teamwork, and your input covers at least
the following:
– A clear research goal
– The project mission and context
– How you’re going to perform your analysis
– What resources you expect to use
– Proof that it’s an achievable project, or proof of concepts
– Deliverables and a measure of success
– A timeline
2. Retrieving data
Data Retrieval
• Setting the research goal—Defining the what, the why, and the how of your
project in a project charter.
• Retrieving data—Finding and getting access to data needed in your project.
This data is either found within the company or retrieved from a third party.
• Data preparation—Checking and remediating data errors, enriching the
data with data from other data sources, and transforming it into a suitable
format for your models.
• Data exploration—Diving deeper into your data using descriptive statistics
and visual techniques.
• Data modeling—Using machine learning and statistical techniques to
achieve your project goal.
• Presentation and automation—Presenting your results to the stakeholder
and industrializing your analysis process for repetitive reuse and
integration with other tools.
Thank you
This presentation is created using LibreOffice Impress 5.1.6.2, can be used freely as per GNU General Public License
Web Resources
https://mitu.co.in
http://tusharkute.com
[email protected]
[email protected]
Data Science
Tushar B. Kute,
http://tusharkute.com
Objectives
• Data Volume
• 44x increase from 2009 2020
• From 0.8 zettabytes to 35zb
Exponential increase in
collected/generated data
Computer Memory Units
Characteristics of Big Data: Variety
• Examples
• E-Promotions: Based on your current location, your purchase
history, what you like send promotions right now for store next to
you.
Old Model: Few companies are generating data, all others are
consuming data
• Relational Data
(Tables/Transaction/Legacy Data)
• Text Data (Web)
• Semi-structured Data (XML)
• Graph Data
• Social Network, Semantic Web (RDF), …
• Streaming Data
What to do with this data?
• Internet Search
• Digital Advertisements (Targeted Advertising and re-
targeting)
• Recommender Systems
• Image Recognition
• Speech Recognition
• Gaming
• Price Comparison Websites
• Airline Route Planning
• Fraud and Risk Detection
• Delivery logistics
Internet Search
Targeting Advertisement
Recommender System
Image Recognition
Speech Recognition
Computer Games
Price Comparison Website
Airline Route Planning
Fraud Detection
Delivery Logistics
Facets of Data
• Audio, image, and video are data types that pose specific
challenges to a data scientist.
• Tasks that are trivial for humans, such as recognizing
objects in pictures, turn out to be challenging for
computers. MLBAM (Major League Baseball Advanced
Media) announced in 2014 that they’ll increase video
capture to approximately 7 TB per game for the purpose
of live, in-game analytics.
• High-speed cameras at stadiums will capture ball and
athlete movements to calculate in real time, for example,
the path taken by a defender relative to two baselines.
Audio, Video and Image
Web Resources
https://mitu.co.in
http://tusharkute.com