0% found this document useful (0 votes)
29 views884 pages

Data Science Complete Theory

Data modeling is the process of creating a conceptual representation of data objects and their relationships for storage in a database, ensuring consistency and adherence to business rules. There are three main types of data models: conceptual, logical, and physical, each serving different purposes in database design. Additionally, dimensional modeling and multidimensional data models are used for data warehousing and analytical processing, optimizing data retrieval and organization.

Uploaded by

lucifer267302
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views884 pages

Data Science Complete Theory

Data modeling is the process of creating a conceptual representation of data objects and their relationships for storage in a database, ensuring consistency and adherence to business rules. There are three main types of data models: conceptual, logical, and physical, each serving different purposes in database design. Additionally, dimensional modeling and multidimensional data models are used for data warehousing and analytical processing, optimizing data retrieval and organization.

Uploaded by

lucifer267302
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 884

Data Modeling

Tushar B. Kute,
http://tusharkute.com
What is data modeling?

• Data modeling (data modelling) is the process of creating


a data model for the data to be stored in a database.
• This data model is a conceptual representation of Data
objects, the associations between different data objects,
and the rules.
• Data modeling helps in the visual representation of data
and enforces business rules, regulatory compliances, and
government policies on the data.
• Data Models ensure consistency in naming conventions,
default values, semantics, security while ensuring quality
of the data.
Data Model

• The Data Model is defined as an abstract model


that organizes data description, data semantics,
and consistency constraints of data.
• The data model emphasizes on what data is
needed and how it should be organized instead
of what operations will be performed on data.
• Data Model is like an architect's building plan,
which helps to build conceptual models and set a
relationship between data items.
Data Models: Types

• The two types of Data Modeling Techniques are


– Entity Relationship (E-R) Model
– UML (Unified Modelling Language)
Why data modeling?

• Ensures that all data objects required by the


database are accurately represented. Omission
of data will lead to creation of faulty reports
and produce incorrect results.
• A data model helps design the database at the
conceptual, physical and logical levels.
• Data Model structure helps to define the
relational tables, primary and foreign keys and
stored procedures.
Why data modeling?

• It provides a clear picture of the base data and


can be used by database developers to create a
physical database.
• It is also helpful to identify missing and
redundant data.
• Though the initial creation of data model is
labor and time consuming, in the long run, it
makes your IT infrastructure upgrade and
maintenance cheaper and faster.
Data model types

• Types of Data Models: There are mainly three


different types of data models: conceptual data
models, logical data models, and physical data
models, and each one has a specific purpose.
• The data models are used to represent the data
and how it is stored in the database and to set
the relationship between data items.
Data model types

• Conceptual Data Model: This Data Model


defines WHAT the system contains.
• This model is typically created by Business
stakeholders and Data Architects.
• The purpose is to organize, scope and define
business concepts and rules.
Data model types

• Logical Data Model: Defines HOW the system


should be implemented regardless of the
DBMS.
• This model is typically created by Data
Architects and Business Analysts.
• The purpose is to developed technical map of
rules and data structures.
Data model types

• Physical Data Model: This Data Model describes


HOW the system will be implemented using a
specific DBMS system.
• This model is typically created by DBA and
developers.
• The purpose is actual implementation of the
database.
Data model types
Conceptual data model

• A Conceptual Data Model is an organized view of database


concepts and their relationships. The purpose of creating a
conceptual data model is to establish entities, their
attributes, and relationships.
• In this data modeling level, there is hardly any detail available
on the actual database structure. Business stakeholders and
data architects typically create a conceptual data model.
• The 3 basic tenants of Conceptual Data Model are
– Entity: A real-world thing
– Attribute: Characteristics or properties of an entity
– Relationship: Dependency or association between two
entities
Conceptual data model

• Data model example:


– Customer and Product are two entities. Customer
number and name are attributes of the Customer
entity
– Product name and price are attributes of product
entity
– Sale is the relationship between the customer and
product
Characteristics of conceptual data model

• Offers Organisation-wide coverage of the business


concepts.
• This type of Data Models are designed and developed for
a business audience.
• The conceptual model is developed independently of
hardware specifications like data storage capacity,
location or software specifications like DBMS vendor and
technology. The focus is to represent data as a user will
see it in the "real world."
• Conceptual data models known as Domain models create
a common vocabulary for all stakeholders by establishing
basic concepts and scope.
Logical Data Model

• The Logical Data Model is used to define the structure


of data elements and to set relationships between
them.
• The logical data model adds further information to the
conceptual data model elements.
• The advantage of using a Logical data model is to
provide a foundation to form the base for the Physical
model. However, the modeling structure remains
generic.
Characteristics of a Logical data model

• Describes data needs for a single project but


could integrate with other logical data models
based on the scope of the project.
• Designed and developed independently from
the DBMS.
• Data attributes will have datatypes with exact
precisions and length.
• Normalization processes to the model is applied
typically till 3NF.
Physical Data Model

• A Physical Data Model describes a database-specific


implementation of the data model. It offers database
abstraction and helps generate the schema.
• This is because of the richness of meta-data offered by a
Physical Data Model.
• The physical data model also helps in visualizing
database structure by replicating database column keys,
constraints, indexes, triggers, and other RDBMS
features.
Characteristics

• The physical data model describes data need for a single


project or application though it maybe integrated with
other physical data models based on project scope.
• Data Model contains relationships between tables that
which addresses cardinality and nullability of the
relationships.
• Developed for a specific version of a DBMS, location, data
storage or technology to be used in the project.
• Columns should have exact datatypes, lengths assigned and
default values.
• Primary and Foreign keys, views, indexes, access profiles,
and authorizations, etc. are defined.
Advantages of Data model

• The main goal of a designing data model is to make certain that


data objects offered by the functional team are represented
accurately.
• The data model should be detailed enough to be used for
building the physical database.
• The information in the data model can be used for defining the
relationship between tables, primary and foreign keys, and
stored procedures.
• Data Model helps business to communicate the within and across
organizations.
• Data model helps to documents data mappings in ETL process
• Help to recognize correct sources of data to populate the model
Disadvantages of Data model

• To develop Data model one should know


physical data stored characteristics.
• This is a navigational system produces complex
application development, management. Thus, it
requires a knowledge of the biographical truth.
• Even smaller change made in structure require
modification in the entire application.
• There is no set data manipulation language in
DBMS.
Multidimensional Data Model

• Multidimensional Data Model can be defined as a method for


arranging the data in the database, with better structuring
and organization of the contents in the database.
• Unlike a system with one dimension such as a list, the
Multidimensional Data Model can have two or three
dimensions of items from the database system.
• It is typically used in the organizations for drawing out
Analytical results and generation of reports, which can be
used as the main source for imperative decision-making
processes.
• This model is typically applied to systems that operate with
OLAP techniques (Online Analytical Processing).
Multidimensional Data Model

• Multidimensional data model stores data in the


form of data cube.Mostly, data warehousing
supports two or three-dimensional cubes.
• A data cube allows data to be viewed in multiple
dimensions.
• A dimensions are entities with respect to which an
organization wants to keep records.
• For example in store sales record, dimensions allow
the store to keep track of things like monthly sales
of items and the branches and locations.
Multidimensional Data Model

• A multidimensional databases helps to provide


data-related answers to complex business
queries quickly and accurately.
• Data warehouses and Online Analytical
Processing (OLAP) tools are based on a
multidimensional data model.
• OLAP in data warehousing enables users to
view data from different angles and
dimensions.
Multidimensional Data Model

• Schemas for Multidimensional Data Model are:-


– Star Schema
– Snowflakes Schema
– Fact Constellations Schema
Dimensional Modeling

• Dimensional Modeling (DM) is a data structure


technique optimized for data storage in a Data
warehouse.
• The purpose of dimensional modeling is to
optimize the database for faster retrieval of
data.
• The concept of Dimensional Modelling was
developed by Ralph Kimball and consists of
“fact” and “dimension” tables.
Dimensional Modeling

• A dimensional model in data warehouse is


designed to read, summarize, analyze numeric
information like values, balances, counts,
weights, etc. in a data warehouse.
• In contrast, relation models are optimized for
addition, updating and deletion of data in a real-
time Online Transaction System.
• These dimensional and relational models have
their unique way of data storage that has
specific advantages.
Dimensional Modeling

• For instance, in the relational mode, normalization


and ER models reduce redundancy in data.
• On the contrary, dimensional model in data
warehouse arranges data in such a way that it is
easier to retrieve information and generate
reports.
• Hence, Dimensional models are used in data
warehouse systems and not a good fit for relational
systems.
Elements of multidimensional modeling

• Fact
• Dimension
• Attributes
• Fact table
• Dimension Table
Elements of multidimensional modeling

• Fact
– Facts are the measurements/metrics or facts from your business
process. For a Sales business process, a measurement would be
quarterly sales number
• Dimension
– Dimension provides the context surrounding a business process
event. In simple terms, they give who, what, where of a fact. In the
Sales business process, for the fact quarterly sales number,
dimensions would be
• Who – Customer Names
• Where – Location
• What – Product Name
• In other words, a dimension is a window to view information in the facts.
Elements of multidimensional modeling

• Attributes
– The Attributes are the various characteristics of
the dimension in dimensional data modeling.
• In the Location dimension, the attributes can be
– State
– Country
– Zipcode etc.
• Attributes are used to search, filter, or classify
facts. Dimension Tables contain Attributes
Elements of multidimensional modeling

• Fact Table
– A fact table is a primary table in dimension
modelling.
• A Fact Table contains
– Measurements/facts
– Foreign key to dimension table
Elements of multidimensional modeling

• A dimension table contains dimensions of a fact.


• They are joined to fact table via a foreign key.
• Dimension tables are de-normalized tables.
• The Dimension Attributes are the various columns in
a dimension table
• Dimensions offers descriptive characteristics of the
facts with the help of their attributes
• No set limit set for given for number of dimensions
• The dimension can also contain one or more
hierarchical relationships
Star Schema

• The simplest data warehouse schema is star


schema because its structure resembles a star.
• Star schema consists of data in the form of facts
and dimensions.
• The fact table present in the center of star and
points of the star are the dimension tables.
• In star schema fact table contain a large amount
of data, with no redundancy.
• Each dimension table is joined with the fact table
using a primary or foreign key.
Star Schema
Snowflake Schema

• The snowflake schema is a more complex than star


schema because dimension tables of the snowflake
are normalized.
• The snowflake schema is represented by centralized
fact table which is connected to multiple dimension
table and this dimension table can be normalized
into additional dimension tables.
• The major difference between the snowflake and
star schema models is that the dimension tables of
the snowflake model are normalized to reduce
redundancies.
Snowflake Schema
Fact Constellation Schema

• A fact constellation can have multiple fact


tables that share many dimension tables.
• This type of schema can be viewed as a
collection of stars, Snowflake and hence is
called a galaxy schema or a fact constellation.
• The main disadvantage of fact constellation
schemas is its more complicated design.
Fact Constellation Schema
Visualize Multidimensional Data
Visualize Multidimensional Data
Visualize Multidimensional Data
Visualize Multidimensional Data
Visualize Multidimensional Data
Visualize Multidimensional Data
Visualize Multidimensional Data
Visualize Multidimensional Data
Visualize Multidimensional Data
Visualize Multidimensional Data
Visualize Multidimensional Data
Visualize Multidimensional Data
Visualize Multidimensional Data
Visualize Multidimensional Data
Visualize Multidimensional Data
Visualize Multidimensional Data
Visualize Multidimensional Data
Principal Component Analysis

• Large datasets are increasingly common and are often


difficult to interpret.
• Principal component analysis (PCA) is a technique for reducing
the dimensionality of such datasets, increasing interpretability
but at the same time minimizing information loss.
• It does so by creating new uncorrelated variables that
successively maximize variance.
• Finding such new variables, the principal components, reduces
to solving an eigenvalue/eigenvector problem, and the new
variables are defined by the dataset at hand, not a priori,
hence making PCA an adaptive data analysis technique.
Dimensionality Reduction

• Dimensionality reduction or dimension reduction is the


process of reducing the number of random variables
under consideration by obtaining a set of principal
variables.
• It can be divided into feature selection and feature
extraction.
– Feature selection approaches try to find a subset of the
original variables (also called features or attributes).
– Feature projection or Feature extraction transforms the
data in the high-dimensional space to a space of fewer
dimensions.
Large Dimensions

• Large number of features in the dataset is one of the factors


that affect both the training time as well as accuracy of machine
learning models. You have different options to deal with huge
number of features in a dataset.
– Try to train the models on original number of features, which
take days or weeks if the number of features is too high.
– Reduce the number of variables by merging correlated
variables.
– Extract the most important features from the dataset that
are responsible for maximum variance in the output.
Different statistical techniques are used for this purpose e.g.
linear discriminant analysis, factor analysis, and principal
component analysis.
Principal Component Analysis

• Principal component analysis, or PCA, is a statistical


technique to convert high dimensional data to low
dimensional data by selecting the most important features
that capture maximum information about the dataset.
• The features are selected on the basis of variance that they
cause in the output.
• The feature that causes highest variance is the first
principal component. The feature that is responsible for
second highest variance is considered the second principal
component, and so on.
• It is important to mention that principal components do
not have any correlation with each other.
Advantages of PCA

• The training time of the algorithms reduces


significantly with less number of features.
• It is not always possible to analyze data in high
dimensions. For instance if there are 100
features in a dataset. Total number of scatter
plots required to visualize the data would be
100(100-1)2 = 4950. Practically it is not possible
to analyze data this way.
Normalization of features

• It is imperative to mention that a feature set must be


normalized before applying PCA. For instance if a feature set
has data expressed in units of Kilograms, Light years, or
Millions, the variance scale is huge in the training set. If PCA
is applied on such a feature set, the resultant loadings for
features with high variance will also be large. Hence,
principal components will be biased towards features with
high variance, leading to false results.
• Finally, the last point to remember before we start coding is
that PCA is a statistical technique and can only be applied to
numeric data. Therefore, categorical features are required to
be converted into numerical features before PCA can be
applied.
Steps in PCA

• Standardization
• Covariance Matrix Computation
• Computer Eigen vector and eigen values
• Feature vector
• Recast the Data Along the Principal
Components Axes
Standardization

• The aim of this step is to standardize the range of


the continuous initial variables so that each one of
them contributes equally to the analysis.
• More specifically, the reason why it is critical to
perform standardization prior to PCA, is that the
latter is quite sensitive regarding the variances of
the initial variables.
• That is, if there are large differences between the
ranges of initial variables, those variables with larger
ranges will dominate over those with small ranges
Standardization

• Mathematically, this can be done by subtracting


the mean and dividing by the standard deviation
for each value of each variable.

• Once the standardization is done, all the


variables will be transformed to the same scale.
Covariance Matrix Computation

• The aim of this step is to understand how the


variables of the input data set are varying from
the mean with respect to each other, or in other
words, to see if there is any relationship
between them.
• Because sometimes, variables are highly
correlated in such a way that they contain
redundant information.
• So, in order to identify these correlations, we
compute the covariance matrix.
Covariance Matrix Computation
Covariance Matrix Computation

• What do the covariances that we have as entries of the


matrix tell us about the correlations between the
variables?
• It’s actually the sign of the covariance that matters :
– if positive then : the two variables increase or
decrease together (correlated)
– if negative then : One increases when the other
decreases (Inversely correlated)
• Now, that we know that the covariance matrix is not
more than a table that summaries the correlations
between all the possible pairs of variables
Eigenvector and eigenvalues

• Eigenvectors and eigenvalues are the linear algebra


concepts that we need to compute from the
covariance matrix in order to determine the
principal components of the data.
• Before getting to the explanation of these
concepts, let’s first understand what do we mean
by principal components.
Eigenvector and eigenvalues

• Principal components are new variables that are


constructed as linear combinations or mixtures of the
initial variables.
• These combinations are done in such a way that the new
variables (i.e., principal components) are uncorrelated and
most of the information within the initial variables is
squeezed or compressed into the first components.
• So, the idea is 10-dimensional data gives you 10 principal
components, but PCA tries to put maximum possible
information in the first component, then maximum
remaining information in the second and so on, until
having something like shown in the scree plot below.
Eigenvector and eigenvalues
Principal Components

• Organizing information in principal components this way, will


allow you to reduce dimensionality without losing much
information, and this by discarding the components with low
information and considering the remaining components as
your new variables.
• An important thing to realize here is that, the principal
components are less interpretable and don’t have any real
meaning since they are constructed as linear combinations of
the initial variables.
• Geometrically speaking, principal components represent the
directions of the data that explain a maximal amount of
variance, that is to say, the lines that capture most
information of the data.
Principal Components

• As there are as many principal components as


there are variables in the data, principal
components are constructed in such a manner
that the first principal component accounts for
the largest possible variance in the data set.
Principal Components

• For example, let’s assume that the scatter plot of


our data set is as shown below, can we guess the
first principal component ?
• Yes, it’s approximately the line that matches the
purple marks because it goes through the origin and
it’s the line in which the projection of the points (red
dots) is the most spread out.
• Or mathematically speaking, it’s the line that
maximizes the variance (the average of the squared
distances from the projected points (red dots) to the
origin).
Principal Components
Example

• Let’s suppose that our data set is 2-dimensional with 2


variables x,y and that the eigenvectors and eigenvalues of
the covariance matrix are as follows:


If we rank the eigenvalues in descending order, we get
λ1>λ2, which means that the eigenvector that corresponds
to the first principal component (PC1) is v1 and the one
that corresponds to the second component (PC2) isv2.
Feature Vector

• As we saw in the previous step, computing the eigenvectors and


ordering them by their eigenvalues in descending order, allow us
to find the principal components in order of significance.
• In this step, what we do is, to choose whether to keep all these
components or discard those of lesser significance (of low
eigenvalues), and form with the remaining ones a matrix of
vectors that we call Feature vector.
• So, the feature vector is simply a matrix that has as columns the
eigenvectors of the components that we decide to keep.
• This makes it the first step towards dimensionality reduction,
because if we choose to keep only p eigenvectors (components)
out of n, the final data set will have only p dimensions.
Example:

• Continuing with the example from the previous step, we can


either form a feature vector with both of the eigenvectors v1
and v2:

• Or discard the eigenvector v2, which is the one of lesser


significance, and form a feature vector with v1 only:

• Discarding the eigenvector v2 will reduce dimensionality by 1,


and will consequently cause a loss of information in the final
data set.
Last step

• The aim is to use the feature vector formed using


the eigenvectors of the covariance matrix, to
reorient the data from the original axes to the ones
represented by the principal components (hence
the name Principal Components Analysis).
• This can be done by multiplying the transpose of
the original data set by the transpose of the
feature vector.
Clustering high dimensional data

• Clustering high-dimensional data is the cluster


analysis of data with anywhere from a few dozen to
many thousands of dimensions.
• Such high-dimensional spaces of data are often
encountered in areas such as medicine, where DNA
microarray technology can produce many
measurements at once, and the clustering of text
documents, where, if a word-frequency vector is
used, the number of dimensions equals the size of
the vocabulary.
Problems

• Multiple dimensions are hard to think in, impossible


to visualize, and, due to the exponential growth of
the number of possible values with each dimension,
complete enumeration of all subspaces becomes
intractable with increasing dimensionality. This
problem is known as the curse of dimensionality.
• The concept of distance becomes less precise as
the number of dimensions grows, since the
distance between any two points in a given dataset
converges. The discrimination of the nearest and
farthest point in particular becomes meaningless:
Problems

• A cluster is intended to group objects that are


related, based on observations of their attribute's
values. However, given a large number of attributes
some of the attributes will usually not be
meaningful for a given cluster.
• Given a large number of attributes, it is likely that
some attributes are correlated. Hence, clusters
might exist in arbitrarily oriented affine subspaces.
Solutions

• Subspace Clustering
• Projected Clustering
• Projection Based Clustering
• Correlation Clustering
Subspace Clustering

• Subspace clustering is an extension of traditional


clustering that seeks to find clusters in different
subspaces within a dataset.
• Often in high dimensional data, many dimensions are
irrelevant and can mask existing clusters in noisy data.
• Feature selection removes irrelevant and redundant
dimensions by analyzing the entire dataset.
• Subspace clustering algorithms localize the search for
relevant dimensions allowing them to find clusters
that exist in multiple, possibly overlapping subspaces.
Projected Clustering

• Projected clustering is the first, top-down partitioning


projected clustering algorithm based on the notion of k-
medoid clustering which was presented by Aggarwal
(1999).
• It determines medoids for each cluster repetitively on a
sample of data using a greedy hill climbing technique and
then upgrades the results repetitively.
• Cluster quality in projected clustering is a function of
average distance between data points and the closest
medoid.
• Also, the subspace dimensionality is an input framework
which generates clusters of alike sizes.
Projection Based Clustering

• Projection-based clustering is based on a nonlinear projection


of high-dimensional data into a two-dimensional space.
• Typical projection-methods like t-distributed stochastic
neighbor embedding (t-SNE), or neighbor retrieval visualizer
(NerV) are used project data explicitly into two dimensions
disregarding the subspaces of higher dimension than two and
preserving only relevant neighborhoods in high-dimensional
data.
• In the next step, the Delaunay graph between the projected
points is calculated, and each vertex between two projected
points is weighted with the high-dimensional distance
between the corresponding high-dimensional data points.
Correlation Clustering

• Clustering is the problem of partitioning data


points into groups based on their similarity.
• Correlation clustering provides a method for
clustering a set of objects into the optimum
number of clusters without specifying that number
in advance
Thank you
This presentation is created using LibreOffice Impress 5.1.6.2, can be used freely as per GNU General Public License

/mITuSkillologies @mitu_group /company/mitu- MITUSkillologies


skillologies

Web Resources
https://mitu.co.in
http://tusharkute.com

[email protected]
[email protected]
Types of Data Visualization

Tushar B. Kute,
http://tusharkute.com
Types of Graphs

• Column Chart
• Bar Graph
• Stacked Bar Graph
• Area Chart
• Dual Axis Chart
• Line Graph
• Pie Chart
• Waterfall Chart
• Scatter Plot Chart
• Histogram
• Funnel Chart
• Heat Map
Bar Graph

• Bar charts are among the most frequently used


chart types.
• As the name suggests a bar chart is composed of
a series of bars illustrating a variable’s
development.
• Given that bar charts are such a common chart
type, people are generally familiar with them
and can understand them easily.
• Examples like this one are straightforward to
read.
Example:
Bar Graph

• However, please be aware that bar charts can


be confusing, too.
• Especially if one uses them to compare several
variables. I personally believe that a comparison
of more than two variables with a clustered bar
chart becomes too cluttered.
• Here is an example of a clustered bar chart that
is not exactly crystal clear:
Bar Graph
Bar Graph – When to use ?

• Bar charts are nice but limited. We have to consider


the type of data we want to visualize and the
number of variables that will be added to the chart.
• Bar charts are great when we want to track the
development of one or two variables over time.
• For example, one of the most frequent applications
of bar charts in corporate presentations is to show
how a company’s total revenues have developed
during a given period.
Bar Chart – two variables
Stacked Bar Chart
Pie Chart

• A pie chart is a circular graph divided into slices. The


larger a slice is the bigger portion of the total quantity it
represents.
• When to use a pie chart?
– So, pie charts are best suited to depict sections of a
whole.
• What does that mean?
– If a company operates three separate divisions, at
year-end its top management would be interested in
seeing what portion of total revenue each division
accounted for.
Example:
Pie Chart – When to avoid?

• Obviously, we can’t use a pie chart in situations


when we would like to show how one or more
variables develop over time.
• Pie charts are a definite no-go in these cases.
Moreover, as mentioned earlier, a pie chart
would be misleading if we don’t consider all
values.
• In the context of our example from earlier, we
shouldn’t create a pie chart that includes
revenue of only two of the firm’s three divisions.
Line Chart

• A line chart is, as one can imagine, a line or multiple


lines showing how single, or multiple variables
develop over time. It is a great tool because we can
easily highlight the magnitude of change of one or
more variables over a period.
• When to use line charts
– Remember the awkward ‘Fiction book sales’ chart we saw
earlier? Well, a simple line chart would have been much
better in that case.
– A line chart allows us to track the development of several
variables at the same time. It is very easy to understand,
and the reader doesn’t feel overwhelmed.
Example:
Area Chart

• Area charts are very similar to line charts.


• The idea of an area chart is based on the line
chart. Coloured regions (areas) show us the
development of each variable over time.
• There are three types of area charts: regular
area chart, stacked area chart, and 100%
stacked area chart.
When to use Area Chart?

• Whenever we want to show how the parts of a whole change over


time, we should consider an area chart. So, for example, if the
company has three revenue generating divisions, it is very likely
that management would like to see the development of each of
these divisions.
• This is a great way to draw attention to the total value and still
emphasize an important trend – say, revenues from one division
have been growing rapidly while the other two have kept the
same level. A stacked area chart is perfect in this case.
• However, if we are interested in the portion of revenue
generated by each division and not that much of the total amount
of revenues, we can simply use a 100% stacked area chart. This
will show each division’s percentage contribution over time.
Example:
Example:
Area Chart – When to avoid?

• Obviously, similarly to line charts, area charts are not


suitable for representing parts of a whole over a single
period.
• In our example, we can’t use an area chart to show the
proportion of revenues each division generated in say,
2018 alone. So that’s a situation where we can’t use an
area chart.
• In general, I would stay away from the classical area chart
too. It can be very confusing and even Microsoft
themselves recommend avoiding it and to consider using
a simple line chart.
Waterfall Chart

• Waterfall, also known as bridge charts, take their


origins from consulting.
• Several decades ago top tier “24/7 at your service”
consultants at McKinsey popularized this type of
visualization among their clients. And ever since, the
popularity of bridge charts has continued to rise.
• Bridge charts are made of bars showing the
cumulative effect of a series of positive and
negative values impacting a starting and an ending
value.
Example:
Example:
Scatter Plot

• A scatter plot is a type of chart that is often used in


the fields of statistics and data science. It consists
of multiple data points plotted across two axes.
• Each variable depicted in a scatter plot would have
multiple observations. If a scatter plot includes
more than two variables, then we would use
different colours to signify that.
• When to use scatter plots
– A scatter plot chart is a great indicator that allows us to
see whether there is a pattern to be found between two
variables.
Example:
Example:
Scatter Plot – When to avoid?

• We can’t use scatter plots when we don’t have bi-


dimensional data.
• In our example, we need information about both house
prices and house size to create a scatter plot. A scatter
plot requires at least two dimensions for our data.
• In addition, scatter plots are not suitable if we are
interested in observing time patterns.
• Finally, a scatter plot is used with numerical data, or
numbers. If we have categories such as 3 divisions, 5
products, and so on, a scatter plot would not reveal
much.
Histogram

• A series of bins showing us the frequency of


observations of a given variable. The definition
of histogram charts is short and easy.
• Here’s an example.
– An interviewer asked 267 people how much their
house cost. Then a histogram was used to portray
the interviewer’s findings. Some prices were in the
range between ₹117-217k, many more in the range
₹217-₹317k, and the rest of the houses were
classified in more expensive bins.
Example:


Histogram – When to use?

• Histograms are great when we would like to


show the distribution of the data we are
working with.
• This allows us to group continuous data into
bins and hence, provide a useful representation
of where observations are concentrated.
Histogram – When to avoid?

• Be careful when the data you are working with


contains multiple categories or variables. Multi-
column histograms are among the chart types to be
avoided when they look like this.
Box Plot

• A box plot or boxplot is a method for graphically


depicting groups of numerical data through their
quartiles.
• Box plots may also have lines extending from the
boxes (whiskers) indicating variability outside the
upper and lower quartiles, hence the terms box-
and-whisker plot and box-and-whisker diagram.
• Outliers may be plotted as individual points.
Box Plot
Network Visualization

• Network Visualisation (also called Network Graph) is


often used to visualise complex relationships
between a huge amount of elements.
• A network visualisation displays undirected and
directed graph structures. This type of visualization
illuminates relationships between entities.
• Entities are displayed as round nodes and lines show
the relationships between them.
• The vivid display of network nodes can highlight non-
trivial data discrepancies that may be otherwise be
overlooked.
Network Visualization
Hierarchy Data Visualization

• Treemaps are a data-visualization technique for


large, hierarchical data sets. They capture two types
of information in the data: (1) the value of individual
data points; (2) the structure of the hierarchy.
• Definition: Treemaps are visualizations for
hierarchical data. They are made of a series of
nested rectangles of sizes proportional to the
corresponding data value.
• A large rectangle represents a branch of a data tree,
and it is subdivided into smaller rectangles that
represent the size of each node within that branch.

Treemap Chart

• There are some chart types that are effective but


often neglected. Treemap charts are a good example.
Here is what one looks like.
Treemap – when to use?

• The company we have been looking at so far has


three divisions. And each of them has its own
products.
• This is the perfect way to provide information
about the weight divisions have with respect to
the firm’s total revenue.
• At the same time, it shows how much each
product contributes to the revenue of its
division.
Reports

• Data reporting is the process of collecting and


formatting raw data and translating it into a
digestible format to assess the ongoing
performance of your organization.
• Your data reports can answer basic questions about
the state of your business.
• They can show you the status of certain information
in an Excel file or a simple data visualization tool.
• Static data reports usually use the same format over
a period of time and pull from one source of data.
Reports
Reports

• A data report is nothing more than a recorded


list of facts and figures. Take the population
census, for example.
• This is a technical document that transmits
basic information on how many and what kind
of people live in a certain country.
• It can be displayed in the text, or in a visual
format, such as a graph or chart. But it is static
information that can be used to assess current
conditions.
Why data reporting is important?

• Data provides a path that measures progress in every area


of our lives. It informs our professional decisions as well as
our day-to-day matters.
• A data report will tell us where to spend the most time and
resources, and what needs more organization or attention.
• Accurate data reporting plays an important role in every
industry. The use of business intelligence in healthcare can
help physicians save lives by providing more effective and
efficient patient care.
• In education, data reports can be used to analyze how
attendance records relate to seasonal weather patterns, or
how acceptance rates intersect neighborhood areas.
Data reporting skills

• The most effective business analysts master certain skills. An


excellent business analyst must be able to prioritize the most
relevant information.
• They must be extremely thorough and detail-oriented; there’s
no room for error in data reports. Another useful skill is the
ability to process and collate large amounts of information.
And finally, being able to arrange the data and display it in an
easy-to-read format is key for all data reporters.
• Excellence in data reporting doesn’t mean you have to immerse
yourself in code or be an expert at analytics. Other important
skills include being able to extract essential information from
the data, keeping it simple, and avoiding data hoarding.
Thank you
This presentation is created using LibreOffice Impress 5.1.6.2, can be used freely as per GNU General Public License

/mITuSkillologies @mitu_group /company/mitu- MITUSkillologies


skillologies

Web Resources
https://mitu.co.in
http://tusharkute.com

[email protected]
[email protected]
Dashboard

Tushar B. Kute,
http://tusharkute.com
Agenda

• Definition of Dashboard,
• Their type
• Evolution of dashboard
• Dashboard design and principles
• Display media for dashboard.
Dashboard

• A data dashboard is an information management


tool that visually tracks, analyzes and displays key
performance indicators (KPI), metrics and key data
points to monitor the health of a business,
department or specific process.
• They are customizable to meet the specific needs of
a department and company.
• Behind the scenes, a dashboard connects to your
files, attachments, services and API’s, but on the
surface displays all this data in the form of tables,
line charts, bar charts and gauges.
Dashboard

• A data dashboard is the most efficient way to


track multiple data sources because it provides
a central location for businesses to monitor and
analyze performance.
• Real-time monitoring reduces the hours of
analyzing and long line of communication that
previously challenged businesses.
How dashboard works?

• Everyone uses data dashboards differently. Not


all business dashboards serve the same
purpose, which is why it’s important users
understand what KPIs to track and why.
– What kinds of business questions do
dashboards answer?
– What type of data are tracked on
dashboards?
– How are dashboards interactive?
Business Questions

• The best data dashboards answer important questions about your


business. Unlike advanced business intelligence tools, dashboards are
designed for quick analysis and informational awareness. The most
common approach to designing a business dashboard is to build it
using a question-answer format.
– What’s our Quick Ratio? Where is our Quick Ratio compared to
where it should be?
– How many calls has the call center done this week? Was it more or
less then last week?
– What are the top 5 products in Sales Revenue? Where are there
opportunities?
– What are the traffic sources to the website? Has there been an
increase in search?
– What does our marketing funnel look like? Is it on target?
Operational and Analytical Data

• The business questions a dashboard answers depends


on industry, department, process and position.
• Analytical dashboards are typically designed to help
decision makers, executives and senior leaders,
establish targets, set goals and understand what and
why something happened with the same information
they can use to implement appropriate changes.
• An analytical dashboard does this based on insights
from data collected over a period of time determined
by the user (i.e. last month, quarter or year).
Interactive Data Visualization

• Data is visualized on a dashboard as tables, line charts,


bar charts and gauges so that users can track the health
of their business against benchmarks and goals.
• Data dashboards surface the necessary data to
understand, monitor and improve your business through
visual representations.
• Depending on how you decide to design your dashboard,
even straightforward numerical data can be visually
informative by utilizing intuitive symbols, such as a red
triangle facing downward to indicate a drop in revenue
or a green triangle facing up to indicate an increase in
website traffic.
Dashboards in BI

• Dashboards are a data visualization tool that allow all


users to understand the analytics that matter to their
business, department or project.
• Even for non-technical users, dashboards allow them to
participate and understand the analytics process by
compiling data and visualizing trends and occurrences.
• Data dashboards provide an objective view of
performance metrics and serve as an effective foundation
for further dialogue.
• A dashboard is a business intelligence tool used to display
data visualizations in a way that is immediately
understood.
Why to visualize on dashboard?

• Monitor Multiple KPIs and Metrics at Once


• Easy to Read
• Cloud Accessibility - Share you dashboard With
Everyone
• Dashboards make reporting more efficient
Monitor multiple KPIs & Metrics

• Changes to any aspect of a business, whether it be


in marketing, sales, support, or finance, has an
impact on the business as a whole.
• People have been monitoring their businesses
without dashboards for ages, data dashboards
make it a heck of a lot easier.
• With dashboards, users are able to dig deeper into
the big picture to correlate this impact alongside
specific KPIs and metrics to understand what works
and what doesn’t.
Monitor multiple KPIs & Metrics

• Whether your businesses data is stored on a web


service, attachment or API, a dashboard pulls this
information and allows you to monitor all your data
in one central location.
• Additionally, dashboards are capable of correlate
data from different sources in a single visualization
if the user so chooses.
• By monitoring multiple KPIs and metrics on one
central dashboard, users can make adjustments to
their business practices in real time.
Easy to read

• Dashboards display KPIs and metrics using visualizations


like tables, line charts, bar charts and gauges.
• Effective dashboard design utilizes colors, symbols and
visualizations to highlight important data points
• This allows users to quickly scan a dashboard and get the
information they need without sifting through
spreadsheets, emails, or signing into a web service.
• Data dashboards are useful because they visualize
information in a way that is accessible to everyone.
• Even if you don't work in marketing, eg, you can understand
their numbers. This is key: You do not have to be an analyst
to use and understand a data dashboard.
Cloud accessibility

• A dashboard keeps every member of a business


on the same page. Users can share dashboards
in real-time and periodically.
• Dashboards bring a business's data to the cloud,
making key metrics and KPIs accessible to your
entire time on desktop, mobile and tablet.
• There are a number of ways to share data
dashboards: on wallboard tv's, email reports,
printable reports or direct access.
Cloud accessibility

• Users can create a public link to their dashboard


which gives anyone access, or through private link
that allows only those with the link to access your
data.
• It has become increasingly popular to display
dashboards on wallboard tv's in offices as a way to
keep everyone on the same page about
performance and objectives.
• Data dashboards are becoming increasingly common
because they allow for virtual work environments
and make it easier for teams to collaborate.
Efficient Reporting

• Data dashboards save time. Users no longer


need to go to multiple, disconnected, sources to
track their data.
• Getting the data, creating a spreadsheet,
generating and designing the report, and
sharing it -- dashboards do all this automatically.
• There are a number of reasons why reporting is
typically done at the end of the month, one
being because it takes up time and resources.
Efficient Reporting

• All you need to do is invest a bit of time setting


up your dashboard, which is peanuts compared
to what a typical manual reporting will take.
• Dashboards can automatically generate reports
with it’s data, at anytime, anywhere. No longer
do users need to gather, analyze and format
data.
• You can create PDF, email and live reports using
a dashboard: choose the KPI you want to
analyze, select your report format and present.
Types of dashboard

• Operational dashboards
– tell you what is happening now.
• Strategic dashboards
– track key performance indicators.
• Analytical dashboards
– process data to identify trends.
• Tactical dashboards
– used by mid-management to track
performance.
Strategic Dashboard

• A strategic dashboard is a reporting tool for


monitoring the long-term company strategy
with the help of critical success factors.
• They’re usually complex in their creation,
provide an enterprise-wide impact to a business
and are mainly used by senior-level
management.
• Strategic dashboards are commonly used in a
wide range of business types while aligning a
company’s strategic goals.
Strategic Dashboard

• They track performance metrics against


enterprise-wide strategic goals.
• As a result, these dashboards tend to summarize
performance over set time frames: past month,
quarter, or year.
• When the strategic dashboard is properly
developed, designed, and implemented, it can
effectively reduce the amount of time needed to
accomplish a specific business key performance
indicator, while reducing operational costs.
Strategic Dashboard

• Although they can provide opportunities for specific


departments’ operations and further analysis, strategic
reports and dashboards are usually fairly high-level.
• As mentioned, senior members of a team can identify
strategic concerns fairly quickly and provide
comprehensive strategic reports with the analyzed data.
• The importance lies in analyzing management
processes, using common qualitative and quantitative
language, and identifying a specific system, which has to
be incorporated into the dashboard so that every
decision-maker understands the presented data.
Strategic Dashboard: Types

• Management strategic dashboard


• CMO strategic dashboard
• CFO dashboard for strategic planning
Management Strategic Dashboard

• This management dashboard below is one of the


best strategic dashboard examples that could
easily be displayed in a board meeting.
• It isn’t cluttered, but it quickly tells a cohesive
data story. The dashboard focuses on revenue in
total as well as at the customer level plus the
cost of acquiring new customers.
• The dashboard is set to a specific time frame and
it includes significant KPIs: customer acquisition
costs, customer lifetime value, and sales target.
Management Strategic Dashboard

• This dashboard answers the following: What is


my customer base and revenue compared to
this time last year?
• While addressing specific values, incorporating
specific key performance indicators, and using a
common qualitative and qualitative language,
this dashboard represents the management
board clear value and specific course of action,
while using comparison metrics and analysis.
Management Strategic Dashboard
CMO Strategic Dashboard

• Another example comes from the marketing department.


• Chief Marketing Officers (CMOs) often don’t have time to
check numbers such as traffic or CTR of certain
campaigns.
• But they do need to have a closer look at a more strategic
level of marketing efforts, even cooperating with sales to
reach the best possible marketing results a business can
have, and, therefore, generate profit.
• This marketing dashboard shows these important
strategic KPIs in a visual, informative, and straightforward
way.
CMO Strategic Dashboard

• The strategic dashboard example above expounds on


the cost of acquiring each customer, leads, MQL, and
compares them to previous periods and set targets.
• A CMO must have a birds-eye view of the strategic
goals so that he/she can react promptly and keep the
department’s results under control.
• An executive can immediately see where his/her
targets are, which gives them the ability to drill down
further into these marketing KPIs and see what can be
improved in the overall marketing funnel.
CMO Strategic Dashboard
CFO Strategic Dashboard

• Chief financial officers need to keep a company's


strategy on track, monitor the financial performance
closely, and react when there are deviations from
strategic goals and objectives.
• But not only, as the finances of a company are affected
also by non-direct factors such as employee and
customer satisfaction.
• For example, if employees are not satisfied with their
working environment, they can call in sick or leave the
company which will cause financial bottlenecks. But let's
take a closer look at what kind of dashboards for
strategy CFOs need.
CFO Strategic Dashboard
CFO Strategic Dashboard

• Let's continue with more details on the right of the


dashboard.
• The costs are visualized through a percentage
breakdown depicting sales, general and admin,
marketing, and other expenses.
• Here we can see that sales use up most of the
costs, followed by general and admin. Maybe there
is space to eliminate some costs but be careful not
to cause the opposite effect.
Operational Dashboard

• An operational dashboard is one of the types of


dashboards used for monitoring and managing
operations that have a shorter time horizon.
• Since they focus on tracking operational
processes, they’re usually administrated by
junior levels of management.
• Their value in today’s digital age lies in the fact
that businesses start to realize the importance
of fast and correct data between operational
teams and departments.
Operational Dashboard

• These kinds of dashboards are arguably the most common


ones. They are mostly used for monitoring and analyzing a
company’s activities in a given business area.
• These dashboards are usually focused on alerting about
business exceptions and are based on real-time data.
Operational metrics dashboards usually end up in the
hands of the subject matter experts.
• This often leads to more direct action, then further
analysis. Because of this, operational dashboards often are
more detailed than strategic dashboards.
• They can also provide operational reports with a more
detailed view of specific data sets.
Operational Dashboard Types

• Marketing operational dashboard


• LinkedIn operations dashboard
• Customer service operational metrics
dashboard
Operational Dashboard Types

• Marketing operational dashboard


• LinkedIn operations dashboard
• Customer service operational metrics
dashboard
Marketing Operational Dashboard

• The marketing performance dashboard above is one of our top


operational dashboard examples.
• It shows the performance of 3 campaigns over the past 12 weeks.
• It provides important operational information and key
performance indicators for the marketing team on cost per
acquisition, the total number of clicks, total acquisitions gained,
and the total amount spent in the specific campaign.
• Any significant changes would immediately alert the marketing
team.
• Why is it useful? Because a fast-paced marketing department or
agency can adjust their operational activities based on real-time
data and teams don’t have to wait for extensive, traditional
reports and analysis presented in a spreadsheet.
Marketing Operational Dashboard
LinkedIn Operational Dashboard

• We continue our list of operations dashboard


examples with LinkedIn.
• This social media network is critical for building
business relationships, either on a profile level
or company.
• With the number of users steadily growing and
reaching more than 610 million members in
2020, LinkedIn should be on a higher priority for
companies that want to reach decision-makers
and business professionals.
LinkedIn Operational Dashboard

• To effectively manage a company's presence,


companies can use an operational data
dashboard that will solve multiple social media
problems such as automation, customization of
reports, and provide advanced analytical
features.
• Let's take a look at an operational dashboard
design example specifically created for
LinkedIn.
LinkedIn Operational Dashboard
Customer Service Dashboard

• One of our next operational dashboard examples


focuses on customer service.
• By having all the important customer service KPIs
on a single screen, the team can manage its
operations much more efficient.
• Let’s see this through a visual example.
Customer Service Dashboard

• This type of a dashboard expounds on the customer


service team’s performance over a shorter
timeframe, in this case, daily, with an additional
monthly overview of the first, second, third call,
and unresolved ones.
• We can see that the customer service dashboard is
divided into 2 parts: the resolutions and the
response time. Each day of the week gives an
additional insight which helps teams to reduce the
response time metric if they track it on a regular
basis.
Customer Service Dashboard
Analytical Dashboard

• An analytical dashboard is a type of dashboard


that contains a vast amount of data created and
used by analysts to provide support to
executives.
• They supply a business with a comprehensive
overview of data, with middle management
being a crucial part of its usage.
Analytical Dashboard

• The importance of an analytical dashboard lies


within their impact on historical data usage,
where analysts can identify trends, compare them
with multiple variables and create predictions,
and targets, which can be implemented in the
business intelligence strategy of a company.
• They are often useful when complex categorized
information is massive and broad, and need
visualization to perform a clear analysis of
generated data.
Analytical Dashboard Types

• Financial performance dashboard


• Procurement cost dashboard
• Analytical retail KPI dashboard
Financial Performance Dashboard

• In the example below, the analysis of the financial


dashboard focused on performance can help decision-
makers to see how efficiently the company’s capital is
being spent and to establish a specific operational task
to structure future decisions better.
• With the important financial KPIs such as return on
assets, return on equity, working capital, and the
overview of the balance sheet, a finance department has
a clear picture of their capital structure.
• This analysis dashboard enables the department to,
consequently, set specific operational activities to
improve further.
Financial Performance Dashboard
Procurement Cost Dashboard

• Another dashboard focused on costs but, in this


case, specifically for the procurement department.
• As we know, procurement is found in most
companies as a function that connects a company
with its suppliers, contractors, freelancers,
agencies, etc.
• It's not only critical for industries such as
manufacturing but service-oriented as well. To see
the analytical perspective of a procurement
department, let's take a look at a visual example.
Procurement Cost Dashboard
Procurement Cost Dashboard

• The procurement department handles large volumes of data


and by analyzing the costs and purchase of the procurement
cycle, analysts can present data that will provide a building
block for different units in order to save invaluable time.
• A procurement dashboard as visualized above can serve as a
tool to present data in a visual and straightforward manner.
• This kind of analysis is essential since procurement
departments usually gather data from multiple sources such
as ERP, databases, or CSV files, e.g.
• In order to optimize the cost management and increase the
overall positive results, an analytic dashboard such as this
one can prove to be beneficial.
Analytical Retail KPI Dashboard

• Another analytical dashboard example comes


from the retail industry.
• It creates an analytical parallel between
management and customer satisfaction since
the supply chain can directly affect it.
• This comprehensive dashboard shows us an
overview of important aspects of a retail
business that enable analysts to identify trends
and give management the support needed in
business processes.
Analytical Retail KPI Dashboard
Analytical Retail KPI Dashboard

• As we can see on the retail KPI dashboard above,


some of the crucial metrics such as rate of return
(also depicted by category), the total volume of
sales, customer retention rate, and the number of
new and returning customers through a set
timeframe, can give us a bigger picture on the state
of the retail business.
• These retail KPIs can show how good you are in
keeping your customers and developing brand
loyalty, the management can clearly see which
aspects of the business need to be improved
Tactical Dashboard

• A tactical dashboard is utilized in the analysis


and monitoring of processes conducted by mid-
level management, emphasizing the analysis.
• Then an organization effectively tracks the
performance of a company’s goal and delivers
analytic recommendations for future strategies.
• Tactical dashboards are often the most
analytical dashboards. They are great for
monitoring the processes that support the
organization’s strategic initiatives.
Tactical Dashboard

• Tactical dashboards help guide users through


the decision process.
• They capitalize on the interactive nature of
dashboards by providing users the ability to
explore the data.
• The detail level of a tactical dashboard falls
between the strategic and operational
dashboards.
• A tactical sales dashboard can track your sales
target (actual revenue vs. forecasted revenue).
Tactical Dashboard Types

• IT project management dashboard


• Social media dashboard
• Supply chain management tactical
dashboard
IT Project Management

• The example below shows a detailed overview of a


project with specific timelines and efficiency of the
parties involved.
• You can define specific risks, see the overall
progress, and average times of conducting specific
tasks.
• After the project is finished, you can create a
comprehensive IT report, evaluate the results, and
make future projects more successful.
IT Project Management
IT Project Management

• The goal in every IT management is to increase


efficiency, reduce the number of tickets, and deliver
a successful project.
• By having the right tool in the form of an IT
operations dashboard, a single screen can provide a
project manager with all the data he/she needs to
analyze all the important aspects of the project.
• While there are various types of project dashboards,
this particular visual above is set to monitor project
management efforts and alarm leaders if there are
any anomalies within the process.
Social Media Dashboard

• Since there are different types of business intelligence


dashboards that cover various purposes and we have
expounded on LinkedIn as a separate channel that needs
to be monitored daily to keep companies in touch with
their follower base and expand their reach, but now, in a
tactical sense, a KPI scorecard can provide multiple
benefits for managing social accounts and, consequently,
ensure users have enough data to generate
recommendations for future.
• To put this into perspective, we will show a business
process dashboard focused on 4 main social media
channels: Facebook, Twitter, Instagram, and YouTube.
Social Media Dashboard
Social Media Dashboard

• The dashboard starts with Facebook as the biggest


social media network in the world with, currently,
more than 2.5 billion monthly users.
• In our example, we can see that the number of
followers did not reach the set target but it did
increase in comparison to the previous period.
• In this case, social media managers can dig deeper
to understand why and if this Facebook KPI needs
particular attention.
SCM Dashboard

• When you create a tactical dashboard strategy,


it is important to focus on the analytical and
monitoring part of the process that gives a
backbone for effective, data-driven decisions.
• Our next dashboard concentrates on the supply
chain of a logistics company.
SCM Dashboard
SCM Dashboard

• The supply chain metrics depicted in our example


above shows us how data-driven supply chain
should be monitored to ensure a healthy process of
the company.
• Additional focus on the inventory management will
enable the company to have a clear overview of the
logistics KPIs needed to stay competitive and avoid
out of stock merchandise.
• By fully utilizing logistics analytics, you stand to
reap great rewards in your logistics business, and,
ultimately, manage to retain customers.
Benefits of A Successful Dashboard

• A successful dashboard implementation will:


– Save time across an organization: IT, analysts,
managers, C-suite, etc.
– Save companies money by highlighting
unnecessary operational costs
– Provide insight into customer behavior
– Effectively align strategy with tactics
– Ensure a goal-driven and performance-based
data culture
– Encourages interactivity and analysis
Summary
Evolution of Dashboard

• The idea of digital dashboards followed the


study of decision support systems in the 1970s.
• Early predecessors of the modern business
dashboard were first developed in the 1980s in
the form of Executive Information Systems (EISs).
• Due to problems primarily with data refreshing
and handling, it was soon realized that the
approach wasn't practical as information was
often incomplete, unreliable, and spread across
too many disparate sources.
Evolution of Dashboard

• EISs hibernated until the 1990s when the information age


quickened pace and data warehousing, and online analytical
processing (OLAP) allowed dashboards to function adequately.
• Despite the availability of enabling technologies, the dashboard
use didn't become popular until later in that decade, with the
rise of key performance indicators (KPIs), and the introduction of
Robert S. Kaplan and David P. Norton's Balanced Scorecard.
• In the late 1990s, Microsoft promoted a concept known as the
Digital Nervous System and "digital dashboards" were described
as being one leg of that concept.
• Today, the use of dashboards forms an important part of
Business Performance Management (BPM).
Dashboard Design Principles
Consider your audience

• Concerning dashboard best practices in design, your


audience is one of the most important principles you
have to take into account. You need to know who's going
to use the dashboard.
• To do so successfully, you need to put yourself in your
audience’s shoes. The context and device on which users
will regularly access their dashboards will have direct
consequences on the style in which the information is
displayed.
• Will the dashboard be viewed on-the-go, in silence at the
office desk or will it be displayed as a presentation in
front of a large audience?
Don’t place all information

• The next in our rundown of dashboard design tips is a


question of information. This most golden of
dashboard design principles refers to both precision
and the right audience targeting.
• That said, you should never create one-size-fits-all
dashboards and don’t cram all the information into
the same page.
• Think about your audience as a group of individuals
who have different needs – sales manager doesn’t
need to see the same data as a marketing specialist,
HR department, or professionals in logistics analytics.
Choose Relative KPIs

• For a truly effective KPI dashboard design,


selecting the right key performance indicators
(KPIs) for your business needs is a must.
• Your KPIs will help to shape the direction of your
dashboards as these metrics will display visual
representations of relevant insights based on
specific areas of the business.
• Once you’ve determined your ultimate goals and
considered your target audience, you will be able to
select the best KPIs to feature in your dashboard.
Select right type

• Remember to build responsive dashboards that will fit


all types of screens, whether it’s a smartphone, a PC, or
a tablet
• If your dashboard will be displayed as a presentation or
printed, make sure it’s possible to contain all key
information within one page.
Select right type

• Strategic: A dashboard focused on monitoring long-term


company strategies by analyzing and benchmarking a wide range
of critical trend-based information.
• Operational: A business intelligence tool that exists to monitor,
measure, and manage processes or operations with a shorter or
more immediate time scale.
• Analytical: These particular dashboards contain large streams of
comprehensive data that allow analysts to drill down and extract
insights to help the company to progress at an executive level.
• Tactical: These information-rich dashboards are best suited to
mid-management and help in formulating growth strategies
based on trends, strengths, and weaknesses across departments,
Provide context

• Without providing context, how will you know whether


those numbers are good or bad, or if they are typical or
unusual?
• Without comparison values, numbers on a dashboard are
meaningless for the users. And more importantly, they
won’t know whether any action is required.
• For example, a management dashboard design will focus
on high-level metrics that are easy to compare and,
subsequently, offer a visual story.
• Always try to provide maximum information, even if some
of them seem obvious to you, your audience might find
them perplexing.
Choose right chart

• We can’t stress enough the importance of


choosing the right data visualization types. You
can destroy all of your efforts with a missing or
incorrect chart type.
• It’s important to understand what type of
information you want to convey and choose a
data visualization that is suited to the task.
• Line charts are great when it comes to displaying
patterns of change across a continuum.
• Bar chart, scatter plot, histogram etc.
Choose right chart
Choose Careful layout

• Dashboard best practices in design concern


more than just good metrics and well-thought-
out charts.
• The next step is the placement of charts on a
dashboard. If your dashboard is visually
organized, users will easily find the information
they need.
• Poor layout forces users to think more before
they grasp the point, and nobody likes to look
for data in a jungle of charts and numbers.
Prioritize Simplicity

• One of the best practices for dashboard design


focuses on simplicity.
• Nowadays, we can play with a lot of options in the
chart creation and it’s tempting to use them all at
once. However, try to use those frills sparingly.
• Frames, backgrounds, effects, gridlines… Yes,
these options might be useful sometimes, but
only when there is a reason for applying them.
• Moreover, be careful with your labels or legend
and pay attention to the font, size, and color.
Round your numbers

• Continuing on simplicity, rounding the numbers on your


dashboard design should be also one of the priorities
since you don't want your audience to be flooded with
numerous decimal places.
• Yes, you want to present details but, sometimes, too
many details give the wrong impression.
• If you want to present your conversion rate with 5 more
decimal places, it would make sense to round the number
and avoid too many number-specific factors.
• The latter may exaggerate minor elements, in this case,
cents, which, for an effective data story, isn't really
necessary in your dashboard design process.
Be careful about colors

• Without a shadow of a doubt, this is one of the most


important of all dashboard design best practices.
• This particular point may seem incongruous to what
we have said up to this point, but there are options to
personalize and customize your creations to your
preferences.
• The interactive nature of data dashboards means that
you can let go of PowerPoint-style presentations from
the 90s.
• The modern dashboard is minimalist and clean. Flat
design is really trendy nowadays.
Be careful about colors

• Our final suggestion concerning colors is to be


mindful when using “traffic light” colors. For most
people, red means “stop” or “bad” and green
represents “good” or “go.”
• This distinction can prove very useful when
designing dashboards – but only when you use
these colors accordingly.
Don’t go over with real time data

• Next on our list of good dashboard design tips refers to


insight: don’t overuse real-time data. In some cases,
information displayed in too much detail only serves to
lead to distraction.
• Unless you’re tracking some live results, most
dashboards don’t need to be updated continually. Real-
time data serves to paint a picture of a general situation
or a trend.
• Most project management dashboards must only be
updated periodically – on a weekly, daily, or hourly basis.
After all, it is the right data that counts the most.
Consistent Labeling and formatting

• Above all else, in terms of functionality, the main aim of


a data dashboard is gaining the ability to extract
important insights at a swift glance.
• It’s critical to make sure that your labeling and
formatting is consistent across KPIs, tools, and metrics.
• If your formatting or labeling for related metrics or KPIs
is wildly different, it will cause confusion, slow down
your data analysis activities, and increase your chances
of making mistakes.
• Being 100% consistent across the board is paramount
to designing dashboards that work.
Use interactive elements

• Any comprehensive dashboard worth its salt will


allow you to dig deep into certain trends, metrics, or
insights with ease.
• When considering what makes a good dashboard,
factoring drill-downs, click-to-filter, and time interval
widgets into your design is vital.
• Drill-down is a smart interactive feature that allows
the user to drill down into more comprehensive
dashboard information related to a particular
element, variable, or key performance indicator
without overcrowding the overall design.
Use animation options

• Animation options can be one of the dashboard


elements that give an additional neat visual
impression where you select the appearance of
the specific element on the dashboard and
assign an animation option.
• The result is a simple, yet effective automated
movement based on the desired speed (slow,
medium, or fast,e.g.) and types such as linear,
swing, ease-in, or ease-out.
Double up your margins

• One of the most subtle yet essential dashboard


guidelines, this principle boils down to balance.
• White space – also referred to as negative space – is
the area of blankness between elements featured
on a dashboard design.
• Users aren’t typically aware of the pivotal role that
space plays in visual composition, but designers pay
a great deal of attention to it because when
metrics, stats, and insights are unbalanced, they are
difficult to digest.
Optimize for multiple devices

• Optimization for mobile or tablet is another critical point


in the dashboard development process.
• By offering remote access to your most important
insights, you can answer critical business questions on-
the-go, without the need for a special office meeting.
• Benefits such as swift decision-making and instant access
ensure everyone has the possibility to look at the data on-
the-fly.
• Here it makes sense to keep in mind that the dashboard
layout it's not the same as in desktop. A mobile dashboard
has a smaller screen and, therefore, the placement of the
elements will differ.
Export vs. Digital

• In the process of dashboard designing, you also need to


think about exports.
• You can use the dashboard itself and share it, but if you
plan on regularly use exports, you might want to consider
optimizing towards printing bounds, fewer colors, and
different types of line styles to make sure everything is
readable even on a black-and-white printout.
• Hence, when you plan your data dashboard design, you also
need to look into the future uses and how to optimize
towards different exporting options or simply sharing the
dashboard itself with all its features and options.
White label and embedding

• Another critical point when considering your workflow


dashboard design is the opportunity to white label and
embed the dashboard into your own application or
intranet, e.g. With these options in mind, you can
consider using your own company's logos, color styles,
and overall brand requirements and completely adjust
the dashboard as it's your own product.
• Embedded business intelligence ensures that access to
the analytical processes and data manipulation is
completely done within their existing systems and
applications.
Avoid common visualization mistakes

• Data visualization has evolved from simple static


presentations to modern interactive software that
takes the visual perception onto the next level.
• It also enabled average business users and
advanced analysts to create stunning visuals that
tell a clear data-story to any potential audience
profile, from beginners in a field to seasoned
analysts and strategists.
Avoid common visualization mistakes

• Failed calculations: The numbers should add up to 100. For


example, if you conduct a survey and people have the option
to choose more than one answer, you will probably need
some other form of visuals than a pie chart since the
numbers won't add up, and the viewers might get confused.
• The wrong choice of visualizations: We have mentioned how
important it is to choose the right type of chart and
dashboard, so if you want to present a relationship between
the data, a scatter plot might be the best solution.
• Too much data: Another point you need to keep in mind, and
we have discussed in detail, don't put too much data on a
single chart because the viewer will not recognize the point.
Never Stop Evolving

• Last but certainly not least in our collection of principles


of effective dashboards – the ability to tweak and
evolve your designs in response to the changes around
you will ensure ongoing analytical success.
• When designing dashboards, asking for feedback is
essential. By requesting regular input from your team
and asking the right questions, you’ll be able to improve
the layout, functionality, look, feel, and balance of KPIs
to ensure optimum value at all times.
Summary

• By only using the best and most balanced


dashboard design principles, you’ll ensure that
everyone within your organization can identify
key information with ease, which will accelerate
the growth, development, and evolution of your
business.
• That means a bigger audience, a greater reach,
and more profits – the key ingredients of a
successful business.
Display media for Dashboard

• Two fundamental principles have guided the selection of


each display medium in this proposed library:
– It must be the best means to display a particular type of information
that is commonly found on dashboards.
– It must be able to serve its purpose even when sized to fit into a
small space.
• The library is divided into six categories:
– Graphs
– Images
– Icons
– Drawing objects
– Text
– Organizers
Graphs

• Most dashboard display media fall into the graph


category. Given the predominance of quantitative
data on most dashboards, this isn't surprising.
• All but one of the items (treemaps) in this
category display quantitative data in the form of
a 2-D graph with X and Y axes.
• Most of these are familiar business graphs, but
one or two will probably be new to you, because
they were designed or adapted specifically for
use in dashboards.
Graphs

• Bullet graphs
• Bar graphs (horizontal and vertical)
• Stacked bar graphs (horizontal and vertical)
• Combination bar and line graphs
• Line graphs
• Sparklines
• Box plots
• Scatter plots
• Treemaps
Icon

• Icons are simple images that communicate a


clear and simple meaning. Only a few are
needed on a dashboard.
• The most useful icons are typically those that
communicate the following three meanings:
– Alert
– Up/down
– On/off
Text

• All dashboards, no matter how graphically oriented,


include some information that is encoded as text. This is
both necessary and desirable, for some information is
better communicated textually rather than graphically.
• Text is used for the categorical labels that identify what
items are on graphs, but it is often appropriate in other
places as well.
• Any time it is appropriate to report a single measure alone,
without comparing it to anything, text communicates the
number more directly and efficiently than a graph
• Note that in these instances some means to display the
text on a dashboard, such as a simple text box, is necessary.
Images

• The means to display images such as photos,


illustrations, or diagrams is sometimes useful on a
dashboard, but rarely, in my experience.
• A dashboard that is used by a trainer might include
photographs of the people scheduled to attend the day's
class, one used by a maintenance worker might highlight
the areas of the building where light bulbs need to be
replaced, or one used by a police department might use a
map to show where crimes have occurred in the last 24
hours.
• However, images will be unnecessary for most typical
business uses.
Objects

• It is sometimes useful to arrange and connect pieces of


information in relation to one another in ways that
simple drawing objects handle with clarity and ease.
• For instance, when displaying information about a
process, it can be helpful to arrange separate events in
the process sequentially and to indicate the path along
which the process flows, especially when branching
along multiple paths is possible.
• Another example is when you need to show connections
between entities, perhaps including a hierarchical
relationship, such as in an organization chart.
Organizer

• It is often the case that sets of information need to


be arranged in a particular manner to communicate
clearly.
• Three separate ways of organizing and arranging
related information stand out as particularly useful
when displaying business information on
dashboards:
– Tables
– Spatial maps
– Small multiples
Thank you
This presentation is created using LibreOffice Impress 5.1.6.2, can be used freely as per GNU General Public License

/mITuSkillologies @mitu_group /company/mitu- MITUSkillologies


skillologies

Web Resources
https://mitu.co.in
http://tusharkute.com

[email protected]
[email protected]
Data Visualization

Tushar B. Kute,
http://tusharkute.com
Agenda

• Data visualization
– What? Why?
– Benefits
– Techniques
– Who uses it?
– Challenges
Data Visualization

• Data visualization is the presentation of data in


a pictorial or graphical format.
• It enables decision makers to see analytics
presented visually, so they can grasp difficult
concepts or identify new patterns.
• With interactive visualization, you can take the
concept a step further by using technology to
drill down into charts and graphs for more
detail, interactively changing what data you see
and how it’s processed.
Data Visualization

• With so much information being collected through


data analysis in the business world today, we must
have a way to paint a picture of that data so we can
interpret it.
• Data visualization gives us a clear idea of what the
information means by giving it visual context
through maps or graphs.
• This makes the data more natural for the human
mind to comprehend and therefore makes it easier
to identify trends, patterns, and outliers within large
data sets.
Benefits of Data Visualization

• Correlations in Relationships: Without data visualization, it is


challenging to identify the correlations between the relationship
of independent variables. By making sense of those independent
variables, we can make better business decisions.
• Trends Over Time: While this seems like an obvious use of data
visualization, it is also one of the most valuable applications. It’s
impossible to make predictions without having the necessary
information from the past and present. Trends over time tell us
where we were and where we can potentially go.
• Frequency: Closely related to trends over time is frequency. By
examining the rate, or how often, customers purchase and when
they buy gives us a better feel for how potential new customers
might act and react to different marketing and customer
acquisition strategies.
Benefits of Data Visualization

• Examining the Market: Data visualization takes the information


from different markets to give you insights into which audiences
to focus your attention on and which ones to stay away from. We
get a clearer picture of the opportunities within those markets by
displaying this data on various charts and graphs.
• Risk and Reward: Looking at value and risk metrics requires
expertise because, without data visualization, we must interpret
complicated spreadsheets and numbers. Once information is
visualized, we can then pinpoint areas that may or may not require
action.
• Reacting to the Market: The ability to obtain information quickly
and easily with data displayed clearly on a functional dashboard
allows businesses to act and respond to findings swiftly and helps
to avoid making mistakes.
Data Visualization Techniques

• Infographics: Unlike a single data visualization,


infographics take an extensive collection of information
and gives you a comprehensive visual representation. An
infographic is excellent for exploring complex and
highly-subjective topics.
• Heatmap Visualization: This method uses a graph with
numerical data points highlighted in light or warm colors
to indicate whether the data is a high-value or a low-
value point. Psychologically, this data visualization
method helps the viewer to identify the information
because studies have shown that humans interpret
colors much better than numbers and letters.
Data Visualization Techniques

• Fever Charts: A fever chart shows changing data over a period of


time. As a marketing tool, we could take the performance from the
previous year and compare that to the prior year to get an accurate
projection of next year. This can help decision-makers easily
interpret wide and varying data sources.
• Area Chart (or Graph): Area charts are excellent for visualizing the
data’s time-series relationship. Whether you’re looking at the
earnings for individual departments on a month to month basis or
the popularity of a product since the 1980s, area charts can visualize
this relationship.
• Histogram: Rather than looking at the trends over time, histograms
are measuring frequencies instead. These graphs show the
distribution of numerical data using an automated data visualization
formula to display a range of values that can be easily interpreted.
Who uses Data Visualization?

• Data visualization is used across all industries to increase sales


with existing customers and target new markets and
demographics for potential customers.
• The World Advertising and Research Center (WARC) predicts
that in 2020 half of the world’s advertising dollars will be spent
online, which means companies everywhere have discovered
the importance of web data.
• As a crucial step in data analytics, data visualization gives
companies critical insights into untapped information and
messages that would otherwise be lost.
• The days of scouring through thousands of rows of
spreadsheets are over, as now we have a visual summary of
data to identify trends and patterns.
Challenges

• Data visualization has changed our society


considerably. From the most simple projected line
across a football field through to complex graphs
outlining market fluctuations, they are changing the
way that our society is approaching and understanding
data.
– VR
– AR
– Development. ...
– Differing Levels Of Understanding. ...
– Technical Skills.
Virtual Reality

• Virtual reality is going to have a huge impact on the


potential for data visualizations, allowing people to
interact with data in the third dimension for the first
time.
• Imagine being able to pick a data set and move it
around on any axis to compare it to another, it isn’t
too far away.
• According to SAS we can process only 1 kilobit of
information per second on a flat screen, which can
be increased significantly if it’s analyzed in a 3D VR
world.
Virtual Reality

• Virtual reality is something that is currently


seen as predominantly for entertainment so
trying to get a senior leader in a fortune 500
company to wear one to look at sales data
would certainly be a struggle.
• At present there are some moves to try and
make VR headsets more compact, but this is
going to take several years and data
visualization needs to stay front and centre
until then.
Augmented Reality

• Augmented reality may well be the single biggest


change that we are going to see regarding the use of
data visualizations.
• To some extent we have seen some of it already, with
HUDs like the now defunct Google Glass, overlaying
data onto what you can see in front of you.
• Bizarrely, one of the key reasons for the sudden
concentration on AR is the huge success of Pokemon
Go, which not only showed the capabilities of AR, but
also introduced it to a wide and diverse audience.
Augmented Reality

• The challenge that data visualization is going to


have is that those creating them need to make
sure they are doing so in an understandable and
non-obtrusive way.
• It creates a new dynamic, where the data
overlaid needs to be clear, concise and not
distracting.
• It’s a fine line to balance on and a real challenge
for those who are used to creating traditional
visualizations.
Developement

• VR and AR are likely to be interesting technologies


in the future, but for the time being, we are still
going to be consuming the majority of our data
through traditional 2D screens.
• As the number of data visualizations increases in
almost every area, the chances of yours standing
out decreases too as you’re trying to get to the top
of a larger and larger pile.
Developement

• It means that whilst these other technologies are


developing, people working in data visualization
need to try and find a way of making their
visualizations stand out from the crowd, without
making it overly complex.
• This could mean more vivid colors, increased
interactivity or simply using the most interesting
data, but finding the correct way is certainly a
hurdle to overcome in the next few years.
Differing level of understanding

• As data has spread throughout society one of the elements


that has become evident is that there is a huge variation in
the levels of understanding.
• This could even be in a high powered business setting,
where people who are used to seeing basic excel graphs do
not understand anything more complex.
• The idea of interactivity within visualized data is not
something they would ever feel necessary.
• However, there are others who would benefit from more
complex visualizations, where they can see as much as
possible in as smaller space as they can, through interactive
design or just more complex features.
Differing level of understanding

• It is, therefore, difficult for those designing


visualizations to match up to the wide-ranging
understanding of data and data visualizations.
• It could be that multiple visualizations are
created for different levels of data literacy, but
that’s simply wasting resources and is hardly a
practical solution.
Technical Skills

• As we move toward more interactive and complex trends for


data visualizations, we are going to be seeing an increased
need for technical skills to first understand and translate the
data then create visualizations around the results.
• We already have a shortage of data scientists and people
who can feed the right data to the right people, so this is
going to be a key challenge for the creation of decent data
visualizations that can pinpoint important data.
• Although this is likely to increase in the future with an
increasing number of universities offering data science
courses, this is unlikely to see data scientists becoming
prevalent for several years.
Technical Skills

• As discussed previously with VR and AR, working on new


technologies is not easy, especially for those with little
experience of similar areas.
• At the same time, these technologies are not developing
for data visualization alone, with those with the training
and qualifications having the option of working on other
popular mediums, like gaming or movies, which may also
have higher salaries given the focus of the technologies
on these markets.
• Therefore, trying to find somebody with the technical
expertise who hasn’t already joined these other
industries is going to be a huge hurdle to overcome.
Thank you
This presentation is created using LibreOffice Impress 5.1.6.2, can be used freely as per GNU General Public License

/mITuSkillologies @mitu_group /company/mitu- MITUSkillologies


skillologies

Web Resources
https://mitu.co.in
http://tusharkute.com

[email protected]
[email protected]
Big Data Solutions

Tushar B. Kute,
http://tusharkute.com
Traditional Enterprise Approach
Traditional Enterprise Approach

Limitation
• This approach works fine with those
applications that process less voluminous data
that can be accommodated by standard
database servers, or up to the limit of the
processor that is processing the data.
• But when it comes to dealing with huge
amounts of scalable data, it is a hectic task to
process such data through a single database
bottleneck.
Google's Solution

• Google solved this problem using an


algorithm called MapReduce.
• This algorithm divides the task into small
parts and assigns them to many
computers, and collects the results from
them which when integrated, form the
result dataset.
Google's Solution – MapReduce
Hadoop

• Using the solution provided by Google,


Doug Cutting and his team developed an
Open Source Project called HADOOP.
• Hadoop runs applications using the
MapReduce algorithm, where the data is
processed in parallel with others.
• In short, Hadoop is used to develop
applications that could perform complete
statistical analysis on huge amounts of data.
Hadoop
What is Hadoop ?

• Hadoop is an Apache open source framework


written in Java that allows distributed
processing of large datasets across clusters of
computers using simple programming models.
• The Hadoop framework application works in an
environment that provides distributed storage
and computation across clusters of computers.
• Hadoop is designed to scale up from single
server to thousands of machines, each offering
local computation and storage.
Hadoop Architecture
What is MapReduce?

• MapReduce is a parallel programming


model for writing distributed applications
devised at Google for efficient processing
of large amounts of data (multi-terabyte
data-sets), on large clusters (thousands of
nodes) of commodity hardware in a reliable,
fault-tolerant manner.
• The MapReduce program runs on Hadoop
which is an Apache open-source framework.
Hadoop Distributed File System

• The Hadoop Distributed File System (HDFS) is based


on the Google File System (GFS) and provides a
distributed file system that is designed to run on
commodity hardware.
• It has many similarities with existing distributed file
systems. However, the differences from other
distributed file systems are significant.
• It is highly fault-tolerant and is designed to be
deployed on low-cost hardware.
• It provides high throughput access to application data
and is suitable for applications having large datasets.
Hadoop Distributed File System

• Apart from the above-mentioned two


core components, Hadoop framework
also includes the following two modules:
– Hadoop Common: These are Java libraries
and utilities required by other Hadoop
modules.
– Hadoop YARN: This is a framework for job
scheduling and cluster resource
management.
What is Hadoop ?

• Hadoop is an Apache open source framework


written in Java that allows distributed
processing of large datasets across clusters of
computers using simple programming models.
• The Hadoop framework application works in an
environment that provides distributed storage
and computation across clusters of computers.
• Hadoop is designed to scale up from single
server to thousands of machines, each offering
local computation and storage.
How does Hadoop work?

• It is quite expensive to build bigger servers with


heavy configurations that handle large scale
processing, but as an alternative, you can tie together
many commodity computers with single-CPU, as a
single functional distributed system and practically,
the clustered machines can read the dataset in
parallel and provide a much higher throughput.
• Moreover, it is cheaper than one high-end server. So
this is the first motivational factor behind using
Hadoop that it runs across clustered and low-cost
machines.
How does Hadoop work?

• Hadoop runs code across a cluster of computers. This process


includes the following core tasks that Hadoop performs:
– Data is initially divided into directories and files. Files are divided into
uniform sized blocks of 128M and 64M (preferably 128M).
– These files are then distributed across various cluster nodes for
further processing.
– HDFS, being on top of the local file system, supervises the processing.
– Blocks are replicated for handling hardware failure.
– Checking that the code was executed successfully.
– Performing the sort that takes place between the map and reduce
stages.
– Sending the sorted data to a certain computer.
– Writing the debugging logs for each job.
Advantages of Hadoop

• Hadoop framework allows the user to quickly write and test


distributed systems. It is efficient, and it automatic distributes
the data and work across the machines and in turn, utilizes the
underlying parallelism of the CPU cores.
• Hadoop does not rely on hardware to provide fault-tolerance
and high availability (FTHA), rather Hadoop library itself has
been designed to detect and handle failures at the application
layer.
• Servers can be added or removed from the cluster dynamically
and Hadoop continues to operate without interruption.
• Another big advantage of Hadoop is that apart from being
open source, it is compatible on all the platforms since it is
Java based.
Thank you
This presentation is created using LibreOffice Impress 5.1.6.2, can be used freely as per GNU General Public License

/mITuSkillologies @mitu_group /company/mitu- MITUSkillologies


skillologies

Web Resources
https://mitu.co.in
http://tusharkute.com

[email protected]
[email protected]
MapReduce

Tushar B. Kute,
http://tusharkute.com
What is MapReduce?

• MapReduce is a framework using which we


can write applications to process huge
amounts of data, in parallel, on large clusters
of commodity hardware in a reliable manner.
• MapReduce is a processing technique and a
program model for distributed computing
based on java.
• The MapReduce algorithm contains two
important tasks, namely Map and Reduce.
Map and Reduce

• Map takes a set of data and converts it into


another set of data, where individual elements
are broken down into tuples (key/value pairs).
• Secondly, reduce task, which takes the output
from a map as an input and combines those
data tuples into a smaller set of tuples. As the
sequence of the name MapReduce implies, the
reduce task is always performed after the map
job.
Map and Reduce

• The major advantage of MapReduce is that it is easy to


scale data processing over multiple computing nodes.
• Under the MapReduce model, the data processing
primitives are called mappers and reducers.
• Decomposing a data processing application into mappers
and reducers is sometimes nontrivial. But, once we write an
application in the MapReduce form, scaling the application
to run over hundreds, thousands, or even tens of
thousands of machines in a cluster is merely a
configuration change.
• This simple scalability is what has attracted many
programmers to use the MapReduce model.
The Algorithm

• MapReduce program executes in three stages, namely map


stage, shuffle stage, and reduce stage.
• Map stage: The map or mapper’s job is to process the input
data. Generally the input data is in the form of file or
directory and is stored in the Hadoop file system (HDFS). The
input file is passed to the mapper function line by line. The
mapper processes the data and creates several small chunks
of data.
• Reduce stage: This stage is the combination of the Shuffle
stage and the Reduce stage. The Reducer’s job is to process
the data that comes from the mapper. After processing, it
produces a new set of output, which will be stored in the
HDFS.
The MapReduce
Inserting Data into HDFS

• The MapReduce framework operates on <key, value>


pairs, that is, the framework views the input to the job as
a set of <key, value> pairs and produces a set of <key,
value> pairs as the output of the job, conceivably of
different types.
• The key and the value classes should be in serialized
manner by the framework and hence, need to implement
the Writable interface. Additionally, the key classes have
to implement the Writable-Comparable interface to
facilitate sorting by the framework.
• Input and Output types of a MapReduce job: (Input)
<k1,v1> -> map -> <k2, v2>-> reduce -> <k3, v3> (Output).
Input and output
Wordcount Example
Matrix Multiplication Example
Matrix Multiplication Example
Thank you
This presentation is created using LibreOffice Impress 5.1.6.2, can be used freely as per GNU General Public License

/mITuSkillologies @mitu_group /company/mitu- MITUSkillologies


skillologies

Web Resources
https://mitu.co.in
http://tusharkute.com

[email protected]
[email protected]
Multi Layer Perceptron

Tushar B. Kute,
http://tusharkute.com
Neural Network

• Humans have an ability to identify patterns within the


accessible information with an astonishingly high degree
of accuracy.
• Whenever you see a car or a bicycle you can immediately
recognize what they are. This is because we have
learned over a period of time how a car and bicycle looks
like and what their distinguishing features are.
• Artificial neural networks are computation systems that
intend to imitate human learning capabilities via a
complex architecture that resembles the human nervous
system.
Human Nervous System
Human Nervous System

• Human nervous system consists of billions of neurons. These


neurons collectively process input received from sensory
organs, process the information, and decides what to do in
reaction to the input.
• A typical neuron in the human nervous system has three
main parts: dendrites, nucleus, and axons.
– The information passed to a neuron is received by
dendrites.
– The nucleus is responsible for processing this information.
– The output of a neuron is passed to other neurons via the
axon, which is connected to the dendrites of other
neurons further down the network.
Perceptron

• A perceptron is a simple binary classification


algorithm, proposed by Cornell scientist Frank
Rosenblatt.
• It helps to divide a set of input signals into two
parts—“yes” and “no”.
• But unlike many other classification algorithms, the
perceptron was modeled after the essential unit of
the human brain—the neuron and has an uncanny
ability to learn and solve complex problems.
Perceptron
Perceptron

• A perceptron is a very simple learning machine.


It can take in a few inputs, each of which has a
weight to signify how important it is, and
generate an output decision of “0” or “1”.
• However, when combined with many other
perceptrons, it forms an artificial neural
network.
• A neural network can, theoretically, answer any
question, given enough training data and
computing power.
Multilayer Perceptron

• A multilayer perceptron (MLP) is a perceptron


that teams up with additional perceptrons,
stacked in several layers, to solve complex
problems.
• Each perceptron in the first layer on the left
(the input layer), sends outputs to all the
perceptrons in the second layer (the hidden
layer), and all perceptrons in the second layer
send outputs to the final layer on the right (the
output layer).
Multilayer Perceptron
Multilayer Perceptron

• Each layer can have a large number of perceptrons,


and there can be multiple layers, so the multilayer
perceptron can quickly become a very complex
system.
• The multilayer perceptron has another, more
common name—a neural network.
• A three-layer MLP, like the diagram in previous slide,
is called a Non-Deep or Shallow Neural Network.
• An MLP with four or more layers is called a Deep
Neural Network.
Multilayer Perceptron

• One difference between an MLP and a neural


network is that in the classic perceptron, the
decision function is a step function and the
output is binary.
• In neural networks that evolved from MLPs,
other activation functions can be used which
result in outputs of real values, usually between
0 and 1 or between -1 and 1.
• This allows for probability-based predictions or
classification of items into multiple labels.
Structure of a Perceptron
The Percpetron Learning Process

1 Takes the inputs, multiplies them by their


weights, and computes their sum
2 Adds a bias factor, the number 1 multiplied by a
weight
3 Feeds the sum through the activation function
4 The result is the perceptron output
Step-1 Backpropogation

• Takes the inputs, multiplies them by their


weights, and computes their sum
• Why It’s Important ?
– The weights allow the perceptron to evaluate the
relative importance of each of the outputs.
– Neural network algorithms learn by discovering
better and better weights that result in a more
accurate prediction.
– There are several algorithms used to fine tune the
weights, the most common is called backpropagation.
Step-2 Neural Network Bias

• Adds a bias factor, the number 1 multiplied by a


weight
• This is a technical step that makes it possible to
move the activation function curve up and
down, or left and right on the number graph.
• It makes it possible to fine-tune the numeric
output of the perceptron.
Step-3 Activation Function

• Feeds the sum through the activation function


• The activation function maps the input values to
the required output values.
• For example, input values could be between 1
and 100, and outputs can be 0 or 1. The activation
function also helps the perceptron to learn, when
it is part of a multilayer perceptron (MLP).
• Certain properties of the activation function,
especially its non-linear nature, make it possible
to train complex neural networks.
Step-4 Output

• The perceptron output is a classification


decision.
• In a multilayer perceptron, the output of one
layer’s perceptrons is the input of the next
layer.
• The output of the final perceptrons, in the
“output layer”, is the final prediction of the
perceptron learning model.
Transformation

• From the Classic Perceptron to a Full-Fledged


Deep Neural Network
• Although multilayer perceptrons (MLP) and
neural networks are essentially the same thing,
you need to add a few ingredients before an
MLP becomes a full neural network. These are:
– Backpropagation
– Hyperparameters
– Advanced structures
Backpropogation

• The backpropagation algorithm allows you to


perform a “backward pass”, which helps tune the
weights of the inputs.
• Backpropagation performs iterative backward
passes which attempt to minimize the “loss”, or the
difference between the known correct prediction
and the actual model prediction.
• With each backward pass, the weights move
towards an optimum that minimizes the loss
function and results in the most accurate prediction.
Backpropogation

• Backpropagation is an algorithm commonly used


to train neural networks.
• When the neural network is initialized, weights
are set for its individual elements, called neurons.
• Inputs are loaded, they are passed through the
network of neurons, and the network provides an
output for each one, given the initial weights.
• Backpropagation helps to adjust the weights of
the neurons so that the result comes closer and
closer to the known true result.
Backpropogation
Hyperparameters

• In a modern neural network, aspects of the


multilayer structure such as the number of
layers, initial weights, the type of activation
function, and details of the learning process, are
treated as parameters and tuned to improve
the performance of the neural network.
• Tuning hyperparameters is an art, and can have
a huge impact on the performance of a neural
network.
Model vs. Hyperparameters

• Model parameters are internal to the neural network – for


example, neuron weights. They are estimated or learned
automatically from training samples. These parameters
are also used to make predictions in a production model.
• Hyperparameters are external parameters set by the
operator of the neural network – for example, selecting
which activation function to use or the batch size used in
training.
• Hyperparameters have a huge impact on the accuracy of a
neural network, there may be different optimal values for
different values, and it is non-trivial to discover those
values.
Hyperparameters of Neural N/W

• Number of hidden layers


• Dropout
• Neural network activation function
• Weights initialization
Hyperparameters of Neural N/W

• Number of hidden layers –


– adding more hidden layers of neurons generally improves
accuracy, to a certain limit which can differ depending on
the problem.
• Dropout –
– what percentage of neurons should be randomly “killed”
during each epoch to prevent overfitting.
• Neural network activation function –
– which function should be used to process the inputs flowing
into each neuron. The activation function can impact the
network’s ability to converge and learn for different ranges
of input values, and also its training speed.
Hyperparameters of Neural N/W

• Weights initialization –
– it is necessary to set initial weights for the first forward
pass. Two basic options are to set weights to zero or to
randomize them.
– However, this can result in a vanishing or exploding
gradient, which will make it difficult to train the model.
– To mitigate this problem, you can use a heuristic (a
formula tied to the number of neuron layers) to
determine the weights.
– A common heuristic used for the Tanh activation is
called Xavier initialization.
Hyperparameters of training algo

• Neural network learning rate


• Deep learning epoch, iterations and batch size
• Optimizer algorithm and neural network
momentum
Neural Network Learning Rate

• How fast the backpropagation algorithm


performs gradient descent.
• A lower learning rate makes the network train
faster but might result in missing the minimum
of the loss function.
Epoch, iterations, batch size

• Deep learning epoch, iterations and batch size – these


parameters determine the rate at which samples are fed to
the model for training.
• An epoch is a group of samples which are passed through
the model together (forward pass) and then run through
backpropagation (backward pass) to determine their
optimal weights.
• If the epoch cannot be run all together due the size of the
sample or complexity of the network, it is split into
batches, and the epoch is run in two or more iterations.
• The number of epochs and batches per epoch can
significantly affect model fit, as shown (next slide).
Epoch, iterations, batch size
Optimizer Algorithm

• Optimizer algorithm and neural network momentum –


when a neural network trains, it uses an algorithm to
determine the optimal weights for the model, called an
optimizer.
• The basic option is Stochastic Gradient Descent, but there
are other options.
• Another common algorithm is Momentum, which works by
waiting after a weight is updated, and updating it a second
time using a delta amount.
• This speeds up training gradually, with a reduced risk of
oscillation. Other algorithms are Nesterov Accelerated
Gradient, AdaDelta and Adam.
Hyperparameter Tuning Methods

• Manual Hyperparameter Tuning


• Grid Search
• Random Search
• Bayesian Optimization
Manual Tuning

• Traditionally, hyperparameters were tuned manually by


trial and error.
• This is still commonly done, and experienced operators
can “guess” parameter values that will achieve very high
accuracy for deep learning models.
• However, there is a constant search for better, faster
and more automatic methods to optimize
hyperparameters.
• Pros: Very simple and effective with skilled operators
• Cons: Not scientific, unknown if you have fully
optimized hyperparameters
Grid Search

• Grid search is slightly more sophisticated than manual tuning. It


involves systematically testing multiple values of each
hyperparameter, by automatically retraining the model for each
value of the parameter.
• For example, you can perform a grid search for the optimal
batch size by automatically training the model for batch sizes
between 10-100 samples, in steps of 20.
• The model will run 5 times and the batch size selected will be
the one which yields highest accuracy.
• Pros: Maps out the problem space and provides more
opportunity for optimization
• Cons: Can be slow to run for large numbers of hyperparameter
values
Random Search

• According to a 2012 research study by James Bergstra and


Yoshua Bengio, testing randomized values of
hyperparameters is actually more effective than manual
search or grid search.
• In other words, instead of testing systematically to cover
“promising areas” of the problem space, it is preferable to
test random values drawn from the entire problem space.
• Pros: According to the study, provides higher accuracy with
less training cycles, for problems with high dimensionality
• Cons: Results are unintuitive, difficult to understand “why”
hyperparameter values were chosen
Comparing
Baysian Optimization

• Bayesian optimization (described by Shahriari, et al) is


a technique which tries to approximate the trained
model with different possible hyperparameter values.
• To simplify, bayesian optimization trains the model
with different hyperparameter values, and observes
the function generated for the model by each set of
parameter values.
• It does this over and over again, each time selecting
hyperparameter values that are slightly different and
can help plot the next relevant segment of the
problem space.
Baysian Optimization

• Similar to sampling methods in statistics, the


algorithm ends up with a list of possible
hyperparameter value sets and model functions, from
which it predicts the optimal function across the
entire problem set.
• Pros: The original study and practical experience
from the industry shows that bayesian optimization
results in significantly higher accuracy compared to
random search.
• Cons: Like random search, results are not intuitive
and difficult to improve on, even by trained operators
In real world...

• In a real neural network project, you will have three


practical options:
– Performing manual optimization
– Leveraging hyperparameter optimization
techniques in the deep learning framework of
your choice. The framework will report on
hyperparameter values discovered, their
accuracy and validation scores
– Using third party hyperparameter optimization
tools
Advanced Strutures

• Many neural networks use a complex structure that


builds on the multilayer perceptron.
• For example, a Recurrent Neural Network (RNN) uses
two neural networks in parallel—one runs the training
data from beginning to end, the other from the end to
the beginning, which helps with language processing.
• A Convolutional Neural Network (CNN) uses a three-
dimensional MLP—essentially, three multilayer
perceptron structures that learn the same data point.
• This is useful for color images which have three layers
of “depth”—red, green and blue.
Neural Network in Real World

• In the real world, perceptrons work under the hood.


You will run neural networks using deep learning
frameworks such as TensorFlow, Keras, and PyTorch.
• These frameworks ask you for hyperparameters
such as the number of layers, activation function,
and type of neural network, and construct the
network of perceptrons automatically.
• When you work on real, production-scale deep
learning projects, you will find that the operations
side of things can become a bit daunting:
Neural Network in Real World

• Running experiments at scale and tracking results,


source code, metrics, and hyperparameters.
– To succeed at deep learning you need to run large
numbers of experiments and manage them
correctly to see what worked.
• Running experiments across multiple machines—
– in most cases neural networks are computationally
intensive. To work efficiently, you’ll need to run
experiments on multiple machines. This requires
provisioning these machines and distributing the
work.
Neural Network in Real World

• Manage training data—


– The more training data you provide, the
better the model will learn and perform.
– There are files to manage and copy to the
training machines.
– If your model’s input is multimedia, those
files can weigh anywhere from Gigabytes to
Petabytes.
Activation Function

• Neural network activation functions are a crucial


component of deep learning.
• Activation functions determine the output of a deep
learning model, its accuracy, and also the
computational efficiency of training a model—which
can make or break a large scale neural network.
• Activation functions also have a major effect on the
neural network’s ability to converge and the
convergence speed, or in some cases, activation
functions might prevent neural networks from
converging in the first place.
Activation Function

• Activation functions are mathematical equations


that determine the output of a neural network.
• The function is attached to each neuron in the
network, and determines whether it should be
activated (“fired”) or not, based on whether
each neuron’s input is relevant for the model’s
prediction.
• Activation functions also help normalize the
output of each neuron to a range between 1 and
0 or between -1 and 1.
Activation Function

• An additional aspect of activation functions is


that they must be computationally efficient
because they are calculated across thousands or
even millions of neurons for each data sample.
• Modern neural networks use a technique called
backpropagation to train the model, which
places an increased computational strain on the
activation function, and its derivative function.
Common Activation Function
ANN and DNN

• Artificial Neural Networks (ANN) are comprised of a


large number of simple elements, called neurons, each of
which makes simple decisions. Together, the neurons can
provide accurate answers to some complex problems,
such as natural language processing, computer vision,
and AI.
• A neural network can be “shallow”, meaning it has an
input layer of neurons, only one “hidden layer” that
processes the inputs, and an output layer that provides
the final output of the model.
• A Deep Neural Network (DNN) commonly has between 2-
8 additional layers of neurons.
Non-Deep Feed Forward Neural N/W
Deep Neural Network
Role of Activation Function

• In a neural network, numeric data points, called inputs, are


fed into the neurons in the input layer. Each neuron has a
weight, and multiplying the input number with the weight
gives the output of the neuron, which is transferred to the
next layer.
• The activation function is a mathematical “gate” in between
the input feeding the current neuron and its output going to
the next layer.
• It can be as simple as a step function that turns the neuron
output on and off, depending on a rule or threshold. Or it can
be a transformation that maps the input signals into output
signals that are needed for the neural network to function.
Role of Activation Function
Process Carried out by Neuron
Types of Activation Function

• Binary Step Function


• Linear Activation Function
• Non Linear Activation Function
Binary Step Function

• A binary step function is a threshold-based activation


function. If the input value is above or below a certain
threshold, the neuron is activated and sends exactly the same
signal to the next layer.

• The problem with a step function is that it does not allow


multi-value outputs—for example, it cannot support
classifying the inputs into one of several categories.
Linear Activation Function

• It takes the inputs, multiplied by the weights for


each neuron, and creates an output signal
proportional to the input.
• In one sense, a linear function is better than a
step function because it allows multiple
outputs, not just yes and no.
Problems: Linear Activation Function

• Not possible to use backpropagation (gradient descent) to


train the model—
– The derivative of the function is a constant, and has no
relation to the input, X. So it’s not possible to go back and
understand which weights in the input neurons can provide
a better prediction.
• All layers of the neural network collapse into one—
– With linear activation functions, no matter how many layers
in the neural network, the last layer will be a linear function
of the first layer (because a linear combination of linear
functions is still a linear function). So a linear activation
function turns the neural network into just one layer.
Non Linear Activation Function

• Modern neural network models use non-linear


activation functions. They allow the model to
create complex mappings between the network’s
inputs and outputs, which are essential for
learning and modeling complex data, such as
images, video, audio, and data sets which are non-
linear or have high dimensionality.
• Almost any process imaginable can be represented
as a functional computation in a neural network,
provided that the activation function is non-linear.
Problems: Non Linear Activation Function

• Non-linear functions address the problems of a


linear activation function:
– They allow backpropagation because they have a
derivative function which is related to the inputs.
– They allow “stacking” of multiple layers of
neurons to create a deep neural network.
Multiple hidden layers of neurons are needed to
learn complex data sets with high levels of
accuracy.
Common Non-Linear Functions

• Sigmoid / Logistic
• Tanh / Hyperbolic Tangent
• ReLU (Rectified Linear Unit)
• Leaky ReLU
• Parametric ReLU
• Softmax
• Swish
Sigmoid / Logistic
Sigmoid / Logistic

• Advantages
– Smooth gradient, preventing “jumps” in output values.
– Output values bound between 0 and 1, normalizing the output of
each neuron.
– Clear predictions—For X above 2 or below -2, tends to bring the Y
value (the prediction) to the edge of the curve, very close to 1 or 0.
This enables clear predictions.
• Disadvantages
– Vanishing gradient—for very high or very low values of X, there is
almost no change to the prediction, causing a vanishing gradient
problem. This can result in the network refusing to learn further, or
being too slow to reach an accurate prediction.
– Outputs not zero centered.
– Computationally expensive
Tanh

• Advantages
– Zero centered—making it easier to model inputs
that have strongly negative, neutral, and
strongly positive values.
– Otherwise like the Sigmoid function.
• Disadvantages
– Like the Sigmoid function
ReLU (Rectified Linear Unit)
ReLU (Rectified Linear Unit)

• Advantages
– Computationally efficient—allows the network to
converge very quickly
– Non-linear—although it looks like a linear function,
ReLU has a derivative function and allows for
backpropagation
• Disadvantages
– The Dying ReLU problem—when inputs approach
zero, or are negative, the gradient of the function
becomes zero, the network cannot perform
backpropagation and cannot learn.
Leaky ReLU

• Advantages
– Prevents dying ReLU problem—this variation of
ReLU has a small positive slope in the negative
area, so it does enable backpropagation, even
for negative input values
– Otherwise like ReLU
• Disadvantages
– Results not consistent—leaky ReLU does not
provide consistent predictions for negative
input values.
Leaky ReLU
Parametric ReLU

• Advantages
– Allows the negative slope to be learned—unlike
leaky ReLU, this function provides the slope of
the negative part of the function as an argument.
– It is, therefore, possible to perform
backpropagation and learn the most appropriate
value of α.
– Otherwise like ReLU
• Disadvantages
– May perform differently for different problems.
Softmax
Softmax

• Advantages
– Able to handle multiple classes only one class in
other activation functions—normalizes the outputs
for each class between 0 and 1, and divides by their
sum, giving the probability of the input value being in
a specific class.
– Useful for output neurons—typically Softmax is used
only for the output layer, for neural networks that
need to classify inputs into multiple categories.
Swish

• Swish is a new, self-gated activation function discovered


by researchers at Google.
• According to their paper, it performs better than ReLU
with a similar level of computational efficiency.
• In experiments on ImageNet with identical models
running ReLU and Swish, the new function achieved top
-1 classification accuracy 0.6-0.9% higher.
Summary
Challenges

• While selecting and switching activation


functions in deep learning frameworks is easy,
you will find that managing multiple experiments
and trying different activation functions on large
test data sets can be challenging.
• It can be difficult to:
– Track experiment progress
– Run experiments across multiple machines
– Manage training data
Let’s Start with an example

Reference: Friendly introduction to RNN by Luis Serrano


Conditional Outputs
Basic NN
Let’s do some maths
Now in NN
Conceptualizing
Adding to NN
Useful resources

• https://missinglink.ai
• https://machinelearningmastery.com
• https://www.allaboutcircuits.com
• https://medium.com
Thank you
This presentation is created using LibreOffice Impress 5.1.6.2, can be used freely as per GNU General Public License

/mITuSkillologies @mitu_group /company/mitu- MITUSkillologies


skillologies

Web Resources
http://mitu.co.in
http://tusharkute.com

[email protected]
[email protected]
Random Forest

Tushar B. Kute,
http://tusharkute.com
Random Forest

• Random forest is a type of supervised machine learning


algorithm based on ensemble learning.
• Ensemble learning is a type of learning where you join
different types of algorithms or same algorithm multiple
times to form a more powerful prediction model.
• The random forest algorithm combines multiple
algorithm of the same type i.e. multiple decision trees,
resulting in a forest of trees, hence the name "Random
Forest".
• The random forest algorithm can be used for both
regression and classification tasks.
How it works ?

• Pick N random records from the dataset.


• Build a decision tree based on these N records.
• Choose the number of trees you want in your algorithm
and repeat steps 1 and 2.
• In case of a regression problem, for a new record, each
tree in the forest predicts a value for Y (output). The final
value can be calculated by taking the average of all the
values predicted by all the trees in forest. Or, in case of a
classification problem, each tree in the forest predicts the
category to which the new record belongs. Finally, the
new record is assigned to the category that wins the
majority vote.

Majority Voting

Source: medium.com
Regressor Output

Source: medium.com
Advantages

• The random forest algorithm is not biased, since, there are


multiple trees and each tree is trained on a subset of data.
Basically, the random forest algorithm relies on the power of
"the crowd"; therefore the overall biasedness of the algorithm
is reduced.
• This algorithm is very stable. Even if a new data point is
introduced in the dataset the overall algorithm is not affected
much since new data may impact one tree, but it is very hard
for it to impact all the trees.
• The random forest algorithm works well when you have both
categorical and numerical features.
• The random forest algorithm also works well when data has
missing values or it has not been scaled well
Disadvantages

• A major disadvantage of random forests lies in


their complexity. They required much more
computational resources, owing to the large
number of decision trees joined together.
• Due to their complexity, they require much
more time to train than other comparable
algorithms.
Useful resources

• www.pythonprogramminglanguage.com
• www.scikit-learn.org
• www.towardsdatascience.com
• www.medium.com
• www.analyticsvidhya.com
• www.kaggle.com
• www.stephacking.com
• www.github.com
Thank you
This presentation is created using LibreOffice Impress 5.1.6.2, can be used freely as per GNU General Public License

/mITuSkillologies @mitu_group /company/mitu- MITUSkillologies


skillologies

Web Resources
https://mitu.co.in
http://tusharkute.com

[email protected]
[email protected]
Decision Tree

Tushar B. Kute,
http://tusharkute.com
Lets see the example...

• Suppose a job seeker was deciding between several


offers, some closer or further from home, with various
levels of pay and benefits.
• He or she might create a list with the features of each
position. Based on these features, rules can be created to
eliminate some options.
• For instance, "if I have a commute longer than an hour,
then I will be unhappy", or "if I make less than $50k, I
won't be able to support my family."
• The difficult decision of predicting future happiness can
be reduced to a series of small, but increasingly specific
choices.
Decision tree

• Decision tree is a graph to represent choices


and their results in form of a tree.
• The nodes in the graph represent an event or
choice and the edges of the graph represent the
decision rules or conditions.
• It is mostly used in Machine Learning and Data
Mining applications using Python.
Understanding Decision tree

• As you might intuit from the name, decision tree


learners build a model in the form of a tree structure.
• The model itself comprises a series of logical
decisions, similar to a flowchart, with decision nodes
that indicate a decision to be made on an attribute.
• These split into branches that indicate the decision's
choices.
• The tree is terminated by leaf nodes (also known as
terminal nodes) that denote the result of following a
combination of decisions.
Decision tree – example

• Examples of use of decision tress is − predicting an


email as spam or not spam, predicting of a tumor is
cancerous or predicting a loan as a good or bad
credit risk based on the factors in each of these.
• Generally, a model is created with observed data
also called training data. Then a set of validation
data is used to verify and improve the model.
• For new set of predictor variable, we use this
model to arrive at a decision on the category (yes/
No, spam/not spam) of the data.
Few more applications

• Credit scoring models in which the criteria that


causes an applicant to be rejected need to be
well-specified
• Marketing studies of customer churn or
customer satisfaction that will be shared with
management or advertising agencies
• Diagnosis of medical conditions based on
laboratory measurements, symptoms, or rate of
disease progression
Divide and Conquer

• Decision trees are built using a heuristic called


recursive partitioning.
• This approach is generally known as divide and
conquer because it uses the feature values to split the
data into smaller and smaller subsets of similar classes.
• Beginning at the root node, which represents the
entire dataset, the algorithm chooses a feature that is
the most predictive of the target class.
• The examples are then partitioned into groups of
distinct values of this feature; this decision forms the
first set of tree branches.
Divide and Conquer

• To illustrate the tree building process, let's


consider a simple example.
• Imagine that you are working for a Hollywood
film studio, and your desk is piled high with
screenplays.
• Rather than read each one cover-to-cover, you
decide to develop a decision tree algorithm to
predict whether a potential movie would fall
into one of three categories: mainstream hit,
critic's choice, or box office bust.
Continuing...

• To gather data for your model, you turn to the studio


archives to examine th previous ten years of movie
releases.
• After reviewing the data for 30 different movie
scripts, a pattern emerges.
• There seems to be a relationship between the film's
proposed shooting budget, the number of A-list
celebrities lined up for starring roles, and the
categories of success.
• A scatter plot of this data might look something
like . . .
The scatterplot
Scatterplot – Phase:1
Scatterplot – Phase:2
The decision tree model
The C5.0 Algorithm

• There are numerous implementations of decision trees,


but one of the most well-known is the C5.0 algorithm.
• This algorithm was developed by computer scientist J.
Ross Quinlan as an improved version of his prior
algorithm, C4.5, which itself is an improvement over his
ID3 (Iterative Dichotomiser 3) algorithm.
• Although Quinlan markets C5.0 to commercial clients
(see http://www.rulequest.com/ for details), the source
code for a single-threaded version of the algorithm was
made publically available, and has therefore been
incorporated into programs such as R.
The C4.5 Algorithm

• To further confuse matters, a popular Java-


based open-source alternative to C4.5, titled
J48, is included in the RWeka package.
• Because the differences among C5.0, C4.5, and
J48 are minor, the principles in this
presentation will apply to any of these three
methods and the algorithms should be
considered synonymous.
The Decision tree algorithm
Example:
Example:
Gini index

• Gini index and information gain both of these methods are


used to select from the n attributes of the dataset which
attribute would be placed at the root node or the internal
node.

• Gini Index is a metric to measure how often a randomly


chosen element would be incorrectly identified.
• It means an attribute with lower gini index should be
preferred.
• Sklearn supports “gini” criteria for Gini Index and by default,
it takes “gini” value.
Entropy

• Entropy is the measure of uncertainty of a


random variable, it characterizes the impurity of
an arbitrary collection of examples. The higher
the entropy the more the information content.
Search for a good tree

• How should you go about building a decision tree?


• The space of decision trees is too big for systematic
search.
• Stop and
– return the a value for the target feature or
– a distribution over target feature values
• Choose a test (e.g. an input feature) to split on.
– For each value of the test, build a subtree for those
examples with this value for the test.
Top down induction

1. Which node to proceed with?


• A the “best” decision attribute for next node
• Assign A as decision attribute for node
• For each value of A create new descendant
• Sort training examples to leaf node according to the
attribute value of the branch
• If all training examples are perfectly classified (same
value of target attribute) stop, else iterate over new
leaf nodes. 2. When to stop?
Choices

• When to stop
– no more input features
– all examples are classified the same
– too few examples to make an informative split
• Which test to split on
– split gives smallest error.
– With multi-valued features
– split on all values or
– split values into half.
Which attribute is best ?

[29+,35-] A1=? A2=? [29+,35-]

True False True False

[21+, 5-] [8+, 30-] [18+, 33-] [11+, 2-]


Principle Criterion

• Selection of an attribute to test at each node -


choosing the most useful attribute for classifying
examples.
• Information gain
– measures how well a given attribute separates the training
examples according to their target classification
– This measure is used to select among the candidate
attributes at each step while growing the tree
– Gain is measure of how much we can reduce
uncertainty (Value lies between 0,1)
Entropy

• A measure for
– uncertainty
– purity
– information content
• Information theory: optimal length code assigns (- log2p) bits to
message having probability p
• S is a sample of training examples
– p+ is the proportion of positive examples in S
– p- is the proportion of negative examples in S
• Entropy of S: average optimal number of bits to encode
information about certainty/uncertainty about S
Entropy(S) = p+(-log2p+) + p-(-log2p-) = -p+log2p+- p-log2p-
Entropy

• The entropy is 0 if the outcome


is ``certain”.
• The entropy is maximum if we
have no knowledge of the
system (or any outcome is
equally possible).

• S is a sample of training examples


• p+ is the proportion of positive examples
• p- is the proportion of negative examples
• Entropy measures the impurity of S
Entropy(S) = -p+log2p+- p-log2 p-
Information Gain

Gain(S,A): expected reduction in entropy due to partitioning S


on attribute A

Gain(S,A)=Entropy(S) values(A) |Sv|/|S| Entropy(Sv)


v

Entropy([29+,35-]) = -29/64 log2 29/64 – 35/64 log2 35/64


= 0.99

[29+,35-] A1=? A2=? [29+,35-]

True False True False

[21+, 5-] [8+, 30-] [18+, 33-] [11+, 2-]


Information Gain
Entropy([21+,5-]) = 0.71 Entropy([18+,33-]) = 0.94
Entropy([8+,30-]) = 0.74 Entropy([8+,30-]) = 0.62
Gain(S,A1)=Entropy(S)
Gain(S,A2)=Entropy(S)
-26/64*Entropy([21+,5-])
-38/64*Entropy([8+,30-]) -51/64*Entropy([18+,33-])
=0.27 -13/64*Entropy([11+,2-])
=0.12

[29+,35-] A1=? A2=? [29+,35-]

True False True False

[21+, 5-] [8+, 30-] [18+, 33-] [11+, 2-]


Selecting next attribute
S=[9+,5-] S=[9+,5-]
E=0.940 E=0.940

Humidity Wind

High Normal Weak Strong

[3+, 4-] [6+, 1-] [6+, 2-] [3+, 3-]

E=0.985 E=0.592
Gain(S,Humidity) Gain(S,Wind)
=0.940-(7/14)*0.985 =0.940-(8/14)*0.811
– (7/14)*0.592 – (6/14)*1.0
=0.151 =0.048
Humidity provides greater info. gain than Wind, w.r.t target classification.
Selecting next attribute

S=[9+,5-]
E=0.940

Outlook

Sunny Overcast Rain

[2+, 3-] [4+, 0] [3+, 2-]

E=0.971 E=0.0 E=0.971

Gain(S,Outlook)
=0.940-(5/14)*0.971
-(4/14)*0.0 – (5/14)*0.0971
=0.247
Selecting next attribute

The information gain values for the 4 attributes are:


• Gain(S,Outlook) =0.247
• Gain(S,Humidity) =0.151
• Gain(S,Wind) =0.048
• Gain(S,Temperature) =0.029

where S denotes the collection of training examples

Note: 0Log20 =0
Packages needed

• Data Analytics
– sudo pip3 install pandas
• Decision Tree Algorithm
– sudo pip3 install sklearn
• Visualization
– sudo pip3 install ipython
– sudo pip3 install graphviz
– sudo pip3 install pydotplus
– sudo apt install graphviz
Simplified Decision Tree
Decision Tree Classification

• We will predict whether a bank note is authentic


or fake depending upon the four different
attributes of the image of the note.
• The attributes are Variance of wavelet
transformed image, curtosis of the image,
entropy, and skewness of the image.
• Dataset:
– https://archive.ics.uci.edu/ml/datasets/bankn
ote+authentication
Dataset
Reading the dataset
Training the classifier
Splitting training and testing

X_train: (80%) y_train: (80%)

X_test: (20%) y_test: (20%)


train_test_split

• train_test_split(*arrays, **options)
– Split arrays or matrices into random train and
test subsets
– *arrays : sequence of indexables with same
length / shape[0]
• Allowed inputs are lists, numpy arrays, scipy-sparse
matrices or pandas dataframes.
– test_size : float, int, or None (default is None)
• If float, should be between 0.0 and 1.0 and represent
the proportion of the dataset to include in the test
split.
DecisionTreeClassifier

• DecisionTreeClassifier (criterion=’gini’, splitter=’best’,


max_depth=None, min_samples_split=2,
min_samples_leaf=1, min_weight_fraction_leaf=0.0,
max_features=None, random_state=None,
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, class_weight=None,
presort=False)
The fit function

• Fitting your model to (i.e. using the .fit() method on) the
training data is essentially the training part of the
modeling process. It finds the coefficients for the
equation specified via the algorithm being used.
• Then, for a classifier, you can classify incoming data points
(from a test set, or otherwise) using the predict method.
Or, in the case of regression, your model will
interpolate/extrapolate when predict is used on incoming
data points.
• It also should be noted that sometimes the "fit"
nomenclature is used for non-machine-learning methods,
such as scalers and other preprocessing steps.
Characterizing the classifier
Output

Confusion matrix

Classification Report

Accuracy Score
Visualizing the tree
Tree
Resources

• https://stackabuse.com/
• http://people.sc.fsu.edu
• https://www.geeksforgeeks.org
• http://scikit-learn.org/
• https://machinelearningmastery.com
Thank you
This presentation is created using LibreOffice Impress 5.1.6.2, can be used freely as per GNU General Public License

/mITuSkillologies @mitu_group /company/mitu- MITUSkillologies


skillologies

Web Resources
https://mitu.co.in
http://tusharkute.com

[email protected]
[email protected]
Naive Bayes Classifier using Python

Tushar B. Kute,
http://tusharkute.com
Naive Bayes Classifier

• Naive Bayes classifiers are a collection of


classification algorithms based on Bayes’
Theorem.
• It is not a single algorithm but a family of
algorithms where all of them share a common
principle, i.e. every pair of features being
classified is independent of each other.
Bayes Theorem

Example Reference: Super Data Science


Bayes Theorem

Defective Spanners
Bayes Theorem
Bayes Theorem
Bayes Theorem
Bayes Theorem
That’s intuitive
Exercise
Example:
Step-1
Step-1
Step-1
Step-2
Step-3
Naive Bayes – Step-1
Naive Bayes – Step-2
Naive Bayes – Step-3
Combining altogether
Naive Bayes – Step-4
Naive Bayes – Step-5
Types of model
Final Classification
Probability Distribution
Advantages

• When assumption of independent predictors


holds true, a Naive Bayes classifier performs
better as compared to other models.
• Naive Bayes requires a small amount of
training data to estimate the test data. So, the
training period is less.
• Naive Bayes is also easy to implement.
Disadvantages

• Main imitation of Naive Bayes is the assumption of


independent predictors. Naive Bayes implicitly assumes
that all the attributes are mutually independent. In real life,
it is almost impossible that we get a set of predictors which
are completely independent.
• If categorical variable has a category in test data set, which
was not observed in training data set, then model will
assign a 0 (zero) probability and will be unable to make a
prediction. This is often known as Zero Frequency. To solve
this, we can use the smoothing technique. One of the
simplest smoothing techniques is called Laplace estimation.
Useful resources

• www.datacamp.com
• www.scikit-learn.org
• www.towardsdatascience.com
• www.medium.com
• www.analyticsvidhya.com
• www.kaggle.com
• www.stephacking.com
• www.github.com
Thank you
This presentation is created using LibreOffice Impress 5.1.6.2, can be used freely as per GNU General Public License

/mITuSkillologies @mitu_group /company/mitu- MITUSkillologies


skillologies

Web Resources
https://mitu.co.in
http://tusharkute.com

[email protected]
[email protected]
Classification

Tushar B. Kute,
http://tusharkute.com
What is Classification?

• Classification is a process of categorizing a given set


of data into classes, It can be performed on both
structured or unstructured data.
• The process starts with predicting the class of given
data points. The classes are often referred to as
target, label or categories.
• The classification predictive modeling is the task of
approximating the mapping function from input
variables to discrete output variables.
• The main goal is to identify which class/category the
new data will fall into.
Example:
Example:

• Heart disease detection can be identified as a


classification problem, this is a binary classification
since there can be only two classes i.e has heart
disease or does not have heart disease.
• The classifier, in this case, needs training data to
understand how the given input variables are related
to the class. And once the classifier is trained
accurately, it can be used to detect whether heart
disease is there or not for a particular patient.
• Since classification is a type of supervised learning,
even the targets are also provided with the input data.
Basic Terminologies used

• Classifier – It is an algorithm that is used to map


the input data to a specific category.
• Classification Model – The model predicts or
draws a conclusion to the input data given for
training, it will predict the class or category for
the data.
• Feature – A feature is an individual measurable
property of the phenomenon being observed.
• Label- Output variable
Types of Learners

• Lazy Learners –
– Lazy learners simply store the training data
and wait until a testing data appears.
– The classification is done using the most
related data in the stored training data.
– They have more predicting time compared to
eager learners. Eg – k-nearest neighbor, case-
based reasoning.
Types of Learners

• Eager Learners –
– Eager learners construct a classification
model based on the given training data
before getting data for predictions.
– It must be able to commit to a single
hypothesis that will work for the entire space.
– Due to this, they take a lot of time in training
and less time for a prediction. Eg – Decision
Tree, Naive Bayes, Artificial Neural Networks.
Types of Classification

• Binary Classification
• Multi-Class Classification
• Multi-Label Classification
• Imbalanced Classification
Types of Classification

• Linear Models
– Logistic Regression
– Support Vector Machines
• Nonlinear models
– K-nearest Neighbors (KNN)
– Kernel Support Vector Machines (SVM)
– Naïve Bayes
– Decision Tree Classification
– Random Forest Classification
Binary Classification

• Binary classification refers to those


classification tasks that have two class labels.
• Examples include:
– Email spam detection (spam or not)
– Churn prediction (churn or not).
– Conversion prediction (buy or not).
• Typically, binary classification tasks involve one
class that is the normal state and another class
that is the abnormal state.
Binary Classification – Example

• For example “not spam” is the normal state and


“spam” is the abnormal state. Another example is
“cancer not detected” is the normal state of a task
that involves a medical test and “cancer detected” is
the abnormal state.
• The class for the normal state is assigned the class
label 0 and the class with the abnormal state is
assigned the class label 1.
• It is common to model a binary classification task
with a model that predicts a Bernoulli probability
distribution for each example.
Binary Classification – Algorithms

• Popular algorithms that can be used for binary


classification include:
– Logistic Regression
– k-Nearest Neighbors
– Decision Trees
– Support Vector Machine
– Naive Bayes
Evaluation of Binary Classifier

• There are many metrics that can be used to measure the


performance of a classifier or predictor; different fields
have different preferences for specific metrics due to
different goals.
• In medicine sensitivity and specificity are often used,
while in information retrieval precision and recall are
preferred.
• An important distinction is between metrics that are
independent of how often each category occurs in the
population (the prevalence), and metrics that depend on
the prevalence – both types are useful, but they have
very different properties.
Evaluation of Binary Classifier

• Given a classification of a specific data set, there


are four basic combinations of actual data category
and assigned category: true positives TP (correct
positive assignments), true negatives TN (correct
negative assignments), false positives FP (incorrect
positive assignments), and false negatives FN
(incorrect negative assignments).
Confusion Matrix

• In the field of machine learning and specifically the


problem of statistical classification, a confusion matrix,
also known as an error matrix, is a specific table layout that
allows visualization of the performance of an algorithm,
typically a supervised learning one (in unsupervised
learning it is usually called a matching matrix).
• Each row of the matrix represents the instances in a
predicted class, while each column represents the
instances in an actual class (or vice versa).
• The name stems from the fact that it makes it easy to see
whether the system is confusing two classes (i.e. commonly
mislabeling one as another).
Confusion Matrix

• Given a sample of 13 pictures, 8 of cats and 5 of dogs,


where cats belong to class 1 and dogs belong to class
0,
– actual = [1,1,1,1,1,1,1,1,0,0,0,0,0],
• assume that a classifier that distinguishes between
cats and dogs is trained, and we take the 13 pictures
and run them through the classifier, and the classifier
makes 8 accurate predictions and misses 5: 3 cats
wrongly predicted as dogs (first 3 predictions) and 2
dogs wrongly predicted as cats (last 2 predictions).
– prediction = [0,0,0,1,1,1,1,1,0,0,0,1,1]
Confusion Matrix
F1 Score / Harmonic Mean
Multi Class Classification

• Multi-class classification refers to those classification


tasks that have more than two class labels.
• Examples include:
– Face classification.
– Plant species classification.
– Optical character recognition.
• Unlike binary classification, multi-class classification
does not have the notion of normal and abnormal
outcomes. Instead, examples are classified as
belonging to one among a range of known classes.
Multi Class Classification

• The number of class labels may be very large on some


problems. For example, a model may predict a photo as
belonging to one among thousands or tens of thousands
of faces in a face recognition system.
• Problems that involve predicting a sequence of words,
such as text translation models, may also be considered a
special type of multi-class classification.
• Each word in the sequence of words to be predicted
involves a multi-class classification where the size of the
vocabulary defines the number of possible classes that
may be predicted and could be tens or hundreds of
thousands of words in size.
Multi Class Classification - Examples

• Many algorithms used for binary classification


can be used for multi-class classification.
• Popular algorithms that can be used for multi-
class classification include:
– k-Nearest Neighbors.
– Decision Trees.
– Naive Bayes.
– Random Forest.
– Gradient Boosting.
Multi Class Classification

• This involves using a strategy of fitting multiple binary


classification models for each class vs. all other classes
(called one-vs-rest) or one model for each pair of classes
(called one-vs-one).
– One-vs-Rest: Fit one binary classification model for each
class vs. all other classes.
– One-vs-One: Fit one binary classification model for each
pair of classes.
• Binary classification algorithms that can use these
strategies for multi-class classification include:
– Logistic Regression.
– Support Vector Machine.
Multi-Label Classification?

• Multi-label classification refers to those classification


tasks that have two or more class labels, where one or
more class labels may be predicted for each example.
• Consider the example of photo classification, where a
given photo may have multiple objects in the scene
and a model may predict the presence of multiple
known objects in the photo, such as “bicycle,” “apple,”
“person,” etc.
• This is unlike binary classification and multi-class
classification, where a single class label is predicted
for each example.
Imbalanced Classification

• Imbalanced classification refers to classification tasks


where the number of examples in each class is
unequally distributed.
• Typically, imbalanced classification tasks are binary
classification tasks where the majority of examples in
the training dataset belong to the normal class and a
minority of examples belong to the abnormal class.
• Examples include:
– Fraud detection.
– Outlier detection.
– Medical diagnostic tests.
Imbalanced Classification

• These problems are modeled as binary


classification tasks, although may require
specialized techniques.
• Specialized techniques may be used to change
the composition of samples in the training
dataset by undersampling the majority class or
oversampling the minority class.
• Examples include:
– Random Undersampling.
– SMOTE Oversampling.
Imbalanced Classification

• Specialized modeling algorithms may be used


that pay more attention to the minority class
when fitting the model on the training dataset,
such as cost-sensitive machine learning
algorithms.
• Examples include:
– Cost-sensitive Logistic Regression.
– Cost-sensitive Decision Trees.
– Cost-sensitive Support Vector Machines.
Useful resources

• https://missinglink.ai
• https://machinelearningmastery.com
• https://www.allaboutcircuits.com
• https://medium.com
Thank you
This presentation is created using LibreOffice Impress 5.1.6.2, can be used freely as per GNU General Public License

/mITuSkillologies @mitu_group /company/mitu- MITUSkillologies


skillologies

Web Resources
https://mitu.co.in
http://tusharkute.com

[email protected]
[email protected]
Linear Regression

Tushar B. Kute,
http://tusharkute.com
Linear Regression

• Linear regression is used for finding linear


relationship between target and one or more
predictors.
• There are two types of linear regression-
– Simple and
– Multiple.
Simple Linear Regression

• Simple linear regression is useful for finding relationship


between two continuous variables. One is predictor or
independent variable and other is response or dependent
variable.
• It looks for statistical relationship but not deterministic
relationship. Relationship between two variables is said to be
deterministic if one variable can be accurately expressed by the
other.
• For example, using temperature in degree Celsius it is possible
to accurately predict Fahrenheit.
• Statistical relationship is not accurate in determining
relationship between two variables. For example, relationship
between height and weight.
Core Idea

• The core idea is to obtain a line that best fits


the data.
• The best fit line is the one for which total
prediction error (all data points) are as small as
possible.
• Error is the distance between the point to the
regression line.
Linear Regression
Real Life Example

• We have a dataset which contains information about


relationship between ‘number of hours studied’ and
‘marks obtained’.
• Many students have been observed and their hours
of study and grade are recorded. This will be our
training data.
• Goal is to design a model that can predict marks if
given the number of hours studied. Using the training
data, a regression line is obtained which will give
minimum error.
• This linear equation is then used for any new data.
Real Life Example

• if we give number of hours studied by a student as an input, our


model should predict their mark with minimum error.

Y(pred) = b0 + b1*x

• The values b0 and b1 must be chosen so that they minimize the


error. If sum of squared error is taken as a metric to evaluate the
model, then goal to obtain a line that best reduces the error.

• If we don’t square the error, then positive and negative point will
cancel out each other.
Many names of Linear Regression

• The reason is because linear regression has been around for


so long (more than 200 years). It has been studied from every
possible angle and often each angle has a new and different
name.
• Linear regression is a linear model, e.g. a model that assumes
a linear relationship between the input variables (x) and the
single output variable (y). More specifically, that y can be
calculated from a linear combination of the input variables (x).
• When there is a single input variable (x), the method is
referred to as simple linear regression. When there are
multiple input variables, literature from statistics often refers
to the method as multiple linear regression.
Applications

• Trend lines: A trend line represents the variation in some


quantitative data with passage of time (like GDP, oil prices,
etc.). These trends usually follow a linear relationship. Hence,
linear regression can be applied to predict future values.
• Economics: To predict consumption spending, fixed
investment spending, inventory investment, purchases of a
country’s exports, spending on imports, the demand to hold
liquid assets, labor demand, and labor supply.
• Finance: Capital price asset model uses linear regression to
analyze and quantify the systematic risks of an investment.
• Biology: Linear regression is used to model causal
relationships between parameters in biological systems.
Ordinary Least Square

• When we have more than one input we can use


Ordinary Least Squares to estimate the values of the
coefficients.
• The Ordinary Least Squares procedure seeks to
minimize the sum of the squared residuals.
• This means that given a regression line through the
data we calculate the distance from each data point
to the regression line, square it, and sum all of the
squared errors together.
• This is the quantity that ordinary least squares seeks
to minimize.
Ordinary Least Square

• This approach treats the data as a matrix and uses


linear algebra operations to estimate the optimal
values for the coefficients.
• It means that all of the data must be available and you
must have enough memory to fit the data and perform
matrix operations.
• It is unusual to implement the Ordinary Least Squares
procedure yourself unless as an exercise in linear
algebra.
• It is more likely that you will call a procedure in a linear
algebra library. This procedure is very fast to calculate.
Ordinary Least Square
Gradient Descent

• When there are one or more inputs you can use a process
of optimizing the values of the coefficients by iteratively
minimizing the error of the model on your training data.
• This operation is called Gradient Descent and works by
starting with random values for each coefficient.
• The sum of the squared errors are calculated for each pair
of input and output values.
• A learning rate is used as a scale factor and the coefficients
are updated in the direction towards minimizing the error.
• The process is repeated until a minimum sum squared error
is achieved or no further improvement is possible.
Regularization

• There are extensions of the training of the linear model called


regularization methods. These seek to both minimize the sum of the
squared error of the model on the training data (using ordinary least
squares) but also to reduce the complexity of the model (like the
number or absolute size of the sum of all coefficients in the model).
• Two popular examples of regularization procedures for linear
regression are:
• Lasso Regression:
– where Ordinary Least Squares is modified to also minimize the
absolute sum of the coefficients (called L1 regularization).
• Ridge Regression:
– where Ordinary Least Squares is modified to also minimize the
squared absolute sum of the coefficients (called L2 regularization).
Example:

• Go practical...
Thank you
This presentation is created using LibreOffice Impress 5.1.6.2, can be used freely as per GNU General Public License

/mITuSkillologies @mitu_group /company/mitu- MITUSkillologies


skillologies

Web Resources
https://mitu.co.in
http://tusharkute.com

[email protected]
[email protected]
Association Rule Mining

Tushar B. Kute,
http://tusharkute.com
The Association Rules

• There are many ways to see the similarities between items.


• These are techniques that fall under the general umbrella
of association.
• The outcome of this type of technique, in simple terms, is a
set of rules that can be understood as “if this, then that”.
• Association Rule Mining is a Data Mining technique that
finds patterns in data.
• The patterns found by Association Rule Mining represent
relationships between items. When this is used with sales
data, it is referred to as Market Basket Analysis.
Market Basket Analysis
Market Basket Analysis
Applications

• So what kind of items are we talking about? There are many


applications of association:
– Product recommendation – like Amazon’s “customers
who bought that, also bought this”
– Music recommendations – like Last FM’s artist
recommendations
– Medical diagnosis – like with diabetes really cool stuff
– Content optimization – like in magazine websites or
blogs
• Here, we will focus on the retail application – it is simple,
intuitive, and the dataset comes packaged with R making it
repeatable.
Example
Why Association Rules?

• It helps businesses build sales strategies.


– Ultimately, the main objective of any
business is to become profitable. This means,
attracting more customers and improving
their sales.
– By identifying products that sell better
together, they can build better strategies.
For instance, knowing that people who buy
fries almost always buy Coke can be
exploited to drive up sales.
Why Association Rules?

• It helps businesses build marketing strategies.


– Attracting customers is a very important part of
any business. Knowledge of what products sell
together and which products don’t is key in
building marketing strategies.
– This includes the planning of sales and
advertisements as well as targeted marketing.
For example, the knowledge that some
ornaments do not sell as well others during
Christmas may help the manager offer a sale on
the non-frequent ornaments.
Why Association Rules?

• It helps shelf-life planning.


– Knowledge of association rules can enable store
managers to plan their inventory as well as ensure
that they don’t lose out by overstocking low-selling
perishables.
– For instance, if olives don’t sell very often, the
manager will not stock up on it. But he still wants to
ensure that the existing stock sells before the
expiration date. With the knowledge that people who
buy pizza dough tend to buy olives, the olives can be
offered at a lower price in combination with the pizza
dough.
Why Association Rules?

• It helps the in-store organization.


– Products which are known to drive the sales
of other products can be moved closer
together in the store.
– For instance, if the sale of butter is driven by
the sale of bread, they can be moved to the
same aisle in the store.
The conceptualization

• We already discussed the concept of Items and Item


Sets.
• We can represent our items as an item set as
follows:
i = { i1,i2,…,in }
• Therefore a transaction is represented as follows:
tn = { ij, ik,…,in }
• This gives us our rules which are represented as
follows:
{ i1, i2} => { ik }
The associations

• Which can be read as “if a user buys an item in


the item set on the left hand side, then the user
will likely buy the item on the right hand side
too”. A more human readable example is:
{coffee,sugar} => {milk}
• If a customer buys coffee and sugar, then they
are also likely to buy milk.
• With this we can understand three important
ratios; the support, confidence and lift.
Steps

• Step 1: Find all frequent itemsets.


• Step 2: Generate strong association rules from
the frequent itemsets.
Find all frequent itemsets.

• An itemset is a set of items that occurs in a shopping


basket.
• A set of items in a shopping basket can be referred to
as an itemset. It can consist of any number of
products. For example, [bread, butter, eggs] is an
itemset from a supermarket database.
• A frequent itemset is one that occurs frequently in a
database. This begs the question of how frequency is
defined. This is where support count comes in.
• The support count of an item is defined as the
frequency of the item in the dataset.
Support

• This says how popular an itemset is, as measured by the


proportion of transactions in which an itemset appears.
• In Table 1 below, the support of {apple} is 4 out of 8, or
50%. Itemsets can also contain multiple items. For
instance, the support of {apple, beer, rice} is 2 out of 8,
or 25%.
Defining Support

• Defining support as percentage helps us set a threshold


for frequency called min_support. If we set support at
50%, this means that we define a frequent itemset as one
that occurs at least 50 times in 100 transactions. For
instance, for the above dataset, we set threshold_support
at 60%.

• We always eliminate those items whose support is less


than min_support as is seen from the greyed-out parts of
the table above. The generation of frequent itemsets
depends on the algorithm used.
Generate Rules

• Generate strong association rules from the


frequent itemsets.
• Association rules are generated by building
associations from frequent itemsets generated
in step 1.
• This uses a measure called confidence to find
strong associations.
The associations properties

• Support: The fraction of which our item set


occurs in our dataset.
• Confidence: probability that a rule is correct for
a new transaction with items on the left.
• Lift: The ratio by which by the confidence of a
rule exceeds the expected confidence.
• Note: if the lift is 1 it indicates that the items on
the left and right are independent.
Confidence

• This says how likely item Y is purchased when item


X is purchased, expressed as {X -> Y}. This is
measured by the proportion of transactions with
item X, in which item Y also appears.
• In Table 1, the confidence of {apple -> beer} is 3 out
of 4, or 75%.
Lift

• This says how likely item Y is purchased when item X is


purchased, while controlling for how popular item Y is.
• In Table 1, the lift of {apple -> beer} is 1,which implies no
association between items.
• A lift value greater than 1 means that item Y is likely to
be bought if item X is bought, while a value less than 1
means that item Y is unlikely to be bought if item X is
bought.
Types of algorithm

• Apriori Algorithm
• Eclat Algorithm
• F-P Growth Algorithm
Apriori Algorithm

• This algorithm uses frequent datasets to


generate association rules. It is designed to work
on the databases that contain transactions.
• This algorithm uses a breadth-first search and
Hash Tree to calculate the itemset efficiently.
• It is mainly used for market basket analysis and
helps to understand the products that can be
bought together.
• It can also be used in the healthcare field to find
drug reactions for patients.
Eclat algorithm

• Eclat algorithm stands for Equivalence Class


Transformation.
• This algorithm uses a depth-first search
technique to find frequent itemsets in a
transaction database.
• It performs faster execution than Apriori
Algorithm.
F-P Growth algorithm

• The F-P growth algorithm stands for Frequent


Pattern, and it is the improved version of the
Apriori Algorithm.
It represents the database in the form of a tree
structure that is known as a frequent pattern or
tree.
• The purpose of this frequent tree is to extract
the most frequent patterns.
Apriori Algorithm

• The Apriori algorithm is considered one of the most basic


Association Rule Mining algorithms. It works on the principle
that “ Having prior knowledge of frequent itemsets can
generate strong association rules. ” The word Apriori means
prior knowledge.
• Apriori finds the frequent itemsets by a process called
candidate itemset generation. This is an iterative approach,
where k-itemsets are used to explore (k+1)-itemsets. First, the
set of frequent 1-itemsets is found, then, frequent 2-itemsets,
and so on, until no more frequent k-itemsets can be found.
• A Candidate k-itemset is an itemset with k items in it.
Example: Candidate 2-itemset can be [bread, butter].
Apriori Algorithm

• To improve the efficiency of the level-wise


generation of frequent itemsets, an important
property called the Apriori property, is used to
reduce the search space.
• The Apriori Property states that “All non-empty
subsets of a frequent itemset must also be
frequent.”
• This means that if there is a frequent item then, its
subsets will also be frequent. For instance, if [Bread,
Butter] is a frequent itemset, it means that [Bread]
and [Butter] must individually be frequent too.
Example:
Apriori Algorithm Steps

• Step 1: Set a minimum support and confidence


threshold.
• Step 2: Generate candidate itemsets.
• Step 3: Mine Association Rules
Apriori Algorithm Steps

• Step 1: Set a minimum support and confidence


threshold.
• We set the threshold as 50% implying that we
define an itemset as frequent if it occurs at
least once in 2 transactions. Confidence is
introduced before.
Apriori Algorithm Steps

• Step 2: Generate candidate


itemsets.
• The candidate 1-itemsets
consist of all individual products
and their support counts
respectively. For instance, [A]
occurs in 3 out of 4 transactions.
• The greyed out rows represent
itemsets whose support counts
do not meet the threshold
requirement.
• L1: [A], [B], [C]
Apriori Algorithm Steps

• The candidate 2-itemsets


consists of all possible 2 item set
combinations of L1 and their
respective support counts. For
instance, [A, C] occur together in
2 out of 4 transactions.
L2: [A,C]
• Candidate 3-itemsets are to be
generated from L2, containing all
3-item combinations. However,
at L2 we are left with just 2 items
and we cannot generate
candidate 3-itemsets.
Apriori Algorithm Steps

• To mine Association Rules from candidate itemsets, a


measure called confidence is used. It is simply defined
as an association rule between items.
• Consider an itemset [ Bread, Butter ]. Two possible
scenarios can be considered here:
• 1.People who buy bread, also buy butter. The sale of
butter is driven by that of bread. This makes sense
because any dish involving bread often involves butter.
• 2. People who buy butter, also buy bread. The sale of
bread is driven by butter. This does not make sense as
butter can be used for anything and not just with bread.
Apriori Algorithm Steps

• The confidence measure helps identify which


product drives the sale of which other product.
• For any two products, A drives B represented as
{A ⇒ B} is not the same as B drives A, {B ⇒ A}. If
the confidence of an association rule {A⇒B} is
60%, it means that 60% of the transactions
containing A also contain B together.
Apriori Algorithm Steps

• Consider the candidate itemset output at L2: [ A, C


]. These are the frequent itemsets with support of
50%.

• Since both rules have confidence greater than


50%, both are accepted. However, {C ⇒ A} occurs
with confidence 100% implying that on most
occasions, C drives the sale of A.
Drawbacks

• Apriori suffers from two main drawbacks which


restrict its usage in real-world use-cases:
• 1. It is computationally intensive.
– Apriori requires repeated scans of the
database for itemset generation. This is very
resource-intensive and time-consuming.
• 2. It can mine misleading patterns.
– Apriori and other Association Rule Mining
algorithms are known to produce rules that are
a product of chance.
Practical: Packages Needed

• The Apriori algorithm and Association Rules:


– sudo pip install mlxtend
• The basic data analytics:
– sudo pip install pandas
• The visualizations:
– sudo pip install matplotlib
• Importing csv file formats (already installed)
– csv
Structured Transactions

retails.csv
Non-structured transactions

groceries.csv
Reading the csv file (both)
Transaction encoding
The Transaction Encoder

• Encodes database transaction data in form of a


Python list of lists into a NumPy array.
• Using and TransactionEncoder object, we can
transform this dataset into an array format
suitable for typical machine learning APIs.
• Via the fit method, the TransactionEncoder
learns the unique labels in the dataset, and via
the transform method, it transforms the input
dataset (a Python list of lists) into a one-hot
encoded NumPy boolean array:
Example:
Transaction Encoder
Getting the association rules
Output:
Conditional Rules
Operations on Rules
Applications

• Cross Selling
• Product Placement
• Affinity Promotion
• Fraud Detection
• Customer Behavior
Useful resources

• https://rasbt.github.io
• https://www.kdnuggets.com
• http://intelligentonlinetools.com
• http://pbpython.com
• www.towardsdatascience.com
• www.analyticsvidhya.com
• www.kaggle.com
• www.github.com
Thank you
This presentation is created using LibreOffice Impress 5.1.6.2, can be used freely as per GNU General Public License

/mITuSkillologies @mitu_group /company/mitu- c/MITUSkillologies


skillologies

Web Resources
http://mitu.co.in
http://tusharkute.com

[email protected]
[email protected]
Clustering Techniques

Tushar B. Kute,
http://tusharkute.com
Unsupervised learning flow
Clustering

• Clustering is an unsupervised learning problem.


• Key objective is to identify distinct groups
(called clusters) based on some notion of
similarity within a given dataset.
• Clustering analysis origins can be traced to the
area of Anthropology and Psychology in the
193’s.
• The most popularly used clustering techniques
are k-means (divisive) and hierarchical
(agglomerative).
K-means clustering

• The key objective of a k-means algorithm is to organize data into


clusters such that there is high intra-cluster similarity and low
inter-cluster similarity. An item will only belong to one cluster, not
several, that is, it generates a specific number of disjoint, non-
hierarchical clusters.
• K-means uses the strategy of divide and concur, and it is a classic
example for an expectation maximization (EM) algorithm. EM
algorithms are made up of two steps:
– The first step is known as expectation(E) and is used to find the expected
point associated with a cluster; and
– The second step is known as maximization(M) and is used to improve the
estimation of the cluster using knowledge from the first step.
• The two steps are processed repeatedly until convergence is
reached.
K-means clustering
Generalized algorithm

• Our algorithm works as follows, assuming we


have inputs x1, x2, x3,…,xn and value of K
– Step 1 - Pick K random points as cluster centers
called centroids.
– Step 2 - Assign each xi to nearest cluster by
calculating its distance to each centroid.
– Step 3 - Find new cluster center by taking the
average of the assigned points.
– Step 4 - Repeat Step 2 and 3 until none of the
cluster assignments change.
Algorithm work flow

• Step 1: In the first step k centroids (in above case


k=3) is randomly picked (only in the first iteration)
and all the points that are nearest to each
centroid point are assigned to that specific
cluster. Centroid is the arithmetic mean or
average position of all the points.
• Step 2: Here the centroid point is recalculated
using the average of the coordinates of all the
points in that cluster. Then step one is repeated
(assign nearest point) until the clusters converge.
Algorithm work flow
Limitations

• K-means clustering needs the number of


clusters to be specified.
• K-means has problems when clusters are
of differing sized, densities, and non-
globular shapes.
• Presence of outlier can skew the results.
Use cases

• K-Means is widely used for many


applications.
– Image Segmentation
– Clustering Gene Segmentation Data
– News Article Clustering
– Clustering Languages
– Species Clustering
– Anomaly Detection
Example:

• Let’s assume the dataset of customers. This is information of


Clients that subscribe to Membership card, maintains the Purchase
history, score is Dependent on INCOME, and number times in week
the show up in Mall, total expense in same mall
Read the dataset
How many clusters?

• There are two commonly used methods


to determine the ideal number of clusters
possible in K-means –
– Elbow Method
– Silhouette Method
How many clusters?

• Elbow Method:
– First of all, compute the sum of squared error (SSE) for some
values of k (for example 2, 4, 6, 8, etc.). The SSE is defined as
the sum of the squared distance between each member of the
cluster and its centroid. Mathematically:

– If you plot k against the SSE, you will see that the error
decreases as k gets larger; this is because when the number of
clusters increases, they should be smaller, so distortion is also
smaller. The idea of the elbow method is to choose the k at
which the SSE decreases abruptly. This produces an "elbow
effect" in the graph
Find no. of clusters

In this case, k=6 is the value that the Elbow method has
selected.
Applying elbow method
The Kmeans() function

• KMeans(n_clusters=8, init=’k-means++’, n_init=10,


max_iter=300, tol=0.0001, precompute_distances
=’auto’, verbose=0, random_state=None,
copy_x=True, n_jobs=1, algorithm=’auto’)
– n_clusters : int, optional, default: 8
• The number of clusters to form as well as the number
of centroids to generate.
– random_state : int, RandomState instance or None,
optional, default: None
• If int, random_state is the seed used by the random
number generator; If RandomState instance,
random_state is the random number generator;
The Kmeans() function attributes

• cluster_centers_ : array, [n_clusters, n_features]


– Coordinates of cluster centers
• labels_ : :
– Labels of each point
• inertia_ : float
– Sum of squared distances of samples to their
closest cluster center.
Visualizing elbow

Got an elbow at 5
Silhouette Method

• Silhouette analysis is a way to measure how close each point in


a cluster is to the points in its neighboring clusters. Its a way to
find out the optimum value for k during k-means clustering.
• Silhouette values lies in the range of [-1, 1]. A value of +1
indicates that the sample is far away from its neighboring
cluster and very close to the cluster its assigned.
• Similarly, value of -1 indicates that the point is close to its
neighboring cluster than to the cluster its assigned.
• A value of 0 means its at the boundary of the distance
between the two cluster. Value of +1 is idea and -1 is least
preferred. Hence, higher the value better is the cluster
configuration.
Silhouette Method

• In the data, lets define a(i) to be the mean distance of


point (i) w.r.t to all the other points in the cluster its
assigned (A). We can interpret a(i) as how well the point
is assigned to the cluster. Smaller the value better the
assignment.
• Similarly, lets define b(i) to be the mean distance of
point(i) w.r.t. to other points to its closet neighboring
cluster (B). The cluster (B) is the cluster to which point
(i) is not assigned to but its distance is closest amongst
all other cluster.
• Thus, the silhouette s(i) can be calculated as:
Applying Silhouette Method
Find optimal value of k
Finding and visualizing
Cluster labels
Visualizing clusters
Thank you
This presentation is created using LibreOffice Impress 5.1.6.2, can be used freely as per GNU General Public License

/mITuSkillologies @mitu_group /company/mitu-


skillologies

Web Resources
http://mitu.co.in
http://tusharkute.com

[email protected]
[email protected]
Probability

Tushar B. Kute,
http://tusharkute.com
What is Probability?

• Probability is a measure of the likelihood of a


random phenomenon or chance behavior.
• Probability describes the long-term proportion with
which a certain outcome will occur in situations
with short-term uncertainty.
• Example:
– Simulate flipping a coin 100 times. Plot the
proportion of heads against the number of flips.
Repeat the simulation.
Probability

• Probability deals with experiments that yield


random short-term results or outcomes, yet
reveal long-term predictability.
• The long-term proportion with which a certain
outcome is observed is the probability of that
outcome.
Law of large numbers

• As the number of repetitions of a probability


experiment increases, the proportion with
which a certain outcome is observed gets closer
to the probability of the outcome.
Probability and event

• In probability, an experiment is any process that can


be repeated in which the results are uncertain.
• A simple event is any single outcome from a
probability experiment. Each simple event is denoted
ei.

• The sample space, S, of a probability


experiment is the collection of all possible
simple events. In other words, the sample space
is a list of all possible outcomes of a probability
experiment.
The event

• An event is any collection of outcomes


from a probability experiment.
• An event may consist of one or more
simple events.
• Events are denoted using capital letters
such as E.
Example:

• Consider the probability experiment of


having two children.
• (a) Identify the simple events of the
probability experiment.
• (b) Determine the sample space.
• (c) Define the event E = “have one boy”.
Denoting probability

• The probability of an event, denoted


P(E), is the likelihood of that event
occurring.
Properties of probabilities

• The probability of any event E, P(E), must be between 0


and 1 inclusive. That is,

0 < P(E) < 1.

• If an event is impossible, the probability of the event is 0.


• If an event is a certainty, the probability of the event is 1.
• If S = {e1, e2, …, en}, then

P(e1) + P(e2) + … + P(en) = 1.


Unusual Event

• An unusual event is an event that has a


low probability of occurring.
Method of probability

• Three methods for determining the


probability of an event:
(1) the classical method
(2) the empirical method
(3) the subjective method
Dependence and Independence

• Roughly speaking, we say that two events E and F are


dependent if knowing something about whether E
happens gives us information about whether F happens
(and vice versa). Otherwise they are independent.
• For instance, if we flip a fair coin twice, knowing whether
the first flip is Heads gives us no information about
whether the second flip is Heads. These events are
independent. On the other hand, knowing whether the
first flip is Heads certainly gives us information about
whether both flips are Tails. (If the first flip is Heads, then
definitely it’s not the case that both flips are Tails.) These
two events are dependent.
Dependence and Independence

• Mathematically, we say that two events E and F are


independent if the probability that they both
happen is the product of the probabilities that each
one happens:

• In the example above, the probability of “first flip


Heads” is 1/2, and the probability of “both flips Tails”
is 1/4, but the probability of “first flip Heads and
both flips Tails” is 0.
Conditional Probability

• When two events E and F are independent, then


by definition we have:

• If they are not necessarily independent (and if


the probability of F is not zero), then we define
the probability of E “conditional on F” as:
Conditional Probability

• You should think of this as the probability that E


happens, given that we know that F happens.
• We often rewrite this as:

• When E and F are independent, you can check


that this gives:
Example:

• One common tricky example involves a family


with two (unknown) children.
• If we assume that:
1. Each child is equally likely to be a boy or a girl
2. The gender of the second child is
independent of the gender of the first child
then the event “no girls” has probability 1/4, the
event “one girl, one boy” has probability 1/2, and
the event “two girls” has probability 1/4.
Example:

• Now we can ask what is the probability of the


event “both children are girls” (B) conditional on
the event “the older child is a girl” (G)? Using the
definition of conditional probability:

• since the event B and G (“both children are girls


and the older child is a girl”) is just the event B.
(Once you know that both children are girls, it’s
necessarily true that the older child is a girl.)
Example:

• We could also ask about the probability of the event “both


children are girls” conditional on the event “at least one of
the children is a girl” (L). Surprisingly, the answer is different
from before!
• As before, the event B and L (“both children are girls and at
least one of the children is a girl”) is just the event B. This
means we have:

• How can this be the case? Well, if all you know is that at least
one of the children is a girl, then it is twice as likely that the
family has one boy and one girl than that it has both girls.
Bayes Theorem

Example Reference: Super Data Science


Bayes Theorem

Defective Spanners
Bayes Theorem
Bayes Theorem
Bayes Theorem
Bayes Theorem
That’s intuitive
Exercise
Example:
Step-1
Step-1
Step-1
Step-2
Step-3
Naive Bayes – Step-1
Naive Bayes – Step-2
Naive Bayes – Step-3
Combining altogether
Naive Bayes – Step-4
Naive Bayes – Step-5
Types of model
Final Classification
Random Variable

• A random variable is a numerical description of the


outcome of a statistical experiment.
• A random variable that may assume only a finite
number or an infinite sequence of values is said to be
discrete; one that may assume any value in some
interval on the real number line is said to be continuous.
• For instance, a random variable representing the
number of automobiles sold at a particular dealership
on one day would be discrete, while a random variable
representing the weight of a person in kilograms (or
pounds) would be continuous.
Random Variable

• The probability distribution for a random variable describes how


the probabilities are distributed over the values of the random
variable.
• For a discrete random variable, x, the probability distribution is
defined by a probability mass function, denoted by f(x).
• This function provides the probability for each value of the
random variable. In the development of the probability function
for a discrete random variable, two conditions must be satisfied:
– (1) f(x) must be nonnegative for each value of the random
variable, and
– (2) the sum of the probabilities for each value of the random
variable must equal one.
Random Variable

• A random variable is a variable whose possible


values have an associated probability
distribution.
• A very simple random variable equals 1 if a coin
flip turns up heads and 0 if the flip turns up tails.
• A more complicated one might measure the
number of heads observed when flipping a coin
10 times or a value picked from range(10) where
each number is equally likely.
Random Variable
Random Variable

• The associated distribution gives the probabilities that the


variable realizes each of its possible values. The coin flip
variable equals 0 with probability 0.5 and 1 with probability
0.5.
• The range(10) variable has a distribution that assigns
probability 0.1 to each of the numbers from 0 to 9.
• We will sometimes talk about the expected value of a random
variable, which is the average of its values weighted by their
probabilities.
• The coin flip variable has an expected value of 1/2 (= 0 * 1/2 + 1
* 1/2), and the range(10) variable has an expected value of 4.5.
Random Variable

• Random variables can be conditioned on events just as


other events can. Going back to the two-child example
from “Conditional Probability”, if X is the random variable
representing the number of girls, X equals 0 with
probability 1/4, 1 with probability 1/2, and 2 with
probability 1/4.
• We can define a new random variable Y that gives the
number of girls conditional on at least one of the children
being a girl. Then Y equals 1 with probability 2/3 and 2
with probability 1/3. And a variable Z that’s the number of
girls conditional on the older child being a girl equals 1
with probability 1/2 and 2 with probability 1/2.
Probability Distribution

• A probability distribution is a function that describes


the likelihood of obtaining the possible values that a
random variable can assume. In other words, the
values of the variable vary based on the underlying
probability distribution.
• Suppose you draw a random sample and measure the
heights of the subjects. As you measure heights, you
can create a distribution of heights. This type of
distribution is useful when you need to know which
outcomes are most likely, the spread of potential
values, and the likelihood of different results.
Probability Distribution
General Properties

• Probability distributions indicate the likelihood


of an event or outcome. Statisticians use the
following notation to describe probabilities:
p(x) = the likelihood that random variable takes
a specific value of x.
• The sum of all probabilities for all possible
values must equal 1. Furthermore, the
probability for a particular value or range of
values must be between 0 and 1.
Types

• Probability distributions describe the dispersion of the


values of a random variable. Consequently, the kind of
variable determines the type of probability distribution.
For a single random variable, statisticians divide
distributions into the following two types:
– Discrete probability distributions for discrete
variables
– Probability density functions for continuous variables
• You can use equations and tables of variable values and
probabilities to represent a probability distribution.
Discrete Probability Distribution

• Discrete probability functions are also known as


probability mass functions and can assume a
discrete number of values.
• For example, coin tosses and counts of events are
discrete functions. These are discrete distributions
because there are no in-between values.
• For example, you can have only heads or tails in a
coin toss. Similarly, if you’re counting the number of
books that a library checks out per hour, you can
count 21 or 22 books, but nothing in between.
Discrete Probability Distribution

• For discrete probability distribution functions,


each possible value has a non-zero likelihood.
Furthermore, the probabilities for all possible
values must sum to one. Because the total
probability is 1, one of the values must occur for
each opportunity.
• For example, the likelihood of rolling a specific
number on a die is 1/6. The total probability for
all six values equals one. When you roll a die,
you inevitably obtain one of the possible values.
Discrete Probability Distribution

• If the discrete distribution has a finite number of


values, you can display all the values with their
corresponding probabilities in a table. For example,
according to a study, the likelihood for the number
of cars in Pune household is the following:
Continous Probability Distribution

• Continuous probability functions are also known as


probability density functions. You know that you have
a continuous distribution if the variable can assume
an infinite number of values between any two values.
Continuous variables are often measurements on a
scale, such as height, weight, and temperature.
• Unlike discrete probability distributions where each
particular value has a non-zero likelihood, specific
values in continuous distributions have a zero
probability. For example, the likelihood of measuring
a temperature that is exactly 32 degrees is zero.
Continous Probability Distribution
How to find ?

• Probabilities for continuous distributions are measured


over ranges of values rather than single points. A
probability indicates the likelihood that a value will fall
within an interval. This property is straightforward to
demonstrate using a probability distribution plot.
• On a probability plot, the entire area under the
distribution curve equals 1. This fact is equivalent to how
the sum of all probabilities must equal one for discrete
distributions. The proportion of the area under the curve
that falls within a range of values along the X-axis
represents the likelihood that a value will fall within that
range.
Characteristics

• Just as there are different types of discrete distributions


for different kinds of discrete data, there are different
distributions for continuous data.
• Each probability distribution has parameters that define its
shape. Most distributions have between 1-3 parameters.
• Specifying these parameters establishes the shape of the
distribution and all of its probabilities entirely.
• These parameters represent essential properties of the
distribution, such as the central tendency and the
variability.
Characteristics

• The most well-known continuous distribution is the


normal distribution, which is also known as the
Gaussian distribution or the “bell curve.”
• This symmetric distribution fits a wide variety of
phenomena, such as human height and IQ scores. It has
two parameters—the mean and the standard
deviation.
• The Weibull distribution and the lognormal distribution
are other common continuous distributions. Both of
these distributions can fit skewed data.
Characteristics

• Distribution parameters are values that apply to


entire populations.
• Unfortunately, population parameters are generally
unknown because it’s usually impossible to
measure an entire population.
• However, you can use random samples to calculate
estimates of these parameters.
Normal Distribution

• Normal distribution represents the behavior of


most of the situations in the universe (That is why
it’s called a “normal” distribution. I guess!).
• The large sum of (small) random variables often
turns out to be normally distributed, contributing
to its widespread application.
Normal Distribution

• Any distribution is known as Normal distribution if


it has the following characteristics:
– The mean, median and mode of the distribution
coincide.
– The curve of the distribution is bell-shaped and
symmetrical about the line x=μ.
– The total area under the curve is 1.
– Exactly half of the values are to the left of the
center and the other half to the right.
Normal Distribution : Example

• Let’s start off with the normal distribution to show


how to use continuous probability distributions.
• The distribution of IQ scores is defined as a normal
distribution with a mean of 100 and a standard
deviation of 15. We’ll create the probability plot of
this distribution.
• Additionally, let’s determine the likelihood that an
IQ score will be between 120-140.
Normal Distribution : Example
Normal Distribution : Example

• We can see that it is a symmetric distribution where


values occur most frequently around 100, which is
the mean. The probabilities drops-off as you move
away from the mean in both directions.
• The shaded area for the range of IQ scores between
120-140 contains 8.738% of the total area under
the curve.
• Therefore, the likelihood that an IQ score falls
within this range is 0.08738.
Normal Distribution : Shape
Central Limit Theorem

• The Central Limit Theorem states that the


sampling distribution of the sample means
approaches a normal distribution as the
sample size gets larger — no matter what the
shape of the population distribution. This fact
holds especially true for sample sizes over 30.
• All this is saying is that as you take more
samples, especially large ones, your graph of
the sample means will look more like a normal
distribution.
Central Limit Theorem

• Here’s what the Central Limit


Theorem is saying,
graphically. The picture
below shows one of the
simplest types of test: rolling
a fair die.
• The more times you roll the
die, the more likely the shape
of the distribution of the
means tends to look like a
normal distribution graph.
Central Limit Theorem

• An essential component of the Central Limit Theorem is


that the average of your sample means will be the
population mean.
• In other words, add up the means from all of your
samples, find the average and that average will be your
actual population mean.
• Similarly, if you find the average of all of the standard
deviations in your sample, you’ll find the actual standard
deviation for your population.
• It’s a pretty useful phenomenon that can help accurately
predict characteristics of a population.
Examples:

• A Central Limit Theorem word problem will most likely


contain the phrase “assume the variable is normally
distributed”, or one like it. With these central limit
theorem examples, you will be given:
– A population (i.e. 29-year-old males, seniors between
72 and 76, all registered vehicles, all cat owners)
– An average (i.e. 125 pounds, 24 hours, 15 years, $15.74)
– A standard deviation (i.e. 14.4lbs, 3 hours, 120 months,
$196.42)
– A sample size (i.e. 15 males, 10 seniors, 79 cars, 100
households)
Thank you
This presentation is created using LibreOffice Impress 5.1.6.2, can be used freely as per GNU General Public License

/mITuSkillologies @mitu_group /company/mitu- MITUSkillologies


skillologies

Web Resources
https://mitu.co.in
http://tusharkute.com

[email protected]
[email protected]
Statistics

Tushar B. Kute,
http://tusharkute.com
Objectives

• Statistics: Describing a Single Set of Data,


• Correlation,
• Simpson’s Paradox,
• Some Other Correlational Caveats,
• Correlation and Causation
What is statistics?

• Statistics is the discipline that concerns the collection,


organization, analysis, interpretation, and presentation
of data.
• In applying statistics to a scientific, industrial, or social
problem, it is conventional to begin with a statistical
population or a statistical model to be studied.
• Populations can be diverse groups of people or objects
such as "all people living in a country" or "every atom
composing a crystal". Statistics deals with every aspect
of data, including the planning of data collection in
terms of the design of surveys and experiments
What is statistics?

• When census data cannot be collected, statisticians collect data


by developing specific experiment designs and survey samples.
• Representative sampling assures that inferences and
conclusions can reasonably extend from the sample to the
population as a whole.
• An experimental study involves taking measurements of the
system under study, manipulating the system, and then taking
additional measurements using the same procedure to
determine if the manipulation has modified the values of the
measurements.
• In contrast, an observational study does not involve
experimental manipulation.
Descriptive Statistics

• A descriptive statistic (in the count noun sense) is a


summary statistic that quantitatively describes or
summarizes features of a collection of information,
while descriptive statistics in the mass noun sense is
the process of using and analyzing those statistics.
• Descriptive statistics is distinguished from inferential
statistics (or inductive statistics), in that descriptive
statistics aims to summarize a sample, rather than use
the data to learn about the population that the
sample of data is thought to represent.
Inferential Statistics

• Statistical inference is the process of using data analysis


to deduce properties of an underlying probability
distribution.
• Inferential statistical analysis infers properties of a
population, for example by testing hypotheses and
deriving estimates.
• It is assumed that the observed data set is sampled from a
larger population. Inferential statistics can be contrasted
with descriptive statistics.
• Descriptive statistics is solely concerned with properties
of the observed data, and it does not rest on the
assumption that the data come from a larger population.
Comparing
Why Statistics?

• What features are the most important?


• How should we design the experiment to develop
our product strategy?
• What performance metrics should we measure?
• What is the most common and expected outcome?
• How do we differentiate between noise and valid
data?
From data to knowledge

• In isolation, raw observations are just data. We use


descriptive statistics to transform these
observations into insights that make sense.
• Then we can use inferential statistics to study small
samples of data and extrapolate our findings to the
entire population.
Terminologies of statistics

• Population: It is an entire pool of data from where a


statistical sample is extracted. It can be visualized as a
complete data set of items that are similar in nature.
• Sample: It is a subset of the population, i.e. it is an
integral part of the population that has been
collected for analysis.
• Variable: A value whose characteristics such as
quantity can be measured, it can also be addressed as
a data point, or a data item.
Terminologies of statistics

• Distribution: The sample data that is spread over a


specific range of values.
• Parameter: It is a value that is used to describe the
attributes of a complete data set (also known as
‘population’). Example: Average, Percentage
• Quantitative analysis: It deals with specific
characteristics of data- summarizing some part of
data, such as its mean, variance, and so on.
• Qualitative analysis: This deals with generic
information about the type of data, and how clean or
structured it is.
Statistical Machine Learning

• The methods used in statistics are important to train and test


the data that is used as input to the machine learning model.
Some of these include outlier/anomaly detection, sampling of
data, data scaling, variable encoding, dealing with missing
values, and so on.
• Statistics is also essential to evaluate the model that has been
used, i.e. see how well the machine learning model performs on
a test dataset, or on data that it has never seen before.
• Statistics is essential in selecting the final and appropriate
model to deal with that specific data in a predictive modelling
situation.
• It is also needed to show how well the model has performed, by
taking various metrics and showing how the model has fared.
Describing single set of data

• Practically...
Dispersion

• Dispersion refers to measures of how spread


out our data is.
• Typically they’re statistics for which values near
zero signify not spread out at all and for which
large values (whatever that means) signify very
spread out.
Variance

• In statistics, the variance is a measure of how far individual


(numeric) values in a dataset are from the mean or average
value.
• The variance is often used to quantify spread or dispersion.
Spread is a characteristic of a sample or population that
describes how much variability there is in it.
• A high variance tells us that the values in our dataset are
far from their mean. So, our data will have high levels of
variability.
• On the other hand, a low variance tells us that the values
are quite close to the mean. In this case, the data will have
low levels of variability.
Variance

• To calculate the variance in a dataset, we first need to find the


difference between each individual value and the mean. The
variance is the average of the squares of those differences. We
can express the variance with the following math expression:

• In this equation, xi stands for individual values or observations


in a dataset. μ stands for the mean or average of those values.
n is the number of values in the dataset.
• The term xi - μ is called the deviation from the mean. So, the
variance is the mean of square deviations. That's why we
denoted it as σ2.
Variance
Variance

• That's all. The variance of our data is 3.916666667.


The variance is difficult to understand and interpret,
particularly how strange its units are.
• For example, if the observations in our dataset are
measured in pounds, then the variance will be
measured in square pounds.
• So, we can say that the observations are, on average,
3.916666667 square pounds far from the mean 3.5.
• Fortunately, the standard deviation comes to fix this
problem
Standard Deviation

• The standard deviation measures the amount of variation or


dispersion of a set of numeric values. Standard deviation is the
square root of variance σ2 and is denoted as σ. So, if we want
to calculate the standard deviation, then all we just have to do
is to take the square root of the variance as follows:
σ=√σ2
• Again, we need to distinguish between the population
standard deviation, which is the square root of the population
variance (σ2) and the sample standard deviation, which is the
square root of the sample variance (S2). We'll denote the
sample standard deviation as S:
S=√S2
Standard Deviation

• Low values of standard deviation tell us that


individual values are closer to the mean. High
values, on the other hand, tell us that individual
observations are far away from the mean of the
data.
• Values that are within one standard deviation of
the mean can be thought of as fairly typical,
whereas values that are three or more standard
deviations away from the mean can be considered
much more atypical. They're also known as outliers.
Standard Deviation
Correlation Coefficients

• Correlation coefficients quantify the association between


variables or features of a dataset. These statistics are of high
importance for science and technology, and Python has great
tools that you can use to calculate them. SciPy, NumPy, and
Pandas correlation methods are fast, comprehensive, and
well-documented.
• We will learn:
– What Pearson, Spearman, and Kendall correlation
coefficients are
– How to use SciPy, NumPy, and Pandas correlation functions
– How to visualize data, regression lines, and correlation
matrices with Matplotlib
What is correlation ?

• Statistics and data science are often concerned about the


relationships between two or more variables (or features) of a
dataset. Each data point in the dataset is an observation, and the
features are the properties or attributes of those observations.
• Every dataset you work with uses variables and observations. For
example, you might be interested in understanding the following:
– How the height of basketball players is correlated to their
shooting accuracy
– Whether there’s a relationship between employee work
experience and salary
– What mathematical dependence exists between the population
density and the gross domestic product of different countries
What is correlation ?

• In this table, each row represents one observation, or


the data about one employee (either Ann, Rob, Tom, or
Ivy). Each column shows one property or feature (name,
experience, or salary) for all the employees.
Forms of correlation
Forms of correlation

• Negative correlation (red dots): In the plot on the left, the y


values tend to decrease as the x values increase. This shows
strong negative correlation, which occurs when large values of
one feature correspond to small values of the other, and vice
versa.
• Weak or no correlation (green dots): The plot in the middle shows
no obvious trend. This is a form of weak correlation, which occurs
when an association between two features is not obvious or is
hardly observable.
• Positive correlation (blue dots): In the plot on the right, the y
values tend to increase as the x values increase. This illustrates
strong positive correlation, which occurs when large values of one
feature correspond to large values of the other, and vice versa.
Example: Employee table
Correlation Techniques

• There are several statistics that you can use to quantify


correlation. We will be learning about three correlation
coefficients:
– Pearson’s r
– Spearman’s rho
– Kendall’s tau
• Pearson’s coefficient measures linear correlation, while the
Spearman and Kendall coefficients compare the ranks of data.
• There are several NumPy, SciPy, and Pandas correlation
functions and methods that you can use to calculate these
coefficients.
• You can also use Matplotlib to conveniently illustrate the
results.
Basic with numpy
What sort of correlation ?

• The values on the main diagonal of the correlation


matrix (upper left and lower right) are equal to 1.
• The upper left value corresponds to the correlation
coefficient for x and x, while the lower right value is
the correlation coefficient for y and y. They are always
equal to 1.
• However, what you usually need are the lower left and
upper right values of the correlation matrix.
• These values are equal and both represent the
Pearson correlation coefficient for x and y. In this case,
it’s approximately 0.76.
Correlation with scipy
Correlation with scipy
Correlation with Pandas
Linear Correlation

• Linear correlation measures the proximity of


the mathematical relationship between
variables or dataset features to a linear
function.
• If the relationship between the two features is
closer to some linear function, then their linear
correlation is stronger and the absolute value of
the correlation coefficient is higher.
Pearson Correlation

• Consider a dataset with two features: x and y. Each feature


has n values, so x and y are n-tuples. Say that the first value x₁
from x corresponds to the first value y₁ from y, the second
value x₂ from x to the second value y₂ from y, and so on. Then,
there are n pairs of corresponding values: (x₁, y₁), (x₂, y₂), and
so on. Each of these x-y pairs represents a single observation.
• The Pearson (product-moment) correlation coefficient is a
measure of the linear relationship between two features. It’s
the ratio of the covariance of x and y to the product of their
standard deviations. It’s often denoted with the letter r and
called Pearson’s r. You can express this value mathematically
with this equation:
• r = Σᵢ((xᵢ − mean(x))(yᵢ − mean(y))) (√Σᵢ(xᵢ − mean(x))² √Σᵢ(y ᵢ − mean(y))²) ⁻¹
Pearson Correlation

• The Pearson correlation coefficient can take on any real value in the
range −1 ≤ r ≤ 1.
• The maximum value r = 1 corresponds to the case when there’s a
perfect positive linear relationship between x and y. In other words,
larger x values correspond to larger y values and vice versa.
• The value r > 0 indicates positive correlation between x and y.
• The value r = 0 corresponds to the case when x and y are
independent.
• The value r < 0 indicates negative correlation between x and y.
• The minimal value r = −1 corresponds to the case when there’s a
perfect negative linear relationship between x and y. In other
words, larger x values correspond to smaller y values and vice versa.
Pearson Correlation
Linear Regression

• Linear regression is the process of finding the linear


function that is as close as possible to the actual
relationship between features.
• In other words, you determine the linear function
that best describes the association between the
features. This linear function is also called the
regression line.
• You can implement linear regression with SciPy.
You’ll get the linear function that best approximates
the relationship between two arrays, as well as the
Pearson correlation coefficient.
Practical Linear Regression
More on Regression
Using Multidimensional Data
Using Pandas
Using Pandas
Using Pandas
Using corrwith
Rank Correlation

• Rank correlation compares the ranks or the


orderings of the data related to two variables or
dataset features.
• If the orderings are similar, then the correlation is
strong, positive, and high. However, if the orderings
are close to reversed, then the correlation is strong,
negative, and low.
• In other words, rank correlation is concerned only
with the order of values, not with the particular
values from the dataset.
Rank Correlation

• The left plot has a perfect positive linear relationship between x and y, so r
= 1. The central plot shows positive correlation and the right one shows
negative correlation. However, neither of them is a linear function, so r is
different than −1 or 1.
• When you look only at the orderings or ranks, all three relationships are
perfect! The left and central plots show the observations where larger x
values always correspond to larger y values. This is perfect positive rank
correlation. The right plot illustrates the opposite case, which is perfect
negative rank correlation.
The Spearman Correlation

• The Spearman correlation coefficient between two


features is the Pearson correlation coefficient between
their rank values.
• It’s calculated the same way as the Pearson correlation
coefficient but takes into account their ranks instead of
their values. It’s often denoted with the Greek letter rho
(ρ) and called Spearman’s rho.
• Say you have two n-tuples, x and y, where (x₁, y₁), (x₂, y₂), …
are the observations as pairs of corresponding values.
• You can calculate the Spearman correlation coefficient ρ
the same way as the Pearson coefficient. You’ll use the
ranks instead of the actual values from x and y.
The Spearman Correlation

• It can take a real value in the range −1 ≤ ρ ≤ 1.


• Its maximum value ρ = 1 corresponds to the case
when there’s a monotonically increasing function
between x and y. In other words, larger x values
correspond to larger y values and vice versa.
• Its minimum value ρ = −1 corresponds to the case
when there’s a monotonically decreasing function
between x and y. In other words, larger x values
correspond to smaller y values and vice versa.
Kendall Correlation Coefficient

• Let’s start again by considering two n-tuples, x and y. Each of the


x-y pairs (x₁, y₁), (x₂, y₂), … is a single observation. A pair of
observations (xᵢ, yᵢ) and (xⱼ, yⱼ), where i < j, will be one of three
things:
– concordant if either (xᵢ > xⱼ and yᵢ > yⱼ) or (xᵢ < xⱼ and yᵢ < yⱼ)
– discordant if either (xᵢ < xⱼ and yᵢ > yⱼ) or (xᵢ > xⱼ and yᵢ < yⱼ)
– neither if there’s a tie in x (xᵢ = xⱼ) or a tie in y (yᵢ = yⱼ)
• The Kendall correlation coefficient compares the number of
concordant and discordant pairs of data.
• This coefficient is based on the difference in the counts of
concordant and discordant pairs relative to the number of x-y
pairs. It’s often denoted with the Greek letter tau (τ) and called
Kendall’s tau.
Kendall Correlation Coefficient

• It can take a real value in the range −1 ≤ τ ≤ 1.


• Its maximum value τ = 1 corresponds to the case
when the ranks of the corresponding values in x
and y are the same. In other words, all pairs are
concordant.
• Its minimum value τ = −1 corresponds to the
case when the rankings in x are the reverse of
the rankings in y. In other words, all pairs are
discordant.
Ranking with scipy
Implementation
Scipy with 2D
Scipy with 2D
Using pandas
Visualization of Regression Line
Simpson’s Paradox

• Simpson’s Paradox refers to a situation where you


believe you understand the direction of a
relationship between two variables, but when you
consider an additional variable, that direction
appears to reverse.
Simpson’s Paradox : Why?

• Simpson’s Paradox happens because


disaggregation of the data (e.g., splitting it into
subgroups) can cause certain subgroups to have an
imbalanced representation compared to other
subgroups.
• This might be due to the relationship between the
variables, or simply due to the way that the data
has been partitioned into subgroups.
Example #1: Admissions

• A famous example of Simpson’s Paradox appears in


the admissions data for graduate school at UC
Berkeley in 1973.
• In this example, when looking at the graduate
admissions data overall, it appeared that men were
more likely to be admitted than women (gender
discrimination!), but when looking at the data for
each department individually, men were less likely
to be admitted than women in most of the
departments.
Example #1: Admissions
Why?

• Here is an explanation of why this happens:


– Different departments had very different acceptance
rates (some were much “harder” to get into than others)
– More females applied to the “harder" departments
– Therefore, females had a lower acceptance rate in
aggregate
• This leads us to ask: which view is the correct view? Do men
or women have a higher acceptance rate? Is there a gender
bias in admissions at this university?
• In this case, it seems most reasonable to conclude that
looking at the admissions rates by department makes more
sense, and the disaggregated view is correct.
Example #2 : Baseball

• Another example of Simpson’s Paradox can be


found in the batting averages of two famous
baseball players, Derek Jeter and David Justice,
from 1995 and 1996.
• David Justice had a higher batting average in both
1995 and 1996 individually, but Derek Jeter had a
higher batting average over the two years
combined.
Example #2
Why?

• Here is an explanation of why this happens:


– Both players had significantly higher batting
averages in 1996 than in 1995
– Derek Jeter had significantly more at-bats in
1996; David Justice had significantly more in
1995
– Therefore, Derek Jeter had a higher batting
average in aggregate
What to do?

• Without enough domain knowledge, it’s hard to know which


view of the relationship between two variables makes more
sense – the one with or without the third variable.
• But before we think about how to deal with Simpson’s
Paradox, we need to find a way to efficiently detect it in a
dataset.
• As mentioned earlier, it’s possible to find an instance of
Simpson’s Paradox (a “Simpson’s Pair”) simply by
disaggregating a contingency table or a plot of data points
and studying the results.
• However, there are other ways we can find Simpson’s Pairs
using models
What to do?

– By building decision trees and comparing the


distributions, or
– By building regression models and comparing the signs
of the coefficients
• There are benefits to both, however, this can get difficult
very quickly, especially when working with big datasets.
• It’s hard to know which variables in the dataset may reverse
the relationship between two other variables, and it can be
hard to check all possible pairs of variables manually.
• Imagine we have a dataset with only 20 variables: we’d
need to check almost 400 pairs to be sure to find all cases
of Simpson’s Paradox.
Some other correlation caveats

• A correlation of zero indicates that there is no linear


relationship between the two variables. However, there may
be other sorts of relationships. For example, if:
x = [-2, -1, 0, 1, 2]
y = [ 2, 1, 0, 1, 2]
• then x and y have zero correlation. But they certainly have a
relationship — each element of y equals the absolute value
of the corresponding element of x .
• What they don’t have is a relationship in which knowing how
x_i compares to mean(x) gives us information about how y_i
compares to mean(y) . That is the sort of relationship that
correlation looks for.
Some other correlation caveats

• In addition, correlation tells you nothing about how


large the relationship is. The variables:
x = [-2, 1, 0, 1, 2]
y = [99.98, 99.99, 100, 100.01, 100.02]
• are perfectly correlated, but (depending on what
you’re measuring) it’s quite possible that this
relationship isn’t all that interesting.
Causation

• Causation means that one event causes another


event to occur.
• Causation can only be determined from an
appropriately designed experiment.
• In such experiments, similar groups receive
different treatments, and the outcomes of each
group are studied.
• We can only conclude that a treatment causes
an effect if the groups have noticeably different
outcomes.
Why?

• Describing a relationship between variables


• Identifying statements consistent with the
relationship between variables
• Identifying valid conclusions about correlation and
causation for data shown in a scatterplot
• Identifying a factor that could explain why a
correlation does not imply a causal relationship
Example:

• My mother-in-law recently complained to me:


“Whenever I try to text message, my phone
freezes.”
• A quick look at her smartphone confirmed my
suspicion: she had five game apps open at the same
time plus Facebook and YouTube.
• The act of trying to send a text message wasn’t
causing the freeze, the lack of RAM was.
• But she immediately connected it with the last
action she was doing before the freeze.
Example:
Causation

• Causation is implying that A and B have a cause-


and-effect relationship with one another. You’re
saying A causes B.
• Causation is also known as causality.
– Firstly, causation means that two events appear
at the same time or one after the other.
– And secondly, it means these two variables not
only appear together, the existence of one
causes the other to manifest.
Why does correlation means causation?

• Even if there is a correlation between two variables, we cannot


conclude that one variable causes a change in the other. This
relationship could be coincidental, or a third factor may be causing
both variables to change.
• For example, Liam collected data on the sales of ice cream cones and
air conditioners in his hometown. He found that when ice cream
sales were low, air conditioner sales tended to be low and that when
ice cream sales were high, air conditioner sales tended to be high.
– Liam can conclude that sales of ice cream cones and air
conditioner are positively correlated.
– Liam can't conclude that selling more ice cream cones causes
more air conditioners to be sold. It is likely that the increases in
the sales of both ice cream cones and air conditioners are caused
by a third factor, an increase in temperature!
Thank you
This presentation is created using LibreOffice Impress 5.1.6.2, can be used freely as per GNU General Public License

/mITuSkillologies @mitu_group /company/mitu- MITUSkillologies


skillologies

Web Resources
https://mitu.co.in
http://tusharkute.com

[email protected]
[email protected]
Machine Learning

Tushar B. Kute,
http://tusharkute.com
Objectives

• Machine Learning Definition and


• Relation with Data Science
Machine Learning

• Machine learning is an application of artificial intelligence (AI)


that provides systems the ability to automatically learn and
improve from experience without being explicitly
programmed. Machine learning focuses on the development
of computer programs that can access data and use it learn for
themselves.
• The process of learning begins with observations or data, such
as examples, direct experience, or instruction, in order to look
for patterns in data and make better decisions in the future
based on the examples that we provide.
• The primary aim is to allow the computers learn automatically
without human intervention or assistance and adjust actions
accordingly.
Origins of Machine Learning

• The earliest databases recorded information from


the observable environment.
• Astronomers recorded patterns of planets and stars;
biologists noted results from experiments
crossbreeding plants and animals; and cities
recorded tax payments, disease outbreaks, and
populations. Each of these required a human being
to first observe and second, record the observation.
• Today, such observations are increasingly
automated and recorded systematically in ever-
growing computerized databases.
Machine Learning

• The field of study interested in the development of


computer algorithms for transforming data into
intelligent action is known as machine learning.
Data Mining

• A closely related sibling of machine learning, data


mining, is concerned with the generation of novel
insight from large databases (not to be confused with
the pejorative term "data mining," describing the
practice of cherry-picking data to support a theory).
• Although there is some disagreement over how
widely the two fields overlap, a potential point of
distinction is that machine learning tends to be
focused on performing a known task, whereas data
mining is about the search for hidden nuggets of
information.
Uses and Abuses

• Predict the outcomes of elections


• Identify and filter spam messages from e-mail
• Foresee criminal activity
• Automate traffic signals according to road conditions
• Produce financial estimates of storms and natural
disasters
• Examine customer churn
• Create auto-piloting planes and auto-driving cars
• Identify individuals with the capacity to donate
• Target advertising to specific types of consumers
Case Study
Recognizing patterns

• A machine learning algorithm takes data and


identifies patterns that can be used for action.
• In some cases, the results are so successful that
they seem to reach near-legendary status.
How do machine learn ?

• A commonly cited formal definition of machine


learning, proposed by computer scientist Tom
M. Mitchell, says that a machine is said to learn
if it is able to take experience and utilize it such
that its performance improves up on similar
experiences in the future.
• This definition is fairly exact, yet says little
about how machine learning techniques actually
learn to transform data into actionable
knowledge.
Types of Machine Learning
Supervised Machine Learning
Unsupervised Machine Learning
Training a dataset

• The process of fitting a particular model to a


dataset is known as training.
• Why is this not called learning? First, note that
the learning process does not end with the step
of data abstraction.
• Learning requires an additional step to
generalize the knowledge to future data.
• Second, the term training more accurately
describes the actual process undertaken when
the model is fitted to the data.
Training a dataset
Predictive Analytics

• Predictive Analytics will help an organization to


know what might happen next, it predicts future
based on present data available.
• It will analyze the data and provide statements that
have not happened yet.
• It makes all kinds of predictions that you want to
know and all predictions are probabilistic in nature.
Descriptive Analytics

• Descriptive Analytics will help an organization to


know what has happened in the past, it would give
you the past analytics using the data that are
stored.
• For a company, it is necessary to know the past
events that help them to make decisions based on
the statistics using historical data.
• For example, you might want to know how much
money you lost due to fraud and many more.
Data Scientist Skillset
Machine Learning Skillset
Summary
Data Science vs. Machine Learning

• Because data science is a broad term for multiple


disciplines, machine learning fits within data
science.
• Machine learning uses various techniques, such as
regression and supervised clustering. On the other
hand, the data’ in data science may or may not
evolve from a machine or a mechanical process.
• The main difference between the two is that data
science as a broader term not only focuses on
algorithms and statistics but also takes care of the
entire data processing methodology.
Data Science Disciplines
Applications of ML in Data Science

• Regression and classification are of primary importance


to a data scientist. To achieve these goals, one of the
main tools a data scientist uses is machine learning. The
uses for regression and automatic classification are wide
ranging, such as the following:
– Finding oil fields, gold mines, or archeological sites
based on existing sites (classification and regression)
– Finding place names or persons in text (classification)
– Identifying people based on pictures or voice
recordings (classification)
– Recognizing birds based on their whistle (classification)
Applications of ML in Data Science

• Identifying profitable customers (regression and


classification)
• Proactively identifying car parts that are likely to fail
(regression)
• Identifying tumors and diseases (classification)
• Predicting the amount of money a person will spend on
product X (regression)
• Predicting the number of eruptions of a volcano in a period
(regression)
• Predicting your company’s yearly revenue (regression)
• Predicting which team will win the Champions League in
soccer (classification)
Applications of ML in Data Science
Python for Machine Learning
Thank you
This presentation is created using LibreOffice Impress 5.1.6.2, can be used freely as per GNU General Public License

/mITuSkillologies @mitu_group /company/mitu- MITUSkillologies


skillologies

Web Resources
https://mitu.co.in
http://tusharkute.com

[email protected]
[email protected]
Data Science Process

Tushar B. Kute,
http://tusharkute.com
Objectives

• Understanding the flow of a data science


process
• Discussing the steps in a data science process
Data Science Process

• A structured approach to data science helps you to


maximize your chances of success in a data science
project at the lowest cost.
• It also makes it possible to take up a project as a
team, with each team member focusing on what
they do best.
• However, this approach may not be suitable for
every type of project or be the only way to do good
data science.
• The typical data science process consists of six steps
Data Science Process
Steps

• The first step of this process is setting a research


goal. The main purpose here is making sure all the
stakeholders understand the what, how, and why of
the project.
• The second phase is data retrieval. You want to
have data available for analysis, so this step
includes finding suitable data and getting access to
the data from the data owner. The result is data in
its raw form, which probably needs polishing and
transformation before it becomes usable.
Steps

• Now that you have the raw data, it’s time to prepare it.
This includes transforming the data from a raw form into
data that’s directly usable in your models. To achieve this,
you’ll detect and correct different kinds of errors in the
data, combine data from different data sources, and
transform it. If you have successfully completed this step,
you can progress to data visualization and modeling.
• The fourth step is data exploration. The goal of this step
is to gain a deep understanding of the data. You’ll look
for patterns, correlations, and deviations based on visual
and descriptive techniques. The insights you gain from
this phase will enable you to start modeling.
Steps

• Finally, we get to the main part: model building (or


“data modeling”).
• It is now that you attempt to gain the insights or
make the predictions stated in your project charter.
• Now is the time to bring out the heavy guns, but
remember research has taught us that often (but
not always) a combination of simple models tends
to outperform one complicated model.
• If you’ve done this phase right, you’re almost done.
Steps

• The last step of the data science model is presenting your


results and automating the analysis, if needed.
• One goal of a project is to change a process and/or make
better decisions. You may still need to convince the
business that your findings will indeed change the
business process as expected.
• This is where you can shine in your influencer role. The
importance of this step is more apparent in projects on a
strategic and tactical level.
• Certain projects require you to perform the business
process over and over again, so automating the project
will save time.
Don’t be slave to this process

• Not every project will follow this blueprint, because


your process is subject to the preferences of the
data scientist, the company, and the nature of the
project you work on.
• Some companies may require you to follow a strict
protocol, whereas others have a more informal
manner of working.
• In general, you’ll need a structured approach when
you work on a complex project or when many
people or resources are involved.
1. Setting research goal
Goal and context of research

• An essential outcome is the research goal that


states the purpose of your assignment in a clear and
focused manner.
• Understanding the business goals and context is
critical for project success.
• Continue asking questions and devising examples
until you grasp the exact business expectations,
identify how your project fits in the bigger picture,
appreciate how your research is going to change the
business, and understand how they’ll use your
results.
Create project charter

• Clients like to know upfront what they’re paying for, so after you have
a good understanding of the business problem, try to get a formal
agreement on the deliverables. All this information is best collected in
a project charter. For any significant project this would be mandatory.
• A project charter requires teamwork, and your input covers at least
the following:
– A clear research goal
– The project mission and context
– How you’re going to perform your analysis
– What resources you expect to use
– Proof that it’s an achievable project, or proof of concepts
– Deliverables and a measure of success
– A timeline
2. Retrieving data
Data Retrieval

• The next step in data science is to retrieve the


required data. Sometimes you need to go into the
field and design a data collection process yourself, but
most of the time you won’t be involved in this step.
• Many companies will have already collected and
stored the data for you, and what they don’t have can
often be bought from third parties.
• Don’t be afraid to look outside your organization for
data, because more and more organizations are
making even high-quality data freely available for
public and commercial use.
Data Stored in company

• Your first act should be to assess the relevance and


quality of the data that’s readily available within your
company.
• Most companies have a program for maintaining key
data, so much of the cleaning work may already be done.
• This data can be stored in official data repositories such
as databases, data marts, data warehouses, and data
lakes maintained by a team of IT professionals.
• The primary goal of a database is data storage, while a
data warehouse is designed for reading and analyzing
that data.
Data Stored in company

• Getting access to data is another difficult task.


• Organizations understand the value and sensitivity of data
and often have policies in place so everyone has access to
what they need and nothing more.
• These policies translate into physical and digital barriers
called Chinese walls. These “walls” are mandatory and
well-regulated for customer data in most countries.
• This is for good reasons, too; imagine everybody in a credit
card company having access to your spending habits.
• Getting access to the data may take time and involve
company politics.
Data Sources
Data Quality Test

• Expect to spend a good portion of your project time


doing data correction and cleansing, sometimes up to
80%.
• The retrieval of data is the first time you’ll inspect the
data in the data science process. Most of the errors you’ll
encounter during the data- gathering phase are easy to
spot, but being too careless will make you spend many
hours solving data issues that could have been prevented
during data import.
• You’ll investigate the data during the import, data
preparation, and exploratory phases. The difference is in
the goal and the depth of the investigation.
3. Data Preparation
Data Preparation

• The data received from the data retrieval phase is


likely to be “a diamond in the rough.”
• Your task now is to sanitize and prepare it for use in
the modeling and reporting phase.
• Doing so is tremendously important because your
models will perform better and you’ll lose less time
trying to fix strange output.
• It can’t be mentioned nearly enough times: garbage in
equals garbage out.
• Your model needs the data in a specific format, so data
transformation will always come into play.
Data Cleansing

• Data cleansing is a subprocess of the data science process


that focuses on removing errors in your data so your data
becomes a true and consistent representation of the
processes it originates from.
• By “true and consistent representation” we imply that at
least two types of errors exist.
• The first type is the interpretation error, such as when you
take the value in your data for granted, like saying that a
person’s age is greater than 300 years.
• The second type of error points to inconsistencies
between data sources or against your company’s
standardized values.
Overview of common errors
Example: Outliers
Data Entry Errors

• Data collection and data entry are error-prone processes.


• They often require human intervention, and because
humans are only human, they make typos or lose their
concentration for a second and introduce an error into the
chain.
• But data collected by machines or computers isn’t free
from errors either. Errors can arise from human
sloppiness, whereas others are due to machine or
hardware failure.
• Examples of errors originating from machines are
transmission errors or bugs in the extract, transform, and
load phase ( ETL ).
Example: Frequency Table
Error: Redundant Whitespaces

• Whitespaces tend to be hard to detect but cause errors


like other redundant characters would.
• Who hasn’t lost a few days in a project because of a bug
that was caused by whitespaces at the end of a string?
• You ask the program to join two keys and notice that
observations are missing from the output file. After
looking for days through the code, you finally find the bug.
• Then comes the hardest part: explaining the delay to the
project stakeholders. The cleaning during the ETL phase
wasn’t well executed, and keys in one table contained a
whitespace at the end of a string.
Impossible values / Sanity Check

• Sanity checks are another valuable type of data


check.
• Here you check the value against physically or
theoretically impossible values such as people
taller than 3 meters or someone with an age of
299 years.
• Sanity checks can be directly expressed with
rules:
check = 0 <= age <= 120
Outliers

• An outlier is an observation that seems to be distant


from other observations or, more specifically, one
observation that follows a different logic or
generative process than the other observations.
• The easiest way to find outliers is to use a plot or a
table with the minimum and maximum values.
• The plot on the top shows no outliers, whereas the
plot on the bottom shows possible outliers on the
upper side when a normal distribution is expected.
• The normal dis-tribution, or Gaussian distribution, is
the most common distribution in natural sciences.
Example:
Example:
Dealing with missing values

• Missing values aren’t necessarily wrong, but you


still need to handle them separately; certain
modeling techniques can’t handle missing
values.
• They might be an indicator that something went
wrong in your data collection or that an error
happened in the ETL process.
Handling missing values
Error: deviation from code book

• Detecting errors in larger data sets against a code


book or against standardized values can be done with
the help of set operations.
• A code book is a description of your data, a form of
metadata. It contains things such as the number of
variables per observation, the number of observations,
and what each encoding within a variable means.
• (For instance “0” equals “negative”, “5” stands for “very
positive”.) A code book also tells the type of data
you’re looking at: is it hierarchical, graph, something
else?
Error: different units of measurement

• When integrating two data sets, you have to pay


attention to their respective units of measurement.
• An example of this would be when you study the
prices of gasoline in the world. To do this you
gather data from different data providers.
• Data sets can con tain prices per gallon and others
can contain prices per liter. A simple conversion will
do the trick in this case.
Having different levels of aggregation

• Having different levels of aggregation is similar to


having different types of measurement.
• An example of this would be a data set containing data
per week versus one containing data per work week.
• This type of error is generally easy to detect, and
summarizing (or the inverse, expanding) the data sets
will fix it.
• After cleaning the data errors, you combine
information from different data sources. But before
we tackle this topic we’ll take a little detour and stress
the importance of cleaning data as early as possible.
Correct Errors

• A good practice is to mediate data errors as early as


possible in the data collection chain and to fix as
little as possible inside your program while fixing
the origin of the problem.
• Retrieving data is a difficult task, and organizations
spend millions of dollars on it in the hope of making
better decisions.
• The data collection process is errorprone, and in a
big organization it involves many steps and teams.
Correct Errors

• Data should be cleansed when acquired for many


reasons:
– Not everyone spots the data anomalies.
Decision-makers may make costly mistakes on
information based on incorrect data from
applications that fail to correct for the faulty
data.
– If errors are not corrected early on in the
process, the cleansing will have to be done for
every project that uses that data.
Correct Errors

• Data errors may point to a business process that


isn’t working as designed. For instance, both
authors worked at a retailer in the past, and they
designed a couponing system to attract more
people and make a higher profit.
• Data errors may point to defective equipment, such
as broken transmission lines and defective sensors.
• Data errors can point to bugs in software or in the
integration of software that may be critical to the
company.
Combine Data

• Your data comes from several different places, and in this


substep we focus on integrating these different sources.
• Data varies in size, type, and structure, ranging from
databases and Excel files to text documents.
• It’s easy to fill entire books on this topic alone, and we
choose to focus on the data science process instead of
presenting scenarios for every type of data.
• But keep in mind that other types of data sources exist,
such as key-value stores, document stores, and so on,
which we’ll handle in more appropriate places in the
book.
Different ways to combine data

• You can perform two operations to combine


information from different data sets.
• The first operation is joining: enriching an observation
from one table with information from another table.
• The second operation is appending or stacking: adding
the observations of one table to those of another
table.
• When you combine data, you have the option to
create a new physical table or a virtual table by
creating a view. The advantage of a view is that it
doesn’t consume more disk space.
Joining tables

• Joining tables allows you to combine the


information of one observation found in one table
with the information that you find in another table.
The focus is on enriching a single observation.
• Let’s say that the first table contains information
about the purchases of a customer and the other
table contains information about the region where
your customer lives.
• Joining the tables allows you to combine the
information so that you can use it for your model
Joining tables
Appending tables

• Appending or stacking tables is effectively adding


observations from one table to another table. One table
contains the observations from the month January and
the second table contains observations from the month
February.
• The result of appending these tables is a larger one with
the observations from January as well as February.
• The equivalent operation in set theory would be the
union, and this is also the command in SQL, the common
language of relational databases.
• Other set operators are also used in data science, such as
set difference and intersection.
Appending tables
View: without replication
Aggregating measures

• Data enrichment can also be done by adding


calculated information to the table, such as the
total number of sales or what percentage of total
stock has been sold in a certain region.
• Extra measures such as these can add perspective.
Looking at figure, we now have an aggregated
data set, which in turn can be used to calculate the
participation of each product within its category.
• This could be useful during data exploration but
more so when creating data models.
Example:
Data Transformation

• Certain models require their data to be in a certain


shape.
• Now that you’ve cleansed and integrated the data,
this is the next task you’ll perform: transforming
your data so it takes a suitable form for data
modeling.
Data Transformation

• Relationships between an input variable and an


output variable aren’t always linear.
• Take, for instance, a relationship of the form y =
aebx .
• Taking the log of the independent variables
simplifies the estimation problem dramatically.
Data Transformation
Reducing number of variables

• Sometimes you have too many variables and


need to reduce the number because they don’t
add new information to the model.
• Having too many variables in your model makes
the model difficult to handle, and certain
techniques don’t perform well when you
overload them with too many input variables.
Reducing number of variables

• For instance, all the techniques based on a


Euclidean distance perform well only up to 10
variables.
Dummy Variables

• Variables can be turned into dummy variables (figure). Dummy


variables can only take two values: true(1) or false(0).
• They’re used to indicate the absence of a categorical effect
that may explain the observation. In this case you’ll make
separate columns for the classes stored in one variable and
indicate it with 1 if the class is present and 0 otherwise.
• An example is turning one column named Weekdays into the
columns Monday through Sunday. You use an indicator to show
if the observation was on a Monday; you put 1 on Monday and 0
elsewhere.
• Turning variables into dummies is a technique that’s used in
modeling and is popular with, but not exclusive to, economists.
Dummy Variables
4. Exploratory Data Analysis

• During exploratory data analysis you take a deep dive into


the data (see figure).
• Information becomes much easier to grasp when shown in
a picture, therefore you mainly use graphical techniques
to gain an understanding of your data and the interactions
between variables.
• This phase is about exploring data, so keeping your mind
open and your eyes peeled is essential during the
exploratory data analysis phase.
• The goal isn’t to cleanse the data, but it’s common that
you’ll still discover anomalies you missed before, forcing
you to take a step back and fix them.
4. Exploratory Data Analysis
Exploratory Data Analysis
Brushing and linking

• With brushing and linking you combine and link


different graphs and tables (or views) so changes in
one graph are automatically transferred to the
other graphs.
Brushing and linking
Brushing and linking
Histogram

• In a histogram a variable is cut into discrete


categories and the number of occurrences in each
category are summed up and shown in the graph.
• The boxplot, on the other hand, doesn’t show how
many observations are present but does offer an
impression of the distribution within categories.
• It can show the maximum, minimum, median, and
other characterizing measures at the same time.
Histogram
Boxplot
5. Build the model
Building a model

• With clean data in place and a good understanding


of the content, you’re ready to build models with
the goal of making better predictions, classifying
objects, or gaining an understanding of the system
that you’re modeling.
• This phase is much more focused than the
exploratory analysis step, because you know what
you’re looking for and what you want the outcome
to be.
Building a model

• Building a model is an iterative process. The way you


build your model depends on whether you go with
classic statistics or the somewhat more recent
machine learning school, and the type of technique
you want to use.
• Either way, most models consist of the following main
steps:
– Selection of a modeling technique and variables to
enter in the model
– Execution of the model
– Diagnosis and model comparison
Build a model

• You’ll need to select the variables you want to


include in your model and a modeling technique.
• Your findings from the exploratory analysis should
already give a fair idea of what variables will help
you construct a good model.
• Many modeling techniques are available, and
choosing the right model for a problem requires
judgment on yourpart.
Build a model

• You’ll need to consider model performance and


whether your project meets all the requirements to
use your model, as well as other factors:
– Must the model be moved to a production
environment and, if so, would it be easy to
implement?
– How difficult is the maintenance on the model:
how long will it remain relevant if left
untouched?
– Does the model need to be easy to explain?
Model Execution

• Luckily, most programming languages, such as Python,


already have libraries such as StatsModels or Scikit-learn.
These packages use several of the most popular
techniques.
• Coding a model is a nontrivial task in most cases, so
having these libraries available can speed up the process.
• As you can see in the following code, it’s fairly easy to
use linear regression (figure) with StatsModels or Scikit-
learn.
• Doing this your self would require much more effort
even for the simple techniques.
Model Execution
Coding
Evaluation
Evaluation

• Model fit—For this the R-squared or adjusted R-squared


is used. This measure is an indication of the amount of
variation in the data that gets captured by the model.
• Predictor variables have a coefficient—For a linear
model this is easy to interpret. In our example if you add
“1” to x1, it will change y by “0.7658”. It’s easy to see how
finding a good predictor can be your route to a Nobel
Prize even though your model as a whole is rubbish.
• Predictor significance—Coefficients are great, but
sometimes not enough evidence exists to show that the
influence is there. This is what the p-value is about.
Example: KNN Model
Code
Evaluation
Model diagnostic and comparison

• You’ll be building multiple models from which you then


choose the best one based on multiple criteria. Working
with a holdout sample helps you pick the best-
performing model. A holdout sample is a part of the data
you leave out of the model building so it can be used to
evaluate the model afterward. The principle here is
simple: the model should work on unseen data. You use
only a fraction of your data to estimate the
• model and the other part, the holdout sample, is kept
out of the equation. The model
• is then unleashed on the unseen data and error
measures are calculated to evaluate it.
Cross Validation
6. Presentation and automation

• After you’ve successfully analyzed the data and


built a well-performing model, you’re ready to
present your findings to the world (figure).
• This is an exciting part; all your hours of hard work
have paid off and you can explain what you found
to the stakeholders.
6. Presentation and automation
Presentation

• Sometimes people get so excited about your work that you’ll


need to repeat it over and over again because they value the
predictions of your models or the insights that you produced.
For this reason, you need to automate your models.
• This doesn’t always mean that you have to redo all of your
analysis all the time.
• Sometimes it’s sufficient that you implement only the model
scoring; other times you might build an application that
automatically updates reports, Excel spreadsheets, or
PowerPoint presentations.
• The last stage of the data science process is where your soft
skills will be most useful, and yes, they’re extremely important.
Summary

• Setting the research goal—Defining the what, the why, and the how of your
project in a project charter.
• Retrieving data—Finding and getting access to data needed in your project.
This data is either found within the company or retrieved from a third party.
• Data preparation—Checking and remediating data errors, enriching the
data with data from other data sources, and transforming it into a suitable
format for your models.
• Data exploration—Diving deeper into your data using descriptive statistics
and visual techniques.
• Data modeling—Using machine learning and statistical techniques to
achieve your project goal.
• Presentation and automation—Presenting your results to the stakeholder
and industrializing your analysis process for repetitive reuse and
integration with other tools.
Thank you
This presentation is created using LibreOffice Impress 5.1.6.2, can be used freely as per GNU General Public License

/mITuSkillologies @mitu_group /company/mitu- MITUSkillologies


skillologies

Web Resources
https://mitu.co.in
http://tusharkute.com

[email protected]
[email protected]
Data Science

Tushar B. Kute,
http://tusharkute.com
Objectives

• Defining data science and big data


• Recognizing the different types of data
• Gaining insight into the data science
process
Data All Around

• Lots of data is being collected and warehoused


– Web data, e-commerce
– Financial transactions, bank/credit
transactions
– Online trading and purchasing
– Social Network
– Cloud
Data and Big Data

• “90% of the world’s data was generated in the last few


years.”
• Due to the advent of new technologies, devices, and
communication means like social networking sites, the
amount of data produced by mankind is growing rapidly
every year.
• The amount of data produced by us from the beginning of
time till 2003 was 5 billion gigabytes. If you pile up the
data in the form of disks it may fill an entire football field.
• The same amount was created in every two days in 2011,
and in every six minutes in 2016. This rate is still growing
enormously.
Big Data Definition

• No single standard definition…

“Big Data” is data whose scale, diversity,


and complexity require new architecture,
techniques, algorithms, and analytics to
manage it and extract value and hidden
knowledge from it…
What is Big Data

• Big Data is a collection of large datasets


that cannot be processed using
traditional computing techniques.
• It is not a single technique or a tool,
rather it involves many areas of business
and technology.
Big Data

• Big Data is any data that is expensive to manage


and hard to extract value from
– Volume
• The size of the data
– Velocity
• The latency of data processing relative to the
growing demand for interactivity
– Variety and Complexity
• The diversity of sources, formats, quality,
structures.
Big Data
Characteristics of Big Data: Volume

• Data Volume
• 44x increase from 2009 2020
• From 0.8 zettabytes to 35zb

• Data volume is increasing exponentially

Exponential increase in
collected/generated data
Computer Memory Units
Characteristics of Big Data: Variety

• Various formats, types, and structures

• Text, numerical, images, audio, video,


sequences, time series, social media
data, multi-dim arrays, etc…
• Static data vs. streaming data

• A single application can be


generating/collecting many types of
data

To extract knowledge all these types


of data need to linked together
Characteristics of Big Data: Velocity

• Data is begin generated fast and need to be


processed fast
• Online Data Analytics

• Late decisions, missing opportunities

• Examples
• E-Promotions: Based on your current location, your purchase
history, what you like send promotions right now for store next to
you.

• Healthcare monitoring: sensors monitoring your activities and


body any abnormal measurements require immediate reaction.
Big Data: 3 Vs
Big Data: The 4th V
What Comes Under Big Data?

• Black Box Data: It is a component of helicopter, airplanes, and


jets, etc. It captures voices of the flight crew, recordings of
microphones and earphones, and the performance information
of the aircraft.
• Social Media Data: Social media such as Facebook and Twitter
hold information and the views posted by millions of people
across the globe.
• Stock Exchange Data: The stock exchange data holds
information about the ‘buy’ and ‘sell’ decisions made on a
share of different companies made by the customers.
• Power Grid Data: The power grid data holds information
consumed by a particular node with respect to a base station.
What Comes Under Big Data?

• Transport Data: Transport data includes


model, capacity, distance and availability of a
vehicle.
• Search Engine Data: Search engines retrieve
lots of data from different databases.
• Structured data: Relational data.
• Semi Structured data: XML data.
• Unstructured data: Word, PDF, Text, Media
Logs.
Benefits of Big Data

• Using the information kept in the social network like


Facebook, the marketing agencies are learning
about the response for their campaigns,
promotions, and other advertising mediums.
• Using the information in the social media like
preferences and product perception of their
consumers, product companies and retail
organizations are planning their production.
• Using the data regarding the previous medical
history of patients, hospitals are providing better
and quick service.
Big Data Technologies

• Operational Big data


• Analytical Big data
Operational Big Data

• These include systems like MongoDB that


provide operational capabilities for real-
time, interactive workloads where data is
primarily captured and stored.
• NoSQL Big Data systems are designed to
take advantage of new cloud computing
architectures that have emerged over the
past decade to allow massive computations
to be run inexpensively and efficiently.
Analytical Big Data

• These includes systems like Massively Parallel


Processing (MPP) database systems and
MapReduce that provide analytical capabilities for
retrospective and complex analysis that may touch
most or all of the data.
• MapReduce provides a new method of analyzing
data that is complementary to the capabilities
provided by SQL, and a system based on
MapReduce that can be scaled up from single
servers to thousands of high and low end machines.
Who generates Big Data?

Social media and networks Scientific instruments


(all of us are generating data) (collecting all sorts of data)

Sensor technology and


networks
(measuring all kinds of data) Mobile devices
(tracking all objects all the time)
Big Data generation models

• The Model of Generating/Consuming Data has Changed

Old Model: Few companies are generating data, all others are
consuming data

New Model: all of us are generating data, and all of us are


consuming data
Challenges in Big Data

• The major challenges associated with big data


are as follows:
– Capturing data
– Curation
– Storage
– Searching
– Sharing
– Transfer
– Analysis
– Presentation
Types of Data

• Relational Data
(Tables/Transaction/Legacy Data)
• Text Data (Web)
• Semi-structured Data (XML)
• Graph Data
• Social Network, Semantic Web (RDF), …
• Streaming Data
What to do with this data?

• Aggregation and Statistics


– Data warehousing and OLAP
• Indexing, Searching, and Querying
– Keyword based search
– Pattern matching (XML/RDF)
• Knowledge discovery
– Data Mining
– Statistical Modeling
What is Data Science ?

• An area that manages, manipulates, extracts,


and interprets knowledge from tremendous
amount of data.
• Data science (DS) is a multidisciplinary field of
study with goal to address the challenges in big
data.
• Data science principles apply to all data – big
and small.
What is Data Science ?

• Theories and techniques from many fields and disciplines


are used to investigate and analyze a large amount of data
to help decision makers in many industries such as science,
engineering, economics, politics, finance, and education
– Computer Science
• Pattern recognition, visualization, data warehousing,
High performance computing, Databases, AI
– Mathematics
• Mathematical Modeling
– Statistics
• Statistical and Stochastic modeling, Probability.
Data Science Disciplines
Real Life Examples

• Internet Search
• Digital Advertisements (Targeted Advertising and re-
targeting)
• Recommender Systems
• Image Recognition
• Speech Recognition
• Gaming
• Price Comparison Websites
• Airline Route Planning
• Fraud and Risk Detection
• Delivery logistics
Internet Search
Targeting Advertisement
Recommender System
Image Recognition
Speech Recognition
Computer Games
Price Comparison Website
Airline Route Planning
Fraud Detection
Delivery Logistics
Facets of Data

• In data science and big data you’ll come across many


different types of data, and each of them tends to
require different tools and techniques. The main
categories of data are these:
– Structured
– Unstructured
– Natural language
– Machine-generated
– Graph-based
– Audio, video, and images
– Streaming
Strutured Data

• Structured data is data that depends on a data


model and resides in a fixed field within a record.
• As such, it’s often easy to store structured data
in tables within databases or Excel files, SQL , or
Structured Query Language, is the preferred way
to manage and query data that resides in
databases.
• You may also come across structured data that
might give you a hard time storing it in a
traditional relational database.
Strutured Data
Unstructured Data

• Unstructured data is data that isn’t easy to fit into a data


model because the content is context-specific or varying.
One example of unstructured data is your regular email
• Although email contains structured elements such as the
sender, title, and body text, it’s a challenge to find the
number of people who have written an email complaint
about a specific employee because so many ways exist to
refer to a person, for example.
• The thousands of different languages and dialects out
there further complicate this.
• A human-written email, as shown in next figure, is also a
perfect example of natural language data.
Unstructured Data
Natural Language

• Natural language is a special type of unstructured data;


it’s challenging to process because it requires knowledge
of specific data science techniques and linguistics.
• The natural language processing community has had
success in entity recognition, topic recognition,
summarization, text completion, and sentiment analysis,
but models trained in one domain don’t generalize well to
other domains.
• Even state-of-the-art techniques aren’t able to decipher
the meaning of every piece of text. This shouldn’t be a
surprise though: humans struggle with natural language
as well. It’s ambiguous by nature.
Machine Generated Data

• Machine-generated data is information that’s automatically


created by a computer, process, application, or other machine
without human intervention.
• Machine-generated data is becoming a major data resource
and will continue to do so. Wikibon has forecast that the
market value of the industrial Internet (a term coined by
Frost & Sullivan to refer to the integration of complex
physical machinery with networked sensors and software)
will be approximately $540 billion in 2020.
• IDC (International Data Corporation) has estimated there will
be 26 times more connected things than people in 2020. This
network is commonly referred to as the internet of things.
Machine Generated Data
Graph or Network Data

• “Graph data” can be a confusing term because any data can


be shown in a graph.
• “Graph” in this case points to mathematical graph theory. In
graph theory, a graph is a mathematical structure to model
pair-wise relationships between objects. Graph or network
data is, in short, data that focuses on the relationship or
adjacency of objects.
• The graph structures use nodes, edges, and properties to
represent and store graphical data. Graph-based data is a
natural way to represent social networks, and its structure
allows you to calculate specific metrics such as the influence
of a person and the shortest path between two people.
Graph or Network Data

• Examples of graph-based data can be found on many social


media websites (For instance, on LinkedIn you can see who you
know at which company.
• Your follower list on Twitter is another example of graph-based
data. The power and sophistication comes from multiple,
overlapping graphs of the same nodes. For example, imagine
the connecting edges here to show “friends” on Facebook.
• Imagine another graph with the same people which connects
business colleagues via LinkedIn.
• Imagine a third graph based on movie interests on Netflix.
Overlapping the three different-looking graphs makes more
interesting questions possible.
Graph or Network Data
Audio, Video and Image

• Audio, image, and video are data types that pose specific
challenges to a data scientist.
• Tasks that are trivial for humans, such as recognizing
objects in pictures, turn out to be challenging for
computers. MLBAM (Major League Baseball Advanced
Media) announced in 2014 that they’ll increase video
capture to approximately 7 TB per game for the purpose
of live, in-game analytics.
• High-speed cameras at stadiums will capture ball and
athlete movements to calculate in real time, for example,
the path taken by a defender relative to two baselines.
Audio, Video and Image

• Recently a company called DeepMind succeeded at


creating an algorithm that’s capable of learning how
to play video games.
• This algorithm takes the video screen as input and
learns to interpret everything via a complex process
of deep learning. It’s a remarkable feat that
prompted Google to buy the company for their own
Artificial Intelligence ( AI ) development plans.
• The learning algorithm takes in data as it’s produced
by the computer game; it’s streaming data.
Streaming Data

• While streaming data can take almost any of the


previous forms, it has an extra property.
• The data flows into the system when an event
happens instead of being loaded into a data store
in a batch.
• Although this isn’t really a different type of data,
we treat it here as such because you need to adapt
your process to deal with this type of information.
• Examples are the “What’s trending” on Twitter, live
sporting or music events, and the stock market.
Thank you
This presentation is created using LibreOffice Impress 5.1.6.2, can be used freely as per GNU General Public License

/mITuSkillologies @mitu_group /company/mitu- MITUSkillologies


skillologies

Web Resources
https://mitu.co.in
http://tusharkute.com

[email protected]
[email protected]

You might also like