0% found this document useful (0 votes)
112 views53 pages

Unit 2 Data Science

Uploaded by

arulmani1979
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
112 views53 pages

Unit 2 Data Science

Uploaded by

arulmani1979
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 53

UNIT 2 DATA SCIENCE

Overview of Data Science


• The entire process of gathering actionable insights from raw data that involves
various concepts that include statistical analysis, data analysis, machine
learning algorithms, data modeling, preprocessing of data, etc.

• Data science is an interdisciplinary field that uses scientific methods,


processes, algorithms, and systems to extract knowledge and insights from
structured and unstructured data.

• It involves the use of techniques from statistics, data analysis, machine


learning, and computer science to extract insights and knowledge from data.

• Data science can be applied in a wide range of fields, including business,


healthcare, finance, and government, among others.

• The goal of data science is to turn raw data into actionable insights that can
inform decision-making and improve outcomes.

• It can be aligned with how data science actually works.


History of Data Science
• 1962 – Inception - Future of Data Analysis (the importance of data
analysis with respect to science rather than mathematics)

• 1974 - Concise Survey of Computer Methods (surveys the contemporary


methods of data processing in various applications)

• 1974 – 1980 - International Association For Statistical Computing (link


traditional statistical methodology with modern computer technology
to extract useful information and knowledge from the data)

• 1980-1990 - Knowledge Discovery in Databases (annual conference on


knowledge discovery and data mining)
• 1990-2000
– a. Database Marketing (explains how big organizations are using the customer data to
predict the likelihood of a customer buying a specific product or not)
– b. International Federation of Classification Society (Data Science” was used in a
conference held in Japan)

• 2000-2010
– a.
Data Science (field of statistics and coined the term Data Science)
– b. Statistical Modeling (two cultures in the use of statistical modeling to reach
conclusions from data)
– c. Data Science Journal (focused on management of data and databases in science
and technology)

• 2010-Present
– a. Data Everywhere (new professional has arrived – a data scientist)
How does Data Science Work?
• The working of data science can be explained as
follows:
– Raw data is gathered from various sources that explain
the business problem.
– Using various statistical analysis, and machine
learning approaches, data modeling is performed to
get the optimum solutions that best explain the
business problem.
– Actionable insights that will serve as a solution for
the business problems gathered through data science.
• They can follow the following approach to get an
optimal solution using Data Science:
– Gather the previous data on the sales that were
closed.
– Use statistical analysis to find out the patterns that
were followed by the leads that were closed.
– Use machine learning to get actionable insights for
finding out potential leads.
– Use the new data on sales lead to segregate
potential leads that will be highly likely to be closed.
Data Science Life Cycle
• Formulating a Business Problem - reducing the
wastage of products
• Data Extraction, Transformation, Loading - create a
data pipeline
• Data Preprocessing - create meaningful data
• Data Modeling - Choosing the best algorithms based on
evidence
• Gathering Actionable Insights - gathering insights from
the said problem statement
• Solutions For the Business Problem - will solve the
problem using evidence based information
USES OF DATA SCIENCE:
• Business: Data science can be used to analyze customer data, predict
market trends, and optimize business operations.
• Healthcare: Data science can be used to analyze medical data and
identify patterns that can aid in diagnosis, treatment, and drug discovery.
• Finance: Data science can be used to identify fraud, analyze financial
markets, and make investment decisions.
• Social Media: Data science can be used to understand user behavior,
recommend content, and identify influencers.
• Internet of things: Data science can be used to analyze sensor data from IoT
devices and make predictions about equipment failures, traffic patterns, and
more.
• Natural Language Processing: Data science can be used to make computers
understand human language, process large amounts of text or speech data
and make predictions.
Applications of Data Science:
– Internet Search Results (Google)
– Recommendation Engine (Spotify)
– Intelligent Digital Assistants (Google Assistant)
– Autonomous Driving Vehicle (Waymo)
– Spam Filter (Gmail) • Abusive Content and Hate
Speech Filter (Facebook)
– Robotics (Boston Dynamics)
– Automatic Piracy Detection (YouTube)
use cases in organizations
• customer analytics
• fraud detection
• risk management
• stock trading
• targeted advertising
• website personalization
• customer service
• predictive maintenance
• logistics and supply chain management
• image recognition
• speech recognition
• natural language processing
• cybersecurity
• medical diagnosis
Data science team
• Data engineer: Responsibilities include setting up data pipelines and
aiding in data preparation and model deployment, working closely with data
scientists.
• Data analyst: This is a lower-level position for analytics professionals who
don't have the experience level or advanced skills that data scientists do.
• Machine learning engineer: This programming-oriented job involves
developing the machine learning models needed for data science applications.
• Data visualization developer. This person works with data scientists to create
visualizations and dashboards used to present analytics results to business
users.
• Data translator: Also called an analytics translator, it's an emerging role that
serves as a liaison to business units and helps plan projects and communicate
results.
• Data architect: A data architect designs and oversees the implementation
of the underlying systems used to store and manage data for analytics uses.
USES OF DATA SCIENCE:
• Data science is a field that involves using
scientific methods, processes, algorithms,
and systems to extract knowledge and insights
from structured and unstructured data.
• It can be used in a variety of industries and
applications such as:
Data science tools and platforms
• Numerous tools are available for data scientists to use in
the analytics process, including both commercial and open
source options:
– Data platforms and analytics engines, such as Spark, Hadoop and
NoSQL databases;
– Programming languages, such as Python, R, Julia, Scala and SQL;
– Statistical analysis tools like SAS and IBM SPSS;
– Machine learning platforms and libraries, including
TensorFlow, Weka, Scikit-learn, Keras and PyTorch;
– Jupyter Notebook, a web application for sharing documents
with code, equations and other information; and Data
visualization tools and libraries, such as Tableau, D3.js and
Matplotlib.
List of Data Science Tools
• Algorithms.io.
• This tool is a machine-learning (ML) resource that
takes raw data and shapes it into real-time insights
and actionable events, particularly in the context of
machine-learning.
• Advantages:
– It‟s on a cloud platform, so it has all the SaaS
advantages of scalability, security, and infrastructure
– Makes machine learning simple and accessible to
developers and companies
Apache Hadoop
• This open-source framework creates simple programming models and
distributes extensive dataset processing across thousands of computer
clusters. Hadoop works equally well for research and production
purposes. Hadoop is perfect for high-level computations.
• Apache Hadoop This open-source framework creates simple
programming models and distributes extensive data set processing
across thousands of computer clusters. Hadoop works equally well for
research and production purposes. Hadoop is perfect for high-level
computations.
• Advantages:
– Open-source
– Highly scalable
– It has many modules available
– Failures are handled at the application layer
Apache Spark
• Also called “Spark,” this is an all-powerful analytics engine
and has the distinction of being the most used data
science tool. It is known for offering lightning-fast
cluster computing. Spark accesses varied data sources
such as Cassandra, HDFS, HBase, and S3. It can also easily
handle large datasets.
• Advantages:
– Over 80 high-level operators simplify the process of parallel app
building
– Can be used interactively from the Scale, Python, and R shells
– Advanced DAG execution engine supports in-memory computing
and acyclic data flow
BigML
• This tool is another top-rated data science resource that
provides users with a fully interactable, cloud-based GUI
environment, ideal for processing ML algorithms. You
can create a free or premium account depending on
your needs, and the web interface is easy to use.
• Advantages:
– An affordable resource for building complex machine
learning solutions
– Takes predictive data patterns and turns them into
intelligent, practical applications usable by anyone
– It can run in the cloud or on-premises
D3.js
• D3.js is an open-source JavaScript library that lets
you make interactive visualizations on your web
browser. It emphasizes web standards to take full
advantage of all of the features of modern
browsers, without being bogged down with a
proprietary framework.
• Advantages:
– D3.js is based on the very popular JavaScript
– Ideal for client-side Internet of Things (IoT) interactions
• Useful for creating interactive visualizations
Data Robot
• This tool is described as an advanced platform for
automated machine learning. Data scientists,
executives, IT professionals, and software
engineers use it to help them build better quality
predictive models, and do it faster.
• Advantages:
– With just a single click or line of code, you can train,
test, and compare many different models.
– It features Python SDK and APIs
– It comes with a simple model deployment process
Excel
• Originallydeveloped by Microsoft for
spreadsheet calculations, it has gained
widespread use as a tool for data processing,
visualization, and sophisticated calculations.
• Advantages:
– You can sort and filter your data with one click
– Advanced Filtering function lets you filter data
based on your favorite criteria
– Well-known and found everywhere
ForecastThis
• This helps investment managers, data
scientists, and quantitative analysts to use
their in-house data to optimize their complex
future objectives and create robust forecasts.
• Advantages:
– Easily scalable to fit any size challenge
– Includes robust optimization algorithms
– Simple spreadsheet and API plugins
Google BigQuery
• This is a very scalable, serverless data
warehouse tool created for productive data
analysis. It uses Google‟s infrastructure-based
processing power to run super-fast SQL
queries against append-only tables.
• Advantages:
– Extremely fast
– Keeps costs down since users need only pay for
storage and computer usage
– Easily scalable
Java
• Java is the classic object-oriented programming
language that‟s been around for years. It‟s simple,
architecture-neutral, secure, platform-
independent, and object-oriented.
• Advantages:
– Suitable for large science projects if used with Java 8
with Lambdas
– Java has an extensive suite of tools and libraries that
are perfect for machine learning and data science
– Easy to understand
MATLAB
• MATLAB is a high-level language coupled with an
interactive environment for numerical computation,
programming, and visualization. MATLAB is a powerful
tool, a language used in technical computing, and ideal
for graphics, math, and programming.
• Advantages:
– Intuitive use
– It analyzes data, creates models, and develops algorithms
– With just a few simple code changes, it scales analyses
to run on clouds, clusters, and GPUs
MySQL
• Another familiar tool that enjoys widespread
popularity, MySQL is one of the most popular
open-source databases available today. It‟s ideal
for accessing data from databases.
• Advantages:
– Users can easily store and access data in a structured
manner
– Works with programming languages like Java
– It‟s an open-source relational database management
system
NLTK
• Short for Natural Language Toolkit, this open-
source tool works with human language data
and is a well-liked Python program builder. NLTK
is ideal for rookie data scientists and students.
• Advantages:
– Comes with a suite of text processing libraries
– Offers over 50 easy-to-use interfaces
– It has an active discussion forum that provides a
wealth of new information
Rapid Miner
• This data science tool is a unified platform that
incorporates data prep, machine learning, and model
deployment for making data science processes easy and
fast. It enjoys heavy use in the manufacturing,
telecommunication, utility, and banking industries.
• Advantages:
– All of the resources are located on one platform.
– GUI is based on a block-diagram process, simplifying these
blocks into a plug-and-play environment.
– Uses a visual workflow designer to model machine learning
algorithms.
SAS
• This data science tool is designed especially for statistical
operations. It is a closed-source proprietary software tool that
specializes in handling and analyzing massive amounts of data for
large organizations. It‟s well-supported by its company and very
reliable. Still, it‟s a case of getting what you pay for because SAS is
expensive and best suited for large companies and organizations.
• Advantages:
– Numerous analytics functions covering everything from social media
to automated forecasting to location data
– It features interactive dashboards and reports, letting the user go
straight from reporting to analysis
– Contains advanced data visualization techniques such as auto
charting to present compelling results and data
Tableau
• Tableau is a Data Visualization software that is
packed with powerful graphics to make
interactive visualizations. It is focused on industries
working in the field of business intelligence.
• The most important aspect of Tableau is its
ability to interface with databases, spreadsheets,
OLAP (Online Analytical Processing) cubes, etc.
Along with these features, Tableau has the
ability to visualize geographical data and for plotting
longitudes and latitudes in maps.
TensorFlow
• TensorFlow has become a standard tool for Machine
Learning. It is widely used for advanced machine learning
algorithms like Deep Learning. Developers named
TensorFlow after Tensors which are multidimensional arrays.
• It is an open-source and ever-evolving toolkit which is
known for its performance and high computational
abilities. TensorFlow can run on both CPUs and GPUs and
has recently emerged on more powerful TPU platforms.
• This gives it an unprecedented edge in terms of the
processing power of advanced machine learning
algorithms.
Weka
• Weka or Waikato Environment for Knowledge
Analysis is a machine learning software written in
Java. It is a collection of various Machine Learning
algorithms for data mining. Weka consists of
various machine learning tools like classification,
clustering, regression, visualization and data
preparation.
• It is an open-source GUI software that allows
easier implementation of machine learning
algorithms through an interactable platform.
Example of different algorithms i.e segmentation,
classification, validation, regressions, recommendations.
Introduction
• The implementation of Data Science to any
problem requires a set of skills. Machine Learning is
an integral part of this skill set.
• For doing Data Science, you must know the various
Machine Learning algorithms used for solving
different types of problems, as a single algorithm
cannot be the best for all types of use cases. These
algorithms find an application in various tasks like
prediction, classification, clustering, etc. from the
dataset under consideration.
3 main categories
• Supervised Algorithms: The training data set has inputs as well
as the desired output. During the training session, the model will
adjust its variables to map inputs to the corresponding output.
• Unsupervised Algorithms: In this category, there is not a target
outcome. The algorithms will cluster the data set for different
groups.
• Reinforcement Algorithms: These algorithms are trained on
taking decisions. Therefore based on those decisions, the
algorithm will train itself based on the success/error of output.
• Eventually by experience algorithm will able to give good
predictions.
Linear Regression
• Linear regression method is used for predicting the value of the
dependent variable by using the values of the independent variable.
• The linear regression model is suitable for predicting the value of a
continuous quantity.
• The linear regression model represents the relationship between the
input variables (x) and the output variable (y) of a dataset in terms
of a line given by the equation,
y = m*x + c
• Where y is the dependent variable and x is the independent
variable. Basic calculus theories are applied to find the values for m
and c using the given data set. The main aim of this method is to find
the value of b0 and b1 to find the best fit line that will be covering or
will be nearest to most of the data points.
Logistic Regression
• Linear Regression is always used for representing the
relationship between some continuous values. However,
contrary to this Logistic Regression works on discrete values.
• Logistic regression finds the most common application in
solving binary classification problems, that is, when there
are only two possibilities of an event, either the event will
occur or it will not occur (0 or 1).
• Thus, in Logistic Regression, we convert the predicted
values into such values that lie in the range of 0 to 1 by
using a non-linear transform function which is called a
logistic function.
• We generate this with the help of logistic function

1 / (1 + e^-x)
• Here, e represents base of natural log and we
obtain the S-shaped curve with values between 0
and 1. The equation for logistic regression is
written as:
y = e^(b0 + b1*x) / (1 + e^(b0 + b1*x))
• Here, b0 and b1 are the coefficients of the input
x. These coefficients are estimated using the data
through “maximum likelihood estimation”.
Decision Trees
• This algorithm categorizes the population for several sets based on
some chosen properties (independent variables) of a population.
Usually, this algorithm is used to solve classification problems.
Categorization is done by using some techniques such as Gini, Chi-
square, entropy etc.
• Decision trees help in solving both classification and prediction
problems. It makes it easy to understand the data for better
accuracy of the predictions. Each node of the Decision tree
represents a feature or an attribute, each link represents a decision
and each leaf node holds a class label, that is, the outcome.
• The drawback of decision trees is that it suffers from the problem of
overfitting.
• Basically, these two Data Science algorithms are most commonly used
for implementing the Decision trees.
ID3 ( Iterative Dichotomiser 3) Algorithm

• This algorithm uses entropy and information


gain as the decision metric.
• Cart (Classification and Regression Tree)
Algorithm-This algorithm uses the Gini index
as the decision metric. The below image will
help you to understand things better.
• Here‟s a decision tree that evaluates scenarios
where people want to play football.
Support Vector Machine (SVM)
• Support Vector Machine or SVM comes under the category
of supervised Machine Learning algorithms and finds an
application in both classification and regression problems. It
is most commonly used for classification of problems and
classifies the data points by using a hyperplane.
• The first step of this Data Science algorithm involves plotting all
the data items as individual points in an n-dimensional graph.
• Here, n is the number of features and the value of each
individual feature is the value of a specific coordinate. Then
we find the hyperplane that best separates the two classes
for classifying them.
• Finding the correct hyperplane plays the most
important role in classification. The data points which
are closest to the separating hyperplane are the support
vectors.
• Let us consider the following example to understand
how you can identify the right hyperplane. The basic
principle for selecting the best hyperplane is that you
have to choose the hyperplane that separates the two
classes very well.
• In this case, the hyperplane B is classifying the data
points very well. Thus, B will be the right hyperplane.
• All three hyperplanes are separating the two
classes properly. In such cases, we have to
select the hyperplane with the maximum
margin. As we can see in the above image,
hyperplane B has the maximum margin
therefore it will be the right hyperplane.
• In this case, the hyperplane B has the maximum
margin but it is not classifying the two classes
accurately. Thus, A will be the right hyperplane.
Naive Bayes
• The Naive Bayes algorithm helps in building
predictive models. We use this Data Science
algorithm when we want to calculate the probability
of the occurrence of an event in the future. Here, we
have prior knowledge that another event has already
occurred.
• The Naive Bayes algorithm works on the assumption
that each feature is independent and has an
individual contribution to the final prediction.
• The Naive Bayes theorem is represented by:
• P(A|B) = P(B|A) P(A) / P(B)
• Where A and B are two events.
– P(A|B) is the posterior probability i.e. the
probability of A given that B has already occurred.
– P(B|A) is the likelihood i.e. the probability of B
given that A has already occurred.
– P(A) is the class prior to probability.
– P(B) is the predictor prior probability.
• Example: Let‟s understand it using an example. Below I have a training
data set of weather and the corresponding target variable „Play‟. Now, we
need to classify whether players will play or not based on weather
conditions. Let‟s follow the below steps to perform it.
– Step 1: Convert the data set to a frequency table.
– Step 2: Create a Likelihood table by finding the probabilities like Overcast probability =
0.29 and probability of playing is 0.64.
– Step 3: Now, use the Naive Bayesian equation to calculate the posterior
probability for each class. The class with the highest posterior probability is the
outcome of the prediction.
• Problem: Players will pLay if the weather is sunny, is this statement correct?
• We can solve it using above discussed method, so
• P(Yes | Sunny) = P( Sunny | Yes) *P(Yes) / P (Sunny)
• Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)=
9/14 = 0.64
• Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.
• Naive Bayes uses a similar method to predict the probability of different classes
based on
• various attributes. This algorithm is mostly used in text classification and with
problems having multiple classes.
KNN
• KNN stands for K-Nearest Neighbors. This Data
Science algorithm employs both classification
and regression problems.
• This is a simple algorithm which predicts
unknown data point with its k nearest neighbors.
The value of k is a critical factor here regarding
the accuracy of prediction. It determines the
nearest by calculating the distance using basic
distance functions like Euclidean.
• The KNN algorithm considers the complete dataset as the training
dataset. After training the model using the KNN algorithm, we try to
predict the outcome of a new data point.
• Here, the KNN algorithm searches the entire data set for identifying the k
most similar or nearest neighbors of that data point. It then predicts the
outcome based on these k instances.
• For finding the nearest neighbors of a data instance, we can use
various distance measures like Euclidean distance, Hamming distance,
etc. To better understand, let us consider the following example.
• Here we have represented the two classes A and B by the circle and the
square respectively.
• Let us assume the value of k is 3.
• Now we will first find three data points that are closest to the new data
item and enclose them in a dotted circle. Here the three closest points of
the new data item belong to class A. Thus, we can say that the new data
point will also belong to class A.
• Now you all might be thinking that how we assumed k=3?
• The selection of the value of k is a very critical task. You should take such a
value of k that it is neither too small nor too large. Another simpler
approach is to take k = √n where n is the number of data points.
K-Means Clustering
• K-means clustering is a type of unsupervised Machine
Learning algorithm.
• Clustering basically means dividing the data set into
groups of similar data items called clusters.
• K means clustering categorizes the data items into k
groups with similar data items.
• For measuring this similarity, we use Euclidean
distance which is given by,
D = √(x1-x2)^2 + (y1-y2)^2
• K means clustering is iterative in nature.
Artificial Neural Networks
• Neural Networks are modeled after the neurons in
the human brain. It comprises many layers of
neurons that are structured to transmit
information from the input layer to the output
layer.
• Between the input and the output layer, there are
hidden layers present.
• These hidden layers can be many or just one. A
simple neural network comprising of a single
hidden layer is known as Perceptron.
Principal Component Analysis (PCA)
• PCA is basically a technique for performing dimensionality reduction of
the datasets with the least effect on the variance of the datasets.
This means removing the redundant features but keeping the
important ones.
• To achieve this, PCA transforms the variables of the dataset into a new
set of variables. This new set of variables represents the principal
components.
• The most important features of these principal components are:
– All the PCs are orthogonal (i.e. they are at a right angle to each other).
– They are created in such a way that with the increasing number of
components, the amount of variation that it retains starts decreasing.
– This means the 1st principal component retains the variation to the
maximum extent as compared to the original variables.
• PCA is basically used for summarizing data.
Random Forests
• Random Forests overcomes the overfitting
problem of decision trees and helps in solving
both classification and regression problems. It
works on the principle of Ensemble learning.
• The Ensemble learning methods believe that a
large number of weak learners can work
together for giving high accuracy predictions.

You might also like