Unit 2 Data Science
Unit 2 Data Science
• The goal of data science is to turn raw data into actionable insights that can
inform decision-making and improve outcomes.
• 2000-2010
– a.
Data Science (field of statistics and coined the term Data Science)
– b. Statistical Modeling (two cultures in the use of statistical modeling to reach
conclusions from data)
– c. Data Science Journal (focused on management of data and databases in science
and technology)
• 2010-Present
– a. Data Everywhere (new professional has arrived – a data scientist)
How does Data Science Work?
• The working of data science can be explained as
follows:
– Raw data is gathered from various sources that explain
the business problem.
– Using various statistical analysis, and machine
learning approaches, data modeling is performed to
get the optimum solutions that best explain the
business problem.
– Actionable insights that will serve as a solution for
the business problems gathered through data science.
• They can follow the following approach to get an
optimal solution using Data Science:
– Gather the previous data on the sales that were
closed.
– Use statistical analysis to find out the patterns that
were followed by the leads that were closed.
– Use machine learning to get actionable insights for
finding out potential leads.
– Use the new data on sales lead to segregate
potential leads that will be highly likely to be closed.
Data Science Life Cycle
• Formulating a Business Problem - reducing the
wastage of products
• Data Extraction, Transformation, Loading - create a
data pipeline
• Data Preprocessing - create meaningful data
• Data Modeling - Choosing the best algorithms based on
evidence
• Gathering Actionable Insights - gathering insights from
the said problem statement
• Solutions For the Business Problem - will solve the
problem using evidence based information
USES OF DATA SCIENCE:
• Business: Data science can be used to analyze customer data, predict
market trends, and optimize business operations.
• Healthcare: Data science can be used to analyze medical data and
identify patterns that can aid in diagnosis, treatment, and drug discovery.
• Finance: Data science can be used to identify fraud, analyze financial
markets, and make investment decisions.
• Social Media: Data science can be used to understand user behavior,
recommend content, and identify influencers.
• Internet of things: Data science can be used to analyze sensor data from IoT
devices and make predictions about equipment failures, traffic patterns, and
more.
• Natural Language Processing: Data science can be used to make computers
understand human language, process large amounts of text or speech data
and make predictions.
Applications of Data Science:
– Internet Search Results (Google)
– Recommendation Engine (Spotify)
– Intelligent Digital Assistants (Google Assistant)
– Autonomous Driving Vehicle (Waymo)
– Spam Filter (Gmail) • Abusive Content and Hate
Speech Filter (Facebook)
– Robotics (Boston Dynamics)
– Automatic Piracy Detection (YouTube)
use cases in organizations
• customer analytics
• fraud detection
• risk management
• stock trading
• targeted advertising
• website personalization
• customer service
• predictive maintenance
• logistics and supply chain management
• image recognition
• speech recognition
• natural language processing
• cybersecurity
• medical diagnosis
Data science team
• Data engineer: Responsibilities include setting up data pipelines and
aiding in data preparation and model deployment, working closely with data
scientists.
• Data analyst: This is a lower-level position for analytics professionals who
don't have the experience level or advanced skills that data scientists do.
• Machine learning engineer: This programming-oriented job involves
developing the machine learning models needed for data science applications.
• Data visualization developer. This person works with data scientists to create
visualizations and dashboards used to present analytics results to business
users.
• Data translator: Also called an analytics translator, it's an emerging role that
serves as a liaison to business units and helps plan projects and communicate
results.
• Data architect: A data architect designs and oversees the implementation
of the underlying systems used to store and manage data for analytics uses.
USES OF DATA SCIENCE:
• Data science is a field that involves using
scientific methods, processes, algorithms,
and systems to extract knowledge and insights
from structured and unstructured data.
• It can be used in a variety of industries and
applications such as:
Data science tools and platforms
• Numerous tools are available for data scientists to use in
the analytics process, including both commercial and open
source options:
– Data platforms and analytics engines, such as Spark, Hadoop and
NoSQL databases;
– Programming languages, such as Python, R, Julia, Scala and SQL;
– Statistical analysis tools like SAS and IBM SPSS;
– Machine learning platforms and libraries, including
TensorFlow, Weka, Scikit-learn, Keras and PyTorch;
– Jupyter Notebook, a web application for sharing documents
with code, equations and other information; and Data
visualization tools and libraries, such as Tableau, D3.js and
Matplotlib.
List of Data Science Tools
• Algorithms.io.
• This tool is a machine-learning (ML) resource that
takes raw data and shapes it into real-time insights
and actionable events, particularly in the context of
machine-learning.
• Advantages:
– It‟s on a cloud platform, so it has all the SaaS
advantages of scalability, security, and infrastructure
– Makes machine learning simple and accessible to
developers and companies
Apache Hadoop
• This open-source framework creates simple programming models and
distributes extensive dataset processing across thousands of computer
clusters. Hadoop works equally well for research and production
purposes. Hadoop is perfect for high-level computations.
• Apache Hadoop This open-source framework creates simple
programming models and distributes extensive data set processing
across thousands of computer clusters. Hadoop works equally well for
research and production purposes. Hadoop is perfect for high-level
computations.
• Advantages:
– Open-source
– Highly scalable
– It has many modules available
– Failures are handled at the application layer
Apache Spark
• Also called “Spark,” this is an all-powerful analytics engine
and has the distinction of being the most used data
science tool. It is known for offering lightning-fast
cluster computing. Spark accesses varied data sources
such as Cassandra, HDFS, HBase, and S3. It can also easily
handle large datasets.
• Advantages:
– Over 80 high-level operators simplify the process of parallel app
building
– Can be used interactively from the Scale, Python, and R shells
– Advanced DAG execution engine supports in-memory computing
and acyclic data flow
BigML
• This tool is another top-rated data science resource that
provides users with a fully interactable, cloud-based GUI
environment, ideal for processing ML algorithms. You
can create a free or premium account depending on
your needs, and the web interface is easy to use.
• Advantages:
– An affordable resource for building complex machine
learning solutions
– Takes predictive data patterns and turns them into
intelligent, practical applications usable by anyone
– It can run in the cloud or on-premises
D3.js
• D3.js is an open-source JavaScript library that lets
you make interactive visualizations on your web
browser. It emphasizes web standards to take full
advantage of all of the features of modern
browsers, without being bogged down with a
proprietary framework.
• Advantages:
– D3.js is based on the very popular JavaScript
– Ideal for client-side Internet of Things (IoT) interactions
• Useful for creating interactive visualizations
Data Robot
• This tool is described as an advanced platform for
automated machine learning. Data scientists,
executives, IT professionals, and software
engineers use it to help them build better quality
predictive models, and do it faster.
• Advantages:
– With just a single click or line of code, you can train,
test, and compare many different models.
– It features Python SDK and APIs
– It comes with a simple model deployment process
Excel
• Originallydeveloped by Microsoft for
spreadsheet calculations, it has gained
widespread use as a tool for data processing,
visualization, and sophisticated calculations.
• Advantages:
– You can sort and filter your data with one click
– Advanced Filtering function lets you filter data
based on your favorite criteria
– Well-known and found everywhere
ForecastThis
• This helps investment managers, data
scientists, and quantitative analysts to use
their in-house data to optimize their complex
future objectives and create robust forecasts.
• Advantages:
– Easily scalable to fit any size challenge
– Includes robust optimization algorithms
– Simple spreadsheet and API plugins
Google BigQuery
• This is a very scalable, serverless data
warehouse tool created for productive data
analysis. It uses Google‟s infrastructure-based
processing power to run super-fast SQL
queries against append-only tables.
• Advantages:
– Extremely fast
– Keeps costs down since users need only pay for
storage and computer usage
– Easily scalable
Java
• Java is the classic object-oriented programming
language that‟s been around for years. It‟s simple,
architecture-neutral, secure, platform-
independent, and object-oriented.
• Advantages:
– Suitable for large science projects if used with Java 8
with Lambdas
– Java has an extensive suite of tools and libraries that
are perfect for machine learning and data science
– Easy to understand
MATLAB
• MATLAB is a high-level language coupled with an
interactive environment for numerical computation,
programming, and visualization. MATLAB is a powerful
tool, a language used in technical computing, and ideal
for graphics, math, and programming.
• Advantages:
– Intuitive use
– It analyzes data, creates models, and develops algorithms
– With just a few simple code changes, it scales analyses
to run on clouds, clusters, and GPUs
MySQL
• Another familiar tool that enjoys widespread
popularity, MySQL is one of the most popular
open-source databases available today. It‟s ideal
for accessing data from databases.
• Advantages:
– Users can easily store and access data in a structured
manner
– Works with programming languages like Java
– It‟s an open-source relational database management
system
NLTK
• Short for Natural Language Toolkit, this open-
source tool works with human language data
and is a well-liked Python program builder. NLTK
is ideal for rookie data scientists and students.
• Advantages:
– Comes with a suite of text processing libraries
– Offers over 50 easy-to-use interfaces
– It has an active discussion forum that provides a
wealth of new information
Rapid Miner
• This data science tool is a unified platform that
incorporates data prep, machine learning, and model
deployment for making data science processes easy and
fast. It enjoys heavy use in the manufacturing,
telecommunication, utility, and banking industries.
• Advantages:
– All of the resources are located on one platform.
– GUI is based on a block-diagram process, simplifying these
blocks into a plug-and-play environment.
– Uses a visual workflow designer to model machine learning
algorithms.
SAS
• This data science tool is designed especially for statistical
operations. It is a closed-source proprietary software tool that
specializes in handling and analyzing massive amounts of data for
large organizations. It‟s well-supported by its company and very
reliable. Still, it‟s a case of getting what you pay for because SAS is
expensive and best suited for large companies and organizations.
• Advantages:
– Numerous analytics functions covering everything from social media
to automated forecasting to location data
– It features interactive dashboards and reports, letting the user go
straight from reporting to analysis
– Contains advanced data visualization techniques such as auto
charting to present compelling results and data
Tableau
• Tableau is a Data Visualization software that is
packed with powerful graphics to make
interactive visualizations. It is focused on industries
working in the field of business intelligence.
• The most important aspect of Tableau is its
ability to interface with databases, spreadsheets,
OLAP (Online Analytical Processing) cubes, etc.
Along with these features, Tableau has the
ability to visualize geographical data and for plotting
longitudes and latitudes in maps.
TensorFlow
• TensorFlow has become a standard tool for Machine
Learning. It is widely used for advanced machine learning
algorithms like Deep Learning. Developers named
TensorFlow after Tensors which are multidimensional arrays.
• It is an open-source and ever-evolving toolkit which is
known for its performance and high computational
abilities. TensorFlow can run on both CPUs and GPUs and
has recently emerged on more powerful TPU platforms.
• This gives it an unprecedented edge in terms of the
processing power of advanced machine learning
algorithms.
Weka
• Weka or Waikato Environment for Knowledge
Analysis is a machine learning software written in
Java. It is a collection of various Machine Learning
algorithms for data mining. Weka consists of
various machine learning tools like classification,
clustering, regression, visualization and data
preparation.
• It is an open-source GUI software that allows
easier implementation of machine learning
algorithms through an interactable platform.
Example of different algorithms i.e segmentation,
classification, validation, regressions, recommendations.
Introduction
• The implementation of Data Science to any
problem requires a set of skills. Machine Learning is
an integral part of this skill set.
• For doing Data Science, you must know the various
Machine Learning algorithms used for solving
different types of problems, as a single algorithm
cannot be the best for all types of use cases. These
algorithms find an application in various tasks like
prediction, classification, clustering, etc. from the
dataset under consideration.
3 main categories
• Supervised Algorithms: The training data set has inputs as well
as the desired output. During the training session, the model will
adjust its variables to map inputs to the corresponding output.
• Unsupervised Algorithms: In this category, there is not a target
outcome. The algorithms will cluster the data set for different
groups.
• Reinforcement Algorithms: These algorithms are trained on
taking decisions. Therefore based on those decisions, the
algorithm will train itself based on the success/error of output.
• Eventually by experience algorithm will able to give good
predictions.
Linear Regression
• Linear regression method is used for predicting the value of the
dependent variable by using the values of the independent variable.
• The linear regression model is suitable for predicting the value of a
continuous quantity.
• The linear regression model represents the relationship between the
input variables (x) and the output variable (y) of a dataset in terms
of a line given by the equation,
y = m*x + c
• Where y is the dependent variable and x is the independent
variable. Basic calculus theories are applied to find the values for m
and c using the given data set. The main aim of this method is to find
the value of b0 and b1 to find the best fit line that will be covering or
will be nearest to most of the data points.
Logistic Regression
• Linear Regression is always used for representing the
relationship between some continuous values. However,
contrary to this Logistic Regression works on discrete values.
• Logistic regression finds the most common application in
solving binary classification problems, that is, when there
are only two possibilities of an event, either the event will
occur or it will not occur (0 or 1).
• Thus, in Logistic Regression, we convert the predicted
values into such values that lie in the range of 0 to 1 by
using a non-linear transform function which is called a
logistic function.
• We generate this with the help of logistic function
–
1 / (1 + e^-x)
• Here, e represents base of natural log and we
obtain the S-shaped curve with values between 0
and 1. The equation for logistic regression is
written as:
y = e^(b0 + b1*x) / (1 + e^(b0 + b1*x))
• Here, b0 and b1 are the coefficients of the input
x. These coefficients are estimated using the data
through “maximum likelihood estimation”.
Decision Trees
• This algorithm categorizes the population for several sets based on
some chosen properties (independent variables) of a population.
Usually, this algorithm is used to solve classification problems.
Categorization is done by using some techniques such as Gini, Chi-
square, entropy etc.
• Decision trees help in solving both classification and prediction
problems. It makes it easy to understand the data for better
accuracy of the predictions. Each node of the Decision tree
represents a feature or an attribute, each link represents a decision
and each leaf node holds a class label, that is, the outcome.
• The drawback of decision trees is that it suffers from the problem of
overfitting.
• Basically, these two Data Science algorithms are most commonly used
for implementing the Decision trees.
ID3 ( Iterative Dichotomiser 3) Algorithm