0% found this document useful (0 votes)
103 views115 pages

Data Science SPPU

The document provides an overview of data science, defining its principles, processes, and the types of data involved. It discusses the significance of big data and the various categories of data, including structured, unstructured, and machine-generated data. Additionally, it outlines the data science process, emphasizing the importance of data preparation, cleansing, and the challenges associated with data quality.

Uploaded by

vidulad.04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
103 views115 pages

Data Science SPPU

The document provides an overview of data science, defining its principles, processes, and the types of data involved. It discusses the significance of big data and the various categories of data, including structured, unstructured, and machine-generated data. Additionally, it outlines the data science process, emphasizing the importance of data preparation, cleansing, and the challenges associated with data quality.

Uploaded by

vidulad.04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 115

Honours* in Data Science

Ms. Reshma Jadhav

DYPCOE akurdi
DYPCOE akurdi
DYPCOE akurdi
Unit I Introduction to Data Science
 Defining data science and big data
 Recognizing the different types of data
 , Gaining insight into the data science process
 Data Science Process: Overview
 Different steps
 Machine Learning Definition and Relation with
Data Science

DYPCOE akurdi
Data All Around
• Lots of data is being collected and
warehoused
– Web data, e-commerce
– Financial transactions,
bank/credit transactions
– Online trading and
purchasing
– Social Network
– Cloud
Defining data science and big data

 What is Data Science?


 Data Science is about data gathering, analysis and
decision-making.
 Data Science is about finding patterns in data,
through analysis, and make future predictions.
 By using Data Science, companies are able to make:
 Better decisions (should we choose A or B)
 Predictive analysis (what will happen next?)
 Pattern discoveries (find pattern, or maybe hidden
information in the data)

DYPCOE akurdi
Big Data Definition

• No single standard defi nition…

“Big Dat a” is data whose scale,


diversity, and complexity require new
architecture, techniques, algorithms,
and analytics to manage it and extract
value and hidden knowledge from it…
Big data

DYPCOE akurdi
Who generates Big Data?

S oc i a l m e d i a a n d Scientifi c
networks instruments
(all of us are generating (collecting all sorts of
data) data)

Sensor technology
and networks
(measuring all kinds of M o b i l e dev i c es
data) (tracking all objects all the
time)
What is Data Science ?
• An area that manages, manipulates,
extracts, and interprets knowledge
from tremendous amount of data.
• Data science (DS) is a multidisciplinary
field of study with goal to address the
challenges in big data.
• Data science principles apply to all
data – big and small.
What is Data Science ?
• Theories and techniques from many fields and
disciplines are used to investigate and analyze a
large amount of data to help decision makers in
many industries such as science, engineering,
economics, politics, finance, and education
– Computer Science
• Pattern recognition, visualization, data
warehousing, High performance computing,
Databases, AI
– Mathematics
• Mathematical Modeling
– Statistics
• Statistical and Stochastic modeling,
Probability.
Data Science Disciplines
Real Life Examples
• Internet Search
• Digital Advertisements (Targeted Advertising
and re- targeting)
• Recommender Systems
• Image Recognition
• Speech Recognition
• Gaming
• Price Comparison Websites
• Airline Route Planning
• Fraud and Risk Detection
• Delivery logistics
Internet Search
Targeting Advertisement
Recommender System
Image Recognition
Speech Recognitio n
Price Comparison Website
Airline Route Planning
Fraud Detection
Facets of Data
• In data science and big data you’ll come
across many diff erent types of data, and
each of them tends to require diff erent
tools and techniques. The main categories of
data are these:
– Structured
– Unstructured
– Natural language
– Machine-generated
– Graph-based
– Audio, video, and images
– Streaming
Strutured Data
• St ruct ured dat a is dat a t hat depends on a
data model and resides in a fixed field
within a record.
• As such, it ’s oft en easy t o st ore
st ruct ured data in t ables wit hin
databases or Excel files, SQL , or
Structured Query Language, is the
preferred way to manage and query data
that resides in databases.
• You may also come across structured
data that might give you a hard t ime
st oring it in a traditional relational
Strutured Data
Unstructured Data
• Unstructured data is data that isn’t easy to fi t
into a data model because the content is
context-specific or varying. One example of
unstructured data is your regular email
• Although email contains structured elements
such as the sender, title, and body text, it’s a
challenge to find the number of people who
have written an email complaint about a
specific employee because so many ways exist to
refer t o a person, for example.
• The thousands of diff erent languages and
dialects out there further complicate this.
• A human-written email, as shown in next figure,
is also a perfect example of natural language
Unstructured Data
Natural Language
• Natural language is a special type of
unstructured data; it’s challenging to process
because it requires knowledge of specific data
science techniques and linguistics.
• The natural language processing community has
had success in entity recognition, topic
recognition, summarization, text completion, and
sentiment analysis, but models trained in one
domain don’t generalize well to other domains.
• Even state-of-the-art techniques aren’t able to
decipher the meaning of every piece of text.
This shouldn’t be a surprise though: humans
struggle with natural language as well. It ’s
ambiguous by nat ure.
Machine Generated Data
• Machine-generated data is information that’s
automatically created by a computer, process,
application, or other machine without human
intervention.
• Machine-generated data is becoming a major data
resource and will continue to do so. Wikibon has
forecast that the market value of the industrial
Internet (a term coined by Frost & Sullivan to refer
to the integration of complex physical machinery
with networked sensors and software) will be
approximately $540 billion in 2020.
• IDC (International Data Corporation) has estimated
there will be 26 times more connected things than
people in 2020. This network is commonly referred
to as the internet of things.
Machine Generated Data
Graph or Network Data
• “Graph data” can be a confusing term because any
data can be shown in a graph.
• “Graph” in this case points to mathematical graph
theory. In graph theory, a graph is a mathematical
structure to model pair-wise relationships between
objects. Graph or network data is, in short, data
that focuses on the relationship or adjacency of
objects.
• The graph structures use nodes, edges, and
properties to represent and store graphical data.
Graph-based data is a natural way to represent
social networks, and its structure allows you to
calculate specific metrics such as the influence of a
person and the shortest path between two people.
Graph or Network Data
• Examples of graph-based data can be found on many
social media websites (For instance, on LinkedIn you
can see who you know at which company.
• Your follower list on Twitter is another example of
graph-based data. The power and sophistication comes
from multiple, overlapping graphs of the same nodes.
For example, imagine the connecting edges here to
show “friends” on Facebook.
• Imagine another graph with the same people which
connects business colleagues via LinkedIn.
• Imagine a third graph based on movie interests on
Netflix. Overlapping the three different-looking
graphs makes more interesting questions possible.
Graph or Network Data
Audio, Video and Image
• Audio, image, and video are data types that pose
specific challenges to a data scientist.
• Tasks that are trivial for humans, such as
recognizing objects in pictures, turn out to be
challenging for computers. MLBAM (Major
League Baseball Advanced Media) announced in
2014 that they’ll increase video capture to
approximately 7 TB per game for the purpose of
live, in-game analytics.
• High-speed cameras at stadiums will capture ball
and athlete movements to calculate in real time,
for example, the path taken by a defender
relative to two baselines.
Audio, Video and Image
• Recently a company called DeepMind
succeeded at creating an algorithm that’s
capable of learning how to play video
games.
• This algorithm takes the video screen as
input and learns to interpret everything via
a complex process of deep learning. It’s a
remarkable feat that prompted Google to
buy the company for their own Artificial
Intelligence ( AI ) development plans.
• The learning algorithm takes in data as it’s
produced by the computer game; it’s
streaming data.
Streaming Data
• While streaming data can take almost any
of the previous forms, it has an extra
property.
• The data flows into the system when an
event happens instead of being loaded
into a data sto re in a batch.
• Although this isn’t really a diff erent type
of data, we treat it here as such because
you need to adapt your process to deal
with this type of information.
• Examples are the “What’s trending” on
Twitter, live sporting or music events, and
Data Science Process
Objectives
• Understanding t he flow of a dat a
science process
• Discussing t he steps in a dat a
science process
1. Setting research goal
Goal and context of research
• An essential outcome is the research goal
that st at es t he purpose of your assignment
in a clear and focused manner.
• Understanding the business goals and
context is critical for project success.
• Continue asking questions and devising
examples until you grasp the exact business
expectations, identify how your project fi ts
in the bigger picture, appreciate how your
research is going to change the business,
and understand how they’ll use your results.
Create project charter
• Clients like to know upfront what they’re paying for, so after
you have a good understanding of the business problem, try
to get a formal agreement on the deliverables. All this
information is best collected in a project charter. For any
significant project this would be mandatory.
• A project charter requires teamwork, and your input
covers at least the following:
– A clear research goal
– The project mission and context
– How you’re going to perform your analysis
– What resources you expect to use
– Proof that it’s an achievable project, or proof of
concepts
– Deliverables and a measure of success
– A timeline
2. Retrieving data
Data Retrieval
• The next step in data science is to retrieve the
required data. Sometimes you need to go into
the field and design a data collection process
yourself, but most of the time you won’t be
• involved in this step.
Many companies will have already collected
and stored the data for you, and what they
• don’t have can often be bought from third
parties.
Don’t be afraid to look outside your
organization for data, because more and
more organizations are making even high-
quality data freely available for public and
Data Stored in company
• Your first act should be to assess the
relevance and quality of the data that’s
readily available within your company.
• Most companies have a program for maintaining
key data, so much of the cleaning work may
already be done.
• This data can be stored in offi cial data
repositories such as databases, data marts,
data warehouses, and data lakes maintained by
a team of IT professionals.
• The primary goal of a database is data storage,
while a data warehouse is designed for reading
and analyzing that data.
Data Stored in company
• Gett ing access t o dat a is ano ther diff icult t ask.
• Organizations understand the value and sensitivity
of data and often have policies in place so
everyone has access to what they need and
nothing more.
• These policies translate into physical and digital
barriers called Chinese walls. These “walls” are
mandatory and well-regulated for customer
data in most countries.
• This is f or good reasons, t oo; imagine everybody in
a credit card company having access t o your
spending habits.
• Gett ing access t o t he data may t ake t ime and
involve company politics.
Data Sources
Data Quality Test
• Expect to spend a good portion of your
project time doing data correction and
cleansing, sometimes up to 80%.
• The retrieval of data is the fi rst time you’ll
inspect the data in the data science process.
Most of the errors you’ll encounter during the
data- gathering phase are easy to spot, but
being too careless will make you spend many
hours solving data issues that could have been
• prevented during data import.
You’ll investigate the data during the import,
data preparation, and exploratory phases. The
difference is in the goal and the depth of the
investigation.
3. Data Preparation
Data Preparation
• The data received from the data retrieval
phase is likely to be “a diamond in the
rough.”
• Your task now is to sanitize and prepare it
for use in the modeling and reporting
phase.
• Doing so is tremendously important because
your models will perform better and you’ll
lose less time trying to fix strange output.
• It can’t be mentioned nearly enough times:
garbage in equals garbage out.
• Your model needs the data in a specific format,
so data transformation will always come into
Data Cleansing
• Data cleansing is a subprocess of the data science
process that focuses on removing errors in your
data so your data becomes a true and consistent
representation of the processes it originates
from.
• By “true and consistent representation” we
imply that at least two types of errors exist.
• The fi rst type is the interpretation error, such as
when you take the value in your data for
granted, like saying that a person’s age is greater
than 300 years.
• The second type of error points to
inconsistencies between data sources or
against your company’s standardized
Overview of common errors
Example: Outliers
Data Entry Errors
• Data collection and data entry are error-prone
• processes.
They often require human intervention, and
because humans are only human, they make typos
or lose their concentration for a second and
• introduce an error into the chain.
But data collected by machines or computers
isn’t free from errors either. Errors can arise
from human sloppiness, whereas others are
• due to machine or hardware failure.
Examples of errors originating from machines are
transmission errors or bugs in the extract,
transform, and load phase ( ETL ).
Example: Frequency Table
Error: Redundant Whitespaces
• Whitespaces tend to be hard to detect but
cause errors like other redundant characters
would.
• Who hasn’t lost a few days in a project because
of a bug that was caused by whitespaces at the
end of a string?
• You ask the program to join two keys and notice
that observations are missing from the output
file. After looking for days through the code, you
finally find the bug.
• Then comes the hardest part: explaining the
delay to the project stakeholders. The cleaning
during the ETL phase wasn’t well executed, and
keys in one table contained a whitespace at the
Impossible values / Sanity Check
• Sanity checks are another valuable type
of data check.
• Here you check the value against
physically or theoretically impossible
values such as people taller than 3
meters or someone with an age of 299
years.
• Sanity checks can be directly
expressed with rules:
check = 0 <= age <= 120
Outliers
• An outlier is an observation that seems to be
distant from other observations or, more
specifically, one observation that follows a
diff erent logic or generative process than
• the other observations.
The easiest way to find outliers is to use a
• plot or a table with the minimum and
maximum values.
The plot on the top shows no outliers,
• whereas the plot on the bottom shows
possible outliers on the upper side when a
normal distribution is expected.
The normal dis-tribution, or Gaussian
Example:
Example:
Dealing with missing values
• Missing values aren’t necessarily wrong,
but you still need to handle them
separately; certain modeling techniques
can’t handle missing values.
• They might be an indicator that
something went wrong in your data
collection or that an error happened in
t he ETL process.
Handling missing values
Error: deviation from code book
• Detecting errors in larger data sets against a
code book or against standardized values can
be done with the help of set operations.
• A code book is a description of your data, a
form of metadata. It contains things such as
the number of variables per observation, the
number of observations, and what each
• encoding within a variable means.
(For instance “0” equals “negative”, “5” stands
for “very positive”.) A code book also tells the
type of data you’re looking at: is it
hierarchical, graph, something else?
Error: diff erent units of
measurement
• When integrating two data sets, you have
to pay attention to their respective units
• of measurement.
An example of this would be when you
study the prices of gasoline in the world.
• To do this you gather data from diff erent
data providers.
Data sets can con tain prices per gallon and
others can contain prices per liter. A simple
conversion will do the trick in this case.
Having diff erent levels of
aggregation
• Having diff erent levels of aggregation is
similar to having diff erent types of
• measurement.
An example of this would be a data set
• containing data per week versus one
containing data per work week.
This type of error is generally easy to detect,
• and summarizing (or the inverse, expanding)
the data sets will fix it.
After cleaning the data errors, you combine
information from diff erent data sources. But
before we tackle this topic we’ll take a little
detour and stress t he importance of cleaning
Correct Errors
• A good practice is t o mediate dat a errors as
early as possible in the data collection chain
and to fix as litt le as possible inside your
program while fixing the origin of the
problem.
• Ret rieving
 data is a diffi cult t ask, and
organizat ions spend millions of dollars on it
in the hope of making better decisions.
• The
 data collection process is errorprone,
and in a big organization it involves many
steps and teams.
Correct Errors
• Data should be cleansed when acquired
for many reasons:
– Not everyone spots the data anomalies.
Decision-makers may make costly
mistakes on information based on
incorrect data from applications that
fail to correct for the faulty data.
– If errors are not corrected early on in
the process, the cleansing will have to
be done for every project that uses
that data.
Correct Errors
• Data errors may point t o a business
process t hat isn’t working as designed.
For instance, both aut hors worked at a
ret ailer in t he past, and t hey designed a
couponing syst em t o att ract more
• people and make a higher profit .
Data errors may point to defective
• equipment, such as broken t ransmission
lines and def ective sensors.
Data errors can point to bugs in software
or in the integration of software that may
be critical to the company.
Combine Data
• Your data comes from several diff erent places,
and in this substep we focus on integrating these
diff erent sources.
• Data varies in size, type, and structure,
ranging from databases and Excel files to
text documents.
• It’s easy to fi ll entire books on this topic alone,
and we choose to focus on the data science
process instead of presenting scenarios for
every type of data.
• But keep in mind that other types of data
sources exist, such as key-value stores,
document stores, and so on, which we’ll handle
in more appropriate places in the book.
Diff erent ways to combine data
• You can perform two operations to
combine information from diff erent
• data sets.
The fi rst operation is joining: enriching an
• observation from one table with information
from another table.
The second operation is appending or stacking:
• adding the observations of one table to those
of another table.
When you combine data, you have the
option to create a new physical table or a
virtual table by creating a view. The
advantage of a view is that it doesn’t
Joining tables
• Joining tables allows you to combine the
information of one observation found in
one table with the information that you
find in another table. The focus is on
• enriching a single observation.
Let ’s say t hat t he first t able cont ains
inf ormat ion about t he purchases of a
customer and t he ot her table contains
• information about the region where your
customer lives.
Joining the tables allows you to combine
the information so that you can use it for
your model
Joining tables
Appending tables
• Appending or stacking tables is effectively
adding observations from one table to another
table. One table contains the observations from
the month January and the second table
contains observations from the month February.
• The result of appending these tables is a larger
one with the observations from January as well
as February.
• The equivalent operation in set theory would be
the union, and this is also the command in SQL,
the common language of relational databases.
• Other set operators are also used in data science,
such as set difference and intersection.
Appending tables
View: without replication
Aggregating measures
• Data enrichment can also be done by
adding calculated information to the
table, such as the total number of sales or
what percentage of total stock has been
sold in a certain region.
• Ext ra measures such as t hese can add
perspect ive. Looking at figure, we now
have an aggregated data set, which in turn
can be used to calculate the participation
of each product within its category.
• This could be useful during data
exploration but more so when creating
Example:
Data Transformation
• Certain models require their data to be in a
certain shape.
• Now that you’ve cleansed and integrated
the data, this is the next task you’ll
perform: transforming your data so it
t akes a suitable f orm f or dat a modeling.
Data Transformation
• Relationships between an input variable
and an output variable aren’t always
linear.
• Take, for instance, a relat ionship of t he
form y = aebx .
• Taking the log of the independent
variables simplifies the estimation
problem dramatically.
Data Transformation
Reducing number of variables
• Sometimes you have too many variables
and need to reduce the number
because they don’t add new
information to the model.
• Having too many variables in your model
makes the model diffi cult to handle,
and certain techniques don’t perform
well when you overload them with too
many input variables.
Reducing number of variables
• For instance, all the techniques based
on a Euclidean distance perform well
only up to 10 variables.
Dummy Variables
• Variables can be turned into dummy variables (figure).
Dummy variables can only take two values: true(1) or
false(0).
• They’re used t o indicat e t he absence of a
cat egorical eff e ct that may explain the
observation. In this case you’ll make separate
columns for the classes stored in one variable and
indicat e it wit h 1 if t he class is present and 0
otherwise.
• An example is turning one column named Weekdays
into the columns Monday through Sunday. You use an
indicator to show if the observation was on a Monday;
you put 1 on Monday and 0 elsewhere.
• Turning variables into dummies is a technique that’s
used in modeling and is popular with, but not
Dummy Variables
4. Exploratory Data Analysis
• During exploratory data analysis you take a deep
dive into the data (see figure).
• Information becomes much easier to grasp when
shown in a picture, therefore you mainly use
graphical techniques to gain an understanding of
your data and the interactions between variables.
• This phase is about exploring data, so keeping
your mind open and your eyes peeled is
essential during the exploratory data analysis
• phase.
The goal isn’t to cleanse the data, but it’s
common that you’ll still discover anomalies you
missed before, forcing you to take a step back
and fi x them.
4. Exploratory Data Analysis
Exploratory Data Analysis
Brushing and linking
• With brushing and linking you combine and
link diff erent graphs and tables (or views)
so changes in one graph are automatically
transferred to the other graphs.
Brushing and linking
Brushing and linking
Histogram
• In a histogram a variable is cut into
discrete categories and the number of
occurrences in each category are summed
up and shown in the graph.
• The boxplot, on the other hand, doesn’t
show how many observations are present
but does off er an impression of the
distribution within categories.
• It can show the maximum, minimum,
median, and other characterizing
measures at the same time.
Histogram
Boxplot
5. Build the model
Building a model
• With clean dat a in place and a good
underst anding of the content, you’re
ready to build models with the goal of
making better predictions, classifying
objects, or gaining an understanding of the
• system that you’re modeling.
This phase is much more focused than the
exploratory analysis step, because you
know what you’re looking for and what
you want the outcome to be.
Building a model
• Building a model is an iterative process. The
way you build your model depends on
whether you go with classic statistics or the
somewhat more recent machine learning
school, and the type of technique you want
• to use.
Either way, most models consist of the
following main steps:
– Selection of a modeling technique and
variables to enter in the model
– Execution of the model
– Diagnosis and model comparison
Build a model
• You’ll need to select the variables you
want to include in your model and a
• modeling t echnique.
Your findings from the exploratory analysis
should already give a f air idea of what
• variables will help you construct a good
model.
Many modeling techniques are available,
and choosing the right model for a
problem requires judgment on yourpart.
Build a model
• You’ll need to consider model performance
and whether your project meets all the
requirements to use your model, as well as
ot her f act ors:
– Must t he model be moved t o a
production environment and, if so,
would it be easy to implement?
– How diffi cult is the maintenance on the
model: how long will it remain relevant
if left untouched?
– Does the model need to be easy to
explain?
Model Execution
• Luckily, most programming languages, such as
Python, already have libraries such as
StatsModels or Scikit-learn. These packages use
several of the most popular techniques.
• Coding a model is a nontrivial task in most cases,
so having these libraries available can speed up
• the process.
As you can see in the following code, it’s fairly
easy to use linear regression (figure) with
• StatsModels or Scikit- learn.
Doing this your self would require much
more eff ort even for the simple techniques.
Model Execution
Coding
Evaluation
Evaluation
• Model fi t — Fo r this the R-squared or adjusted R-
squared is used. This measure is an indication of
the amount of variation in the data that gets
captured by the model.
• Predictor variables have a coeffi cient—For a
linear model this is easy to interpret. In our
example if you add “1” to x1, it will change y by
“0.7658”. It’s easy to see how finding a good
predictor can be your rout e to a Nobel Prize
even though your model as a whole is rubbish.
• Predictor significance—Coeffi cients are great,
but sometimes not enough evidence exists to
show that the influence is there. This is what
the p-value is about.
Example: KNN Model
Code
Evaluation
Model diagnostic and comparison
• You’ll be building multiple models from which
you then choose the best one based on multiple
criteria. Working with a holdout sample helps
you pick the best- performing model. A holdout
sample is a part of the data you leave out of the
model building so it can be used to evaluate the
model afterward. The principle here is simple:
the model should work on unseen data. You use
only a fraction of your data to estimate the
• model and the other part, the holdout
sample, is kept out of the equation. The
model
• is then unleashed on the unseen data
and error measures are calculated to
Cross Validation
6. Presentation and automation
• After you’ve successfully analyzed the
data and built a well-performing model,
you’re ready to present your findings to
• the world (figure).
This is an exciting part; all your hours of
hard work have paid off and you can
explain what you found to the
stakeholders.
6. Presentation and automation
Presentation
• Sometimes people get so excited about your work
that you’ll need to repeat it over and over again
because they value the predictions of your models or
the insights that you produced. For this reason, you
need to automate your models.
• Thisdoesn’t always mean that you have to
redo all of your analysis all the time.

 Sometimes it’s sufficient that you
implement only the model scoring; other
times you might build an application that
• automatically updates reports, Excel
spreadsheets, or PowerPoint presentations.
 The last stage of the data science process is
where your soft skills will be most useful,
Machine Learning Definition and Relation with Data Science

 Machine learning is a subset of artificial


intelligence (AI) that involves the
development of algorithms and models that
enable computers to learn from and make
predictions or decisions based on data,
without being explicitly programmed. In other
words, machine learning algorithms are
designed to improve their performance over
time by learning from the patterns and
relationships present in the data they are
given.

DYPCOE akurdi
 There are several key components of machine learning:
 Data: Data serves as the foundation for machine learning. Algorithms
learn from historical data to identify patterns, relationships, and trends.
 Algorithm: Machine learning algorithms are mathematical models that
are designed to learn from data. These algorithms adjust their
parameters based on the input data to improve their performance over
time.
 Training: The process of training involves feeding historical data into a
machine learning algorithm. The algorithm adjusts its internal parameters
iteratively to minimize the difference between its predictions and the
actual outcomes in the training data.
 Testing and Validation: After training, the algorithm is tested on new,
unseen data to evaluate its performance. Validation techniques are used
to ensure that the algorithm generalizes well to new data and doesn't just
memorize the training data.
 Prediction or Inference: Once trained, the machine learning model can
be used to make predictions or decisions on new, unseen data. It uses
the patterns it has learned during training to generate outputs or
classifications.
DYPCOE akurdi
 Data science, on the other hand, is a broader field that
encompasses various processes and techniques for
extracting knowledge and insights from data. It involves a
combination of skills from various domains, including
statistics, domain expertise, programming, and machine
learning. Data science includes activities such as data
collection, data cleaning and preprocessing, exploratory
data analysis, feature engineering, model selection, and the
interpretation of results.
 Machine learning is a crucial component of data science.
Data scientists use machine learning techniques to build
predictive models, classify data, cluster similar data points,
recommend products or actions, and more. Machine
learning provides the tools and methodologies that enable
data scientists to automate and enhance decision-making
processes based on data-driven insights.
DYPCOE akurdi
DYPCOE akurdi
DIFFRENCE BETWN ML and DS

DYPCOE akurdi
Thank you

DYPCOE akurdi

You might also like