0% found this document useful (0 votes)
6 views

Fundamentals of Data Science unit 1

The document outlines the fundamentals of data science, emphasizing its importance in transforming raw data into actionable insights through various processes such as data cleansing, modeling, and analysis. It details the components of data science, including statistics, domain expertise, data engineering, and machine learning, as well as the tools and life cycle involved in data science projects. Additionally, it highlights the applications of data science across different industries and the prerequisites needed for a career in this field.

Uploaded by

kaleeswaranmmcas
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Fundamentals of Data Science unit 1

The document outlines the fundamentals of data science, emphasizing its importance in transforming raw data into actionable insights through various processes such as data cleansing, modeling, and analysis. It details the components of data science, including statistics, domain expertise, data engineering, and machine learning, as well as the tools and life cycle involved in data science projects. Additionally, it highlights the applications of data science across different industries and the prerequisites needed for a career in this field.

Uploaded by

kaleeswaranmmcas
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 33

Fundamentals of Data Science

UNIT I

Syllabus

Need for data science – benefits and uses – facets of data – data science
process – setting the research goal – retrieving data – cleansing, integrating,
and transforming data – exploratory data analysis – build the models –
presenting and building applications

INTRODUCTION

1.1.1 Data

Data is a collection of discrete values that convey information, describing quantity,


quality, fact, statistics, etc., Data is information such as facts and numbers used to
analyze something or make decisions.
The characteristics of big data are often referred to as the three Vs:

 Volume - How much data is there?

 Variety - How diverse are different types of data?

 Velocity - At what speed is new data generated?

1.1.2 Why data is so important?

Data is a precious asset of any organization. It helps firms understand and enhance
their processes, thereby saving time and money. Wastage of time and money,
such as a huge spend on advertisements, improper inventory management may
deplete resources and severely impact a business. The efficient use of data enables
businesses to reduce such wastage by analyzing different
marketing channels’ performance and focusing on those offering the highest
Return on Investment(ROI). Thus, a company can generate more leads without
increasing its advertising spend.
1.1.3 Characteristics of quality data
Determining the quality of data requires an examination of its characteristics,
thenweighing those characteristics according to what is most important to your
organization and theapplication(s) for which they will be used.

 Validity -
 The degree to which your data conforms to defined business rules or
constraints.

 Accuracy -
Ensure your data is close to the true values.

 Completeness -
The degree to which all required data is known.

 Consistency -
Ensure your data is consistent within the same dataset and/or acrossmultiple
data sets.

 Uniformity -
The degree to which the data is specified using the same unit of measure.

1.1.4 Data Science

Data science is the domain of study that deals with vast volumes of data using
modern tools and techniques to find hidden patterns, derive meaningful
information, and make business decisions. Data science can be explained as the
entire process of gathering actionable insights from raw data
that involves concepts like pre-processing of data, data modeling, statistical
analysis, data analysis, machine learning algorithms, etc.
The main purpose of data science is to compute better decision making. It uses
complex machine learning algorithms to build predictive models. The data used for
analysis can come from many different sources and presented in various formats.

 The working of data science can be explained as follows:

 Raw data is gathered from various sources that explains the business
problem.

 Using various statistical analysis, and machine learning approaches, data


modeling is performed to get the optimum solutions that best explain the
business problem.

 Actionable insights that will serve as a solution for the business problems
gathered through data science.

1.1.5 Why does Data science so important? Or Need for Data Science!
Data is meaningless until its conversion into valuable information. Data Science
involves mining large datasets containing structured and unstructured data and
identifying hidden patterns to extract actionable insights. The importance of Data
Science lies in its numerous uses that range from daily activities like asking Siri or
Alexa for recommendations to more complex applications like operating a self-
driving car.
Handling of huge amount of data is a challenging task for every organization. So to
handle, process, and analysis of this, we required complex, powerful, and efficient
algorithms and that technology came into existence as data Science. Following are
some main reasons for using data science technology:

 With the help of data science technology, we can convert the


massive amount of raw and unstructured data into meaningful insights.

 Data science technology is opting by various companies, whether it is a big


brand or a startup. Google, Amazon, Netflix, etc, which handle the huge
amount of data, are using data science algorithms for better customer
experience.

 Data science is working for automating transportation such as creating a


self-driving car, which is the future of transportation.

 Data science can help in different predictions such as various


survey, elections, flight ticket confirmation, etc.

1.2 Components of Data Science:

The main components of Data Science are given below:


1. Statistics:
Statistics is one of the most important components of data science.
Statistics is a way to collect and analyze the numerical data in a large
amount and finding meaningful insights from it.

2. Domain Expertise:
In data science, domain expertise binds data science together. Domain
expertise means specialized knowledge or skills of a particular area. In data
science, there are various areas for which we need domain experts.

3. Data engineering:
Data engineering is a part of data science, which involves acquiring,
storing, retrieving, and transforming the data. Data engineering also includes
metadata (data about data) to the data.

4. Visualization:
Data visualization is meant by representing data in a visual context so
that people can easily understand the significance of data. Data visualization
makes it easy to access the huge amount of data in visuals.

5. Advanced computing:
Heavy lifting of data science is advanced computing. Advanced
computing involves designing, writing, debugging, and maintaining the
source code of computer programs.

6. Mathematics:
Mathematics is the critical part of data science. Mathematics involves the
study of quantity, structure, space, and changes. For a data scientist,
knowledge of good mathematics is essential.

7. Machine learning:
Machine learning is backbone of data science. Machine learning is all
about to provide training to a machine so that it can act as a human brain. In
data science, we use various machine learning algorithms to solve the
problems.

1.2.1 Tools for Data Science

Following are some tools required for data science:


 Data Analysis tools:
Python, Statistics, SAS, Jupyter, R Studio, MATLAB, Excel,
RapidMiner.
 Data Warehousing:
ETL, SQL, Hadoop, Informatica/Talend, AWS Redshift
 Data Visualization tools:
R, Jupyter, Tableau, Cognos.
 Machine learning tools:
Spark, Mahout, Azure ML studio.

1.3 Data Science Life Cycle

Data Science Life Cycle defines the process of the how information is carried out
in various phases for professionals working on a project.
It’s a step-by-step procedure that is arranged in a circular structure. Each phase has
its own characteristics and importance. The Data Science life cycle comprises of
the following:

 Formulating a Business Problem


Any data science problem will start their journey from formulating a
business problem. A business problem explains the issues that may be
fixed with insights gathered from an efficient Data Science solution.

A simple example of a business problem is –You have past 1 year’s sales


data for a retail store. Using machine learning approaches, you have to
predict or forecast the sales for the next 3 months that will help the store to
create an inventory that will help in reducing the wastage of products that
have lesser shelf life than the other products.

 Data Extraction, Transformation, Loading


The next step in the data science life cycle is to create a data pipeline where
the relevant data is extracted from the source and transformed into machine
readable format, and eventually the data is loaded into the program or the
machine learning pipeline to get things started. For the above example:–
To forecast the sales, we will need data from the store that will be useful for
formulating an efficient machine learning model. Keeping this in mind, we
would create separate data points that may or may not be affecting the sales
for that particular store.

 Data Preprocessing
The third step is where the magic happens. Using statistical analysis,
exploratory data analysis, data wrangling and manipulation, we will create
meaningful data. The preprocessing is done to assess the various data points
and formulate hypotheses that best explain the relationship between the
various features in the data.
For example –
The store sales problem will require the data to be in a time series format to
be able to forecast the sales. The hypothesis testing will test the stationary of
the series and further computations will show various trends, seasonality and
other relationship patterns in the data.

 Data Modeling
This step involves advanced machine learning concepts that will be used for
feature selection, feature transformation, standardization of the data, data
normalization, etc. Choosing the best algorithms based on evidence from the
above steps will help you create a model that will efficiently create a
forecast for the said months in the above example. For example –
 We can use the Time Series forecasting approach for the business problem
where the presence of high dimensional data could be the case. We will use
various dimensionality reduction techniques, and create a Forecasting model
using AR, MA, or ARIMA model and forecast the sales for the next quarter.
 Gathering Actionable Insights
The final step from the data science life cycle is gathering insights from the
said problem statement. We create inferences and findings from the entire
process that would best explain the business problem. For example–
From the above Time series model, we will get the monthly or weekly sales
for the next 3 months. These insights will in turn help the professionals
create a strategy plan to overcome the problem at hand.

 Solutions For the Business Problem


The solutions for the business problem are nothing but actionable insights
that will solve the problem using evidence based information.
For example –
 Our forecast from the Time series model will give an efficient estimate for
the store sales in the next 3 months. Using those insights, the store can plan
their inventory to reduce the wastage of perishable goods.

1.3.1 Prerequisite for Data Science

Technical Prerequisite:

 Machine learning:
To understand data science, one needs to understand the concept of machine
learning. Data science uses machine learning algorithms to solve various
problems.

 Mathematical modeling:
Mathematical modeling is required to make fast mathematical calculations
and predictions from the available data.

 Statistics:
Basic understanding of statistics is required, such as mean, median, or
standard deviation. It is needed to extract knowledge and obtain better
results from the data.

 Computer programming:
For data science, knowledge of at least one programming language is
required. R, Python, Spark are some required computer programming
languages for data science.
 Databases:
The depth understanding of Databases such as SQL is essential for data
science to get the data and to work with data.

Non-Technical Prerequisite:

 Curiosity
To learn data science, one must have curiosities. When you have curiosity
and ask various questions, then you can understand the business problem
easily.
 Critical Thinking:
It is also required for a data scientist so that you can find multiple new ways
to solve the problem with efficiency.
 Communication skills:
Communication skills are most important for a data scientist because after
solving a business problem, you need to communicate it with the team.

1.4 Benefits of data science / application of data science

 Data Science is widely used in the banking and finance sectors for fraud
detection and personalized financial advice.

 Retailers use Data Science to enhance customer experience and retention.


For an example,
 In the healthcare industry, physicians use Data Science to analyze data from
wearable trackers to ensure their patients’ well-being and make vital
decisions. Data Science also enables hospital managers to reduce waiting
time and enhance care.

 Transportation providers use Data Science to enhance the transportation


journeys of their customers. For instance, Transport for London maps
customer journeys offering personalized transportation details, and manages
unexpected circumstances using statistical data.

 Construction companies use Data Science for better decision making by


tracking activities, including average time for completing tasks, materials-
based expenses, and more.

 Data Science enables trapping and analyzing massive data from


manufacturing processes, which has gone untapped so far.

 With Data Science, one can analyze massive graphical data, temporal data,
and geospatial data to draw insights. It also helps in seismic interpretation
and reservoir characterization.

 Data Science facilitates firms to leverage social media content to obtain real-
time media content usage patterns. This enables the firms to create target
audience-specific content, measure content performance, and recommend
on-demand content.

 Data Science helps study utility consumption in the energy and utility
domain. This study allows for better control of utility use and enhanced
consumer feedback.

 Data Science applications in the public service field include health-related


research, financial market analysis, fraud detection, energy exploration,
environmental protection, and more.
1.5 Difference between Business Intelligence and Data Science

1.6 Difference between Data Mining and Data Science


1.7 Machine learning in Data Science
To become a data scientist, one should also be aware of machine learning and
its algorithms, as in data science; there are various machine learning algorithms
which are broadly being used. Following are the name of some machine learning
algorithms used in data science:

 Regression

 Decision tree

 Clustering

 Classification

 Outlier Analysis

1. Linear Regression Algorithm:


Linear regression is the most popular machine learning algorithm based on
supervised learning. this algorithm work on regression, which is a method of
modeling target values based on independent variables. It represents the form of
the linear equation, which has a relationship between the set of inputs and
predictive output. This algorithm is mostly used in forecasting and predictions.
Since it shows the linear relationship between input and output variable, hence it is
called linear regression.

The below equation can describe the relationship between x and y variables:

Y= mx+c

where, y => Dependent variable


X => independent variable
M => slope
C => intercept

2. Decision Tree:
Decision Tree algorithm is another machine learning algorithm, which belongs to
the supervised learning algorithm. This is one of the most popular machine
learning algorithms. It can be used for both classification and regression problems.
In the decision tree algorithm, we can solve the problem, by using tree
representation in which, each node represents a feature, each branch represents a
decision, and each leaf represents the outcome. Following is the example for a Job
offer problem
In the decision tree, we start from the root of the tree and compare the values of the
root attribute with record attribute. On the basis of this comparison, we follow the
branch as per the value and then move to the next node. We continue comparing
these values until we reach the leaf node with predicated class value.

3. K-Means Clustering:
K-means clustering is one of the most popular algorithms of machine learning,
which belongs to the unsupervised learning algorithm. It solves the clustering
problem. If we are given a data set of items, with certain features and values, and
we need to categorize those set of items into groups, so such type of problems can
be solved using k-means clustering algorithm
4. Classification
It is the act or process of dividing things into groups according to their type. In
statistics, classification is the problem of identifying which of a set of categories
(sub-populations) an observation (or observations) belongs to. There are two types
of classification such as Binary Classification and Multi-class Classification.

5. Outlier Analysis
Outlier Analysis is a process that involves identifying the anomalous observation
in the dataset. Outliers are nothing but an extreme value that deviates from the
other observations in the dataset. Outliers are those observations that differ
strongly (different properties) from the other data points in the sample of a
population. Outliers are classified into three types namely Global Outliers,
Contextual Outliers and Collective Outliers.
So in data science, problems are solved using algorithms, and below is the diagram
representation for applicable algorithms for possible questions:

1.8 Facets of Data


There are different types of data handled in data science domain and big data, and
each of them tends to require different tools and techniques.
The main categories of data are these:

 Structured Data

 Unstructured Data

 Natural language Data


 Machine-generated Data

 Graph-based Data

 Audio, video, and images

Streaming Data:

Structured Data
Structured data is when data is in a standardized format, has a well-defined
structure, complies with a data model, follows a persistent order, and is easily
accessed by humans and programs. This data type is generally stored in a database.
Structured Query Language (SQL) is the preferred way to manage and query data
that resides in databases. Example of Structured data is

Characteristics of Structured Data


Good structured data will have a range of characteristics, regardless of how the
data is stored or what the information is about. Structured data:

 has an identifiable structure that conforms to a data model and It is in fixed


fields

 is presented in rows and columns, such as in a database

 is organized so that the definition, format and meaning of the data is


explicitly understood
 has similar groups of data clustered together in classes

 information is easy to access and query for humans and other programs

 elements are able to be addressed, enabling efficient analysis and processing

Advantages of Structured Data

 Easy Storage and Access


 Ease of Updating and Deleting
 Easily Scalable
 Data Mining is Simple
 Better Business Intelligence
 Disadvantages of Structured Data
 Storage Inflexibility
 Limited Use Cases

Unstructured data
Unstructured data is data that isn’t easy to fit into a data model because the content
is
Context-specific or varying. Unstructured data is information that either does not
have a predefined data model or is not organized in a pre-defined manner.
Unstructured information is typically text-heavy, but may contain data such as
dates, numbers, and facts as well. This result in irregularities and ambiguities that
make it difficult to understand using traditional programs as compared to data
stored in structured databases. Common examples of unstructured data include
audio, video files or No-SQL databases.
Examples of unstructured data are:

 Rich media. Media and entertainment data, surveillance data, geo-spatial


data,audio, weather data

 Document collections. Invoices, records, emails, productivity applications


 Internet of Things (IoT). Sensor data, ticker data

Natural Language Data:

Natural language is a special type of unstructured data; it’s challenging to process


because
it requires knowledge of specific data science techniques and linguistics. Natural
language refers to the way we, humans, communicate with each other namely,
speech and text

Machine-generated data
Machine-generated data is information that’s automatically created by a computer,
process, application, or other machine without human intervention. Examples of
machine data are web server logs, call detail records, network event logs, and
telemetry
Graph-based or network data
In graph theory, a graph is a mathematical structure to model pair-wise
relationships between objects. Graph or network data is, in short, data that focuses
on the relationship or adjacency of objects. The graph structures use nodes, edges,
and properties to represent and store graphical data. Graph-based data is a natural
way to represent social networks, and its structure allows you to calculate specific
metrics such as the influence of a person and the shortest path between two people

Audio, image, and video


Audio, image, and video are data types that pose specific challenges to a data
scientist. Tasks that are trivial for humans, such as recognizing objects in pictures,
turn out to be challenging for computers
Streaming data
While streaming data can take almost any of the previous forms, it has an extra
property. The data flows into the system when an event happens instead of being
loaded into a data store
in a batch. Although this isn’t really a different type of data, we treat it here as such
because you
need to adapt your process to deal with this type of information. Examples are the
“What’s trending” on Twitter, live sporting or music events, and the stock market.

1.9 The Data Science Process

The data science process typically consists of six steps, as follows,

1. Setting the research goal


- Defining the what, the why, and the how of your project in a project
charter.2.

2. Retrieving data
- Finding and getting access to data needed in your project. This data is
either found within the company or retrieved from a third party.3.

3. Data preparation
- Checking and remediating data errors, enriching the data with data
from other data sources, and transforming it into a suitable format for
your models.4.

4. Data exploration
- Diving deeper into your data using descriptive statistics and visual
techniques.

5. Data modeling
- Using machine learning and statistical techniques to achieve your
project goal.

6. Presentation and automation


- Presenting your results to the stakeholders and industrializing your
analysis process for repetitive reuse and integration with other
tools.

Setting the research goal


Data science is mostly applied in the context of an organization. When the business
asks you to perform a data science project, you’ll first prepare a project charter.
This charter contains information such as what you’re going to research, how the
company benefits from that, what data and resources you need, a timetable, and
deliverables.

Retrieving data
The second step is to collect data. You’ve stated in the project charter which data you need
and where you can find it. In this step you ensure that you can use the data in your
program, which means checking the existence of, quality, and access to the data.
Data can also be delivered by third-party companies and takes many forms ranging
from Excel spreadsheets to different types of databases.
Data preparation
Data collection is an error-prone process; in this phase you enhance the quality
of the data and prepare it for use in subsequent steps. This phase consists of three
sub phases: data cleansing removes false values from a data source
and inconsistencies across data sources, data integration enriches data sources by
combining information from multiple data sources, and data transformation
ensures that the data is in a suitable format for use in your models.

Data exploration
Data exploration is concerned with building a deeper understanding of your data.
You try to understand how variables interact with each other, the distribution of the
data, and whether there are outliers. To achieve this you mainly use descriptive
statistics, visual techniques, and simple modeling. This step often goes by
the abbreviation EDA, for Exploratory Data Analysis.
Data modeling or model building
In this phase you use models, domain knowledge, and insights about the data you
found in the previous steps to answer the research question. You select a technique
from the fields of statistics, machine learning, operations research, and so on.
Building a model is an iterative process that involves selecting the variables for the
model, executing the model, and model diagnostics. Building a model is an
iterative process. The way you build your model depends on whether you go with
classic statistics or the somewhat more recent machine learning school, and the
type of technique you want to use.
Either way, most models consist of the following main steps:.

i) Selection of a modeling technique and variables to enter in the model.

ii) Execution of the model

iii) Diagnosis and model comparison

Presentation and automation


Finally, you present the results to your business. These results can take many
forms, ranging from presentations to research reports. Sometimes you’ll need to
automate the execution of the process because the business will want to use the
insights you gained in another project or enable an operational process to use the
outcome from your model. The above diagram summarizes the data science
process and shows the main steps and actions that will be taken during a project.1.

The first step of this process is


Setting a research goal
The main purpose here is making sure all the stakeholders understand the what,
how, and why of the project. In every serious project this will result in a project
charter.2.

The second phase is


Data retrieval
You want to have data available for analysis, so this step includes finding
suitable data and getting access to the data from the data owner. The result is data
in its raw form, which probably needs polishing and transformation before it
becomes usable.3.

The third phase is


Data preparation
.
Now that you have the raw data, it’s time to prepare it.

This includes transforming the data from a raw form into data that’s directly usable
in your models. To achieve this, you’ll detect and correct different kinds of errors
in the data, combine data from different data sources, and transform it. If you have
successfully completed this step, you can progress to data visualization and
modelling.4.

The fourth step is


Data exploration
The goal of this step is to gain a deep understanding
of the data. You’ll look for patterns, correlations, and deviations based on visual
and
descriptive techniques. The insights you gain from this phase will enable you to
startmodelling.5.

Finally, Fifth step in Data Science process is


Model building
. It is now that you attempt to gain the insights or make the predictions stated in
your project charter.6.

The last step of the data science model is


Presenting your results and automating the analysis
. One goal of a project is to change a process and/or make better decisions. You
may still need to convince the business that your findings will indeed change the
business process as expected. The importance of this step is more apparent in
projects on a strategic and tactical level. Certain projects require you to perform
the business process over and over again, so automating the project will save time

1.10 A detailed view on Data Science Process

1.10.1 Data Science Process: Defining research goals and creating a project
charter

A project starts by understanding the what , the why , and the how of your
project.“What does the company expect you to do? ”, “why does management
place such a value on your research?”, “Is it part of a bigger strategic picture
project originating from an opportunity someone detected?” Answering these three
questions (what, why, how) is the goal of the first phase.
The entire process is divided into two parts as such,

i) Understanding the goals and context of your research.

ii) Create a project charter


A project charter requires teamwork, and your input covers at least the following:

i) A clear research goal.

ii) The project mission and context.

iii) How you’re going to perform your analysis.

iv) What resources you expect to use.

v) Proof that it’s an achievable project, or proof of concepts

vi) Deliverables and a measure of success.


vii) A timeline Company can use this information to make an estimation of
the project costs and the data and people required for your project to
become a success.

1.10.2 Data Science Process: Retrieving data

 The next step in data science is to retrieve the required data. Data required
for the analysis process will be collected from the company directly (Internal
data) or collected from outside sources (External Data).

 Data can be stored in many forms, ranging from simple text files to tables in
a database. The objective now is acquiring all the data you need
 The main challenge in data collection is identifying the data sources where
the required data is actually stored. Because company may have stored the
data across many places.

 Another challenge is the extract useful data from the data collection and
removing the noises and unwanted data out of it. Many manual and
automated tools are used to refine the data
Many companies publish their data in open forum for public access and some of
them are as follows

Data Science Process: Cleansing, integrating, and transforming data

Data cleaning
is the process of fixing or removing incorrect, corrupted, incorrectly formatted,
duplicate, or incomplete data within a dataset. When combining multiple data
sources, there are many opportunities for data to be duplicated or mislabeled. If
data is incorrect, outcomes and algorithms are unreliable, even though they may
look correct.

Data Cleaning Process


includes removal of duplicate and irrelevant observations, fix structural errors,
filtering unwanted outliers, handling missing data, data validation, etc.,

Data Integration
is a process of combining data from multiple heterogeneous data sources into a
coherent data store and provides a unified view of the data. These sources may
include multiple data cubes, databases, or flat files.

There are mainly 2 major approaches for data integration-One is the “tight coupling
approach” and another is the “loose coupling approach”.

Some of challenges in Data Integration are Schema Integration, Data redundancy


and detection and resolution of data value conflicts.
Data transformation:
Raw data is difficult to trace or understand. That's why it needs to be pre-processed
before retrieving any information from it.
Data transformation

Is a technique used to convert the raw data into a suitable format that efficiently
eases data mining and retrieves strategic information?

Data transformation includes data cleaning techniques and a data reduction


technique to convert the data into the appropriate form.

Data transformation changes the format, structure, or values of the data


and converts them into clean, usable data. Data may be transformed at two stages
of the data pipeline for data analytics projects.

Data integration, migration, data warehousing, data wrangling may all involve data
transformation. Data transformation increases the efficiency of business and
analytic processes, and it enables businesses to make better data-driven decisions.
During the data transformation process, an analyst will determine the structure of
the data.
Data Science Process: Exploratory data analysis [EDA]

Exploratory data analysis (EDA) is an approach of analysing data sets to


summarize their main characteristics, often using statistical graphics and other
data visualization methods.

Exploratory Data Analysis refers to the critical process of performing initial


investigations on data so as to discover patterns, to spot anomalies, to test
hypothesis and to check assumptions with the help of summary statistics and
graphical representations.
The objectives of EDA are to:

 Enable unexpected discoveries in the data

 Suggest hypotheses about the causes of observed phenomena

 Assess assumptions on which statistical inference will be based

 Support the selection of appropriate statistical tools and techniques

 Provide a basis for further data collection through surveys or experiments

Typical graphical techniques used in EDA are Box plot, Histogram, Multi-vari
chart, Run chart, Pareto chart, Scatter plot (2D/3D), Stem-and-leaf plot, Parallel
coordinates, Odds ratio, Targeted projection pursuit, Heat map, Bar chart, etc.,

1.10.5 Data Science Process: Build the models


The model building process involves setting up ways of collecting data,
understanding and paying attention to what is important in the data to answer the
questions you are asking, finding a statistical, mathematical or a simulation model
to gain understanding and make predictions.

 In regression analysis, model building is the process of developing a


probabilistic model that best describes the relationship between the
dependent and independent variables.

 Model Building Process consist of the following three steps,

i) Selection of a modeling technique and variables to enter in the model.

ii) Execution of the model

iii) Diagnosis and model comparison

1.10.6 Data Science Process: Presenting findings and building applications

 Once the data is successfully analyzed and a well-performing model is built,


the findings need to be presented to the world. It involves presenting your
results to the stakeholders and industrializing your analysis process for
repetitive reuse and integration with other tools.
 The entire process needs to be documented in a professional way and have
clear visualizations, so it makes it easy for the audience to understand.

You might also like