Fundamentals of Data Science unit 1
Fundamentals of Data Science unit 1
UNIT I
Syllabus
Need for data science – benefits and uses – facets of data – data science
process – setting the research goal – retrieving data – cleansing, integrating,
and transforming data – exploratory data analysis – build the models –
presenting and building applications
INTRODUCTION
1.1.1 Data
Data is a precious asset of any organization. It helps firms understand and enhance
their processes, thereby saving time and money. Wastage of time and money,
such as a huge spend on advertisements, improper inventory management may
deplete resources and severely impact a business. The efficient use of data enables
businesses to reduce such wastage by analyzing different
marketing channels’ performance and focusing on those offering the highest
Return on Investment(ROI). Thus, a company can generate more leads without
increasing its advertising spend.
1.1.3 Characteristics of quality data
Determining the quality of data requires an examination of its characteristics,
thenweighing those characteristics according to what is most important to your
organization and theapplication(s) for which they will be used.
Validity -
The degree to which your data conforms to defined business rules or
constraints.
Accuracy -
Ensure your data is close to the true values.
Completeness -
The degree to which all required data is known.
Consistency -
Ensure your data is consistent within the same dataset and/or acrossmultiple
data sets.
Uniformity -
The degree to which the data is specified using the same unit of measure.
Data science is the domain of study that deals with vast volumes of data using
modern tools and techniques to find hidden patterns, derive meaningful
information, and make business decisions. Data science can be explained as the
entire process of gathering actionable insights from raw data
that involves concepts like pre-processing of data, data modeling, statistical
analysis, data analysis, machine learning algorithms, etc.
The main purpose of data science is to compute better decision making. It uses
complex machine learning algorithms to build predictive models. The data used for
analysis can come from many different sources and presented in various formats.
Raw data is gathered from various sources that explains the business
problem.
Actionable insights that will serve as a solution for the business problems
gathered through data science.
1.1.5 Why does Data science so important? Or Need for Data Science!
Data is meaningless until its conversion into valuable information. Data Science
involves mining large datasets containing structured and unstructured data and
identifying hidden patterns to extract actionable insights. The importance of Data
Science lies in its numerous uses that range from daily activities like asking Siri or
Alexa for recommendations to more complex applications like operating a self-
driving car.
Handling of huge amount of data is a challenging task for every organization. So to
handle, process, and analysis of this, we required complex, powerful, and efficient
algorithms and that technology came into existence as data Science. Following are
some main reasons for using data science technology:
2. Domain Expertise:
In data science, domain expertise binds data science together. Domain
expertise means specialized knowledge or skills of a particular area. In data
science, there are various areas for which we need domain experts.
3. Data engineering:
Data engineering is a part of data science, which involves acquiring,
storing, retrieving, and transforming the data. Data engineering also includes
metadata (data about data) to the data.
4. Visualization:
Data visualization is meant by representing data in a visual context so
that people can easily understand the significance of data. Data visualization
makes it easy to access the huge amount of data in visuals.
5. Advanced computing:
Heavy lifting of data science is advanced computing. Advanced
computing involves designing, writing, debugging, and maintaining the
source code of computer programs.
6. Mathematics:
Mathematics is the critical part of data science. Mathematics involves the
study of quantity, structure, space, and changes. For a data scientist,
knowledge of good mathematics is essential.
7. Machine learning:
Machine learning is backbone of data science. Machine learning is all
about to provide training to a machine so that it can act as a human brain. In
data science, we use various machine learning algorithms to solve the
problems.
Data Science Life Cycle defines the process of the how information is carried out
in various phases for professionals working on a project.
It’s a step-by-step procedure that is arranged in a circular structure. Each phase has
its own characteristics and importance. The Data Science life cycle comprises of
the following:
Data Preprocessing
The third step is where the magic happens. Using statistical analysis,
exploratory data analysis, data wrangling and manipulation, we will create
meaningful data. The preprocessing is done to assess the various data points
and formulate hypotheses that best explain the relationship between the
various features in the data.
For example –
The store sales problem will require the data to be in a time series format to
be able to forecast the sales. The hypothesis testing will test the stationary of
the series and further computations will show various trends, seasonality and
other relationship patterns in the data.
Data Modeling
This step involves advanced machine learning concepts that will be used for
feature selection, feature transformation, standardization of the data, data
normalization, etc. Choosing the best algorithms based on evidence from the
above steps will help you create a model that will efficiently create a
forecast for the said months in the above example. For example –
We can use the Time Series forecasting approach for the business problem
where the presence of high dimensional data could be the case. We will use
various dimensionality reduction techniques, and create a Forecasting model
using AR, MA, or ARIMA model and forecast the sales for the next quarter.
Gathering Actionable Insights
The final step from the data science life cycle is gathering insights from the
said problem statement. We create inferences and findings from the entire
process that would best explain the business problem. For example–
From the above Time series model, we will get the monthly or weekly sales
for the next 3 months. These insights will in turn help the professionals
create a strategy plan to overcome the problem at hand.
Technical Prerequisite:
Machine learning:
To understand data science, one needs to understand the concept of machine
learning. Data science uses machine learning algorithms to solve various
problems.
Mathematical modeling:
Mathematical modeling is required to make fast mathematical calculations
and predictions from the available data.
Statistics:
Basic understanding of statistics is required, such as mean, median, or
standard deviation. It is needed to extract knowledge and obtain better
results from the data.
Computer programming:
For data science, knowledge of at least one programming language is
required. R, Python, Spark are some required computer programming
languages for data science.
Databases:
The depth understanding of Databases such as SQL is essential for data
science to get the data and to work with data.
Non-Technical Prerequisite:
Curiosity
To learn data science, one must have curiosities. When you have curiosity
and ask various questions, then you can understand the business problem
easily.
Critical Thinking:
It is also required for a data scientist so that you can find multiple new ways
to solve the problem with efficiency.
Communication skills:
Communication skills are most important for a data scientist because after
solving a business problem, you need to communicate it with the team.
Data Science is widely used in the banking and finance sectors for fraud
detection and personalized financial advice.
With Data Science, one can analyze massive graphical data, temporal data,
and geospatial data to draw insights. It also helps in seismic interpretation
and reservoir characterization.
Data Science facilitates firms to leverage social media content to obtain real-
time media content usage patterns. This enables the firms to create target
audience-specific content, measure content performance, and recommend
on-demand content.
Data Science helps study utility consumption in the energy and utility
domain. This study allows for better control of utility use and enhanced
consumer feedback.
Regression
Decision tree
Clustering
Classification
Outlier Analysis
The below equation can describe the relationship between x and y variables:
Y= mx+c
2. Decision Tree:
Decision Tree algorithm is another machine learning algorithm, which belongs to
the supervised learning algorithm. This is one of the most popular machine
learning algorithms. It can be used for both classification and regression problems.
In the decision tree algorithm, we can solve the problem, by using tree
representation in which, each node represents a feature, each branch represents a
decision, and each leaf represents the outcome. Following is the example for a Job
offer problem
In the decision tree, we start from the root of the tree and compare the values of the
root attribute with record attribute. On the basis of this comparison, we follow the
branch as per the value and then move to the next node. We continue comparing
these values until we reach the leaf node with predicated class value.
3. K-Means Clustering:
K-means clustering is one of the most popular algorithms of machine learning,
which belongs to the unsupervised learning algorithm. It solves the clustering
problem. If we are given a data set of items, with certain features and values, and
we need to categorize those set of items into groups, so such type of problems can
be solved using k-means clustering algorithm
4. Classification
It is the act or process of dividing things into groups according to their type. In
statistics, classification is the problem of identifying which of a set of categories
(sub-populations) an observation (or observations) belongs to. There are two types
of classification such as Binary Classification and Multi-class Classification.
5. Outlier Analysis
Outlier Analysis is a process that involves identifying the anomalous observation
in the dataset. Outliers are nothing but an extreme value that deviates from the
other observations in the dataset. Outliers are those observations that differ
strongly (different properties) from the other data points in the sample of a
population. Outliers are classified into three types namely Global Outliers,
Contextual Outliers and Collective Outliers.
So in data science, problems are solved using algorithms, and below is the diagram
representation for applicable algorithms for possible questions:
Structured Data
Unstructured Data
Graph-based Data
Streaming Data:
Structured Data
Structured data is when data is in a standardized format, has a well-defined
structure, complies with a data model, follows a persistent order, and is easily
accessed by humans and programs. This data type is generally stored in a database.
Structured Query Language (SQL) is the preferred way to manage and query data
that resides in databases. Example of Structured data is
information is easy to access and query for humans and other programs
Unstructured data
Unstructured data is data that isn’t easy to fit into a data model because the content
is
Context-specific or varying. Unstructured data is information that either does not
have a predefined data model or is not organized in a pre-defined manner.
Unstructured information is typically text-heavy, but may contain data such as
dates, numbers, and facts as well. This result in irregularities and ambiguities that
make it difficult to understand using traditional programs as compared to data
stored in structured databases. Common examples of unstructured data include
audio, video files or No-SQL databases.
Examples of unstructured data are:
Machine-generated data
Machine-generated data is information that’s automatically created by a computer,
process, application, or other machine without human intervention. Examples of
machine data are web server logs, call detail records, network event logs, and
telemetry
Graph-based or network data
In graph theory, a graph is a mathematical structure to model pair-wise
relationships between objects. Graph or network data is, in short, data that focuses
on the relationship or adjacency of objects. The graph structures use nodes, edges,
and properties to represent and store graphical data. Graph-based data is a natural
way to represent social networks, and its structure allows you to calculate specific
metrics such as the influence of a person and the shortest path between two people
2. Retrieving data
- Finding and getting access to data needed in your project. This data is
either found within the company or retrieved from a third party.3.
3. Data preparation
- Checking and remediating data errors, enriching the data with data
from other data sources, and transforming it into a suitable format for
your models.4.
4. Data exploration
- Diving deeper into your data using descriptive statistics and visual
techniques.
5. Data modeling
- Using machine learning and statistical techniques to achieve your
project goal.
Retrieving data
The second step is to collect data. You’ve stated in the project charter which data you need
and where you can find it. In this step you ensure that you can use the data in your
program, which means checking the existence of, quality, and access to the data.
Data can also be delivered by third-party companies and takes many forms ranging
from Excel spreadsheets to different types of databases.
Data preparation
Data collection is an error-prone process; in this phase you enhance the quality
of the data and prepare it for use in subsequent steps. This phase consists of three
sub phases: data cleansing removes false values from a data source
and inconsistencies across data sources, data integration enriches data sources by
combining information from multiple data sources, and data transformation
ensures that the data is in a suitable format for use in your models.
Data exploration
Data exploration is concerned with building a deeper understanding of your data.
You try to understand how variables interact with each other, the distribution of the
data, and whether there are outliers. To achieve this you mainly use descriptive
statistics, visual techniques, and simple modeling. This step often goes by
the abbreviation EDA, for Exploratory Data Analysis.
Data modeling or model building
In this phase you use models, domain knowledge, and insights about the data you
found in the previous steps to answer the research question. You select a technique
from the fields of statistics, machine learning, operations research, and so on.
Building a model is an iterative process that involves selecting the variables for the
model, executing the model, and model diagnostics. Building a model is an
iterative process. The way you build your model depends on whether you go with
classic statistics or the somewhat more recent machine learning school, and the
type of technique you want to use.
Either way, most models consist of the following main steps:.
This includes transforming the data from a raw form into data that’s directly usable
in your models. To achieve this, you’ll detect and correct different kinds of errors
in the data, combine data from different data sources, and transform it. If you have
successfully completed this step, you can progress to data visualization and
modelling.4.
1.10.1 Data Science Process: Defining research goals and creating a project
charter
A project starts by understanding the what , the why , and the how of your
project.“What does the company expect you to do? ”, “why does management
place such a value on your research?”, “Is it part of a bigger strategic picture
project originating from an opportunity someone detected?” Answering these three
questions (what, why, how) is the goal of the first phase.
The entire process is divided into two parts as such,
The next step in data science is to retrieve the required data. Data required
for the analysis process will be collected from the company directly (Internal
data) or collected from outside sources (External Data).
Data can be stored in many forms, ranging from simple text files to tables in
a database. The objective now is acquiring all the data you need
The main challenge in data collection is identifying the data sources where
the required data is actually stored. Because company may have stored the
data across many places.
Another challenge is the extract useful data from the data collection and
removing the noises and unwanted data out of it. Many manual and
automated tools are used to refine the data
Many companies publish their data in open forum for public access and some of
them are as follows
Data cleaning
is the process of fixing or removing incorrect, corrupted, incorrectly formatted,
duplicate, or incomplete data within a dataset. When combining multiple data
sources, there are many opportunities for data to be duplicated or mislabeled. If
data is incorrect, outcomes and algorithms are unreliable, even though they may
look correct.
Data Integration
is a process of combining data from multiple heterogeneous data sources into a
coherent data store and provides a unified view of the data. These sources may
include multiple data cubes, databases, or flat files.
There are mainly 2 major approaches for data integration-One is the “tight coupling
approach” and another is the “loose coupling approach”.
Is a technique used to convert the raw data into a suitable format that efficiently
eases data mining and retrieves strategic information?
Data integration, migration, data warehousing, data wrangling may all involve data
transformation. Data transformation increases the efficiency of business and
analytic processes, and it enables businesses to make better data-driven decisions.
During the data transformation process, an analyst will determine the structure of
the data.
Data Science Process: Exploratory data analysis [EDA]
Typical graphical techniques used in EDA are Box plot, Histogram, Multi-vari
chart, Run chart, Pareto chart, Scatter plot (2D/3D), Stem-and-leaf plot, Parallel
coordinates, Odds ratio, Targeted projection pursuit, Heat map, Bar chart, etc.,