0% found this document useful (0 votes)
26 views22 pages

Unit-2 - DS Notes

Ds module 2

Uploaded by

krishjain531
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views22 pages

Unit-2 - DS Notes

Ds module 2

Uploaded by

krishjain531
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

UNIT-2

Syllabus:
Data Science Process: Overview of the Data Science Process, defining research goals
and creating a project charter, Retrieving data, Cleansing, integrating and
transforming data, exploratory data analysis, build the models, presenting findings
and building applications on top of them.

Overview of the Data Science Process:


 The typical data science process consists of six steps through

N.KOTESWARA RAO, NEC, GUDUR. Page 1


 The first step of this process is setting a research goal. The main purpose here is
making sure all the stakeholders understand the what, how, and why of the
project
 The second phase is data retrieval. This step includes finding suitable data and
getting access to the data from the data owner. The result is data in its raw
form, which probably needs polishing and transformation before it becomes
usable.
 Now we have the raw data and it’s time to prepare it. This includes
transforming the data from a raw form into data that’s directly usable in our
models. To achieve this, you’ll detect and correct different kinds of errors in the
data, combine data from different data sources, and transform it. If you have
successfully completed this step, we can progress to data visualization and
modeling.
 The fourth step is data exploration. The goal of this step is to gain a deep
understanding of the data. You’ll look for patterns, correlations, and deviations
based on visual and descriptive techniques. The insights you gain from this
phase will enable you to start modeling.
 The fifth step is model building
 The last step of the data science model is presenting your results and
automating the analysis, if needed. One goal of a project is to change a process
and/or make better decisions.

Step 1: Defining research goals and creating a project charter:


 A project starts by understanding the what, the why, and the how of your
project.
 Answering these three questions (what, why, how) is the goal of the first phase,
so that everybody knows what to do and can agree on the best course of action.
 The outcome should be a clear research goal, a good understanding of the
context, well-defined deliverables, and a plan of action with a timetable. This
information is then best placed in a project charter.

N.KOTESWARA RAO, NEC, GUDUR. Page 2


Step 1: Setting the research goal

 Create a project charter: After you have a good understanding of the business
problem, try to get a formal agreement on the deliverables. All this information
is best collected in a project charter. For any significant project this would be
mandatory.
 A project charter requires teamwork, and your input covers at least the
following:
1. A clear research goal
2. The project mission and context
3. How you’re going to perform your analysis
4. What resources you expect to use
5. Proof that it’s an achievable project, or proof of concepts
6. Deliverables and a measure of success
7. A timeline
8. Your client can use this information to make an estimation of the project
costs and the data and people required for your project to become a
success.

N.KOTESWARA RAO, NEC, GUDUR. Page 3


Step 2: Retrieving data:
 The next step in data science is to retrieve the required data. Sometimes you
need to go into the field and design a data collection process yourself, but most
of the time you won’t be involved in this step.

Step 2: Retrieving data

 Data can be stored in many forms, ranging from simple text files to tables in a
database. The objective now is acquiring all the data you need.
 Start with data stored within the company: first you should assess the relevance
and quality of the data that’s readily available within your company.
 Most companies have a program for maintaining key data, so much of the
cleaning work may already be done. This data can be stored in official data
repositories such as databases, data marts, data warehouses, and data
lakes maintained by a team of IT professionals.
 The primary goal of a database is data storage, while a data warehouse is
designed for reading and analyzing that data.
 A data mart is a subset of the data warehouse and serving a specific business
unit. Data lakes contain data in its natural or raw format.

N.KOTESWARA RAO, NEC, GUDUR. Page 4


 Finding data even within your own company can sometimes be a challenge. As
companies grow, their data becomes scattered around many places. Knowledge
of the data may be dispersed as people change positions and leave the company.
 Getting access to data is another difficult task. Organizations understand the
value and sensitivity of data and often have policies in place so everyone has
access to what they need and nothing more. These policies translate into
physical and digital barriers called Chinese walls. These “walls” are mandatory
and well-regulated for customer data in most countries. This is for good
reasons, too; imagine everybody in a credit card company having access to your
spending habits. Getting access to the data may take time and involve company
politics.

Table 2.1. A list of open-data providers that should get you started

Open data site Description

Data.gov The home of the US Government’s open data


https://open- The home of the European Commission’s open data
data.europa.eu/
Freebase.org An open database that retrieves its information from sites
like Wikipedia, MusicBrains, and the SEC archive
Data.worldbank.org Open data initiative from the World Bank
Aiddata.org Open data for international development
Open.fda.gov Open data from the US Food and Drug Administration

N.KOTESWARA RAO, NEC, GUDUR. Page 5


Step 3: Cleansing, integrating, and transforming data (Data
Preparation):
 The data received from the data retrieval phase is likely to be “a diamond in the
rough.”
 Now our task is to clean and prepare it for use in the modeling and reporting
phase. Our model needs the data in a specific format, so data transformation
will always come into play. It’s a good habit to correct data errors as early on in
the process as possible. However,
 Figure 2.4 shows the most common actions to take during the data cleansing,
integration, and transformation phase.

Step 3: Data preparation

N.KOTESWARA RAO, NEC, GUDUR. Page 6


Cleansing data:
 Data cleansing is a sub process of the data science process that focuses on
removing errors in our data, so our data becomes a true and consistent
representation of the processes.
 By “true and consistent representation” we imply that at least two types of
errors exist. The first type is the interpretation error, such as when you take the
value in your data for granted, like saying that a person’s age is greater than
300 years.
 The second type of error points to inconsistencies between data sources or
against your company’s standardized values. An example of this class of errors
is putting “Female” in one table and “F” in another when they represent the
same thing: that the person is female.
 Another example is that you use Pounds in one table and Dollars in another.
Too many possible errors exist for this list to be exhaustive, but table 2.2 shows
an overview of the types of errors that can be detected with easy checks—the
“low hanging fruit,” as it were.

An overview of common errors:

General solution
Try to fix the problem early in the data acquisition chain or else fix it in the program.
Error description Possible solution
Errors pointing to false values within one data set
Mistakes during data entry Manual overrules
Redundant white space Use string functions
Impossible values Manual overrules
Missing values Remove observation or value
Outliers Validate and, if erroneous, treat as missing value (remove
or insert)

N.KOTESWARA RAO, NEC, GUDUR. Page 7


General solution
Errors pointing to inconsistencies between data sets
Deviations from a code book Match on keys or else use manual overrules
Different units of measurement Recalculate
Different levels of aggregation Bring to same level of measurement by aggregation
or extrapolation

Data entry errors:


 Data collection and data entry are error-prone processes. They often require
human intervention, and because humans are only human, they make typos or
lose their concentration for a second and introduce an error into the chain.
 But data collected by machines or computers isn’t free from errors either.
Errors can arise from human negligence, whereas others are due to machine or
hardware failure.
 Examples of errors originating from machines are transmission errors or bugs
in the extract, transform, and load phase (ETL).
 For small data sets we can check every value by hand. Detecting data errors
when the variables you study don’t have many classes can be done by
tabulating the data with counts.
 When you have a variable that can take only two values: “Good” and “Bad”,
you can create a frequency table and see if those are truly the only two values
present. In table 2.3, the values “Godo” and “Bade” point out something went
wrong in at least 16 cases.
Table 2.3. Detecting outliers on simple variables with a frequency table
Value Count
Good 1598647
Bad 1354468
Godo 15
Bade 1

N.KOTESWARA RAO, NEC, GUDUR. Page 8


Redundant whitespace:
 Whitespaces likely to be hard to detect but cause errors like other redundant
characters.
 The cleaning during the ETL phase wasn’t well executed, and keys in one table
contained a whitespace at the end of a string. This caused a mismatch of keys
such as “FR” – “FR”, dropping the observations that couldn’t be matched.
 Most programming languages provide string functions that will remove the
leading and trailing whitespaces. For instance, in Python you can use
the strip() function to remove leading and trailing spaces.
FIXING CAPITAL LETTER MISMATCHES:
 Capital letter mismatches are common. Most programming languages make a
distinction between “Brazil” and “brazil”.
 In this case you can solve the problem by applying a function that returns both
strings in lowercase, such as .lower() in Python.
 "Brazil".lower() == "brazil".lower() should result in true.
Impossible values and sanity checks:
 Sanity checks are another valuable type of data check. Here we check the value
against physically or theoretically impossible values such as people taller than 3
meters or someone with an age of 299 years.
 Sanity checks can be directly expressed with rules: check = 0 <= age <= 120
Outliers:
 An outlier is an observation that seems to be distant from other observations or,
more specifically, one observation that follows a different logic or generative
process than the other observations.
 The easiest way to find outliers is to use a plot or a table with the minimum and
maximum values.
 An example is shown in figure 2.6. Distribution plots are helpful in detecting
outliers and helping you understand the variable.

N.KOTESWARA RAO, NEC, GUDUR. Page 9


 The plot on the top shows no outliers, whereas the plot on the bottom shows
possible outliers on the upper side when a normal distribution is expected. The
normal distribution, or Gaussian distribution, is the most common distribution
in natural sciences. It shows most cases occurring around the average of the
distribution and the occurrences decrease when further away from it. The high
values in the bottom graph can point to outliers when assuming a normal
distribution. As we saw earlier with the regression example, outliers can
gravely influence your data modeling, so investigate them first.
Dealing with missing values:
 Missing values aren’t necessarily wrong, but still we need to handle them
separately; certain modeling techniques can’t handle missing values. They
might be an indicator that something went wrong in your data collection or
that an error happened in the ETL process.

N.KOTESWARA RAO, NEC, GUDUR. Page 10


 Common techniques data scientists use are listed in table 2.4.
Table 2.4. An overview of techniques to handle missing data
Technique Advantage Disadvantage
Omit the values Easy to perform You lose the
information from an
observation
Set value to null Easy to perform Not every modeling
technique and/or
implementation can
handle null values
Impute a static value Easy to perform You don’t Can lead to false
such as 0 or the mean lose information from the estimations from a
other variables in the model
observation
Impute a value from an Does not disturb the model as Harder to execute You
estimated or theoretical much make data assumptions
distribution
Modeling the value Does not disturb the model Can lead to too much
(nondependent) too much confidence in the
model Can artificially
raise dependence
among the variables
Harder to execute You
make data assumptions

Deviations from a code book:


 Detecting errors in larger data sets against a code book or against standardized
values can be done with the help of set operations.
 A code book is a description of your data, a form of metadata. It contains things
such as the number of variables per observation, the number of observations,

N.KOTESWARA RAO, NEC, GUDUR. Page 11


and what each encoding within a variable means. (For instance “0” equals
“negative”, “5” stands for “very positive”.)
 A code book also tells the type of data we are looking at: is it hierarchical,
graph, something else?
Correct errors as early as possible:
 A good practice is to correct data errors as early as possible in the data
collection. Retrieving data is a difficult task, and organizations spend millions of
dollars on it in the hope of making better decisions.
 The data collection process is error-prone, and in a big organization it involves
many steps and teams.
 Data should be cleansed when acquired for many reasons:
1. Not everyone spots the data anomalies. Decision-makers may make costly
mistakes on information based on incorrect data from applications that fail to
correct for the faulty data.
2. If errors are not corrected early on in the process, the cleansing will have to be
done for every project that uses that data.
3. Data errors may point to a business process that isn’t working as designed.
4. Data errors may point to defective equipment, such as broken transmission
lines and defective sensors.
5. Data errors can point to bugs in software or in the integration of software that
may be critical to the company.
6. As a final remark: always keep a copy of your original data (if possible).
Sometimes you start cleaning data but you’ll make mistakes: impute variables in
the wrong way, delete outliers that had interesting additional information, or
alter data as the result of an initial misinterpretation. If you keep a copy you get
to try again. For “flowing data” that’s manipulated at the time of arrival, this
isn’t always possible and you’ll have accepted a period of tweaking before you
get to use the data you are capturing. One of the more difficult things isn’t the
data cleansing of individual data sets however, it’s combining different sources
into a whole that makes more sense.

N.KOTESWARA RAO, NEC, GUDUR. Page 12


Combining data from different data sources:
 Your data comes from several different places, and in this substep we focus on
integrating these different sources. Data varies in size, type, and structure,
ranging from databases and Excel files to text documents.

The different ways of combining data:


 We can perform two operations to combine information from different data
sets. The first operation is joining: Moving an observation from one table with
information from another table.
 The second operation is appending or stacking: adding the observations of one
table to those of another table.
 When you combine data, you have the option to create a new physical table or
a virtual table by creating a view. The advantage of a view is that it doesn’t
consume more disk space.
 Example: Joining tables
 Joining tables allows you to combine the information of one observation found
in one table with the information that you find in another table.
 Let’s say that the first table contains information about the purchases of a
customer and the other table contains information about the region where your
customer lives.
 Joining the tables allows you to combine the information so that you can use it
for your model, as shown in figure 2.7.

Figure 2.7. Joining two tables on the Item and Region keys

N.KOTESWARA RAO, NEC, GUDUR. Page 13


 To join tables, you use variables that represent the same object in both tables,
such as a date, a country name, or a Social Security number. These common
fields are known as keys.
 When these keys also uniquely define the records in the table they are
called primary keys. One table may have buying behavior and the other table
may have demographic information on a person.
 In figure 2.7 both tables contain the client name, and this makes it easy to
enrich the client expenditures with the region of the client. People who are
acquainted with Excel will notice the similarity with using a lookup function.
 The number of resulting rows in the output table depends on the exact join type
that you use.
Appending tables:
 Appending or stacking tables is effectively adding observations from one table
to another table. Figure 2.8 shows an example of appending tables. One table
contains the observations from the month January and the second table
contains observations from the month February.
 The result of appending these tables is a larger one with the observations from
January as well as February. The equivalent operation in set theory would be
the union, and this is also the command in SQL, the common language of
relational databases. Other set operators are also used in data science, such as
set difference and intersection.
 Figure 2.8. Appending data from tables is a common operation but requires an
equal structure in the tables being appended.

N.KOTESWARA RAO, NEC, GUDUR. Page 14


Using views to simulate data joins and appends:
 To avoid duplication of data, we virtually combine data with views. In the
previous example we took the monthly data and combined it in a new physical
table. The problem is that we duplicated the data and therefore needed more
storage space.
 Imagine that every table consists of terabytes of data; then it becomes
problematic to duplicate the data. For this reason, the concept of a view was
invented.
 A view behaves as if you’re working on a table, but this table is nothing but a
virtual layer that combines the tables for us.
 Figure 2.9 shows how the sales data from the different months is combined
virtually into a yearly sales table instead of duplicating the data.
 Figure 2.9. A view helps you combine data without replication.

Transforming data:
 Certain models require their data to be in a certain shape. Now that you’ve
cleansed and integrated the data, this is the next task you’ll perform:
transforming your data so it takes a suitable form for data modeling.
 Transforming data: Relationships between an input variable and an output
variable aren’t always linear.

N.KOTESWARA RAO, NEC, GUDUR. Page 15


 Figure 2.11 shows how transforming the input variables greatly simplify the
estimation problem. Figure 2.11. Transforming x to log x makes the
relationship between x and y linear (right), compared with the non-log x (left).

Reducing the number of variables:


 Sometimes you have too many variables and need to reduce the number
because they don’t add new information to the model. Having too many
variables in your model makes the model difficult to handle, and certain
techniques don’t perform well when you overload them with too many input
variables.
 For instance, all the techniques based on a Euclidean distance perform well only
up to 10 variables.
EUCLIDEAN DISTANCE:
 The Euclidean distance between two points in a two-dimensional plane is

calculated using a similar formula: distance = .


 If you want to expand this distance calculation to more dimensions, add the
coordinates of the point within those higher dimensions to the formula. For

three dimensions we get distance =


.

N.KOTESWARA RAO, NEC, GUDUR. Page 16


 Data scientists use special methods to reduce the number of variables but retain
the maximum amount of data.
Turning variables into dummies:
 Variables can be turned into dummy variables. Dummy variables can only take
two values: true(1) or false(0).
 They’re used to indicate the absence of a categorical effect that may explain the
observation.
 In this case we will make separate columns for the classes stored in one
variable and indicate it with 1 if the class is present and 0 otherwise.
 An example is turning one column named Weekdays into the columns Monday
through Sunday. You use an indicator to show if the observation was on a
Monday; you put 1 on Monday and 0 elsewhere.
 Turning variables into dummies is a technique that’s used in modeling and is
popular with, but not exclusive to, economists.

N.KOTESWARA RAO, NEC, GUDUR. Page 17


Step 4: Exploratory data analysis:
 During exploratory data analysis we take a deep dive into the data. Information
becomes much easier to grasp when shown in a picture, therefore we mainly
use graphical techniques to gain an understanding of your data and the
interactions between variables.
 This phase is about exploring data, so keeping your mind open and your eyes
peeled is essential during the exploratory data analysis phase.
 The goal isn’t to cleanse the data, but it’s common that you’ll still discover
anomalies you missed before, forcing you to take a step back and fix them.

Figure 2.14. Step 4: Data exploration

 The visualization techniques we use in this phase range from simple line graphs
or histograms, as shown in figure 2.15, to more complex diagrams such as
Sankey and network graphs.
 Sometimes it’s useful to compose a composite graph from simple graphs to get
even more insight into the data. Other times the graphs can be animated or
made interactive to make it easier.

N.KOTESWARA RAO, NEC, GUDUR. Page 18


 Figure 2.15. From top to bottom, a bar chart, a line plot, and a distribution are
some of the graphs used in exploratory analysis.

N.KOTESWARA RAO, NEC, GUDUR. Page 19


Step 5: Build the models:
 With clean data in place and a good understanding of the content, we are ready
to build models with the goal of making better predictions, classifying objects,
or gaining an understanding of the system that we are modeling.
 This phase is much more focused than the exploratory analysis step, because
you know what you’re looking for and what you want the outcome to be

 Figure 2.21. Step 5: Data modeling

 The techniques we will use now are borrowed from the field of machine
learning, data mining, and/or statistics.
 Building a model is an iterative process. The way we build our model depends
on whether we go with classic statistics or the somewhat more recent machine
learning school, and the type of technique we want to use. most models consist
of the following main steps:
a. Selection of a modeling technique and variables to enter in the model
b. Execution of the model
c. Diagnosis and model comparison

N.KOTESWARA RAO, NEC, GUDUR. Page 20


1. Model and variable selection:
 We will need to select the variables we want to include in our model and a
modeling technique. Our findings from the exploratory analysis should already
give a fair idea of what variables will help us to construct a good model.
 Many modeling techniques are available, and choosing the right model for a
problem requires judgment on our part.
 We will need to consider model performance and whether your project meets
all the requirements to use our model, as well as other factors:
 Must the model be moved to a production environment and, if so, would it be
easy to implement?
 How difficult is the maintenance on the model: how long will it remain relevant
if left untouched?
 Does the model need to be easy to explain?
 When the thinking is done, it’s time for action.
2. Model execution
 Once we have chosen a model we will need to implement it in code.
 Luckily, most programming languages, such as Python, already have libraries
such as StatsModels or Scikit-learn. These packages use several of the most
popular techniques.
 Coding a model is a nontrivial task in most cases, so having these libraries
available can speed up the process.
3 Model diagnostics and model comparison:
 You’ll be building multiple models from which you then choose the best one
based on multiple criteria.

N.KOTESWARA RAO, NEC, GUDUR. Page 21


Step 6: Presenting findings and building applications on top of
them
 After we have successfully analyzed the data and built a well-performing
model, we are ready to present our findings to the world.
 This is an exciting part; all our hours of hard work have paid off and we can
explain what we found to the stakeholders.
 Step 6: Presentation and automation

 Sometimes people get so excited about our work that we will need to repeat it
over and over again because they value the predictions of our models or the
insights that we produced.
 For this reason, we need to automate our models. This doesn’t always mean
that we have to redo all of our analysis all the time.
 Sometimes it’s sufficient that we implement only the model scoring; other times
we might build an application that automatically updates reports, Excel
spreadsheets, or PowerPoint presentations.
 The last stage of the data science process is where our soft skills will be most
useful.

N.KOTESWARA RAO, NEC, GUDUR. Page 22

You might also like