Unit-2 - DS Notes
Unit-2 - DS Notes
Syllabus:
Data Science Process: Overview of the Data Science Process, defining research goals
and creating a project charter, Retrieving data, Cleansing, integrating and
transforming data, exploratory data analysis, build the models, presenting findings
and building applications on top of them.
Create a project charter: After you have a good understanding of the business
problem, try to get a formal agreement on the deliverables. All this information
is best collected in a project charter. For any significant project this would be
mandatory.
A project charter requires teamwork, and your input covers at least the
following:
1. A clear research goal
2. The project mission and context
3. How you’re going to perform your analysis
4. What resources you expect to use
5. Proof that it’s an achievable project, or proof of concepts
6. Deliverables and a measure of success
7. A timeline
8. Your client can use this information to make an estimation of the project
costs and the data and people required for your project to become a
success.
Data can be stored in many forms, ranging from simple text files to tables in a
database. The objective now is acquiring all the data you need.
Start with data stored within the company: first you should assess the relevance
and quality of the data that’s readily available within your company.
Most companies have a program for maintaining key data, so much of the
cleaning work may already be done. This data can be stored in official data
repositories such as databases, data marts, data warehouses, and data
lakes maintained by a team of IT professionals.
The primary goal of a database is data storage, while a data warehouse is
designed for reading and analyzing that data.
A data mart is a subset of the data warehouse and serving a specific business
unit. Data lakes contain data in its natural or raw format.
Table 2.1. A list of open-data providers that should get you started
General solution
Try to fix the problem early in the data acquisition chain or else fix it in the program.
Error description Possible solution
Errors pointing to false values within one data set
Mistakes during data entry Manual overrules
Redundant white space Use string functions
Impossible values Manual overrules
Missing values Remove observation or value
Outliers Validate and, if erroneous, treat as missing value (remove
or insert)
Figure 2.7. Joining two tables on the Item and Region keys
Transforming data:
Certain models require their data to be in a certain shape. Now that you’ve
cleansed and integrated the data, this is the next task you’ll perform:
transforming your data so it takes a suitable form for data modeling.
Transforming data: Relationships between an input variable and an output
variable aren’t always linear.
The visualization techniques we use in this phase range from simple line graphs
or histograms, as shown in figure 2.15, to more complex diagrams such as
Sankey and network graphs.
Sometimes it’s useful to compose a composite graph from simple graphs to get
even more insight into the data. Other times the graphs can be animated or
made interactive to make it easier.
The techniques we will use now are borrowed from the field of machine
learning, data mining, and/or statistics.
Building a model is an iterative process. The way we build our model depends
on whether we go with classic statistics or the somewhat more recent machine
learning school, and the type of technique we want to use. most models consist
of the following main steps:
a. Selection of a modeling technique and variables to enter in the model
b. Execution of the model
c. Diagnosis and model comparison
Sometimes people get so excited about our work that we will need to repeat it
over and over again because they value the predictions of our models or the
insights that we produced.
For this reason, we need to automate our models. This doesn’t always mean
that we have to redo all of our analysis all the time.
Sometimes it’s sufficient that we implement only the model scoring; other times
we might build an application that automatically updates reports, Excel
spreadsheets, or PowerPoint presentations.
The last stage of the data science process is where our soft skills will be most
useful.