Unit I Introduction To Data Science Syllabus
Unit I Introduction To Data Science Syllabus
UNIT I
INTRODUCTION TO DATA SCIENCE
Syllabus
Introduction of Data Science – Basic Data Analytics using R – R Graphical User Interfaces – Data
Import and Export – Attribute and Data Types – Descriptive Statistics – Exploratory Data
Analysis – Visualization Before Analysis – Dirty Data – Visualizing a Single Variable – Examining
Multiple Variables – Data Exploration Versus Presentation..
Several industries have led the way in developing their ability to gather and exploit data:
● Credit card companies monitor every purchase their customers make and can identify
fraudulent purchases with a high degree of accuracy using rules derived by processing billions of
transactions.
● Mobile phone companies analyze subscribers’ calling patterns to determine, for example,
whether a caller’s frequent contacts are on a rival network. If that rival network is offering an
attractive promotion that might cause the subscriber to defect, the mobile phone company can
proactively offer the subscriber an incentive to remain in her contract.
● For companies such as LinkedIn and Facebook, data itself is their primary product. The
valuations of these companies are heavily derived from the data they gather and host, which
contains more and more intrinsic value as the data grows.
Definition
Definition of Big Data comes from the McKinsey Global report from 2011:
Big Data is data whose scale, distribution, diversity, and/or timeliness require the use of new
technical architectures and analytics to enable insights that unlock new sources of business
value.
McKinsey’s definition of Big Data implies that organizations will need new data architectures and
analytic sandboxes, new tools, new analytical methods, and an integration of multiple skills into
the new role of the data scientist, Figure 1-1 highlights several sources of the Big Data deluge.
Semi-structured data: Textual data files with a discernible pattern that enables parsing
(such as Extensible Markup Language [XML] data files that are self-describing and defined
by an XML schema).
Quasi-structured data: Textual data with erratic data formats that can be formatted with
effort, tools, and time (for instance, web clickstream data that may contain inconsistencies
in data values and formats).
Figure shows typical data architecture and several of the challenges it presents to data scientists
and others trying to do advanced analytics. This section examines the data flow to the Data
Scientist and how this individual fits into the process of getting data to analyze on projects.
1. For data sources to be loaded into the data warehouse, data needs to be well understood,
structured, and normalized with the appropriate data type definitions. Although this kind
of centralization enables security, backup, and failover of highly critical data, it also
means that data typically must go through significant preprocessing and checkpoints
before it can enter this sort of controlled environment, which does not lend itself to data
exploration and iterative analytics.
2. As a result of this level of control on the EDW, additional local systems may emerge in the
form of departmental warehouses and local data marts that business users create to
accommodate their need for flexible analysis. These local data marts may not have the
same constraints for security and structure as the main EDW and allow users to do some
level of more in-depth analysis. However, these one-off systems reside in isolation, often
are not synchronized or integrated with other data stores, and may not be backed up.
3. Once in the data warehouse, data is read by additional applications across the enterprise
for BI and reporting purposes. These are high-priority operational processes getting
critical data feeds from the data warehouses and repositories.
4. At the end of this workflow, analysts get data provisioned for their downstream analytics.
Because users generally are not allowed to run custom or intensive analytics on
production databases, analysts create data extracts from the EDW to analyze data offline
in R or other local analytical tools. Many times these tools are limited to in-memory
analytics on desktops analyzing samples of data, rather than the entire population of a
dataset. Because these analyses are based on data extracts, they reside in a separate
location, and the results of the analysis—and any insights on the quality of the data or
anomalies—rarely are fed back into the main data repository.
Data acquisition
Representing, transforming, grouping, and linking the data are all tasks that need to occur before
the data can be profitably analyzed, and these are all tasks in which the data scientist is actively
involved.
Data analysis
The analysis phase is where data scientists are most heavily involved. In this context we are
using analysis to include summarization of the data, using portions of data (samples) to make
inferences about the larger context, and visualization of the data by presenting it in tables,
graphs, and even animations.
Data archiving
Finally, the data scientist must become involved in the archiving of the data. Preservation of
collected data in a form that makes it highly reusable - what you might think of as "data
duration"- is a difficult challenge because it is so hard to anticipate all of the future uses of the
data.
means checking the existence, quality, and access to the data. Data can also be delivered by third-
party companies and take many forms ranging from Excel spreadsheets to different types of
databases.
3. Data cleansing
Data collection is an error-prone process; in this phase you enhance the quality of the data and
prepare it for use in subsequent steps. This phase consists of three subphases: data cleansing
removes false values from a data source and inconsistencies across data sources, data
integration enriches data sources by combining information from multiple data sources, and
data transformation ensures that the data is in a suitable format for use in your models.
4. Data exploration
Data exploration is concerned with building a deeper understanding of your data. You try to
understand how variables interact with each other, the distribution of the data, and whether
there are outliers. To achieve this you mainly use descriptive statistics, visual techniques, and
simple modeling. This step often goes under the abbreviation EDA for Exploratory Data Analysis.
5. Data modeling or model building
In this phase you use models, domain knowledge, and insights about the data you found in the
previous steps to answer the research question. You select a technique from the fields of
statistics, machine learning, operations research, and so on. Building a model is an iterative step
between selecting the variables for the model, executing the model, and model diagnostics.
6. Presentation and automation
Finally, you present the results to your business. These results can take many forms, ranging
from presentations to research reports. Sometimes you’ll need to automate the execution of the
process because the business will want to use the insights you gained in another project or
enable an operational process to use the outcome from your model.
AN ITERATIVE PROCESS
The previous description of the data science process gives you the impression that you walk
through this process in a linear way, but in reality you often have to step back and rework
certain findings. For instance, you might find outliers in the data exploration phase that point to
data import errors. As part of the data science process you gain incremental insights, which may
lead to new questions. To prevent rework, make sure that you scope the business question
clearly and thoroughly at the start.
****************************