0% found this document useful (0 votes)
32 views28 pages

Eds Unit 1

Uploaded by

Adhiban R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views28 pages

Eds Unit 1

Uploaded by

Adhiban R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

COS 2402-ESSENTIALS OF

DATA SCIENCE
Unit I

0
UNIT 1: Introduction to Big Data and Data Science

Big Data is a collection of data that is huge in volume, yet growing exponentially
with time. It is a data with so large size and complexity that none of traditional data
management tools can store it or process it efficiently. Big data is also a data but with
huge size.

Big Data Overview


Data is created constantly, and at an ever-increasing rate. Mobile phones, social
media, imaging technologies to determine a medical diagnosis-all these and more create
new data, and that must be stored somewhere for some purpose. Devices and sensors
automatically generate diagnostic information that needs to be stored and processed in
real time. Merely keeping up with this huge influx of data is difficult, but substantially
more challenging is analyzing vast amounts of it, especially when it does not conform to
traditional notions of data structure, to identify meaningful patterns and extract useful
information. These challenges of the data deluge present the opportunity to transform
business, government, science, and everyday life.

Several industries have led the way in developing their ability to gather and exploit data:

• Credit card companies monitor every purchase their customers make and can
identify fraudulent purchases with a high degree of accuracy using rules derived
by processing billions of transactions.

• Mobile phone companies analyze subscribers' calling patterns to determine, for


example, whether a caller 's frequent contacts are on a rival network. If that rival
network is offering an attractive promotion that might cause the subscriber to
defect, the mobile phone company can proactively offer the subscriber an
incentive to remain in her contract.

1
• For companies such as LinkedIn and Facebook, data itself is their primary
product. The valuations of these companies are heavily derived from the data they
gather and host, which contains more and more intrinsic value as the data grows.
Three attributes stand out as defining Big Data characteristics:

1. Huge volume of data: Rather than thousands or millions of rows, Big Data can be
billions of rows and millions of columns.
2. Complexity of data types and structures: Big Data reflects the variety of new data
sources, formats, and structures, including digital traces being left on the web and
other digital repositories for subsequent analysis.

3. Speed of new data creation and growth: Big Data can describe high velocity data,
with rapid data ingestion and near real time analysis.

Data Structures
Big data can come in multiple forms, including structured and non-structured data
such as financial data, text files, multimedia files, and genetic mappings. Contrary to
much ofthe traditional data analysis performed by organizations, most of the Big Data is
unstructured or semi-structured in nature, which requires different techniques and tools
to process and analyze. [2] Distributed computing environments and massively parallel
processing (MPP) architectures that enable parallelized data ingest and analysis are the
preferred approach to process such complex data.

The following shows four types of data structures, with 80-90% of future data growth
coming from unstructured data types. [2] Though different, the four are commonly mixed.
For example, a classic Relational Database Management System (RDBMS) maystore call
logs for a software support call center. The RDBMS may store characteristics of the
support calls as typical structured data, with attributes such as time stamps, machine type,
problem type, and operating system. In addition, the system will likely have unstructured,
quasi- or semi-structured data, such as free-form call log informationtaken from an e-mail
ticket of the problem, customer chat history, or transcript of a phonecall describing the
technical problem and the solution or audio file of the phone call conversation. Many
insights could be extracted from the unstructured, quasi- or semi- structured data in the
call center data.

2
3
Although analyzing structured data tends to be the most familiar technique, a different
technique is required to meet the challenges to analyze semi-structured data (shown as
XML), quasi-structured (shown as a clickstream), and unstructured data.
Here are examples of how each of the four main types of data structures may look.

Structured data: Data containing a defined data type, format, and structure (that is,
transaction data, online analytical processing [OLAP] data cubes, traditional RDBMS,
CSV files, and even simple spreadsheets). See below figure.

Semi-structured data: Textual data files with a discernible pattern that enables parsing
(such as Extensible Markup Language [XML] data files that are self-describing and
defined by an XML schema).

4
Quasi-structured data: Textual data with erratic data formats that can be formatted with
effort, tools, and time (for instance, web clickstream data that may contain inconsistencies
in data values and formats). See below figure .
Quasi-structured data is a common phenomenon that bears closer scrutiny. Consider the
following example. A user attends the EMC World conference and subsequently runs a
Google search online to find information related to EMC and Data Science. This would
produce a URL such as https: //www.google. com/#q=EMC+data+science and a list of
results, such as in the first graphic of above figure.
After doing this search, the user may choose the second link, to read more about the
headline "Data Scientist-EM(Education, Training, and Certification." This brings the user
to an erne.com site focused on this topic and a new URL,
https://education.erne.com/guest/campaign/data_ science.aspx, that displays the page
shown as (2) in below figure. Arriving at this site, the user may decide to click to learn
more about the process of becoming certified in data science. The user chooses a link
toward the top of the page on Certifications, bringing the user to a new URL:
https://education.erne.com/guest/certification/ framework/stf/data_science.aspx, whichis
(3) in below figure.

Visiting these three websites adds three URLs to the log files monitoring the user's
computer or network use. These three URLs are:

5
https://www.google . com/#q=EMC+data+science

https://education.emc.com/guest campaign/ data science.aspx


https://education.emc.com/guest/certification/ framework/stf/data_science .aspx

Unstructured data: Data that has no inherent structure, which may include text
documents, PDFs, images, and video.

6
Current Analytical Architecture

As described earlier, Data Science projects need workspaces that are purpose-built for
experimenting with data, with flexible and agile data architectures. Most organizations
still have data warehouses that provide excellent support for traditional reporting and
simple data analysis activities but unfortunately have a more difficult time supporting
more robust analyses. This section examines a typical analytical data architecture that
may exist within an organization.

The following figure shows a typical data architecture and several of the challenges it
presents to data scientists and others trying to do advanced analytics. This section
examines the data flow to the Data Scientist and how this individual fits into the process

of getting data to analyze on projects.

1. For data sources to be loaded into the data warehouse, data needs to be well
understood, structured, and normalized with the appropriate data type definitions.
Although this kind of centralization enables security, backup, and failover of
highly critical data, it also means that data typically must go through significant
preprocessing and checkpoints before it can enter this sort of controlled
environment, which does not lend itself to data exploration and iterative analytics.

2. As a result of this level of control on the EDW, additional local systems may
emerge in the form of departmental warehouses and local data marts that business
users create to accommodate their need for flexible analysis. These local data

7
marts may not have the same constraints for security and structure as the main
EDW and allow users to do some level of more in-depth analysis. However, these

8
one-off systems reside in isolation, often are not synchronized or integrated with
other data stores, and may not be backed up.

3. Once in the data warehouse, data is read by additional applications across the
enterprise for Bl and reporting purposes. These are high-priority operational
processes getting critical data feeds from the data warehouses and repositories.
4. At the end of this workflow, analysts get data provisioned for their downstream
analytics.

Because users generally are not allowed to run custom or intensive analytics on
production databases, analysts create data extracts from the EDW to analyze data offline
in R or other local analytical tools. Many times these tools are limited to in-memory
analytics on desktops analysing samples of data, rather than the entire population of a
dataset. Because these analyses are based on data extracts, they reside in a separate
location, and the results of the analysis-and any insights on the quality of the data or
anomalies-rarely are fed back into the main data repository.
Because new data sources slowly accumulate in the EDW due to the rigorous validation
and data structuring process, data is slow to move into the EDW, and the data schema is
slow to change.

Departmental data warehouses may have been originally designed for a specific purpose
and set of business needs, but over time evolved to house more and more data, some of
which may be forced into existing schemas to enable Bland the creation of OLAP cubes
for analysis and reporting. Although the EDW achieves the objective of reporting and
sometimes the creation of dashboards, EDWs generally limit the ability of analysts to
iterate on the data in a separate nonproduction environment where they can conduct in-
depth analytics or perform analysis on unstructured data.

The typical data architectures just described are designed for storing and processing
mission-critical data, supporting enterprise applications, and enabling corporate reporting
activities. Although reports and dashboards are still important for organizations, most
traditional data architectures inhibit data exploration and more sophisticated analysis.
Moreover, traditional data architectures have several additional implications for data
scientists.

• High-value data is hard to reach and leverage, and predictive analytics and data
mining activities are last in line for data. Because the EDWs are designed for
9
central data management and reporting, those wanting data for analysis are
generally prioritized after operational processes.

• Data moves in batches from EDW to local analytical tools. This workflow means
that data scientists are limited to performing in-memory analytics (such as with
R, SAS, SPSS, or Excel), which will restrict the size of the data sets they can use.
As such, analysis may be subject to constraints of sampling, which can skew
model accuracy.

• Data Science projects will remain isolated and ad hoc, rather than centrally
managed. The implication of this isolation is that the organization can never
harness the power of advanced analytics in a scalable way, and Data Science
projects will exist as nonstandard initiatives, which are frequently not aligned
with corporate business goals or strategy.
All these symptoms of the traditional data architecture result in a slow "time-to-insight"
and lower business impact than could be achieved if the data were more readily accessible
and supported by an environment that promoted advanced analytics. As stated earlier,
one solution to this problem is to introduce analytic sandboxes to enable data scientists
to perform advanced analytics in a controlled and sanctioned way. Meanwhile, the current
Data Warehousing solutions continue offering reporting and Bl services to support
management and mission-critical operations.

Emerging Big Data Ecosystem and a New Approach to Analytics


Organizations and data collectors are realizing that the data they can gather from
individuals contains intrinsic value and, as a result, a new economy is emerging. As this
new digital economy continues to evolve, the market sees the introduction of data vendors
and data cleaners that use crowdsourcing (such as Mechanical Turk and GalaxyZoo) to
test the outcomes of machine learning techniques. Other vendors offer added value by
repackaging open source tools in a simpler way and bringing the tools to market. Vendors
such as Cloudera, Hortonworks, and Pivotal have provided this value-add for the open
source framework Hadoop.

As the new ecosystem takes shape, there are four main groups of players within this
interconnected web. These are shown in below figure.

• Data devices [shown in the (1) section of Figure] and the "Sensornet" gat her data
from multiple locations and continuously generate new data about this data. For
10
each gigabyte of new data created, an additional petabyte of data is created about
that data. [2)

▪ For example, consider someone playing an online video game through a


PC, game console, or smartphone. In this case, the video game provider
captures data about the skill and levels attained by the player. Intelligent
systems monitor and log how and when the user plays the game. As a
consequence, the game provider can fine-tune the difficulty of the game,
suggest other related games that would most likely interest the user, and
offer additional

▪ equipment and enhancements for the character based on the user's age,
gender, and interests. This information may get stored locally or uploaded
to the game provider's cloud to analyze the gaming habits and
opportunities for ups ell and cross-sell, and identify archetypical profiles
of specific kinds of users.
▪ Smartphones provide another rich source of data. In addition to messaging
and basic phone usage, they store and transmit data about Internet usage,
SMS usage, and real-time location. This metadata can be used for
analyzing traffic patterns by scanning the density of smartphones in
locations to track the speed of cars or the relative traffic congestion on
busy roads. In this way, GPS devices in cars can give drivers real -time
updates and offer alternative routes to avoid traffic delays.

▪ Retail shopping loyalty cards record not just the amount an individual
spends, but the locations of stores that person visits, the kinds of products
purchased, the stores where goods are purchased most often, and the
combinations of products purchased together. Collecting this data
provides insights into shopping and travel habits and the likelihood of
successful advertisement targeting for certain types of retail promotions.

• Data collectors [the blue ovals, identified as (2) within Figure] include sample
entities that collect data from the device and users.

▪ Data results from a cable TV provider tracking the shows a person


watches, which TV channels someone will and will not pay for to watch

11
on demand, and the prices someone is willing to pay for premium TV
content

▪ Retail stores tracking the path a customer takes through their store while
pushing a shopping cart with an RFID chip so they can gauge which
products get the most foot traffic using geospatial data collected from the
RFID chips

• Data aggregators (the dark gray ovals in Figure, marked as (3)) make sense of the
data collected from the various entities from the "Sensor Net" or the "Internet of
Things." These organizations compile data from the devices and usage patterns
collected by government agencies, retail stores, and websites. ln turn, they can
choose to transform and package the data as products to sell to list brokers, who
may want to generate marketing lists of people who may be good targets for
specific ad campaigns.

• Data users and buyers are denoted by (4) in Figure. These groups directly benefit
from the data collected and aggregated by others within the data value chain.

▪ Retail banks, acting as a data buyer, may want to know which customers
have the highest likelihood to apply for a second mortgage or a home
equity line of credit. To provide input for this analysis, retail banks may
purchase data from a data aggregator. This kind of data may include
demographic information about people living in specific locations; people
who appear to have a specific level of debt, yet still have solid credit scores
(or other characteristics such as paying bil ls on time and having savings
accounts) that can be used to infer credit worthiness; and those who are
searching the web for information about paying off debts or doing home
remodeling projects. Obtaining data from these various sources and
aggregators will enable a more targeted marketing campaign, whichwould
have been more challenging before Big Data due to the lack of information
or high-performing technologies.

▪ Using technologies such as Hadoop to perform natural language


processing on unstructured, textual data from social media websites, users
can gauge the reaction to events such as presidential campaigns. People
may, for example, want to determine public sentiments toward a candidate
by analyzing related blogs and online comments. Similarly, data users
12
may want to track and prepare for natural disasters by identifying which
areas a hurricane affects first and how it moves, based on which
geographic areas are tweeting about it or discussing it via social media.

As illustrated by this emerging Big Data ecosystem, the kinds of data and the related
market dynamics vary greatly. These data sets can include sensor data, text, structured
datasets, and social media. With this in mind, it is worth recalling that these data sets will
not work well within traditional EDWs, which were architected to streamline reporting
and dashboards and be centrally managed. instead, Big Data problems and projects
require different approaches to succeed.

Key Roles for the New Big Data Ecosystem


As explained in the context of the Big Data ecosystem in Section 1.2.4, new players have
emerged to curate, store, produce, clean, and transact data. In addition, the need for
applying more advanced analytical techniques to increasingly complex business
problems has driven the emergence of new roles, new technology platforms, and new
analytical methods. This section explores the new roles that address these needs, and
subsequent chapters explore some of the analytical methods and technology platforms.

13
The Big Data ecosystem demands three categories of roles, as shown in below figure.
These roles were described in the McKinsey Global study on Big Data, from May 2011
[1].

The first group- Deep Analytical Talent- is technically savvy, with strong analytical
skills. Members possess a combination of skills to handle raw, unstructured data and to
apply complex analytical techniques at massive scales. This group has advanced training
in quantitative disciplines, such as mathematics, statistics, and machine learning. To do
their jobs, members need access to a robust analytic sandbox or workspace where they
can perform large-scale analytical data experiments. Examples of current professions
fitting into this group include statisticians, economists, mathematicians, and the new role
of the Data Scientist.

The second group-Data Savvy Professionals-has less technical depth but has a basic
knowledge of statistics or machine learning and can define key questions that can be

14
answered using advanced analytics. These people tend to have a base knowledge of
working with data, or an appreciation for some of the work being performed by data
scientists and others with deep analytical talent. Examples of data savvy professionals
include financial analysts, market research analysts, life scientists, operations managers,
and business and functional managers.

The third category of people mentioned in the study is Technology and Data Enablers.
This group represents people providing technical expertise to support analytical projects,
such as provisioning and administrating analytical sandboxes, and managing large-scale
data architectures that enable widespread analytics within companies and other
organizations. This role requires skills related to computer engineering, programming,
and database administration.

These three groups must work together closely to solve complex Big Data challenges.
Most organizations are familiar with people in the latter two groups mentioned, but the
first group, Deep Analytical Talent, tends to be the newest role for most and the least
understood. For simplicity, this discussion focuses on the emerging role of the Data
Scientist. It describes the kinds of activities that role performs and provides a more
detailed view of the skills needed to fulfill that role.

There are three recurring sets of activities that data scientists perform:

• Reframe business challenges as analytics challenges. Specifically, this is a skill


to diagnose business problems, consider the core of a given problem, and
determine which kinds of candidate analytical methods can be applied to solve it.
This concept is explored further in Chapter 2, "Data Analytics lifecycle."

• Design, implement, and deploy statistical models and data mining techniques on
Big Data. This set of activities is mainly what people think about when they
consider the role of the Data Scientist: namely, applying complex or advanced
analytical methods to a variety of business problems using data.

15
• Develop insights that lead to actionable recommendations. It is critical to note that
applying advanced methods to data problems does not necessarily drive new
business value. Instead, it is important to learn how to draw insights out of the
data and communicate them effectively. "The Endgame, or Putting It All
Together;' has a brief overview of techniques for doing this.
Data scientists are generally thought of as having five main sets of skills and behavioral
characteristics, as shown in below figure:

• Quantitative skill: such as mathematics or statistics

• Technical aptitude: namely, software engineering, machine learning, and


programming skills

• Skeptical mind-set and critica l thinking: It is important that data scientists can
examine their work critically rather than in a one-sided way.

• Curious and creative: Data scientists are passionate about data and findingcreative
ways to solve problems and portray information.

• Communicative and collaborative: Data scientists must be able to articulate the


business value in a clear way and collaboratively work with other groups,
including project sponsors and key stakeholders.

Data scientists are generally comfortable using this blend of skills to acquire, manage,
analyze, and visualize data and tell compelling stories about it.

16
17
Data Science-Introduction:

18
19
20
21
22
23
24
25
26
27

You might also like