Introduction
Introduction
1.1 Introduction
In 21st century rapid growth happened in the field of Information Technology (IT). In every part of
day to day life IT plays an important role such as health, business/industries, education, finance etc.
In today’s era to sustain in current competitive world, we need to use IT and its various applications.
Information Technology is nothing but a computer based information system built by study, design,
development, application, implementation, support and manages to serve requirement current business
processes.
Due to use of this information technology system, huge amount of data is being generated every day
at an alarming rate. This huge amount of data is called as BIG DATA.
This data is generated through various sources as mobile phones, Social media like Facebook, Twitter
and Instagram, Video surveillance, Medical Imagining, Gene Sequencing and Geographical data
shown in Figure 1.1.
This Big Data is creating good opportunities for IT industries and other businesses to improve quality,
efficiency, product services, level of customer satisfaction and profit. This is also good domain for
academia and researchers to contribute in the field of data analysis.
Imagine every day, every hour, every minute and every second data is data is generated by various
resources. Example
a. Every day around 1.5 millions payments are done by using PayPal.
b. Every hour Walmart does more than 1 million customer transactions.
c. Every minute millions of peoples are writing comments, doing status updates and photo
uploads on Facebook.
d. Every second thousands of tweets are sent on twitter.
This shows that data is growing from Terabytes to Exabytes. This data has Volume, Variety,
Velocity and veracity as shown in Figure 1.2. This data have different forms as structured, semi
structured, quasi Structured and unstructured. This data is homogenous as well as
heterogeneous in nature so such data is called as BIG DATA.
Several industries are using this Big Data for various applications such as credit card fraud
identification, attractive promotional offers to gold and platinum customers, recommendations
of different products based on browsing history of individuals on social networking websites.
1. Volume:
Big data is big in volume i.e. huge in volume. It has billions of rows and millions
of columns.
2. Data Complexity and Structures:
Big data have variety of sources, formats and structures. It also includes digital
traces on web and other digital repositories.
3. New data creation speed and growth:
Big Data have high velocity data i.e rapidly growing in nature.
Data
Storage
Data
Parallel
Science/Da
Processing
ta Analytics
Big
Data
Data Distributed
Mining systems
Artificial
Intelligence
Big data can be various forms as structured, semi structured, quasi Structured and unstructured as
shown in Figure 1.4. Example: textual data, multimedia data (Audio/Video), financial sheets, genetic
mappings etc. To process this unstructured or semi structured data, there is need of distributed and
massively parallel processing environment.
Quasi
Structured Semi Structured Unstructured
Structured
A. Structured Data
Structured data follows a predefined format, data type and schema (transaction data, Online
analytical processing [OLAP] data cubes, traditional RDBMS, CSV files, and simple
spreadsheets).
Example:
Roll_No FName MName LName Mathematics Science English Marathi Hindi
1 Suresh Nagesh Kulkarni 80 89 87 91 92
2 Kalish Kumar Joshi 99 98 92 92 95
B. Semi Structured Data
Semi structured data is self describing structures as Extensible Markup Language [XML]
data files that are self-describing and defined by an XML schema.
D. Unstructured Data
Unstructured data has no predefined structure/schema. It may include text documents, PDFs,
images, and video.
Data Science
4. Data Data science has flexible structure. Data is adding rapidly due flexible
sources nature.
6. Storage Data Science uses less structured (logs,blogs, SQL, NoSQL, cloud data
etc.)
7. Data quality Data science provides more precision, confidence level and much wider
probabilities with its findings.
10. Tools Data Science uses Statistics, MachineLearning, NLP and Graph analysis
tools
1.3 Overview Data Analytic Life Cycle:
• The Data Analytics Lifecycle is specifically designed for Big Data problems and data science
projects.
• Data Analytics lifecycle has six phases and project work can occur in several phases at once.
• In life cycle, most of the phase’s movement is either forward or backward.
The various roles and key stakeholders of an analytics project are shown in Figure 1.8. Each Stake
holder plays a important role in a successful analytics project. These 7 stakeholders are as follows
Project
sponsors
Business Project
User Manager
Key stake
holders for
Data
Analytics Business
Intelligence
Scientist Project Analyst
Database
Data
Administrat
Engineer
or (DBA)
1. Business User:
• Business user understands the domain area and usually benefits from the results.
• Business user can consult and advise the project team on the context of the project.
• Business user can decide the value of the results and how the outputs will be
operationalized.
• Business users are business analyst, line manager or deep subject matter expert in
the project domain.
2. Project sponsors:
• Project Sponsors are responsible for the genesis of the project.
• They provide the need and requirements for the project.
• They define the actual business problem.
• They provide the funding of the project.
• They set the priorities for the project and clarify the desired outputs.
3. Project Manager:
• The Project manager ensures about key millstones and objectives are achieved in
defined time limit.
• The project manager also ensures about quality of project.
4. Business Intelligence Analyst:
• Business Intelligence Analyst provides business domain expertise based on a deep
understanding of the data, key performance indicators (KPIs), key metrics, and
business intelligence from a reporting perspective.
• Business Intelligence Analysts generally create dashboards and reports and have
knowledge of the data feeds and sources.
5. Database Administrator (DBA):
• Database Administrator (DBA) configures the database environment to support the
analytics needs of the working team.
• DBA responsibilities may include
o Providing access to key databases or tables
o Ensuring the appropriate security levels related to the data repositories.
6. Data Engineer:
• Data Engineer have deep technical skills to assist with tuning SQL queries for data
management and data extraction
• This person provides support for data ingestion into the analytic sandbox.
• The DBA sets up and configures the databases to be used; the data engineer executes
the actual data extractions and performs substantial data manipulation to facilitate
the analytics.
• The data engineer works closely with the data scientist to help shape data in the right
ways for analyses.
7. Data Scientist:
• Data Scientist provides expertise for analytical techniques, data modeling and they
apply valid analytical techniques to given business problems.
• Data Scientist ensures overall analytics objectives are met.
• Data Scientist designs and executes analytical methods and approaches with the
data available to the project.
1. Discovey
6.Opearationalize 2.Data
preparation
4.Model Building
Data Analytics life cycle have total 6 phases as shown in Figure 1.9. This iterative nature of
the lifecycle closely portray a real project as the project move forward and may return to earlier
stages as new information is uncovered and team members learn more about various stages of
the project. Overview of these phases is as follows,
Phase 1- Discovery: Do I have enough Information to draft an analytic plan and share for peer
review?
• In Discovery phase, the team learns the business domain its relevant history. They learn
about similar type of previous implemented projects by the organization or business unit.
• They verify resources available to support the project in terms of people, technology, time,
and data.
• Framing the business problem as an analytics challenge is main activity of this phase that can
be addressed in subsequent phases and formulating initial hypotheses (IHs) to test and begin
learning the data.
Phase 2- Data preparation: Do I have enough good quality data to start building the model?
• Data Preparation phase requires an analytic sandbox, in which the team can work with data
and perform analytics for the duration of the project.
• The team executes extract, load, and transform (ELT) or extract, transform and load (ETL) to
get data into the sandbox. The ELT and ETL are called as ETLT.
• To work on data and analyze it, data should be transformed in the ETLT process.
• The team study data in depth and apply various conditions and constraints on it.
Phase 3- Model Planning: Do I have a good Idea about the type of model to try? Can I refine the
analytic plan?
• In Model planning phase, team identify the methods, techniques and workflow which follows
the subsequent model building phase.
• The team identifies relationships between variables which helps them to select key variables
and the most suitable models.
Phase 4- Model Building: Is the model robust enough? Have we failed for sure?
• The team develops data sets for testing, training, and production purposes in Model Building
phase.
• In addition to this the team builds and executes models based on the work done in the model
planning phase.
• The team performs comparison between existing tools and more robust environment as fast
hard ware or parallel programming to execute model and work flow.
Phase 6- Operationalize
• In Operationalize phase, the team delivers final reports, briefings, code, and technical
documents.
• Team also runs a pilot project to implement the models in a production environment.
In Discovery phase, the team learns the business domain its relevant history. In this phase data
scientist team need to perform some activities as shown in Figure 1.10
Learn
Business
Domain
Identifying
Learn
Potential
Resources
Data Sources
Discovery
Initial
Problem
Hypothesis
Framing
development
Interview of Key
Analytics Stakeholders
Sponsor identification
• Learn Resources
• Team need to learn about all available resources such as technology, tools, systems, data, and
people.
• Team also needs to identify different types of systems needed for later phases to
operationalize the models.
• Team also needs to identify gap between existing tools, technologies and skills.
• Team needs to identify sufficient data is available or need to collect additional data, purchase
it from outside sources.
• Team need to ensure the project team has the right mix of domain experts, customers,
analytic talent, and project management to be effective.
• Team need evaluate how much time is required if the team has the right breadth and depth of
skills.
• Problem Framing
• Problem framing is the process of stating the analytics problem to be solved.
• The best practice to write down the problem statement and share it with the key stakeholders.
• Each team member may have own perspective about problem and may have different
solutions for the problem.
• Essentially, the team needs to consider the current situation and its main challenges.
• In this process team needs to identify
• What are main objectives of the project?
• What needs to be achieved in business terms?
• What needs to be done to meet the needs?
• What will be outcome of the project?
• They main task of discovery phase is developing a set of IHs. The ideas can test with data.
Team can come up with a few primary hypotheses to test and then be creative about developing
several more. Hypothesis testing from a statistical perspective can be done in laterphases also.
• The team can compare its answers with the outcome of an experiment or test and can generate
additional possible solutions to problems.
• In this process gathering and assessing hypotheses from stakeholders and domain experts as
they may have their own perspective on what the problem is, what the solution should be, and
how to arrive at a solution.
• These stakeholders would know the domain area well and can offer suggestions on ideas to test
as the team formulates hypotheses during this phase. All ideas will also give the team
opportunities to expand the project scope into adjacent spaces..
The Data Scientist team should perform five main activities during discovery phase:
Data preparation phase of the Data Analytics Lifecycle includes the steps to explore, preprocess, and
condition data prior to modeling and analysis. The team needs to create a robust environment to explore
the data that is separate from a production environment. In this phase data scientist team need to perform
some activities as shown in Figure 1.11
Analytic
Sandbox
Preparation
Identify tools
Performing
for data
ETLT
preparation
Data
Perform Data
Conditioning
ETL ELT
Extract,
Extract,Load ETLT
Transform,
Load ,Transform
In ETL, users perform extract, transform, and load processes to extract data from a data
store, perform data transformations, and load the data back into the data store as shown
in Figure 1.14.
• What is ELT?
In ETL, users perform extract, transform and load processes to extract data from a
datastore, perform data transformations, and load the data back into the datastore as
shown in Figure 1.14.
• What is Sandbox box approach?
o Sandbox suggests extract, load, and then transform. In sandbox the data is extracted
in its raw form and loaded into the datastore, where analysts can choose to transform
the data into a new state or leave it in its original, raw condition. There is significant
value for raw and including it in sandbox before performing transformation on it.
o Consider example of fraud detection on credit card usage. Outliers in this data
population can represent higher-risk transactions which may be fraudulent credit
card activity. In case of ETL these outliers may get filtered out. But in case of ELT
all data is present in sandbox so can perform analysis of fraud detection.
• What is ETLT Process?
o Consider a scenario where the team may want clean data and aggregated data.
o They also need to keep a copy of the original data to compare against or look for
hidden patterns that may have existed in the data before the cleaning stage. This
process is called as ETLT.
• Learn about Data in-depth
• Learning data in depth in critical aspect of data preparation. This activity accomplishes
following goals,
1. Clarifies the data that the data science team has access to at the start of the
project.
2. Highlights gap in existing data sets of organization and team can trigger activity
of new data collection from organization.
3. Identifies datasets outside the organization which can obtain, through open APIs,
data sharing, or purchasing data to supplement already existing datasets.
In Model planning phase as shown in Figure 1.15, the data science team decides candidate models to
apply to the data for clustering, classifying, or finding relationships in the data depending on the goal
of the project. During this phase the team refers to the hypotheses developed in Phase 1. These
hypotheses help them to frame the analytics to execute in Phase 4 and select the right methods to
achieve its objectives.
• Assess the structure of the dataset; it dictates the tools and analytical techniques for the
model building phase. Different tools and approaches are required for different types of
data.
• Ensure that the analytical techniques enable the team to meet the business objectives and
accept or reject the working hypotheses.
• Determine if the situation needs a single model or a series of techniques as part of a larger
analytic workflow.
• Do research about other analyst work on same problem. Need to find out which methods,
techniques they used to solve same type of problem.
• Model Selection
• In model selection phase, the team can make list of suitable analytical techniques to fulfill
end goal of the project. They can observe real world situations and try to map to the current
problem for model construction.
• In machine learning and data mining, several such techniques such as classification, association
rules, and clustering are available. Team also needs to identify techniques suitable for Big Data
for structured data, unstructured data, or a hybrid approach.
• Initially these models can be created by using statistical software package such as R, SAS, or
Matlab. As these tools are designed for data mining and machine learning algorithms, may
have limitations for Big Data so team need to redesign algorithms as per requirement.
Data
Exploration and
Variable
Selection
Model
Planning
Common Tools
Model
for the Model
Selection
Planning Phase
• After executing the model, the team needs to compare the outcomes of the modeling to
the criteria established for success and failure. The team considers how to convey the
findings and outcomes to the various team members and stakeholders.
• The team needs to determine if it succeeded or failed in its objectives. The best practice
in this phase is to record all the findings and then select the three most significant ones and
share them with the stakeholders. Here the team needs to reflect implications of these
findings and measure the business value.
• Team need to consider possible improvements and suggest them in future work orexisting
process.
• In 2012 EMC’s new director wanted to improve the company’s engagement of employees
across the global centers of excellence (GCE) to drive innovation, research, and university
partnerships
• This project was created to accomplish
o Store formal and informal data
o Track research from global technologists
o Mine the data for patterns and insights to improve the team’s operations and
strategy
Phase 1: Discovery
• Team members and roles
o Business user, project sponsor, project manager – Vice President from Office of
CTO
o BI analyst – person from IT
o Data engineer and DBA – people from IT
o Data scientist – distinguished engineer
• The data fell into two categories
o Five years of idea submissions from internal innovation contests
o Minutes and notes representing innovation and research activity from around the
world
• Hypotheses grouped into two categories
o Descriptive analytics of what is happening to spark further creativity, collaboration,
and asset generation
o Predictive analytics to advise executive management of where it should be
investing in the future
o The 10 main IHs that the GINA team developed were as follows:
1. Innovation activity in different geographic regions can be mapped to
corporate strategic directions.
2. The length of time it takes to deliver ideas decreases when global
knowledge transfer occurs as part of the idea delivery process.
3. Innovators who participate in global knowledge transfer deliver ideas
more quickly than those who do not.
4. An idea submission can be analyzed and evaluated for the likelihood of
receiving funding.
5. Knowledge discovery and growth for a particular topic can be measured
and compared across geographic regions.
6. Knowledge transfer activity can identify research-specific boundary
spanners in disparate regions.
7. Strategic corporate themes can be mapped to geographic regions.
8. Frequent knowledge expansion and transfer events reduce the time it
takes to generate a corporate asset from an idea.
9. Lineage maps can reveal when knowledge expansion and transfer did not
(or has not) resulted in a corporate asset.
10. Emerging research topics can be classified and mapped to specific
ideators, innovators, boundary spanners, and assets.
Phase 2: Data Preparation
• Set up an analytics sandbox.
• Discovered that certain data needed conditioning and normalization and that missing
datasets were critical.
• Team recognized that poor quality data could impact subsequent steps.
• They discovered many names were misspelled and problems with extra spaces.
• These seemingly small problems had to be addressed.
Phase 3: Model Planning
• The study included the following considerations.
o Identify the right milestones to achieve the goals.
o Trace how people move ideas from each milestone toward the goal.
o Tract ideas that die and others that reach the goal.
o Compare times and outcomes using a few different methods.
Phase 4: Model Building
• Several analytic method were employed
o NLP on textual descriptions.
o Social network analysis using R and Rstudio.
o Developed social graphs and visualizations.
Social graph of data submitters and finalists