0% found this document useful (0 votes)

12 views21 pages

Introduction

The document discusses an introduction to big data including its sources, characteristics, structures, and the data analytics lifecycle. It describes how big data is generated from various sources and characterized by volume, variety, velocity, and veracity. It also outlines structured, semi-structured, quasi-structured and unstructured data and key stakeholders in analytics projects.

Uploaded by

Fayeque Peerzade

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views21 pages

Introduction

Uploaded by

Fayeque Peerzade

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Unit I Introduction and Life cycle

1.1 Introduction
In 21st century rapid growth happened in the field of Information Technology (IT). In every part of
day to day life IT plays an important role such as health, business/industries, education, finance etc.
In today’s era to sustain in current competitive world, we need to use IT and its various applications.
Information Technology is nothing but a computer based information system built by study, design,
development, application, implementation, support and manages to serve requirement current business
processes.
Due to use of this information technology system, huge amount of data is being generated every day
at an alarming rate. This huge amount of data is called as BIG DATA.
This data is generated through various sources as mobile phones, Social media like Facebook, Twitter
and Instagram, Video surveillance, Medical Imagining, Gene Sequencing and Geographical data
shown in Figure 1.1.
This Big Data is creating good opportunities for IT industries and other businesses to improve quality,
efficiency, product services, level of customer satisfaction and profit. This is also good domain for
academia and researchers to contribute in the field of data analysis.

Mobile Phones Social Media like Facebook,Twitter,Instagram

Video surveillance Medical Imagining

Gene Sequencing Geographical data

Figure 1.1 Sources of data

1.2 Big data overview

Imagine every day, every hour, every minute and every second data is data is generated by various
resources. Example
a. Every day around 1.5 millions payments are done by using PayPal.
b. Every hour Walmart does more than 1 million customer transactions.
c. Every minute millions of peoples are writing comments, doing status updates and photo
uploads on Facebook.
d. Every second thousands of tweets are sent on twitter.

This shows that data is growing from Terabytes to Exabytes. This data has Volume, Variety,
Velocity and veracity as shown in Figure 1.2. This data have different forms as structured, semi
structured, quasi Structured and unstructured. This data is homogenous as well as
heterogeneous in nature so such data is called as BIG DATA.
Several industries are using this Big Data for various applications such as credit card fraud
identification, attractive promotional offers to gold and platinum customers, recommendations
of different products based on browsing history of individuals on social networking websites.

Big Data Characteristics

Three attributes defining big data characteristics

1. Volume:
Big data is big in volume i.e. huge in volume. It has billions of rows and millions
of columns.
2. Data Complexity and Structures:
Big data have variety of sources, formats and structures. It also includes digital
traces on web and other digital repositories.
3. New data creation speed and growth:
Big Data have high velocity data i.e rapidly growing in nature.

• Amount of data generated

Volume • KB-MB-GB-TB-PB-XB

• Different type of sources

Variety • Marketing Automation, Databases, Reports, CMS, Audio, Video, Web, SMS, RFID

• Rate at which data is generated captured and shared

Velocity • Batch, Periodic, Near Real Time, Real Time

• Uncertainty of data in terms of accuracy and consistency.

Veracity

Figure 1.2 Big Data V’s.

Due to above characteristics Big Data cannot be fully analyzed by traditional database
technologies. Figure 1.3 shows various components of Big Data. It needs new tools and
technologies to store, manage and analyze. These tools and techniques designed and developed
specifically to handle large data sets and to extract appropriate business knowledge.

Data
Storage

Data
Parallel
Science/Da
Processing
ta Analytics

Big
Data
Data Distributed
Mining systems

Artificial
Intelligence

Figure 1.3 Components of BIG Data

1.2.1 Data Structures

Big data can be various forms as structured, semi structured, quasi Structured and unstructured as
shown in Figure 1.4. Example: textual data, multimedia data (Audio/Video), financial sheets, genetic
mappings etc. To process this unstructured or semi structured data, there is need of distributed and
massively parallel processing environment.

Quasi
Structured Semi Structured Unstructured
Structured

Figure 1.4 Data Structures (Data is growing from structured to unstructured)

A. Structured Data
Structured data follows a predefined format, data type and schema (transaction data, Online
analytical processing [OLAP] data cubes, traditional RDBMS, CSV files, and simple
spreadsheets).
Example:
Roll_No FName MName LName Mathematics Science English Marathi Hindi
1 Suresh Nagesh Kulkarni 80 89 87 91 92
2 Kalish Kumar Joshi 99 98 92 92 95
B. Semi Structured Data
Semi structured data is self describing structures as Extensible Markup Language [XML]
data files that are self-describing and defined by an XML schema.

C. Quasi Structured Data

Quasi structures data is textual data with inconsistent data formats.Thses formats are
formatted with effort, tools, and time. Example: web clickstream data that may contain
inconsistencies in data values and formats.

D. Unstructured Data
Unstructured data has no predefined structure/schema. It may include text documents, PDFs,
images, and video.
Data Science

Table 1.1 describes emerging Data Science.

Table 1.1Data Science

Sr.No Parameters Data Science
1. Perspective Data Science has forward approach. It is predicting future based on
information. Data Science to solve question: “what will happen if we do
X?”
2. Focus Data science predicts how this data may look in future i.e. patterns and
experimentations.

3. Process Data Science gives scope to explorationand experimentation to decide how

to gather and manage data.

4. Data Data science has flexible structure. Data is adding rapidly due flexible
sources nature.

5. Transform Data Science encourages new questionsdue to its growing data.

6. Storage Data Science uses less structured (logs,blogs, SQL, NoSQL, cloud data
etc.)

7. Data quality Data science provides more precision, confidence level and much wider
probabilities with its findings.

8. IT owned vs Data Science is owned and managed byAnalyst.

business
owned

9. Methods Data Science uses scientific methods.

10. Tools Data Science uses Statistics, MachineLearning, NLP and Graph analysis
tools
1.3 Overview Data Analytic Life Cycle:

• The Data Analytics Lifecycle is specifically designed for Big Data problems and data science
projects.
• Data Analytics lifecycle has six phases and project work can occur in several phases at once.
• In life cycle, most of the phase’s movement is either forward or backward.

Key Stake Holders for a Successful Analytics Project

The various roles and key stakeholders of an analytics project are shown in Figure 1.8. Each Stake
holder plays a important role in a successful analytics project. These 7 stakeholders are as follows
Project
sponsors

Business Project
User Manager

Key stake
holders for
Data
Analytics Business
Intelligence
Scientist Project Analyst

Database
Data
Administrat
Engineer
or (DBA)

Figure 1.8 Key stake holders for Analytics Project

1. Business User:
• Business user understands the domain area and usually benefits from the results.
• Business user can consult and advise the project team on the context of the project.
• Business user can decide the value of the results and how the outputs will be
operationalized.
• Business users are business analyst, line manager or deep subject matter expert in
the project domain.
2. Project sponsors:
• Project Sponsors are responsible for the genesis of the project.
• They provide the need and requirements for the project.
• They define the actual business problem.
• They provide the funding of the project.
• They set the priorities for the project and clarify the desired outputs.
3. Project Manager:
• The Project manager ensures about key millstones and objectives are achieved in
defined time limit.
• The project manager also ensures about quality of project.
4. Business Intelligence Analyst:
• Business Intelligence Analyst provides business domain expertise based on a deep
understanding of the data, key performance indicators (KPIs), key metrics, and
business intelligence from a reporting perspective.
• Business Intelligence Analysts generally create dashboards and reports and have
knowledge of the data feeds and sources.
5. Database Administrator (DBA):
• Database Administrator (DBA) configures the database environment to support the
analytics needs of the working team.
• DBA responsibilities may include
o Providing access to key databases or tables
o Ensuring the appropriate security levels related to the data repositories.
6. Data Engineer:
• Data Engineer have deep technical skills to assist with tuning SQL queries for data
management and data extraction
• This person provides support for data ingestion into the analytic sandbox.
• The DBA sets up and configures the databases to be used; the data engineer executes
the actual data extractions and performs substantial data manipulation to facilitate
the analytics.
• The data engineer works closely with the data scientist to help shape data in the right
ways for analyses.
7. Data Scientist:
• Data Scientist provides expertise for analytical techniques, data modeling and they
apply valid analytical techniques to given business problems.
• Data Scientist ensures overall analytics objectives are met.
• Data Scientist designs and executes analytical methods and approaches with the
data available to the project.

1.4.1. Phases of Data Analytical cycle

1. Discovey

6.Opearationalize 2.Data
preparation

5.Communicate 3.Model Planning

Results

4.Model Building

Figure 1.9 Phases of Data Analytics Life Cycle

Data Analytics life cycle have total 6 phases as shown in Figure 1.9. This iterative nature of
the lifecycle closely portray a real project as the project move forward and may return to earlier
stages as new information is uncovered and team members learn more about various stages of
the project. Overview of these phases is as follows,
Phase 1- Discovery: Do I have enough Information to draft an analytic plan and share for peer
review?
• In Discovery phase, the team learns the business domain its relevant history. They learn
about similar type of previous implemented projects by the organization or business unit.
• They verify resources available to support the project in terms of people, technology, time,
and data.
• Framing the business problem as an analytics challenge is main activity of this phase that can
be addressed in subsequent phases and formulating initial hypotheses (IHs) to test and begin
learning the data.

Phase 2- Data preparation: Do I have enough good quality data to start building the model?
• Data Preparation phase requires an analytic sandbox, in which the team can work with data
and perform analytics for the duration of the project.
• The team executes extract, load, and transform (ELT) or extract, transform and load (ETL) to
get data into the sandbox. The ELT and ETL are called as ETLT.
• To work on data and analyze it, data should be transformed in the ETLT process.
• The team study data in depth and apply various conditions and constraints on it.

Phase 3- Model Planning: Do I have a good Idea about the type of model to try? Can I refine the
analytic plan?
• In Model planning phase, team identify the methods, techniques and workflow which follows
the subsequent model building phase.
• The team identifies relationships between variables which helps them to select key variables
and the most suitable models.

Phase 4- Model Building: Is the model robust enough? Have we failed for sure?
• The team develops data sets for testing, training, and production purposes in Model Building
phase.
• In addition to this the team builds and executes models based on the work done in the model
planning phase.
• The team performs comparison between existing tools and more robust environment as fast
hard ware or parallel programming to execute model and work flow.

Phase 5- Communicate Results:

• In this phase, the team collaborates with key stakeholders to determine success or failure of
the project based on the criteria developed in Phase 1.
• The team should identify key findings, quantify the business value, and develop a narrative to
summarize and convey findings to stakeholders.

Phase 6- Operationalize
• In Operationalize phase, the team delivers final reports, briefings, code, and technical
documents.
• Team also runs a pilot project to implement the models in a production environment.

1.4.2 Phase 1- Discovery

In Discovery phase, the team learns the business domain its relevant history. In this phase data
scientist team need to perform some activities as shown in Figure 1.10
Learn
Business
Domain

Identifying
Learn
Potential
Resources
Data Sources

Discovery
Initial
Problem
Hypothesis
Framing
development

Interview of Key
Analytics Stakeholders
Sponsor identification

Figure 1.10 Phases 1: Discovery

• Learn Business Domain

• Understanding the domain area of the problem is essential activity for data scientist.
• Many data scientist have deep computational and quantitative knowledge that can apply for
many cases.
• Some Data scientists may have deep knowledge of the methods, techniques, and ways for
applying heuristics to a variety of business and conceptual problems and some of them are
good in domain knowledge.
• The team needs to decide how much domain knowledge needs to build model in Phases 3 and
4. The team should have balance of domain knowledge experts and technical experts.

• Learn Resources
• Team need to learn about all available resources such as technology, tools, systems, data, and
people.
• Team also needs to identify different types of systems needed for later phases to
operationalize the models.
• Team also needs to identify gap between existing tools, technologies and skills.
• Team needs to identify sufficient data is available or need to collect additional data, purchase
it from outside sources.
• Team need to ensure the project team has the right mix of domain experts, customers,
analytic talent, and project management to be effective.
• Team need evaluate how much time is required if the team has the right breadth and depth of
skills.

• Problem Framing
• Problem framing is the process of stating the analytics problem to be solved.
• The best practice to write down the problem statement and share it with the key stakeholders.
• Each team member may have own perspective about problem and may have different
solutions for the problem.
• Essentially, the team needs to consider the current situation and its main challenges.
• In this process team needs to identify
• What are main objectives of the project?
• What needs to be achieved in business terms?
• What needs to be done to meet the needs?
• What will be outcome of the project?

• Key Stakeholders identification

• The important step is to identify the key stakeholders and their interests in the project.
• During discussions with stake holders team can identify success criteria, key risks, and
stakeholder’s benefits.
• When interviewing stakeholders, team can learn about the domain area and any relevant
history. For example, expected result of each stake holder and success parameters.
• Team can decide the type of participation expected from stakeholders in the project.
• Team can set clear expectations with the participants and avoid delays for approval or advice
about the project.

• Interview of Analytics Sponsor

• To interview project sponsors need to team need to be prepared well as they are providing
funding to project and they may have different requirement and expectations
1. Prepare questions and review with colleagues.
2. Prefer open-ended questions and avoid asking leading questions.
3. Probe for details and pose follow-up questions
4. Give sufficient time to person to think.
5. Let the sponsors express their ideas and ask clarifying questions, such as "Why? Is
that correct? Is this idea on target? Is there anything else?"
6. Use active listening techniques; repeat back what he said and try to summarize it.
7. Try to avoid expressing the team's opinions, which can introduce bias; instead, focus
on listening.
8. Be mindful of the body language of the interviewers and stakeholders; use eye
contact where appropriate and be attentive, Minimize distractions.
9. Document what the team heard, and review it with the sponsors.

• Some common questions

• What business problem is the team trying to solve?
• What is the desired outcome of the project?
• What data sources are available?
• What industry issues may impact the analysis?
• What timelines need to be considered?
• Who could provide insight into the project?
• Who has final decision-making authority on the project?
• How will the focus and scope of the problem change if the following dimensions change?
Time: Analyzing 1 year or 10 years' worth of data?
People: Assess impact of changes in resources on project timeline.
Risk: Conservative to aggressive
Resources: None to unlimited (tools, technology, systems)
Size and attributes of data: Including internal and external data sources

• Initial Hypothesis development

• They main task of discovery phase is developing a set of IHs. The ideas can test with data.
Team can come up with a few primary hypotheses to test and then be creative about developing
several more. Hypothesis testing from a statistical perspective can be done in laterphases also.
• The team can compare its answers with the outcome of an experiment or test and can generate
additional possible solutions to problems.
• In this process gathering and assessing hypotheses from stakeholders and domain experts as
they may have their own perspective on what the problem is, what the solution should be, and
how to arrive at a solution.
• These stakeholders would know the domain area well and can offer suggestions on ideas to test
as the team formulates hypotheses during this phase. All ideas will also give the team
opportunities to expand the project scope into adjacent spaces..

• Identifying Potential Data Sources

The Data Scientist team should perform five main activities during discovery phase:

o Identify data sources:

▪ Team need make a list of candidate data sources, this data can be used to test
the initial hypotheses outlined in this phase.
▪ Team need make an inventory currently available datasets and can be
purchased from outside resources.
o Capture aggregate data sources:
▪ Current aggregated data sources helps team to preview data and provide in
dept understanding of it.
▪ It helps team to decide further explorations and complex investigations on the
data.
o Review the raw data:
▪ Raw data can be obtained from preliminary data sources as data feeds.
▪ This data provides detail understanding of interdependencies among the data
attributes.
▪ Team can learn about content of the data, its quality, and its limitations.
o Evaluate the data structures and tools needed:
▪ The data type and structure helps to decide which data analysis tools team can
use to analyze the data.
▪ It also helps team to decide good technologies for project implementation.
o Scope the sort of data infrastructure needed for this type of problem:
▪ Team can decide the kind of infrastructure that’s required, such as disk
storage and network capacity etc.

1.4.3 Phase 2- Data preparation

Data preparation phase of the Data Analytics Lifecycle includes the steps to explore, preprocess, and
condition data prior to modeling and analysis. The team needs to create a robust environment to explore
the data that is separate from a production environment. In this phase data scientist team need to perform
some activities as shown in Figure 1.11
Analytic
Sandbox
Preparation

Identify tools
Performing
for data
ETLT
preparation

Data

Data survey Learn about

and Visualize Data in-depth

Perform Data
Conditioning

Figure 1.11 Phases 2: Data Preparation

• Analytic Sandbox Preparation

• Data preparation requires the team to prepare an analytic sandbox also called as workspace
in which the team can explore the data without interfering with live production databases.
• Consider an example If team need financial data to work, that data can be obtained from
analytic sandbox rather than interacting with the product ion version of the organization's
main database, because that will be tightly controlled and needed for financial reporting.
• To develop analytic sandbox, best practice is to collect all kinds of data so team members
can access high volumes and varieties of data for a Big Data analytics project. Data can
include summary-level aggregated data, structured data, raw data feeds and unstructured
text data from call logs or web logs.
• During this collection of data, the data science team needs to give a justification to IT
department how analytical sandbox data is different from traditional IT-controlled
warehouses.
• The analytic sandbox enables organizations to work on more challenging data science
projects. Organization can move beyond doing traditional data analysis and Business
Intelligence to perform more robust and advanced predictive analytics.
• The sand box size is large as it contains other raw data, unstructured data which is less
required for organization but useful for data science project. The sandbox size is again
depend on complexity if project.
• A good rule is to plan for the sandbox to be at least 5-10 times the size of the original
data sets because partial copies of the data may be created that serve as specific tables or
data stores for specific kinds of analysis in the project.
• The analytics sandbox must have ample of bandwidth and reliable network connections to
the underlying data sources to enable uninterrupted read and write to perform various
transformations.
• Performing ETLT
ETLT is combination of ETL (Extract, Transform and Load) and ELT (Extract, Load and
Transform) as shown in Figure 1.13.

ETL ELT
Extract,
Extract,Load ETLT
Transform,
Load ,Transform

Figure 1.13 ETLT

Figure 1.14 ETL and ELT process

• What is ETL?

In ETL, users perform extract, transform, and load processes to extract data from a data
store, perform data transformations, and load the data back into the data store as shown
in Figure 1.14.
• What is ELT?
In ETL, users perform extract, transform and load processes to extract data from a
datastore, perform data transformations, and load the data back into the datastore as
shown in Figure 1.14.
• What is Sandbox box approach?
o Sandbox suggests extract, load, and then transform. In sandbox the data is extracted
in its raw form and loaded into the datastore, where analysts can choose to transform
the data into a new state or leave it in its original, raw condition. There is significant
value for raw and including it in sandbox before performing transformation on it.
o Consider example of fraud detection on credit card usage. Outliers in this data
population can represent higher-risk transactions which may be fraudulent credit
card activity. In case of ETL these outliers may get filtered out. But in case of ELT
all data is present in sandbox so can perform analysis of fraud detection.
• What is ETLT Process?
o Consider a scenario where the team may want clean data and aggregated data.
o They also need to keep a copy of the original data to compare against or look for
hidden patterns that may have existed in the data before the cleaning stage. This
process is called as ETLT.
• Learn about Data in-depth
• Learning data in depth in critical aspect of data preparation. This activity accomplishes
following goals,
1. Clarifies the data that the data science team has access to at the start of the
project.
2. Highlights gap in existing data sets of organization and team can trigger activity
of new data collection from organization.
3. Identifies datasets outside the organization which can obtain, through open APIs,
data sharing, or purchasing data to supplement already existing datasets.

• Perform Data Conditioning

• The process of cleaning data, normalizing datasets, and performing transformations

on the data is called as Data Conditioning.
• It is data preprocessing step in data analysis as it performs many operations on data
set before model development.
• Generally this process is performed by IT, data owners, a DBA or a data engineer
but, involvement of data scientist is also important in this step because many
decisions are made in the data conditioning phase that affects subsequent analysis.
The data science team must know decision about which data to keep or which data
to transform or discard.
• To retrace the discarded data, team need to add some questions as follows
What are the data sources?
What are the target fields (for example, columns of the tables)?
How consistent are the contents and files?

• Data survey and Visualize

• To get overview of data, data visualization process carried out. The high-level patterns of
data clarify characteristics of it.
• Data visualization used to examine data quality, such as whether the data contains many
unexpected values or other indicators of dirty data. Another example is skewness, such as
if the majority of the data is heavily shifted toward one value or end of a continuum.

• Identify tools for data preparation

Tools used commonly for this activity are
1. Hadoop
Hadoop performs massively parallel processing. It performs web traffic parsing, GPS
location analytics, genomic analysis, and combining of massive unstructured data feeds
from multiple sources.
2. Alpine Miner
Alpine Miner provides a graphical user interface (GUI) for creating analytic work
flows. It includes data manipulations and a series of analytic events such as staged data-
mining.
3. Open Refine
Open Refine is a free, open source, powerful tool for working with messy data. It
has a popular GUI for performing data transformations.
4. Data Wrangler
Data Wrangler is an interactive tools used for data cleaning and transformation.
It was developed by Stanford University.
1.4.4 Phase 3- Model Planning

In Model planning phase as shown in Figure 1.15, the data science team decides candidate models to
apply to the data for clustering, classifying, or finding relationships in the data depending on the goal
of the project. During this phase the team refers to the hypotheses developed in Phase 1. These
hypotheses help them to frame the analytics to execute in Phase 4 and select the right methods to
achieve its objectives.

Some of the activities to consider in this phase include the following:

• Assess the structure of the dataset; it dictates the tools and analytical techniques for the
model building phase. Different tools and approaches are required for different types of
data.
• Ensure that the analytical techniques enable the team to meet the business objectives and
accept or reject the working hypotheses.
• Determine if the situation needs a single model or a series of techniques as part of a larger
analytic workflow.
• Do research about other analyst work on same problem. Need to find out which methods,
techniques they used to solve same type of problem.

• Data Exploration and Variable Selection

• The data exploration carried out in phase 2 is focus mainly on data hygiene and on assessing
the quality of the data itself. In this phase, it is carried out to understand the relationships among
the variables to inform selection of the variables and methods and to understand the problem
domain.
• The main aim to capture the most essential predictors and variables rather than considering
every possible variable that people think may influence the outcome. This approach requires
iterations and testing to identify the most essential variables for the intended analyses. The
team need to test a range of variables to include in the model and then focus on the most
important and influential variables.

• Model Selection
• In model selection phase, the team can make list of suitable analytical techniques to fulfill
end goal of the project. They can observe real world situations and try to map to the current
problem for model construction.
• In machine learning and data mining, several such techniques such as classification, association
rules, and clustering are available. Team also needs to identify techniques suitable for Big Data
for structured data, unstructured data, or a hybrid approach.
• Initially these models can be created by using statistical software package such as R, SAS, or
Matlab. As these tools are designed for data mining and machine learning algorithms, may
have limitations for Big Data so team need to redesign algorithms as per requirement.

• Common Tools for the Model Planning Phase

Many tools are available to assist in this phase. Here are several of the more common ones:
• R provides statistical analysis interface and graphical representation of data. It has many data
mining and machine learning algorithm packages with data can be used for data analysis. R
can connect with several structures and unstructured databases like SQL, MongoDB etc. and
can perform analysis on that data.
• SQL Analysis services used to perform in-database analytics of common data mining
functions, involved aggregations, and basic predictive models.
• SAS/ACCESS provides integration between SAS and the analytics sandbox via multiple data
connectors such as OBDC, JDBC and OLE DB. SAS itself is generally used on file extracts,
but with SAS/ACCESS, users can connect to relational databases (such as Oracle or
Teradata) and data warehouse appliances (such as Green plum or Aster), files, and enterprise
applications (such as SAP and Salesforce.com).

Data
Exploration and
Variable
Selection

Model
Planning

Common Tools
Model
for the Model
Selection
Planning Phase

Figure 1.15 Phases 3: Model Planning

1.4.5 Phase 4- Model Building

• In building model phase data scientist team needs to develop training, testing and
production datasets. Training data set used to train developed analytical model, test data
for testing the model. The training datasets and testing datasets are robust for model.
• Training datasets: Initial Experiments ,Testing datasets: Validation of approach
• In this model team needs to iterate backward and forward to decide the final model.
Phase 3 and 4 are short in span but conceptually complex, here the data science team needs
to execute the models defined in Phase 3.
• In this phase team needs to consider following questions:
o Does the model appear valid and accurate on the test data?
o Does the model output/behavior make sense to the domain experts?
o Is the model giving answers that make sense in this context?
o Do the parameter values of the fitted model make sense in the context of the
domain?
o Is the model sufficiently accurate to meet the goal?
o Does the model avoid intolerable mistakes?
• Common Tools for the Model Building Phase are SAS Enterprise Miner,SPSS Modeler,
Matlab ,Alpine Miner , STATISTICA and Mathematica , R and PL/R , Octave, WEKA
,Python ( scikit-learn, numpy, scipy, pandas, and related data visualization using
matplotlib),MADlib.

1.4.6 Phase 5- Communicate Results

• After executing the model, the team needs to compare the outcomes of the modeling to
the criteria established for success and failure. The team considers how to convey the
findings and outcomes to the various team members and stakeholders.
• The team needs to determine if it succeeded or failed in its objectives. The best practice
in this phase is to record all the findings and then select the three most significant ones and
share them with the stakeholders. Here the team needs to reflect implications of these
findings and measure the business value.
• Team need to consider possible improvements and suggest them in future work orexisting
process.

1.4.7 Phase 6- Opearationalize

• In this phase the successfully executed model deployed in commercial environment to work
on real data. This phase can bring in a new set of team members- usually the engineers
responsible for the production environment. This technical group needs to ensure that
running the model fits smoothly into the production environment and can integrate into
related business processes.
• Operationalzing phase includes creating a mechanism to perform performing real time
monitoring of model accuracy and, if accuracy degrades, finding ways to retrain the model.
• The key output from all stakeholders as described in Table 1.2 which helps to create
main deliverables

Sr.No Stakeholders Role Key output

1. Business User Business User tries to determine Presentation for project
the benefits and implications of sponsors
the findings to the business.
2. Project Sponsor Project sponsor typically asks Presentation for project
questions related to the business sponsors
impact of the project, the risks
and return on investment (ROI).
3. Project Manager Project Manager needs to --
determine if the project was
completed on time and within
budget.
4. Business Business Intelligent Analyst Presentation for analysts
Intelligence Analyst needs to know if the reports and
dashboards he manages will be
impacted and need to change.
5. Data Engineer and Data Engineer and Database Code for technical people,
Database Administrator needs to share the Technical specifications
Administrator code from the analytical project of implementing the code
and create technical documents
that describe how to implement
the code.
6. Data Scientists Data Scientists need to share the Code for technical people,
code and explain the model to Technical specifications
their peers, managers, and other of implementing the code
stakeholders.

1. Presentation for project sponsors:

• This contains high-level takeaways for executive-level stakeholders. It
contains a few key messages to aid their decision-making process.
• Focus on clean, easy visuals for the presenter to explain and for the
viewer to grasp.
2. Presentation for analysts:
• This describes changes to business processes and reports.
• Data scientists reading this presentation are comfortable with technical
graphs such as histograms etc.
3. Code for technical people, such as engineers and technical people in production
environment
4. Technical specifications of implementing the code

1.4.8 Case Study: GINA-Global Innovation Network and Analysis

• In 2012 EMC’s new director wanted to improve the company’s engagement of employees
across the global centers of excellence (GCE) to drive innovation, research, and university
partnerships
• This project was created to accomplish
o Store formal and informal data
o Track research from global technologists
o Mine the data for patterns and insights to improve the team’s operations and
strategy
Phase 1: Discovery
• Team members and roles
o Business user, project sponsor, project manager – Vice President from Office of
CTO
o BI analyst – person from IT
o Data engineer and DBA – people from IT
o Data scientist – distinguished engineer
• The data fell into two categories
o Five years of idea submissions from internal innovation contests
o Minutes and notes representing innovation and research activity from around the
world
• Hypotheses grouped into two categories
o Descriptive analytics of what is happening to spark further creativity, collaboration,
and asset generation
o Predictive analytics to advise executive management of where it should be
investing in the future
o The 10 main IHs that the GINA team developed were as follows:
1. Innovation activity in different geographic regions can be mapped to
corporate strategic directions.
2. The length of time it takes to deliver ideas decreases when global
knowledge transfer occurs as part of the idea delivery process.
3. Innovators who participate in global knowledge transfer deliver ideas
more quickly than those who do not.
4. An idea submission can be analyzed and evaluated for the likelihood of
receiving funding.
5. Knowledge discovery and growth for a particular topic can be measured
and compared across geographic regions.
6. Knowledge transfer activity can identify research-specific boundary
spanners in disparate regions.
7. Strategic corporate themes can be mapped to geographic regions.
8. Frequent knowledge expansion and transfer events reduce the time it
takes to generate a corporate asset from an idea.
9. Lineage maps can reveal when knowledge expansion and transfer did not
(or has not) resulted in a corporate asset.
10. Emerging research topics can be classified and mapped to specific
ideators, innovators, boundary spanners, and assets.
Phase 2: Data Preparation
• Set up an analytics sandbox.
• Discovered that certain data needed conditioning and normalization and that missing
datasets were critical.
• Team recognized that poor quality data could impact subsequent steps.
• They discovered many names were misspelled and problems with extra spaces.
• These seemingly small problems had to be addressed.
Phase 3: Model Planning
• The study included the following considerations.
o Identify the right milestones to achieve the goals.
o Trace how people move ideas from each milestone toward the goal.
o Tract ideas that die and others that reach the goal.
o Compare times and outcomes using a few different methods.
Phase 4: Model Building
• Several analytic method were employed
o NLP on textual descriptions.
o Social network analysis using R and Rstudio.
o Developed social graphs and visualizations.
Social graph of data submitters and finalists

Social graph of top innovation influencers

Phase 5: Communicate Results
• Study was successful in identifying hidden innovators
o Found high density of innovators in Cork, Ireland
• The CTO office launched longitudinal studies
Phase 6: Operationalize
• Deployment was not really discussed
• Key findings
o Need more data in future
o Some data were sensitive
o A parallel initiative needs to be created to improve basic BI activities
o A mechanism is needed to continually reevaluate the model after deployment

Barraone Performance Analytics Guide
No ratings yet
Barraone Performance Analytics Guide
86 pages
Creating Business Intelligence For Your Organization Fast Track
0% (1)
Creating Business Intelligence For Your Organization Fast Track
708 pages
Unit 1 Rept
No ratings yet
Unit 1 Rept
61 pages
Unit I: Chapter 1: Introduction To Big Data
No ratings yet
Unit I: Chapter 1: Introduction To Big Data
35 pages
KCA 034 - Unit 1
No ratings yet
KCA 034 - Unit 1
48 pages
Itfm Assignment Group 8
100% (1)
Itfm Assignment Group 8
16 pages
Unit 1
No ratings yet
Unit 1
61 pages
Data Science Introduction
No ratings yet
Data Science Introduction
82 pages
mod 3
No ratings yet
mod 3
96 pages
Big Data Class 27Feb
No ratings yet
Big Data Class 27Feb
48 pages
FDSUNIT 1
No ratings yet
FDSUNIT 1
27 pages
ET_Ch-2_Data_Science_ppt (2)
No ratings yet
ET_Ch-2_Data_Science_ppt (2)
28 pages
NJ CSE4261-1
No ratings yet
NJ CSE4261-1
26 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
22 pages
Big Data Analytics Unit1
No ratings yet
Big Data Analytics Unit1
10 pages
Unit I - BDA
No ratings yet
Unit I - BDA
12 pages
CHAPTER-1
No ratings yet
CHAPTER-1
149 pages
Lesson 3 Big Data Overview
No ratings yet
Lesson 3 Big Data Overview
30 pages
WINSEM2023-24 BCSE206L TH VL2023240501787 2024-01-29 Reference-Material-I
No ratings yet
WINSEM2023-24 BCSE206L TH VL2023240501787 2024-01-29 Reference-Material-I
53 pages
Chapter Two
No ratings yet
Chapter Two
57 pages
unit 1 big data
No ratings yet
unit 1 big data
34 pages
Lecture_2
No ratings yet
Lecture_2
50 pages
Content
No ratings yet
Content
7 pages
Bda Module 1 Notes
No ratings yet
Bda Module 1 Notes
10 pages
Data Analytics Unit I 1
No ratings yet
Data Analytics Unit I 1
87 pages
Introduction to Data Science_students
No ratings yet
Introduction to Data Science_students
237 pages
Chapter 2 - Overview for Data Science
No ratings yet
Chapter 2 - Overview for Data Science
31 pages
Unit 1 (1)
No ratings yet
Unit 1 (1)
21 pages
Itfm Assignment Group 5
No ratings yet
Itfm Assignment Group 5
14 pages
Data Analysis _Unit1
No ratings yet
Data Analysis _Unit1
65 pages
BDT 1
No ratings yet
BDT 1
49 pages
UNUT 1- Introduction and Data Analytics Life Cycle
No ratings yet
UNUT 1- Introduction and Data Analytics Life Cycle
86 pages
DAUnit-1
No ratings yet
DAUnit-1
20 pages
dataanalyticsunit-1[1]
No ratings yet
dataanalyticsunit-1[1]
26 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
Cs3352 Foundation of Data Science
No ratings yet
Cs3352 Foundation of Data Science
80 pages
CH 2 Data Science
No ratings yet
CH 2 Data Science
28 pages
Big Data Analytics
No ratings yet
Big Data Analytics
58 pages
Unit-1 Final sgs
No ratings yet
Unit-1 Final sgs
24 pages
Unit I - Big Data Programming
No ratings yet
Unit I - Big Data Programming
19 pages
Data Science
No ratings yet
Data Science
35 pages
Bda Unit 1
No ratings yet
Bda Unit 1
74 pages
Unit 1
No ratings yet
Unit 1
59 pages
Unit 1 Introduction
No ratings yet
Unit 1 Introduction
70 pages
BDU1
No ratings yet
BDU1
39 pages
22UCS303 DS-Unit I-N
No ratings yet
22UCS303 DS-Unit I-N
42 pages
Inroduction To Data Science
No ratings yet
Inroduction To Data Science
62 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Unit 2
No ratings yet
Unit 2
35 pages
Chap1-Overview of Data Science
No ratings yet
Chap1-Overview of Data Science
50 pages
Unit 5 Concepts of Big Data and Data Lake
No ratings yet
Unit 5 Concepts of Big Data and Data Lake
15 pages
Big Data and Data Science
No ratings yet
Big Data and Data Science
6 pages
U - 02 ET
No ratings yet
U - 02 ET
24 pages
Module 1 - Data Science Introduction _Detailed
No ratings yet
Module 1 - Data Science Introduction _Detailed
131 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
unit-1ppt
No ratings yet
unit-1ppt
29 pages
Data Science: Chapter 1: Introduction To Big Data
100% (2)
Data Science: Chapter 1: Introduction To Big Data
77 pages
Cloud computing
No ratings yet
Cloud computing
86 pages
Unit-1
No ratings yet
Unit-1
107 pages
BDA Unit 1
No ratings yet
BDA Unit 1
10 pages
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Network Detection and Response in The SOC Securonix
No ratings yet
Network Detection and Response in The SOC Securonix
7 pages
A Predictive Analytics Model For Forecasting Outcomes in The National Football League Games Using Decision Tree and Logistic Regression
No ratings yet
A Predictive Analytics Model For Forecasting Outcomes in The National Football League Games Using Decision Tree and Logistic Regression
10 pages
B 36 V TRIMESTER SUBJECTS (8765)
No ratings yet
B 36 V TRIMESTER SUBJECTS (8765)
5 pages
r22 IV Csd Year Syllabus
No ratings yet
r22 IV Csd Year Syllabus
21 pages
Chapter 12
No ratings yet
Chapter 12
29 pages
TM1 Reference Guide
No ratings yet
TM1 Reference Guide
348 pages
S4 HANALicensing Model External V19
No ratings yet
S4 HANALicensing Model External V19
28 pages
Campus Brochure - 2025 Math Co
No ratings yet
Campus Brochure - 2025 Math Co
7 pages
Data Analytics 360digitmg
No ratings yet
Data Analytics 360digitmg
10 pages
Leveraging Industrial Iot and Advanced Technologies For Digital Transformation
No ratings yet
Leveraging Industrial Iot and Advanced Technologies For Digital Transformation
76 pages
Gartner Build A Data Driven Enterprise August 2019
100% (1)
Gartner Build A Data Driven Enterprise August 2019
14 pages
ISM Case Studies
No ratings yet
ISM Case Studies
15 pages
Big Data Analytics
No ratings yet
Big Data Analytics
17 pages
HRTT (Till Feb2025)
No ratings yet
HRTT (Till Feb2025)
86 pages
Digital Fluency Complete Notes All 3 Modules
No ratings yet
Digital Fluency Complete Notes All 3 Modules
26 pages
SIOC Consulting - Client Introduction - General Presentation v2.0
No ratings yet
SIOC Consulting - Client Introduction - General Presentation v2.0
46 pages
Finance Operations Research Event
No ratings yet
Finance Operations Research Event
22 pages
Export Run at 2023-02-16 04 - 09 - 56
No ratings yet
Export Run at 2023-02-16 04 - 09 - 56
75 pages
Ps Petrel Re 2013
50% (2)
Ps Petrel Re 2013
4 pages
Public Health Surveillance System in Saudi Arabia
No ratings yet
Public Health Surveillance System in Saudi Arabia
16 pages
Solutions CPG PDF
No ratings yet
Solutions CPG PDF
5 pages
Spci Platts Excel Add in Userguide
No ratings yet
Spci Platts Excel Add in Userguide
23 pages
Cloud Digital Leader 9
No ratings yet
Cloud Digital Leader 9
37 pages
What We Do?: VAST Is Helping Its Clients by
No ratings yet
What We Do?: VAST Is Helping Its Clients by
1 page
GA Bahrain Data+Analytics+Bootcamp+Part Time
No ratings yet
GA Bahrain Data+Analytics+Bootcamp+Part Time
12 pages
ijsat_UnderstandingDataProcessinginDatabricksFromSparkStreamingtoStructuredStreaming
No ratings yet
ijsat_UnderstandingDataProcessinginDatabricksFromSparkStreamingtoStructuredStreaming
12 pages
Sap S/4Hana Embedded Analytics: An Overview
No ratings yet
Sap S/4Hana Embedded Analytics: An Overview
13 pages
Training Policies and Practices in Sbi: Project Report ON
No ratings yet
Training Policies and Practices in Sbi: Project Report ON
54 pages

Introduction

Uploaded by

Introduction

Uploaded by

Unit I Introduction and Life cycle

Mobile Phones Social Media like Facebook,Twitter,Instagram

Video surveillance Medical Imagining

Gene Sequencing Geographical data

Figure 1.1 Sources of data

Big Data Characteristics

• Amount of data generated

• Different type of sources

• Rate at which data is generated captured and shared

• Uncertainty of data in terms of accuracy and consistency.

Figure 1.2 Big Data V’s.

Figure 1.3 Components of BIG Data

1.2.1 Data Structures

Figure 1.4 Data Structures (Data is growing from structured to unstructured)

C. Quasi Structured Data

Table 1.1 describes emerging Data Science.

Table 1.1Data Science

3. Process Data Science gives scope to explorationand experimentation to decide how

5. Transform Data Science encourages new questionsdue to its growing data.

8. IT owned vs Data Science is owned and managed byAnalyst.

9. Methods Data Science uses scientific methods.

Key Stake Holders for a Successful Analytics Project

Figure 1.8 Key stake holders for Analytics Project

1.4.1. Phases of Data Analytical cycle

5.Communicate 3.Model Planning

Figure 1.9 Phases of Data Analytics Life Cycle

Phase 5- Communicate Results:

1.4.2 Phase 1- Discovery

Figure 1.10 Phases 1: Discovery

• Learn Business Domain

• Key Stakeholders identification

• Interview of Analytics Sponsor

• Some common questions

• Initial Hypothesis development

• Identifying Potential Data Sources

o Identify data sources:

1.4.3 Phase 2- Data preparation

Data survey Learn about

Figure 1.11 Phases 2: Data Preparation

• Analytic Sandbox Preparation

Figure 1.13 ETLT

Figure 1.14 ETL and ELT process

• Perform Data Conditioning

• The process of cleaning data, normalizing datasets, and performing transformations

• Data survey and Visualize

• Identify tools for data preparation

Some of the activities to consider in this phase include the following:

• Data Exploration and Variable Selection

• Common Tools for the Model Planning Phase

Figure 1.15 Phases 3: Model Planning

1.4.5 Phase 4- Model Building

1.4.6 Phase 5- Communicate Results

1.4.7 Phase 6- Opearationalize

Sr.No Stakeholders Role Key output

1. Presentation for project sponsors:

1.4.8 Case Study: GINA-Global Innovation Network and Analysis

Social graph of top innovation influencers

You might also like