0% found this document useful (0 votes)
26 views57 pages

Unit-1 Introduction To Big Data Analytics

Uploaded by

Kalighat Okira
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views57 pages

Unit-1 Introduction To Big Data Analytics

Uploaded by

Kalighat Okira
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 57

Introduction to Big Data Analytics

By

Prof Shibdas Dutta


Associate Professor,

DCGDATACORESYSTEMSINDIAPVTLTD
Kolkata
Overview

• Big Data Overview

• State of practice in analytics

• Role of Data Scientists

• Examples of Big Data Analytics

• Data Analytics Lifecycle.


What is Big Data?

• No standard definition! here is from Wikipedia:


• Big data is a term for data sets that are so large or complex
that traditional data processing applications are inadequate.

• Challenges include analysis, capture, data curation, search,


sharing, storage, transfer, visualization, querying, updating
and information privacy.

• Analysis of data sets can find new correlations to "spot


business trends, prevent diseases, combat crime and so on."
Who is generating Big Data?

Social User Tracking & Homeland Security


Engagement

eCommerce Financial Services Real Time Search


What is Big Data?
• The total amount of data created, captured, copied, and consumed globally
increases rapidly, reaching 64.2 zettabytes in 2020.

• It’s not easy to measure the total volume of data stored electronically, but an
estimate is that over the next five years up to 2025, global data creation is
projected to grow to more than 180 zettabytes.

Consider the following:


• The New York Stock Exchange generates about 4-5 terabytes of data per day.
• Facebook hosts more than 240 billion photos, growing at 7 petabytes per
month.
• Ancestry.com, the genealogy site, stores around 10 petabytes of data.
• The Internet Archive stores around 18.5 petabytes of data.
Data Storage and Analysis

• Although the storage capacities of hard drives have increased massively over
the years, access speeds—the rate at which data can be read from drives
have not kept up.

• The size, speed, and complexity of big data necessitates the use specialist of
software which in turn relies on significant processing power and storage
capabilities. While costly, embracing big data analytics enables organizations
to derive powerful insights and gain a competitive edge.

• By 2029, the value of the big data analytics market is expected to reach over
655 billion U.S. dollars, up from around

• 15 billion U.S. dollars in 2019.


• 68 billion U.S. dollars by 2025
• 655 billion U.S. dollars by 2029
Big Data Characteristics: 3V
Volume (Scale)

• Data Volume
• Growth 40% per year
• From 8 zettabytes (2016) to 44zb (2020)
• Data volume is increasing exponentially

Exponential increase in
collected/generated data
Processes 20 PB a day (2008)
Crawls 20B web pages a day (2012)
Search index is 100+ PB (5/2014) 400B pages,
Bigtable serves 2+ EB, 600M QPS (5/2014) 10+ PB
(2/2014)
Hadoop: 365 PB, 330K
nodes (6/2014)

150 PB on 50k+ servers


Hadoop: 10K nodes, 150K
cores, 150 PB (4/2014)
running 15k apps (6/2011)

300 PB data in Hive LHC: ~15 PB a year


+
600 TB/day
(4/2014)
S3: 2T objects, 1.1M
request/second (4/2013)

LSST: 6-10 PB a year


(~2020)
640K ought to be
enough for
anybody.
SKA: 0.3 – 1.5 EB
per year (~2020)

How much data?


Variety (Complexity)

• Different Types:
• Relational Data (Tables/Transaction/Legacy Data)
• Text Data (Web)
• Semi-structured Data (XML)
• Graph Data
• Social Network, Semantic Web (RDF), …
• Streaming Data
• You can only scan the data once
• A single application can be generating/collecting many types of data
• Different Sources :
• Movie reviews from IMDB and Rotten Tomatoes
• Product reviews from different provider websites
To extract knowledge all these
types of data need to linked
together
A Single View to the Customer

Social Banking
Finance
Media

Our
Gaming
Customer Known
History

Entertain Purchase
A Global View of Linked Big Data

pre
is sc r
nos i pt
d i ag i on

doctors drug

patient
get
tar

mu
ta t i on
“Ebola”

tissue gene

protein
Diversified social network Heterogeneous information network
Velocity (Speed)

• Data is begin generated fast and need to be processed fast


• Online Data Analytics
• Late decisions  missing opportunities
• Examples
• E-Promotions: Based on your current location, your purchase history,
what you like  send promotions right now for store next to you
• Healthcare monitoring: sensors monitoring your activities and body 
any abnormal measurements require immediate reaction
• Disaster management and response
Real-Time Analytics/Decision Requirement

Product
Recommendations Learning why Customers
that are Relevant Influence
Behavior
Switch to competitors
& Compelling and their offers; in
time to Counter

Friend Invitations
to join a
Improving the Customer
Game or Activity
Marketing
that expands
Effectiveness of a
business
Promotion while it
is still in Play
Preventing Fraud
as it is Occurring
& preventing more
proactively
Extended Big Data Characteristics: 6V

• Volume: In a big data environment, the amounts of data collected and processed are
much larger than those stored in typical relational databases.

• Variety: Big data consists of a rich variety of data types.

• Velocity: Big data arrives to the organization at high speeds and from multiple sources
simultaneously.

• Veracity: Data quality issues are particularly challenging in a big data context.

• Visibility/Visualization: After big data being processed, we need a way of presenting


the data in a manner that’s readable and accessible.

• Value: Ultimately, big data is meaningless if it does not provide value toward some
meaningful goal.
Veracity (Quality & Trust)
• Data = quantity + quality

• When we talk about big data, we typically mean its quantity:


• What capacity of a system provides to cope with the sheer size of the
data?
• Is a query feasible on big data within our available resources?
• How can we make our queries tractable on big data?
• ...
• Can we trust the answers to our queries?
• Dirty data routinely lead to misleading financial reports, strategic business
planning decision  loss of revenue, credibility and customers, disastrous
consequences

• The study of data quality is as important as data quantity.


Data in real-life is often dirty

81 million National Insurance


numbers but only 60 million
eligible citizens

98000 deaths each


year, caused by errors
in medical data

500,000 dead people


retain active Medicare
cards
Visibility/Visualization
• Visible to the process of big data management
• Big Data – visibility = Black Hole?

A visualization of Divvy bike rides across Chic


ago

• Big data visualization tools:


Value

• Big data is meaningless if it does not provide value toward some


meaningful goal
Big Data: 6V in Summary

Transforming Energy and Utilities through Big Data & Analytics. By Anders
Quitzau@IBM
Other V’s

• Variability
Variability refers to data whose meaning is constantly changing. This is
particularly the case when gathering data relies on language processing.
• Viscosity
This term is sometimes used to describe the latency or lag time in the
data relative to the event being described. We found that this is just as
easily understood as an element of Velocity.
• Virality
Defined by some users as the rate at which the data spreads; how often
it is picked up and repeated by other users or events.
• Volatility
Big data volatility refers to how long is data valid and how long should it
be stored. You need to determine at what point is data no longer relevant to
the current analysis.
• More V’s in the future …
Big Data Overview

Several industries have led the way in developing their ability to gather and exploit data:

• Credit card companies monitor every purchase their customers make and can identify
fraudulent purchases with a high degree of accuracy using rules derived by processing
billions of transactions.

• Mobile phone companies analyze subscribers’ calling patterns to determine, for


example, whether a caller’s frequent contacts are on a rival network. If that rival
network is offering an attractive promotion that might cause the subscriber to defect,
the mobile phone company can proactively offer the subscriber an incentive to remain
in her contract.

• For companies such as LinkedIn and Facebook, data itself is their primary product. The
valuations of these companies are heavily derived from the data they gather and host,
which contains more and more intrinsic value as the data grows.
Big Data Overview
McKinsey’s definition of Big Data implies that organizations will need new data
architectures and analytic sandboxes, new tools, new analytical methods, and an
integration of multiple skills into the new role of the data scientist
Big Data Overview
• Social media and genetic sequencing are among the fastest-growing sources of Big
Data and examples of untraditional sources of data being used for analysis.

• For example, in 2012 Facebook users posted 700 status updates per second worldwide,
which can be leveraged to deduce latent interests or political views of users and show
relevant ads. For instance, an update in which a woman changes her relationship
status from “single” to “engaged” would trigger ads on bridal dresses, wedding
planning, or name-changing services.

• Facebook can also construct social graphs to analyze which users are connected to
each other as an interconnected network. In March 2013, Facebook released a new
feature called “Graph Search,” enabling users and developers to search social graphs
for people with similar interests, hobbies, and shared locations.
Big Data Overview

• Another example comes from genomics. Genetic sequencing and human genome
mapping provide a detailed understanding of genetic makeup and lineage. The health
care industry is looking toward these advances to help predict which illnesses a person
is likely to get in his lifetime and take steps to avoid these maladies or reduce their
impact through the use of personalized medicine and treatment.

• Such tests also highlight typical responses to different medications and pharmaceutical
drugs, heightening risk awareness of specific drug treatments.
Data Structures
• Big data can come in multiple forms, including structured and non-structured data such
as financial data, text files, multimedia files, and genetic mappings.

• Contrary to much of the traditional data analysis performed by organizations, most of


the Big Data is unstructured or semi-structured in nature, which requires different
techniques and tools to process and analyze.

• Distributed computing environments and massively parallel processing (MPP)


architectures that enable parallelized data ingest and analysis are the preferred
approach to process such complex data.
Data Structures
• 80–90% of future data growth coming from non-
structured data types

• For example, a classic Relational Database


Management System (RDBMS) may store call logs for
a software support call center. The RDBMS may store
characteristics of the support calls as typical
structured data, with attributes such as time stamps,
machine type, problem type, and operating system.

• The system will likely have unstructured, quasi- or


semi-structured data, such as free-form call log
information taken from an e-mail ticket of the
problem, customer chat history, or transcript of a
phone call describing the technical problem and the
solution or audio file of the phone call conversation
Data Structures

• Structured data: Data containing a


defined data type, format, and
structure (that is, transaction data,
online analytical processing [OLAP]
data cubes, traditional RDBMS, CSV
files, and even simple spreadsheets).
Data Structures

• Semi-structured data: Textual data


files with a discernible pattern that
enables parsing (such as Extensible
Markup Language [XML] data files
that are self- describing and defined
by an XML schema).
Data Structures

• Quasi-structured data: Textual data


with erratic data formats that can be
formatted with effort, tools, and time (for
instance, web clickstream data that may
contain inconsistencies in data values
and formats).

• Quasi-structured data is a common


phenomenon that bears closer scrutiny.
Visiting these three websites adds three
URLs to the log files monitoring the user’s
computer or network use.

• This set of three URLs reflects the


websites and actions taken to find Data
Science information related to EMC.
Together, this comprises a clickstream
that can be parsed and mined by data
scientists to discover usage patterns and
uncover relationships among clicks and
areas of interest on a website or group of
sites.
Data Structures

• Unstructured data: Data that has


no inherent structure, which may
include text documents, PDFs,
images, and video.
State of the Practice in Analytics

Current business problems provide many opportunities for organizations to become more
analytical and data driven, as shown

Business Driver Examples


Optimize business operations Sales, pricing, profitability, efficiency
Identify business risk Customer churn, fraud, default
Predict new business Upsell, cross-sell, best new customer prospects
opportunities

Comply with laws or regulatory Anti-Money Laundering, Fair Lending, Basel


requirements II-III, Sarbanes-Oxley (SOX)

Four categories of common business problems that organizations contend with where
they have an opportunity to leverage advanced analytics to create competitive
advantage. Rather than only performing standard reporting on these areas,
organizations can apply advanced analytical techniques to optimize processes and derive
more value from these common tasks.
BI Versus Data Science

• BI tends to provide reports, dashboards,


and queries on business questions for the
current period or in the past. BI systems make
it easy to answer questions related to quarter-
to-date revenue, progress toward quarterly
targets, and understand how much of a given
product was sold in a prior quarter or year.

• Data Science tends to use disaggregated data


in a more forward-looking, exploratory way,
focusing on analyzing the present and
enabling informed decisions about the
future. Rather than aggregating historical
data to look at how many of a given product
sold in the previous quarter, a team may
employ time series analysis to forecast
future product sales and revenue more
accurately than extending a simple trend line
Current Analytical Architecture

• Data Science projects need workspaces that are purpose-built for experimenting
with data, with flexible and agile data architectures. Most organizations still have data
warehouses that provide excellent support for traditional reporting and simple data
analysis activities but unfortunately have a more difficult time supporting more robust
analyses.

• The data flow to the Data Scientist and how the individual fits into the process of
getting data to analyze on projects:

1. For data sources to be loaded into the data warehouse, data needs to be well
understood, structured, and normalized with the appropriate data type definitions.

2. As a result of this level of control on the Enterprise Data Warehouse (EDW),


additional local systems may emerge in the form of departmental warehouses
and local data marts that business users create to accommodate their need for
flexible analysis. These local data marts may not have the same constraints for
security and structure as the main EDW and allow users to do some level of more
in-depth analysis
Current Analytical Architecture

Inside a Google data center ; https://www.youtube.com/watch?


v=XZmGGAbHqa0
Inside The World's Largest Data Center :
https://www.youtube.com/watch?v=g7JaN3rTK2A
Current Analytical Architecture

3. Once in the data warehouse, data is read by additional applications across the
enterprise for BI and reporting purposes. These are high-priority operational
processes getting critical data feeds from the data warehouses and repositories.

4. At the end of this workflow, analysts get data provisioned for their downstream
analytics. Because users generally are not allowed to run custom or intensive
analytics on production databases, analysts create data extracts from the EDW to
analyze data offline in R or other local analytical tools.

Because new data sources slowly accumulate in the EDW due to the
rigorous validation and data structuring process, data is slow to move into the EDW,
and the data schema is slow to change.

The typical data architectures just described are designed for storing and processing
mission-critical data, supporting enterprise applications, and enabling corporate
reporting activities , thus limit the ability of analysts to iterate on the data in a
separate nonproduction environment
Key Roles for the New Big Data Ecosystem

The need for applying more advanced analytical techniques to increasingly complex
business problems has driven the emergence of new roles, new technology platforms,
and new analytical methods.
Three Roles for the New Big Data
Ecosystem
Key Roles for the New Big Data Ecosystem

• Deep Analytical Talent — is technically savvy, with strong analytical skills. Members
possess a combination of skills to handle raw, unstructured data and to apply complex
analytical techniques at massive scales. This group has advanced training in quantitative
disciplines, such as mathematics, statistics, and machine learning.

• Data Savvy Professionals — has less technical depth but has a basic knowledge of
statistics or machine learning and can define key questions that can be answered using
advanced analytics. These people tend to have a base knowledge of working with data, or
an appreciation for some of the work being performed by data scientists and others with
deep analytical talent.

• Technology and Data Enablers — is a group represents people providing technical


expertise to support analytical projects, such as provisioning and administrating analytical
sandboxes, and managing large-scale data architectures that enable widespread analytics
within companies and other organizations. This role requires skills related to computer
engineering, programming, and database administration.
Role of Data Scientists

Following are the activities that data scientists perform:

• Reframe business challenges as analytics challenges - Specifically, this is a skill


to diagnose business problems, consider the core of a given problem, and determine
which kinds of candidate analytical methods can be applied to solve it.

• Design, implement, and deploy statistical models and data mining


techniques on Big Data - This set of activities is mainly what people think about
when they consider the role of the Data Scientist: namely, applying complex or
advanced analytical methods to a variety of business problems using data.

• Develop insights that lead to actionable recommendations. It is critical to note


that applying advanced methods to data problems does not necessarily drive new
business value. Instead, it is important to learn how to draw insights out of the data
and communicate them effectively.
Role of Data Scientists

Data scientists are generally thought of as having five main sets of skills and behavioral
characteristic

• Quantitative skill: such as mathematics or statistics

• Technical aptitude: namely, software engineering, machine learning, and programming


skills

• Skeptical mind-set and critical thinking: It is important that data scientists can
examine their work critically rather than in a one-sided way.

• Curious and creative: Data scientists are passionate about data and finding creative
ways to solve problems and portray information.

• Communicative and collaborative: Data scientists must be able to articulate the


business value in a clear way and collaboratively work with other groups, including project
sponsors and key stakeholders
Role of Data Scientists

Data scientists are generally


comfortable using this blend of skills
to acquire, manage, analyze, and
visualize data and tell compelling
stories about it.
Examples of Big Data Analytics

Big Data presents many opportunities to improve sales and marketing analytics. An example of
this is the U.S. retailer Target. Charles Duhigg’s book The Power of Habit discusses how retailer
used Big Data and advanced analytical methods to drive new revenue. After analyzing consumer-
purchasing behavior, retailer made a great deal of money from three main life event situations.

1. Marriage, when people tend to buy many new products


2. Divorce, when people buy new products and change their spending habits
3. Pregnancy, when people have many new things to buy and have an urgency to buy them

Retailer determined that the most lucrative of these life-events is the third situation: pregnancy.
Using data collected from shoppers, retailer was able to identify this fact and
predict which of its shoppers were pregnant. This kind of knowledge allowed retailer to
offer specific coupons and incentives to their pregnant shoppers. In fact, retailer could not
only determine if a shopper was pregnant, but in which month of pregnancy a shopper
may be. This enabled retailer to manage its inventory, knowing that there would be demand
for specific products and it would likely vary by month over the coming nine- to ten month cycles.
Examples of Big Data Analytics

Social media represents a tremendous opportunity to leverage social and professional


interactions to derive new insights. LinkedIn exemplifies a company in which data itself is
the product. Early on, LinkedIn founder Reid Hoffman saw the opportunity to create a social
network for working professionals.

As of 2014, LinkedIn has more than 250 million user accounts (Current data??) and
has added many additional features and data-related products, such as recruiting, job
seeker tools, advertising, and InMaps, which show a social graph of a user’s professional
network.

• Write 2 examples of Big Data Analytics (approx. 250 words each)


Key Roles for a Successful Analytics Project

There may be roughly seven key roles that need to be fulfilled for a high functioning data
science team to execute analytic projects successfully.

• Business User: Someone who understands the domain area and usually benefits from the
results. This person can consult and advise the project team on the context of the project,
the value of the results, and how the outputs will be operationalized.

• Project Sponsor: Responsible for the genesis of the project. Provides the impetus and
requirements for the project and defines the core business problem. Generally provides the
funding and gauges the degree of value from the final outputs of the working team. This
person sets the priorities for the project and clarifies the desired outputs.

• Project Manager: Ensures that key milestones and objectives are met on time and at the
expected quality

• Business Intelligence Analyst: Provides business domain expertise based on a deep


understanding of the data, key performance indicators (KPIs), key metrics, and business
intelligence from a reporting perspective. Business Intelligence Analysts generally create
dashboards and reports and have knowledge of the data feeds and sources.
Key Roles for a Successful Analytics Project
• Database Administrator (DBA): Provisions and configures the database environment to support the
analytics needs of the working team. These responsibilities may include providing access to key
databases or tables and ensuring the appropriate security levels are in place related to the data
repositories.
• Data Engineer: Leverages deep technical skills
to assist with tuning SQL queries for data
management and data extraction, and provides
support for data ingestion into the analytic
sandbox. Whereas the DBA sets up and
configures the databases to be used, the data
engineer executes the actual data extractions
and performs substantial data manipulation to
facilitate the analytics. The data engineer works
closely with the data scientist to help shape
data in the right ways for analyses.

• Data Scientist: Provides subject matter


expertise for analytical techniques, data
modeling, and applying valid analytical
techniques to given business problems. Ensures
overall analytics objectives are met. Designs
and executes analytical methods and
approaches with the data available to the
project.
Data Analytics Lifecycle
• Scientific methods are in use for centuries, still provides a solid framework for thinking about and
deconstructing problems into their principal parts. One of the most valuable ideas of the scientific
method relates to forming hypotheses and finding ways to test ideas.
https://en.wikipedia.org/wiki/Scientific_method

• CRISP-DM provides useful input on ways to frame analytics problems and is a popular approach for
data mining.
https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining

• Tom Davenport’s DELTA framework : The DELTA framework offers an approach for data analytics
projects, including the context of the organization’s skills, datasets, and leadership engagement.
(Analytics at Work: Smarter Decisions, Better Results, 2010, Harvard Business Review Press )

• Doug Hubbard’s Applied Information Economics (AIE) approach : AIE provides a framework for
measuring intangibles and provides guidance on developing decision models, calibrating expert
estimates, and deriving the expected value of information.
(How to Measure Anything: Finding the Value of Intangibles in Business, 2010, Hoboken, NJ: John
Wiley & Sons)

• “MAD Skills” by Cohen et al. offers input for several of the techniques mentioned in Phases 2–4 that
focus on model planning, execution, and key findings.
(MAD Skills: New Analysis Practices for Big Data, Watertown, MA 2009)
Phase 1: Discovery

In this phase, the data science team must learn and investigate the problem, develop
context and understanding, and learn about the data sources needed and available for the
project. In addition, the team formulates initial hypotheses that can later be tested with
data.

• Learning the Business Domain


• Resources
• Framing the Problem
• Identifying Key Stakeholders
• Interviewing the Analytics Sponsor
• Developing Initial Hypotheses
• Identifying Potential Data Sources
Phase 2: Data Preparation

This phase includes the steps to explore, preprocess, and condition data prior to modeling and
analysis.

• In this phase, the team needs to create a robust environment in which it can explore the data
that is separate from a production environment. Usually, this is done by preparing an
analytics sandbox.

• To get the data into the sandbox, the team needs to perform ETL, by a combination of
extracting, transforming, and loading data into the sandbox. Once the data is in the sandbox,
the team needs to learn about the data and become familiar with it.

• The team also must decide how to condition and transform data to get it into a format to
facilitate subsequent analysis.

• The team may perform data visualizations to help team members understand the data,
including its trends, outliers, and relationships among data variables.
Phase 2: Data Preparation

• Preparing the Analytic Sandbox (workspace)


• Performing ETL
• Learning About the Data
• Data Conditioning
• Survey and Visualize
Phase 3: Model Planning

• The data science team identifies candidate models to apply to the data for clustering,
classifying, or finding relationships in the data depending on the goal of the project.

• Given the kind of data and resources that are available, evaluate whether similar,
existing approaches will work or if the team will need to create something new.

Market Sector Analytic Techniques/Methods Used


Multiple linear regression, automatic relevance
Consumer
determination
Packaged Goods
(ARD), and decision tree
Retail Banking Multiple regression
Retail Business Logistic regression, decision tree

Neural network, decision tree, hierarchical neurofuzzy


Wireless Telecom
systems, rule evolver, logistic regression
Phase 3: Model Planning

Some of the activities to consider in this phase include the following:

• Assess the structure of the datasets


• Ensure that the analytical techniques enable the team to meet the business
objectives and accept or reject the working hypotheses.
• Determine if the situation warrants a single model or a series of techniques as part
of a larger analytic workflow
• Data Exploration and Variable Selection
• Model Selection
• Common Tools for the Model Planning Phase (R, SQL Analysis Service, SAS/Access)
Phase 4: Model Building

• The data science team needs to develop datasets for training, testing, and production
purposes. These datasets enable the data scientist to develop the analytical model and
train it (“training data”), while holding aside some of the data (“hold-out data” or “test
data”) for testing the model.

• It is critical to ensure that the training and test datasets are sufficiently robust for the
model and analytical techniques
Phase 4: Model Building

Questions to consider include these:

• Does the model appear valid and accurate on the test data?
• Does the model output/behavior make sense to the domain experts?
• Do the parameter values of the fitted model make sense in the context of the domain?
• Is the model sufficiently accurate to meet the goal?
• Does the model avoid intolerable mistakes?
• Are more data or more inputs needed? Do any of the inputs need to be transformed or
eliminated?
• Will the kind of model chosen support the runtime requirements?
• Is a different form of the model required to address the business problem? If so, go back
to the model planning phase and revise the modeling approach.

Write some common tools for the Model Building Phase???


Phase 5: Communicate Results

• After executing the model, the team needs to compare the outcomes of the modeling to
the criteria established for success and failure.

• The team considers how best to articulate the findings and outcomes to the various
team members and stakeholders, considering caveats, assumptions, and any limitations
of the results

• As a result of this phase, the team will have documented the key findings and major
insights derived from the analysis.

• The deliverable of this phase will be the most visible portion of the process to the
outside stakeholders and sponsors, so take care to clearly articulate the results,
methodology, and business value of the findings
Phase 6: Operationalize

• The team communicates the benefits of the project more broadly and sets up a pilot
project to deploy the work in a controlled way before broadening the work to a full
enterprise or ecosystem of users.

• This approach enables the team to learn about the performance and related constraints
of the model in a production environment on a small scale and make adjustments before
a full deployment.

• While scoping the effort involved in conducting a pilot project, consider running the
model in a production environment for a discrete set of products or a single line of
business, which tests the model in a live setting.

• Part of the operationalizing phase includes creating a mechanism for performing


ongoing monitoring of model accuracy and, if accuracy degrades, finding ways to retrain
the model.
Happy Learning

You might also like