0% found this document useful (0 votes)
7 views17 pages

Draw The Data Analytics Life Cycle and Explain Each Phase With Examples

The document outlines the Data Analytics Life Cycle, detailing six phases: Discovery, Data Preparation, Model Planning, Model Building, Communicate Results, and Operationalize, each with specific activities and examples. It also differentiates between Business Intelligence and Data Science, discusses the importance of model selection, and highlights tools used in various phases of analytics. Additionally, it addresses the causes of data deluge and its implications with a real-life example of a social media platform.

Uploaded by

shrushtib27
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views17 pages

Draw The Data Analytics Life Cycle and Explain Each Phase With Examples

The document outlines the Data Analytics Life Cycle, detailing six phases: Discovery, Data Preparation, Model Planning, Model Building, Communicate Results, and Operationalize, each with specific activities and examples. It also differentiates between Business Intelligence and Data Science, discusses the importance of model selection, and highlights tools used in various phases of analytics. Additionally, it addresses the causes of data deluge and its implications with a real-life example of a social media platform.

Uploaded by

shrushtib27
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

3

1. Draw the Data Analytics Life Cycle and explain each phase with
examples.

Phases:

lifecycle ss in ppt

1. Discovery – In Phase 1, the team learns the business domain, including


relevant history such as whether the organization or business unit has
attempted similar projects in the past from which they can learn. ----
Understand business goals and identify key resources.
2. Data Preparation – Phase 2 requires the presence of an analytic
sandbox, in which the team can work with data and perform analytics
for the duration of the project. ● The team needs to execute extract,
load, and transform (ELT) or extract, transform and load (ETL) to get
data into the sandbox. -- ETLT (Extract, Transform, Load, Transform)
data into an analytics sandbox.
3. Model Planning – is model planning, where the team determines the
methods, techniques, and workflow it intends to follow for the
subsequent model building phase.--Select appropriate modeling
techniques.
4. Model Building – In Phase 4, the team develops datasets for testing,
training, and production purposes. ● In addition, in this phase the team
builds and executes models based on the work done in the model
planning phase. -- Develop datasets and build models.
5. Communicate Results – In Phase 5, the team, in collaboration with
major stakeholders, determines if the results of the project are a
success or a failure based on the criteria developed in Phase 1.----
Visualize findings, present insights to stakeholders.
6. Operationalize – ● In Phase 6, the team delivers final reports, briefings,
code, and technical documents.---Deploy model into production for
decision-making.

2. What is the Data Preparation phase? Explain ETLT process and the
role of the Analytics Sandbox.
Data Preparation: Phase 2 requires the presence of an analytic
sandbox, in which the team can work with data and perform analytics
for the duration of the project.
The team needs to execute extract, load, and transform (ELT) or extract,
transform and load (ETL) to get data into the sandbox.
The ELT and ETL are sometimes abbreviated as ETLT. Data should be
transformed in the ETLT process so the team can work with it and
analyze it.
Activities: - Explore ● Pre-process ● Condition data. ● Do I have enough
good quality data to start building the model? ● ELTL : ETL software, or
Extract-Transform-Load, is used to manage all aspects of data
preparation ● 50 % of time. 14 Data Science Life Cycle Phase 2 : Data
preparation ● Preparing the Analytic Sandbox (commonly referred to
as a workspace) ● Performing ETLT ● Learning about data ● Data
Conditioning ● Survey and Visualize ● Tools- hadoop,alpine
miner,openrefine ● Data Wrangler
ETLT: is used to manage all aspects of data preparation
Extract data from sources.
Transform to suitable format.
Load into sandbox.
Transform again as needed for analysis.
Analytics Sandbox: An isolated environment where analysts can safely
manipulate and model data without affecting live systems.

3. What is the Discovery phase in Data Analytics Life Cycle? What


are the activities carried out in this phase?

In Phase 1, the team learns the business domain, including relevant


history such as whether the organization or business unit has
attempted similar projects in the past from which they can learn.
The team assesses the resources available to support the project in
terms of people, technology, time, and data.
Important activities in this phase include framing the business problem
as an analytics challenge that can be addressed in subsequent phases
and formulating initial hypotheses (IHs) to test and begin learning the
data
Activities:
Learning the business domain. ● Resources ● Framing the problem. ●
Identifying key stakeholders. ● Interviewing the Analytics Sponsor. ●
Developing initial hypotheses. ● Identifying potential data sources

4. Explain the key roles involved in a successful analytics project.


What are their responsibilities and expectations?

Business Analyst: Understands business needs


Data Scientist: Builds and validates models
Data Engineer: Prepares and pipelines data
Project Manager: Ensures timely delivery
Stakeholders: Define expectations and act on insights

Expectations: Actionable insights, improved decision-making, ROI

5. Differentiate between Business Intelligence (BI) and Data Science.

Aspect Business Intelligence Data Science

Focus Reporting, dashboards Predictive modeling,


ML

Data Structured, historical Structured +


Unstructured

Tools Tableau, Power BI Python, R, TensorFlow

Outcome What happened What will happen, Why


it happened
6. What is Model Planning and Model Building? List activities and
common tools used.

Model Planning: Phase 3 is model planning, where the team determines


the methods, techniques, and workflow it intends to follow for the
subsequent model building phase.
The team explores the data to learn about the relationships between
variables and subsequently selects key variables and the most suitable
models.
Activities-- ● Data Exploration and variable selection ● Model
selection ● Do I have a good idea about the type of model to try?
Can I refine the analytic plan
Tools: R, SAS, Excel
Model Building: ● In Phase 4, the team develops datasets for testing,
training, and production purposes.
In addition, in this phase the team builds and executes models based
on the work done in the model planning phase.
Activities---Develop Analytical model and train it. ● Model build on
training data, fit on train data and evaluated with test. ● Is the model
robust? Have we failed for sure? ● Is the model robust? Have we
failed for sure?
Tools: Python, R, Scikit-learn, TensorFlow

7. Explain Model Selection in Data Analytics. How do you choose the


right model for your data?

Factors to consider:
Type of problem (classification, regression)
Data size and structure
Performance metrics (accuracy, RMSE)
Interpretability
Tools support

Example: Use logistic regression for binary classification, decision trees for
interpretable models.

8. What are the main sources of Big Data? Give three examples and
explain each.

9. What are the 3 V’s of Big Data? Discuss main considerations when
processing Big Data.
important considerations include:

Scalability and Elasticity: Your infrastructure and processing


frameworks should be able to scale up or down dynamically based on
the data volume and processing demands. Cloud platforms often
provide this elasticity.
Fault Tolerance and Reliability: Given the distributed nature of Big Data
processing, systems should be designed to handle failures gracefully
without losing data or interrupting processing.
Security and Privacy: Protecting sensitive data is paramount.
Implementing appropriate security measures, access controls, and
adhering to privacy regulations are critical.
Cost-Effectiveness: Processing large volumes of data can be expensive.
Optimizing your infrastructure, choosing cost-efficient technologies,
and managing resource utilization are important considerations.
Data Governance and Management: Establishing clear policies and
procedures for data acquisition, storage, processing, and retention is
essential for maintaining data quality and compliance.
Skills and Expertise: Processing Big Data requires specialized skills in
areas like data engineering, data science, and distributed computing.
Having the right team with the necessary expertise is crucial for
success.
Choice of Tools and Technologies: A wide range of Big Data tools and
frameworks are available (e.g., Hadoop, Spark, Kafka, NoSQL databases,
cloud-based services). Selecting the right tools for your specific needs
and use cases is a critical decision.

10. Write a short note on Big Data Analytics Architecture with a neat
diagram.
11. What is Linear Regression? Difference between Simple and
Multiple Linear Regression. How is performance evaluated?

Here's a detailed answer to the question:

✅ What is Linear Regression?


Linear Regression is a statistical and machine learning technique used
to model the relationship between a dependent variable (target) and
one or more independent variables (features) by fitting a linear
equation to the observed data.

General form of the equation:

Y=β0+β1X1+β2X2+...+βnXn+ϵ

Where:

Y = predicted value (dependent variable)


Xn = input features (independent variables)
βn= coefficients
ϵ = error term

✅ Difference Between Simple and Multiple Linear Regression


Aspect Simple Linear Multiple Linear
Regression Regression

Number of 1 2 or more
Independent
Variables

Equation Y=β0+β1X+ϵ Y=β0+β1X1+β2X2+...


+βnXn+ϵ

Use Case Predicting salary Predicting house


based on years of price using area,
experience location, and age

Visualization Straight line on 2D Multidimensional


graph plane (not easily
visualizable)

Complexity Low Higher

✅ How is Performance Evaluated?


Common metrics used to evaluate a linear regression model:

1. R² (R-squared)
Measures the proportion of variance in the dependent variable
explained by the model.
Value ranges from 0 to 1 (closer to 1 is better).
2. Adjusted R²
Modified R² that adjusts for the number of predictors in the
model.
Useful in multiple regression.
3. Mean Absolute Error (MAE)
Average of absolute differences between predicted and actual
values.
Easy to interpret.
4. Mean Squared Error (MSE)
Average of squared differences. Penalizes larger errors more.
5. Root Mean Squared Error (RMSE)
Square root of MSE. Same units as the output variable.

12. Define Descriptive, Diagnostic, and Predictive Analytics with


examples.
Aspect Descriptive Diagnostic Predictive
Analytics Analytics Analytics

Purpose Summarize and Understand Predict future


describe causes of events outcomes based
historical data or patterns on historical
data

Question What Why did it What is likely to


Answered happened? happen? happen?

Data Used Historical, Historical data Historical data


aggregated data with used to train
segmentation predictive
and comparison models

Techniques/Tool Data Drill-down, data Machine


s aggregation, mining, learning,
reporting, correlation regression, time-
dashboards, analysis, cause- series,
statistics effect analysis classification
models

Complexity Low – Simple Moderate – High – Involves


summaries Requires deeper modeling and
analysis algorithm
selection

Outcome Clear view of Identifies factors Forecasts and


past influencing past probabilities of
performance outcomes future trends

Examples Total monthly Sales dropped Predicting next


sales, customer due to price quarter sales or
count trends increase or customer churn
market
competition

Tools Excel, Tableau, SQL with Python (scikit-


Power BI analytics, R, learn), R, SAS,
Python with IBM SPSS, ML
visualization platforms
tools

Output Format Reports, bar Root cause Prediction


charts, line diagrams, scores, risk
graphs comparison levels,
dashboards probability-
based decisions

Decision Basic insight for Informs Enables


Support historical corrective action proactive
tracking strategies and
resource
planning

13. Explain common tools used for Model Building in analytics.


Mention both open-source and commercial tools.

Open Source: Python (Scikit-learn, TensorFlow), R


Commercial: SAS, IBM Watson, RapidMiner

🔧 1. Tools for Data Preparation


These tools help clean, format, and organize raw data before analysis:
Tool Use in Data Preparation

Python (Pandas, NumPy) Data wrangling, handling missing


values, merging datasets, and
reshaping data

R (tidyverse) Data cleaning using dplyr, tidyr, etc.

Talend / Alteryx Visual ETL tools to connect to


sources, clean and transform data
without coding

Apache NiFi Real-time data ingestion and


transformation pipelines for big data

🧠 2. Tools for Model Planning


Used to explore data and determine the best modeling techniques:

Tool Use in Model Planning

R (ggplot2, stats) Data visualization, correlation


analysis, summary statistics

Python (Matplotlib, Seaborn, Scikit- For plotting distributions, feature


learn) importance, and choosing models

SAS Offers statistical modeling and


planning tools

IBM SPSS GUI-based planning of statistical


models

⚙️ 3. Tools for Model Building


Used to build, train, and evaluate predictive models:
Tool Use in Model Building

Python (Scikit-learn, TensorFlow, Most widely used for regression,


XGBoost) classification, deep learning models

R (Caret, mlr, RandomForest) Easy to build and tune predictive


models with great visualization

RapidMiner Drag-and-drop environment for


model building and evaluation

SAS Enterprise Miner Visual environment for predictive


modeling

H2O.ai Open-source platform supporting


large-scale ML (works with R, Python,
Spark)

14. What is causing the data deluge? Explain with a real-life example.

The "data deluge" refers to the exponentially increasing volume of data


being created and stored worldwide. This surge is driven by a confluence
of factors:

1. Proliferation of Digital Devices and Sensors: We are surrounded by


devices that constantly generate data. Smartphones, laptops, tablets,
smartwatches, and the rapidly growing Internet of Things (IoT) devices
(smart home appliances, industrial sensors, connected vehicles) all
contribute massive streams of information.
2. Increased Internet Usage and Online Activities: Our daily online
activities leave digital footprints. Social media interactions (posts,
comments, likes, shares), online shopping, streaming videos and music,
browsing websites, and sending emails all generate vast amounts of
data.
3. Growth of Multimedia Content: Images, videos, and audio files are
inherently large in size. The increasing creation and sharing of such
content on social media platforms and other online services
significantly contribute to the data deluge. High-resolution videos and
the popularity of platforms like YouTube and TikTok amplify this effect.
4. Rise of Big Data Technologies and Data Collection: Organizations are
increasingly recognizing the value of data and are implementing
sophisticated technologies to collect and store information from
various sources. This includes customer interactions, business
processes, sensor data, and publicly available information. The ability
to store and process this data more efficiently further encourages its
collection.
5. Scientific Research and Simulations: Fields like genomics, astronomy,
climate science, and high-energy physics generate enormous datasets
through experiments and simulations. Advances in these areas lead to
increasingly complex and data-intensive research.
6. Regulatory and Compliance Requirements: Many industries are
subject to regulations that mandate the long-term storage of various
types of data, contributing to the overall volume.

Real-life Example: Social Media Platform

Consider a popular social media platform like Instagram. Millions of


users worldwide are constantly:
Uploading photos and videos: Each high-resolution image or video adds
significantly to the platform's storage needs.
Posting text updates and comments: While smaller in size individually,
the sheer volume of these interactions across millions of users
generates a massive amount of textual data.
Sending direct messages: Private conversations contribute to the
overall data volume.
Interacting with content: Likes, shares, saves, and story views are all
recorded as data points.
Generating usage data: Information about how users navigate the app,
their preferences, and their activity patterns is also collected.

Impact:
Over a single day, Instagram (and similar platforms) generates
terabytes, if not petabytes, of new data. This constant influx requires
massive and scalable infrastructure for storage, processing, and
analysis. The platform needs to efficiently manage this "data deluge" to:
Provide core services: Ensure users can upload, view, and interact with
content smoothly.
Personalize user experience: Recommend relevant content, suggest
connections, and tailor advertisements based on user data.
Detect and prevent abuse: Identify and remove harmful content or
malicious accounts by analyzing patterns in the data.
Gain business insights: Understand user behavior, trends, and
preferences to improve the platform and its offerings.
Without the ability to handle this massive and continuous flow of data,
the social media platform would become slow, unreliable, and unable
to deliver a relevant experience to its users. This example illustrates
how the combination of user activity, multimedia content, and the
platform's need to understand and manage this information leads to a
significant data deluge.

15. List and explain a few applications of Big Data Analytics in


industries such as healthcare, retail, or finance.

Healthcare: Disease prediction, patient monitoring


Retail: Customer segmentation, personalized marketing
Finance: Fraud detection, credit scoring
Manufacturing: Predictive maintenance, supply chain optimization

You might also like