Draw The Data Analytics Life Cycle and Explain Each Phase With Examples
Draw The Data Analytics Life Cycle and Explain Each Phase With Examples
1. Draw the Data Analytics Life Cycle and explain each phase with
examples.
Phases:
lifecycle ss in ppt
2. What is the Data Preparation phase? Explain ETLT process and the
role of the Analytics Sandbox.
Data Preparation: Phase 2 requires the presence of an analytic
sandbox, in which the team can work with data and perform analytics
for the duration of the project.
The team needs to execute extract, load, and transform (ELT) or extract,
transform and load (ETL) to get data into the sandbox.
The ELT and ETL are sometimes abbreviated as ETLT. Data should be
transformed in the ETLT process so the team can work with it and
analyze it.
Activities: - Explore ● Pre-process ● Condition data. ● Do I have enough
good quality data to start building the model? ● ELTL : ETL software, or
Extract-Transform-Load, is used to manage all aspects of data
preparation ● 50 % of time. 14 Data Science Life Cycle Phase 2 : Data
preparation ● Preparing the Analytic Sandbox (commonly referred to
as a workspace) ● Performing ETLT ● Learning about data ● Data
Conditioning ● Survey and Visualize ● Tools- hadoop,alpine
miner,openrefine ● Data Wrangler
ETLT: is used to manage all aspects of data preparation
Extract data from sources.
Transform to suitable format.
Load into sandbox.
Transform again as needed for analysis.
Analytics Sandbox: An isolated environment where analysts can safely
manipulate and model data without affecting live systems.
Factors to consider:
Type of problem (classification, regression)
Data size and structure
Performance metrics (accuracy, RMSE)
Interpretability
Tools support
Example: Use logistic regression for binary classification, decision trees for
interpretable models.
8. What are the main sources of Big Data? Give three examples and
explain each.
9. What are the 3 V’s of Big Data? Discuss main considerations when
processing Big Data.
important considerations include:
10. Write a short note on Big Data Analytics Architecture with a neat
diagram.
11. What is Linear Regression? Difference between Simple and
Multiple Linear Regression. How is performance evaluated?
Y=β0+β1X1+β2X2+...+βnXn+ϵ
Where:
Number of 1 2 or more
Independent
Variables
1. R² (R-squared)
Measures the proportion of variance in the dependent variable
explained by the model.
Value ranges from 0 to 1 (closer to 1 is better).
2. Adjusted R²
Modified R² that adjusts for the number of predictors in the
model.
Useful in multiple regression.
3. Mean Absolute Error (MAE)
Average of absolute differences between predicted and actual
values.
Easy to interpret.
4. Mean Squared Error (MSE)
Average of squared differences. Penalizes larger errors more.
5. Root Mean Squared Error (RMSE)
Square root of MSE. Same units as the output variable.
14. What is causing the data deluge? Explain with a real-life example.
Impact:
Over a single day, Instagram (and similar platforms) generates
terabytes, if not petabytes, of new data. This constant influx requires
massive and scalable infrastructure for storage, processing, and
analysis. The platform needs to efficiently manage this "data deluge" to:
Provide core services: Ensure users can upload, view, and interact with
content smoothly.
Personalize user experience: Recommend relevant content, suggest
connections, and tailor advertisements based on user data.
Detect and prevent abuse: Identify and remove harmful content or
malicious accounts by analyzing patterns in the data.
Gain business insights: Understand user behavior, trends, and
preferences to improve the platform and its offerings.
Without the ability to handle this massive and continuous flow of data,
the social media platform would become slow, unreliable, and unable
to deliver a relevant experience to its users. This example illustrates
how the combination of user activity, multimedia content, and the
platform's need to understand and manage this information leads to a
significant data deluge.