0% found this document useful (0 votes)
13 views28 pages

Quantum DA Review

Uploaded by

r98641897
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views28 pages

Quantum DA Review

Uploaded by

r98641897
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Unit01 (Introduction to Data Analytics)

Sources and Nature of Data:


Data Sources
In data analytics, data can be gathered from various sources. These sources are
broadly categorized as:

1. Primary Sources

o Data collected directly from the source or by conducting experiments


and surveys.

o Examples:

 Surveys conducted by organizations.

 Observations recorded by researchers.

 Interviews.

2. Secondary Sources

o Data collected by someone else but used for analysis by a third party.

o Examples:

 Government databases (e.g., census data).

 Industry reports.

 Historical records.

3. Internal Data

 Data that originates within an organization.

 Examples:

o Sales records.

o Customer feedback forms.

1
o Financial transactions.

4. External Data

 Data collected from outside the organization.

 Examples:

o Market trends.

o Competitor analysis.

o Social media activity.

Nature of Data
Data can be classified based on its characteristics:

1. Quantitative Data

o Numerical data that can be measured or counted.

o Examples:

 Sales figures.

 Employee performance metrics.

2. Qualitative Data

o Descriptive data that provides insights or patterns but cannot be


measured numerically.

o Examples:

 Customer feedback.

 User reviews.

3. Continuous Data

o Data that can take any value within a range.

o Examples:

2
 Temperature readings.

 Time measurements.

4. Discrete Data

o Data that can only take certain values.

o Examples:

 Number of employees.

 Customer complaints.

5. Big Data

o Large, complex datasets that require advanced tools to process.

o Examples:

 Social media analytics.

 E-commerce transactions.

Question 1: What is the difference between structured and unstructured


data? Provide examples.

Solution: Structured data is organized in a fixed format, such as rows and columns in
databases.

 Example: A database of customer names and phone numbers.


Unstructured data lacks a predefined format and is more challenging to
analyze.

 Example: Social media posts, images, and videos.

Question 2: How can secondary data sources be validated for reliability?

Solution:

 Verify the credibility of the source (e.g., government or reputable research


organizations).

3
 Cross-check data with other reliable sources to ensure consistency.

 Evaluate the methodology used to collect the data (e.g., sample size, data
collection process)

Question 3: Explain the difference between qualitative and quantitative


data with examples.

Solution:

 Qualitative Data: Descriptive and non-numerical.


Example: Customer feedback like "The product is excellent."

 Quantitative Data: Measurable and numerical.


Example: 75% of customers rated the product 5 out of 5.

Question 4: What are the main challenges in handling big data, and how
can they be addressed?

Solution:
Challenges:

1. Storage: Large datasets require significant storage capacity.

2. Processing: Analyzing large data takes time and resources.

3. Security: Protecting sensitive data from breaches.

Solutions:

1. Use cloud-based storage solutions for scalability.

2. Implement distributed computing tools like Hadoop.

3. Apply robust encryption and access control mechanisms.

4
Question 5: How can semi-structured data be converted into structured
data for analysis?

Solution:

 Use data parsing techniques to extract key elements from semi-structured


formats (e.g., JSON, XML).

 Employ data transformation tools like ETL (Extract, Transform, Load)


pipelines.

 Store the parsed data in structured databases like SQL for further analysis.

Classification of Data (Structured, Semi Structured and


Unstructured)

Data can be classified into three main categories based on its structure and organization:
Structured Data, Semi-Structured Data, and Unstructured Data. These
classifications help determine the methods and tools used for storage, processing, and
analysis.

1. Structured Data :

Definition: Structured data is highly organized and stored in a predefined format, such
as rows and columns in relational databases. It is easily searchable and analyzable using
traditional tools like SQL.

Key Characteristics:

 Predefined schema or format.

 Easily stored and retrieved from relational databases.

 Typically numerical or categorical in nature.

Examples:

 Sales data in a spreadsheet (e.g., columns for date, product, quantity, and price).

 Customer information in a database (e.g., name, email, and phone number).

5
 Financial transactions (e.g., transaction ID, amount, and timestamp).

Advantages:

 Easy to query and analyze.

 High compatibility with traditional analytics tools.

 Efficient storage in databases.

Challenges:

 Limited flexibility to adapt to new data formats or fields.

 Not suitable for handling complex, multimedia, or irregular data.

2. Semi-Structured Data

Definition:
Semi-structured data does not have a fixed schema but has some organizational
properties, such as tags or markers, that make it easier to process than unstructured data.

Key Characteristics:

 No fixed schema, but has a clear structure or metadata.

 Flexible and adaptable to changing data formats.

 Requires specialized tools for analysis.

Examples:

 JSON and XML files (e.g., configuration files for applications).

 Email messages (structured headers like "To" and "From" with unstructured
message content).

 Social media posts with hashtags, mentions, or metadata.

Advantages:

 Combines the flexibility of unstructured data with some organization.

 Can store diverse data types (e.g., text, numbers, and multimedia).

6
 Easy integration with modern big data tools.

Challenges:

 Requires parsing to extract meaningful insights.

 More complex to analyze compared to structured data.

3. Unstructured Data

Definition:
Unstructured data lacks a predefined format or organization. It represents the largest
portion of available data and requires advanced tools and techniques for processing.

Key Characteristics:

 No predefined schema or format.

 Highly flexible but difficult to analyze.

 Typically multimedia or text-heavy in nature.

Examples:

 Images, videos, and audio files.

 Social media content (tweets, posts, comments).

 Medical imaging data (e.g., X-rays, CT scans).

 Website logs and raw sensor data.

Advantages:

 Provides rich, detailed information.

 Can offer deep insights when analyzed effectively.

 Accommodates a wide range of data types.

Challenges:

 Requires advanced tools (e.g., AI, NLP, and computer vision) for analysis.

7
 High storage and processing requirements.

 Difficult to query and index.

Comparison Table

Aspect Structured Data Semi-Structured Data Unstructured Data


Flexible or partially
Schema Fixed and predefined None
defined
Ease of
High Moderate Low
Analysis
Cloud storage, data
Storage Relational databases NoSQL databases, files
lakes
Spreadsheets,
Examples JSON, XML, emails Images, videos, texts
databases
Hadoop, NoSQL
Tools SQL, BI tools AI, NLP, deep learning
databases

 Characteristics of Data in Data Analytics:

Understanding the characteristics of data is crucial for effective data analysis. Here’s a
detailed and easy-to-understand explanation of the key characteristics of data in
analytics:

1. Accuracy

 Definition: Data should be correct, free from errors, and represent reality.

 Importance: Inaccurate data can lead to misleading insights and poor decision-
making.

 Example: A dataset with incorrect customer names or transaction amounts


compromises the validity of analysis.

2. Completeness

 Definition: Data should be complete, with no missing values or gaps.

 Importance: Missing data can result in biased outcomes or incomplete analysis.

8
 Example: A survey dataset missing response for key questions affects the
reliability of conclusions.

3. Consistency

 Definition: Data should remain uniform across different sources or systems.

 Importance: Inconsistent data creates confusion and hinders integration.

 Example: A customer's address should match across the CRM, billing, and
shipping databases.

4. Timeliness

 Definition: Data should be up-to-date and available when needed.

 Importance: Outdated data might result in irrelevant or ineffective insights.

 Example: Real-time stock market data is crucial for timely financial decisions.

5. Relevance

 Definition: Data should be related to the specific problem or question being


addressed.

 Importance: Irrelevant data can waste resources and distract from the primary
analysis.

 Example: Analysing weather data when assessing website traffic trends would
not be relevant.

9
 Need for Data Analytics: Understanding Its Importance:
Data analytics is the process of examining raw data to identify patterns, draw
conclusions, and make informed decisions. It is essential in today's digital age, where
businesses, governments, and individuals generate and use vast amounts of data daily.

Key Reasons for the Need for Data Analytics

1. Better Decision-Making

 Data analytics provides insights to help organizations make evidence-based


decisions rather than relying on intuition or guesswork.
 Example: An e-commerce company uses analytics to decide which products to
promote during a sale based on past customer behaviour.

2. Improving Efficiency

 Identifying inefficiencies or bottlenecks in operations can help organizations


optimize their processes.
 Example: A logistics company uses route optimization to save time and reduce
costs.

3. Understanding Customers

 Analytics helps in understanding customer preferences, behaviour, and needs,


enabling personalized experiences.
 Example: Streaming platforms recommend shows based on viewers' past
preferences.

4. Enhancing Marketing Strategies

 By analysing data, companies can create targeted marketing campaigns,


improving conversion rates and reducing advertising costs.

10
 Example: Social media platforms analyse user engagement data to run
personalized ad campaigns.

5. Identifying Trends

 Data analytics helps organizations stay ahead by spotting emerging trends, such
as changes in consumer preferences or market demands.
 Example: Fashion brands use analytics to predict seasonal trends and stock
products accordingly.

 Analytic Process and Tools in Data Analytics :

Each step has its own process and tools to make overall conclusions based on
the data.

1. Define the Problem or Research Question


In the first step of process the data analyst is given a problem/business task. The analyst
has to understand the task and the stakeholder’s expectations for the solution. A
stakeholder is a person that has invested their money and resources to a project. The
analyst must be able to ask different questions in order to find the right solution to their
problem. The analyst has to find the root cause of the problem in order to fully
understand the problem. The analyst must make sure that he/she doesn’t have any
distractions while analyzing the problem. Communicate effectively with the
stakeholders and other colleagues to completely understand what the underlying
problem is. Questions to ask yourself for the Ask phase are:

 What are the problems that are being mentioned by my stakeholders?

 What are their expectations for the solutions?

11
2. Collect Data
The second step is to Prepare or Collect the Data. This step includes collecting data and
storing it for further analysis. The analyst has to collect the data based on the task given
from multiple sources. The data has to be collected from various sources, internal or
external sources. Internal data is the data available in the organization that you work for
while external data is the data available in sources other than your organization. The
data that is collected by an individual from their own resources is called first-party data.
The data that is collected and sold is called second-party data. Data that is collected
from outside sources is called third-party data. The common sources from where the
data is collected are Interviews, Surveys, Feedback, Questionnaires. The collected data
can be stored in a spreadsheet or SQL database. A spreadsheet is a digital worksheet
that contains rows and columns while a database contains tables that have functions to
manipulate the data. Spreadsheets are used to store some thousands or ten thousand of
data while databases are used when there are too many rows to store. The best tools to
store the data are MS Excel or Google Sheets in the case of Spreadsheets and there are
so many databases like Oracle, Microsoft to store the data.

3. Data Cleaning
The third step is Clean and Process Data. After the data is collected from multiple
sources, it is time to clean the data. Clean data means data that is free from misspellings,
redundancies, and irrelevance. Clean data largely depends on data integrity. There
might be duplicate data or the data might not be in a format, therefore the unnecessary
data is removed and cleaned. There are different functions provided by SQL and Excel
to clean the data. This is one of the most important steps in Data Analysis as clean and
formatted data helps in finding trends and solutions. The most important part of the
Process phase is to check whether your data is biased or not. Bias is an act of favoring
a particular group/community while ignoring the rest. Biasing is a big no-no as it might
affect the overall data analysis. The data analyst must make sure to include every group
while the data is being collected.

4. Analyzing the Data


The fourth step is to Analyze. The cleaned data is used for analyzing and identifying
trends. It also performs calculations and combines data for better results. The tools used
for performing calculations are Excel or SQL. These tools provide in-built functions to
perform calculations or sample code is written in SQL to perform calculations. Using
Excel, we can create pivot tables and perform calculations while SQL creates temporary
tables to perform calculations. Programming languages are another way of solving
problems. They make it much easier to solve problems by providing packages. The most
widely used programming languages for data analysis are R and Python.

12
5. Data Visualization
The fifth step is visualizing the data. Nothing is more compelling than a visualization.
The data now transformed has to be made into a visual (chart, graph). The reason for
making data visualizations is that there might be people, mostly stakeholders that are
non-technical. Visualizations are made for a simple understanding of complex data.
Tableau and Looker are the two popular tools used for compelling data visualizations.
Tableau is a simple drag and drop tool that helps in creating compelling visualizations.
Looker is a data viz tool that directly connects to the database and creates visualizations.
Tableau and Looker are both equally used by data analysts for creating a visualization.
R and Python have some packages that provide beautiful data visualizations. R has a
package named g plot which has a variety of data visualizations. A presentation is given
based on the data findings. Sharing the insights with the team members and stakeholders
will help in making better decisions. It helps in making more informed decisions and it
leads to better outcomes.

6. Presenting the Data


Presenting the data involves transforming raw information into a format that is easily
comprehensible and meaningful for various stakeholders. This process encompasses the
creation of visual representations, such as charts, graphs, and tables, to effectively
communicate patterns, trends, and insights gleaned from the data analysis. The goal is
to facilitate a clear understanding of complex information, making it accessible to both
technical and non-technical audiences. Effective data presentation involves thoughtful
selection of visualization techniques based on the nature of the data and the specific
message intended. It goes beyond mere display to storytelling, where the presenter
interprets the findings, emphasizes key points, and guides the audience through the
narrative that the data unfolds. Whether through reports, presentations, or interactive
dashboards, the art of presenting data involves balancing simplicity with depth,
ensuring that the audience can easily grasp the significance of the information presented
and use it for informed decision-making.

 Analysis vs. Reporting in Data Analytics:


Data analytics involves both analysis and reporting, but they serve distinct
purposes and involve different processes. Here's a breakdown to help differentiate
them in a simple and detailed way:

Analysis vs. Reporting in Data Analytics

Data analytics involves both analysis and reporting, but they serve distinct
purposes and involve different processes. Here's a breakdown to help
differentiate them in a simple and detailed way:

13
1. What is Analysis?

Definition:
Analysis is the process of exploring data to discover patterns, trends, correlations, and
insights. It focuses on answering "Why?" and "What will happen?" questions.

Purpose:

 To generate insights for decision-making.


 To predict outcomes or identify hidden opportunities.
 To diagnose problems or root causes.

Processes in Analysis:

1. Exploratory Data Analysis (EDA): Understanding the data, cleaning it, and
summarizing key characteristics.
2. Statistical Analysis: Using statistical methods to uncover relationships or
validate hypotheses.
3. Predictive Analysis: Leveraging machine learning models to forecast future
outcomes.
4. Diagnostic Analysis: Identifying the reasons behind certain outcomes.

Example:

 Analyzing customer purchase data to predict future buying behavior.


 Identifying why sales dropped in a particular region by examining market trends
and consumer feedback.

2. What is Reporting?

Definition:
Reporting is the process of organizing data into a structured format, typically in
dashboards or documents, to summarize past and current performance. It answers
"What happened?" questions.

Purpose:

 To communicate findings in a clear, accessible way.


 To monitor key metrics and performance indicators.
 To provide stakeholders with regular updates.

Processes in Reporting:

1. Data Visualization: Creating charts, graphs, or dashboards to present data.


2. Summarization: Highlighting key metrics and trends without deep exploration.
3. Automation: Using tools like Power BI or Tableau to generate recurring reports.

14
Example:

 Weekly sales reports showing revenue trends.


 A dashboard summarizing website traffic and user engagement.

 Applications of Data Analytics:


Data analytics has a wide range of applications across industries, enabling organizations
to make data-driven decisions, optimize operations, and gain competitive advantages.
Below are some key applications of data analytics, explained in an easy-to-understand
way:

1. Business Intelligence (BI) and Decision Making

 What it means: Data analytics helps businesses understand their performance


and identify trends. BI tools use analytics to visualize and report data in user-
friendly formats like dashboards.
 Examples:
o Sales trend analysis to predict future demand.
o Identifying underperforming regions or products.
 Benefits:
o Improves strategic decision-making.
o Optimizes resource allocation.

2. Customer Analytics and Personalization

 What it means: By analyzing customer behavior, businesses can provide


personalized experiences to enhance customer satisfaction and loyalty.
 Examples:
o E-commerce platforms recommending products based on past purchases.
o Streaming services like Netflix suggesting shows based on viewing
history.
 Benefits:
o Increases customer retention.
o Boosts sales through targeted marketing.

3. Healthcare and Medical Research

 What it means: Data analytics supports disease diagnosis, treatment planning,


and efficient hospital management.
 Examples:
o Predicting patient outcomes using historical health data.
o Managing hospital resources, such as optimizing bed allocation.
 Benefits:
o Enhances patient care quality.
o Reduces healthcare costs.

15
4. Marketing Analytics

 What it means: Analyzing marketing campaigns and customer responses helps


companies improve their marketing strategies.
 Examples:
o Measuring the effectiveness of digital ads.
o Analyzing social media sentiment to understand public opinion about a
brand.
 Benefits:
o Improves return on marketing investments.
o Helps in targeting the right audience.

5. Supply Chain Optimization

 What it means: Analytics enhances supply chain operations by optimizing


inventory, logistics, and demand forecasting.
 Examples:
o Predicting inventory needs to avoid overstocking or stockouts.
o Monitoring real-time shipment data to ensure timely deliveries.
 Benefits:
o Reduces operational costs.
o Improves delivery efficiency.

Previous Year Asked Question in AKTU exams

Very Short Answer Questions (1-2 Marks)

1. What is the primary difference between structured and unstructured data?


(AKTU 2022)

Answer: Structured data is organized in a predefined format like rows and columns,
whereas unstructured data lacks a specific format, such as social media posts or images.

2. Define Big Data. (AKTU 2021)

Answer: Big Data refers to large, complex datasets that traditional data processing tools
cannot handle effectively.

3. What is the role of data analytics in modern businesses? (AKTU 2023)

16
Answer: Data analytics helps businesses make informed decisions by identifying
trends, patterns, and insights from data.

4. Mention any two characteristics of data. (AKTU 2020)

Answer: Examples:

1. Volume: Refers to the quantity of data.


2. Variety: Refers to the different types of data formats.

5. Name one tool used for data analysis and one tool for reporting. (AKTU 2022)

Answer:

 Data analysis tool: Python


 Reporting tool: Tableau

Short Answer Questions (4-5 Marks)

1. Explain the classification of data into structured, semi-structured, and


unstructured data with examples. (AKTU 2021)

Answer:

 Structured Data: Organized in rows and columns, e.g., SQL databases.


 Semi-structured Data: Partially organized with tags or metadata, e.g., JSON or
XML files.
 Unstructured Data: Lacks a predefined format, e.g., images, videos, social
media posts.

2. What is the need for data analytics in modern organizations? (AKTU 2020)

Answer:

 Data analytics enables informed decision-making.

17
 It helps predict trends, optimize operations, and improve customer satisfaction.
 It provides a competitive edge through actionable insights.

3. Differentiate between analysis and reporting in data analytics. (AKTU 2022)

Answer:

 Analysis: Focuses on discovering patterns and generating insights from data.


 Reporting: Summarizes and visualizes data to present findings, often in
dashboards or reports.

4. What are the main characteristics of Big Data, and why is scalability important?
(AKTU 2023)

Answer:
Characteristics of Big Data:

1. Volume: Large size of data.


2. Velocity: Speed of data generation.
3. Variety: Diverse formats of data.
4. Veracity: Accuracy of data.

Importance of Scalability: Scalability ensures that data systems can handle increasing
data loads effectively without compromising performance.

5. Explain the evolution of analytic scalability with reference to modern tools.


(AKTU 2021)

Answer:

 Initial analytics relied on single-server processing.


 As data volumes grew, distributed systems like Hadoop emerged.
 Today, cloud-based platforms (e.g., AWS, Google BigQuery) offer on-demand
scalability for real-time analytics.

18
UNIT 01: PART II (Data Analytics Lifecycle)

 Key Roles for Successful Analytic Projects:


There are certain key roles that are required for the complete and fulfilled functioning
of the data science team to execute projects on analytics successfully. The key roles
are seven in number.

Each key plays a crucial role in developing a successful analytics project. There is no
hard and fast rule for considering the listed seven roles, they can be used fewer or
more depending on the scope of the project, skills of the participants, and
organizational structure.

Example –
For a small, versatile team, these listed seven roles may be fulfilled by only three to
four people but a large project on the contrary may require 20 or more people for
fulfilling the listed roles.
1. Business User :
 The business user is the one who understands the main area of the project
and is also basically benefited from the results.

 This user gives advice and consult the team working on the project about
the value of the results obtained and how the operations on the outputs are
done.

 The business manager, line manager, or deep subject matter expert in the
project mains fulfills this role.

2. Project Sponsor :
 The Project Sponsor is the one who is responsible to initiate the project.
Project Sponsor provides the actual requirements for the project and
presents the basic business issue.
 He generally provides the funds and measures the degree of value from
the final output of the team working on the project.

19
 This person introduce the prime concern and brooms the desired output.

3. Project Manager :
 This person ensures that key milestone and purpose of the project is met
on time and of the expected quality.

4. Business Intelligence Analyst :


 Business Intelligence Analyst provides business domain perfection based
on a detailed and deep understanding of the data, key performance
indicators (KPIs), key matrix, and business intelligence from a reporting
point of view.

 This person generally creates fascia and reports and knows about the data
feeds and sources.

5. Database Administrator (DBA) :


 DBA facilitates and arrange the database environment to support the
analytics need of the team working on a project.

 His responsibilities may include providing permission to key databases or


tables and making sure that the appropriate security stages are in their
correct places related to the data repositories or not.

6. Data Engineer :
 Data engineer grasps deep technical skills to assist with tuning SQL
queries for data management and data extraction and provides support for
data intake into the analytic sandbox.

 The data engineer works jointly with the data scientist to help build data
in correct ways for analysis.

20
7. Data Scientist :
 Data scientist facilitates with the subject matter expertise for analytical
techniques, data modelling, and applying correct analytical techniques for
a given business issues.

 He ensures overall analytical objectives are met.

 Data scientists outline and apply analytical methods and proceed towards
the data available for the concerned project.

 Various Phases of Data Analytics Life Cycle :

21
The Data analytic lifecycle is designed for Big Data problems and data science projects.
The cycle is iterative to represent real project. To address the distinct requirements for
performing analysis on Big Data, step–by–step methodology is needed to organize the
activities and tasks involved with acquiring, processing, analyzing, and repurposing
data.

Phase 1: Discovery –
 The data science team learns and investigates the problem.

 Develop context and understanding.

 Come to know about data sources needed and available for the project.

 The team formulates the initial hypothesis that can be later tested with data.

Phase 2: Data Preparation –


 Steps to explore, preprocess, and condition data before modeling and analysis.

 It requires the presence of an analytic sandbox, the team executes, loads, and
transforms, to get data into the sandbox.

 Data preparation tasks are likely to be performed multiple times and not in
predefined order.

 Several tools commonly used for this phase are – Hadoop, Alpine Miner, Open
Refine, etc.

Phase 3: Model Planning –


 The team explores data to learn about relationships between variables and
subsequently, selects key variables and the most suitable models.

 In this phase, the data science team develops data sets for training, testing, and
production purposes.

 Team builds and executes models based on the work done in the model planning
phase.

22
Phase 4: Model Building –
 Team develops datasets for testing, training, and production purposes.

 Team also considers whether its existing tools will suffice for running the models
or if they need more robust environment for executing models.

 Free or open-source tools – Rand PL/R, Octave, WEKA.

 Commercial tools – Matlab and STASTICA.

Phase 5: Communication Results –


 After executing model team need to compare outcomes of modeling to criteria
established for success and failure.

 Team considers how best to articulate findings and outcomes to various team
members and stakeholders, taking into account warning, assumptions.

 Team should identify key findings, quantify business value, and develop
narrative to summarize and convey findings to stakeholders.

Phase 6: Operationalize –
 The team communicates benefits of project more broadly and sets up pilot
project to deploy work in controlled way before broadening the work to full
enterprise of users.

 This approach enables team to learn about performance and related constraints
of the model in production environment on small scale which make adjustments
before full deployment.

 The team delivers final reports, briefings, codes.

23
Previous Year Asked Question in AKTU exams

Question 1: Describe the key roles required for the successful execution of an
analytics project. Discuss their responsibilities and how they contribute to project
success. (AKTU 2019)
Answer: Successful analytics projects rely on a cohesive team with distinct roles,
each playing a crucial part in the project's success. Key roles include:
1. Project Manager
Responsible for overseeing the project, ensuring it meets deadlines, stays
within budget, and aligns with objectives. The project manager also facilitates
communication between team members.
2. Data Scientist
They analyze and interpret complex data to provide insights. Their expertise in
statistical methods, machine learning, and programming tools (e.g., Python, R)
is critical for model development and problem-solving.
3. Data Engineer
They handle data infrastructure, ensuring smooth data collection, storage, and
preprocessing. They design robust data pipelines and ensure data quality and
accessibility.
4. Business Analyst
Acts as a bridge between the technical team and stakeholders. They define
business problems, gather requirements, and ensure the solution aligns with
organizational goals.
5. Subject Matter Expert (SME)
Provides domain-specific knowledge to guide data interpretation and ensure the
project’s relevance to industry or organizational needs.
6. Visualization Expert
Designs clear and effective visual representations of data insights, making
results accessible to stakeholders.
Each role ensures that the project progresses seamlessly from problem definition to
actionable results. Collaboration, communication, and clear role definition are critical
to the team’s success. Without these key roles, an analytics project may face technical,
organizational, or strategic failures.

24
Question 2: Explain the discovery phase of the data analytics lifecycle. Discuss its
importance and the key steps involved. (AKTU 2018)
Answer: The discovery phase is the first step in the data analytics lifecycle, focusing
on understanding the project’s objectives, scope, and feasibility. Its importance lies in
establishing a strong foundation for the entire project.
Key Steps in the Discovery Phase:
1. Understanding Business Objectives
The team collaborates with stakeholders to define the problem and identify
goals. A clear understanding of the desired outcome ensures alignment with
business needs.
2. Assessing Resources
This involves evaluating the availability of data, tools, infrastructure, and
expertise. Identifying gaps early helps in planning and avoids delays later.
3. Defining Success Criteria
Quantifiable metrics are established to measure the success of the project. For
instance, success in a marketing analytics project may mean achieving a 20%
increase in lead conversion.
4. Exploratory Data Assessment
Preliminary data exploration helps in identifying potential issues, such as
missing values, inconsistencies, or bias, and determines the data's relevance to
the problem.
5. Creating a Project Charter
A project charter is developed to outline the objectives, deliverables, timelines,
and responsibilities.
The discovery phase ensures that all stakeholders have a shared understanding of the
project, reducing the risk of misalignment. It sets the stage for the subsequent phases
by providing clarity, identifying constraints, and ensuring that the team is equipped to
proceed.

25
Question 3:
Describe the data preparation phase in detail. Why is it critical in the data
analytics lifecycle? (AKTU 2020)
Answer:
The data preparation phase is a critical step in the data analytics lifecycle that focuses
on cleaning, transforming, and organizing data for analysis. Poorly prepared data can
lead to inaccurate results, making this phase essential.
Key Steps in the Data Preparation Phase:
1. Data Cleaning
Removing inconsistencies, handling missing values, and correcting errors in
the dataset. For instance, missing values can be imputed using techniques like
mean substitution or regression methods.
2. Data Transformation
Converting raw data into a usable format, such as normalizing numerical data
or encoding categorical variables into numerical ones.
3. Data Integration
Combining data from multiple sources to create a unified dataset. For example,
merging sales data from various regional databases into a single dataset.
4. Feature Selection
Identifying and selecting relevant features that contribute to the analysis while
discarding redundant or irrelevant ones. This reduces computational complexity
and improves model accuracy.
5. Data Sampling
If the dataset is too large, representative subsets are created to speed up
analysis without compromising accuracy.
The data preparation phase is critical because it ensures data quality, which directly
impacts the reliability of the insights and models. High-quality, well-structured data
minimizes errors during analysis and helps in building robust and accurate predictive
models.

26
Question 4:
What are the key activities in the model planning and model building phases of
the data analytics lifecycle? (AKTU 2021)
Answer:
The model planning and model building phases are pivotal in the analytics lifecycle,
focusing on selecting techniques and developing models to address business problems.
Model Planning Phase:
This phase involves:
1. Selecting Techniques
Based on the problem type, appropriate statistical and machine learning
techniques are chosen. For example, regression analysis for prediction or
clustering for segmentation.
2. Identifying Algorithms
Algorithms such as decision trees, random forests, or neural networks are
evaluated for their suitability.
3. Creating Data Splits
The dataset is divided into training, validation, and testing sets to ensure
unbiased evaluation.
4. Initial Hypotheses Testing
Preliminary analysis is conducted to identify trends or patterns.
Model Building Phase:
This phase includes:
1. Developing Models
Algorithms are applied to the training data to build predictive or descriptive
models.
2. Parameter Tuning
Hyperparameters are fine-tuned to optimize model performance.
3. Validation and Iteration
Models are tested on the validation set to assess performance. Iterative
improvements are made based on results.
4. Testing
The final model is evaluated on the test dataset to measure its effectiveness in
real-world scenarios.
Both phases are integral to creating a solution tailored to the business problem. Model
planning ensures a structured approach, while model building translates theory into
practical application, yielding actionable insights.

27
Question 5:
How are results communicated in data analytics, and why is operationalization
crucial for project success? (AKTU 2022)
Answer:
Communicating results and operationalizing solutions are critical steps in ensuring the
success of a data analytics project.
Communicating Results:
1. Tailoring the Presentation
Insights are presented in a manner understandable to stakeholders. For
example, executives may prefer high-level summaries, while technical teams
require detailed findings.
2. Using Visualization Tools
Tools like Tableau, Power BI, or Python libraries (e.g., Matplotlib, Seaborn) are
used to create charts, graphs, and dashboards. These help convey insights
effectively.
3. Highlighting Key Findings
Results are linked directly to business objectives. For instance, “Customer
retention improved by 15% due to targeted marketing campaigns.”
4. Actionable Recommendations
Specific, data-driven recommendations are provided, ensuring clarity in
decision-making.
Operationalization:
This involves integrating the model or analytics solution into daily business processes:
1. Deploying Models
Predictive models are implemented in production systems for real-time use,
such as fraud detection in banking.
2. Monitoring Performance
Continuous monitoring ensures that the solution remains effective over time.
For example, retraining machine learning models periodically.p
3. User Training
End-users are trained to use the tools or systems effectively.
Operationalization bridges the gap between insights and action, ensuring that the
analytical solutions deliver measurable business value. Without this phase, analytics
projects risk becoming theoretical exercises with no real-world impact.

28

You might also like