CC DataScience Material Removed
CC DataScience Material Removed
Unlike a centralized database system, in distributed systems, data is distributed among different
database systems of an organization. These database systems are connected via communication
links. Such links help the end-users to access the data easily. Examples of the Distributed database
are Apache Cassandra, HBase, Ignite, etc.
o Homogeneous DDB: Those database systems which execute on the same operating system
and use the same application process and carry the same hardware devices.
o Heterogeneous DDB: Those database systems which execute on different operating systems
under different application procedures, and carries different hardware devices.
o Modular development is possible in a distributed database, i.e., the system can be expanded
by including new computers and connecting them to the distributed system.
o One server failure will not affect the entire data set.
3) Relational Database
This database is based on the relational data model, which stores data in the form of rows(tuple) and
columns(attributes), and together forms a table(relation). A relational database uses SQL for storing,
manipulating, as well as maintaining the data. E.F. Codd invented the database in 1970. Each table in
the database carries a key that makes the data unique from others. Examples of Relational databases
are MySQL, Microsoft SQL Server, Oracle, etc.
There are following four commonly known properties of a relational model known as ACID
properties, where:
UNIT – 1 Introduction to Data Science
1. What is Data Science
Data science is a field that combines statistical analysis, machine learning, data visualization, and domain
expertise to understand and interpret complex data. It involves various processes, including data collection,
cleaning, analysis, modelling, and communication of insights. The goal is to extract valuable information
from data to inform decision-making and strategy.
Multiple-Choice Questions
1. What is the primary goal of data science?
- A) To collect data
- B) To analyze data
- C) To extract insights from data
- D) To visualize data
Answer: C) To extract insights from data
Data science has become essential in today's data-driven world for several reasons:
1. Data-Driven Decision Making : Organizations use data science to make informed decisions based on
empirical evidence rather than intuition or guesswork.
2. Understanding Customer Behaviour: Data science helps businesses analyse customer data to
understand preferences and behaviours, leading to personalized marketing and improved customer
experiences.
3. Predictive Analytics : Companies leverage data science to forecast trends and outcomes, allowing them
to be proactive rather than reactive in their strategies.
4. Operational Efficiency : Data science can identify inefficiencies in processes and suggest improvements,
leading to cost savings and better resource allocation.
5. Competitive Advantage : Organizations that utilize data science effectively can gain insights that provide
a competitive edge in their industry.
6. Innovation : Data science enables organizations to explore new business models, products, and services
based on data insights.
1. Definition: Focuses on analyzing historical data to provide actionable insights and improve decision-
making.
3. Techniques: Dashboards, SQL queries, data visualization tools (e.g., Power BI, Tableau).
1. Definition: Extracts insights and predictions from large, complex datasets using advanced algorithms.
2. Key Goal: Predictive and prescriptive analytics.
3. Techniques: Machine learning, AI, statistical modelling, coding (e.g., Python, R).
4. Usage: Forecasting trends, customer segmentation, anomaly detection.
5. Output: Predictive models, recommendations, insights for innovation.
MCQs
1. Data Collection
Definition: The process of gathering raw data from various sources such as databases, APIs, web scraping,
or manual entry.
Tools: SQL, Python (libraries like requests, Beautiful Soup), APIs.
5. Model Evaluation
Definition: Assessing the performance of a model using metrics to ensure accuracy and reliability.
Metrics: Accuracy, precision, recall, F1 score, RMSE.
Tools: Python (Scikit-learn), R, cross-validation techniques.
6. Data Visualization
Definition: Representing data insights through charts, graphs, and dashboards.
Purpose: Communicating results effectively to stakeholders.
Tools: Matplotlib, Seaborn, Power BI, Tableau.
MCQs
1. Problem Definition
Understanding the business problem and defining the goals.
2. Data Collection
Gathering relevant data from various sources.
5. Model Building
Selecting algorithms and training machine learning models.
6. Model Evaluation
Testing the model's performance using metrics (e.g., accuracy, precision).
7. Deployment
Integrating the model into a production environment.
MCQs
1. Which is the first step in the Data Science Life Cycle?
A. Data Collection
B. Problem Definition
C. Model Building
D. Data Visualization
Answer: B. Problem Definition
2. What is the main objective of Data Preparation?
A. Building machine learning models
B. Cleaning and organizing data for analysis
C. Testing the model’s accuracy
D. Deploying the model
Answer: B. Cleaning and organizing data for analysis
3. Which process helps identify patterns and trends in data?
A. Model Deployment
B. Data Collection
C. Exploratory Data Analysis (EDA)
D. Data Cleaning
Answer: C. Exploratory Data Analysis (EDA)
4. What step involves splitting data into training and testing datasets?
A. Model Deployment
B. Model Evaluation
C. Model Building
D. Data Preparation
Answer: C. Model Building
5. Which of the following is NOT a model evaluation metric?
A. Accuracy
B. Precision
C. Data Cleaning
D. Recall
Answer: C. Data Cleaning
6. After deploying a model, what is the next step?
A. Model Building
B. Data Wrangling
C. Monitoring and Maintenance
D. EDA
Answer: C. Monitoring and Maintenance
7. What is the purpose of data visualization during EDA?
A. Build predictive models
B. Clean raw data
C. Identify patterns and insights
D. Test model performance
Answer: C. Identify patterns and insights
MCQs
1. Volume
The sheer size of data generated daily, often measured in terabytes or petabytes. Examples include social media
posts, transaction records, and sensor data.
2. Velocity
The speed at which data is generated and processed. For instance, real-time data streams from IoT devices or
financial transactions.
3. Variety
The diversity of data types, such as structured (databases), semi-structured (XML, JSON), and unstructured (videos,
images, text).
4. Veracity
The quality and accuracy of data, which can often be incomplete, noisy, or misleading.
5. Value
The actionable insights and benefits derived from analyzing Big Data.
4. IoT (Internet of Things): Data from connected devices like sensors, cameras, and wearables.
5. Government & Public Services: Census data, transportation data, and utility records.
1. Storage
2. Processing
Apache Hadoop
Apache Spark
3. Database Systems
Big Data can be classified into different types based on its nature, source, and structure:
1. Based on Data Type
Structured Data: Organized in rows and columns, stored in relational databases (e.g., SQL).
Unstructured Data: Lacks a predefined format, difficult to process (e.g., videos, images, social media posts).
Semi-structured Data: Does not follow strict schema but has some organizational properties.
2. Based on Source
Batch Data: Processed in chunks over time (e.g., transaction data at the end of the day).
Stream Data: Real-time data processed as it is generated (e.g., stock market feeds).
4. Based on Domain
A) Structured
B) Unstructured
C) Semi-structured
D) Machine-generated
Answer: C) Semi-structured
C) Audio files
A) Human-generated
B) Semi-structured
C) Machine-generated
D) Batch data
Answer: C) Machine-generated
A) Batch Data
B) Stream Data
C) Structured Data
D) Scientific Data
A) Structured B) Unstructured
C) Semi-structured D) Machine-generated
Answer: C) Semi-structured
Tools like SQL were used to manage and query relational databases.
Social media platforms and IoT devices started producing massive volumes of unstructured data.
Limitations of traditional databases led to the development of new frameworks like Hadoop.
Advancements in cloud computing, distributed systems, and machine learning enhanced Big Data analytics.
Big Data became critical in industries such as finance, healthcare, and retail.
3. 2010s: NoSQL databases and real-time processing tools like Kafka and Spark emerge.
4. 2020s: Integration of Big Data with AI, IoT, and edge computing.
The origin of the data, which can include:Structured data (databases, spreadsheets).
A) MapReduce
B) Apache Spark
C) HDFS
D) MongoDB
A) Centralized storage
B) Distributed storage
C) Local storage
D) Cloud-based storage
4. Which layer in Big Data architecture is responsible for analyzing and visualizing the data?
Sqoop: For importing and exporting data between Hadoop and relational databases.
Flume: Designed to collect, aggregate, and move large amounts of log data.
2. Fault Tolerance: Automatically replicates data across nodes to prevent data loss.
a) To manage relational databases b) To process and store large datasets in a distributed manner
a) HDFS
b) MapReduce
c) SQL Server
d) YARN
a) 1
b) 2
c) 3
d) 4
Answer: c) 3
a) Pig b) Hive
c) Flume d) Sqoop
Answer: b) Hive
Hadoop Architecture
The Hadoop architecture is designed to process and store massive datasets efficiently in a distributed and fault-
tolerant manner. It is based on the Master-Slave architecture and includes the following major components:
Keeps track of where data blocks are stored across the cluster.
NodeManager: Manages individual nodes in the cluster and monitors resource usage.
3. MapReduce Framework
Map Phase: Processes input data and produces intermediate key-value pairs.
Reduce Phase: Aggregates the output from the Map phase into meaningful results.
Hadoop Ecosystem
The Hadoop ecosystem consists of various tools that extend Hadoop's functionality, enabling it to manage, process,
and analyze diverse data types effectively.
Key Components:
1. Hive: SQL-like querying for structured data.
2. Pig: High-level scripting for data transformation.
3. HBase: A NoSQL database for real-time data access.
4. Sqoop: Import/export data between Hadoop and relational databases.
5. Flume: Collects and moves log data into HDFS.
6. Spark: An in-memory data processing engine for real-time analytics.
7. Zookeeper: Coordination and synchronization for distributed systems.
8. Mahout: Machine learning and recommendation system libraries.
MCQs
1. What is the primary function of HDFS in Hadoop?
a) To process data
b) To store data in a distributed manner
c) To query data
d) To manage resources
Answer: b) To store data in a distributed manner
2. What is the responsibility of the NameNode in HDFS?
a) Store data blocks
b) Manage metadata and block locations
c) Schedule jobs in the cluster
d) Monitor resource usage
Answer: b) Manage metadata and block locations
3. Which component is responsible for resource allocation in YARN?
a) DataNode
b) ResourceManager
c) NameNode
d) NodeManager
Answer: b) ResourceManager
4. What are the two phases of a MapReduce job?
a) Store and Process
b) Map and Reduce
c) Fetch and Aggregate d) Query and Execute
Answer: b) Map and Reduce
5. Which tool in the Hadoop ecosystem is used for real-time data processing?
a) Sqoop
b) Spark
c) Pig
d) Hive
Answer: b) Spark
6. What is the default block size in HDFS for storing data?
a) 32MB
b) 64MB
c) 128MB
d) 256MB
Answer: c) 128MB
7. Which Hadoop ecosystem component is used for machine learning?
a) Flume
b) Mahout
c) Hive
d) HBase
Answer: b) Mahout
8. Which of the following tools is used for importing and exporting data in Hadoop?
a) Hive
b) Pig
c) Sqoop
d) Flume
Answer: c) Sqoop
9. In the Hadoop architecture, which node performs data storage tasks?
a) NameNode
b) DataNode
c) ResourceManager
d) NodeManager
Answer: b) DataNode
10. What is the role of Zookeeper in the Hadoop ecosystem?
a) Data storage
b) Log management
c) Coordination and synchronization
d) Machine learning
Answer: c) Coordination and synchronization
Hadoop Ecosystem Components
The Hadoop ecosystem includes a variety of tools and frameworks that complement Hadoop's core functionality
(HDFS, MapReduce, and YARN) to handle diverse big data tasks like storage, processing, analysis, and real-time
computation.
1. Core Components
HDFS (Hadoop Distributed File System): Distributed storage for large datasets.
MapReduce: Programming model for batch data processing.
YARN (Yet Another Resource Negotiator): Resource management and task scheduling.
2. Ecosystem Tools
1. Hive
Data warehousing and SQL-like querying for large datasets.
Suitable for structured data.
Converts SQL queries into MapReduce jobs.
2. Pig
High-level scripting language for data transformation.
Converts Pig scripts into MapReduce tasks.
Suitable for semi-structured and unstructured data.
3. HBase
A distributed NoSQL database for real-time read/write access.
Built on HDFS, optimized for random access.
4. Spark
An in-memory processing engine for fast analytics.
Supports batch processing, machine learning, graph processing, and stream processing.
5. Sqoop
Transfers data between Hadoop and relational databases like MySQL or Oracle.
Ideal for ETL (Extract, Transform, Load) operations.
6. Flume
Collects, aggregates, and moves large amounts of log data into HDFS.
Best for streaming data.
7. Zookeeper
Manages and coordinates distributed systems.
Ensures synchronization across nodes.
8. Oozie
Workflow scheduler for managing Hadoop jobs.
Executes workflows involving Hive, Pig, and MapReduce tasks.
9. Mahout
Provides machine learning algorithms for clustering, classification, and recommendations.
10. Kafka
A distributed messaging system for real-time data streams.
Often used with Spark or Flume.
MCQs
1. Which Hadoop tool provides SQL-like querying capabilities?
a) Pig b) Hive
c) Sqoop d) Oozie
Answer: b) Hive
2. What is the purpose of HBase in the Hadoop ecosystem?
a) To process data in real-time
b) To provide NoSQL database storage
c) To collect log data
d) To manage workflows
Answer: b) To provide NoSQL database storage
3. Which component is used for data import/export between Hadoop and relational databases?
a) Flume
b) Sqoop
c) Spark
d) Mahout
Answer: b) Sqoop
4. What is the main use of Flume in the Hadoop ecosystem?
a) To manage workflows
b) To process structured data
c) To collect and move log data
d) To provide SQL queries
Answer: c) To collect and move log data
5. Which of the following is a workflow management tool in Hadoop?
a) Zookeeper
b) Oozie
c) Pig
d) Hive
Answer: b) Oozie
6. What is the primary function of Spark in the Hadoop ecosystem?
a) Data storage
b) Batch processing
c) In-memory data processing
d) Real-time log collection
Answer: c) In-memory data processing
7. Which tool is used for machine learning tasks in Hadoop?
a) Oozie
b) Mahout
c) Hive
d) Sqoop
Answer: b) Mahout
8. What does Zookeeper provide in the Hadoop ecosystem?
a) Data synchronization and coordination
b) SQL-like querying
c) Real-time data streaming
d) Workflow scheduling
Answer: a) Data synchronization and coordination
9. Which component handles the streaming of real-time data in Hadoop?
a) Kafka
b) Pig
c) Hive
d) Spark
Answer: a) Kafka
10. Which tool in the Hadoop ecosystem is ideal for large-scale graph processing?
a) Hive
b) Spark
c) HBase
d) Flume
Answer: b) Spark
MapReduce Overview
MapReduce is a programming model and processing framework in Hadoop for processing large datasets in a
distributed and parallel manner. It consists of two key phases:
1. Map Phase:
Splits the input data into smaller chunks and processes them independently.
Converts data into key-value pairs.
2. Reduce Phase:
Aggregates, filters, or summarizes the intermediate key-value pairs produced by the Map phase.
Produces the final output.
Key Features:
Works with HDFS for fault tolerance and distributed processing.
Suitable for batch processing of large-scale datasets.
Workflow of MapReduce
1. Input data is split into chunks.
3. The Shuffle and Sort phase organizes these key-value pairs by key.
4. Block Storage
4. Write Once, Read Many: Data is written once and read multiple times, making it ideal for analytics.
HDFS Workflow
1. Data Input: Data is divided into blocks and sent to DataNodes for storage.
3. Fault Tolerance: If a DataNode fails, the system retrieves the data from replicas stored on other nodes.
MCQs
1. What is the primary function of HDFS?
a) DataNode
b) NameNode
c) Secondary NameNode
d) ResourceManager
Answer: b) NameNode
a) 64MB
b) 128MB
c) 256MB d) 512MB
Answer: b) 128MB
4. What does the Secondary NameNode do?
a) Replaces the NameNode in case of failure
b) Stores actual data blocks
c) Periodically merges and checkpoints metadata
d) Manages task scheduling
Answer: c) Periodically merges and checkpoints metadata
5. How does HDFS ensure fault tolerance?
a) By using expensive hardware
b) By replicating data blocks across multiple nodes
c) By compressing data
d) By storing all data on a single node
Answer: b) By replicating data blocks across multiple nodes
6. What is the default replication factor in HDFS?
a) 1
b) 2
c) 3
d) 4
Answer: c) 3
7. Which of the following is NOT a characteristic of HDFS?
a) Write Once, Read Many
b) Real-time data updates
c) Fault tolerance
d) Distributed storage
Answer: b) Real-time data updates
8. What happens if a DataNode fails in HDFS?
a) The system shuts down.
b) Data is lost permanently.
c) Data is retrieved from replicated blocks on other nodes.
d) The NameNode fails.
Answer: c) Data is retrieved from replicated blocks on other nodes.
9. Which node in HDFS stores actual data?
a) NameNode
b) DataNode
c) Secondary NameNode
d) ResourceManager
Answer: b) DataNode
YARN (Yet Another Resource Negotiator)
YARN is a core component of Hadoop responsible for cluster resource management and task scheduling. It enables
multiple data processing engines like MapReduce, Spark, and others to run on Hadoop, making the system more
efficient and versatile.
3. ApplicationMaster:
4. Container:
Features of YARN
1. Scalability: Efficiently handles large clusters.
4. Flexibility: Supports various processing frameworks like MapReduce, Spark, and Tez.
YARN Workflow
1. The client submits an application to the ResourceManager.
3. The ApplicationMaster requests containers from the ResourceManager for task execution.
MCQs
1. What is the primary function of YARN in Hadoop?
a) Data storage
c) Query execution
d) Metadata management
a) Data blocks
b) Metadata
c) Containers
d) Files
Answer: c) Containers
a) Only MapReduce
b) Only Spark
2. NoSQL Databases
3. Distributed Databases
4. Cloud Databases
5. Object-Oriented Databases
6. Hierarchical Databases
7. Network Databases
MCQs
1. Which database is based on a table structure of rows and columns?
a) NoSQL Database
b) Relational Database
c) Hierarchical Database
d) Object-Oriented Database
a) Relational Database
b) Hierarchical Database
a) Hierarchical Database
b) Graph Database
c) Relational Database
d) Object-Oriented Database
a) Relational Database
b) Cloud Database
c) Hierarchical Database
d) Network Database
4. Distributed Architecture: Often built for distributed systems and fault tolerance.
2. Document Stores:
4. Graph Databases:
MCQs
1. Which of the following best describes NoSQL databases?
a) They store data only in relational tables.
b) They provide support for unstructured or semi-structured data.
c) They are always slower than relational databases.
d) They cannot scale horizontally.
Answer: b) They provide support for unstructured or semi-structured data.
2. Which type of NoSQL database is optimized for managing relationships between data?
a) Document Store
b) Key-Value Store
c) Graph Database
d) Column-Family Store
Answer: c) Graph Database
3. MongoDB is an example of which type of NoSQL database?
a) Key-Value Store
b) Document Store
c) Column-Family Store
d) Graph Database
Answer: b) Document Store
4. Which of the following is a feature of NoSQL databases?
a) Fixed schema b) Vertical scaling
c) Support for distributed data d) Mandatory use of SQL
Answer: c) Support for distributed data
5. Cassandra is an example of which type of NoSQL database?
a) Document Store
b) Key-Value Store
c) Graph Database
d) Column-Family Store
Relational databases struggle with the massive data generated by IoT, social media, and web applications. NoSQL
databases can efficiently process and store large volumes of data.
2. Scalability:
Traditional databases rely on vertical scaling (adding more resources to a single server), which can be expensive.
NoSQL databases support horizontal scaling (adding more servers to a cluster), providing cost-effective scalability.
3. Flexibility:
Relational databases require a fixed schema. In contrast, NoSQL databases allow schema-less designs, which are
more adaptable for dynamic and evolving data structures.
With the rise of unstructured and semi-structured data (e.g., images, videos, JSON), NoSQL databases are better
equipped to handle such data types.
5. High Performance:
NoSQL databases are optimized for high-speed read and write operations, making them ideal for applications
requiring real-time data processing.
6. Distributed Systems:
Modern applications are often distributed globally, and NoSQL databases are designed for distributed architectures,
ensuring reliability and fault tolerance.
7. Cost-Effectiveness:
Many NoSQL solutions are open-source and can run on commodity hardware, reducing costs.
MCQs
1. Why are NoSQL databases preferred for handling big data?
c) They can process large volumes of unstructured data. d) They cannot scale horizontally.
d) Vertical scalability
c) Static scalability
d) Single-node scalability
5. For real-time applications requiring high-speed data access, which type of database is suitable?
NoSQL databases are designed for horizontal scaling, allowing the addition of more servers to handle increasing data
and traffic.
2. Flexibility:
Schema-less design enables dynamic changes to data structures without disrupting operations.
Supports structured, semi-structured, and unstructured data such as JSON, XML, videos, and images.
4. High Performance:
Optimized for fast read and write operations, suitable for real-time applications.
Built for distributed architecture, ensuring fault tolerance and high availability.
6. Cost-Effectiveness:
Many NoSQL solutions are open-source and can run on commodity hardware, reducing costs.
Efficiently processes large volumes of data and provides insights in real time.
8. Easy Integration:
9. No Complex Joins:
10. Cloud-Friendly:
MCQs
1. What is the main advantage of NoSQL databases in terms of scalability?
a) Vertical scaling
b) Horizontal scaling
c) Limited scaling
d) No scaling
Answer: b) They are optimized for high-speed read and write operations.
MCQs
1. Which type of database uses a predefined schema?
a) SQL
b) NoSQL
c) Both SQL and NoSQL
d) Neither SQL nor NoSQL
Answer: a) SQL
2. What is a major advantage of NoSQL databases over SQL databases?
a) Strong support for joins
b) Fixed schema
c) Horizontal scalability
d) Limited support for unstructured data
Answer: c) Horizontal scalability
3. Which type of database is better suited for handling structured data?
a) SQL
b) NoSQL
c) Both SQL and NoSQL
d) None of the above
Answer: a) SQL
4. Which of the following is an example of a NoSQL database?
a) MySQL
b) PostgreSQL
c) MongoDB
d) Oracle
Answer: c) MongoDB
5. What makes NoSQL databases more flexible than SQL databases?
a) Predefined schema
b) Schema-less design
c) Complex relationships
d) Dependence on SQL
Answer: b) Schema-less design
6. SQL databases are typically scaled by:
a) Adding more servers (horizontal scaling)
b) Adding more resources to the same server (vertical scaling)
c) Both horizontal and vertical scaling equally d) Neither horizontal nor vertical scaling
Answer: b) Adding more resources to the same server (vertical scaling)
7. Which type of database is ideal for applications with rapidly changing data models?
a) SQL
b) NoSQL
c) Both SQL and NoSQL
d) None of the above
Answer: b) NoSQL
8. Which query language is used by SQL databases?
a) Structured Query Language (SQL)
b) JSON Query Language (JQL)
c) NoSQL Query Language
d) Custom APIs
Answer: a) Structured Query Language (SQL)
MCQs
1. Which type of NoSQL database stores data in key-value pairs?
a) Document Store b) Column-Family Store
c) Key-Value Store d) Graph Database
Answer: c) Key-Value Store
2. MongoDB is an example of which type of NoSQL database?
a) Document Store b) Key-Value Store
c) Column-Family Store d) Graph Database
Answer: a) Document Store
3. Which type of NoSQL database is optimized for managing relationships between entities?
a) Document Store b) Graph Database
c) Column-Family Store d) Key-Value Store
Answer: b) Graph Database
4. What type of NoSQL database is best suited for analytics and time-series data?
a) Key-Value Store b) Column-Family Store
c) Graph Database d) Document Store
Answer: b) Column-Family Store
5. Redis is an example of which type of NoSQL database?
a) Document Store b) Key-Value Store
c) Column-Family Store d) Graph Database
Answer: b) Key-Value Store
6. Which type of NoSQL database is most suitable for storing hierarchical data in JSON format?
Answer: b) Neo4j
8. Which NoSQL database type is best for storing large-scale tabular data?
MCQs
Question: What is the primary goal of data analytics?
A) To store large amounts of data
B) To create complex algorithms
C) To extract valuable insights from data
D) To visualize data in graphs and charts
Answer: C) To extract valuable insights from data
1. Which of the following is a key step in the data analytics process?
A) Data collection
B) Data visualization
C) Data cleaning
D) All of the above
Answer: D) All of the above
2. What type of analysis is used to make predictions based on historical data?
A) Descriptive analysis
B) Diagnostic analysis
C) Predictive analysis
D) Prescriptive analysis
Answer: C) Predictive analysis
3. Which of the following tools is commonly used for data visualization?
A) Excel
B) Power BI
C) Tableau
D) All of the above
Answer: D) All of the above
4. Which of these is a type of unstructured data?
A) A database table
B) A text document
C) A CSV file
D) A spreadsheet
Answer: B) A text document
5. What is the purpose of data cleaning in the analytics process?
The use of data analytics spans across various fields and industries, enabling organizations to gain valuable insights,
make data-driven decisions, and optimize processes.
Use: Data analytics helps monitor energy usage, optimize supply distribution, and improve sustainability practices.
Example: Smart meters in homes use analytics to track energy consumption, helping users optimize usage and
reduce costs.
MCQs
1. What is the first step in the Data Analytics Life Cycle?
A) Data Collection B) Problem Definition
C) Model Building D) Data Cleaning and Preprocessing
Answer: B) Problem Definition
2. In which stage of the Data Analytics Life Cycle is data cleaned and transformed for analysis?
A) Model Evaluation B) Exploratory Data Analysis
C) Data Collection D) Data Cleaning and Preprocessing
Answer: D) Data Cleaning and Preprocessing
3. Which of the following is the primary goal of Exploratory Data Analysis (EDA)?
A) To collect data from external sources B) To build a predictive model
C) To visualize and summarize the data to find patterns and insights
D) To deploy the model into production
Answer: C) To visualize and summarize the data to find patterns and insights
4. At which stage of the Data Analytics Life Cycle is the model’s performance evaluated?
A) Data Collection B) Model Building C) Model Evaluation D) Problem Definition
Answer: C) Model Evaluation
5. Which of the following describes the deployment and monitoring stage of the Data Analytics Life Cycle?
A) Applying models to historical data B) Collecting and cleaning data
C) Putting the model into use and tracking its performance D) Visualizing the data
Answer: C) Putting the model into use and tracking its performance
6. What is the purpose of the "Problem Definition" stage in the Data Analytics Life Cycle?
A) To collect data from different sources B) To identify the specific questions to answer and goals to achieve
C) To build a predictive model D) To evaluate the model's accuracy
Answer: B) To identify the specific questions to answer and goals to achieve
Types of Analytics:
1. Descriptive Analytics
Focuses on understanding historical data and summarizing what has happened. It answers questions like
"What happened?" through methods like data aggregation and visualization.
2. Diagnostic Analytics
Goes a step further than descriptive analytics by investigating the reasons behind past outcomes. It answers
"Why did it happen?" through techniques like correlation analysis and root cause analysis.
3. Predictive Analytics
Uses statistical models and machine learning techniques to forecast future outcomes. It answers "What
could happen?" by analyzing patterns in historical data to make predictions.
4. Prescriptive Analytics
Recommends actions to achieve desired outcomes by using optimization and simulation techniques. It
answers "What should we do?" to optimize decisions and strategies.
5. Cognitive Analytics
Involves advanced AI techniques that simulate human thought processes. It focuses on improving decision-
making by learning from experience, often using natural language processing (NLP) and machine learning.
MCQs
1. Which type of analytics answers the question, "What happened?"
A) Predictive Analytics B) Prescriptive Analytics C) Descriptive Analytics D) Cognitive
Analytics
Answer: C) Descriptive Analytics
2. What is the primary goal of diagnostic analytics?
A) To summarize past events B) To predict future outcomes
C) To understand why something happened D) To recommend the best course of action
Answer: C) To understand why something happened
3. Which type of analytics is used to predict future outcomes based on historical data?
A) Prescriptive Analytics B) Cognitive Analytics
C) Predictive Analytics D) Descriptive Analytics
Answer: C) Predictive Analytics
4. Which of the following is a feature of prescriptive analytics?
A) It predicts future trends based on past data. B) It answers why something occurred in the past.
C) It recommends the best course of action to optimize outcomes.
D) It visualizes and summarizes historical data.
Answer: C) It recommends the best course of action to optimize outcomes.
5. Which type of analytics involves using AI to simulate human thought processes?
A) Predictive Analytics B) Descriptive Analytics C) Cognitive Analytics D) Diagnostic Analytics
Answer: C) Cognitive Analytics
6. What is the key difference between descriptive and diagnostic analytics?
A) Descriptive analytics focuses on predicting future events, while diagnostic analytics focuses on past events.
B) Descriptive analytics summarizes historical data, while diagnostic analytics explores the reasons behind past
events.
C) Descriptive analytics makes recommendations, while diagnostic analytics predicts outcomes.
D) Descriptive analytics uses AI, while diagnostic analytics does not.
Answer: B) Descriptive analytics summarizes historical data, while diagnostic analytics explores the reasons behind
past events.
Thank You.