0% found this document useful (0 votes)
9 views

big data quiz for final

The document outlines key components of Apache Spark, detailing the roles of the Driver and Executor processes, as well as the architecture involving the Cluster Manager and RDDs. It also reviews various data wrangling tools such as OpenRefine and Google DataPrep, and compares KNIME with Spark MLlib in terms of their interfaces and functionalities. Additionally, it discusses the pros and cons of pie charts in data visualization and explains the Apriori algorithm used for association rule learning.

Uploaded by

dothtrung4897
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

big data quiz for final

The document outlines key components of Apache Spark, detailing the roles of the Driver and Executor processes, as well as the architecture involving the Cluster Manager and RDDs. It also reviews various data wrangling tools such as OpenRefine and Google DataPrep, and compares KNIME with Spark MLlib in terms of their interfaces and functionalities. Additionally, it discusses the pros and cons of pie charts in data visualization and explains the Apriori algorithm used for association rule learning.

Uploaded by

dothtrung4897
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

slide 8:

1) What are the two main processes associated with an Apache Spark application? Describe
them in details.
2) Explain the Apache Spark Architecture

1) Two Main Processes in an Apache Spark Application:


● Driver Process:
○ The Driver is the central coordinator of a Spark application.
○ It is responsible for translating the user's code into tasks, distributing them
across the cluster, and collecting the results.
○ It maintains information about the Spark application and responds to the
user’s program.
● Executor Processes:
○ Executors are distributed processes on the cluster nodes.
○ They run the tasks assigned by the driver and return the results.
○ Executors also provide in-memory storage for RDDs that are cached by user
programs through Spark’s APIs.

2) Apache Spark Architecture:


● Driver: The heart of the application, responsible for task scheduling, result collection,
and communicating with the cluster manager.
● Cluster Manager: Manages the cluster resources and works with the driver to
schedule tasks. Common cluster managers include Standalone, YARN, and Mesos.
● Executors: Launched by the cluster manager, they execute tasks and store data for
the application. Each application has its own set of executors.
● Tasks: Units of work sent to the executors by the driver. Each task performs
operations on a partition of the data.
● RDD (Resilient Distributed Dataset): The fundamental data structure in Spark,
representing distributed collections of objects. Operations on RDDs are transformed
into a directed acyclic graph (DAG) of stages.

Slide9:
Research several tools for data wrangling:
•OpenRefine

•Google DataPrep

•Watson Studio Refinery

•Trifacta Wrangler

OpenRefine: Open-source tool for cleaning and transforming data. Works with formats like
CSV, TSV, and JSON. User-friendly with menu-based operations.
Google DataPrep: Cloud-based service for exploring, cleaning, and preparing data.
Provides automated transformation suggestions and anomaly detection.
Watson Studio Refinery: Part of IBM's data platform, it handles large datasets with features
like automatic type detection and compliance with data policies.
Trifacta Wrangler: Cloud-based tool for data cleaning and transformation, supporting
platforms like Excel and Tableau. It offers automated type detection and a collaborative
environment.

slide 10:
What are KNIME and Spark MLlib?
KNIME is a graphical user interface-based machine learning tool, while Spark MLlib provides a
programming-based distributed platform for scalable machine learning algorithms
The main difference between KNIME and Spark MLlib is that KNIME is a graphical user
interface-based machine learning tool, while Spark MLlib provides a programming-based
distributed platform for scalable machine learning algorithms.
For the quiz questions:
1. NOT machine learning: Explicit, step-by-step programming
2. NOT a category of machine learning: Algorithm Prediction
3. Supervised machine learning categories: Classification and regression
4. In unsupervised approaches: The target is unlabeled
5. Machine learning process sequence: Acquire -> Prepare -> Analyze -> Report -> Act
6. Process type: The first two steps, Acquire and Prepare, are apply-once, and the other
steps are iterative
7. Phase 2 of CRISP-DM Data Understanding: We acquire as well as explore the data
related to the problem
8. Already addressed in opening statement

slide 11:
What's Wrong with Pie Charts? One type of plot that we did not cover in Data Exploration module is
the pie chart. Pie charts are commonly used, and we see them often in newspaper articles and
business reports. However, some people think that the pie chart is fundamentally flawed. What are
some problems with the pie chart? What are some good things about pie charts? Do you agree with
the statement that pie charts should never be used? Why or why not?

Domain Knowledge in Data Preparation Using domain knowledge to guide the data preparation
process is important. What are some specific examples where domain knowledge would be useful in
preparing data for analysis?

Problems with Pie Charts:


● Hard to Compare Slices: Difficult to distinguish similar-sized slices.
● Limited Categories: Not suitable for many categories, causing clutter.
● Misleading: Can exaggerate or minimize differences.

Good Things About Pie Charts:


● Simple: Easy to understand proportions at a glance.
● Familiar: Commonly recognized by general audiences.

Should Pie Charts Be Avoided?


● Not Always: Effective for simple, few-category proportions but use bar charts for
more complex comparisons.

Domain Knowledge in Data Preparation:


● Missing Data: Guides whether to impute or exclude values.
● Transformations: Informs appropriate data transformations.
● Outlier Detection: Helps identify true outliers.
● Feature Engineering: Enables creating meaningful features.

slide 12:
1) What is the Apriori algorithm?
2) Describe the Apriori algorithm

1) What is the Apriori Algorithm?

The Apriori algorithm is a popular method used in association rule learning to identify frequent
item sets in a large dataset. It's commonly used in market basket analysis to find associations
between items purchased together.

2) Describe the Apriori Algorithm:

● Step 1: Identify frequent individual items (itemsets) in the dataset that meet a minimum
support threshold.
● Step 2: Generate candidate itemsets by combining frequent itemsets from the previous
step.
● Step 3: Filter out candidate itemsets that do not meet the minimum support.
● Step 4: Repeat the process of generating and filtering until no more frequent itemsets are
found.
● Step 5: Use these frequent itemsets to generate association rules that meet a minimum
confidence threshold.

You might also like