big data quiz for final
big data quiz for final
1) What are the two main processes associated with an Apache Spark application? Describe
them in details.
2) Explain the Apache Spark Architecture
Slide9:
Research several tools for data wrangling:
•OpenRefine
•Google DataPrep
•Trifacta Wrangler
OpenRefine: Open-source tool for cleaning and transforming data. Works with formats like
CSV, TSV, and JSON. User-friendly with menu-based operations.
Google DataPrep: Cloud-based service for exploring, cleaning, and preparing data.
Provides automated transformation suggestions and anomaly detection.
Watson Studio Refinery: Part of IBM's data platform, it handles large datasets with features
like automatic type detection and compliance with data policies.
Trifacta Wrangler: Cloud-based tool for data cleaning and transformation, supporting
platforms like Excel and Tableau. It offers automated type detection and a collaborative
environment.
slide 10:
What are KNIME and Spark MLlib?
KNIME is a graphical user interface-based machine learning tool, while Spark MLlib provides a
programming-based distributed platform for scalable machine learning algorithms
The main difference between KNIME and Spark MLlib is that KNIME is a graphical user
interface-based machine learning tool, while Spark MLlib provides a programming-based
distributed platform for scalable machine learning algorithms.
For the quiz questions:
1. NOT machine learning: Explicit, step-by-step programming
2. NOT a category of machine learning: Algorithm Prediction
3. Supervised machine learning categories: Classification and regression
4. In unsupervised approaches: The target is unlabeled
5. Machine learning process sequence: Acquire -> Prepare -> Analyze -> Report -> Act
6. Process type: The first two steps, Acquire and Prepare, are apply-once, and the other
steps are iterative
7. Phase 2 of CRISP-DM Data Understanding: We acquire as well as explore the data
related to the problem
8. Already addressed in opening statement
slide 11:
What's Wrong with Pie Charts? One type of plot that we did not cover in Data Exploration module is
the pie chart. Pie charts are commonly used, and we see them often in newspaper articles and
business reports. However, some people think that the pie chart is fundamentally flawed. What are
some problems with the pie chart? What are some good things about pie charts? Do you agree with
the statement that pie charts should never be used? Why or why not?
Domain Knowledge in Data Preparation Using domain knowledge to guide the data preparation
process is important. What are some specific examples where domain knowledge would be useful in
preparing data for analysis?
slide 12:
1) What is the Apriori algorithm?
2) Describe the Apriori algorithm
The Apriori algorithm is a popular method used in association rule learning to identify frequent
item sets in a large dataset. It's commonly used in market basket analysis to find associations
between items purchased together.
● Step 1: Identify frequent individual items (itemsets) in the dataset that meet a minimum
support threshold.
● Step 2: Generate candidate itemsets by combining frequent itemsets from the previous
step.
● Step 3: Filter out candidate itemsets that do not meet the minimum support.
● Step 4: Repeat the process of generating and filtering until no more frequent itemsets are
found.
● Step 5: Use these frequent itemsets to generate association rules that meet a minimum
confidence threshold.