Unit 6
Unit 6
4. Misleading Visuals
Visuals can be designed in a way that misrepresents the data, either intentionally or
accidentally.
Example: Manipulating the scale of axes to exaggerate trends.
5. Lack of Context
Without proper titles, labels, or legends, users may not understand what the chart is about.
A beautiful graph is useless if people can't figure out what it's showing.
7. Technical Limitations
Software or tools may not support advanced visuals or large datasets.
Rendering issues can also arise in low-resource systems or browsers.
A bar graph showing how many values A smooth curve showing how data is
What it is
fall into each range (bin). distributed across values.
You can change bin size (number of You can change bandwidth (controls
Customization
bars). smoothness).
Comparison use Not ideal for overlapping comparisons. Great for comparing multiple datasets on
Aspect Histogram Density Plot
Easy for
Yes, very easy to understand. Can be harder to understand at first.
beginners
When you want to know how many When you want a smooth view of data
Use cases
values fall in each range. spread or compare multiple groups.
How many students scored between What’s the overall pattern of students’
Example
60–70? scores?
6) Overview of Pentaho
Pentaho is a comprehensive open-source data integration and business analytics platform. It helps
collect data from various sources, process it, and visualize it to support business decision-making.
Data is split and saved in many computers to make storage cheap and reliable.
HDFS is the core of Hadoop—it manages where and how the data is stored.
o Node Manager: Checks how much is used in each computer and reports it.
Helps in breaking big tasks into smaller tasks and solving them faster.
🐷 Pig
Made by Yahoo to make data processing easier using a language called Pig Latin (like
SQL).
You write commands in Pig Latin, and it runs MapReduce in the background.
Simple to use, great for analyzing large data.
🐝 Hive
A data warehouse system that uses HQL (Hive Query Language), which is like SQL.
Helps in querying and analyzing large data stored in Hadoop.
Supports real-time and batch processing.
Works with tools like JDBC and command-line interfaces.
🧠 Mahout
Adds machine learning to Hadoop.
Can do tasks like:
o Clustering (grouping things),
⚡ Apache Spark
A fast engine for processing large data.
Uses memory to work faster than MapReduce.
Can handle:
o Real-time data,
o Graph data,
o Machine learning,
o Batch jobs.
HBase
A NoSQL database that works on top of HDFS.
Good for quick searching and retrieving small bits of data from big datasets.
Useful when you need fast access to specific information in a huge database.
🐘 Zookeeper
Keeps all parts of Hadoop in sync and coordinated.
Handles communication and coordination between the many parts of the system.
Avoids confusion and errors when lots of components work together.
⏰ Oozie
A job scheduler for Hadoop.
It runs jobs in order (workflow jobs) or based on a trigger (coordinator jobs).
Helps to organize and manage multiple tasks easily.
3. Parser
o This reads and checks your script to make sure it is written correctly.
o It breaks the script into smaller parts and makes a logical plan of what to do.
4. Optimizer
o Makes your task faster and better by improving the plan.
o It tries to use the best way to run your Pig script (this is called optimization).
5. Compiler
o Converts the optimized plan into a format that the computer understands.
6. Execution Engine
o Now, the plan is ready and it's time to run it.
7. MapReduce
o Pig doesn’t process data directly. Instead, it sends jobs to MapReduce, which does
the actual heavy work.
8. HDFS (Hadoop Distributed File System)
o The final results are saved in HDFS, which is where Hadoop stores all big data.
9) Apache Hive
Apache Hive is a tool that helps you store, read, and analyze big data that is saved in Hadoop’s
storage system (HDFS).
🔹 Features of Hive:
Easy to use: Uses SQL-like language (HiveQL)
Can handle huge datasets
Supports partitions to break big tables into smaller parts for faster queries
Allows custom functions (UDFs)
Optimizes queries to run faster