0% found this document useful (0 votes)
18 views12 pages

Unit 6

The document outlines various challenges in data visualization, such as poor data quality, misleading visuals, and technical limitations. It also discusses the usage and differences between histograms and density plots, types of data visualization, and applications of data visualization in decision-making and reporting. Additionally, it provides an overview of Pentaho, the Hadoop ecosystem, and tools like Apache Pig and Hive for processing and analyzing large datasets.

Uploaded by

thikoleashwini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views12 pages

Unit 6

The document outlines various challenges in data visualization, such as poor data quality, misleading visuals, and technical limitations. It also discusses the usage and differences between histograms and density plots, types of data visualization, and applications of data visualization in decision-making and reporting. Additionally, it provides an overview of Pentaho, the Hadoop ecosystem, and tools like Apache Pig and Hive for processing and analyzing large datasets.

Uploaded by

thikoleashwini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Unit – 6

1) Challenges of the Data Visualization.


1. Poor Data Quality
 If the data is incomplete, incorrect, or outdated, the visualization will also be misleading.
 Example: A sales chart showing wrong revenue figures due to missing data.

2. Choosing the Wrong Chart Type


 Using the wrong type of graph can confuse the viewer or hide important insights.
 Example: Showing a pie chart for time-series data, where a line graph would be better.

3. Too Much Information (Cluttered Visuals)


 Adding too many elements like colors, labels, or datasets can overwhelm the viewer.
 It becomes hard to focus on the important part of the data.

4. Misleading Visuals
 Visuals can be designed in a way that misrepresents the data, either intentionally or
accidentally.
 Example: Manipulating the scale of axes to exaggerate trends.

5. Lack of Context
 Without proper titles, labels, or legends, users may not understand what the chart is about.
 A beautiful graph is useless if people can't figure out what it's showing.

6. Limited Audience Understanding


 Viewers may lack data literacy, so if the visualization is too complex, it loses meaning.
 Simpler visuals often work better for general audiences.

7. Technical Limitations
 Software or tools may not support advanced visuals or large datasets.
 Rendering issues can also arise in low-resource systems or browsers.

8. Data Privacy and Security


 Visualizing sensitive or personal data can lead to privacy issues if not handled properly.

2) Usages of the Histogram

 Understand Data Distribution


 See if the data is normally distributed, skewed, or has any gaps or spikes.
 Spot Patterns and Trends
 Easily identify where most values are concentrated (e.g., most students scoring between 60–
79).
 Detect Outliers or Anomalies
 If a bar is unusually tall or short, it might indicate unusual data.
 Compare Spread and Central Tendency
 Helps visualize how spread out the values are and where the center (like the average) is.
 Support Decision Making
 Useful in education, business, healthcare, and research for data-driven decisions.
 Summarize Large Datasets Simply
 Gives a quick overview of big data without listing every number.

3) Difference between Histogram and Density Plot

Aspect Histogram Density Plot

A bar graph showing how many values A smooth curve showing how data is
What it is
fall into each range (bin). distributed across values.

Best for Discrete or grouped data. Continuous data.

Look Bars (like a column chart). A smooth wave-like curve.

Smoothing is applied to estimate


Smoothing No smoothing — raw count per range.
distribution.

Shows probability density (not exact


Values shown Shows actual counts (frequencies).
counts).

You can change bin size (number of You can change bandwidth (controls
Customization
bars). smoothness).

Comparison use Not ideal for overlapping comparisons. Great for comparing multiple datasets on
Aspect Histogram Density Plot

the same plot.

Easy for
Yes, very easy to understand. Can be harder to understand at first.
beginners

When you want to know how many When you want a smooth view of data
Use cases
values fall in each range. spread or compare multiple groups.

How many students scored between What’s the overall pattern of students’
Example
60–70? scores?

4) Types of Data Visualization


1. Multidimensional: 2D Area (Maps)
These types use maps to show data from different locations (like cities, states, countries).
 Cartogram: A type of map where the size of the country or state changes based on some
value.
For example, a country with a large population will look bigger on the map.
 Choropleth Map: A map with different colors to show values.
For example, dark red areas might mean more people, light colors mean fewer.
 Dot Distortion Map: Dots are placed on a map to show where something is happening.
For example, one dot = 100 people or 1 hospital.

2. Temporal (Time-based Data)


These charts show how data changes over time.
 Pie Chart: A circular chart divided into slices to show part-to-whole relationships.
🔹 Not great for showing changes over time.
 Histogram: Bars grouped into ranges (called bins) to show frequency of values.
🔹 E.g., how many people are aged between 10–20, 20–30, etc.
 Scatter Plot: Shows all data points on a graph—great for seeing relationships between two
values.
🔹 E.g., Study hours vs marks.
📝 Note: The Pie Chart listed under temporal here is not technically temporal—it's more about
proportions. Histograms and scatter plots are better examples of temporal or continuous data
visualizations.

3. Hierarchical (Tree-like Structures)


Used to represent data with levels, like folders, family trees, or company structure.
 Dendrogram: Looks like a tree with branches showing clusters or groups of related data.
🔹 Used in data science and biology.
 Ring Chart: Like a Pie Chart, but with multiple levels, showing deeper categories.
🔹 Also called Sunburst chart.
 Tree Diagram: Shows data with parent-child relationships from top to bottom or left to right.
🔹 Used for decision trees, family trees.

4. Network (Connections between items)


These show relationships or links between data points.
 Alluvial Diagram: A flow diagram that shows how things change over time.
🔹 E.g., How students move from one subject group to another over years.
 Node-Link Diagram: Circles (nodes) connected by lines (links), showing relationships.
🔹 Used in social networks, IT networks.
 Matrix: A table-like chart that shows how different items are connected to each other.
🔹 E.g., Person A is connected to Person B and C.

5) Application Of the Data Visualization

 Understanding Data Easily


 Helps turn large datasets into simple visuals like charts and graphs.
 Makes it easier to see patterns, trends, and outliers.
 🧠 Example: A line graph showing rising temperatures over years.
 Detecting Patterns and Trends
 Visual tools like line graphs and scatter plots show how values are increasing, decreasing, or
staying the same.
 📈 Example: Sales growth over the last 12 months.
 Finding Errors or Outliers
 Helps spot incorrect or extreme data points quickly.
 ⚠️Example: A bar in a chart that is unusually high or low compared to others.
 Exploring Data Before Modeling (EDA)
 Exploratory Data Analysis (EDA) uses visuals to understand the data before creating any
models.
 🔍 Example: Box plots used to see how data is spread and where the median lies.
 Making Machine Learning Results Understandable
 After building a machine learning model, visualizations can show how accurate it is.
 🧪 Example: Confusion matrix to show classification accuracy.
 Comparing Multiple Variables
 Visuals help compare more than one factor at a time (multivariate analysis).
 🧮 Example: Bubble chart comparing sales, profit, and region.
 Making Reports and Dashboards
 Used to create dashboards for business users and clients to make data-driven decisions.
 Example: Power BI or Tableau dashboards showing key performance indicators (KPIs).
 Telling a Story
 Helps present a "story" from raw data so that non-technical people can understand.
 🎯 Example: A story of how customer satisfaction improved after a service change.
 Decision Making
 Visuals help managers, scientists, and engineers take better decisions by understanding data
faster.
 📌 Example: Choosing the best product based on customer review data.
 Monitoring Real-Time Data
 Dashboards can show live data updates in industries like finance, health, and transport.
 Example: Live COVID-19 tracker map showing current cases by location.

6) Overview of Pentaho
Pentaho is a comprehensive open-source data integration and business analytics platform. It helps
collect data from various sources, process it, and visualize it to support business decision-making.

🔧 Key Components from the Diagram


1. Data Sources (Bottom Layer)
Pentaho can pull data from multiple sources:
 Operational Data: Traditional databases used in day-to-day operations.
 Big Data: Large-scale data stored in distributed systems like Hadoop.
 Public/Private Clouds: Cloud storage and applications.
 Data Streams: Real-time data flows from sources like sensors or IoT devices.
These sources are supported because Pentaho is:
 ✅ 100% Java-based (cross-platform compatibility),
 ✅ Has open web-based APIs (easy integration with other systems),
 ✅ Uses a pluggable architecture (modular and extendable).

2. Core Functions (Middle Layer)


 Data Integration: Combines data from different sources using ETL (Extract, Transform,
Load) processes.
Shown by the flowchart on the left monitor.
 Visual Analytics: Allows interactive visual representation of data (charts, maps, dashboards)
for better insights.
Shown by the central monitor with bubble and map visuals.
 Predictive Analytics: Uses statistical models and machine learning to predict future trends
based on historical data.
Shown by the graph-heavy monitor on the right.
These three components work together, supporting each other (as shown by the arrows), enabling
users to:
 Integrate,
 Analyze visually,
 Predict and plan effectively.
7) Hadoop EcoSystem
 HDFS: Hadoop Distributed File System
 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing
 Spark: In-Memory data processing
 PIG, HIVE: Query based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm libraries
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling
HDFS (Hadoop Distributed File System)
 Main job: Store huge amounts of data across many computers.
 Two parts:
o NameNode: Stores information about files (like file names, sizes, where they’re
stored).
o DataNode: Stores the actual data.

 Data is split and saved in many computers to make storage cheap and reliable.
 HDFS is the core of Hadoop—it manages where and how the data is stored.

⚙️YARN (Yet Another Resource Negotiator)


 Main job: Manages CPU, memory, and other resources across the system.
 Three parts:
o Resource Manager: Decides who gets how much memory or CPU.

o Node Manager: Checks how much is used in each computer and reports it.

o Application Manager: Connects the other two and handles tasks.

 Basically, YARN helps Hadoop run smoothly by managing system resources.


🧠 MapReduce
 Main job: Process big data using two steps:
o Map(): Filters, sorts, and groups the data into key-value pairs.

o Reduce(): Takes those pairs and summarizes them.

 Helps in breaking big tasks into smaller tasks and solving them faster.

🐷 Pig
 Made by Yahoo to make data processing easier using a language called Pig Latin (like
SQL).
 You write commands in Pig Latin, and it runs MapReduce in the background.
 Simple to use, great for analyzing large data.

🐝 Hive
 A data warehouse system that uses HQL (Hive Query Language), which is like SQL.
 Helps in querying and analyzing large data stored in Hadoop.
 Supports real-time and batch processing.
 Works with tools like JDBC and command-line interfaces.

🧠 Mahout
 Adds machine learning to Hadoop.
 Can do tasks like:
o Clustering (grouping things),

o Classification (labeling things),

o Recommendation (like YouTube suggesting videos).

 It has built-in libraries to make learning from data easy.

⚡ Apache Spark
 A fast engine for processing large data.
 Uses memory to work faster than MapReduce.
 Can handle:
o Real-time data,
o Graph data,

o Machine learning,

o Batch jobs.

 Often used with Hadoop.

HBase
 A NoSQL database that works on top of HDFS.
 Good for quick searching and retrieving small bits of data from big datasets.
 Useful when you need fast access to specific information in a huge database.

🔎 Solr and Lucene


 Tools used for searching data.
 Lucene is a Java library that helps with search, spell check, and indexing.
 Solr is built on Lucene and adds more features.

🐘 Zookeeper
 Keeps all parts of Hadoop in sync and coordinated.
 Handles communication and coordination between the many parts of the system.
 Avoids confusion and errors when lots of components work together.

⏰ Oozie
 A job scheduler for Hadoop.
 It runs jobs in order (workflow jobs) or based on a trigger (coordinator jobs).
 Helps to organize and manage multiple tasks easily.

8) Apache Pig Architecture


Apache Pig is used to process large sets of data using a special language called Pig Latin.
🔄 Step-by-Step Flow:
1. Pig Latin Scripts
o You write instructions using Pig Latin (a language like SQL).

o These scripts tell Pig what to do with the data.

2. Grunt Shell / Pig Server


o Grunt Shell: It's a command-line interface where you can type Pig commands
directly.
o Pig Server: Used when you run Pig from a program (like Java or a script).

3. Parser
o This reads and checks your script to make sure it is written correctly.

o It breaks the script into smaller parts and makes a logical plan of what to do.
4. Optimizer
o Makes your task faster and better by improving the plan.

o It tries to use the best way to run your Pig script (this is called optimization).

5. Compiler
o Converts the optimized plan into a format that the computer understands.

o This format is based on MapReduce (a way to process big data in Hadoop).

6. Execution Engine
o Now, the plan is ready and it's time to run it.

o The engine executes the plan using MapReduce jobs.

7. MapReduce
o Pig doesn’t process data directly. Instead, it sends jobs to MapReduce, which does
the actual heavy work.
8. HDFS (Hadoop Distributed File System)
o The final results are saved in HDFS, which is where Hadoop stores all big data.

9) Apache Hive
Apache Hive is a tool that helps you store, read, and analyze big data that is saved in Hadoop’s
storage system (HDFS).

🔹 Why Use Hive?


 It makes it easy to work with large data using SQL-like commands (called HiveQL).
 Even if the data is very big and spread across many machines (using Hadoop), Hive helps you
handle it like you're using a normal database.
 It is not made for daily transactions (like banking), but perfect for analyzing large amounts
of data.

🔹 Who Created It?


 Hive was first developed by Facebook.
 Companies like Amazon and Netflix also use it to handle huge datasets.

🔹 What Can You Do With Hive?


 Query Big Data using simple SQL commands.
 ETL (Extract, Transform, Load): Bring data from different sources, clean and change it,
then store it properly.
 Do data analysis and create summaries, charts, or reports.

🔹 How Does It Work?


1. You write a query using HiveQL (similar to SQL).
2. Hive converts your query into MapReduce jobs.
3. Hadoop runs these jobs to fetch and process the data stored in HDFS.
4. You get the result, just like using SQL on a database.

🔹 Features of Hive:
 Easy to use: Uses SQL-like language (HiveQL)
 Can handle huge datasets
 Supports partitions to break big tables into smaller parts for faster queries
 Allows custom functions (UDFs)
 Optimizes queries to run faster

You might also like