0% found this document useful (0 votes)
17 views11 pages

Dsbda Ut6

The document provides an overview of data visualization, including its definition, challenges, applications, and various types of visualizations such as bar charts, line charts, and scatter plots. It also discusses common data visualization tools like Tableau and Power BI, and outlines analytical techniques used in big data visualization, including descriptive, diagnostic, and predictive analytics. Additionally, it covers the Hadoop ecosystem, detailing its core components and tools like MapReduce, Pig, and Hive.

Uploaded by

practicalcodes04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views11 pages

Dsbda Ut6

The document provides an overview of data visualization, including its definition, challenges, applications, and various types of visualizations such as bar charts, line charts, and scatter plots. It also discusses common data visualization tools like Tableau and Power BI, and outlines analytical techniques used in big data visualization, including descriptive, diagnostic, and predictive analytics. Additionally, it covers the Hadoop ecosystem, detailing its core components and tools like MapReduce, Pig, and Hive.

Uploaded by

practicalcodes04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Rohit

Unit 6: Imp + PYQs


1. Data Visualization and its Challenges, Applications, Types of Visualization.
1. What is Data Visualization?
Data Visualization is the graphical representation of data and information using visual elements like charts, graphs,
maps, and dashboards. It helps in understanding trends, outliers, and patterns in data.

2. Challenges in Data Visualization:


• Data Quality Issues: Inaccurate or incomplete data can mislead visuals.
• Overcrowded Visuals: Too much information can confuse the audience.
• Choosing the Wrong Chart Type: Inappropriate visuals can misrepresent insights.
• Scalability: Visualizing large datasets in real-time can be performance-intensive.
• User Interpretation: Poor design or lack of clarity can lead to wrong conclusions.
• Tool Limitations: Some tools may not support specific types of data or interactivity.

3. Applications of Data Visualization:


• Business Intelligence: Dashboards for sales, marketing, and financial metrics.
• Healthcare: Tracking disease spread, patient outcomes, and treatment efficiency.
• Education: Student performance tracking, learning analytics.
• Social Media Analytics: Sentiment analysis and engagement trends.
• Data Journalism: Visual storytelling with data.
• Machine Learning: Understanding model outputs, feature importance, etc.

4. Types of Data Visualization:

Type of Visualization Description Example Use Case

Bar Chart Represents categorical data with rectangular bars Compare product sales

Line Chart Displays trends over time Stock price over months

Pie Chart Shows proportions of a whole Market share by company

Histogram Visualizes distribution of numerical data Distribution of ages

Scatter Plot Shows relationships between two variables Height vs weight

Heatmap Uses color to show data density Correlation matrix

Box Plot Visualizes spread and outliers in data Exam score analysis

Map Visualization Geographic data representation COVID-19 spread by region

Common Tools for Visualization:


• Tableau, Power BI, Google Data Studio
• Python Libraries: Matplotlib, Seaborn, Plotly, Altair
2. Architecture of Apache-Pig.
3. List and explain Data Visualization tools, describe Tableau.
Common Data Visualization Tools:

Tool Description

Tableau A powerful, interactive data visualization and business intelligence tool.

Power BI A Microsoft product that connects data sources and builds dashboards.

QlikView/Qlik Sense Offers associative data models for dynamic dashboards and reports.

Google Data Studio Free Google tool for simple, shareable reports using Google data.

D3.js A JavaScript library for creating interactive, web-based visualizations.

Plotly Python and JavaScript library for interactive plots and dashboards.

Looker A cloud-based platform by Google Cloud for BI and analytics.

Matplotlib / Seaborn (Python) Libraries used for 2D plotting and statistical visualizations.

Detailed Overview of Tableau:


• What is Tableau?
Tableau is a leading data visualization tool used to convert raw data into interactive and shareable
dashboards. It is widely adopted in business intelligence for its ease of use and powerful analytics
capabilities.
• Key Features:
o Drag-and-drop interface for building dashboards quickly.
o Real-time data analysis with live connections to data sources.
o Supports a wide range of data sources including Excel, SQL databases, Google Analytics,
and cloud services.
o Interactive dashboards that allow filtering, zooming, and drilling down into data.
o Calculated fields and LOD expressions for advanced analytics.
• Advantages:
o User-friendly for non-programmers.
o Highly customizable visualizations.
o Strong community support and resources.
• Use Cases:
o Business reporting
o Sales and marketing analytics
o Operational dashboards
o Healthcare and public sector insights
o Financial performance tracking
4. Data Visualization Techniques.
Data visualization techniques are methods used to graphically represent data in a way that helps users
understand trends, patterns, and insights. These techniques vary based on the type of data (categorical,
numerical, time-series, etc.) and the purpose of analysis.

1. Bar Charts
• Purpose: Compare values across categories.
• Use Case: Comparing sales across regions or product categories.
2. Line Charts
• Purpose: Display trends over time.
• Use Case: Monthly revenue growth or stock price fluctuations.
3. Pie Charts
• Purpose: Show proportions of a whole.
• Use Case: Market share distribution among companies.
4. Histograms
• Purpose: Show the distribution of a single variable.
• Use Case: Frequency of scores in an exam.
5. Scatter Plots
• Purpose: Visualize the relationship between two numeric variables.
• Use Case: Analyzing the correlation between advertising spend and revenue.
6. Heatmaps
• Purpose: Show values in a matrix with color intensity.
• Use Case: Correlation matrix of features in a dataset.
7. Box Plots (Box-and-Whisker)
• Purpose: Summarize distribution using median, quartiles, and outliers.
• Use Case: Compare test scores across different classes.
8. Area Charts
• Purpose: Similar to line charts but emphasize the magnitude of change.
• Use Case: Visualizing the cumulative growth of users over time.
9. Tree Maps
• Purpose: Display hierarchical data as nested rectangles.
• Use Case: Visualizing budget allocation in an organization.
5. Analytical techniques used in Big data visualization.
Big data visualization involves representing massive, complex datasets visually to reveal trends, patterns,
and insights. Analytical techniques are essential to summarize, process, and present this data effectively.
Below are the key techniques used:

1. Descriptive Analytics
• Purpose: Summarizes past data to understand what happened.
• Example: Visualizing average monthly sales or website traffic trends.
• Tools: Bar charts, line charts, pie charts.

2. Diagnostic Analytics
• Purpose: Explores data to understand the reasons behind events or trends.
• Example: Analyzing why sales dropped in a region by comparing KPIs.
• Tools: Heatmaps, correlation matrices, drill-down charts.

3. Predictive Analytics
• Purpose: Uses historical data and machine learning to forecast future outcomes.
• Example: Forecasting product demand or customer churn.
• Tools: Time series charts, regression lines, prediction intervals.

4. Prescriptive Analytics
• Purpose: Recommends actions based on predictive insights and optimization.
• Example: Recommending pricing strategies based on customer segmentation.
• Tools: Decision trees, optimization dashboards.

5. Correlation and Regression Analysis


• Purpose: Identifies relationships between variables.
• Example: Visualizing how marketing spend affects sales.
• Tools: Scatter plots, bubble charts with regression lines.

6. Cluster Analysis
• Purpose: Groups similar data points together to identify patterns or segments.
• Example: Customer segmentation based on behavior.
• Tools: Cluster heatmaps, 3D scatter plots, dendrograms.
7. Anomaly Detection
• Purpose: Identifies unusual data points or outliers.
• Example: Detecting fraudulent transactions or spikes in sensor data.
• Tools: Line charts with threshold bands, box plots.

8. Time Series Analysis


• Purpose: Analyzes data over time to identify trends and seasonality.
• Example: Monthly sales forecasting or website traffic analysis.
• Tools: Time series line charts, moving average plots.

9. Geospatial Analysis
• Purpose: Analyzes data related to geographic locations.
• Example: Mapping customer density or delivery routes.
• Tools: Choropleth maps, geo heatmaps.

10. Sentiment Analysis


• Purpose: Analyzes text data to determine sentiment (positive, neutral, negative).
• Example: Visualizing customer reviews or social media feedback.
• Tools: Word clouds, bar charts, polarity graphs.
6. Line plot, Scatter plot, Histogram, Density plot, Box-plot and its usages.
1. Line Plot
• Description: A graph that connects data points with a continuous line.
• Use Cases:
o Visualizing trends over time.
o Monitoring time series data (e.g., stock prices, temperature).
• Example: Plotting monthly revenue growth over a year.

2. Scatter Plot
• Description: A graph of plotted points that show the relationship between two variables.
• Use Cases:
o Identifying correlations or patterns between variables.
o Detecting clusters or outliers.
• Example: Relationship between hours studied and exam scores.
3. Histogram
• Description: A bar graph representing the frequency distribution of a dataset.
• Use Cases:
o Showing the distribution of a single variable.
o Understanding how data is spread (e.g., normal, skewed).
• Example: Distribution of ages in a customer dataset.

4. Density Plot
• Description: A smoothed version of a histogram that shows the probability density function of a
continuous variable.
• Use Cases:
o Comparing distributions between groups.
o Understanding data distribution in a continuous manner.
• Example: Comparing test scores between two student groups.

5. Box Plot (Box-and-Whisker Plot)


• Description: A plot that displays the distribution of data based on five summary statistics: minimum,
Q1, median, Q3, and maximum.
• Use Cases:
o Detecting outliers.
o Comparing distribution between multiple groups.
• Example: Comparing income ranges across different departments.
7. Hadoop ecosystem in detail with diagram and its components.
The Hadoop ecosystem is a suite of tools and frameworks that work together to store, process, and analyze
big data efficiently. It is built on the Hadoop Distributed File System (HDFS) and the MapReduce
processing framework but includes many other complementary components.

Core Components of Hadoop Ecosystem


1. HDFS (Hadoop Distributed File System)
• Function: Storage layer of Hadoop.
• Purpose: Stores large volumes of data across multiple machines.
• Features: Fault-tolerant, scalable, handles structured and unstructured data.
2. MapReduce
• Function: Processing layer of Hadoop.
• Purpose: Processes data in parallel using a map and reduce approach.
• Example: Mapping customer data and reducing to get total sales by region.
3. YARN (Yet Another Resource Negotiator)
• Function: Resource management and job scheduling.
• Purpose: Manages compute resources in clusters and assigns them to various applications.

Hadoop Ecosystem Tools


1. Hive
• Purpose: Data warehousing and SQL-like querying on HDFS.
• Language Used: HiveQL (similar to SQL).
• Best For: Users familiar with SQL who want to query large datasets.
2. Pig
• Purpose: High-level platform for data processing using a scripting language.
• Language Used: Pig Latin.
• Best For: Complex data transformations and procedural workflows.
3. HBase
• Purpose: NoSQL database built on top of HDFS.
• Best For: Real-time read/write access to large datasets.
4. Sqoop
• Purpose: Transfers data between Hadoop and relational databases.
• Best For: Importing data from MySQL, Oracle to Hadoop and vice versa.
5. Flume
• Purpose: Collects, aggregates, and moves large volumes of log data to HDFS.
• Best For: Streaming data like logs or events from various sources.
6. Oozie
• Purpose: Workflow scheduler for Hadoop jobs.
• Best For: Managing complex job dependencies and timing.
7. Zookeeper
• Purpose: Centralized service for maintaining configuration information, naming, and
synchronization.
• Best For: Distributed coordination between Hadoop services.
8. Mahout
• Purpose: Machine learning library for Hadoop.
• Best For: Building scalable ML algorithms like clustering and classification.
9. Ambari
• Purpose: Web-based tool for managing, monitoring, and provisioning Hadoop clusters.
10. Spark
• Purpose: In-memory processing engine that works with HDFS and YARN.
• Best For: Faster batch and real-time analytics compared to MapReduce.
8. Map Reduce, Pig, Hive, Apache spark.

1. MapReduce
• Definition: A programming model used for processing and generating large datasets in a distributed
manner.
• Working:
o Map step: Converts input data into key-value pairs.
o Reduce step: Aggregates results based on keys.
• Use Case: Counting the number of occurrences of words in a large document.
• Advantages:
o Handles huge datasets efficiently.
o Fault-tolerant and scalable.
2. Pig
• Definition: A high-level platform that uses a scripting language (Pig Latin) to process large datasets.
• Components:
o Pig Latin: Language used to express data flows.
o Execution Engine: Converts Pig Latin into MapReduce jobs.
• Use Case: Data transformation tasks like filtering, joining, and grouping.
• Advantages:
o Easier to write than raw MapReduce.
o Suitable for ETL (Extract, Transform, Load) operations.
3. Hive
• Definition: A data warehouse system for Hadoop that allows querying of large datasets using a SQL-
like language called HiveQL.
• Working: Hive queries are internally converted to MapReduce jobs.
• Use Case: Running SQL-like queries on big data stored in HDFS.
• Advantages:
o Ideal for users familiar with SQL.
o Schema flexibility and partitioning support.
o Good for summarization and analysis.
4. Apache Spark
• Definition: A fast, general-purpose big data processing engine that performs in-memory computing
for increased speed.
• Components:
o Spark Core: The base engine for large-scale computation.
o Spark SQL: SQL queries.
o Spark Streaming: Real-time data processing.
o MLlib: Machine learning library.
o GraphX: Graph processing.
• Use Case: Real-time data analysis, iterative machine learning tasks.
• Advantages:
o Faster than MapReduce due to in-memory processing.
o Supports multiple languages (Python, Scala, Java, R).
o Compatible with Hadoop and HDFS.

You might also like