0% found this document useful (0 votes)
29 views

Unit-6 - Graph Analytics and Data Visualization

Uploaded by

21dce011
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Unit-6 - Graph Analytics and Data Visualization

Uploaded by

21dce011
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 40

U N I T- 6

G R A P H A N A LY T I C S A N D D ATA
V I S U A L I Z AT I O N
Prepared By:
Aayushi Chaudhari,
Assistant Professor, CE, CSPIT,
CHARUSAT

November 11, 2024| Department of Computer Engineering 1


Agenda
• What is data visualization?
• Benefits of using data visualization
• Why is it required?
• Its benefits and why is it required?
• Apache Spark GraphX: Property Graph
• Graph Operator
• SubGraph, Triplet
• Neo4j: Modeling data with Neo4j
• Cypher
• Query Language: General clauses
• Read and Write clauses.
• Big Data Visualization with Power BI
• Apache Super-Set

November 11, 2024| Department of Computer Engineering 2


What is data visualization?
• Data visualization is the practice of translating information into a visual
context, such as a map or graph, to make data easier for the human brain
to understand and pull insights from.
• The main goal of data visualization is to make it easier to identify patterns,
trends and outliers in large data sets.
• The term is often used interchangeably with others, including information
graphics, information visualization and statistical graphics.
• Data visualization is one of the steps of the data science process, which states
that after data has been collected, processed and modeled, it must be
visualized for conclusions to be made.
November 11, 2024| Department of Computer Engineering 3
What is data visualization? Cont..
• Data visualization is important for almost every career.
• It can be used by teachers to display student test results, by computer
scientists exploring advancements in artificial intelligence (AI) or by
executives looking to share information with stakeholders.
• It also plays an important role in big data projects.
• As businesses accumulated massive collections of data during the early years
of the big data trend, they needed a way to quickly and easily get an
overview of their data.
• Visualization tools were a natural fit.

November 11, 2024| Department of Computer Engineering 4


Need of Data Visualization
• When a data scientist is writing advanced predictive analytics or machine
learning (ML) algorithms, it becomes important to visualize the outputs
to monitor results and ensure that models are performing as intended.
• This is because visualizations of complex algorithms are generally easier
to interpret than numerical outputs.

November 11, 2024| Department of Computer Engineering 5


Example

November 11, 2024| Department of Computer Engineering 6


Importance of Data Visualization
• Data visualization provides a quick and effective way to communicate information in a
universal manner using visual information.
• The practice can also help businesses identify which factors affect customer behavior;
pinpoint areas that need to be improved or need more attention; make data more
memorable for stakeholders; understand when and where to place specific products;
and predict sales volumes.
• It has ability to absorb information quickly, improve insights and make faster
decisions.
• It provides an increased understanding of the next steps that must be taken to improve
the organization.
• Provides an improved ability to maintain the audience's interest with information they
can understand.
November 11, 2024| Department of Computer Engineering 7
Importance of Data Visualization cont..
• Provides an easy distribution of information that increases the
opportunity to share insights with everyone involved.
• It eliminates the need for data scientists since data is more
accessible and understandable.
• Provides an increased ability to act on findings quickly and,
therefore, achieve success with greater speed and less mistakes.

November 11, 2024| Department of Computer Engineering 8


Data Visualization for Big data
• Data analysis projects have made visualization more important than ever.
• Companies are increasingly using machine learning to gather massive amounts of data that
can be difficult and slow to sort through, comprehend and explain.
• Visualization offers a means to speed this up and present information to business owners
and stakeholders in ways they can understand.
• Big data visualization often goes beyond the typical techniques used in normal
visualization, such as pie charts, histograms and corporate graphs.
• It instead uses more complex representations, such as heat maps and fever charts.
• Big data visualization requires powerful computer systems to collect raw data, process it
and turn it into graphical representations that humans can use to quickly draw insights.

November 11, 2024| Department of Computer Engineering 9


Needs of Organizations to use Data Visualization
Visualization specialist is required for organization, who can apply appropriate data set and
visual styles so that, it guarantees that the organization are optimizing the use of the data.
Involvement of IT specialist is required as organization would need powerful computer
hardware, efficient storage systems and even a move to the cloud.
Quality of data to be used needs to accurate and should be in control of governing person.

November 11, 2024| Department of Computer Engineering 10


Example of Various Visualization Styles
In the early days of visualization, the most common visualization technique was using a
Microsoft Excel, spreadsheet to transform the information into a table, bar graph or pie
chart. While these visualization methods are still commonly used, more intricate
techniques are now available, including the following:
 infographics
 bubble clouds
 bullet graphs
 heat maps
 fever charts

November 11, 2024| Department of Computer Engineering 11


Example of Infographics

November 11, 2024| Department of Computer Engineering 12


Example of bubble clouds

November 11, 2024| Department of Computer Engineering 13


Example of Bullet chart

November 11, 2024| Department of Computer Engineering 14


Example of heat map

November 11, 2024| Department of Computer Engineering 15


Fever chart example

November 11, 2024| Department of Computer Engineering 16


Apache Spark GraphX
• GraphX is the graph processing library, built in Apache Spark.
• It makes use of Property Graph and Spark RDD.
• GraphX is the hybrid technology, that combines two components, data
parallel systems, such as Hadoop and spark, which focus on distributed
data across multiple nodes.
• Graph-parallel systems such as pregel, Graph lab, Giraph, efficiently
execute graph algorithms through partitioning and distributing
techniques.
• GraphX will unify data parallel and Graph parallel approach.

November 11, 2024| Department of Computer Engineering 17


Table View v/s Graph view

November 11, 2024| Department of Computer Engineering 18


Data parallel v/s Graph parallel

November 11, 2024| Department of Computer Engineering 19


GraphX
• GraphX is the collection of graph that extends the Spark
RDD(Resilient Distributed Database) class, which is an
immutable distributed collection of objects.
• Basically there are two types of graphs:
• Directed Graph: Edges have direction associated with the graph.
• Regular Graph: Graph where each vertex has same number of
edges.

November 11, 2024| Department of Computer Engineering 20


GraphX property graph
• It is a directed multigraph which has multiple edges in a
parallel.
• Every edge and vertex has user defined properties
associated with it.
• The parallel edges allow multiple relationships between
the same vertices.

November 11, 2024| Department of Computer Engineering 21


Example of Property Graph

November 11, 2024| Department of Computer Engineering 22


Example
In this scenario, we will analyze three flights, information for the same is given in table below:
• Airport will act as vertices
• Routes will act as edges
• For vertices, each of them have an ID and Airport Name as a property.

ID Airport Name SrcID DestID Distance


1 Ahmedabad 1 2 263.3
2 Surat 2 3 279.4
3 Mumbai 3 1 524.2
Table for Routes and Distances Vertex Table for Airports Edges Table for Routes
ID - Long and Airport Name - String SrcID, DestID and Distance - Long

November 11, 2024| Department of Computer Engineering 23


Graph Operator
• Big data comes in different shapes and sizes. It can be batch data that needs to be
processed offline, processing large set of records and generating the results and insights
at a later time.
• Or the data can be real-time streams which needs to be processed on the fly and create
the data insights almost instantaneously.
• Apache Spark can be used for processing batch (Spark Core) as well as real-time data
(Spark Streaming).

November 11, 2024| Department of Computer Engineering 24


Graph Operator
GraphX makes it easier to run analytics on graph data with the built-in operators and
algorithms.
It also allows us to cache and uncache the graph data to avoid recomputation when we
need to call a graph multiple times.
Basically, there are four types of graph operators:
1. Basic
2. Property
3. Structural
4. Join

November 11, 2024| Department of Computer Engineering 25


Types of Graphs
Graph Operators

Basic Join
numEdges joinVertices
numVertices Property Structural outerJoinVertices
inDegress mapVertices reverse
outDegress mapEdges subgraph
degrees mapTriplets mask
groupEdges

November 11, 2024| Department of Computer Engineering 26


SubGraph
• A subgraph refers to a smaller, derived graph that is extracted from a larger graph based
on certain criteria or conditions.
• GraphX allows for the manipulation and analysis of large-scale graphs, and the subgraph
operation is a fundamental part of this functionality.
• A subgraph is created by selecting a subset of the vertices and edges from the original
graph. The selection can be based on vertex properties, edge properties, or both.
• Subgraphs are useful for focusing on specific parts of a larger graph to perform more
detailed analysis, to filter out noise, or to study particular relationships or structures within
the graph.

November 11, 2024| Department of Computer Engineering 27


Syntax to create subgraph
In GraphX, the subgraph can be created using the subgraph method, which takes a predicate
function to specify which vertices and edges to include.
// Create the subgraph
val subgraph = graph.subgraph(vertexPredicate, edgePredicate)

// The subgraph now contains only the vertices and edges that match the predicates
subgraph.vertices.collect.foreach(println)
subgraph.edges.collect.foreach(println)

November 11, 2024| Department of Computer Engineering 28


Triplet
• In GraphX, a triplet is a structure that provides a convenient way to access the information
of an edge along with its source and destination vertices.
• This structure is particularly useful for algorithms that need to consider both the
relationships (edges) and the entities (vertices) they connect.
• Triplet Structure
• A triplet in GraphX consists of three components:
• Source Vertex: The vertex where the edge originates.
• Destination Vertex: The vertex where the edge terminates.
• Edge: The edge connecting the source and destination vertices.
The triplet allows to access both vertex attributes and the edge attributes simultaneously,
making it easier to perform computations that involve both.
November 11, 2024| Department of Computer Engineering 29
Steps to use triplet using scala
// Create the graph val graph:
Graph[String, String] = Graph(vertices, edges)
// Define a function to process triplets
def processTriplet(triplet: EdgeTriplet[String, String]): String = { s"${triplet.srcAttr} $
{triplet.attr} ${triplet.dstAttr}" }
// Use the triplets method to access triplets and apply the function
val tripletDescriptions: RDD[String] = graph.triplets.map(processTriplet)
// Collect and print the results
tripletDescriptions.collect.foreach(println)

November 11, 2024| Department of Computer Engineering 30


What is Neo4j?
Neo4j is a graph database that uses graph structures with nodes, relationships, and properties
to represent and store data.
This type of modeling databases are well suited for complex relationships, like social
networks, recommender systems and network topologies.
Key Concepts of Neo4j:
Nodes: Entities or objects in the graph. Each node can have labels to categorize it and
properties to store data.
Relationships: Directed connections between nodes, representing how entities are related.
Relationships can also have properties.
Properties: Key-value pairs associated with nodes and relationships, storing data.
Labels: Tags used to group nodes into sets, similar to types or classes in other databases.
November 11, 2024| Department of Computer Engineering 31
Features of Neo4j
Graph-based Data Model:
Neo4j stores data as nodes, relationships, and properties. This allows for a highly flexible and intuitive way to model complex, interconnected data.

ACID Compliance:
Neo4j ensures ACID (Atomicity, Consistency, Isolation, Durability) properties for transactions, ensuring reliable and predictable transaction processing.
Cypher Query Language:
Neo4j uses Cypher, a powerful and expressive query language designed for querying and updating graph data. It is similar to SQL but optimized for graph traversal.
High Performance:
It can efficiently handle complex queries over large datasets, making it suitable for applications that require real-time data processing and analytics.
Indexing and Constraints:

Neo4j supports indexing for fast lookups and constraints to ensure data integrity, such as uniqueness constraints on node properties.
Graph Algorithms and Analytics:
Neo4j includes a library of graph algorithms, such as shortest path, PageRank, and community detection, which are useful for advanced analytics and machine learning tasks.

Visualization Tools:
Tools like Neo4j Bloom and the Neo4j Browser provide interactive graph exploration capabilities.

November 11, 2024| Department of Computer Engineering 32


Example of creating Graph schema using cypher
// Create User nodes
CREATE (alice:User {username: 'alice', name: 'Alice', email: '[email protected]'})
CREATE (bob:User {username: 'bob', name: 'Bob', email: '[email protected]'})
CREATE (carol:User {username: 'carol', name: 'Carol', email: '[email protected]'})
// Create FOLLOWS relationships
CREATE (alice)-[:FOLLOWS {since: '2023-01-01'}]->(bob)
CREATE (bob)-[:FOLLOWS {since: '2023-02-01'}]->(carol)
// Create Post nodes
CREATE (post1:Post {content: 'Hello world!', timestamp: '2023-07-31T10:00:00'})
CREATE (post2:Post {content: 'Learning Neo4j', timestamp: '2023-07-31T12:00:00'})
// Create CREATED relationships
CREATE (alice)-[:CREATED]->(post1)
CREATE (bob)-[:CREATED]->(post2)
November 11, 2024| Department of Computer Engineering 33
Quering the above stated data
Find all users who follow Alice
MATCH (alice:User {username: 'alice'})<-[:FOLLOWS]-(follower) RETURN follower

Find all posts created by Bob


MATCH (bob:User {username: 'bob'})-[:CREATED]->(post:Post) RETURN post

Find all users and their followers


MATCH (user:User)-[:FOLLOWS]->(follower:User) RETURN user, follower

November 11, 2024| Department of Computer Engineering 34


General Clauses in Cyber query
MATCH: Used to specify patterns to search for in the graph.
RETURN: Specifies what to return from the query.
WHERE: Adds conditions to the patterns specified in the MATCH clause.
WITH: Chains multiple parts of a query together, passing results from one part to the next.
Example:
MATCH (alice:User {username: 'alice'})-[:FOLLOWS]->(friend:User) WHERE
friend.username = 'bob' RETURN alice, friend

November 11, 2024| Department of Computer Engineering 35


Read Clauses
MATCH: Retrieves nodes and relationships based on patterns.
OPTIONAL MATCH: Similar to MATCH, but returns null if no matches are found, ensuring the query
continues.
RETURN: Specifies what to return from the query.
WHERE: Adds conditions to filter results.
ORDER BY: Sorts the results.
LIMIT: Limits the number of results returned.
SKIP: Skips a specified number of results.
UNWIND: Expands a list into individual rows.
Example:
MATCH (user:User)-[:CREATED]->(post:Post) WHERE user.username = 'alice' RETURN post ORDER BY
post.timestamp DESC LIMIT 5
November 11, 2024| Department of Computer Engineering 36
Write Clauses
CREATE: Creates nodes and relationships.
MERGE: Ensures that a pattern exists in the graph, either by finding or creating it.
SET: Updates properties on nodes or relationships.
DELETE: Deletes nodes and relationships.
REMOVE: Removes properties or labels from nodes and relationships.
FOREACH: Iterates over lists to perform write operations.
Example for CREATE and MATCH are given on slide no. 32 and 33.
MERGE (alice)-[:LIKES]->(post:Post {content: 'Hello world!', timestamp: '2023-07-
31T10:00:00'})

November 11, 2024| Department of Computer Engineering 37


Steps to visualize data in PowerBi
-> Collect the data you want to visualize. This could be from Excel files, databases, web services,
etc. Ensure your data is clean and formatted correctly.
-> Use the Power Query Editor to transform and shape your data. This could involve:Merging or
appending queries: Creating calculated columns, Filtering rows, Pivoting/unpivoting columns.
-> Create or edit relationships between tables to ensure accurate data analysis and reporting.
-> Choose the type of visualization you want to create from the Visualizations pane (e.g., bar chart,
line chart, pie chart).
-> Use the Format pane to customize your visualizations. Adjust properties such as colors, labels,
titles, and tooltips.
-> Arrange your visuals on the report canvas to create a coherent and aesthetically pleasing layout.
-> Click on the Publish button on the Home ribbon to publish your report to the Power BI Service.

November 11, 2024| Department of Computer Engineering 38


Apache Superset
• Apache Superset is an open-source data exploration and visualization platform designed to help users create
interactive and customizable dashboards and reports. It provides a rich set of features that make it a powerful tool
for data analysis and business intelligence. Here are some key aspects of Apache Superset:
 Use Cases
• Business Intelligence:
Creating dashboards to monitor key performance indicators (KPIs) and metrics.
Analyzing business trends and making data-driven decisions.
• Data Exploration:
Enabling data scientists and analysts to explore and visualize data interactively.
Facilitating the discovery of insights and patterns in complex datasets.
• Reporting:
Generating reports for stakeholders with interactive visualizations.
Automating the reporting process with scheduled queries and updates.
November 11, 2024| Department of Computer Engineering 39
Thank You.

November 11, 2024| Department of Computer Engineering 40

You might also like