0% found this document useful (0 votes)
4 views9 pages

BDA Unit 5 Notes

This document provides an overview of Apache Spark and its capabilities in big data analytics, focusing on its features, benefits, and applications in text, web content, and link analytics. It covers various techniques for text mining, web mining, and web usage analytics, detailing workflows, tools, and real-world applications. Additionally, it discusses PageRank and the structure of the web as a graph, emphasizing the importance of link analysis and connectivity in understanding web dynamics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views9 pages

BDA Unit 5 Notes

This document provides an overview of Apache Spark and its capabilities in big data analytics, focusing on its features, benefits, and applications in text, web content, and link analytics. It covers various techniques for text mining, web mining, and web usage analytics, detailing workflows, tools, and real-world applications. Additionally, it discusses PageRank and the structure of the web as a graph, emphasizing the importance of link analysis and connectivity in understanding web dynamics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

UNIT-5

Spark and Big Data Analytics: Spark, Introduction to Data Analysis with Spark. Text, Web
Content and Link Analytics: Introduction, Text Mining, Web Mining, Web Content and Web
Usage Analytics, Page Rank, Structure of Web and Analyzing a Web Graph.

Spark and Big Data Analytics: Spark, Introduction to Data Analysis with Spark.
Apache Spark is a powerful, open-source, distributed processing system designed for big data
analytics. It excels at handling both batch and real-time data workloads, providing a platform for
interactive queries, machine learning, and more. Unlike Hadoop's MapReduce, Spark uses in-
memory caching to accelerate data processing, making it significantly faster for many tasks. It
offers a suite of libraries for various analytics needs, including data analysis, machine learning
(MLlib), and graph processing (GraphX). Spark's core uses Resilient Distributed Datasets (RDDs),
a fundamental data structure for handling distributed data.
Key Features and Benefits:
Speed: Spark's in-memory processing and optimized query execution significantly speed up data
processing compared to traditional Hadoop-based approaches.
Scalability: Spark is designed to handle large datasets, processing them across a cluster of
machines for efficient analytics.
Flexibility: Spark supports various programming languages (Java, Scala, Python, R) and offers
a wide range of libraries for diverse analytics tasks.
Fault Tolerance: Spark's RDDs and distributed nature ensure that if a node in the cluster fails,
data and computation can continue without interruption.
Interactive Querying: Spark's interactive shells (like the PySpark shell) allow data scientists to
explore data and perform ad-hoc analysis quickly and easily.
Machine Learning:Spark's MLlib library provides a rich set of machine learning algorithms,
making it suitable for building predictive models and performing analysis.
How it Works:
RDDs (Resilient Distributed Datasets): Spark uses RDDs as its fundamental data structure for
distributed data, allowing for parallel processing and efficient data reuse.
DataFrames: Spark's DataFrames provide a structured, tabular representation of data, building
upon RDDs and offering a more intuitive interface for data manipulation.
Spark SQL: Spark SQL allows users to query and manipulate data using SQL-like syntax,
making it accessible to data analysts and database professionals.
Spark Streaming: Spark Streaming allows real-time data analysis and processing, enabling
applications like data ingestion and stream analytics.

Text, Web Content and Link Analytics: Introduction


Text, web content, and link analytics are distinct but related fields that focus on extracting
meaningful information from textual data, website content, and the relationships between web
pages. Text analytics analyzes unstructured text to uncover trends, sentiment, and themes, while
web content analysis focuses on understanding user behavior and website performance. Link
analytics, a subset of web analytics, examines the flow of traffic between websites, including the
effectiveness of backlinks and internal linking.
Text Analytics:
Purpose: To convert large volumes of unstructured text into structured data for analysis.
Methods: Includes lexical analysis, pattern recognition, parsing, and part-of-speech tagging.
Applications: Identifying trends in social media, extracting customer feedback from surveys,
and analyzing customer support tickets.
Tools: Text analytics software and Natural Language Processing (NLP) techniques.
Web Content Analysis:
Purpose:
To understand user behavior, website performance, and marketing campaign effectiveness.
Methods: Tracking visitor demographics, traffic sources, page views, bounce rates, and
conversion rates.
Applications: Optimizing website design, improving content strategy, and measuring ROI on
marketing efforts.
Tools: Web analytics platforms like Google Analytics, Optimizely, and others.
Link Analytics:
Purpose: To analyze the flow of traffic between websites and assess the effectiveness of links.
Methods: Tracking backlinks, internal linking, and the impact of different types of links on
traffic and SEO.
Applications: Improving SEO, building authority, and understanding the effectiveness of link
building strategies.
Tools: SEO tools, web analytics platforms, and third-party link analysis services.

Text Mining
What is Text Mining?
Text mining (also called text data mining or text analytics) refers to techniques for analyzing
textual data to derive high-quality information. It involves:
Preprocessing (cleaning, tokenization, etc.)
Information extraction
Pattern recognition
Sentiment analysis
Topic modeling
Why Text Mining in Big Data?
In the Big Data context, much of the data is unstructured, especially text from:
Social media
News articles
Reviews (e.g., product, service)
Emails and customer feedback
Logs and reports
Traditional structured data analysis tools fall short here, which is where big data platforms like
Apache Spark help.
Text Mining Workflow in Big Data Analytics
1. Data Collection
Collect data from:
Web scraping
APIs (Twitter, Reddit, etc.)
Databases
Log files
2. Text Preprocessing
Tokenization: Splitting text into words
Stopword Removal: Removing common words (e.g., "the", "is")
Stemming/Lemmatization: Reducing words to root forms
Lowercasing and punctuation removal
Example (in PySpark):
from pyspark.ml.feature import Tokenizer, StopWordsRemover
tokenizer = Tokenizer(inputCol="text", outputCol="words")
wordsData = tokenizer.transform(df)
remover = StopWordsRemover(inputCol="words", outputCol="filtered")
cleanedData = remover.transform(wordsData)
3. Feature Extraction
TF-IDF (Term Frequency-Inverse Document Frequency)
Word2Vec or CountVectorizer
Example:
from pyspark.ml.feature import HashingTF, IDF
hashingTF = HashingTF(inputCol="filtered", outputCol="rawFeatures", numFeatures=1000)
featurizedData = hashingTF.transform(cleanedData)
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)
4. Text Classification / Clustering
Sentiment Analysis (Positive/Negative/Neutral)
Spam Detection
Topic Modeling (e.g., LDA - Latent Dirichlet Allocation)
5. Visualization and Insights
Use tools like:
Spark DataFrames
Tableau / Power BI
Jupyter Notebooks (with Pandas for small-scale subsets)
Tools Commonly Used
Tool/Library Purpose
Apache Spark (PySpark) Scalable processing of text
NLTK / spaCy NLP preprocessing
MLlib (Spark) Machine learning on text
Gensim Topic modeling (LDA, Word2Vec)
Hadoop / HDFS Distributed storage
Real-World Applications
Customer sentiment analysis from product reviews
Social media monitoring for brands
Fraud detection using textual patterns in insurance claims
Healthcare: mining medical reports or EHR notes
Legal & Compliance: scanning legal documents for risk
Web Mining
Web Mining in Big Data Analytics refers to the process of extracting useful patterns, knowledge,
and insights from massive web data. It combines techniques from data mining, machine learning,
natural language processing (NLP), and big data platforms to handle the scale, variety, and
unstructured nature of web-based information.
What is Web Mining?
Web Mining is broadly divided into three categories:
1. Web Content Mining
Extracting and analyzing data from web page content (text, images, videos).
Example: Sentiment analysis of product reviews on e-commerce websites.
2. Web Structure Mining
Analyzing the link structure of websites (like Google’s PageRank).
Example: Discovering authoritative sources or communities.
3. Web Usage Mining
Analyzing user behavior via web logs, clickstreams, and browsing patterns.
Example: Recommender systems based on user navigation.
Role of Web Mining in Big Data Analytics
The web is a massive source of unstructured data, and mining it at scale requires Big Data
technologies like:
Apache Spark
Hadoop
Kafka (for real-time stream processing)
NoSQL databases like MongoDB or Cassandra
These platforms allow processing of:
Social media feeds
News websites
Forums, blogs
E-commerce product pages
Clickstream data
Web Mining Workflow in Big Data
1. Data Collection
Web scraping (using tools like Scrapy, BeautifulSoup, or Selenium)
APIs (Twitter, Reddit, YouTube, etc.)
Server logs / clickstream data
2. Data Storage
Use of distributed storage (e.g., HDFS, Amazon S3, Google Cloud Storage)
Structured and semi-structured formats: JSON, XML, HTML, CSV
3. Data Preprocessing
Remove HTML tags, scripts, duplicates
Tokenization, stop-word removal, stemming
Metadata extraction (titles, tags, timestamps)
4. Feature Extraction & Analysis
Text mining (TF-IDF, word embeddings)
Link mining (PageRank, graph analytics)
Pattern discovery (frequent patterns, trends)
5. Modeling & Prediction
Sentiment analysis
Recommendation systems
Trend detection
Anomaly detection (e.g., fraud)
Tools & Frameworks
Tool/Framework Purpose
Apache Spark Distributed processing
Scrapy, BeautifulSoup Web scraping
Kafka Real-time data ingestion
MLlib Machine learning on mined data
ElasticSearch + Kibana Indexing and visualizing logs
GraphX (Spark) Link analysis
Real-World Applications
Search engines: Ranking and indexing web pages
Social media monitoring: Brand reputation, event detection
E-commerce: Competitor price tracking, product review analysis
Cybersecurity: Monitoring for phishing or malicious sites
AdTech: Behavioral targeting and personalization

Web Content and Web Usage Analytics


1. Web Content Analytics
Web Content Analytics involves extracting and analyzing data from the actual content on
websites — primarily unstructured or semi-structured data like text, images, videos, or structured
elements like HTML tags.
Goals:
 Understand what is on the web.
 Derive insights from web page content (e.g., blogs, articles, product reviews).
Techniques:
 Text mining & NLP: Topic modeling, sentiment analysis, keyword extraction.
 Multimedia analytics: Image or video content tagging and classification.
 Entity recognition: Identifying people, places, products, etc.
Tools:
 Scrapy, BeautifulSoup, Selenium: Web scraping
 Apache Spark + MLlib/NLP: Scalable content processing
 Gensim, spaCy, NLTK: Advanced text mining
Use Cases:
 Analyzing product reviews or feedback
 Monitoring online news sentiment
 Classifying articles or blog posts
 Extracting product details or specs from e-commerce sites
2. Web Usage Analytics
Web Usage Analytics focuses on analyzing user behavior through data collected from web logs,
clickstreams, and interaction patterns.
Goals:
 Understand how users interact with a website.
 Improve usability, personalization, and conversion rates.
Data Sources:
 Web server logs
 Google Analytics or similar tools
 User session data
 Event tracking (e.g., clicks, scrolls, form submissions)
Techniques:
 Clickstream analysis: Track user navigation paths
 Session analysis: Study time spent, bounce rate, etc.
 User segmentation: Group users based on behavior
 A/B testing: Measure changes in behavior from design tweaks
Tools:
 Apache Spark (for large-scale log analysis)
 Elastic Stack (ELK): For real-time monitoring and dashboards
 Google Analytics / Matomo: Prebuilt usage analytics
 Hadoop + Hive: Storage and querying of raw logs
Use Cases:
 Optimizing website structure and layout
 Recommending content/products based on behavior
 Detecting unusual patterns (e.g., bot traffic or fraud)
 Improving customer journey and retention

Page Rank
PageRank is a link analysis algorithm originally developed by Larry Page and Sergey Brin
(founders of Google) to rank web pages in search engine results. In web analytics, PageRank is a
powerful tool used to assess the importance or influence of web pages based on their link
structure.
What is PageRank?
PageRank measures the authority of a webpage by analyzing how many other pages link to it,
and how authoritative those linking pages are.
Core idea: A page is important if it is linked to by other important pages.
Formula (simplified):

PageRank in Web Analytics


In the context of web analytics, PageRank can be used to:
1. Determine Page Importance: Identify which pages are central or authoritative within your
site or across the web.
2. Optimize Internal Linking: Improve site navigation and SEO by distributing link equity
smartly across high- and low-traffic pages.
3. Detect Spam or Link Farms: Abnormal PageRank distributions can highlight unnatural
linking practices.
4. Analyze Site Structure: Understand how deeply or hierarchically content is connected.
How to Compute PageRank with Big Data Tools
Using Apache Spark (GraphX):Scala
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD

// Example edges (fromNodeId, toNodeId)


val edges: RDD[Edge[Int]] = sc.parallelize(Seq(
Edge(1L, 2L, 1),
Edge(1L, 3L, 1),
Edge(2L, 3L, 1),
Edge(3L, 1L, 1)
))

val graph = Graph.fromEdges(edges, defaultValue = 1)


val ranks = graph.pageRank(0.0001).vertices

ranks.collect().foreach(println)
Using NetworkX (Python, small scale):
import networkx as nx
G = nx.DiGraph()
G.add_edges_from([(1, 2), (1, 3), (2, 3), (3, 1)])
pagerank = nx.pagerank(G, alpha=0.85)
print(pagerank)
Real-World Applications
Application Purpose
Search Engines Rank and retrieve relevant results
SEO Analysis Improve site visibility
Social Network Analysis Identify influential users/pages
Recommendation Systems Recommend popular or authoritative items
Academic Citation Networks Rank influential papers or authors

Structure of Web and Analyzing a Web Graph.


The Web can be structured and analyzed as a graph where web pages are nodes and hyperlinks are
directed edges. This structure allows for the application of graph-based algorithms to understand
the web's evolution, discover communities, and improve search and crawling. Analyzing the web
graph involves characterizing its structure, including examining the distribution of links and the
connectivity of pages.
Web Graph Structure:
The web can be represented as a directed graph, where each web page is a node and a hyperlink
between two pages creates a directed edge from the source page to the target page. This structure
captures the interlinked nature of the web.
Nodes and Edges: Web pages are represented as nodes in the graph, and hyperlinks between
them form directed edges. This representation allows for the analysis of the web's network of
interconnected documents.
Analyzing the Web Graph:
Connectivity: Examining the connections between different web pages and the existence of
paths between them.
Link Distribution: Analyzing how links are distributed across the web, including the number
of incoming and outgoing links for each page.
Evolution of the Web: Studying how the graph changes over time as new pages and links are
added.
Applications of Web Graph Analysis:
Crawling: Improving the efficiency and effectiveness of web crawlers by identifying important
pages and links.
Search Engines: Developing more sophisticated search algorithms that leverage the web's
structure to rank search results.
Community Discovery: Identifying groups of related web pages that share common interests or
topics.
Understanding Human Dynamics: Analyzing link patterns to understand how humans interact
with the web and how information spreads.
Key Concepts:
Nodes: Web pages are represented as nodes in the graph.
Edges: Hyperlinks between web pages are represented as directed edges.
Directed Graph: The web graph is a directed graph, meaning the edges have a direction from one
node to another.
Connectivity: The way in which nodes are connected in the graph.
Diameter: The longest shortest path between any two nodes in the graph.

Web Structure Mining

You might also like