BDA Unit 5 Notes
BDA Unit 5 Notes
Spark and Big Data Analytics: Spark, Introduction to Data Analysis with Spark. Text, Web
Content and Link Analytics: Introduction, Text Mining, Web Mining, Web Content and Web
Usage Analytics, Page Rank, Structure of Web and Analyzing a Web Graph.
Spark and Big Data Analytics: Spark, Introduction to Data Analysis with Spark.
Apache Spark is a powerful, open-source, distributed processing system designed for big data
analytics. It excels at handling both batch and real-time data workloads, providing a platform for
interactive queries, machine learning, and more. Unlike Hadoop's MapReduce, Spark uses in-
memory caching to accelerate data processing, making it significantly faster for many tasks. It
offers a suite of libraries for various analytics needs, including data analysis, machine learning
(MLlib), and graph processing (GraphX). Spark's core uses Resilient Distributed Datasets (RDDs),
a fundamental data structure for handling distributed data.
Key Features and Benefits:
Speed: Spark's in-memory processing and optimized query execution significantly speed up data
processing compared to traditional Hadoop-based approaches.
Scalability: Spark is designed to handle large datasets, processing them across a cluster of
machines for efficient analytics.
Flexibility: Spark supports various programming languages (Java, Scala, Python, R) and offers
a wide range of libraries for diverse analytics tasks.
Fault Tolerance: Spark's RDDs and distributed nature ensure that if a node in the cluster fails,
data and computation can continue without interruption.
Interactive Querying: Spark's interactive shells (like the PySpark shell) allow data scientists to
explore data and perform ad-hoc analysis quickly and easily.
Machine Learning:Spark's MLlib library provides a rich set of machine learning algorithms,
making it suitable for building predictive models and performing analysis.
How it Works:
RDDs (Resilient Distributed Datasets): Spark uses RDDs as its fundamental data structure for
distributed data, allowing for parallel processing and efficient data reuse.
DataFrames: Spark's DataFrames provide a structured, tabular representation of data, building
upon RDDs and offering a more intuitive interface for data manipulation.
Spark SQL: Spark SQL allows users to query and manipulate data using SQL-like syntax,
making it accessible to data analysts and database professionals.
Spark Streaming: Spark Streaming allows real-time data analysis and processing, enabling
applications like data ingestion and stream analytics.
Text Mining
What is Text Mining?
Text mining (also called text data mining or text analytics) refers to techniques for analyzing
textual data to derive high-quality information. It involves:
Preprocessing (cleaning, tokenization, etc.)
Information extraction
Pattern recognition
Sentiment analysis
Topic modeling
Why Text Mining in Big Data?
In the Big Data context, much of the data is unstructured, especially text from:
Social media
News articles
Reviews (e.g., product, service)
Emails and customer feedback
Logs and reports
Traditional structured data analysis tools fall short here, which is where big data platforms like
Apache Spark help.
Text Mining Workflow in Big Data Analytics
1. Data Collection
Collect data from:
Web scraping
APIs (Twitter, Reddit, etc.)
Databases
Log files
2. Text Preprocessing
Tokenization: Splitting text into words
Stopword Removal: Removing common words (e.g., "the", "is")
Stemming/Lemmatization: Reducing words to root forms
Lowercasing and punctuation removal
Example (in PySpark):
from pyspark.ml.feature import Tokenizer, StopWordsRemover
tokenizer = Tokenizer(inputCol="text", outputCol="words")
wordsData = tokenizer.transform(df)
remover = StopWordsRemover(inputCol="words", outputCol="filtered")
cleanedData = remover.transform(wordsData)
3. Feature Extraction
TF-IDF (Term Frequency-Inverse Document Frequency)
Word2Vec or CountVectorizer
Example:
from pyspark.ml.feature import HashingTF, IDF
hashingTF = HashingTF(inputCol="filtered", outputCol="rawFeatures", numFeatures=1000)
featurizedData = hashingTF.transform(cleanedData)
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)
4. Text Classification / Clustering
Sentiment Analysis (Positive/Negative/Neutral)
Spam Detection
Topic Modeling (e.g., LDA - Latent Dirichlet Allocation)
5. Visualization and Insights
Use tools like:
Spark DataFrames
Tableau / Power BI
Jupyter Notebooks (with Pandas for small-scale subsets)
Tools Commonly Used
Tool/Library Purpose
Apache Spark (PySpark) Scalable processing of text
NLTK / spaCy NLP preprocessing
MLlib (Spark) Machine learning on text
Gensim Topic modeling (LDA, Word2Vec)
Hadoop / HDFS Distributed storage
Real-World Applications
Customer sentiment analysis from product reviews
Social media monitoring for brands
Fraud detection using textual patterns in insurance claims
Healthcare: mining medical reports or EHR notes
Legal & Compliance: scanning legal documents for risk
Web Mining
Web Mining in Big Data Analytics refers to the process of extracting useful patterns, knowledge,
and insights from massive web data. It combines techniques from data mining, machine learning,
natural language processing (NLP), and big data platforms to handle the scale, variety, and
unstructured nature of web-based information.
What is Web Mining?
Web Mining is broadly divided into three categories:
1. Web Content Mining
Extracting and analyzing data from web page content (text, images, videos).
Example: Sentiment analysis of product reviews on e-commerce websites.
2. Web Structure Mining
Analyzing the link structure of websites (like Google’s PageRank).
Example: Discovering authoritative sources or communities.
3. Web Usage Mining
Analyzing user behavior via web logs, clickstreams, and browsing patterns.
Example: Recommender systems based on user navigation.
Role of Web Mining in Big Data Analytics
The web is a massive source of unstructured data, and mining it at scale requires Big Data
technologies like:
Apache Spark
Hadoop
Kafka (for real-time stream processing)
NoSQL databases like MongoDB or Cassandra
These platforms allow processing of:
Social media feeds
News websites
Forums, blogs
E-commerce product pages
Clickstream data
Web Mining Workflow in Big Data
1. Data Collection
Web scraping (using tools like Scrapy, BeautifulSoup, or Selenium)
APIs (Twitter, Reddit, YouTube, etc.)
Server logs / clickstream data
2. Data Storage
Use of distributed storage (e.g., HDFS, Amazon S3, Google Cloud Storage)
Structured and semi-structured formats: JSON, XML, HTML, CSV
3. Data Preprocessing
Remove HTML tags, scripts, duplicates
Tokenization, stop-word removal, stemming
Metadata extraction (titles, tags, timestamps)
4. Feature Extraction & Analysis
Text mining (TF-IDF, word embeddings)
Link mining (PageRank, graph analytics)
Pattern discovery (frequent patterns, trends)
5. Modeling & Prediction
Sentiment analysis
Recommendation systems
Trend detection
Anomaly detection (e.g., fraud)
Tools & Frameworks
Tool/Framework Purpose
Apache Spark Distributed processing
Scrapy, BeautifulSoup Web scraping
Kafka Real-time data ingestion
MLlib Machine learning on mined data
ElasticSearch + Kibana Indexing and visualizing logs
GraphX (Spark) Link analysis
Real-World Applications
Search engines: Ranking and indexing web pages
Social media monitoring: Brand reputation, event detection
E-commerce: Competitor price tracking, product review analysis
Cybersecurity: Monitoring for phishing or malicious sites
AdTech: Behavioral targeting and personalization
Page Rank
PageRank is a link analysis algorithm originally developed by Larry Page and Sergey Brin
(founders of Google) to rank web pages in search engine results. In web analytics, PageRank is a
powerful tool used to assess the importance or influence of web pages based on their link
structure.
What is PageRank?
PageRank measures the authority of a webpage by analyzing how many other pages link to it,
and how authoritative those linking pages are.
Core idea: A page is important if it is linked to by other important pages.
Formula (simplified):
ranks.collect().foreach(println)
Using NetworkX (Python, small scale):
import networkx as nx
G = nx.DiGraph()
G.add_edges_from([(1, 2), (1, 3), (2, 3), (3, 1)])
pagerank = nx.pagerank(G, alpha=0.85)
print(pagerank)
Real-World Applications
Application Purpose
Search Engines Rank and retrieve relevant results
SEO Analysis Improve site visibility
Social Network Analysis Identify influential users/pages
Recommendation Systems Recommend popular or authoritative items
Academic Citation Networks Rank influential papers or authors