Big Data - Unit 5
Big Data - Unit 5
Data analysis plays a crucial role in business decision-making by providing insights into past
performance and future trends. There are four main types of data analysis techniques used across
industries, each serving different purposes and levels of complexity.
1. Descriptive Analysis
Definition: Descriptive analysis focuses on summarizing historical data to understand what has
happened.
Key Questions: What happened? What are the key trends and patterns?
Examples:
o KPI Dashboards: Visual summaries of key metrics like sales, revenue, and customer
acquisition.
o Monthly Revenue Reports: Summaries detailing revenue performance across different
months or quarters.
o Sales Leads Overview: Analysis of leads generated and their conversion rates.
2. Diagnostic Analysis
Definition: Diagnostic analysis seeks to understand why certain events occurred by drilling
deeper into data.
Key Questions: Why did it happen? What were the causes behind the outcomes observed in
descriptive analysis?
Examples:
o Investigating Slow Shipments: Identifying factors contributing to delays in specific
regions.
o Marketing Effectiveness: Analyzing which campaigns or channels contributed most to
customer trials in a SaaS company.
3. Predictive Analysis
Definition: Predictive analysis forecasts future outcomes based on historical data and statistical
modeling.
Key Questions: What is likely to happen? What can we expect in the future based on past
trends?
Examples:
o Risk Assessment: Predicting the likelihood of default for loans or credit risk.
o Sales Forecasting: Estimating future sales based on historical sales data and market
trends.
o Customer Segmentation: Identifying which customer segments are likely to respond
positively to marketing campaigns.
4. Prescriptive Analysis
With text analysis, you can get accurate information from the sources more quickly. The process
is fully automated and consistent, and it displays data you can act on. For example, using text
analysis software allows you to immediately detect negative sentiment on social media posts so
you can work to solve the problem
Sentiment analysis
Sentiment analysis or opinion mining uses text analysis methods to understand the opinion
conveyed in a piece of text. You can use sentiment analysis of reviews, blogs, forums, and other
online media to determine if your customers are happy with their purchases. Sentiment analysis
helps you spot new trends, track sentiment changes, and tackle PR issues. By using sentiment
analysis and identifying specific keywords, you can track changes in customer opinion and
identify the root cause of the problem.
Record management
Text analysis leads to efficient management, categorization, and searches of documents. This
includes automating patient record management, monitoring brand mentions, and detecting
insurance fraud. For example, LexisNexis Legal & Professional uses text extraction to identify
specific records among 200 million documents.
You can use text analysis software to process emails, reviews, chats, and other text-based
correspondence. With insights about customers’ preferences, buying habits, and overall brand
perception, you can tailor personalized experiences for different customer segments.
Text analysis software works on the principles of deep learning and natural language processing.
Deep learning
Artificial intelligence is the field of data science that teaches computers to think like humans.
Machine learning is a technique within artificial intelligence that uses specific methods to teach or
train computers. Deep learning is a highly specialized machine learning method that uses neural
networks or software structures that mimic the human brain. Deep learning technology powers text
analysis software so these networks can read text in a similar way to the human brain.
1. Text Classification
o Definition: Assigning predefined tags or categories to unstructured text.
o Applications: Sentiment analysis, topic modeling, language detection, intent
detection.
o Example: Classifying customer reviews as positive, negative, or neutral to gauge
sentiment.
2. Text Extraction
o Definition: Extracting specific pieces of data (e.g., keywords, prices, names) from
text.
o Applications: Populating spreadsheets, extracting product specifications from
reviews.
o Example: Extracting customer names and complaint details from support tickets.
3. Word Frequency
o Definition: Measuring the frequency of words in a text using TF-IDF (term
frequency-inverse document frequency).
o Applications: Analyzing common topics or issues in customer feedback.
o Example: Identifying frequently mentioned topics like 'delivery' in negative
customer reviews.
4. Collocation
o Definition: Identifying words that frequently occur together (bigrams and
trigrams).
o Applications: Finding related terms in customer feedback or product reviews.
o Example: Identifying common phrases like 'customer support' in customer
reviews.
5. Concordance
o Definition: Showing the context and instances of words or phrases in a text.
o Applications: Understanding how specific terms are used across different
contexts.
o Example: Analyzing how the word 'simple' is used in app reviews to understand
user perceptions.
6. Word Sense Disambiguation
o Definition: Resolving ambiguity in word meanings based on context.
o Applications: Understanding multiple meanings of words like 'light' (weight,
color, etc.).
o Example: Distinguishing between different senses of 'bank' (financial institution
vs. river bank).
7. Clustering
o Definition: Grouping similar documents or texts into clusters based on similarity.
o Applications: Organizing search results, grouping related articles or documents.
o Example: Google clustering search results based on relevance to search queries.
Internal data
Internal data is text content that is internal to your business and is readily available—for example,
emails, chats, invoices, and employee surveys.
External data
You can find external data in sources such as social media posts, online reviews, news articles, and
online forums. It is harder to acquire external data because it is beyond your control. You might need
to use web scraping tools or integrate with third-party solutions to extract external data.
Tokenization
Tokenization is segregating the raw text into multiple parts that make semantic sense. For example,
the phrase text analytics benefits businesses tokenizes to the words text, analytics, benefits,
and businesses.
Part-of-speech tagging
Part-of-speech tagging assigns grammatical tags to the tokenized text. For example, applying this
step to the previously mentioned tokens results in text: Noun; analytics: Noun; benefits: Verb;
businesses: Noun.
Parsing
Parsing establishes meaningful connections between the tokenized words with English grammar. It
helps the text analysis software visualize the relationship between words.
Lemmatization
Lemmatization is a linguistic process that simplifies words into their dictionary form, or lemma. For
example, the dictionary form of visualizing is visualize.
Stop words are words that offer little or no semantic context to a sentence, such as and, or, and for.
Depending on the use case, the software might remove them from the structured text.
Text classification
Classification is the process of assigning tags to the text data that are based on rules or machine
learning-based systems.
Text extraction
Extraction involves identifying the presence of specific keywords in the text and associating them
with tags. The software uses methods such as regular expressions and conditional random fields
(CRFs) to do this.
Stage 4—Visualization
Visualization is about turning the text analysis results into an easily understandable format. You will
find text analytics results in graphs, charts, and tables. The visualized results help you identify
patterns and trends and build action plans. For example, suppose you’re getting a spike in product
returns, but you have trouble finding the causes. With visualization, you look for words such
as defects, wrong size, or not a good fit in the feedback and tabulate them into a chart. Then you’ll
know which is the major issue that takes top priority.
Ensemble methods fall into two broad categories, i.e., sequential ensemble
techniques and parallel ensemble techniques. Sequential ensemble
techniques generate base learners in a sequence, e.g., Adaptive Boosting
(AdaBoost). The sequential generation of base learners promotes the
dependence between the base learners. The performance of the model is
then improved by assigning higher weights to previously misrepresented
learners.
1. Bagging
2. Boosting
3. Stacking
Variance Reduction
Ensemble methods are ideal for reducing the variance in models, thereby
increasing the accuracy of predictions. The variance is eliminated when
multiple models are combined to form a single prediction that is chosen from
all other possible predictions from the combined models. An ensemble of
models combines various models to ensure that the resulting prediction is
the best possible, based on the consideration of all predictions.