0% found this document useful (0 votes)
9 views10 pages

Big Data - Unit 5

The document outlines four main types of data analysis techniques: Descriptive, Diagnostic, Predictive, and Prescriptive, each serving distinct purposes in business decision-making. It also discusses the importance of text analysis for extracting actionable insights from unstructured data, emphasizing methods like sentiment analysis and various text analysis techniques. Additionally, it introduces ensemble methods in machine learning, highlighting their ability to improve model accuracy through techniques such as bagging, boosting, and stacking.

Uploaded by

Misba firdose
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views10 pages

Big Data - Unit 5

The document outlines four main types of data analysis techniques: Descriptive, Diagnostic, Predictive, and Prescriptive, each serving distinct purposes in business decision-making. It also discusses the importance of text analysis for extracting actionable insights from unstructured data, emphasizing methods like sentiment analysis and various text analysis techniques. Additionally, it introduces ensemble methods in machine learning, highlighting their ability to improve model accuracy through techniques such as bagging, boosting, and stacking.

Uploaded by

Misba firdose
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Unit 5

Types of Data Analysis

Data analysis plays a crucial role in business decision-making by providing insights into past
performance and future trends. There are four main types of data analysis techniques used across
industries, each serving different purposes and levels of complexity.

1. Descriptive Analysis

 Definition: Descriptive analysis focuses on summarizing historical data to understand what has
happened.
 Key Questions: What happened? What are the key trends and patterns?
 Examples:
o KPI Dashboards: Visual summaries of key metrics like sales, revenue, and customer
acquisition.
o Monthly Revenue Reports: Summaries detailing revenue performance across different
months or quarters.
o Sales Leads Overview: Analysis of leads generated and their conversion rates.

2. Diagnostic Analysis

 Definition: Diagnostic analysis seeks to understand why certain events occurred by drilling
deeper into data.
 Key Questions: Why did it happen? What were the causes behind the outcomes observed in
descriptive analysis?
 Examples:
o Investigating Slow Shipments: Identifying factors contributing to delays in specific
regions.
o Marketing Effectiveness: Analyzing which campaigns or channels contributed most to
customer trials in a SaaS company.

3. Predictive Analysis

 Definition: Predictive analysis forecasts future outcomes based on historical data and statistical
modeling.
 Key Questions: What is likely to happen? What can we expect in the future based on past
trends?
 Examples:
o Risk Assessment: Predicting the likelihood of default for loans or credit risk.
o Sales Forecasting: Estimating future sales based on historical sales data and market
trends.
o Customer Segmentation: Identifying which customer segments are likely to respond
positively to marketing campaigns.

4. Prescriptive Analysis

 Definition: Prescriptive analysis goes beyond predicting future outcomes by recommending


actions to optimize a given outcome.
 Key Questions: What should we do? What actions should be taken to achieve desired
outcomes?
 Examples:
o AI-Based Decision Making: Using AI systems to recommend personalized actions in
customer service or logistics.
o Optimization Strategies: Recommending optimal pricing strategies based on demand
forecasting and competitor analysis.
o Operational Efficiency: Suggesting process improvements based on real-time data
insights.
. Points to Consider During Analysis

 Data Quality: Ensure data is accurate, complete, and consistent.


 Relevance: Use data that is relevant to the analysis goals.
 Context: Understand the business context and the data source.
 Bias: Be aware of any biases that might affect the analysis.
 Tools and Techniques: Choose appropriate tools and methods for analysis.

Developing an Analytic Team


1. Clarify your people analytics goals. ...
2. Decide what skills you need on your team. ...
3. Foster business acumen in your team members. ...
4. Empower your team with the right tools. ...
5. Consider where your people analytics team should be located within the
organization.
6. Set your team up for success.

What is text analysis?


Text analysis is the process of using computer systems to read and understand human-written
text for business insights. Text analysis software can independently classify, sort, and extract
information from text to identify patterns, relationships, sentiments, and other actionable
knowledge. You can use text analysis to efficiently and accurately process multiple text-based
sources such as emails, documents, social media content, and product reviews, like a human
would.

Why is text analysis important?


Businesses use text analysis to extract actionable insights from various unstructured data sources.
They depend on feedback from sources like emails, social media, and customer survey responses
to aid decision making. However, the immense volume of text from such sources proves to be
overwhelming without text analytics software.

With text analysis, you can get accurate information from the sources more quickly. The process
is fully automated and consistent, and it displays data you can act on. For example, using text
analysis software allows you to immediately detect negative sentiment on social media posts so
you can work to solve the problem

Sentiment analysis
Sentiment analysis or opinion mining uses text analysis methods to understand the opinion
conveyed in a piece of text. You can use sentiment analysis of reviews, blogs, forums, and other
online media to determine if your customers are happy with their purchases. Sentiment analysis
helps you spot new trends, track sentiment changes, and tackle PR issues. By using sentiment
analysis and identifying specific keywords, you can track changes in customer opinion and
identify the root cause of the problem.

Record management

Text analysis leads to efficient management, categorization, and searches of documents. This
includes automating patient record management, monitoring brand mentions, and detecting
insurance fraud. For example, LexisNexis Legal & Professional uses text extraction to identify
specific records among 200 million documents.

Personalizing customer experience

You can use text analysis software to process emails, reviews, chats, and other text-based
correspondence. With insights about customers’ preferences, buying habits, and overall brand
perception, you can tailor personalized experiences for different customer segments.

How does text analysis work?


The core of text analysis is training computer software to associate words with specific meanings
and to understand the semantic context of unstructured data. This is similar to how humans learn a
new language by associating words with objects, actions, and emotions.

Text analysis software works on the principles of deep learning and natural language processing.

Deep learning
Artificial intelligence is the field of data science that teaches computers to think like humans.
Machine learning is a technique within artificial intelligence that uses specific methods to teach or
train computers. Deep learning is a highly specialized machine learning method that uses neural
networks or software structures that mimic the human brain. Deep learning technology powers text
analysis software so these networks can read text in a similar way to the human brain.

Natural language processing


Natural language processing (NLP) is a branch of artificial intelligence that gives computers the
ability to automatically derive meaning from natural, human-created text. It uses linguistic models
and statistics to train the deep learning technology to process and analyze text data, including
handwritten text images. NLP methods such as optical character recognition (OCR) convert text
images into text documents by finding and understanding the words in the images.

Text Analysis Methods & Techniques

1. Text Classification
o Definition: Assigning predefined tags or categories to unstructured text.
o Applications: Sentiment analysis, topic modeling, language detection, intent
detection.
o Example: Classifying customer reviews as positive, negative, or neutral to gauge
sentiment.
2. Text Extraction
o Definition: Extracting specific pieces of data (e.g., keywords, prices, names) from
text.
o Applications: Populating spreadsheets, extracting product specifications from
reviews.
o Example: Extracting customer names and complaint details from support tickets.
3. Word Frequency
o Definition: Measuring the frequency of words in a text using TF-IDF (term
frequency-inverse document frequency).
o Applications: Analyzing common topics or issues in customer feedback.
o Example: Identifying frequently mentioned topics like 'delivery' in negative
customer reviews.
4. Collocation
o Definition: Identifying words that frequently occur together (bigrams and
trigrams).
o Applications: Finding related terms in customer feedback or product reviews.
o Example: Identifying common phrases like 'customer support' in customer
reviews.
5. Concordance
o Definition: Showing the context and instances of words or phrases in a text.
o Applications: Understanding how specific terms are used across different
contexts.
o Example: Analyzing how the word 'simple' is used in app reviews to understand
user perceptions.
6. Word Sense Disambiguation
o Definition: Resolving ambiguity in word meanings based on context.
o Applications: Understanding multiple meanings of words like 'light' (weight,
color, etc.).
o Example: Distinguishing between different senses of 'bank' (financial institution
vs. river bank).
7. Clustering
o Definition: Grouping similar documents or texts into clusters based on similarity.
o Applications: Organizing search results, grouping related articles or documents.
o Example: Google clustering search results based on relevance to search queries.

What are the stages in text analysis?


To implement text analysis, you need to follow a systematic process that goes through four stages.

Stage 1—Data gathering


In this stage, you gather text data from internal or external sources.

Internal data

Internal data is text content that is internal to your business and is readily available—for example,
emails, chats, invoices, and employee surveys.

External data

You can find external data in sources such as social media posts, online reviews, news articles, and
online forums. It is harder to acquire external data because it is beyond your control. You might need
to use web scraping tools or integrate with third-party solutions to extract external data.

Stage 2—Data preparation


Data preparation is an essential part of text analysis. It involves structuring raw text data in an
acceptable format for analysis. The text analysis software automates the process and involves the
following common natural language processing (NLP) methods.

Tokenization

Tokenization is segregating the raw text into multiple parts that make semantic sense. For example,
the phrase text analytics benefits businesses tokenizes to the words text, analytics, benefits,
and businesses.

Part-of-speech tagging

Part-of-speech tagging assigns grammatical tags to the tokenized text. For example, applying this
step to the previously mentioned tokens results in text: Noun; analytics: Noun; benefits: Verb;
businesses: Noun.

Parsing

Parsing establishes meaningful connections between the tokenized words with English grammar. It
helps the text analysis software visualize the relationship between words.

Lemmatization

Lemmatization is a linguistic process that simplifies words into their dictionary form, or lemma. For
example, the dictionary form of visualizing is visualize.

Stop words removal

Stop words are words that offer little or no semantic context to a sentence, such as and, or, and for.
Depending on the use case, the software might remove them from the structured text.

Stage 3—Text analysis


Text analysis is the core part of the process, in which text analysis software processes the text by
using different methods.

Text classification
Classification is the process of assigning tags to the text data that are based on rules or machine
learning-based systems.

Text extraction

Extraction involves identifying the presence of specific keywords in the text and associating them
with tags. The software uses methods such as regular expressions and conditional random fields
(CRFs) to do this.

Stage 4—Visualization
Visualization is about turning the text analysis results into an easily understandable format. You will
find text analytics results in graphs, charts, and tables. The visualized results help you identify
patterns and trends and build action plans. For example, suppose you’re getting a spike in product
returns, but you have trouble finding the causes. With visualization, you look for words such
as defects, wrong size, or not a good fit in the feedback and tabulate them into a chart. Then you’ll
know which is the major issue that takes top priority.

What is text analytics?


Text analytics is the quantitative data that you can obtain by analyzing patterns in multiple samples
of text. It is presented in charts, tables, or graphs.

Text analysis vs. text analytics


Text analytics helps you determine if there’s a particular trend or pattern from the results of
analyzing thousands of pieces of feedback. Meanwhile, you can use text analysis to determine
whether a customer’s feedback is positive or negative.

What are Ensemble Methods?


Ensemble methods are techniques that aim at improving the accuracy of
results in models by combining multiple models instead of using a single
model. The combined models increase the accuracy of the results
significantly. This has boosted the popularity of ensemble methods
in machine learning.
Summary

 Ensemble methods aim at improving predictability in models by combining


several models to make one very reliable model.
 The most popular ensemble methods are boosting, bagging, and stacking.
 Ensemble methods are ideal for regression and classification, where they
reduce bias and variance to boost the accuracy of models.

Categories of Ensemble Methods

Ensemble methods fall into two broad categories, i.e., sequential ensemble
techniques and parallel ensemble techniques. Sequential ensemble
techniques generate base learners in a sequence, e.g., Adaptive Boosting
(AdaBoost). The sequential generation of base learners promotes the
dependence between the base learners. The performance of the model is
then improved by assigning higher weights to previously misrepresented
learners.

In parallel ensemble techniques, base learners are generated in a parallel


format, e.g., random forest. Parallel methods utilize the parallel generation of
base learners to encourage independence between the base learners. The
independence of base learners significantly reduces the error due to the
application of averages.

The majority of ensemble techniques apply a single algorithm in base


learning, which results in homogeneity in all base learners. Homogenous
base learners refer to base learners of the same type, with similar qualities.
Other methods apply heterogeneous base learners, giving rise to
heterogeneous ensembles. Heterogeneous base learners are learners of
distinct types.
Main Types of Ensemble Methods

1. Bagging

Bagging, the short form for bootstrap aggregating, is mainly applied in


classification and regression. It increases the accuracy of models through
decision trees, which reduces variance to a large extent. The reduction of
variance increases accuracy, eliminating overfitting, which is a challenge to
many predictive models.

Bagging is classified into two types, i.e., bootstrapping and


aggregation. Bootstrapping is a sampling technique where samples are
derived from the whole population (set) using the replacement procedure.
The sampling with replacement method helps make the selection procedure
randomized. The base learning algorithm is run on the samples to complete
the procedure.

Aggregation in bagging is done to incorporate all possible outcomes of the


prediction and randomize the outcome. Without aggregation, predictions will
not be accurate because all outcomes are not put into consideration.
Therefore, the aggregation is based on the probability bootstrapping
procedures or on the basis of all outcomes of the predictive models.

Bagging is advantageous since weak base learners are combined to form a


single strong learner that is more stable than single learners. It also
eliminates any variance, thereby reducing the overfitting of models. One
limitation of bagging is that it is computationally expensive. Thus, it can lead
to more bias in models when the proper procedure of bagging is ignored.

2. Boosting

Boosting is an ensemble technique that learns from previous predictor


mistakes to make better predictions in the future. The technique combines
several weak base learners to form one strong learner, thus significantly
improving the predictability of models. Boosting works by arranging weak
learners in a sequence, such that weak learners learn from the next learner
in the sequence to create better predictive models.

Boosting takes many forms, including gradient boosting, Adaptive Boosting


(AdaBoost), and XGBoost (Extreme Gradient Boosting). AdaBoost uses weak
learners in the form of decision trees, which mostly include one split that is
popularly known as decision stumps. AdaBoost’s main decision stump
comprises observations carrying similar weights.

Gradient boosting adds predictors sequentially to the ensemble, where


preceding predictors correct their successors, thereby increasing the model’s
accuracy. New predictors are fit to counter the effects of errors in the
previous predictors. The gradient of descent helps the gradient booster
identify problems in learners’ predictions and counter them accordingly.

XGBoost makes use of decision trees with boosted gradient, providing


improved speed and performance. It relies heavily on the computational
speed and the performance of the target model. Model training should follow
a sequence, thus making the implementation of gradient boosted machines
slow.

3. Stacking

Stacking, another ensemble method, is often referred to as stacked


generalization. This technique works by allowing a training algorithm to
ensemble several other similar learning algorithm predictions. Stacking has
been successfully implemented in regression, density estimations, distance
learning, and classifications. It can also be used to measure the error rate
involved during bagging.

Variance Reduction

Ensemble methods are ideal for reducing the variance in models, thereby
increasing the accuracy of predictions. The variance is eliminated when
multiple models are combined to form a single prediction that is chosen from
all other possible predictions from the combined models. An ensemble of
models combines various models to ensure that the resulting prediction is
the best possible, based on the consideration of all predictions.

You might also like