0% found this document useful (0 votes)
3 views12 pages

Organized

The document provides a comprehensive guide to learning data analytics with a focus on text analysis techniques such as DTM word clouds, TF-IDF, and bigrams. It outlines various tabs and functionalities for data input, cleaning, and analysis, emphasizing the transformation of unstructured text into structured data. Additionally, it discusses methods for identifying patterns, relationships, and key themes within text data using co-occurrence and frequency analysis.

Uploaded by

Rakshitha Mohan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views12 pages

Organized

The document provides a comprehensive guide to learning data analytics with a focus on text analysis techniques such as DTM word clouds, TF-IDF, and bigrams. It outlines various tabs and functionalities for data input, cleaning, and analysis, emphasizing the transformation of unstructured text into structured data. Additionally, it discusses methods for identifying patterns, relationships, and key themes within text data using co-occurrence and frequency analysis.

Uploaded by

Rakshitha Mohan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Learning Data Analytics Made Easy

TABLE OF CONTENTS
INDEX

1. MODEL-TEXT ANALYSIS

2. ALL ABOUT THE LEFT PANEL


3. RAW DATA INPUT / OVERVIEW TAB

4. CLEANED DATA INPUT TAB

5. DTM WORD AND CO-OCURANCE TAB


6. SEARCH WORD

7. BIGRAM

8. TF- IDF WORD CLOUD TAB

9. TF-IDF COOCCURRENCE TAB


TEXT ANALYSIS
DTM word clouds can be used to quickly get a sense of the most common words in a
corpus of text. This can be useful for exploratory data analysis or for identifying patterns
in the data.

DTM Co-occurrence can help to identify the main topic of the document or to classify it
into a particular category. Co-occurring terms can also be used to generate
recommendations, predict the likelihood of certain events, or to identify relationships
between different concepts.

Bigrams can be used in language modelling (predicting words probability), Information


retrieval (improves word search), Sentiment analysis (identify sentiment of text) and
many more.

TF-IDF word clouds are commonly used in text analysis and visualization to quickly
identify important themes and concepts in a corpus. They can be helpful in identifying key
topics in large document collections, such as news articles or academic papers.

TF-IDF co-occurrence analysis is to represent the co-occurrence data as a co-occurrence


matrix, where each row and column represents a word in the corpus, and each cell
represents the co-occurrence frequency between the two corresponding words. This
matrix can be visualized using techniques such as network analysis, which can reveal the
relationships between different words in the corpus.

DTM WORD CLOUD DTM COOCCURRENCE BIGRAM

TF-IDF WORD CLOUD TF-IDF COOCCURRENCE


TEXT ANALYSIS 01
Text Analysis, Text Analysis is about parsing texts in order to extract machine-
readable facts from them. The purpose of Text Analysis is to create structured
data out of free text content. The process can be thought of as slicing and dicing
heaps of unstructured, heterogeneous documents into easy-to-manage and
interpret data pieces.

LEFT PANEL (INPUT AREA)

OPERATIONAL ANALYSIS TAB (MAIN PANEL)


LEFT PANEL (INP)
02
2
Upload your dataset here Select ID column and text
column required for the
analysis.

Apply any changes Vary the words appearance


required for analysis for and its requirement based on
cleaning raw data. analysis.
03
RAW DATA INPUT (UPLOADING DATA)
 Click on browse
 Select the data file that is in the form of CSV format.(Ex program.csv)
 Browse the file and select the data to train your model for prediction.
 Top rows of the dataset should be of ‘variable names’.

OVERVIEW AND EXAMPLE DATASET TAB


This tab provides you with relevant study resources, tutorials, sample datasets
and a short overview to start with, which helps you understand and comprehend
your data correctly. This tab also provides you the basic idea about Text
Analysis, gives sample data and provides the description about Analysis.
04
INPUT DATA TAB (CLEANED DATA)
The ‘Input Data’ Tab enables to load the cleaned data into the model for
Analysis, it also shows the optical converted data also reviewed and selected
Data together for comparison and analytic view over the data separation and
Cleaning.

We can see that there are two elements mainly highlighted in


the picture i.e Data input and selected clean data over text
analysis, this tab provides the summary of the uploaded data
as well as Data segregation required to make tags as well as
tokens for analysis. More we can review the data and its
other forms before or after cleaning and selecting data and
the required text for analysis.

Use the left panel to transform selected variables as per the requirement of analysis
, correspondingly the data summary will also change.
DTM WORD CLOUD TAB 05
A word cloud is a visual representation of a text, in which the words appear
bigger the more often they are mentioned. Word clouds are great for
visualizing unstructured text data and getting insights into trends and
patterns.
text mining methods allow us to highlight the most frequently used keywords
in a paragraph of text.

One can create a word cloud, also


referred as a text cloud or tag
cloud, which is a visual
representation of text data.

Use the left panel to modify/deal with the outliers identified here.

DTM CO-OCCURANCE TAB


Co-occurrence analysis is simply the counting of paired data within a collection
unit. Co-occurrence analysis is simply the counting of paired data within a
collection unit.
Co-occurrence can be quantitatively described using measures like correlation
or mutual information.
Variable Co-occurrence networks were
found to be particularly useful to
analyze large text and big data, when
identifying the main themes and topics
(such as in a large number of social
media posts), revealing biases in the
text (such as biases in news
coverage), or even mapping an entire
research field.

We can take the weighted sum of each j with pj as the weights to find the
expected co-occurrence. Mathematically, this is
∑( pj × j ) for j = max {0, N1 + N2 – N } to min{N1, N2}.

06
SEARCH WORD
Search word is used to identify a particular word in the text or entire data, we
can get the counts of word repetition by varying concordance window size,
similar word to the searching word can also be seen while searching.
07
BIGRAM
Bigram is a combination of two words that can be grouped. The frequency
distribution of every bigram in a string is commonly used for simple statistical
analysis of text in many applications.
This assumption that the probability of a word depends only on the previous
word.
Markov models are the class of probabilistic models that assume that we can
predict the probability of some future unit without looking too far into the past.

08
TF-IDF WORD CLOUD
TF-IDF (Term Frequency - Inverse Document Frequency) is a handy algorithm
that uses the frequency of words to determine how relevant those words are
to a given document. It's a relatively simple but intuitive approach to weighting
words, allowing it to act as a great jumping off point for a variety of tasks.

TF-IDF enables us to gives us a way to associate each word in a document


with a number that represents how relevant each word is in that document.
Then, documents with similar, relevant words will have similar vectors,
which is what we are looking for in a machine learning algorithm.
The term frequency (i.e., tf) for cat is
then (3 / 100) = 0.03. Now, assume we
have 10 million documents and the
word cat appears in one thousand of
these. Then, the inverse document
frequency (i.e., idf) is calculated as
log(10,000,000 / 1,000) = 4

Use the left panel to impute or drop the missing values identified here

09
TF-IDF CO-OCCURANCE

The co-occurrence of two words W1 and W2 corresponds to the number of


times these two words occurred together in the
context window.
we can then build the co-occurrence matrix
which is an NxN matrix, N being the total number
of vocabularies in the entire corpus. So each
document will have a size of NxN.
We can zoom and click on the nodes to know the
Details regarding various aspects included in the
analysis. This gives better details about words
and their occurrence with the given graph.

We have shown an example that words


containing an increase are highlighted along with other words.

You might also like