0% found this document useful (0 votes)
15 views64 pages

Lec 2 Getting To Know Data EDA

The document discusses Exploratory Data Analysis (EDA), highlighting its purpose and benefits, such as understanding data characteristics, identifying missing values, and discovering patterns. It covers the types of data and attributes, including univariate, bivariate, and multivariate data, as well as nominal, ordinal, and numeric attributes. Additionally, it explains statistical measures like central tendencies and spread, which are essential for summarizing and interpreting data effectively.

Uploaded by

Saman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views64 pages

Lec 2 Getting To Know Data EDA

The document discusses Exploratory Data Analysis (EDA), highlighting its purpose and benefits, such as understanding data characteristics, identifying missing values, and discovering patterns. It covers the types of data and attributes, including univariate, bivariate, and multivariate data, as well as nominal, ordinal, and numeric attributes. Additionally, it explains statistical measures like central tendencies and spread, which are essential for summarizing and interpreting data effectively.

Uploaded by

Saman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Big Data Analytics

Getting to Know Data & Exploratory Data Analysis

EDA: Purpose & Benefits


Size, Dimension, and Resolution of Data
Types of Attributes
Statistical EDA
Measures of Central Tendencies and Spread
Bivariate EDA: Correlation, Contingency Table
Graphical EDA
Types of Diagrams

Imdad ullah Khan

Imdad ullah Khan (LUMS) Getting to know Data & EDA 1 / 64


Exploratory Data Analysis (EDA): Purpose and Benefits

EDA: Initial investigation of data using summary statistics and diagrams

Objectives of EDA are to

understand data (what it is, where it comes from, what does it


represent, kind of values, specific characteristics of data)
find out if there are missing values? (how to deal with them!)
spot anomalies (are there outliers?)
discover patterns (how does the data look like?)
understand relationships between features (measure similarity,
distance and relationship type)
check our assumptions
visually describe the data

Imdad ullah Khan (LUMS) Getting to know Data & EDA 2 / 64


EDA: Purpose and Benefits

Preliminary exploration and inspection of data is essential for analysis


It guides preprocessing steps
It gives a clear picture of data sizes, which helps in selecting the right
data structures, tools and even modeling strategies
Could help reduce data sizes (dimensions or records)

Imdad ullah Khan (LUMS) Getting to know Data & EDA 3 / 64


Data object and Attribute

Data object
represents an entity in the data set
also called data item, point, instance, example, sample, row, observation
e.g. a patient, movie, student, customer, product, book, tweet
described by a set of attributes

Attribute
is a data field, representing a feature/characteristic of data objects
also called variable, feature, dimension, column, coordinate, field
e.g. reaction to a test, genre/director, course, address, price/category,
author, publisher, word

Imdad ullah Khan (LUMS) Getting to know Data & EDA 4 / 64


Size and dimensions of data

Size of Data refers to number of data objects


Dimension of Data refers to number of attributes

Sparsity in Data
If most of the feature values are missing, then the data is called sparse

Missing values could be represented as NaN, blank, -, 0


This could be a problem for many statistical methods
For efficient computation, can use libraries for sparse data
e.g. sparse matrix multiplication, sparse storage schemes

Imdad ullah Khan (LUMS) Getting to know Data & EDA 5 / 64


Resolution of Data

Different resolution reveal different patterns

If resolution is too fine, a pattern may be buried in noise


If the resolution is too coarse pattern may disappear
See number of bins in histograms below

Imdad ullah Khan (LUMS) Getting to know Data & EDA 6 / 64


Types of Data

Types of data based on number of attributes

Univariate Data
Bivariate Data
Multivariate Data

Imdad ullah Khan (LUMS) Getting to know Data & EDA 7 / 64


Types of Data

Univariate: Consists of only one feature per observation. Analysis


deals with only one quantity that changes
Heights (cm)
164
167.3
170
174.2
178
180
186

What is the average height?


How much the values deviate form the average height?

Imdad ullah Khan (LUMS) Getting to know Data & EDA 8 / 64


Types of Data

Bivariate: Involves two different features per observation


Analysis of this type of data deals with comparisons, relationships,
causes and explanations

Temperature (°C) Ice Cream Sales


20 2000
25 2500
35 5000
43 7800

Are the temperature and ice cream sales related/dependent?


As temperature increases, sales also increases

Imdad ullah Khan (LUMS) Getting to know Data & EDA 9 / 64


Types of Data

Multivariate: Objects are described by more than 2 features


To see if one or more of them are predictive of a certain outcome
The predictive variables are independent variables and the outcome is
the dependent variable

Roll Num CS100 SS101 MT200 MGMT240 Major


19100115 A B B C CS
19100120 B A B C PHY
19100122 B B C A CS
19100126 C A C A EE
19100127 B A C C CS
19100133 C B A B PHY
19100135 C C A C Maths

Imdad ullah Khan (LUMS) Getting to know Data & EDA 10 / 64


Types of Attributes

Imdad ullah Khan (LUMS) Getting to know Data & EDA 11 / 64


Types of Attributes

Roll Num Gender Grade Age Major


19100115 Male B 23 CS
19100120 Male A 22 PHY
19100122 Female B 21 CS
19100126 Male C 19 EE
19100127 Female A 21 CS
19100133 Female B 20 PHY
19100135 Male C 22 Maths

Nominal/Categorical Attributes
Ordinal Attributes
Numeric Attributes

Imdad ullah Khan (LUMS) Getting to know Data & EDA 12 / 64


Types of Attributes: Nominal/Categorical
Possible values are symbols, labels or names of things, categories
gender, major, state, color
Describe a feature qualitatively and values have no order
Not quantitative, arithmetic operations can’t be performed on them
male − female = ?? green + blue = ??
Can code by numbers (numeric symbols) e.g. postal codes, roll numb
frequency of values and the most frequent value
Can compute middle value
average value of an attribute
Binary Attribute: - special case of nominal true/false, Pass/Fail, 0/1
Symmetric: Both symbols carry the same weight e.g. gender
Asymmetric: Both symbols are not equally important, e.g. Pass/Fail

Imdad ullah Khan (LUMS) Getting to know Data & EDA 13 / 64


Types of Attributes: Ordinal Attributes

Possible values have meaningful order


Grades : A,B,C,D
Serving Sizes : Small, Medium, Large
Ratings : poor, average, excellent
No quantified difference between two levels
A is higher/better than B but
Cannot quantify how much higher is A than B, or
if the difference between A and B the same as the difference between
B and C
Can be obtained by discretizing numeric quantities (data reduction)

frequency of values and the most frequent value


Can compute middle value
average value of an attribute

Imdad ullah Khan (LUMS) Getting to know Data & EDA 14 / 64


Types of Attributes: Numeric Attributes
Quantitative and measurable
can quantify the difference between two values
temperature, age, number of courses, height, years of experience
frequency of values and the most frequent value
Can compute middle value
average value of an attribute
Discrete Numeric Attributes
values come from a finite or countably infinite sets
Continuous Numeric Attributes
values are real (continuous)
Interval-Scaled: No point 0, ratios have no meaning
e.g. Temperature in Celsius. 30◦ is not double as hot as 15◦
Ratio-Scaled: Well-Defined point 0, ratios are meaningful
e.g. Temperature in Kelvin. 30◦ is double as hot as 15◦

Imdad ullah Khan (LUMS) Getting to know Data & EDA 15 / 64


Statistical EDA

Imdad ullah Khan (LUMS) Getting to know Data & EDA 16 / 64


Statistical Description of Data

Estimates that give an overall picture of data


Summary statistics are numbers that summarize properties of data
Typical values of variables (features/attributes)
Spread and distribution of values
Dependencies and correlations among variables

Imdad ullah Khan (LUMS) Getting to know Data & EDA 17 / 64


Measures of Central Tendencies

These measures describe the location of data


location of concentration or middle of data

Data is “distributed” around this “center”


Computed for each attribute
Three common types of locations
Mode
Mean
Median

These measures do not give information regarding


extreme values in data
distribution or spread of the data

Imdad ullah Khan (LUMS) Getting to know Data & EDA 18 / 64


Frequency

Nominal and Ordinal attributes are generally described with frequencies

The frequency of a value is the number of times the value occurs in


the dataset

Some time we use fraction or percentage of time the value appears


Probability mass function

Imdad ullah Khan (LUMS) Getting to know Data & EDA 19 / 64


Measures of Central Tendencies: Mode

For location of nominal and ordinal attributes one can use the most
frequent value

Mode is the most frequent element


Can have more than one modes
unimodal (one mode in data)
multi-modal (bimodal, trimodal): more than one modes in data

Not the same as the Majority element (a value with frequency more than
50%)

Imdad ullah Khan (LUMS) Getting to know Data & EDA 20 / 64


Measures of Central Tendencies: Mean

For a dataset X = {x1 , x2 , · · · , xn }


(Arithmetic) Mean is the average of the data set
▷ This definition readily extend to higher dimensional data
Pn
x1 + x2 + . . . + xn i=1 xi
x = =
n n
Weighted Mean Pn
wi xi
x = Pi=1
n
i=1 wi
Harmonic Mean
n
x = Pn 1
i=1 xi
Geometric Mean !1/n
n
Y
x = xi
i=1

Imdad ullah Khan (LUMS) Getting to know Data & EDA 21 / 64


Other Types of Mean
Arithmetic mean is sensitive to outliers ▷ unstable statistic
Just one very high/low value (think ±∞) makes mean very high/low
2.5 2.5 3 3.5 3.5 3.5 3.5 4 4 4 4.5 4.5 4.5 5 5 5.5 5.5 6 98 99

5 99

Mean = 13.57

Trimmed Mean: Ignore k% of values at both extremes to compute mean


2.5 2.5 3 3.5 3.5 3.5 3.5 4 4 4 4.5 4.5 4.5 5 5 5.5 5.5 6 98 99

5 99

Mean = 4.34

Imdad ullah Khan (LUMS) Getting to know Data & EDA 22 / 64


Measures of Central Tendencies: Median

Median is the middle value of a dataset

Odd/even number of values


Median is less sensitive to outliers as compared to mean
Median is good for asymmetric distributions and where data has outliers

5 99

Median = 4.25 Mean = 13.57

Various possible definitions for median of higher dimensional data


Mean together with variance (see below) has nice properties

Imdad ullah Khan (LUMS) Getting to know Data & EDA 23 / 64


Measures of Spread

Location measures do not tell anything about extremes or spread (how


extreme are the extremes)
Measures of spread describe distribution of data

Max
Min
Range := max - min
Midrange := average of min and max
Inter-Quartile Range := 3rd quartile - 1st quartile
Low Spread Mid-spread High Spread
Variance and Standard Deviation

Imdad ullah Khan (LUMS) Getting to know Data & EDA 24 / 64


Quantile

Quantiles are points taken at regular interval so as data is divided into


roughly equal sized consecutive subsets

The ith q-quantile is a data point x such that ∼ i/q fraction of points
are less than x and ∼ (q−i)/q fraction of points are greater than x
Median is the first 2-quantile
3rd quartile := 3rd 4-quantile := 75 percentile

Imdad ullah Khan (LUMS) Getting to know Data & EDA 25 / 64


Measures of Spread

Imdad ullah Khan (LUMS) Getting to know Data & EDA 26 / 64


Five-Number Summary

Five-number summary (elementary EDA of numeric univariate data)

maximum (100th percentile)

{
Min

1st /lower quartile upper quartile (75th percentile)

{
interquartile range
data range
median (50th percentile)

}|
Median
}|

lower quartile (25th percentile)


2nd /upper quartile z

Max
minimum (0th percentile)
z

Imdad ullah Khan (LUMS) Getting to know Data & EDA 27 / 64


EDA: Measures of Spread

Variance: Measures the deviation in values relative to mean


Pn
2 (xi − x)2
σ = i=1
n
Varaince is mean squared deviation from mean
Squared to avoid cancellation of +ve and −ve deviation
Mean deviation could be 0 for data with significant spread
mean and average distance from mean of both
{−5, −10, 5, 10} and {−100, −50, 50, 100} are 0 and 0
▷ There is significantly more spread in the latter data
Pn
|xi −x|
Mean Absolute Deviation: MAD := i=1
n

Variance is easy to compute and has useful mathematical properties

Imdad ullah Khan (LUMS) Getting to know Data & EDA 28 / 64


Measures of Spread

Standard Deviation

Variance has different unit than that of original data


Standard deviation also measures deviation in values relative to mean
Standard deviation is the square root of variance
r Pn
2
i=1 (xi − x)
σ=
n
Standard deviation restores the measure to the original unit of data

Imdad ullah Khan (LUMS) Getting to know Data & EDA 29 / 64


Normal Distribution (Bell-Curve)

For normal distribution, there are guarantees that certain number of values
must fall within k st-dev from the mean

At least ∼ 68% must lie within k = 1 st-dev (x ± 1σ)


At least ∼ 95% must lie within k = 2 st-dev (x ± 2σ)
At least ∼ 99.7% must lie within k = 3 st-dev (x ± 3σ)

Imdad ullah Khan (LUMS) Getting to know Data & EDA 30 / 64


EDA: Three-Sigma Rule - The Empirical Rule

For any distribution of data, there are guarantees that certain number of
values must fall with k st-dev from the mean
At least ∼ 75% must lie within k = 2 st-dev (x ± 2σ)
At least ∼ 89% must lie within k = 3 st-dev (x ± 3σ)
At least ∼ 93% must lie within k = 4 st-dev (x ± 4σ)

Imdad ullah Khan (LUMS) Getting to know Data & EDA 31 / 64


Bivariate Measures

Used for bivariate data or pairs of attributes, more detail later

Nominal or Ordinal Attributes


Contingency Table
χ2 statistics

Numeric Attributes
Covariance
Correlation
Correlation Matrix

Imdad ullah Khan (LUMS) Getting to know Data & EDA 32 / 64


Contingency Table
Contingency table summarizes data with two nominal or ordinal features
Used to determine whether the variable pair is correlated (χ2 -Test)

(nominal) A and B taking values in {a1 , a2 , . . . , ap } and {b1 , b2 , . . . , bq }

fij : frequency of joint occurrence of (ai , bj )


▷ observed frequency of the joint event (A = ai , B = bj )

Contingency Table:
a1 a2 . . . ap
b1
C = b2
..
. fij
bq

Imdad ullah Khan (LUMS) Getting to know Data & EDA 33 / 64


χ2 -test for two attributes A and B

χ2 -statistic: A “correlation” between two nominal attributes A and B


taking values in {a1 , a2 , . . . , ap } and {b1 , b2 , . . . , bq }

fij : frequency of joint occurrence of (ai , bj )


▷ observed frequency of the joint event (A = ai , B = bj )
The expected frequency, eij of the joint event (A = ai , B = bj ),
under independence assumption
Pq
j=1 fij
Estimating probability, Pai = Pr {A = ai } = N , N = pq
eij = Pai · Pbj · N
p P q (f − e )2
ij ij
The χ2 value (Pearson’s χ2 -statistics) is
P
i=1 j=1 eij

Large χ2 values indicates variables are related

Imdad ullah Khan (LUMS) Getting to know Data & EDA 34 / 64


Covariance and Correlation

Covariance and correlation are helpful in understanding the


dependency/relationship between two numeric variables

Covariance between two variables x = {x1 , x2 , · · · , xn } and


y = {y1 , y2 , · · · , yn } with means x and y , resp. is defined as
Pn
i=1 (xi − x)(yi − y)
cov(x, y) =
n
▷ Covariance reveals the “proportionality” between variables
Note when xi and yi both are greater or smaller than their respective
means, (xi − x)(yi − y) is positive and vice-versa
cov(x, y) < 0 =⇒ inverse proportionality
cov(x, y) > 0 =⇒ direct proportionality
cov(x, y) = 0 =⇒ no linear relation

Imdad ullah Khan (LUMS) Getting to know Data & EDA 35 / 64


Covariance and Correlation

Some properties of covariance that readily follow from definition

cov(x, y) = cov(y, x)
cov(x, x) = var(x, x)
If x and y are independent, then cov(x, y) = 0
For constant a and b
cov(x, a) = 0
cov(ax, by) = ab cov(x, y)
cov(x + a, y + b) = cov(x, y)

cov(x, y + z) = cov(x, y) + cov(x, z)

Imdad ullah Khan (LUMS) Getting to know Data & EDA 36 / 64


Covariance and Correlation

Correlation
Covariance depends on magnitude and scale of variable x and y
Correlation quantifies how strongly two variables are linearly related

cov(x, y)
rxy = corr (x, y) =
σx .σy
−1 ≤ corr (x, y) ≤ 1

It is not affected by changes in scale of variables x and y

corr (x, y) = −1 =⇒ perfect negative linear association


corr (x, y) = 1 =⇒ perfect positive linear association
corr (x, y) = 0 =⇒ no linear association

Imdad ullah Khan (LUMS) Getting to know Data & EDA 37 / 64


Correlation
1 0.8 0.4 0 −.4 −.8 −1

1 1 1 1 −1 −1 −1

0 0 0 0 0 0 0

Figure: x and y -axis represent variables - their correlations is on the top


Imdad ullah Khan (LUMS) Getting to know Data & EDA 38 / 64
Correlation matrix

For multi-variate numeric data correlation matrix is


A table of pairwise correlation coefficients between variables
Each cell shows the correlation between two variables
Used to summarize data, as an input into a more advanced analysis,
and as a diagnostic for advanced analyses
Also used to remove redundant variables

Imdad ullah Khan (LUMS) Getting to know Data & EDA 39 / 64


Graphical EDA

Imdad ullah Khan (LUMS) Getting to know Data & EDA 40 / 64


Diagrammatic Representations of Data

Easy to understand: Numbers do not tell all the story. Diagrammatic


representation of data makes it easier to understand
Simplified Presentation: Large volumes of complex data can be
represented in a simplified and intelligible diagram
Reveals hidden facts: Diagrams help in bringing out the facts and
relationships between data not noticeable in raw/tabular form
Easy to compare: Diagrams make it easier to compare data

Imdad ullah Khan (LUMS) Getting to know Data & EDA 41 / 64


Visualizing Data for Insight

Purpose of Graphical EDA: To reveal underlying structures, detect


outliers and anomalies, and understand patterns within the data through
visual methods.
Simplifies complex quantitative information.
Facilitates faster comprehension and decision-making.
Helps in spotting trends, patterns, and outliers.
Common Tools: Histograms, Box plots, Scatter plots, etc.

Imdad ullah Khan (LUMS) Getting to know Data & EDA 42 / 64


Types of Diagrams

We will briefly discuss and use the following types of diagrams


▷ More on importance of visualization later

Bar Charts
Histogram ▷ and also overlapping histogram
Box Plot ▷ and also side-by-side box-plots
Scatter Plot ▷ and scatter plot matrix
Heat map
Line Graph
Parallel Axis Plot
Word-Cloud

Imdad ullah Khan (LUMS) Getting to know Data & EDA 43 / 64


Bar charts

Generally used for a nominal and ordinal variables


Different bars (usually colored/shaded differently) for distinct values
(levels, categories, symbols) of the variable
Height of bar represent frequencies of each symbol (value)
Can reveal variables that have no or limited information e.g. constants
Note that we can use pie charts for the same purpose too
Humans perceive difference in lengths better than in angles

Imdad ullah Khan (LUMS) Getting to know Data & EDA 44 / 64


Histograms
Represent distribution of data in a numeric/continuous variable
(estimates probability distribution of a numeric variable)
Group values by a series of intervals (bins - usually consecutive
non-overlapping subintervals covering range of data)
Plot the number of values falling in each bin (represented by the
height of the bar)
Normalized histogram shows proportion of values in each bin

Imdad ullah Khan (LUMS) Getting to know Data & EDA 45 / 64


Histograms

A histogram with appropriate number/length of bins reveals

Where is the data located


Where/what are the extremes
What is the distribution of the data
How the data is spread out
If the distribution is symmetric or have skew (left or right)
Whether the data is unimodal, bimodal or more
Can also detect outliers in the data if any

Imdad ullah Khan (LUMS) Getting to know Data & EDA 46 / 64


Histograms

Number and sizes of bins are important considerations


Bins do not have to be of equal sizes
For unequal bin sizes height of the bar is not the frequency of values
in the bin, it is the frequency density
Area of the bar is proportional to the frequency
Number of items per unit of the variable of x-axis

Too many bins in histogram gives too much unnecessary details


(shows too much noise)
Too few bins give almost nothing, obscure the underlying patterns

Imdad ullah Khan (LUMS) Getting to know Data & EDA 47 / 64


Histograms

Imdad ullah Khan (LUMS) Getting to know Data & EDA 48 / 64


Overlapping Histograms

Useful in observing distribution of values with respect to a nominal variable

Imdad ullah Khan (LUMS) Getting to know Data & EDA 49 / 64


Box Plots

Another way of displaying the distribution of data (somewhat)


Box-Plots or Box and Whisker diagrams

Imdad ullah Khan (LUMS) Getting to know Data & EDA 50 / 64


Box Plots

Box-Plots or Box and Whisker diagrams


Top and bottom lines of the box are 3rd and 1st quartiles of data
Length of the box is the inter-quartile range (midspread)
The line in the middle of the box is median of data
The top whisker denotes the largest value in the data that is within
1.5 times midspread (Q3 × 1.5 · IQR)
Similarly the bottom whisker
Anything above and below the whiskers are considered outliers
Relative location of median within the box tells us about data
distribution
We find out at what end are the outliers if any

Imdad ullah Khan (LUMS) Getting to know Data & EDA 51 / 64


Box Plots

Can get some idea of skew by observing the shorter whisker


Various norms for whiskers (sometime) top whisker is 90th percentile
Uni-modality and multi-modality type information is generally not
clear from box plots

Imdad ullah Khan (LUMS) Getting to know Data & EDA 52 / 64


Side-by-side Box Plots

Extremely useful for comparisons of two or more variables.


To compare numeric variables, we draw their box-plots in parallel

Imdad ullah Khan (LUMS) Getting to know Data & EDA 53 / 64


Side-by-side Box Plots

Side by side groupwise box plots are extremely useful


Groups are based on values of a categorical variable
It reveals whether a factor (the categorical variable) is important
It addresses whether the location of data differ between groups
To some extent it also reveals whether distribution and variation
differ between groups
Overlapping histograms are more suitable for the latter question,
unless there is too much overlap

Imdad ullah Khan (LUMS) Getting to know Data & EDA 54 / 64


Scatter Plot

Scatter Plot is the best to visualize two dimensional numeric data


This directly represent the two dimensional observations as points in R2 .
Plot one variable on x-axis and other on y -axis

Imdad ullah Khan (LUMS) Getting to know Data & EDA 55 / 64


Scatter Plot

Scatter Plot is the best to visualize two dimensional numeric data


This directly represent the two dimensional observations as points in R2 .
Plot one variable on x-axis and other on y -axis

It shows how the two variables are related to each other


▷ reveals correlations between the variables
If one or both variables are highly skewed, then scatter plots are hard
to examine, as bulk of the data is concentrated in a small part of plot
For this we should use some kind of transformation, explained later
on one or both the variables
log-scaled plots can also be used in such cases

Imdad ullah Khan (LUMS) Getting to know Data & EDA 56 / 64


Scatter Plot

Imdad ullah Khan (LUMS) Getting to know Data & EDA 57 / 64


Scatter Plot Matrix
Pairwise scatter plots, pairwise correlations and individual histograms
or density plots
Summarize the relationships of all pairs of numerical attributes

Imdad ullah Khan (LUMS) Getting to know Data & EDA 58 / 64


Scatter Plot Matrix
Scatter plot (matrix) can be combined with information in a nominal
attribute encoded through color or marker shape

Imdad ullah Khan (LUMS) Getting to know Data & EDA 59 / 64


Heat Map

Presents pairwise relationship between attributes of multivariate data

Imdad ullah Khan (LUMS) Getting to know Data & EDA 60 / 64


Heat Map

Presents pairwise relationship between attributes of multivariate data


Provides a numerical value of the correlation between each variable
Also provides an easy to understand visual representation of those
numbers (colors shades)
Darker red showing high correlation
Dark blue showing none or negative correlation
Can be used to visualize any matrix

Imdad ullah Khan (LUMS) Getting to know Data & EDA 61 / 64


Line graphs
Line graphs are used for time series e.g. player’s yearly average,
student’s semester gpa or hourly energy consumption
Two or more time series can be compared in different colors or
markers (legend should be provided)

Imdad ullah Khan (LUMS) Getting to know Data & EDA 62 / 64


Parallel Axis Plot

Imdad ullah Khan (LUMS) Getting to know Data & EDA 63 / 64


Word-Cloud

Very useful in text analytics


A word cloud shows words used in a text corpus (collection of documents)
with size of words proportional to their importance (e.g. tf-idf)

Quite clear that the word cloud on left is for a collection of articles about US politics,
political news, while that on the right seems a corpus of astronomy/astrophysics

Imdad ullah Khan (LUMS) Getting to know Data & EDA 64 / 64

You might also like