0% found this document useful (0 votes)
22 views15 pages

Unit 4

Uploaded by

abernakumari87
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views15 pages

Unit 4

Uploaded by

abernakumari87
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 15

AD3301 DATA EXPLORATION AND VISUALIZATION

UNIT - 4
BIVARIATE ANALYSIS
Presented By
Dr R Murugadoss
Professor
Artificial Intelligence & Data Science
- Relationships between Two Variables
- Percentage Tables
- Analyzing Contingency Tables
- Handling Several Batches
- Scatterplots and Resistant Lines
– Transformations
Two variables are related if knowing one gives you information
about the other. For example, height and weight are related;
people who are taller tend to be heavier.

Association Examples:
◦ Smoking is associated with heart disease.
◦ Weight is associated with height.
◦ Income is associated with education.
Both variables are categorical. We analyze an association through a comparison of
conditional probabilities and graphically represent the data using contingency tables.
Examples of categorical variables are gender and class standing.
Both variables are quantitative. To analyze this situation we consider how one
variable, called a response variable, changes in relation to changes in the other variable
called an explanatory variable. Graphically we use scatterplots to display two
quantitative variables. Examples are age, height, weight (i.e. things that are measured).
One variable is categorical and the other is quantitative, for instance height and
gender. These are best compared by using side-by-side boxplots to display any
differences or similarities in the center and variability of the quantitative variable (e.g.
height) across the categories (e.g. Male and Female).
-1 indicates a perfectly negative linear correlation between two variables
0 indicates no linear correlation between two variables
1 indicates a perfectly positive linear correlation between two variables
A Percentage is calculated by the mathematical formula of
dividing the value by the sum of all the values and then
multiplying the sum by 100. This is also applicable in
Pandas Dataframes. Here, the pre-defined sum() method of
pandas series is used to compute the sum of all the values of
a column.
Contingency Table is one of the techniques for exploring two or even
more variables. It is basically a tally of counts between two or more
categorical variables.
import numpy as np data = pd.read_csv("loan_status.csv")
import pandas as pd
import matplotlib as plt print (data.head(10))
Most analytics applications require frequent batch processing
that allows them to process data in batches at varying
intervals. For example, processing daily sales aggregations by
individual store and then writing that data to the data
warehouse on a nightly basis can allow business intelligence
(BI) reporting queries to run faster. Batch systems must be
built to scale for all sizes of data and to scale seamlessly to the
size of the dataset being processed by various job runs.
Scatterplots and Resistant Lines
Scatter Plot
In a scatter plot, the values of two variables are plotted along two axes and the resulting pattern can
reveal correlation present between the variables if any.
A scatter plot is also useful for assessing the strength of the relationship and to find if there are any
outliers in the data.

import numpy
import matplotlib.pyplot as plt
x = numpy.random.normal(5.0, 1.0, 1000)
y = numpy.random.normal(10.0, 2.0, 1000)
plt.scatter(x, y) plt.show()

The ‘scatter()’ method of matplotlib can be used to draw the scatter plot which takes both the
variables.
The resistant line basics

The eda_rline function fits a robust line through a bivariate dataset.


It does so by first breaking the data into three roughly equal sized
batches following the x-axis variable. It then uses the batches’ median
values to compute the slope and intercept.
What is a Resistance Line? A Resistance line, sometimes also known

as a Speed Line, helps identify stock trends and levels of support and

resistance. Resistance lines are technical indication tools used by

equity analysts and investors to determine the price trend of a specific

stock.
Data transformation is the process of converting raw data into a format
or structure that would be more suitable for model building and also
data discovery in general. It is an imperative step in feature engineering
that facilitates discovering insights.

You might also like