Unit 4
Unit 4
UNIT - 4
BIVARIATE ANALYSIS
Presented By
Dr R Murugadoss
Professor
Artificial Intelligence & Data Science
- Relationships between Two Variables
- Percentage Tables
- Analyzing Contingency Tables
- Handling Several Batches
- Scatterplots and Resistant Lines
– Transformations
Two variables are related if knowing one gives you information
about the other. For example, height and weight are related;
people who are taller tend to be heavier.
Association Examples:
◦ Smoking is associated with heart disease.
◦ Weight is associated with height.
◦ Income is associated with education.
Both variables are categorical. We analyze an association through a comparison of
conditional probabilities and graphically represent the data using contingency tables.
Examples of categorical variables are gender and class standing.
Both variables are quantitative. To analyze this situation we consider how one
variable, called a response variable, changes in relation to changes in the other variable
called an explanatory variable. Graphically we use scatterplots to display two
quantitative variables. Examples are age, height, weight (i.e. things that are measured).
One variable is categorical and the other is quantitative, for instance height and
gender. These are best compared by using side-by-side boxplots to display any
differences or similarities in the center and variability of the quantitative variable (e.g.
height) across the categories (e.g. Male and Female).
-1 indicates a perfectly negative linear correlation between two variables
0 indicates no linear correlation between two variables
1 indicates a perfectly positive linear correlation between two variables
A Percentage is calculated by the mathematical formula of
dividing the value by the sum of all the values and then
multiplying the sum by 100. This is also applicable in
Pandas Dataframes. Here, the pre-defined sum() method of
pandas series is used to compute the sum of all the values of
a column.
Contingency Table is one of the techniques for exploring two or even
more variables. It is basically a tally of counts between two or more
categorical variables.
import numpy as np data = pd.read_csv("loan_status.csv")
import pandas as pd
import matplotlib as plt print (data.head(10))
Most analytics applications require frequent batch processing
that allows them to process data in batches at varying
intervals. For example, processing daily sales aggregations by
individual store and then writing that data to the data
warehouse on a nightly basis can allow business intelligence
(BI) reporting queries to run faster. Batch systems must be
built to scale for all sizes of data and to scale seamlessly to the
size of the dataset being processed by various job runs.
Scatterplots and Resistant Lines
Scatter Plot
In a scatter plot, the values of two variables are plotted along two axes and the resulting pattern can
reveal correlation present between the variables if any.
A scatter plot is also useful for assessing the strength of the relationship and to find if there are any
outliers in the data.
import numpy
import matplotlib.pyplot as plt
x = numpy.random.normal(5.0, 1.0, 1000)
y = numpy.random.normal(10.0, 2.0, 1000)
plt.scatter(x, y) plt.show()
The ‘scatter()’ method of matplotlib can be used to draw the scatter plot which takes both the
variables.
The resistant line basics
as a Speed Line, helps identify stock trends and levels of support and
stock.
Data transformation is the process of converting raw data into a format
or structure that would be more suitable for model building and also
data discovery in general. It is an imperative step in feature engineering
that facilitates discovering insights.