Assignment 2 B
Assignment 2 B
Assignment No:2 B
Problem Statement:
Use the diabetes data set from UCI and Pima Indians Diabetes data set for performing the
following:
a. Univariate analysis: Frequency, Mean, Median, Mode, Variance, Standard Deviation,
Skewness and Kurtosis
b. Bivariate analysis: Linear and logistic regression modeling
c. Multiple Regression analysis
d. Also compare the results of the above analysis for the two data sets
Theory:
Multiple Linear Regression attempts to model the relationship between two or more
features and a response by fitting a linear equation to observed data. The steps to perform
multiple linear Regression are almost similar to that of simple linear Regression. The
Difference Lies in the evaluation. We can use it to find out which factor has the highest impact
on the predicted output and how different variables relate to each other.
Here : Y = b0 + b1 * x1 + b2 * x2 + b3 * x3 + …… bn * xn
Y = Dependent variable and x1, x2, x3, …… xn = multiple independent variables
Assumption of Regression Model:
Linearity: The relationship between dependent and independent variables should be linear.
Homoscedasticity: Constant variance of the errors should be maintained.
Multivariate normality: Multiple Regression assumes that the residuals are normally
distributed.
Lack of Multicollinearity: It is assumed that there is little or no multicollinearity in the data.
Dummy Variable:
As we know in the Multiple Regression Model we use a lot of categorical data.
Using Categorical Data is a good method to include non-numeric data into the respective
Regression Model. Categorical Data refers to data values that represent categories-data values
with the fixed and unordered number of values, for instance, gender(male/female).
Computer Laboratory-I Class: BE (AI &DS)
a. Descriptive Statistics: This includes measures like mean, median, mode, range, variance,
and standard deviation, which provide a summary of the central tendency and variability of
the data.
b. Histograms: A histogram is a graphical representation of the frequency distribution of a
continuous variable. It displays data as bars or bins to visualize the shape of the distribution.
c. Bar Charts: Bar charts are used to visualize the frequency distribution of a categorical
variable. They show the frequency of each category or class.
d. Box Plots: A box plot, also known as a box-and-whisker plot, displays the summary of a
continuous variable's distribution, including the median, quartiles, and potential outliers.
e. Frequency Tables: Frequency tables provide a tabular summary of the counts or
percentages of different categories or values within a variable.
Bivariate Analysis:
Bivariate analysis, on the other hand, involves analyzing the relationships and interactions
between two variables. It is used to explore how changes in one variable affect another and to
identify patterns, associations, or correlations. Common techniques and tools used in bivariate
analysis include:
a. Scatter Plots: Scatter plots are used to visualize the relationship between two continuous
variables. Each data point is represented as a point on the graph, allowing you to observe
patterns and trends.
b. Correlation Analysis: Correlation measures the strength and direction of the relationship
between two continuous variables. Common correlation coefficients include Pearson's
correlation coefficient (for linear relationships) and Spearman's rank correlation (for monotonic
relationships).
c. Contingency Tables: Contingency tables are used to analyze the relationships between two
categorical variables. They show how the variables are distributed with respect to each other.
d. Regression Analysis: Regression analysis is used to model and quantify the relationship
between a dependent variable and one or more independent variables. Simple linear regression
and multiple linear regression are common techniques in bivariate analysis.
e. Chi-Square Test: The chi-square test is a statistical test used to determine if there is an
association between two categorical variables. It helps assess the independence of variables.
Univariate and bivariate analysis are crucial for understanding data, identifying outliers, trends,
patterns, and making initial observations before more advanced analyses are conducted. They
provide the foundation for more complex multivariate analysis and hypothesis testing in
statistics and data science.
Computer Laboratory-I Class: BE (AI &DS)
Conclusion:
Students will be able to apply Linear Regression and will be able to Design ML models to
make predictions by using linear regression technique.
18/10/2023, 19:59 Assignment 2-B - Jupyter Notebook
In [21]: df = pd.read_csv("diabetes.csv")
In [22]: df.shape
Out[22]: (768, 9)
In [23]: df.head()
Out[23]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outco
1 1 85 66 29 0 26.6 0.351 31
3 1 89 66 23 94 28.1 0.167 21
B. Use the diabetes data set from UCI and Pima Indians Diabetes data set for performing the following: a.
Univariate analysis: Frequency, Mean, Median, Mode, Variance, Standard Deviation, Skewness and
Kurtosis b. Bivariate analysis: Linear and logistic regression modeling c. Multiple Regression analysis d.
Also compare the results of the above analysis for the two data sets
In [24]: df.describe()
Out[24]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunc
Column: Pregnancies
Frequency:
1 135
0 111
2 103
3 75
4 68
5 57
6 50
7 45
8 38
9 28
10 24
11 11
13 10
12 9
14 2
15 1
17 1
Name: Pregnancies dtype: int64
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specif
ied.
[2] The condition number is large, 1.1e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
In [39]: df.corr()
Out[39]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI Diabete
In [ ]: