0% found this document useful (0 votes)
144 views8 pages

Aphical Representation

The document outlines instructions for a data visualization assignment, including requirements for submission and guidelines for creating various plots. It covers univariate, bivariate, and multivariate graphs, explaining the necessary plots and their interpretations. Additionally, it discusses skewness and probability distributions, emphasizing the relationships between mean, median, and mode in different distributions.

Uploaded by

cikihi9288
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
144 views8 pages

Aphical Representation

The document outlines instructions for a data visualization assignment, including requirements for submission and guidelines for creating various plots. It covers univariate, bivariate, and multivariate graphs, explaining the necessary plots and their interpretations. Additionally, it discusses skewness and probability distributions, emphasizing the relationships between mean, median, and mode in different distributions.

Uploaded by

cikihi9288
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

2a.

Graphical Representation
Instructions:
Please share your answers filled in-line in the word document. Submit code
separately wherever applicable.

Please ensure you update all the details:


Name: _Naveen M_____ Batch ID: _11/09/2023-10AM________
Topic: Data Visualization

Guidelines:
1. An assignment submission is considered complete only when the correct and executable code(s) is
submitted along with the documentation explaining the method and results. Failing to submit either
of those will be considered an invalid submission and will not be considered a correct submission.

2. Ensure that you submit your assignments correctly. Resubmission is not allowed.

3. Post the submission you can evaluate your work by referring to the keys provided. (will be available
only post the submission).

Hints: Follow CRISP-ML(Q) methodology steps, where were appropriate.


1. Data Understanding: work on each feature of the dataset to create a data
dictionary as displayed in the image below:

Make a table as shown above and provide information about the features such as its data
type and its relevance to the model building. And if not relevant, provide reasons and a
description of the feature.
Problem Statements:

1. Univariate plots for UNIV data (Plot must have Title, X & Y label)
A) Plot numerical column with 3 different plots ?
B) What are bin parameters? What are the methods to define the number of bins and
bin sizes ?

© 360DigiTMG. All Rights Reserved.


C) Why do density plots exceed the range values of the column ?
D) Plot categorical columns by taking unique values ?
ANS) A) #Required libraries
import pandas as pd
import matplotlib.pyplot as plt

#Reading the data


df=pd.read_csv('C:/Users/Naveen/Desktop/Data Preprocessing Dataset/education.csv')

#Plotting
plt.hist(df.workex)
plt.title('Histogram of numerical columns')
plt.xlabel('workex')
plt.show()

#Boxplot
plt.boxplot(df.workex)
plt.title('Boxplot of numerical columns')
plt.xlabel('workex')
plt.show()

© 360DigiTMG. All Rights Reserved.


#violinplot
plt.violinplot(df.workex)
plt.title('Violinplot of numerical columns')
plt.xlabel('workex')
plt.show()

B) Bins are used to divide the range of data into intervals in a histogram. Choosing an
appropriate number of bins and bin size is important for effectively visualizing the data. There
are several methods to determine the number of bins:
Square Root Choice:

Sturges' Formula

Scott's Rule:

Freedman-Diaconis Rule:

C) Why do density plots exceed the range values of the column?

Density plots, particularly those created using kernel density estimation (KDE), can extend
beyond the range of the data for visualization purposes. This is because the KDE is used to
estimate the underlying probability density function, and it can have tails that extend beyond
the observed data range. It's important to remember that the density plot doesn't represent
the actual data but rather provides a smoothed estimate of the data distribution.
The extension of the density plot beyond the range of the data is a feature of the KDE
smoothing process and is not necessarily an error. However, you can limit the x-axis range of
the plot to match the observed data range if you want to focus on the data within that range.
D) unique_values = categorical_data.unique()
value_counts = categorical_data.value_counts()

plt.bar(unique_values, value_counts)

© 360DigiTMG. All Rights Reserved.


plt.title("Bar Plot of Categorical Column")
plt.xlabel("Categories")
plt.ylabel("Counts")
plt.xticks(rotation=45) # Rotate x-axis labels for better readability
plt.show()
2. Bivariate graphs for UNIV data (Plot must be readable [use rotation], have all labels)
A) Plot 2 numerical columns with scatter plot [use grid] ?
B) 2 Different plots for plotting a numerical column with a categorical column (bar,
line) ?
C) How are bar plots different from histogram?
ANS) A) #Required libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#Reading the data


df=pd.read_csv('C:/Users/Naveen/Desktop/Data.csv')

#Plotting
plt.scatter(df.age,df.Salaries)
plt.title('Scatterplot of numerical columns')
plt.xlabel('age')
plt.ylabel('Salaries')
plt.grid()
plt.show()

B) #Bargraph
plt.bar(df.Salaries,df.Sex,width=1000,align='center')
plt.title('Bargraph of numerical and categorical columns')
plt.xlabel('Salaries')
plt.ylabel('Sex')
plt.show()

#line graph
sns.lineplot(df,x=df.Salaries,y=df.age)
C) The major difference between Bar Chart and Histogram is the bars of the bar chart are not
just next to each other. In the histogram, the bars are adjacent to each other. In statistics, bar
charts and histograms are important for expressing a huge or big number of data.

© 360DigiTMG. All Rights Reserved.


3. Plot multivariate graphs (correlation heatmap, pairplot)

A) Plot for only numerical data ?


B) Plot multivariate graphs for both numerical and categorical columns ?
C) What does it mean when a correlation value says 1? When it is negative? When it is
zero?
ANS) A) #Required libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#Reading the data


df=pd.read_csv('C:/Users/Naveen/Desktop/Data.csv')
numerical_column=df[['Salaries','age']]
numerical_column_cor=numerical_column.corr()
#Plotting
sns.heatmap(numerical_column_cor)

sns.pairplot(numerical_column)

© 360DigiTMG. All Rights Reserved.


B) #Required libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#Reading the data


df=pd.read_csv('C:/Users/Naveen/Desktop/Data.csv')
df_cor=df.corr()
#Plotting
sns.heatmap(df_cor)
sns.pairplot(df)
C) A correlation value of 1: This indicates a perfect positive correlation, meaning that the two
variables move in the same direction. When one variable increases, the other increases by the
same proportion, and when one decreases, the other decreases by the same proportion.
A negative correlation value (e.g., -1): This indicates a perfect negative correlation, meaning
that the two variables move in opposite directions. When one variable increases, the other
decreases by the same proportion, and vice versa.
A correlation value of 0: This means there is no linear relationship between the two variables.
They are not correlated. However, it's essential to note that while a correlation coefficient of 0
indicates no linear relationship, there could still be other types of relationships or associations
that are not captured by the correlation coefficient. Therefore, it's a good practice to explore
the data further.

4. Plot Skewness & Probability distribution for each column of marks data. (Hist, box, density)
A) What is normally distributed and What will be the relationship between mean,
median & mode ?
B) Which data variables are positively skewed and What will be the relationship
between mean, median & mode
C) What are negatively skewed/distributed and What will be the relationship between
mean, median & mode
D) What are the distinctive differences between skewness and distribution?

ANS) #Required libraries


import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

© 360DigiTMG. All Rights Reserved.


#Reading the data
df=pd.read_csv('C:/Users/Naveen/Desktop/education.csv')

#Plotting
plt.hist(df[['workex','gmat']])
plt.legend(df)
plt.show()

plt.boxplot(df[['workex','gmat']])

sns.kdeplot(df[['workex','gmat']])

© 360DigiTMG. All Rights Reserved.


A) The normal distribution is
a symmetrical, bell-shaped distribution in
which the mean, median and mode are all equal.
B) In case of a positively skewed frequency distribution, the mean is
always greater than median and the median is always greater than the
mode
C) In case of a negatively skewed frequency distribution, the mean is
always lesser than median and the median is always lesser than the mode.
D) Skewness is a measure of the asymmetry in the distribution of data. It helps us
understand whether the data is skewed to the left (negatively skewed), skewed to the right
(positively skewed), or approximately symmetric (no skew).
Distribution, in statistics, refers to the way data values are spread or organized. It describes
the set of all possible values that a random variable can take and how often each value
occurs.

© 360DigiTMG. All Rights Reserved.

You might also like