0% found this document useful (0 votes)
5 views

Descriptive Analytics.ipynb - Colab

The document outlines a descriptive analytics project using a CSV dataset of household income and expenditure. It details the loading of data into a pandas DataFrame, the exploration of data characteristics, and the application of descriptive statistics to analyze central tendencies and variations. Additionally, it includes visualizations such as scatter plots, line plots, pie charts, and histograms to illustrate relationships and distributions within the data.

Uploaded by

lsivakum
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Descriptive Analytics.ipynb - Colab

The document outlines a descriptive analytics project using a CSV dataset of household income and expenditure. It details the loading of data into a pandas DataFrame, the exploration of data characteristics, and the application of descriptive statistics to analyze central tendencies and variations. Additionally, it includes visualizations such as scatter plots, line plots, pie charts, and histograms to illustrate relationships and distributions within the data.

Uploaded by

lsivakum
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

8/23/24, 11:47 AM descriptive analytics.

ipynb - Colab

income expenditure CSV dataset fro kaggle

load the dataset into dataframe / table

import pandas as pd
data = pd.read_csv('/content/sample_data/Inc_Exp_Data (1).csv')

data.head()

Mthly_HH_Income Mthly_HH_Expense No_of_Fly_Members Emi_or_Rent_Amt Annual_HH_I

0 5000 8000 3 2000

1 6000 7000 2 3000

2 10000 4500 2 0 1

3 10000 2000 1 0

4 12500 12000 2 3000 1

data.shape

(50, 7)

data.columns

Index(['Mthly_HH_Income', 'Mthly_HH_Expense', 'No_of_Fly_Members',


'Emi_or_Rent_Amt', 'Annual_HH_Income', 'Highest_Qualified_Member',
'No_of_Earning_Members'],
dtype='object')

descriptive statistics uses the following measures

1. central tendency: mean, median, mode


2. frequency meadures- how frequently events are occuring
3. measures of variation- ranges, variance, SD

info()- number of rows, No. of columns, col names, data types of each col etc

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
https://colab.research.google.com/drive/1yFLS5fSuCYx2dUpVf3vYKqOb0zfT2epy#printMode=true 1/9
8/23/24, 11:47 AM descriptive analytics.ipynb - Colab
0 Mthly_HH_Income 50 non-null int64
1 Mthly_HH_Expense 50 non-null int64
2 No_of_Fly_Members 50 non-null int64
3 Emi_or_Rent_Amt 50 non-null int64
4 Annual_HH_Income 50 non-null int64
5 Highest_Qualified_Member 50 non-null object
6 No_of_Earning_Members 50 non-null int64
dtypes: int64(6), object(1)
memory usage: 2.9+ KB

describes numeric columns/attributes

data.describe()

Mthly_HH_Income Mthly_HH_Expense No_of_Fly_Members Emi_or_Rent_Amt Annual_

count 50.000000 50.000000 50.000000 50.000000 5.0

mean 41558.000000 18818.000000 4.060000 3060.000000 4.9

std 26097.908979 12090.216824 1.517382 6241.434948 3.2

min 5000.000000 2000.000000 1.000000 0.000000 6.4

25% 23550.000000 10000.000000 3.000000 0.000000 2.5

50% 35000.000000 15500.000000 4.000000 0.000000 4.4

75% 50375.000000 25000.000000 5.000000 3500.000000 5.9

max 100000.000000 50000.000000 7.000000 35000.000000 1.4

central tendencies using statistics module

import statistics as st

st.mean(data['Mthly_HH_Income'])

41558

st.variance(data['Mthly_HH_Income'])

681100853.0612245

st.stdev(data['Mthly_HH_Income'])

26097.908978713687

data['No_of_Fly_Members'].unique()

array([3, 2, 1, 5, 4, 6, 7])

https://colab.research.google.com/drive/1yFLS5fSuCYx2dUpVf3vYKqOb0zfT2epy#printMode=true 2/9
8/23/24, 11:47 AM descriptive analytics.ipynb - Colab

st.mode(data['No_of_Fly_Members'])

data['No_of_Fly_Members'].value_counts()

No_of_Fly_Members
4 15
6 10
3 9
2 8
5 5
7 2
1 1
Name: count, dtype: int64

st.mode(data['No_of_Earning_Members'])

Highest_Qualified_Member column is categorical data type- few distince values

data['Highest_Qualified_Member'].value_counts()

Highest_Qualified_Member
Graduate 19
Under-Graduate 10
Professional 10
Post-Graduate 6
Illiterate 5
Name: count, dtype: int64

data visualizations- graphs & charts

python provides a package for visualizations-

1. matplotlib.pyplot
2. seaborn

line, bar, pie, histogram, box, scatter

import matplotlib.pyplot as plt

scatter plot: to visualize the relationship between two variables/attributes/ columns

1. datapoints are represented using dots

trend is - expenditure increases with increase in income

https://colab.research.google.com/drive/1yFLS5fSuCYx2dUpVf3vYKqOb0zfT2epy#printMode=true 3/9
8/23/24, 11:47 AM descriptive analytics.ipynb - Colab

# size of chart
plt.figure(figsize=(3,3))
plt.scatter(data['Mthly_HH_Income'], data['Mthly_HH_Expense'])
# x & y axis labels
plt.xlabel('Income')
plt.ylabel('Expenditure')
plt.title('Income vs expenditure')
plt.show()

line plot :

generally- the monthly expenditure of the families is less than income

plt.figure(figsize=(3,3))
plt.plot(data['Mthly_HH_Income'],label='income' )
plt.plot(data['Mthly_HH_Expense'], label='expenditure')
plt.legend() # giving labels to graphs
plt.show()

https://colab.research.google.com/drive/1yFLS5fSuCYx2dUpVf3vYKqOb0zfT2epy#printMode=true 4/9
8/23/24, 11:47 AM descriptive analytics.ipynb - Colab

pie chart: for categorical variables(few unique values), to know the proportion of each category

1. circular figure showing the proportions

x = data['No_of_Earning_Members'].value_counts()
print(x)

No_of_Earning_Members
1 33
2 12
3 4
4 1
Name: count, dtype: int64

plt.figure(figsize=(3,3))
plt.pie(x,labels=x.index, autopct='%.0f%%' )
plt.show()

histogram: used for single variable values are divided into intervals / bins.

1. bars are displayed to represent count in each bin

print(data['Mthly_HH_Income'].min())
print(data['Mthly_HH_Income'].max())

5000
100000

plt.figure(figsize=(3,3))
plt.hist(data['Mthly_HH_Income'], bins = 10)

plt.show()

https://colab.research.google.com/drive/1yFLS5fSuCYx2dUpVf3vYKqOb0zfT2epy#printMode=true 5/9
8/23/24, 11:47 AM descriptive analytics.ipynb - Colab

earning = data['No_of_Earning_Members'].unique()
#print(earning)
plt.hist(data['No_of_Earning_Members'])
plt.xlabel('No. of earning members')
plt.ylabel('Count')
plt.xticks(earning)
plt.show()

Start coding or generate with AI.

Start coding or generate with AI.

https://colab.research.google.com/drive/1yFLS5fSuCYx2dUpVf3vYKqOb0zfT2epy#printMode=true 6/9
8/23/24, 11:47 AM descriptive analytics.ipynb - Colab

Start coding or generate with AI.

plt.figure(figsize= (3,3))
plt.scatter(data['Mthly_HH_Income'], data['Mthly_HH_Expense'])
plt.xlabel('income')
plt.ylabel('expenditure')
plt.show()

plt.pie(data['No_of_Fly_Members'])
plt.show()

data['No_of_Fly_Members'].unique()

array([3, 2, 1, 5, 4, 6, 7])

https://colab.research.google.com/drive/1yFLS5fSuCYx2dUpVf3vYKqOb0zfT2epy#printMode=true 7/9
8/23/24, 11:47 AM descriptive analytics.ipynb - Colab

x = data['No_of_Fly_Members'].value_counts()
print(x)

No_of_Fly_Members
4 15
6 10
3 9
2 8
5 5
7 2
1 1
Name: count, dtype: int64

plt.pie(x, labels= x.index)


plt.show()

Start coding or generate with AI.

https://colab.research.google.com/drive/1yFLS5fSuCYx2dUpVf3vYKqOb0zfT2epy#printMode=true 8/9
8/23/24, 11:47 AM descriptive analytics.ipynb - Colab

https://colab.research.google.com/drive/1yFLS5fSuCYx2dUpVf3vYKqOb0zfT2epy#printMode=true 9/9

You might also like