100% found this document useful (1 vote)
85 views7 pages

2.basic Statistics - Jupyter Notebook

This document demonstrates various data analysis and manipulation techniques using the Pandas library in Python. It loads breast cancer data from a CSV file, then cleans and explores the data. Key steps include deleting unnecessary columns, calculating descriptive statistics, frequencies of diagnosis types, and replacing text labels. The goal is to prepare and understand the data for further analysis.

Uploaded by

venkatesh m
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
85 views7 pages

2.basic Statistics - Jupyter Notebook

This document demonstrates various data analysis and manipulation techniques using the Pandas library in Python. It loads breast cancer data from a CSV file, then cleans and explores the data. Key steps include deleting unnecessary columns, calculating descriptive statistics, frequencies of diagnosis types, and replacing text labels. The goal is to prepare and understand the data for further analysis.

Uploaded by

venkatesh m
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

In 

[3]: import pandas as pd

In [4]: import numpy as np

In [3]: mba = pd.read_csv("D:\\Course PPTS\\R Codes\\1 Basis staistics\\mba.csv")



#C:\\Users\\Rohit\\Desktop\\Course PPTS\\R Codes\\1 Basis staistics\\mba.csv

In [4]: mba

...

In [5]: # number of Rows


len(mba)

...

In [6]: # check the number of columns


len(mba.columns)

...

In [7]: mba.shape

...

In [8]: # column Names


mba.columns

#in R mba$gmat
#in python mba['gmat']

...

In [9]: # Top rows


mba.head() # will give top 5 rows

...

In [10]: # mention number of rows to display


mba.head(10)

...

In [11]: # tail function to bottom rows


mba.tail(6)

...
In [12]: # column information dataset structure
mba.info()

...

In [13]: # Get stats on the columns


mba.describe()
#summary(mba)
#summary(mba$gmat)

...

In [14]: mba.describe().transpose()

...

In [15]: mba

# del mba['datasrno'] by using delete command
#mba.drop(0) by giving index it will remove the column

...

In [16]: # Rows information


mba[23:27]

...

In [17]: #mba$workex

mba['workex']


...

In [18]: mba1 = mba[['workex','gmat']]


mba1

...

In [19]: ​
del mba['Datasrno']

mba

...

In [20]: #In R mean(mba)


mba.mean()

...
In [21]: mba.std()

...

In [22]: mba.describe()

...

In [1]: # in R mean(mba$gmat)

mba['gmat'].mean()

...

In [24]: mba['gmat'].median()

#mba['gmat'].mean
#mba['gmat'].mode()
#mba['gmat'].var()
#mba['gmat'].std()
#mba['gmat'].max()
#mba['gmat'].min()
# Range = mba['gmat'].max() - mba['gmat'].min()
...

In [10]: mba['workex'].mode()

...

In [11]: mba['gmat'].var()

...

In [12]: mba['gmat'].std()

...

In [17]: max(mba['gmat'])

...

In [18]: min(mba['gmat'])
...

In [19]: range = max(mba['gmat'])-min(mba['gmat'])


range

...
In [13]: # In R skewness and kurtosis - we have installed e1071 package

from scipy.stats import skew

skew = skew(mba['gmat'])
skew
#print("skewness value of gmat:",skew)

...

In [13]: from scipy.stats import kurtosis


kurtosis(mba['gmat'])

...

In [14]: from scipy.stats import mode


mode(mba['gmat'])

...

In [1]: from scipy import stats

In [ ]: ​

Categorical Analysis
In [5]: import pandas as pd
In [6]: wbcd = pd.read_csv("D:\\Course\\Python\\Datasets\\wbcd.csv")
wbcd

Out[6]: id diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_m

0 87139402 B 12.32 12.39 78.85 464.1 0.1

1 8910251 B 10.60 18.95 69.28 346.4 0.0

2 905520 B 11.04 16.83 70.92 373.2 0.1

3 868871 B 11.28 13.39 73.00 384.8 0.1

4 9012568 B 15.19 13.21 97.65 711.8 0.0

... ... ... ... ... ... ...

564 911320502 B 13.17 18.22 84.28 537.3 0.0

565 898677 B 10.26 14.71 66.20 321.6 0.0

566 873885 M 15.28 22.41 98.92 710.6 0.0

567 911201 B 14.53 13.98 93.86 644.2 0.1

568 9012795 M 21.37 15.10 141.30 1386.0 0.1

569 rows × 32 columns

In [3]: wbcd

del wbcd['id']

In [4]: wbcd

Out[4]: diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean comp

0 B 12.32 12.39 78.85 464.1 0.10280

1 B 10.60 18.95 69.28 346.4 0.09688

2 B 11.04 16.83 70.92 373.2 0.10770

3 B 11.28 13.39 73.00 384.8 0.11640

4 B 15.19 13.21 97.65 711.8 0.07963

... ... ... ... ... ... ...

564 B 13.17 18.22 84.28 537.3 0.07466

565 B 10.26 14.71 66.20 321.6 0.09882

566 M 15.28 22.41 98.92 710.6 0.09057

567 B 14.53 13.98 93.86 644.2 0.10990

568 M 21.37 15.10 141.30 1386.0 0.10010

569 rows × 31 columns


In [27]: wbcd['diagnosis'].value_counts()

Out[27]: B 357

M 212

Name: diagnosis, dtype: int64

In [28]: freq = pd.crosstab(index=wbcd['diagnosis'], # Make a crosstab


columns="count")
freq

# Number of B are 357
# number of M are 212

Out[28]: col_0 count

diagnosis

B 357

M 212

In [29]: freq/freq.sum()

# percentage

# b = 357 /357 + 212 = 357/569= 62
# m = 212 / 357+212 = 212.569 = 38

Out[29]: col_0 count

diagnosis

B 0.627417

M 0.372583

In [ ]: # 62 % of users have benign Diagnosis


# 38 % of users have Malignant Diagnosis

In [30]: # replace function used to change the label name in the rows

wbcd['diagnosis'].replace({"B":"Bengign","M":"Malignant"},inplace=True)

In [31]: wbcd

...

In [7]: # To replace Column names in the dataset



wbcd.rename(columns={'radius_mean':'Mean Radius'},inplace=True)


In [38]: wbcd

Out[38]: Mean
diagnosis texture_mean perimeter_mean area_mean smoothness_mean compactnes
Radius

0 Bengign 12.32 12.39 78.85 464.1 0.10280

1 Bengign 10.60 18.95 69.28 346.4 0.09688

2 Bengign 11.04 16.83 70.92 373.2 0.10770

3 Bengign 11.28 13.39 73.00 384.8 0.11640

4 Bengign 15.19 13.21 97.65 711.8 0.07963

... ... ... ... ... ... ...

564 Bengign 13.17 18.22 84.28 537.3 0.07466

565 Bengign 10.26 14.71 66.20 321.6 0.09882

566 Malignant 15.28 22.41 98.92 710.6 0.09057

567 Bengign 14.53 13.98 93.86 644.2 0.10990

568 Malignant 21.37 15.10 141.30 1386.0 0.10010

569 rows × 31 columns

In [ ]: ​

In [ ]: ​

You might also like