0% found this document useful (0 votes)
16 views31 pages

Final DAA

Uploaded by

awsm40996
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views31 pages

Final DAA

Uploaded by

awsm40996
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

DEPARTMENT OF COMPUTER SCI.

& ENGG
( Bhilai Institute of Technology, Durg )

CERTIFICATE OF COMPLETION

This is to certify that Mr/Ms…………………………………………………is bonafide

students of …………… Sem during Academic Session ……………………. In the

Department of Computer Science & Engineering, Bhilai Institute of

Technology, Durg, Chhattisgarh and has successfully completed all the

Experiments of laboratory ………………………………………………………. Within

specified time of academic session.

Approved By:

Prof. In-charge Head of the Deptt.


( ) ( )
INDEX

S.No Experiments Page Date Of Remark


No. Completion

1 File I/O: Write a program for opening, closing,


reading, writing, seeking and exception
handling of a file.

2 NumPy: Write a program demonstrating array


creation and basic operations such as
indexing, slicing, shape manipulation, stacking
and splitting of arrays.

3 Pandas: Write a program to import data (CSV,


excel, text etc) using pandas data frames and
data preparation, filtering, and sorting.

4 Matplotlib: Write a program to understand the


use of Matplotlib for Simple Interactive Chart
(Line Chart, Histogram, Bar Chart, Pie
Charts), subplot with functional method,
Working with Multiple Figures and Axes,
Adding Text, Adding a Grid, Adding a Legend,
Saving the Charts.

5 Seaborn: Write a program to understand the


use of seaborn for visualising statistical
relationships, importing and preparing data,
plotting with categorical data and visualising
linear relationships.

6 Perform different data pre-processing


methods.

7 Perform data cleaning, handling missing


values, imputation techniques
(cleaning/filling/dropping/replacing).

8 Perform Exploratory Analysis for any dataset.

9 Perform the basic statistical analysis by


counting (mean, median, mode, SD etc),
probability, and probability distribution and
sampling distributions.

10 Performing statistical analysis by estimation


and hypothesis testing.
#LAB -1
[ ]: # The name of the file we want to work with
filename = "example.txt"

try:

# Writing to the file


file = open(filename, 'w+')
file.write("Adding some new text.\n")
print("New text added to the file.")

# Explicitly opening the file


file = open(filename, 'r+')
print("File opened successfully.")

# Reading from the file


content = file.read()
print("Current file content:", content)

# Seeking a specific position in the file


file.seek(0)
print("Moved file pointer to the beginning.")

# Reading the updated content


updated_content = file.read()
print("Updated file content:", updated_content)

except FileNotFoundError:
print(f"The file {filename} was not found.")
except IOError:
print(f"Error occurred while accessing the file {filename}.")
except Exception as e:
print(f"An unexpected error occurred: {e}")
finally:
# Closing the file explicitly
try:

1
file.close()
print("File closed successfully.")
except NameError:
# File was never opened, no need to close
pass
except Exception as e:
print(f"An error occurred while closing the file: {e}")

New text added to the file.


File opened successfully.
Current file content: Adding some new text.

Moved file pointer to the beginning.


Updated file content: Adding some new text.

File closed successfully.


#LAB -2
[ ]: import numpy as np

[ ]: arr = np.array([1, 2, 3, 4, 5])

[ ]: print(arr)

[1 2 3 4 5]

[ ]: arr1d = np.array([1, 2, 3, 4, 5])


arr2d = np.array([[1, 2, 3], [4, 5, 6]])

[ ]: arr_range = np.arange(1, 11, 1)


arr_range

[ ]: array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

[ ]: arr_random = np.random.rand(3, 3)
arr_random

[ ]: array([[0.50535839, 0.79636261, 0.41437649],


[0.39710658, 0.09276994, 0.83068921],
[0.69531753, 0.76831111, 0.42289229]])

[ ]: arr1d + 10

[ ]: array([11, 12, 13, 14, 15])

[ ]: arr1d * 2

2
[ ]: array([ 2, 4, 6, 8, 10])

[ ]: np.sqrt(arr1d)

[ ]: array([1. , 1.41421356, 1.73205081, 2. , 2.23606798])

[ ]: np.exp(arr1d)

[ ]: array([ 2.71828183, 7.3890561 , 20.08553692, 54.59815003,


148.4131591 ])

[ ]: arr2d.ndim, arr2d.size

[ ]: (2, 6)

[ ]: a = np.random.rand(3, 3)

[ ]: a = np.array([[5, 3], [5, 2]])


b = np.array([[3, 3], [2, 7]])

[ ]: a,b

[ ]: (array([[5, 3],
[5, 2]]),
array([[3, 3],
[2, 7]]))

[ ]: np.sort(a, axis = -1)

[ ]: array([[3, 5],
[2, 5]])

[ ]: a[1][1] #Indexing

[ ]: 2

[ ]: stacked_arr = np.vstack((a, b)) # Stack vertically


print("Stacked array vertically:\n", stacked_arr)

Stacked array vertically:


[[5 3]
[5 2]
[3 3]
[2 7]]

[ ]: # Splitting
split_arr = np.split(a, [1]) # Split at indices 1

3
print("Split array:",split_arr)

Split array: [array([[5, 3]]), array([[5, 2]])]

[ ]: x = np.stack((a, b))
x

[ ]: array([[[5, 3],
[5, 2]],

[[3, 3],
[2, 7]]])

[ ]: print("addition\n", a + b)
print("subbtraction\n", a + b)
print("division\n", a + b)
print("multiply 1\n", a * b)
print("multiply 2\n", a @ b)

addition
[[8 6]
[7 9]]
subbtraction
[[8 6]
[7 9]]
division
[[8 6]
[7 9]]
multiply 1
[[15 9]
[10 14]]
multiply 2
[[21 36]
[19 29]]

[ ]: c = np.random.rand(4, 4)

[ ]: c

[ ]: array([[0.38479275, 0.84841358, 0.24102487, 0.80679517],


[0.51642219, 0.71366856, 0.26205537, 0.28291745],
[0.84690828, 0.33984087, 0.18054944, 0.53132507],
[0.01385921, 0.87446289, 0.11849415, 0.26600576]])

[ ]: c[0:2, 0:2] #Slicing

4
[ ]: array([[0.38479275, 0.84841358],
[0.51642219, 0.71366856]])

[ ]: c[0:-2, 0:-2]

[ ]: array([[0.38479275, 0.84841358],
[0.51642219, 0.71366856]])

[ ]: print("mean\n", np.mean(a))
print("sum\n", np.sum(a))
print("min\n", np.min(a))
print("max 1\n", np.max(a))
print("cumsum 2\n", np.cumsum(a))

mean
3.75
sum
15
min
2
max 1
5
cumsum 2
[ 5 8 13 15]

[ ]: a

[ ]: array([[5, 3],
[5, 2]])

[ ]: a

[ ]: array([[5, 3],
[5, 2]])

[ ]: a.T

[ ]: array([[5, 5],
[3, 2]])

[ ]: a.reshape(4, 1) #Shape manipulation

[ ]: array([[5],
[3],
[5],
[2]])

5
[ ]: a.max(axis = 0)

[ ]: array([5, 3])

[ ]: x.ndim, x.size

[ ]: (3, 8)

#LAB -3#
[ ]: import pandas as pd
data = pd.read_csv("/content/gapminder-FiveYearData.csv")

[ ]: #Sorting
data.sort_values(by=["gdpPercap"]).head(5)

[ ]: country year pop continent lifeExp gdpPercap


334 Congo Dem. Rep. 2002 55379852.0 Africa 44.966 241.165876
335 Congo Dem. Rep. 2007 64606759.0 Africa 46.462 277.551859
876 Lesotho 1952 748747.0 Africa 42.138 298.846212
624 Guinea-Bissau 1952 580653.0 Africa 32.500 299.850319
333 Congo Dem. Rep. 1997 47798986.0 Africa 42.587 312.188423

[ ]: #filtering
data_2007 = data[data["year"] == 2007]
data_2007.head(5)

[ ]: country year pop continent lifeExp gdpPercap


11 Afghanistan 2007 31889923.0 Asia 43.828 974.580338
23 Albania 2007 3600523.0 Europe 76.423 5937.029526
35 Algeria 2007 33333216.0 Africa 72.301 6223.367465
47 Angola 2007 12420476.0 Africa 42.731 4797.231267
59 Argentina 2007 40301927.0 Americas 75.320 12779.379640

[ ]: max_gdp = max(data["gdpPercap"])
country = data[data["gdpPercap"] == max_gdp]
country # country with max gdp Per Capita

[ ]: country year pop continent lifeExp gdpPercap


853 Kuwait 1957 212846.0 Asia 58.033 113523.1329

[ ]: year_wise_lifeExp_dict = {}
years = data["year"]
for year in years:
x = data[data["year"] == year].lifeExp.mean()
year_wise_lifeExp_dict[year] = x

6
year_wise_lifeExp = pd.Series(year_wise_lifeExp_dict)

[ ]: year_wise_lifeExp

[ ]: 1952 49.057620
1957 51.507401
1962 53.609249
1967 55.678290
1972 57.647386
1977 59.570157
1982 61.533197
1987 63.212613
1992 64.160338
1997 65.014676
2002 65.694923
2007 67.007423
dtype: float64

[ ]: year_wise_lifeExp_sr = data.groupby("year")["lifeExp"].mean()
year_wise_lifeExp_sr

[ ]: year
1952 49.057620
1957 51.507401
1962 53.609249
1967 55.678290
1972 57.647386
1977 59.570157
1982 61.533197
1987 63.212613
1992 64.160338
1997 65.014676
2002 65.694923
2007 67.007423
Name: lifeExp, dtype: float64

#LAB-4#
[ ]: import matplotlib.pyplot as plt
import numpy as np
x = [10,20,25,15]
y = [5,13,6,7]

data = np.random.randn(1000)

sizes = [15, 30, 45, 10]


labels = ['A', 'B', 'C', 'D']

7
# Working with Multiple Figures and Axes
# Subplots
fig, axs = plt.subplots(2, 2, figsize=(10, 8))

axs[0, 0].plot(x, y, 'r', label="RED Line") ␣


↪# Line chart

axs[0, 0].set_title('Line Chart')


axs[0,0].grid(True) ␣
↪# Adding a grid

axs[0,0].legend() ␣
↪# Adding Legend

axs[0, 1].hist(data, bins=30, color='skyblue', edgecolor='black') ␣


↪# Histogram

axs[0, 1].set_title('Histogram')

axs[1, 0].bar(x, y, color='green') ␣


↪# Bar Chart

axs[1, 0].set_title('Bar Chart')



↪# Pie Chart
axs[1, 1].pie(sizes, labels=labels, autopct='%1.1f%%', colors=['gold',␣
↪'yellowgreen', 'lightcoral', 'lightskyblue'])

axs[1, 1].set_title('Pie Chart')

plt.tight_layout()
# Saving the chart
plt.savefig('figure chart.png')
plt.show()

8
#LAB-5#
[ ]: import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Load the example tips dataset


tips = sns.load_dataset("tips")
fig, axes = plt.subplots(1, 3, figsize=(17, 4))

sns.scatterplot(data=tips, x="total_bill", y="tip",ax = axes[0])


plt.title('Scatterplot of Total Bill vs. Tip')

tips['tip_percentage'] = tips['tip'] / tips['total_bill'] * 100

sns.barplot(data=tips, x="day", y="tip_percentage",ax = axes[1])


plt.title('Boxplot of Tip Percentage by Day')

9
sns.regplot(data=tips, x="total_bill", y="tip_percentage",ax = axes[2])
plt.title('Regression Plot of Total Bill vs. Tip Percentage')

g = sns.FacetGrid(tips, col="day", height=4, aspect=.5)


g.map(sns.regplot, "total_bill", "tip_percentage")
plt.show()

#LAB-6#
[ ]: import numpy as np
import pandas as pd

data = pd.read_excel("/content/Case study_Dataset.xlsx")

[ ]: data.head()

10
[ ]: CREATED_DATE CREATED_DATE minus Hour \
0 2016-01-09 00:18:14 2016-01-09
1 2016-01-09 02:28:34 2016-01-09
2 2016-01-09 04:00:34 2016-01-09
3 2016-01-09 10:26:27 2016-01-09
4 2016-01-09 11:37:59 2016-01-09

USER_ID TRANSACTION_ID \
0 45e3c222-38ac-4fdb-b092-ff1639e4438c 27d7fd11-d885-4d2c-9ed1-daa89b7bda1d
1 57c11728-b979-4856-bada-1d268726cfe9 2e1ee26c-0d24-4931-a7f9-0caa0d07eb2e
2 1319cca9-02a7-4a15-8abb-48d4e08e5aa3 bfd20e6f-ddb3-4237-bcd2-f7f8d967e36e
3 3f6bb28c-f945-4027-9178-747956c3ea58 85037186-039a-4ae5-9fea-e87f30822218
4 f54baeeb-7282-4d23-9bb7-e8396ce1b159 8e1e938a-1916-4d5e-b261-82c61a6979d6

TYPE CURRENCY AMOUNT


0 TOPUP EUR 177.38
1 BANK_TRANSFER EUR 310.27
2 CARD_PAYMENT EUR 96.44
3 BANK_TRANSFER EUR 288.51
4 CARD_PAYMENT GBP 88.45

[ ]: data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CREATED_DATE 10000 non-null datetime64[ns]
1 CREATED_DATE minus Hour 10000 non-null datetime64[ns]
2 USER_ID 10000 non-null object
3 TRANSACTION_ID 10000 non-null object
4 TYPE 10000 non-null object
5 CURRENCY 10000 non-null object
6 AMOUNT 10000 non-null float64
dtypes: datetime64[ns](2), float64(1), object(4)
memory usage: 547.0+ KB

[ ]: data.describe()

[ ]: CREATED_DATE CREATED_DATE minus Hour AMOUNT


count 10000 10000 10000.000000
mean 2016-08-23 00:01:29.126000128 2016-08-22 10:24:14.400000 175.768253
min 2016-01-09 00:18:14 2016-01-09 00:00:00 0.020000
25% 2016-06-19 18:20:33 2016-06-19 00:00:00 88.675000
50% 2016-09-03 16:29:08.500000 2016-09-03 00:00:00 177.455000
75% 2016-11-09 18:34:07.500000 2016-11-09 00:00:00 263.540000

11
max 2017-01-08 23:50:18 2017-01-08 00:00:00 349.980000
std NaN NaN 101.406464

[ ]: data["year"] = pd.DatetimeIndex(data.CREATED_DATE).year
data["month"] = pd.DatetimeIndex(data.CREATED_DATE).month
data["weekdays"] = pd.DatetimeIndex(data.CREATED_DATE).weekday

[ ]: EUR = []

for i in range(len(data)):
if data.iloc[i]["CURRENCY"] == "EUR":
EUR.append(data.iloc[i]["AMOUNT"])

else:
EUR.append(data.iloc[i]["AMOUNT"] * 1.17)

data["AMT_EUR"] = EUR

[ ]: data.head()

[ ]: CREATED_DATE CREATED_DATE minus Hour \


0 2016-01-09 00:18:14 2016-01-09
1 2016-01-09 02:28:34 2016-01-09
2 2016-01-09 04:00:34 2016-01-09
3 2016-01-09 10:26:27 2016-01-09
4 2016-01-09 11:37:59 2016-01-09

USER_ID TRANSACTION_ID \
0 45e3c222-38ac-4fdb-b092-ff1639e4438c 27d7fd11-d885-4d2c-9ed1-daa89b7bda1d
1 57c11728-b979-4856-bada-1d268726cfe9 2e1ee26c-0d24-4931-a7f9-0caa0d07eb2e
2 1319cca9-02a7-4a15-8abb-48d4e08e5aa3 bfd20e6f-ddb3-4237-bcd2-f7f8d967e36e
3 3f6bb28c-f945-4027-9178-747956c3ea58 85037186-039a-4ae5-9fea-e87f30822218
4 f54baeeb-7282-4d23-9bb7-e8396ce1b159 8e1e938a-1916-4d5e-b261-82c61a6979d6

TYPE CURRENCY AMOUNT year month weekdays AMT_EUR


0 TOPUP EUR 177.38 2016 1 5 177.3800
1 BANK_TRANSFER EUR 310.27 2016 1 5 310.2700
2 CARD_PAYMENT EUR 96.44 2016 1 5 96.4400
3 BANK_TRANSFER EUR 288.51 2016 1 5 288.5100
4 CARD_PAYMENT GBP 88.45 2016 1 5 103.4865

[ ]: data[["TYPE"]].value_counts()

[ ]: TYPE
TOPUP 2373
BANK_TRANSFER 2371
ATM 2357

12
CARD_PAYMENT 2325
P2P_TRANSFER 574
Name: count, dtype: int64

#LAB-7#
[ ]: import numpy as np
import pandas as pd

data = pd.read_csv("/content/Titanic.csv")

[ ]: data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 pclass 1309 non-null int64
1 survived 1309 non-null int64
2 name 1309 non-null object
3 sex 1309 non-null object
4 age 1046 non-null float64
5 sibsp 1309 non-null int64
6 parch 1309 non-null int64
7 ticket 1309 non-null object
8 fare 1308 non-null float64
9 cabin 295 non-null object
10 embarked 1307 non-null object
11 boat 486 non-null object
12 body 121 non-null float64
13 home.dest 745 non-null object
dtypes: float64(3), int64(4), object(7)
memory usage: 143.3+ KB

[ ]: data.head()

[ ]: pclass survived sex age sibsp parch fare embarked body


0 1 1 0 29.00 0 0 211.3375 2 NaN
1 1 1 1 0.92 1 2 151.5500 2 NaN
2 1 0 0 2.00 1 2 151.5500 2 NaN
3 1 0 1 30.00 1 2 151.5500 2 135.0
4 1 0 0 25.00 1 2 151.5500 2 NaN

[ ]: data.describe()

13
[ ]: pclass survived sex age sibsp \
count 1309.000000 1309.000000 1309.000000 1309.000000 1309.000000
mean 2.294882 0.381971 0.644003 29.881138 0.498854
std 0.837836 0.486055 0.478997 12.883193 1.041658
min 1.000000 0.000000 0.000000 0.170000 0.000000
25% 2.000000 0.000000 0.000000 22.000000 0.000000
50% 3.000000 0.000000 1.000000 29.881138 0.000000
75% 3.000000 1.000000 1.000000 35.000000 1.000000
max 3.000000 1.000000 1.000000 80.000000 8.000000

parch fare embarked body


count 1309.000000 1309.000000 1309.000000 121.000000
mean 0.385027 33.295479 1.605806 160.809917
std 0.865560 51.738879 0.653499 97.696922
min 0.000000 0.000000 0.000000 1.000000
25% 0.000000 7.895800 1.000000 72.000000
50% 0.000000 14.454200 2.000000 155.000000
75% 0.000000 31.275000 2.000000 256.000000
max 9.000000 512.329200 2.000000 328.000000

[ ]: data = data.drop(['cabin','name','ticket','home.dest','boat'], axis=1)␣


↪#Dropping this column as it has high no of null values

[ ]: data.head()

[ ]: pclass survived sex age sibsp parch fare embarked body


0 1 1 0 29.00 0 0 211.3375 2 NaN
1 1 1 1 0.92 1 2 151.5500 2 NaN
2 1 0 0 2.00 1 2 151.5500 2 NaN
3 1 0 1 30.00 1 2 151.5500 2 135.0
4 1 0 0 25.00 1 2 151.5500 2 NaN

[ ]: data.fillna({'age': data['age'].mean()}, inplace=True) #Filling null values␣


↪with mean of age

data.fillna({'fare': data['fare'].mean()}, inplace=True) #Filling null values␣


↪with mean of fare

data.fillna({'embarked': 'S'}, inplace=True)

[ ]: data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 pclass 1309 non-null int64
1 survived 1309 non-null int64

14
2 sex 1309 non-null int64
3 age 1309 non-null float64
4 sibsp 1309 non-null int64
5 parch 1309 non-null int64
6 fare 1309 non-null float64
7 embarked 1309 non-null int64
8 body 121 non-null float64
dtypes: float64(3), int64(6)
memory usage: 92.2 KB

[ ]: data.replace({'sex':{'male':1,'female':0}},inplace=True)

[ ]: data.replace({'embarked':{'S':2 ,'C':1,'Q':0}},inplace=True)

[ ]: data.corr()

[ ]: pclass survived sex age sibsp parch \


pclass 1.000000 -0.312469 0.124617 -0.366371 0.060832 0.018322
survived -0.312469 1.000000 -0.528693 -0.050198 -0.027825 0.082660
sex 0.124617 -0.528693 1.000000 0.057397 -0.109609 -0.213125
age -0.366371 -0.050198 0.057397 1.000000 -0.190747 -0.130872
sibsp 0.060832 -0.027825 -0.109609 -0.190747 1.000000 0.373587
parch 0.018322 0.082660 -0.213125 -0.130872 0.373587 1.000000
fare -0.558477 0.244208 -0.185484 0.171521 0.160224 0.221522
embarked -0.038875 -0.098450 0.120423 -0.035824 0.073461 0.095523
body -0.034642 NaN -0.015903 0.059059 -0.099961 0.051099

fare embarked body


pclass -0.558477 -0.038875 -0.034642
survived 0.244208 -0.098450 NaN
sex -0.185484 0.120423 -0.015903
age 0.171521 -0.035824 0.059059
sibsp 0.160224 0.073461 -0.099961
parch 0.221522 0.095523 0.051099
fare 1.000000 -0.061118 -0.042665
embarked -0.061118 1.000000 -0.033860
body -0.042665 -0.033860 1.000000

[ ]:

[ ]:

[ ]:

[ ]:

#LAB-8#

15
[ ]: import numpy as np
import pandas as pd

data = pd.read_excel("/content/Case study_Dataset.xlsx")

[ ]: data.head()

[ ]: CREATED_DATE CREATED_DATE minus Hour \


0 2016-01-09 00:18:14 2016-01-09
1 2016-01-09 02:28:34 2016-01-09
2 2016-01-09 04:00:34 2016-01-09
3 2016-01-09 10:26:27 2016-01-09
4 2016-01-09 11:37:59 2016-01-09

USER_ID TRANSACTION_ID \
0 45e3c222-38ac-4fdb-b092-ff1639e4438c 27d7fd11-d885-4d2c-9ed1-daa89b7bda1d
1 57c11728-b979-4856-bada-1d268726cfe9 2e1ee26c-0d24-4931-a7f9-0caa0d07eb2e
2 1319cca9-02a7-4a15-8abb-48d4e08e5aa3 bfd20e6f-ddb3-4237-bcd2-f7f8d967e36e
3 3f6bb28c-f945-4027-9178-747956c3ea58 85037186-039a-4ae5-9fea-e87f30822218
4 f54baeeb-7282-4d23-9bb7-e8396ce1b159 8e1e938a-1916-4d5e-b261-82c61a6979d6

TYPE CURRENCY AMOUNT


0 TOPUP EUR 177.38
1 BANK_TRANSFER EUR 310.27
2 CARD_PAYMENT EUR 96.44
3 BANK_TRANSFER EUR 288.51
4 CARD_PAYMENT GBP 88.45

[ ]: data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CREATED_DATE 10000 non-null datetime64[ns]
1 CREATED_DATE minus Hour 10000 non-null datetime64[ns]
2 USER_ID 10000 non-null object
3 TRANSACTION_ID 10000 non-null object
4 TYPE 10000 non-null object
5 CURRENCY 10000 non-null object
6 AMOUNT 10000 non-null float64
dtypes: datetime64[ns](2), float64(1), object(4)
memory usage: 547.0+ KB

16
[ ]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn as sk
import seaborn as sns
gdp_missing_values_data = pd.read_csv('./Datasets/GDP_missing_data.csv')
gdp_complete_data = pd.read_csv('./Datasets/GDP_complete_data.csv')

---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<ipython-input-1-a1d39de8ca53> in <cell line: 6>()
4 import sklearn as sk
5 import seaborn as sns
----> 6 gdp_missing_values_data = pd.read_csv('./Datasets/GDP_missing_data.csv')
7 gdp_complete_data = pd.read_csv('./Datasets/GDP_complete_data.csv')

/usr/local/lib/python3.10/dist-packages/pandas/io/parsers/readers.py in␣
↪read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col,␣
↪usecols, dtype, engine, converters, true_values, false_values,␣
↪skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na,␣
↪na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format,␣
↪keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator,␣
↪chunksize, compression, thousands, decimal, lineterminator, quotechar,␣
↪quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect,␣
↪on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision,␣

↪storage_options, dtype_backend)

910 kwds.update(kwds_defaults)
911
--> 912 return _read(filepath_or_buffer, kwds)
913
914

/usr/local/lib/python3.10/dist-packages/pandas/io/parsers/readers.py in␣
↪_read(filepath_or_buffer, kwds)

575
576 # Create the parser.
--> 577 parser = TextFileReader(filepath_or_buffer, **kwds)
578
579 if chunksize or iterator:

/usr/local/lib/python3.10/dist-packages/pandas/io/parsers/readers.py in␣
↪__init__(self, f, engine, **kwds)

1405
1406 self.handles: IOHandles | None = None
-> 1407 self._engine = self._make_engine(f, self.engine)
1408
1409 def close(self) -> None:

/usr/local/lib/python3.10/dist-packages/pandas/io/parsers/readers.py in␣
↪_make_engine(self, f, engine)

17
1659 if "b" not in mode:
1660 mode += "b"
-> 1661 self.handles = get_handle(
1662 f,
1663 mode,

/usr/local/lib/python3.10/dist-packages/pandas/io/common.py in␣
↪get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text,␣

↪errors, storage_options)

857 if ioargs.encoding and "b" not in ioargs.mode:


858 # Encoding
--> 859 handle = open(
860 handle,
861 ioargs.mode,

FileNotFoundError: [Errno 2] No such file or directory: './Datasets/


↪GDP_missing_data.csv'

[ ]: data.describe()

[ ]: CREATED_DATE CREATED_DATE minus Hour AMOUNT


count 10000 10000 10000.000000
mean 2016-08-23 00:01:29.126000128 2016-08-22 10:24:14.400000 175.768253
min 2016-01-09 00:18:14 2016-01-09 00:00:00 0.020000
25% 2016-06-19 18:20:33 2016-06-19 00:00:00 88.675000
50% 2016-09-03 16:29:08.500000 2016-09-03 00:00:00 177.455000
75% 2016-11-09 18:34:07.500000 2016-11-09 00:00:00 263.540000
max 2017-01-08 23:50:18 2017-01-08 00:00:00 349.980000
std NaN NaN 101.406464

[ ]: data["year"] = pd.DatetimeIndex(data.CREATED_DATE).year
data["month"] = pd.DatetimeIndex(data.CREATED_DATE).month
data["weekdays"] = pd.DatetimeIndex(data.CREATED_DATE).weekday

[ ]: EUR = []

for i in range(len(data)):
if data.iloc[i]["CURRENCY"] == "EUR":
EUR.append(data.iloc[i]["AMOUNT"])

else:
EUR.append(data.iloc[i]["AMOUNT"] * 1.17)

data["AMT_EUR"] = EUR

[ ]: data.head()

18
[ ]: CREATED_DATE CREATED_DATE minus Hour \
0 2016-01-09 00:18:14 2016-01-09
1 2016-01-09 02:28:34 2016-01-09
2 2016-01-09 04:00:34 2016-01-09
3 2016-01-09 10:26:27 2016-01-09
4 2016-01-09 11:37:59 2016-01-09

USER_ID TRANSACTION_ID \
0 45e3c222-38ac-4fdb-b092-ff1639e4438c 27d7fd11-d885-4d2c-9ed1-daa89b7bda1d
1 57c11728-b979-4856-bada-1d268726cfe9 2e1ee26c-0d24-4931-a7f9-0caa0d07eb2e
2 1319cca9-02a7-4a15-8abb-48d4e08e5aa3 bfd20e6f-ddb3-4237-bcd2-f7f8d967e36e
3 3f6bb28c-f945-4027-9178-747956c3ea58 85037186-039a-4ae5-9fea-e87f30822218
4 f54baeeb-7282-4d23-9bb7-e8396ce1b159 8e1e938a-1916-4d5e-b261-82c61a6979d6

TYPE CURRENCY AMOUNT year month weekdays AMT_EUR


0 TOPUP EUR 177.38 2016 1 5 177.3800
1 BANK_TRANSFER EUR 310.27 2016 1 5 310.2700
2 CARD_PAYMENT EUR 96.44 2016 1 5 96.4400
3 BANK_TRANSFER EUR 288.51 2016 1 5 288.5100
4 CARD_PAYMENT GBP 88.45 2016 1 5 103.4865

[ ]: data[["TYPE"]].value_counts()

[ ]: TYPE
TOPUP 2373
BANK_TRANSFER 2371
ATM 2357
CARD_PAYMENT 2325
P2P_TRANSFER 574
Name: count, dtype: int64

[ ]: data["year"].unique()

[ ]: array([2016, 2017], dtype=int32)

[ ]: data.groupby(["CURRENCY"])["AMOUNT"].sum()

[ ]: CURRENCY
EUR 852363.35
GBP 905319.18
Name: AMOUNT, dtype: float64

[ ]: data.groupby(["year", "month", "CURRENCY"])["AMOUNT"].sum()

[ ]: year month CURRENCY


2016 1 EUR 19615.42
GBP 20155.34

19
2 EUR 22249.70
GBP 26937.35
3 EUR 44099.57
GBP 45814.22
4 EUR 43964.14
GBP 45241.07
5 EUR 49489.32
GBP 51630.61
6 EUR 53965.12
GBP 58219.62
7 EUR 81995.70
GBP 82271.76
8 EUR 100820.63
GBP 114643.94
9 EUR 90419.37
GBP 95699.41
10 EUR 101629.15
GBP 115582.59
11 EUR 105934.72
GBP 105177.93
12 EUR 110733.82
GBP 112710.05
2017 1 EUR 27446.69
GBP 31235.29
Name: AMOUNT, dtype: float64

[ ]: data.groupby(["weekdays", "CURRENCY"])["AMOUNT"].sum()

[ ]: weekdays CURRENCY
0 EUR 107370.90
GBP 129305.04
1 EUR 125032.02
GBP 118797.33
2 EUR 121888.83
GBP 129554.67
3 EUR 119865.46
GBP 131812.35
4 EUR 138228.10
GBP 150998.18
5 EUR 132238.72
GBP 135012.44
6 EUR 107739.32
GBP 109839.17
Name: AMOUNT, dtype: float64

[ ]: data.groupby(["TYPE", "CURRENCY"])["AMOUNT"].sum()

20
[ ]: TYPE CURRENCY
ATM EUR 213140.45
GBP 198558.25
BANK_TRANSFER EUR 205127.11
GBP 213737.72
CARD_PAYMENT EUR 210115.77
GBP 204736.58
P2P_TRANSFER EUR 19905.82
GBP 82075.52
TOPUP EUR 204074.20
GBP 206211.11
Name: AMOUNT, dtype: float64

[ ]: data.groupby(["weekdays"])["AMT_EUR"].sum().plot()

[ ]: <Axes: xlabel='weekdays'>

[ ]: data.groupby(["USER_ID"])["TRANSACTION_ID"].count().sort_values(ascending =␣
↪False)

21
[ ]: USER_ID
06bb2d68-bf61-4030-8447-9de64d3ce490 132
d35f19f3-d9ad-48bf-bd1e-90f3ba4f0b98 103
d1bc3cd6-154e-479f-8957-a69cdf414462 95
0fe472c9-cf3e-4e43-90f3-a0cfb6a4f1f0 85
65ac0928-e17d-4636-96f4-ebe6bdb9c98d 84

dcf8d6c6-9fb6-4b0b-a190-013d220b33d7 1
2d6259b3-5a22-4b4b-b616-c22d9d7677c2 1
2d518cf9-d853-443d-a3d8-bda56f373901 1
5a99fa7a-72e5-4dbe-ae51-f0fd3bc8a717 1
2588d6c8-1a2e-4a54-a191-3b3111f9658e 1
Name: TRANSACTION_ID, Length: 1134, dtype: int64

#LAB-9#
[ ]: import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_excel("/content/Case study_Dataset.xlsx")

[ ]: data.head()

[ ]: CREATED_DATE CREATED_DATE minus Hour \


0 2016-01-09 00:18:14 2016-01-09
1 2016-01-09 02:28:34 2016-01-09
2 2016-01-09 04:00:34 2016-01-09
3 2016-01-09 10:26:27 2016-01-09
4 2016-01-09 11:37:59 2016-01-09

USER_ID TRANSACTION_ID \
0 45e3c222-38ac-4fdb-b092-ff1639e4438c 27d7fd11-d885-4d2c-9ed1-daa89b7bda1d
1 57c11728-b979-4856-bada-1d268726cfe9 2e1ee26c-0d24-4931-a7f9-0caa0d07eb2e
2 1319cca9-02a7-4a15-8abb-48d4e08e5aa3 bfd20e6f-ddb3-4237-bcd2-f7f8d967e36e
3 3f6bb28c-f945-4027-9178-747956c3ea58 85037186-039a-4ae5-9fea-e87f30822218
4 f54baeeb-7282-4d23-9bb7-e8396ce1b159 8e1e938a-1916-4d5e-b261-82c61a6979d6

TYPE CURRENCY AMOUNT


0 TOPUP EUR 177.38
1 BANK_TRANSFER EUR 310.27
2 CARD_PAYMENT EUR 96.44
3 BANK_TRANSFER EUR 288.51
4 CARD_PAYMENT GBP 88.45

[ ]: data.info()

22
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CREATED_DATE 10000 non-null datetime64[ns]
1 CREATED_DATE minus Hour 10000 non-null datetime64[ns]
2 USER_ID 10000 non-null object
3 TRANSACTION_ID 10000 non-null object
4 TYPE 10000 non-null object
5 CURRENCY 10000 non-null object
6 AMOUNT 10000 non-null float64
dtypes: datetime64[ns](2), float64(1), object(4)
memory usage: 547.0+ KB

[ ]: data.describe()

[ ]: CREATED_DATE CREATED_DATE minus Hour AMOUNT


count 10000 10000 10000.000000
mean 2016-08-23 00:01:29.126000128 2016-08-22 10:24:14.400000 175.768253
min 2016-01-09 00:18:14 2016-01-09 00:00:00 0.020000
25% 2016-06-19 18:20:33 2016-06-19 00:00:00 88.675000
50% 2016-09-03 16:29:08.500000 2016-09-03 00:00:00 177.455000
75% 2016-11-09 18:34:07.500000 2016-11-09 00:00:00 263.540000
max 2017-01-08 23:50:18 2017-01-08 00:00:00 349.980000
std NaN NaN 101.406464

[ ]: data[["TYPE"]].value_counts()

[ ]: TYPE
TOPUP 2373
BANK_TRANSFER 2371
ATM 2357
CARD_PAYMENT 2325
P2P_TRANSFER 574
Name: count, dtype: int64

[ ]: data[["AMOUNT"]].mean()

[ ]: AMOUNT 175.768253
dtype: float64

[ ]: data[["AMOUNT"]].median()

[ ]: AMOUNT 177.455
dtype: float64

23
[ ]: data[["AMOUNT"]].mode()

[ ]: AMOUNT
0 124.01

[ ]: data[["AMOUNT"]].std()

[ ]: AMOUNT 101.406464
dtype: float64

[ ]: data[["AMOUNT"]].gt(200).mean()

[ ]: AMOUNT 0.4322
dtype: float64

[ ]: data["AMOUNT"].unique()

[ ]: array([177.38, 310.27, 96.44, …, 285.68, 17.32, 228.9 ])

[ ]: data['AMOUNT'].value_counts()

[ ]: AMOUNT
124.01 6
140.59 4
284.25 4
53.96 3
13.63 3
..
292.14 1
52.09 1
110.65 1
307.05 1
228.90 1
Name: count, Length: 8746, dtype: int64

[ ]: # Create a probability distribution

plt.hist(data['AMOUNT'], bins=10)
plt.xlabel('AMOUNT')
plt.ylabel('Probability')
plt.title('Probability distribution of AMOUNT')
plt.show()

24
[ ]: # Create a sampling distribution
sample = data['AMOUNT'].sample(100, replace=True)
plt.hist(sample, bins=10)
plt.xlabel('AMOUNT')
plt.ylabel('Probability')
plt.title('Sampling distribution of AMOUNT')
plt.show()

25
#LAB 10#
[1]: import numpy as np
import pandas as pd

data = pd.read_excel("/content/Case study_Dataset.xlsx")

[2]: data.head()

[2]: CREATED_DATE CREATED_DATE minus Hour \


0 2016-01-09 00:18:14 2016-01-09
1 2016-01-09 02:28:34 2016-01-09
2 2016-01-09 04:00:34 2016-01-09
3 2016-01-09 10:26:27 2016-01-09
4 2016-01-09 11:37:59 2016-01-09

USER_ID TRANSACTION_ID \
0 45e3c222-38ac-4fdb-b092-ff1639e4438c 27d7fd11-d885-4d2c-9ed1-daa89b7bda1d
1 57c11728-b979-4856-bada-1d268726cfe9 2e1ee26c-0d24-4931-a7f9-0caa0d07eb2e
2 1319cca9-02a7-4a15-8abb-48d4e08e5aa3 bfd20e6f-ddb3-4237-bcd2-f7f8d967e36e

26
3 3f6bb28c-f945-4027-9178-747956c3ea58 85037186-039a-4ae5-9fea-e87f30822218
4 f54baeeb-7282-4d23-9bb7-e8396ce1b159 8e1e938a-1916-4d5e-b261-82c61a6979d6

TYPE CURRENCY AMOUNT


0 TOPUP EUR 177.38
1 BANK_TRANSFER EUR 310.27
2 CARD_PAYMENT EUR 96.44
3 BANK_TRANSFER EUR 288.51
4 CARD_PAYMENT GBP 88.45

[3]: data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CREATED_DATE 10000 non-null datetime64[ns]
1 CREATED_DATE minus Hour 10000 non-null datetime64[ns]
2 USER_ID 10000 non-null object
3 TRANSACTION_ID 10000 non-null object
4 TYPE 10000 non-null object
5 CURRENCY 10000 non-null object
6 AMOUNT 10000 non-null float64
dtypes: datetime64[ns](2), float64(1), object(4)
memory usage: 547.0+ KB

[4]: data.describe()

[4]: CREATED_DATE CREATED_DATE minus Hour AMOUNT


count 10000 10000 10000.000000
mean 2016-08-23 00:01:29.126000128 2016-08-22 10:24:14.400000 175.768253
min 2016-01-09 00:18:14 2016-01-09 00:00:00 0.020000
25% 2016-06-19 18:20:33 2016-06-19 00:00:00 88.675000
50% 2016-09-03 16:29:08.500000 2016-09-03 00:00:00 177.455000
75% 2016-11-09 18:34:07.500000 2016-11-09 00:00:00 263.540000
max 2017-01-08 23:50:18 2017-01-08 00:00:00 349.980000
std NaN NaN 101.406464

[5]: data["year"] = pd.DatetimeIndex(data.CREATED_DATE).year


data["month"] = pd.DatetimeIndex(data.CREATED_DATE).month
data["weekdays"] = pd.DatetimeIndex(data.CREATED_DATE).weekday

[6]: EUR = []

for i in range(len(data)):
if data.iloc[i]["CURRENCY"] == "EUR":

27
EUR.append(data.iloc[i]["AMOUNT"])

else:
EUR.append(data.iloc[i]["AMOUNT"] * 1.17)

data["AMT_EUR"] = EUR

[7]: data.head()

[7]: CREATED_DATE CREATED_DATE minus Hour \


0 2016-01-09 00:18:14 2016-01-09
1 2016-01-09 02:28:34 2016-01-09
2 2016-01-09 04:00:34 2016-01-09
3 2016-01-09 10:26:27 2016-01-09
4 2016-01-09 11:37:59 2016-01-09

USER_ID TRANSACTION_ID \
0 45e3c222-38ac-4fdb-b092-ff1639e4438c 27d7fd11-d885-4d2c-9ed1-daa89b7bda1d
1 57c11728-b979-4856-bada-1d268726cfe9 2e1ee26c-0d24-4931-a7f9-0caa0d07eb2e
2 1319cca9-02a7-4a15-8abb-48d4e08e5aa3 bfd20e6f-ddb3-4237-bcd2-f7f8d967e36e
3 3f6bb28c-f945-4027-9178-747956c3ea58 85037186-039a-4ae5-9fea-e87f30822218
4 f54baeeb-7282-4d23-9bb7-e8396ce1b159 8e1e938a-1916-4d5e-b261-82c61a6979d6

TYPE CURRENCY AMOUNT year month weekdays AMT_EUR


0 TOPUP EUR 177.38 2016 1 5 177.3800
1 BANK_TRANSFER EUR 310.27 2016 1 5 310.2700
2 CARD_PAYMENT EUR 96.44 2016 1 5 96.4400
3 BANK_TRANSFER EUR 288.51 2016 1 5 288.5100
4 CARD_PAYMENT GBP 88.45 2016 1 5 103.4865

[8]: data[["TYPE"]].value_counts()

[8]: TYPE
TOPUP 2373
BANK_TRANSFER 2371
ATM 2357
CARD_PAYMENT 2325
P2P_TRANSFER 574
Name: count, dtype: int64

Hypothesis: The top 3% users drive 20% value tax Same as bottom 60% of user. For both EUR
and GBP
[9]: top_users = data.groupby(["USER_ID"])["AMT_EUR"].sum().sort_values(ascending =␣
↪False)

bottom_users = data.groupby(["USER_ID"])["AMT_EUR"].sum().sort_values()

28
[10]: top_users_count = len(top_users)

[11]: top_amt = top_users[:int(top_users_count * 0.03)]


bot_amt = bottom_users[:int(top_users_count * 0.636)]

print("Top 3% amt:", top_amt.sum())


print("Bottom 60% amt:", bot_amt.sum())

Top 3% amt: 394176.2463


Bottom 60% amt: 394327.79819999996

[12]: top_amt = top_users[:int(top_users_count * 0.13)]


total_50_amt = top_users.sum() * 0.5

print("Top 13% amt:", top_amt.sum())


print("Total 50% amt:", total_50_amt)

Top 13% amt: 960763.8478


Total 50% amt: 955793.3953

29

You might also like