0% found this document useful (0 votes)
61 views4 pages

DMV - 1 - Jupyter Notebook

Uploaded by

Anushka Jadhav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views4 pages

DMV - 1 - Jupyter Notebook

Uploaded by

Anushka Jadhav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

10/6/24, 6:55 PM DMV_1 - Jupyter Notebook

In [1]: import pandas as pd

In [2]: csv_data = pd.read_csv('sales_data.csv', encoding='ISO-8859-1')

In [3]: excel_data = pd.read_excel('sales_data.xlsx')

In [4]: json_data = pd.read_json('sales_data.json')

In [5]: print(csv_data.info())
print(excel_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2823 entries, 0 to 2822
Data columns (total 25 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ORDERNUMBER 2823 non-null int64
1 QUANTITYORDERED 2823 non-null int64
2 PRICEEACH 2823 non-null float64
3 ORDERLINENUMBER 2823 non-null int64
4 SALES 2823 non-null float64
5 ORDERDATE 2823 non-null object
6 STATUS 2823 non-null object
7 QTR_ID 2823 non-null int64
8 MONTH_ID 2823 non-null int64
9 YEAR_ID 2823 non-null int64
10 PRODUCTLINE 2823 non-null object
11 MSRP 2823 non-null int64
12 PRODUCTCODE 2823 non-null object
13 CUSTOMERNAME 2823 non-null object
14 PHONE 2823 non-null object
15 ADDRESSLINE1 2823 non-null object
16 ADDRESSLINE2 302 non-null object
17 CITY 2823 non-null object
18 STATE 1337 non-null object
19 POSTALCODE 2747 non-null object
20 COUNTRY 2823 non-null object
21 TERRITORY 1749 non-null object
22 CONTACTLASTNAME 2823 non-null object
23 CONTACTFIRSTNAME 2823 non-null object
24 DEALSIZE 2823 non-null object
dtypes: float64(2), int64(7), object(16)
memory usage: 551.5+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 390 entries, 0 to 389
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Postcode 390 non-null int64
1 Sales_Rep_ID 390 non-null int64
2 Sales_Rep_Name 390 non-null object
3 Year 390 non-null int64
4 Value 390 non-null float64
dtypes: float64(1), int64(3), object(1)
memory usage: 15.4+ KB
None

In [6]: print(json_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9999 entries, 0 to 9998
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 9999 non-null int64
1 email 9999 non-null object
2 first 9999 non-null object
3 last 9999 non-null object
4 company 9999 non-null object
5 created_at 9999 non-null datetime64[ns, UTC]
6 country 9999 non-null object
dtypes: datetime64[ns, UTC](1), int64(1), object(5)
memory usage: 546.9+ KB
None

localhost:8888/notebooks/BE_PRACTICALS/DMV_1.ipynb 1/4
10/6/24, 6:55 PM DMV_1 - Jupyter Notebook

In [7]: csv_data.head()

Out[7]:
ORDERNUMBER QUANTITYORDERED PRICEEACH ORDERLINENUMBER SALES ORDERDATE STATUS QTR_ID MONTH_ID YEAR_ID ... ADDRESS

2/24/2003 897 Long


0 10107 30 95.70 2 2871.00 Shipped 1 2 2003 ...
0:00 A

59
1 10121 34 81.35 5 2765.90 5/7/2003 0:00 Shipped 2 5 2003 ...
l'A

27
2 10134 41 94.74 2 3884.34 7/1/2003 0:00 Shipped 3 7 2003 ... Colone

8/25/2003 78934
3 10145 45 83.26 6 3746.70 Shipped 3 8 2003 ...
0:00

10/10/2003
4 10159 49 100.00 14 5205.27 Shipped 4 10 2003 ... 7734 Str
0:00

5 rows × 25 columns

In [8]: csv_data.columns

Out[8]: Index(['ORDERNUMBER', 'QUANTITYORDERED', 'PRICEEACH', 'ORDERLINENUMBER',


'SALES', 'ORDERDATE', 'STATUS', 'QTR_ID', 'MONTH_ID', 'YEAR_ID',
'PRODUCTLINE', 'MSRP', 'PRODUCTCODE', 'CUSTOMERNAME', 'PHONE',
'ADDRESSLINE1', 'ADDRESSLINE2', 'CITY', 'STATE', 'POSTALCODE',
'COUNTRY', 'TERRITORY', 'CONTACTLASTNAME', 'CONTACTFIRSTNAME',
'DEALSIZE'],
dtype='object')

In [9]: excel_data.head()

Out[9]:
Postcode Sales_Rep_ID Sales_Rep_Name Year Value

0 2121 456 Jane 2011 84219.497311

1 2092 789 Ashish 2012 28322.192268

2 2128 456 Jane 2013 81878.997241

3 2073 123 John 2011 44491.142121

4 2134 789 Ashish 2012 71837.720959

In [15]: excel_data.columns

Out[15]: Index(['Postcode', 'Sales_Rep_ID', 'Sales_Rep_Name', 'Year', 'Value'], dtype='object')

In [10]: json_data.head()

Out[10]:
id email first last company created_at country

0 1 [email protected] Torrey Veum Hilll, Mayert and Wolf 2014-12-25 04:06:27.981000+00:00 Switzerland

1 2 [email protected] Micah Sanford Stokes-Reichel 2014-07-03 16:08:17.044000+00:00 Democratic People's Republic of Korea

2 3 [email protected] Hollis Swift Rodriguez, Cartwright and Kuhn 2014-08-18 06:15:16.731000+00:00 Tunisia

3 4 [email protected] Perry Leffler Sipes, Feeney and Hansen 2014-07-10 11:31:40.235000+00:00 Chad

4 5 [email protected] Janelle Hagenes Lesch and Daughters 2014-04-21 15:05:43.229000+00:00 Swaziland

In [14]: json_data.columns

Out[14]: Index(['id', 'email', 'first', 'last', 'company', 'created_at', 'country'], dtype='object')

In [23]: csv_data['COUNTRY'] = csv_data['COUNTRY'].astype(str)


excel_data['Postcode'] = excel_data['Postcode'].astype(str) # Assuming Postcode is analogous to Country
json_data['country'] = json_data['country'].astype(str)

In [44]: csv_selected = csv_data[['COUNTRY', 'ORDERNUMBER', 'SALES', 'YEAR_ID']].rename(columns={'COUNTRY': 'Country', 'YEAR_ID':'Yea


excel_selected = excel_data[['Postcode', 'Year', 'Value']].rename(columns={'Postcode': 'Country'})
json_selected = json_data[['country', 'email', 'first']].rename(columns={'country': 'Country'})

In [55]: combined_data = pd.merge(csv_selected, excel_selected, on=('Year','Country'), how='outer')


combined_data = pd.merge(combined_data, json_selected, on='Country', how='outer')

localhost:8888/notebooks/BE_PRACTICALS/DMV_1.ipynb 2/4
10/6/24, 6:55 PM DMV_1 - Jupyter Notebook

In [56]: combined_data.head()

Out[56]:
Country ORDERNUMBER SALES Year Value email first

0 USA 10107.0 2871.00 2003.0 NaN NaN NaN

1 USA 10145.0 3746.70 2003.0 NaN NaN NaN

2 USA 10159.0 5205.27 2003.0 NaN NaN NaN

3 USA 10168.0 3479.76 2003.0 NaN NaN NaN

4 USA 10201.0 2168.54 2003.0 NaN NaN NaN

In [57]: combined_data.tail()

Out[57]:
Country ORDERNUMBER SALES Year Value email first

82151 China NaN NaN NaN NaN [email protected] Christopher

82152 China NaN NaN NaN NaN [email protected] Hermann

82153 China NaN NaN NaN NaN [email protected] Leann

82154 China NaN NaN NaN NaN [email protected] Cierra

82155 China NaN NaN NaN NaN [email protected] Juliana

In [58]: combined_data.shape

Out[58]: (82156, 7)

In [59]: combined_data.isna().sum()

Out[59]: Country 0
ORDERNUMBER 9700
SALES 9700
Year 9310
Value 81766
email 1538
first 1538
dtype: int64

In [60]: combined_data.dtypes

Out[60]: Country object


ORDERNUMBER float64
SALES float64
Year float64
Value float64
email object
first object
dtype: object

In [64]: combined_data.describe()

Out[64]:
ORDERNUMBER SALES Year

count 72456.000000 72456.000000 72846.000000

mean 10262.038727 3535.301342 2003.891854

std 94.405512 1831.479392 0.926602

min 10100.000000 482.130000 2003.000000

25% 10180.000000 2184.000000 2003.000000

50% 10262.000000 3160.250000 2004.000000

75% 10347.000000 4496.800000 2004.000000

max 10425.000000 14082.800000 2013.000000

In [70]: combined_data['Year'] = combined_data['Year'].astype(int)

In [71]: mean_year = combined_data['Year'].mean() # Calculate the mean of the Year column


combined_data['Year'].fillna(mean_year, inplace=True)

In [72]: mean_sales = combined_data['SALES'].mean()



combined_data['SALES'].fillna(mean_sales, inplace=True)

localhost:8888/notebooks/BE_PRACTICALS/DMV_1.ipynb 3/4
10/6/24, 6:55 PM DMV_1 - Jupyter Notebook

In [73]: import matplotlib.pyplot as plt


import seaborn as sns

# Bar plot
plt.figure(figsize=(10, 6))
sns.barplot(x=combined_data['Year'], y=combined_data['SALES'])
plt.title('Total Sales by Year')
plt.xticks(rotation=45)
plt.show()

In [ ]: ​

localhost:8888/notebooks/BE_PRACTICALS/DMV_1.ipynb 4/4

You might also like