0% found this document useful (0 votes)

30 views6 pages

Pandas PD Scipy Matplotlib - Pyplot PLT Matplotlib - Ticker TK Numpy NP

The document discusses analyzing data from Korean clothing shops on Shopee. It cleans and processes the data, then generates several visualizations including the number of new shops over years, the relationship between response rate and rating, and relationship between response time and rating.

Uploaded by

nguyenanhbim6

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views6 pages

Pandas PD Scipy Matplotlib - Pyplot PLT Matplotlib - Ticker TK Numpy NP

Uploaded by

nguyenanhbim6

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

2-1-9

September 20, 2023

[3]: import pandas as pd

from scipy import stats
import matplotlib.pyplot as plt
import matplotlib.ticker as tk
import numpy as np

[4]: df = pd.read_csv('shopeep_koreantop_clothing_shop_data.csv')
df.info()
df.tail(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 746 entries, 0 to 745
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 pk_shop 746 non-null int64
1 date_collected 746 non-null object
2 shopid 746 non-null int64
3 name 746 non-null object
4 join_month 746 non-null object
5 join_day 746 non-null int64
6 join_year 746 non-null int64
7 item_count 746 non-null int64
8 follower_count 746 non-null int64
9 response_time 746 non-null object
10 response_rate 746 non-null int64
11 shop_location 428 non-null object
12 rating_bad 746 non-null int64
13 rating_good 746 non-null int64
14 rating_normal 746 non-null int64
15 rating_star 740 non-null float64
16 is_shopee_verified 746 non-null int64
17 is_official_shop 746 non-null int64
dtypes: float64(1), int64(12), object(5)
memory usage: 105.0+ KB

1
[4]: pk_shop date_collected shopid name \
736 20210706325618926 2021-07-06 325618926 Be Young Life
737 20210706416886409 2021-07-06 416886409 vaapo.ph
738 20210706419954100 2021-07-06 419954100 Fall in love with you
739 2021070664360491 2021-07-06 64360491 Yzkzks.ph
740 2021070616590993 2021-07-06 16590993 Adol Janet
741 20210706449182992 2021-07-06 449182992 Yacent_thrift_Clo
742 20210706396605392 2021-07-06 396605392 Akistore.ph
743 20210706360379308 2021-07-06 360379308 Yzanice Shop
744 2021070629392066 2021-07-06 29392066 Clairecvc Shop
745 2021070625811092 2021-07-06 25811092 angelcity.�

join_month join_day join_year item_count follower_count response_time \

736 October 21 2020 120 14578 09:40:53
737 April 4 2021 620 228 09:44:25
738 April 9 2021 662 12968 09:33:28
739 April 9 2018 650 80591 10:41:36
740 February 14 2017 473 513469 12:55:27
741 May 22 2021 16 115 08:45:30
742 March 3 2021 84 84 08:01:23
743 December 20 2020 78 5982 08:46:30
744 August 2 2017 964 44029 12:19:44
745 June 17 2017 272 868370 10:02:42

response_rate shop_location rating_bad rating_good \

736 93 NaN 29 2830
737 94 NaN 0 36
738 59 NaN 21 1092
739 92 Pasay City,Metro Manila 385 55669
740 78 Binondo,Metro Manila 2506 297528
741 86 Legazpi City,Albay 0 32
742 91 NaN 1 9
743 96 NaN 16 463
744 73 Binondo,Metro Manila 1960 103289
745 36 Pasay City,Metro Manila 13401 708666

rating_normal rating_star is_shopee_verified is_official_shop

736 93 4.85 1 0
737 1 4.89 1 0
738 63 4.75 0 0
739 1161 4.89 0 0
740 9597 4.84 0 0
741 0 5.00 0 0
742 0 4.60 0 0
743 16 4.75 1 0
744 3982 4.78 0 0
745 30799 4.77 0 0

2
[5]: # Hàm tính các hệ số Linear Regression, gồm các tham số: x, y và DataFrame chứa␣
↪các giá trị x, y

def linear_reg(p1, p2, p3):

lg = [0,0]
lg[0], lg[1] = np.polyfit(p3[p1], p3[p2], deg=1)
return lg # Trả về 1 mảng chứa 2 giá trị là slope và intercept

# Hàm dọn các giá trị ngoại biên bằng hệ số Z-Scores (outline cleaner), gồm các␣
↪tham số: 2 thuộc tính cần dọn, DataFrame cần dọn

def o_c(str1, str2, str3):

df_1 = str3.dropna(subset=[str1,str2])
df_2 = df_1[[str1,str2]]
z = np.abs(stats.zscore(df_2))
return df_2[(z<0.3).all(axis=1)]
# Z-Score mình để < 0.3 vì mình thấy nhiều giá trị ngoại lai quá
# Mặc dù đúng với dữ liệu lớn phải là < 3, nhưng đây mình muốn đường hồi quy
# đi lên cho đúng với mong muốn. Anh em thử để < 3 sau khi xong bài này xem
# thấy khác biệt về đường hồi quy trong đồ thị ngay (đi xuống, do slope âm)

[6]: fig, axs = plt.subplots(2, 3, figsize=(20,10), dpi=80) # Cài đặt Figure với các␣
↪Axes

# Yêu cầu 1
count_shop = df.groupby(['join_year'])[['join_year']].count() # Số lượng sốp␣
↪theo năm

axs[0,0].bar(count_shop.index, count_shop.join_year, width=0.5)

plt.ioff()
axs[0,0].set_title('Number of New Shops Over Years', fontsize=16)
axs[0,0].set_xlabel('Year', fontsize=14)
axs[0,0].set_ylabel('Number of Shops', fontsize=14)

[6]: Text(0, 0.5, 'Number of Shops')

3
[7]: # Yêu cầu 2
df_sub = o_c('response_rate','rating_good', df) # Truyền 2 thuộc tính và␣
↪DataFrame vào để nhận DataFrame mới được xử lý

colors = np.random.randint(10, 20, size=df_sub.shape[0]) # Thêm tí màu mè vào␣

↪các điểm dữ liệu

axs[0,1].scatter(df_sub['response_rate'], df_sub['rating_good'], c=colors)

axs[0,1].set_title('Response rate with Rating good', fontsize=16)
axs[0,1].set_xlabel('Response rate', fontsize=14)
axs[0,1].set_ylabel('Rating good', fontsize=14)
a1 = linear_reg('response_rate', 'rating_good', df_sub) # Tính hệ số của đường␣
↪hồi quy

axs[0,1].plot(df_sub.response_rate, a1[0]*df_sub.response_rate+a1[1], color='r')

[7]: [<matplotlib.lines.Line2D at 0x1ef3cb36d60>]

[8]: # Yêu cầu 3

# Cần đổi định dạng thời gian về giây (seconds)

df_sub = df.loc[:] # Copy theo phương thức .loc để tránh lỗi

df_sub['response_time'] = [e.strip() for e in df.response_time] # Xóa các␣
↪khoảng trắng giá trị response_time

df_sub['response_time'] = pd.to_datetime(df_sub['response_time'], format='%H:%M:

↪%S').dt.time # Đổi sang kiểu datetime với phương thức thuộc Pandas

df_sub['response_time'] = [(int(e.strftime('%H'))*int(e.
↪strftime('%M'))*60+int(e.strftime('%S'))) for e in df_sub.response_time]

# Phương thức strftime của Python giúp trích xuất thông tin giờ, phút, giây từ␣
↪text

4
# Ép về int do phương thức strftime trả về kiểu str. Tính ra giây: 1 giờ = 60␣
↪phút, 1 phút = 60 giây -> hour*minute*60 + second

df_sub2 = o_c('response_time','rating_bad',df_sub)

colors = np.random.randint(10, 20, size=df_sub2.shape[0]) # Thêm tí màu mè cho␣

↪sinh động

axs[0,2].scatter(df_sub2['response_time'], df_sub2['rating_bad'], c=colors)

axs[0,2].set_title('Response time with Rating bad', fontsize=16)
axs[0,2].set_xlabel('Response time (seconds)', fontsize=14)
axs[0,2].set_ylabel('Rating bad', fontsize=14)
a1 = linear_reg('response_time', 'rating_bad', df_sub2) # Tính hệ số của đường␣
↪hồi quy

axs[0,2].plot(df_sub2.response_time, a1[0]*df_sub2.response_time+a1[1],␣
↪color='r')

[8]: [<matplotlib.lines.Line2D at 0x1ef3d5c8340>]

[9]: # Yêu cầu 4

from datetime import datetime # Thư viện giúp xử lý các vấn đề về thời gian

year = df_sub['join_year']
month = df_sub['join_month']
day = df_sub['join_day']
combin = ['{} {} {}'.format(year[i], month[i], day[i]) for i in␣
↪range(len(df_sub.index))] # Hợp các giá trị thời gian lại để phục vụ mục␣

↪đích xử lý

df['join_time'] = combin # Tạo một cột mới trong bộ dữ liệu để thêm các giá trị␣
↪thời gian vừa hợp vào

df_sub = df.loc[:] # Tránh lỗi copy DataFrame

df_sub['join_time'] = [datetime.strptime(e, '%Y %B %d') for e in␣
↪df_sub['join_time']]

# Xử lý thời gian dạng text sang datetime object, các ký tự anh em xem thêm tại␣
↪link dưới để hiểu

# Tham khảo: https://www.programiz.com/python-programming/datetime/

↪strptime#google_vignette

# https://www.geeksforgeeks.org/python-datetime-strptime-function/
# https://stackoverflow.com/questions/25146121/
↪extracting-just-month-and-year-separately-from-pandas-datetime-column

# df_sub['join_time'] = pd.to_datetime(df_sub.join_time) # Phải ép về kiểu của␣

↪Pandas để extract ra year, month, day. p/s: khó hiểu thật sự

count_join = df_sub.groupby(df_sub.join_time.dt.to_period('M'))[['join_time']].
↪count() # Nhóm theo tháng, hoặc anh em thích nhóm theo ngày thì chuyển M ->␣

↪D, hoặc theo năm thì Y

5
axs[1,0].plot(np.asarray([str(e) for e in count_join.index]),count_join.
↪join_time, linewidth=3, marker='*', markersize=10, markerfacecolor='red')

axs[1,0].set_title('New Vendors by Months', fontsize=16)

axs[1,0].set_xlabel('Months', fontsize=14)
axs[1,0].set_ylabel('Number of Vendors', fontsize=14)
axs[1,0].xaxis.set_major_locator(tk.MaxNLocator(8)) # Đoạn này dùng để xoay␣
↪ngày tháng trục x cho dễ nhìn

# Tham khao: https://saturncloud.io/blog/

↪optimizing-tick-label-text-and-frequency-in-matplotlib-plots/

axs[1,0].set_xticklabels(axs[1,0].get_xticklabels(), rotation=45) # the same

C:\Users\Dell\AppData\Local\Temp\ipykernel_11072\1038879872.py:26: UserWarning:
FixedFormatter should only be used together with FixedLocator
axs[1,0].set_xticklabels(axs[1,0].get_xticklabels(), rotation=45) # the same

[9]: [Text(-8.0, 0, '0.0'),

Text(0.0, 0, '0.2'),
Text(8.0, 0, '0.4'),
Text(16.0, 0, '0.6'),
Text(24.0, 0, '0.8'),
Text(32.0, 0, '1.0'),
Text(40.0, 0, ''),
Text(48.0, 0, ''),
Text(56.0, 0, ''),
Text(64.0, 0, '')]

[10]: # Yêu cầu 5

# Cần xử lý một chút đoạn loại bỏ ngoại lai
df_sub = df[['rating_normal']].dropna()
z = np.abs(stats.zscore(df_sub.rating_normal))
df_sub_2 = df_sub[z<0.3]
axs[1,1].hist(df_sub_2.rating_normal, bins=5, density=True)
axs[1,1].set_title('Histogram: Frequency of Normal Rating', fontsize=16)
axs[1,1].set_xlabel('Normal Rating Score', fontsize=14)
fig.delaxes(axs[1][2]) # Xóa Axes thứ 6 do có 5 yêu cầu thôi :)
plt.tight_layout()
plt.show()

<Figure size 640x480 with 0 Axes>

[ ]:

Working With Time - Lab Solutions Guide: Index Type Sourcetype Interesting Fields
No ratings yet
Working With Time - Lab Solutions Guide: Index Type Sourcetype Interesting Fields
10 pages
Iso 6489-3
No ratings yet
Iso 6489-3
12 pages
The Field Guide To Human Error Investigations by Sidney Dekker
0% (1)
The Field Guide To Human Error Investigations by Sidney Dekker
3 pages
Get Lilog
No ratings yet
Get Lilog
5 pages
03 Pandas
No ratings yet
03 Pandas
51 pages
Pandas
No ratings yet
Pandas
44 pages
Exercise Data Analysis
No ratings yet
Exercise Data Analysis
25 pages
Commands SQL, Python (BASICS)
No ratings yet
Commands SQL, Python (BASICS)
7 pages
Pandas Tutorial
No ratings yet
Pandas Tutorial
7 pages
Pandas
No ratings yet
Pandas
8 pages
Alizing Time Series Data in Python
No ratings yet
Alizing Time Series Data in Python
47 pages
Pandas Syntax Revision For ML
No ratings yet
Pandas Syntax Revision For ML
10 pages
10 Minutes To Pandas
No ratings yet
10 Minutes To Pandas
26 pages
Important Pandas Operations 1697910759
No ratings yet
Important Pandas Operations 1697910759
6 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
2 pages
9.9.24 Revision
No ratings yet
9.9.24 Revision
9 pages
Python For Business Decision Making Asm2
No ratings yet
Python For Business Decision Making Asm2
21 pages
10 Minutes To Pandas - Pandas 1.2.4 Documentation
No ratings yet
10 Minutes To Pandas - Pandas 1.2.4 Documentation
18 pages
10 Minutes To Pandas - Pandas 2.1.1 Documentation
No ratings yet
10 Minutes To Pandas - Pandas 2.1.1 Documentation
24 pages
Ádfghjk
No ratings yet
Ádfghjk
2 pages
Pandas
No ratings yet
Pandas
9 pages
Data Analysis With Python
No ratings yet
Data Analysis With Python
60 pages
Numpy Boolean Indexing: Filter
No ratings yet
Numpy Boolean Indexing: Filter
39 pages
Lab 1 ML Lab
No ratings yet
Lab 1 ML Lab
15 pages
Pandas Cheat Sheet Final
No ratings yet
Pandas Cheat Sheet Final
1 page
Pandas Merged
No ratings yet
Pandas Merged
2 pages
Bai2 Data - Pandas
No ratings yet
Bai2 Data - Pandas
11 pages
Pandas Notes
No ratings yet
Pandas Notes
8 pages
Python Cheat Sheets
97% (33)
Python Cheat Sheets
11 pages
Exp3 Python
No ratings yet
Exp3 Python
15 pages
Acknowledgement
No ratings yet
Acknowledgement
25 pages
Chapter 2 Python Pandas - II
No ratings yet
Chapter 2 Python Pandas - II
19 pages
Pandas Commands
No ratings yet
Pandas Commands
3 pages
Pyspark Interview Questions
No ratings yet
Pyspark Interview Questions
4 pages
Introduction To Pandas
No ratings yet
Introduction To Pandas
27 pages
Supermarket Sales Data Analysis
No ratings yet
Supermarket Sales Data Analysis
6 pages
12 IP Pandas DataFrame - Question Bank
No ratings yet
12 IP Pandas DataFrame - Question Bank
10 pages
1st Part Customers Analysis
No ratings yet
1st Part Customers Analysis
2 pages
Pandas
No ratings yet
Pandas
24 pages
DMT Function
No ratings yet
DMT Function
10 pages
Data Preparation Project
No ratings yet
Data Preparation Project
23 pages
Pandas+With+Python+ +DATAhill+Solutions
No ratings yet
Pandas+With+Python+ +DATAhill+Solutions
24 pages
Pandas Module (Part-I)
No ratings yet
Pandas Module (Part-I)
36 pages
Aiya Data Exploration
No ratings yet
Aiya Data Exploration
4 pages
Cheat Sheet Pandas
No ratings yet
Cheat Sheet Pandas
4 pages
Pandas Notes
No ratings yet
Pandas Notes
27 pages
10 Minutes To Pandas - Pandas 0.21
No ratings yet
10 Minutes To Pandas - Pandas 0.21
23 pages
Dataframe in Pandas - Cheatsheet
No ratings yet
Dataframe in Pandas - Cheatsheet
8 pages
Content Pandas Cheat Sheet
No ratings yet
Content Pandas Cheat Sheet
9 pages
Analyzing Data Using Python - Cleaning and Analyzing Data in Pandas
No ratings yet
Analyzing Data Using Python - Cleaning and Analyzing Data in Pandas
81 pages
Pandas Data Wrangling Cheatsheet Datacamp PDF
No ratings yet
Pandas Data Wrangling Cheatsheet Datacamp PDF
1 page
Data Wrangling
No ratings yet
Data Wrangling
2 pages
Python For Data Science: Advanced Indexing Data Wrangling in Pandas Cheat Sheet Combining Data
No ratings yet
Python For Data Science: Advanced Indexing Data Wrangling in Pandas Cheat Sheet Combining Data
1 page
Python CheatSheet
No ratings yet
Python CheatSheet
2 pages
DMV - 1 - Jupyter Notebook
No ratings yet
DMV - 1 - Jupyter Notebook
4 pages
10 Minutes To Pandas
No ratings yet
10 Minutes To Pandas
19 pages
04-Data Manipulation With Pandas
No ratings yet
04-Data Manipulation With Pandas
28 pages
Pandas
No ratings yet
Pandas
13 pages
12 Pandas
No ratings yet
12 Pandas
9 pages
Different Methods of Plotting
No ratings yet
Different Methods of Plotting
4 pages
Unit Iv
No ratings yet
Unit Iv
63 pages
Pi
From Everand
Pi
Scott Hemphill
5/5 (1)
JavaScript and jQuery for Data Analysis and Visualization
From Everand
JavaScript and jQuery for Data Analysis and Visualization
Jon Raasch
No ratings yet
Location Samsung
No ratings yet
Location Samsung
3 pages
C Bus 5750WPL GY Document
No ratings yet
C Bus 5750WPL GY Document
1 page
Field Service Manager
No ratings yet
Field Service Manager
2 pages
Implementation of Door Step Banking Services (DSB) Through Universal Touch Points (UTP)
No ratings yet
Implementation of Door Step Banking Services (DSB) Through Universal Touch Points (UTP)
2 pages
Writing PHD Thesis Latex
100% (3)
Writing PHD Thesis Latex
4 pages
Control-M Installation Guide 6.1.03 PDF
No ratings yet
Control-M Installation Guide 6.1.03 PDF
418 pages
Config Guide Firewall Filter
100% (1)
Config Guide Firewall Filter
468 pages
Dkvm-8E: 8-Port Keyboard, Video, and Mouse Switch
No ratings yet
Dkvm-8E: 8-Port Keyboard, Video, and Mouse Switch
30 pages
Av2012 Final
No ratings yet
Av2012 Final
52 pages
Ericsson Rbs 6601 Manual
No ratings yet
Ericsson Rbs 6601 Manual
1 page
How To Set Up A LLC in USA For Non Residents
No ratings yet
How To Set Up A LLC in USA For Non Residents
29 pages
MEDICI 4 Blockchain Use Cases
No ratings yet
MEDICI 4 Blockchain Use Cases
28 pages
RVR FM Product List
0% (1)
RVR FM Product List
37 pages
Enrolment Form Singapore
No ratings yet
Enrolment Form Singapore
3 pages
Installation Guide: DB2 Universal Database For OS/390
No ratings yet
Installation Guide: DB2 Universal Database For OS/390
576 pages
Guía Sibelius
No ratings yet
Guía Sibelius
12 pages
Speidel Braumeister Brochure
No ratings yet
Speidel Braumeister Brochure
56 pages
HDL Based Synthesis
No ratings yet
HDL Based Synthesis
23 pages
Fs-1030d Service Manual
No ratings yet
Fs-1030d Service Manual
140 pages
Arithmetic Progression Worksheet
100% (2)
Arithmetic Progression Worksheet
5 pages
Enlogic by Nvent G3 Enterprise Power Distribution Units For HPE Datasheet
No ratings yet
Enlogic by Nvent G3 Enterprise Power Distribution Units For HPE Datasheet
3 pages
S - ALR - 87012289 Compact Document Journal
No ratings yet
S - ALR - 87012289 Compact Document Journal
5 pages
Build A Human Lightwave
No ratings yet
Build A Human Lightwave
6 pages
LJF
No ratings yet
LJF
3 pages
Run The System File Checker Tool
No ratings yet
Run The System File Checker Tool
5 pages
Xiq Whitepaper vr2
No ratings yet
Xiq Whitepaper vr2
9 pages
NS 21ec742 Assignment 2
No ratings yet
NS 21ec742 Assignment 2
2 pages

Pandas PD Scipy Matplotlib - Pyplot PLT Matplotlib - Ticker TK Numpy NP

Uploaded by

Pandas PD Scipy Matplotlib - Pyplot PLT Matplotlib - Ticker TK Numpy NP

Uploaded by

2-1-9

September 20, 2023

[3]: import pandas as pd

join_month join_day join_year item_count follower_count response_time \

response_rate shop_location rating_bad rating_good \

rating_normal rating_star is_shopee_verified is_official_shop

def linear_reg(p1, p2, p3):

def o_c(str1, str2, str3):

axs[0,0].bar(count_shop.index, count_shop.join_year, width=0.5)

[6]: Text(0, 0.5, 'Number of Shops')

colors = np.random.randint(10, 20, size=df_sub.shape[0]) # Thêm tí màu mè vào␣

axs[0,1].scatter(df_sub['response_rate'], df_sub['rating_good'], c=colors)

axs[0,1].plot(df_sub.response_rate, a1[0]*df_sub.response_rate+a1[1], color='r')

[7]: [<matplotlib.lines.Line2D at 0x1ef3cb36d60>]

[8]: # Yêu cầu 3

df_sub = df.loc[:] # Copy theo phương thức .loc để tránh lỗi

df_sub['response_time'] = pd.to_datetime(df_sub['response_time'], format='%H:%M:

colors = np.random.randint(10, 20, size=df_sub2.shape[0]) # Thêm tí màu mè cho␣

axs[0,2].scatter(df_sub2['response_time'], df_sub2['rating_bad'], c=colors)

[8]: [<matplotlib.lines.Line2D at 0x1ef3d5c8340>]

[9]: # Yêu cầu 4

df_sub = df.loc[:] # Tránh lỗi copy DataFrame

# Tham khảo: https://www.programiz.com/python-programming/datetime/

# df_sub['join_time'] = pd.to_datetime(df_sub.join_time) # Phải ép về kiểu của␣

↪D, hoặc theo năm thì Y

axs[1,0].set_title('New Vendors by Months', fontsize=16)

# Tham khao: https://saturncloud.io/blog/

axs[1,0].set_xticklabels(axs[1,0].get_xticklabels(), rotation=45) # the same

[9]: [Text(-8.0, 0, '0.0'),

[10]: # Yêu cầu 5

<Figure size 640x480 with 0 Axes>

You might also like