Lecture 12 - Art and Science of Data Visualization
Lecture 12 - Art and Science of Data Visualization
Visualizing 1D relations
Marysville
10th Street
0 530486002475 WA-31025-1656 5304860 WA-31025 6 8 School Snohomish
School
District
Evergreen
49th Street School
1 530270001270 WA-06114-1646 5302700 WA-06114 KG 12 Clark
Academy District
(Clark)
A G West Tumwater
7
2 530910002602 WA-34033-4500 5309100 WA-34033 9 12 Black Hills School Thurston
High School District
Aberdeen
A J West Grays
3 530003000001 WA-14005-2834 5300030 WA-14005 PK 6 School
Elementary Harbor
District
A-3
Spokane
Multiagency
4 530825002361 WA-32081-1533 5308250 WA-32081 9 12 School Spokane
Adolescent
District
Prog
5 rows × 24 columns
0 Suburb 798
1 City 714
2 Rural 505
3 Town 338
4 Uncategorized 72
In [7]: 1 # Is the schools located equally across various localities?
2 fig = so.Plot(FTloc, x='Location', y='Count' ).add(so.Bar())
3 so.Plot.show(fig)
In [5]: 1 linkRepo='https://github.com/resourcesbookvisual/data/'
2 linkFile='raw/master/crime.csv'
3 fullLink=linkRepo+linkFile
4 crime=pd.read_csv(fullLink)
In [12]: 1 display(crime.head())
2 crime.columns
ReportNumber OccurredDate year month weekday OccurredTime OccurredDayTime ReportedDate ReportedTime DaysToRe
0 THEFT 170946
2 BURGLARY 76630
4 NARCOTIC 16864
For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace
=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original obje
ct.
FTcri['Crimes'].fillna('UNCATEGORIZED', inplace=True)
In [44]: 1 hg = pd.value_counts(eduwa['High.Grade']).reset_index()
2 so.Plot(hg, 'High.Grade','count').add(so.Bar())
C:\Users\ANUP\AppData\Local\Temp\ipykernel_17588\2752507459.py:1: FutureWarning: pandas.value_counts is d
eprecated and will be removed in a future version. Use pd.Series(obj).value_counts() instead.
hg = pd.value_counts(eduwa['High.Grade']).reset_index()
Out[44]:
In [48]: 1 ordLabels=["PK","KG","1","2","3","4","5","6","7","8","9","10","11","12","13"]
2 HGtype= pd.CategoricalDtype(categories=ordLabels,ordered=True)
3 display(HGtype)
4 # apply that format to the column
5 eduwa['High.Grade-O']= eduwa['High.Grade'].astype(HGtype)
6 display(eduwa['High.Grade-O'])
CategoricalDtype(categories=['PK', 'KG', '1', '2', '3', '4', '5', '6', '7', '8', '9',
'10', '11', '12', '13'],
, ordered=True, categories_dtype=object)
0 8
1 12
2 12
3 6
4 12
..
2422 12
2423 6
2424 12
2425 6
2426 8
Name: High.Grade-O, Length: 2427, dtype: category
Categories (15, object): ['PK' < 'KG' < '1' < '2' ... '10' < '11' < '12' < '13']
In [49]: 1 # Frequency table
2 FThg = pd.value_counts(eduwa['High.Grade-O'],ascending=False,sort=False,dropna=False).reset_index()
3 FThg.columns = ['MaxOffer','Counts']
4 # adding column
5 FThg['CumPercent']=100*FThg.Counts.cumsum()/FThg.Counts.sum()
6 display(FThg)
C:\Users\ANUP\AppData\Local\Temp\ipykernel_17588\1997154553.py:2: FutureWarning: pandas.value_counts is d
eprecated and will be removed in a future version. Use pd.Series(obj).value_counts() instead.
FThg = pd.value_counts(eduwa['High.Grade-O'],ascending=False,sort=False,dropna=False).reset_index()
0 PK 82 3.378657
1 KG 7 3.667079
2 1 6 3.914297
3 2 16 4.573548
4 3 19 5.356407
5 4 45 7.210548
6 5 755 38.318912
7 6 266 49.278945
8 7 11 49.732180
9 8 427 67.325917
10 9 15 67.943964
11 10 7 68.232386
12 11 5 68.438401
13 12 757 99.629172
14 13 9 100.000000
In [49]: 1 eduwa['Reduced.Lunch']
Out[49]: 0 3.0
1 9.0
2 40.0
3 10.0
4 4.0
...
2422 0.0
2423 57.0
2424 51.0
2425 35.0
2426 38.0
Name: Reduced.Lunch, Length: 2427, dtype: float64
array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120,
130, 140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240, 250,
260, 270, 280, 290, 300, 310])
0 (-0.001, 10.0]
1 (-0.001, 10.0]
2 (30.0, 40.0]
3 (-0.001, 10.0]
4 (-0.001, 10.0]
...
2422 (-0.001, 10.0]
2423 (50.0, 60.0]
2424 (50.0, 60.0]
2425 (30.0, 40.0]
2426 (30.0, 40.0]
Name: Reduced.Lunch, Length: 2427, dtype: category
Categories (31, interval[float64, right]): [(-0.001, 10.0] < (10.0, 20.0] < (20.0, 30.0] < (30.0, 40.0]
... (270.0, 280.0] < (280.0, 290.0] < (290.0, 300.0] < (300.0, 310.0]]
731
In [66]: 1 fig, ax = plt.subplots()
2 p = so.Plot(eduwa, x = 'Reduced.Lunch').add(so.Bar(),so.Hist("density")).on(ax)#.add(so.Line(),so.KDE(c
3 ax.plot(np.arange(0,300),NormalHist(np.arange(0,300))/sum(NormalHist(np.arange(0,300))), label = "Ideal
4 p.show()
Visualizing 2D relations
Categorical-Categorical relations
In [81]: 1 # Looking at the relation between Precint and time of occurence of crime
2 crime[['Precinct', 'OccurredDayTime']]
3 PrecintDaytime = pd.crosstab(crime.Precinct, crime.OccurredDayTime, margins=True)
4 P1 = PrecintDaytime.sort_values('All',ascending=False).drop("All").drop("All", axis=1)
5 P1
Out[81]: OccurredDayTime afternoon day evening night
Precinct
In [94]: 1 # Better option is to use side-by-side (or dodge) - SO (Spacing - width = 0.9)
2 so.Plot(P,x='Precinct', y='Count', color = 'OccurredDayTime-O').add(so.Bars(width = 0.9), so.Dodge())
Out[94]:
crimecat
DISORDERLY CONDUCT 81 41 67 79
GAMBLE 4 4 7 2
HOMICIDE 46 41 49 131
LOITERING 31 20 25 9
PORNOGRAPHY 53 65 17 31
Out[198]:
In [72]: 1 # Flipped Bar Chart
2 CrimeDay_df = CrimeDay_df.sort_values(by = 'NormalizedCounts', ascending = False)#Sorted the df in the
3 so.Plot(CrimeDay_df,y='Crime',x='NormalizedCounts').facet('DayTime-O').add(so.Bars(width = 1)).label(x
Out[72]:
In [73]: 1 # Heatmap
2 # Graphical representation of data where values are depicted by color.
3 sns.heatmap(CrimeDay_df.pivot(index = 'Crime', columns = 'DayTime-O', values = 'NormalizedCounts'))
Out[73]: <Axes: xlabel='DayTime-O', ylabel='Crime'>
Categorical-Numerical relation (2 variables)
In [6]: 1 crime.year.value_counts()
2 crime2 = crime[crime.year>2007].copy()
3 crime2.dropna(subset=['DaysToReport'], inplace=True)
4 crime2.fillna(value={'crimecat': 'Uncategorized'}, inplace=True)
5 display(crime2[['crimecat','DaysToReport']].head(10))
6 maxD = crime2.groupby('crimecat').describe()['DaysToReport'][['max', 'mean', 'std']]
7 maxD.head(5)
crimecat DaysToReport
0 NARCOTIC 1.0
1 BURGLARY 0.0
3 THEFT 1.0
5 BURGLARY 0.0
6 THEFT 0.0
7 THEFT 0.0
9 THEFT 0.0
crimecat
Out[80]:
Out[81]:
√∑
⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯2⎯
𝜎𝑥 = (𝑥𝑖 − 𝑥¯)
𝑖
√∑
⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯2⎯
𝜎𝑦 = (𝑦𝑖 − 𝑦¯)
𝑖
image.png
In [82]: 1 crime_d = crime[pd.to_datetime(crime.year, format="%Y") > pd.to_datetime('2015', format="%Y")]
2 operations = {'DaysToReport': 'mean', 'Neighborhood': 'count'}
3 crime_neigh = crime_d.groupby('Neighborhood').agg(operations)
4 display(crime_neigh.head(5))
5 crime_sum = crime_neigh.Neighborhood.sum()
6 crime_neigh['Neighborhood']= crime_neigh.Neighborhood/crime_sum * 100
7 display(crime_neigh.head(5))
DaysToReport Neighborhood
Neighborhood
DaysToReport Neighborhood
Neighborhood
Out[83]:
Out[85]: