0% found this document useful (0 votes)
5 views

Lecture 12 - Art and Science of Data Visualization

The document discusses the art and science of data visualization, focusing on techniques for visualizing categorical data and 1D relations using Python libraries like Pandas and Seaborn. It includes examples of visualizing school data across different localities and crime data, showcasing frequency tables, percentages, and cumulative percentages. Additionally, it demonstrates various plotting techniques such as bar plots, box plots, and violin plots to effectively represent the data.

Uploaded by

physizzmva
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Lecture 12 - Art and Science of Data Visualization

The document discusses the art and science of data visualization, focusing on techniques for visualizing categorical data and 1D relations using Python libraries like Pandas and Seaborn. It includes examples of visualizing school data across different localities and crime data, showcasing frequency tables, percentages, and cumulative percentages. Additionally, it demonstrates various plotting techniques such as bar plots, box plots, and violin plots to effectively represent the data.

Uploaded by

physizzmva
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Art and Science of Data Visualization

Visualizing 1D relations

Visualizing Categorical Data


In [1]: 1 import pandas as pd
2 #link to data
3 linkRepo='https://github.com/resourcesbookvisual/data/'
4 linkFile='raw/master/eduwa.csv'
5 fullLink=linkRepo+linkFile
6 eduwa=pd.read_csv(fullLink)
7 import seaborn.objects as so

In [2]: 1 # List of public schools in US


2 display(eduwa.head())
3 eduwa.columns
4 ​

NCES.School.ID State.School.ID NCES.District.ID State.District.ID Low.Grade High.Grade School.Name District County St

Marysville
10th Street
0 530486002475 WA-31025-1656 5304860 WA-31025 6 8 School Snohomish
School
District

Evergreen
49th Street School
1 530270001270 WA-06114-1646 5302700 WA-06114 KG 12 Clark
Academy District
(Clark)

A G West Tumwater
7
2 530910002602 WA-34033-4500 5309100 WA-34033 9 12 Black Hills School Thurston
High School District

Aberdeen
A J West Grays
3 530003000001 WA-14005-2834 5300030 WA-14005 PK 6 School
Elementary Harbor
District

A-3
Spokane
Multiagency
4 530825002361 WA-32081-1533 5308250 WA-32081 9 12 School Spokane
Adolescent
District
Prog

5 rows × 24 columns
 

Out[2]: Index(['NCES.School.ID', 'State.School.ID', 'NCES.District.ID',


'State.District.ID', 'Low.Grade', 'High.Grade', 'School.Name',
'District', 'County', 'Street.Address', 'City', 'State', 'ZIP',
'ZIP.4-digit', 'Phone', 'Locale.Code', 'LocaleType', 'LocaleSub',
'Charter', 'Title.I.School', 'Title.1.School.Wide',
'Student.Teacher.Ratio', 'Free.Lunch', 'Reduced.Lunch'],
dtype='object')

In [3]: 1 # Let us looks at locality Type


2 FTloc = pd.value_counts(eduwa.LocaleType,dropna=False).reset_index()
3 FTloc.columns = ['Location','Count']
4 # Fill NA as Uncategorized
5 FTloc.fillna('Uncategorized', inplace=True)
6 FTloc
C:\Users\ANUP\AppData\Local\Temp\ipykernel_17588\2928297991.py:2: FutureWarning: pandas.value_counts is d
eprecated and will be removed in a future version. Use pd.Series(obj).value_counts() instead.
FTloc = pd.value_counts(eduwa.LocaleType,dropna=False).reset_index()

Out[3]: Location Count

0 Suburb 798

1 City 714

2 Rural 505

3 Town 338

4 Uncategorized 72
In [7]: 1 # Is the schools located equally across various localities?
2 fig = so.Plot(FTloc, x='Location', y='Count' ).add(so.Bar())
3 so.Plot.show(fig)

In [24]: 1 # Visualizing Gaps


2 FTloc['Percentage']= 100*(FTloc.Count/FTloc.Count.sum()).round(4)
3 FTloc['Gap'] = FTloc['Percentage']-25
4 fig = so.Plot(FTloc, x='Location', y='Gap').add(so.Bar())
5 so.Plot.show(fig)

In [5]: 1 linkRepo='https://github.com/resourcesbookvisual/data/'
2 linkFile='raw/master/crime.csv'
3 fullLink=linkRepo+linkFile
4 crime=pd.read_csv(fullLink)
In [12]: 1 display(crime.head())
2 crime.columns

ReportNumber OccurredDate year month weekday OccurredTime OccurredDayTime ReportedDate ReportedTime DaysToRe

0 2.013000e+13 2013-07-09 2013.0 7.0 Tuesday 1930.0 evening 2013-07-10 1722.0

1 2.013000e+13 2013-07-09 2013.0 7.0 Tuesday 1917.0 evening 2013-07-09 2052.0

2 2.013000e+13 2013-07-09 2013.0 7.0 Tuesday 1900.0 evening 2013-07-10 35.0

3 2.013000e+13 2013-07-09 2013.0 7.0 Tuesday 1900.0 evening 2013-07-10 1258.0

4 2.013000e+13 2013-07-09 2013.0 7.0 Tuesday 1846.0 evening 2013-07-09 1846.0

 

Out[12]: Index(['ReportNumber', 'OccurredDate', 'year', 'month', 'weekday',


'OccurredTime', 'OccurredDayTime', 'ReportedDate', 'ReportedTime',
'DaysToReport', 'crimecat', 'CrimeSubcategory',
'PrimaryOffense.Description', 'Precinct', 'Sector', 'Beat',
'Neighborhood'],
dtype='object')

In [13]: 1 # Frequency table


2 FTcri = pd.value_counts(crime.crimecat,dropna=False).reset_index()
3 FTcri.columns = ['Crimes','Counts']
4 FTcri.head()
C:\Users\ANUP\AppData\Local\Temp\ipykernel_17588\4233022211.py:2: FutureWarning: pandas.value_counts is d
eprecated and will be removed in a future version. Use pd.Series(obj).value_counts() instead.
FTcri = pd.value_counts(crime.crimecat,dropna=False).reset_index()

Out[13]: Crimes Counts

0 THEFT 170946

1 CAR PROWL 142447

2 BURGLARY 76630

3 AGGRAVATED ASSAULT 21315

4 NARCOTIC 16864

In [15]: 1 # adding Percentage


2 FTcri['Percent']=100*FTcri.Counts/FTcri.Counts.sum()
3 # adding Cumulative Percentage
4 FTcri['CumPercent']=100*FTcri.Counts.cumsum()/FTcri.Counts.sum()
5 # renaming missing values
6 FTcri['Crimes'].fillna('UNCATEGORIZED', inplace=True)
7 FTcri.head()
C:\Users\ANUP\AppData\Local\Temp\ipykernel_17588\1906479953.py:6: FutureWarning: A value is trying to be
set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate obje
ct on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace
=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original obje
ct.

FTcri['Crimes'].fillna('UNCATEGORIZED', inplace=True)

Out[15]: Crimes Counts Percent CumPercent

0 THEFT 170946 34.209863 34.209863

1 CAR PROWL 142447 28.506618 62.716481

2 BURGLARY 76630 15.335262 78.051743

3 AGGRAVATED ASSAULT 21315 4.265576 82.317320

4 NARCOTIC 16864 3.374838 85.692158


In [16]: 1 p = so.Plot(FTcri, y = 'CumPercent', x = 'Crimes').add(so.Bar())
2 p# X ticks are not visible. How to rotate x ticks???
Out[16]:

In [23]: 1 # Use matplotlib- to rotate tick labels


2 import matplotlib.pyplot as plt
3 fig, ax = plt.subplots()
4 p = so.Plot(FTcri, y = 'CumPercent', x = 'Crimes').add(so.Bar()).on(ax)
5 ax.xaxis.set_tick_params(rotation=90)
6 p.show()
7 # How to show major crimes? Crimes contributing to 80% of total crimes
In [41]: 1 fig, ax = plt.subplots()
2 FTcri['threshold'] = 80
3 p = so.Plot(FTcri).add(so.Bar(), y = 'CumPercent', x = 'Crimes').add(so.Line(linestyle='dashed'), y =
4 ax.xaxis.set_tick_params(rotation=90)
5 p.show()
 

In [38]: 1 # Horizontal bar plot


2 fig = so.Plot(FTcri, x = 'CumPercent', y = 'Crimes').add(so.Bar())
3 so.Plot.show(fig)
In [42]: 1 # Highest Grade offered by the schools
2 eduwa['High.Grade']
3 # Can we visualize the number of schools based on high grade
Out[42]: 0 8
1 12
2 12
3 6
4 12
..
2422 12
2423 6
2424 12
2425 6
2426 8
Name: High.Grade, Length: 2427, dtype: object

In [44]: 1 hg = pd.value_counts(eduwa['High.Grade']).reset_index()
2 so.Plot(hg, 'High.Grade','count').add(so.Bar())
C:\Users\ANUP\AppData\Local\Temp\ipykernel_17588\2752507459.py:1: FutureWarning: pandas.value_counts is d
eprecated and will be removed in a future version. Use pd.Series(obj).value_counts() instead.
hg = pd.value_counts(eduwa['High.Grade']).reset_index()

Out[44]:

In [48]: 1 ordLabels=["PK","KG","1","2","3","4","5","6","7","8","9","10","11","12","13"]
2 HGtype= pd.CategoricalDtype(categories=ordLabels,ordered=True)
3 display(HGtype)
4 # apply that format to the column
5 eduwa['High.Grade-O']= eduwa['High.Grade'].astype(HGtype)
6 display(eduwa['High.Grade-O'])
CategoricalDtype(categories=['PK', 'KG', '1', '2', '3', '4', '5', '6', '7', '8', '9',
'10', '11', '12', '13'],
, ordered=True, categories_dtype=object)

0 8
1 12
2 12
3 6
4 12
..
2422 12
2423 6
2424 12
2425 6
2426 8
Name: High.Grade-O, Length: 2427, dtype: category
Categories (15, object): ['PK' < 'KG' < '1' < '2' ... '10' < '11' < '12' < '13']
In [49]: 1 # Frequency table
2 FThg = pd.value_counts(eduwa['High.Grade-O'],ascending=False,sort=False,dropna=False).reset_index()
3 FThg.columns = ['MaxOffer','Counts']
4 # adding column
5 FThg['CumPercent']=100*FThg.Counts.cumsum()/FThg.Counts.sum()
6 display(FThg)
C:\Users\ANUP\AppData\Local\Temp\ipykernel_17588\1997154553.py:2: FutureWarning: pandas.value_counts is d
eprecated and will be removed in a future version. Use pd.Series(obj).value_counts() instead.
FThg = pd.value_counts(eduwa['High.Grade-O'],ascending=False,sort=False,dropna=False).reset_index()

MaxOffer Counts CumPercent

0 PK 82 3.378657

1 KG 7 3.667079

2 1 6 3.914297

3 2 16 4.573548

4 3 19 5.356407

5 4 45 7.210548

6 5 755 38.318912

7 6 266 49.278945

8 7 11 49.732180

9 8 427 67.325917

10 9 15 67.943964

11 10 7 68.232386

12 11 5 68.438401

13 12 757 99.629172

14 13 9 100.000000

In [50]: 1 #Visualize using bar


2 fig = so.Plot(FThg, x='MaxOffer',y='Counts').add(so.Bar())
3 so.Plot.show(fig)
In [55]: 1 # Visualize using boxplot
2 import seaborn as sns
3 import numpy as np
4 eduwa['High.Grade-N'] = eduwa['High.Grade-O'].cat.codes
5 sns.boxplot(eduwa, x='High.Grade-N')#Modify x = 0, set xticks right
6 plt.xticks(np.arange(0,14),['PK','KG',1,2,3,4,5,6,7,8,9,10,11,12]);

In [58]: 1 # Combining box plot with bar plot - Violin plot


2 sns.boxplot(eduwa,x='High.Grade-N', fill = False)
3 sns.violinplot(eduwa,x='High.Grade-N',width = 1.2, fill = True)
4 plt.ylim([-0.6,0.6])
5 plt.xticks(np.arange(0,14),['PK','KG',1,2,3,4,5,6,7,8,9,10,11,12]);
Visualization of numerical data

In [49]: 1 eduwa['Reduced.Lunch']
Out[49]: 0 3.0
1 9.0
2 40.0
3 10.0
4 4.0
...
2422 0.0
2423 57.0
2424 51.0
2425 35.0
2426 38.0
Name: Reduced.Lunch, Length: 2427, dtype: float64

In [59]: 1 # Visualize using bar plot


2 so.Plot(eduwa, x = 'Reduced.Lunch').add(so.Bar(),so.Count()).label(y = "$count (\mathbf{N})$").show()
In [52]: 1 # Visualize using boxplot
2 sns.boxplot(eduwa, x = 'Reduced.Lunch')
3 import matplotlib.pyplot as plt
4 # plt.ylabel(0)
5 plt.yticks([-0.4,-0.2,0,0.2,0.4],[-0.4,-0.2,0,0.2,0.4])
6 plt.grid()
In [64]: 1 # Visualize using histogram and map into a normal density
2 import numpy as np
3 import scipy.stats as stats
4 statVals=eduwa['Reduced.Lunch'].describe().to_dict()
5 display(statVals)
6 Start= 0
7 width=10
8 newMax= 310
9 TheBreaks=np.arange(Start,newMax+width,width)
10 display(TheBreaks)
11 intervals=pd.cut(eduwa['Reduced.Lunch'],bins=TheBreaks,include_lowest=True)
12 display(intervals)
13 topCount=intervals.value_counts().max()
14 print(topCount)
15 widthY=50
16 reminderY=topCount%widthY
17 top_Y=topCount+widthY-reminderY if reminderY<widthY else topCount
18 vertiVals=list(range(0,top_Y+widthY,widthY))
19 N = statVals['count']
20 MEAN = statVals['mean']
21 STD = statVals['std']
22 def NormalHist(x,m=MEAN,s=STD,n=N,w=width):
23 return stats.norm.pdf(x, m, s)*n*w
{'count': 2296.0,
'mean': 33.53440766550523,
'std': 36.556836589555466,
'min': 0.0,
'25%': 5.0,
'50%': 25.5,
'75%': 47.0,
'max': 301.0}

array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120,
130, 140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240, 250,
260, 270, 280, 290, 300, 310])

0 (-0.001, 10.0]
1 (-0.001, 10.0]
2 (30.0, 40.0]
3 (-0.001, 10.0]
4 (-0.001, 10.0]
...
2422 (-0.001, 10.0]
2423 (50.0, 60.0]
2424 (50.0, 60.0]
2425 (30.0, 40.0]
2426 (30.0, 40.0]
Name: Reduced.Lunch, Length: 2427, dtype: category
Categories (31, interval[float64, right]): [(-0.001, 10.0] < (10.0, 20.0] < (20.0, 30.0] < (30.0, 40.0]
... (270.0, 280.0] < (280.0, 290.0] < (290.0, 300.0] < (300.0, 310.0]]

731
In [66]: 1 fig, ax = plt.subplots()
2 p = so.Plot(eduwa, x = 'Reduced.Lunch').add(so.Bar(),so.Hist("density")).on(ax)#.add(so.Line(),so.KDE(c
3 ax.plot(np.arange(0,300),NormalHist(np.arange(0,300))/sum(NormalHist(np.arange(0,300))), label = "Ideal
4 p.show()

Visualizing 2D relations

Categorical-Categorical relations

In [81]: 1 # Looking at the relation between Precint and time of occurence of crime
2 crime[['Precinct', 'OccurredDayTime']]
3 PrecintDaytime = pd.crosstab(crime.Precinct, crime.OccurredDayTime, margins=True)
4 P1 = PrecintDaytime.sort_values('All',ascending=False).drop("All").drop("All", axis=1)
5 P1
Out[81]: OccurredDayTime afternoon day evening night

Precinct

NORTH 48754 33744 39867 37942

WEST 48931 30366 33766 30925

EAST 20774 15976 17380 19880

SOUTH 22147 17322 16240 15497

SOUTHWEST 14221 10595 11169 11034


In [105]: 1 P = P1.stack().reset_index()
2 P.columns=['Precinct', 'OccurredDayTime', 'Count']
3 so.Plot(P,x='Precinct', y='Count', color = 'OccurredDayTime').add(so.Bars(), so.Stack())
Out[105]:

In [94]: 1 # Better option is to use side-by-side (or dodge) - SO (Spacing - width = 0.9)
2 so.Plot(P,x='Precinct', y='Count', color = 'OccurredDayTime-O').add(so.Bars(width = 0.9), so.Dodge())
Out[94]:

In [119]: 1 # Relative contribution of various crines across precincts


2 PrecintDaytime = pd.crosstab(crime.Precinct, crime.OccurredDayTime, normalize='columns')
3 # display(PrecintDaytime)
4 P2 = PrecintDaytime.stack().reset_index()
5 P2.columns=['Precinct', 'OccurredDayTime', 'Count']
6 display(P2.head())
Precinct OccurredDayTime Count

0 EAST afternoon 0.134176

1 EAST day 0.147922

2 EAST evening 0.146763

3 EAST night 0.172453

4 NORTH afternoon 0.314893


In [121]: 1 # Relative contribution of various crines across precincts - SO
2 so.Plot(P2,x='OccurredDayTime', y='Count', color='Precinct').add(so.Bar(width = 0.9), so.Stack())
Out[121]:

In [68]: 1 # What about a big crosstab?


2 CrimeDay = pd.crosstab(crime.crimecat, crime.OccurredDayTime)
3 CrimeDay
Out[68]: OccurredDayTime afternoon day evening night

crimecat

AGGRAVATED ASSAULT 5366 3564 4884 7501

ARSON 167 196 191 486

BURGLARY 22288 24139 14121 16082

CAR PROWL 38273 26740 42595 34839

DISORDERLY CONDUCT 81 41 67 79

DUI 939 706 2038 8522

FAMILY OFFENSE-NONVIOLENT 2516 1748 1217 1120

GAMBLE 4 4 7 2

HOMICIDE 46 41 49 131

LIQUOR LAW VIOLATION 491 112 410 606

LOITERING 31 20 25 9

NARCOTIC 6416 2415 3924 4109

PORNOGRAPHY 53 65 17 31

PROSTITUTION 675 115 1425 1340

RAPE 318 332 354 854

ROBBERY 4737 2584 4139 5372

SEX OFFENSE-OTHER 1759 1501 1014 1776

THEFT 64868 38687 38980 28410

TRESPASS 5184 4848 2598 3289

WEAPON 1445 735 947 1624

In [123]: 1 CrimeDay_n = pd.crosstab(crime.crimecat, crime.OccurredDayTime, normalize='columns')


2 CrimeDay_df = CrimeDay_n.stack().reset_index()
3 CrimeDay_df.columns=["Crime", "DayTime", "NormalizedCounts"]
4 dayorder = pd.CategoricalDtype(["day", "afternoon", "evening", "night"], ordered=True)
5 CrimeDay_df['DayTime-O'] = CrimeDay_df.DayTime.astype(dayorder)
6 display(CrimeDay_df.head(5))
Crime DayTime NormalizedCounts DayTime-O

0 AGGRAVATED ASSAULT afternoon 0.034473 afternoon

1 AGGRAVATED ASSAULT day 0.032820 day

2 AGGRAVATED ASSAULT evening 0.041041 evening

3 AGGRAVATED ASSAULT night 0.064562 night

4 ARSON afternoon 0.001073 afternoon


In [156]: 1 so.Plot(CrimeDay_df,x='DayTime-O', y='Crime', pointsize = 'NormalizedCounts').add(so.Dot(),legend=False
Out[156]:

In [198]: 1 CrimeDay1 = pd.crosstab(crime.crimecat, crime.OccurredDayTime, normalize='columns', margins=True)


2 CrimeDay1.sort_values("All", ascending=False, inplace=True)
3 CrimeDay1.drop("All", axis=1, inplace=False)
4 CrimeDay_df1 = CrimeDay1.stack().reset_index()
5 CrimeDay_df1.columns=["Crime", "DayTime", "NormalizedCounts"]
6 CrimeDay_df1['DayTime-O'] = CrimeDay_df1.DayTime.astype(dayorder)
7 CrimeDay_df1['Percent'] = 100*CrimeDay_df1["NormalizedCounts"].round(1)
8 #display(CrimeDay_df1.head(5))
9 so.Plot(CrimeDay_df1,x='DayTime-O', y='Crime', pointsize = 'NormalizedCounts', text='Percent').add(so.D
 

Out[198]:
In [72]: 1 # Flipped Bar Chart
2 CrimeDay_df = CrimeDay_df.sort_values(by = 'NormalizedCounts', ascending = False)#Sorted the df in the
3 so.Plot(CrimeDay_df,y='Crime',x='NormalizedCounts').facet('DayTime-O').add(so.Bars(width = 1)).label(x
 

Out[72]:

In [73]: 1 # Heatmap
2 # Graphical representation of data where values are depicted by color.
3 sns.heatmap(CrimeDay_df.pivot(index = 'Crime', columns = 'DayTime-O', values = 'NormalizedCounts'))
Out[73]: <Axes: xlabel='DayTime-O', ylabel='Crime'>
Categorical-Numerical relation (2 variables)

In [6]: 1 crime.year.value_counts()
2 crime2 = crime[crime.year>2007].copy()
3 crime2.dropna(subset=['DaysToReport'], inplace=True)
4 crime2.fillna(value={'crimecat': 'Uncategorized'}, inplace=True)
5 display(crime2[['crimecat','DaysToReport']].head(10))
6 maxD = crime2.groupby('crimecat').describe()['DaysToReport'][['max', 'mean', 'std']]
7 maxD.head(5)

crimecat DaysToReport

0 NARCOTIC 1.0

1 BURGLARY 0.0

2 CAR PROWL 1.0

3 THEFT 1.0

4 FAMILY OFFENSE-NONVIOLENT 0.0

5 BURGLARY 0.0

6 THEFT 0.0

7 THEFT 0.0

8 CAR PROWL 1.0

9 THEFT 0.0

Out[6]: max mean std

crimecat

AGGRAVATED ASSAULT 2136.0 2.457019 34.881287

ARSON 151.0 0.901734 7.616278

BURGLARY 3653.0 4.225830 34.548476

CAR PROWL 2923.0 3.282945 31.155756

DISORDERLY CONDUCT 95.0 0.417910 5.810746

In [8]: 1 import seaborn as sns


2 sns.boxplot(crime2, y='crimecat', x=crime2['DaysToReport']/365, order = maxD.sort_values(by = 'max',asc
 

Out[8]: <Axes: xlabel='DaysToReport', ylabel='crimecat'>


Numerical-Numerical relationship

In [79]: 1 Crimebyday = pd.value_counts(crime2.OccurredDate).reset_index()


2 Crimebyday.columns=['Dates', 'Count']
3 # display(Crimebyday)
4 Crimebyday['Dates-F'] = pd.to_datetime(Crimebyday.Dates, format='%Y-%m-%d')
5 display(Crimebyday)
/tmp/ipykernel_19458/3571748079.py:1: FutureWarning: pandas.value_counts is deprecated and will be remove
d in a future version. Use pd.Series(obj).value_counts() instead.

Dates Count Dates-F

0 2017-07-01 199 2017-07-01

1 2017-05-26 192 2017-05-26

2 2016-01-20 186 2016-01-20

3 2015-12-01 184 2015-12-01

4 2018-07-19 183 2018-07-19

... ... ... ...

3958 2011-12-25 58 2011-12-25

3959 2008-12-25 52 2008-12-25

3960 2018-11-06 48 2018-11-06

3961 2012-01-19 47 2012-01-19

3962 2012-01-18 43 2012-01-18

3963 rows × 3 columns

In [78]: 1 so.Plot(Crimebyday, x='Dates-F', y='Count').add(so.Line())#Adjust xticks


Out[78]:
In [80]: 1 # Adding a local regression line
2 so.Plot(Crimebyday,x='Dates-F', y='Count').add(so.Dots(marker = '+')).add(so.Line(color = "0.2"),so.Pol
 

Out[80]:

In [81]: 1 #Ridge plots - Visualize density as a function of year


2 so.Plot(Crimebyday, x='Count').add(so.Area(),so.KDE()).facet(Crimebyday['Dates-F'].dt.year, wrap = 3)#A
 

Out[81]:

Correlation (Bivariate analysis)


𝜎𝑥,𝑦
𝜌𝑥,𝑦 =
𝜎𝑥 𝜎𝑦

𝜎𝑥,𝑦 = (𝑥𝑖 − 𝑥¯)(𝑦𝑦 − 𝑦¯)
𝑖

√∑
⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯2⎯
𝜎𝑥 = (𝑥𝑖 − 𝑥¯)
𝑖

√∑
⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯2⎯
𝜎𝑦 = (𝑦𝑖 − 𝑦¯)
𝑖

image.png
In [82]: 1 crime_d = crime[pd.to_datetime(crime.year, format="%Y") > pd.to_datetime('2015', format="%Y")]
2 operations = {'DaysToReport': 'mean', 'Neighborhood': 'count'}
3 crime_neigh = crime_d.groupby('Neighborhood').agg(operations)
4 display(crime_neigh.head(5))
5 crime_sum = crime_neigh.Neighborhood.sum()
6 crime_neigh['Neighborhood']= crime_neigh.Neighborhood/crime_sum * 100
7 display(crime_neigh.head(5))

DaysToReport Neighborhood

Neighborhood

ALASKA JUNCTION 3.265236 2330

ALKI 3.798742 636

BALLARD NORTH 3.876259 3079

BALLARD SOUTH 3.725234 4815

BELLTOWN 2.343525 4023

DaysToReport Neighborhood

Neighborhood

ALASKA JUNCTION 3.265236 1.644563

ALKI 3.798742 0.448902

BALLARD NORTH 3.876259 2.173223

BALLARD SOUTH 3.725234 3.398528

BELLTOWN 2.343525 2.839518

In [83]: 1 so.Plot(crime_neigh, x='DaysToReport', y='Neighborhood').add(so.Dots())

Out[83]:

In [84]: 1 import scipy.stats as stats


2 cor, pval = stats.spearmanr(crime_neigh.Neighborhood, crime_neigh.DaysToReport)
3 display(cor)
-0.20139657939884478
In [85]: 1 so.Plot(crime_neigh, x='DaysToReport', y='Neighborhood').add(so.Dots()).add(so.Line(color = "0.2"),so.P
 

Out[85]:

What about tertiary (or more) relationships?


Use facet grid on single variable
Heatmaps

You might also like