0% found this document useful (0 votes)
28 views

Module 4 - Data Exploration and Visualization

Uploaded by

Rachell Ann Uson
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Module 4 - Data Exploration and Visualization

Uploaded by

Rachell Ann Uson
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 80

Exploring and Visualizing Data

Stephen F Elston| Principle Consultant, Quantia Analytics, LLC


Module Outline

• Exploring Data
• Visualizing Data
Exploring Data

• Introduction to R and Python for Data Science


• Working with Data Frames in R and Python
• Working with Data Frames in Azure ML
• Working with Metadata
Data Frames

• Available in R and Python Pandas Column1 Column2 … ColumnN

1 ABC … 12.2
– Map to and from Azure ML tables
2 XYZ … 13.1

• Rectangular tables 3 ABC … 12.8

– Each column of one type 4 XYZ … 10.9

5 ABC … 3.75
• Common Tasks:
– Subsetting by rows and columns
– Logical filtering of rows and columns
Dplyr
library(dplyr)
Col1 Col2 Col3
2012 14 45
2013 13 76
2013 34 65
2014 23 47

dir <- "C:\data"


file <- "values.csv"
path <- file.path(dir, file)
frame1 <- read.csv(path, header=TRUE, stringsAsFactors = FALSE)
Col1 Col2 Col3
2012
2013 14
13 45
76
2013 13
34 76
65
2013 34 65
2014 23 47

frame1 <- filter(frame1, Col1 == 2013)


Col1 Col3
Col2 Col3
2012 45
14 45
2013 76
13 76
2013 65
34 65
2014 47
23 47

frame1 <- select(frame1, Col1, Col3)


Col1 Col2 Col2 Col3 Col3 Col4
2012 14 14 45 45 59
2013 13 13 76 76 89
2013 34 34 65 65 99
2014 23 23 47 47 70

frame1 <- mutate(frame1, Col4 = Col2 + Col3)


Other useful dplyr verbs include:

frame1 <- group_by(frame1, Col1)


frame1 <- distinct(frame1, Col1)
frame1 <- sample_frac(frame1, 0.5)
frame1 <- sample_n(frame1, 500)
frame1 <- summarize(frame1, m1 = mean(Col1))
Col1 Col2 Col2 Col3 Col3 Col4
2013
2012 13 14 76 45 89
2013 34 13 65 76 99
2013 34 65
2014 23 47

frame1 <- frame1 %>%


filter(Col1 == 2013) %>%
mutate(Col4 = Col2 + Col3)
Pandas
Col1 Col2 Col3
2012 14 45
2013 13 76
2013 34 65
2014 23 47

import pandas as pd
import os
dir = "c:\data"
file = "values.csv"
path = os.path.join(dir, file)
frame1 = pd.read_csv(path)
Col1 Col2 Col3
2012 14 45
2013 13 76
2013 34 65
2014 23 47

frame1 = frame1["Col2"]
Col1 Col2 Col3
2012 14 45
2013 13 76
2013 34 65
2014 23 47

frame1 = frame1[["Col1", "Col2"]]


Col1 Col2 Col3
2013
2012 13
14 76
45
2013 34
13 65
76
2013 34 65
2014 23 47

frame1 = frame1[1:3:1]
Col1 Col2 Col3
2012 14 45
2013 13 76
2013 34 65
2014 23 47

frame1 = frame1[:3]
Col1 Col2 Col3
2012 14 45
2013 13 76
2013 34 65
2014 23 47

frame1 = frame1["Col2"][1:2]
Col1 Col2 Col3
2012
4 14
4 45
4
2013 13
21 76
58.25
2013
0.816497 34
9.763879 65
14.863266
2014
2012 23
13 47
45
2012.75 13.75 46.5

frame1 = frame1.describe()
Col1 Col2 Col2 Col3 Col3 Col4
2012 14 14 45 45 59
2013 13 13 76 76 89
2013 34 34 65 65 99
2014 23 23 47 47 70

frame1["Col4"] = frame1["Col2"] + frame1["Col3"]


Col1 Col2 Col3
2012 14 45
2013 13 76
2013 34 65
2014 23 47

frame1.drop("Col3", axis=1, inplace=True)


Other Useful Methods

isnull()
groupby(key|expression, axis)
copy()
where(Boolean)
Other Operations

Pandas.DataFrame.apply(function, axis)
Pandas.Series.Map(function, dictionary | series)
Pandas.DataFrame.applymap(function)
Col1 Col2 Col2 Col3 Col3
2012 14 14 45 45
2013 47 13 141 76
2013 23 34 47 65
2014 23 47

frame1= frame1.groupby("Col1").sum()
R Data Frames in Azure ML
Azure ML

Dataset

Azure ML Table

Execute R Script

Data Frame
1 2

frame1 <- maml.mapInputPort(1)


frame2 <- maml.mapInputPort(2)
source("src/myScript.R")
print("Hello world")
maml.mapOutputPort("frame1")

R Device Port
Python Data Frames in Azure ML
Azure ML

Dataset

Azure ML Table

Execute Python Script

Data Frame
1 2

def azureml_main(frame1, frame2)


import myModule as mm
print("Hello world")
return frame1

Device Port
Data Types and Metadata

Stephen F Elston | Principle Consultant , Quantia Analytics, LLC


Chapter Overview

• Data types
• Continuous and discreet values
• Categorical variables
• Azure ML tools
• Quantization of categorical variables
Azure ML Table Data Types
• Numeric; Floating Point • Categorical
• Numeric: Integer • Date-time
• Boolean • Time-Span
• String • Image

Data type is Metadata


Continuous vs discrete variables

• Continuous variable can take on any value within the resolution


– Temperature
– Distance
– Weight
• Discrete variables have fixed values
– Number of people
– Number of wheels on a vehicle
Categorical variables

• Categories are metadata


• Too many categories can lead to problems
– Not enough data per category
– Too many dimensions in a model
• Often need to combine categories
– Reduce number of categories
– Group like categories
Continuous vs categorical variables

• Categories are metadata


• Too many categories can lead to problems
– Not enough data per category
– Too many dimensions in a model
• Often need to combine categories
– Reduce number of categories
– Group like categories
The Azure ML Metadata Editor

• Meta data includes:


– Data type
– Categories of categorical data
– Field type; feature, label, etc.
– Column name
• Editor enables manipulation of metadata
Quantizing Continuous Variables

• Convert continuous variable to categorical


• Bin values into categories
– Small, medium, large
– Hot, cold
– Income groups
Visualizing Data
Overview

• Exploratory data analysis through visualization


• The R ggplot2 package
• The Python Pandas plotting and matplotlib package
Exploratory data analysis

• Explore the data with visualization


• Understand the relationships in the data
• Create multiple views of data
• Aesthetics to project multiple dimensions
• Conditioning to project multiple dimensions
• Understand sources of model errors
John Tukey, Exploratory Data Analysis, 1977, Addison-
Westley
Views of data

• Relationships in data can be complex


• Data exploration requires multiple views
• Views reveal different aspects of the relationships
• Different plots highlight different relationships
Different plots for different views

• Scatter
• Scatter plot matrix
• Line plots
• Bar plots
• Histograms
• Box plots
• Violin plots
• Q-Q plots
Aesthetics for visualization
• Allow projection of additional dimensions
• But don’t over do it!
• Color
• Shape
• Size
• Transparency
• Aesthetics specific to plot type
Scatter plot
Scatter plot (larger point size)
Scatter plot (+ color by category)
Scatter plot (+ shape by category)
Scatter plot (+ alpha = 0.3)
Scatter plot matrix
Line plot
Bar Plot - unordered
Bar Plot - ordered
Histogram
Box Plot (group by category)
Violin Plot (group by category)
Q-Q Normal Plot
Conditioned Plots
Conditioned plots

• How can you project multiple dimensions?


• Analog with conditional probability: p( d | g)
• Plots of subsets (group by)
• Also know as facetted plots

William S. Cleveland, Visualizing Data, 1993, Hobart


Conditioned plots (faceting)
One conditioning variable
Conditioned plots (faceting)
With two dimensions of conditioning
Conditioning (faceting)
With scatter plot
Conditioning (faceting)
With two conditioning categorical variables
Conditioning (faceting)
With three conditioning categorical variables
Another view
Different views reveal different relationships
Introduction to ggplot2
Overview of ggplot2

• Produces presentation quality charts


• Uses grammar of graphics
• Operators define graphics properties
• Operators chained to create complex plots
The Grammar of Graphics

1. Import library
library(ggplot2)

2. Chain methods to define plot


ggplot(dataframe,aes(x
ggplot(dataframe, aes(x==xcol,
xcol,yy==ycol,
ycol,by
by==opt))
opt))+
geom_plottype(arguments)

3. Add attributes to chain


+
xlab("X label") + ylab("Y label") + ggtitle("Title") +
other_properties()
ggplot2 Types

geom_bar
geom_boxplot
geom_histogram
geom_line
geom_point
stat_smooth
stat_hexbin
ggplot2 Options and Asthetics

facet_grid()
xlab(), ylab()
ggtitle()
shape
color
alpha
size
Execute R Script
Azure ML Tables zip file

myFrame <- maml.mapInputPort(1,2)

source("src/myScript.R")

maml.mapOutputPort(“myFrame")

plots

Azure ML Table R Device Port


Introduction to pandas plotting and
matplotlib
Python plotting
• matplotlib underpins plotting in Python
e.g. matplotlib.pyplot
• pandas.DataFrame.plot built on matplotlib.pyplot
• Other libraries built on matplotlib
• For some plot types of more control use matplotlib.pyplot directly
Pandas Plotting
1. Import libraries
import matplotlib.pyplot as plt

2. Define and clear a figure


fig1 = plt.figure(figsize=(9, 9))
fig1.clf()

3. Define one or more axis


ax = fig1.gca()

4. Apply plot method


pandas.DataFrame.plot(kind = 'someType', ax = ax, ….)

fig1.savefig('scatter2.png')
5. Save figure
Python Plotting in Azure ML
def azureml_main(frame1):

import matplotlib.pyplot as plt ## Import libraries

fig1 = plt.figure(figsize=(9, 9)) ## Define a figure


fig1.clf() ## Clear the current figure
ax = fig1.gca() ## Define axis to plot

pandas.DataFrame.plot(kind = 'someType', ax = ax, ….)

fig1.savefig('scatter2.png') ## Save figure in a file for output


return frame1 ## Must return a Pandas dataframe
Types for pandas.DataFrame.plot()
• ‘line’ : line plot (default)
• ‘bar’ : vertical bar plot
• ‘barh’ : horizontal bar plot
• ‘kde’ or ‘density’: Kernel Density Estimation plot
• ‘scatter’ : scatter plot
Options and Aesthetics for pandas.DataFrame.plot()

• ax – pyplot axis
• x, y – coordinates
• color – line or symbol color
• s – size by value
• shape
• alpha – transparency
Execute Python Script
Azure ML Tables zip file

Def azureml_main(inFrame1, inFrame2)

import my_package

return myFrame

fig.savefig(‘fig.png')

Azure ML Table Python Device Port


©2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, Office, Azure, System Center, Dynamics and other product names are or may be registered trademarks and/or trademarks in the
U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft
must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after
the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

You might also like