1.Data Handling and Visualization Module 1 Slides
1.Data Handling and Visualization Module 1 Slides
Data collection
Module 1 Introduction to Data Visualization
Data Abstraction
Module 1 Introduction to Data Visualization
Task Abstraction
Module 1 Introduction to Data Visualization
Task abstraction: Actions and targets
• very high-level pattern
• actions
• {action, target}
– analyze
pairs
• high-level choices
– discover distribution
– search – compare trends
• find a known/unknown item – locate outliers
– query – browse topology
• find out about characteristics of item
• targets
– what is being acted on
16
Module 1 Introduction to Data Visualization
Actions: Analyze
• consume
– discover vs present
• classic split
• aka explore vs explain
– enjoy
• newcomer
• aka casual, social
• produce
– annotate, record
– derive
• crucial design choice
17
Actions: Search Module 1 Introduction to Data Visualization
• what does user know?
– target, location
• lookup
– ex: word in dictionary
• alphabetical order
• locate
– ex: keys in your house
– ex: node in network
• browse
– ex: books in bookstore
• explore
– ex: find cool neighborhood in new city
18
Module 1 Introduction to Data Visualization
Actions: Query
• how much of the data
matters?
– one: identify
– some: compare
– all: summarize
19
Module 1 Introduction to Data Visualization
• abstraction
– translate from specifics of domain to vocabulary of vis
• what is shown? data abstraction idiom
• why is the user looking at it? task abstraction
algorithm
• idiom
– how is it shown?
• visual encoding idiom: how to draw
• interaction idiom: how to manipulate
[A Multi-Level Typology of Abstract Visualization Tasks. Brehmer and
2
1
Module 1 Introduction to Data Visualization
Nested model
• downstream: cascading effects
2
2
Module 1 Introduction to Data Visualization
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human
or computer error, transmission error
• incomplete: lacking attribute values, lacking certain attributes of interest, or containing only
aggregate data
• e.g., Occupation=“ ” (missing data)
• noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
• inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
• Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?
Module 1 Introduction to Data Visualization
Data Transformation
Data Transformation Module 1 Introduction to Data Visualization
• A function that maps the entire set of values of a given attribute to a new set of replacement
values s.t. each old value can be identified with one of the new values
• Methods
• Smoothing: Remove noise from data
• Attribute/feature construction
• New attributes constructed from the given ones
• Aggregation: Summarization, data cube construction
• Normalization: Scaled to fall within a smaller, specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
• Discretization: Concept hierarchy climbing
Module 1 Introduction to Data Visualization
• Whenever you see “array,” “NumPy array,” or “ndarray” in the text, with few
exceptions they all refer to the same thing: the ndarray object.
• NumPy-based algorithms are generally 10 to 100 times faster (or more) than
their pure Python counterparts and use significantly less memory.
import numpy as np
my_arr = np.arange(1000000)
my_list = list(range(1000000))
Module 1 Introduction to Data Visualization
pandas
Module 1 Introduction to Data Visualization
Why pandas?
• One of the most popular library that data scientists
use
• Labeled axes to avoid misalignment of data
salar Credit score
y
• When merge two tables, some rows may be different Alice 5000 700
Overview
• Created by Wes McKinney in 2008, now maintained by many others.
• Author of one of the textbooks: Python for Data Analysis
• Powerful and productive Python data analysis and Management
Library
• Panel Data System
• The name is derived from the term "panel data", an econometrics term for
data sets that include both time-series and cross-sectional data
• Its an open source product.
Module 1 Introduction to Data Visualization
Overview - 2
• Python Library to provide data analysis features similar to: R,
MATLAB, SAS
• Rich data structures and functions to make working with data
structure fast, easy and expressive.
• It is built on top of NumPy
• Key components provided by Pandas:
• Series
From now on:
• DataFrame
from pandas import Series, DataFrame
import pandas as pd
Module 1 Introduction to Data Visualization
matplotlib
Module 1 Introduction to Data Visualization
• Matplotlib is one of the most popular Python packages used for data
visualization.
• It is a cross-platform library for making 2D plots from data in arrays.
• Matplotlib is written in Python and makes use of NumPy, the numerical
mathematics extension of Python.
• It can be used in Python and IPython shells, Jupyter notebook and web
application servers also.
• Matplotlib has a procedural interface named the Pylab, which is designed to
resemble MATLAB, a proprietary programming language developed by
MathWorks.
• Matplotlib along with NumPy can be considered as the open source equivalent of
MATLAB. Matplotlib was originally written by John D. Hunter in 2003.
Module 1 Introduction to Data Visualization
GGplot
Module 1 Introduction to Data Visualization
ggplot2
• ggplot2: probably the most important visualization library in R.
• Enables most basic plot types.
• Implementation of the Grammar of Graphics (2010) by Hadley
Wickham, the guru of R.
• http://vita.had.co.nz/papers/layered-grammar.pdf
• The Grammar of Graphics is a philosophical outlook on exploratory
visualization expressed in Wilkinson, L., Anand, A., and Grossman, R.
(2005), “Graph-Theoretic Scagnostics”.
• http://papers.rgrossman.com/proc-094.pdf
Module 1 Introduction to Data Visualization