0% found this document useful (0 votes)
11 views

1.Data Handling and Visualization Module 1 Slides

Module 1 provides an introduction to data visualization, focusing on data collection strategies, preparation, and visualization techniques. It emphasizes the importance of using multiple data collection methods, cleaning and labeling data, and utilizing tools like NumPy and pandas for data analysis. The module also covers the interaction with databases and the significance of data transformation and visualization libraries such as Matplotlib and ggplot2.

Uploaded by

varshinipd1345
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

1.Data Handling and Visualization Module 1 Slides

Module 1 provides an introduction to data visualization, focusing on data collection strategies, preparation, and visualization techniques. It emphasizes the importance of using multiple data collection methods, cleaning and labeling data, and utilizing tools like NumPy and pandas for data analysis. The module also covers the interaction with databases and the significance of data transformation and visualization libraries such as Matplotlib and ggplot2.

Uploaded by

varshinipd1345
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Module 1

Introduction to Data Visualization


Module 1 Introduction to Data Visualization

Data collection
Module 1 Introduction to Data Visualization

Data Collection Strategies


• No one best way: decision depends on:
• What you need to know: numbers or stories
• Where the data reside: environment, files, people
• Resources and time available
• Complexity of the data to be collected
• Frequency of data collection
• Intended forms of data analysis
Module 1 Introduction to Data Visualization

Rules for Collecting Data


• Use multiple data collection methods
• Use available data, but need to know
• how the measures were defined
• how the data were collected and cleaned
• the extent of missing data
• how accuracy of the data was ensured
Module 1 Introduction to Data Visualization
Data Collection Tools
• Participatory Methods
• Records and Secondary Data
• Observation
• Surveys and Interviews
• Focus Groups
• Diaries, Journals, Self-reported Checklists
• Expert Judgment
• Delphi Technique
• Other Tools
Module 1 Introduction to Data Visualization

Data Preparation Basic Models


Module 1 Introduction to Data Visualization
Data Preparation
• Data preparation is the process of preparing raw data so that it is
suitable for further processing and analysis.
• Key steps include collecting, cleaning, and labeling raw data into a
form suitable for machine learning (ML) algorithms and then
exploring and visualizing the data.
• Data preparation can take up to 80% of the time spent on an ML
project.
• Using specialized data preparation tools is important to optimize this
process.
Module 1 Introduction to Data Visualization
Data Preparation
• Data preparation follows a series of steps that starts with collecting
the right data, followed by cleaning, labeling, and then validation and
visualization.
1) Collect data
Collecting data is the process of assembling all the data you need for
ML.
2) Clean data
Cleaning data corrects errors and fills in missing data as a step to
ensure data quality.
Module 1 Introduction to Data Visualization
Data Preparation
3) Label data
Data labeling is the process of identifying raw data (images, text files,
videos, and so on) and adding one or more meaningful and informative
labels to provide context so an ML model can learn from it.
4) Validate and visualize
After data is cleaned and labeled, ML teams often explore the data to
make sure it is correct and ready for ML.
Visualizations like histograms, scatter plots, box and whisker plots, line
plots, and bar charts are all useful tools to confirm data is correct.
Module 1 Introduction to Data Visualization

Overview of Data Visualization


Module 1 Introduction to Data Visualization

Defining visualization (vis)


• Computer-based visualization systems provide visual
representations of datasets designed to help people carry
out tasks more effectively.

• Visualization is suitable when there is a need to augment


human capabilities rather than replace people with
computational decision-making methods.
Module 1 Introduction to Data Visualization
Defining visualization (vis)
• external representation: replace cognition with perception

[Cerebral: Visualizing Multiple Experimental Conditions on a


Graph with Biological Context. Barsky, Munzner, Gardy, and
Kincaid. IEEE TVCG (Proc. InfoVis) 14(6):1253-1260, 2008.]
Module 1 Introduction to Data Visualization

Data Abstraction
Module 1 Introduction to Data Visualization

Data abstraction: Three operations


• translate from domain-specific language to generic visualization language
• identify dataset type(s), attribute types
• identify cardinality
• how many items in the dataset?
• what is cardinality of each attribute?
• number of levels for categorical data
• range for quantitative data
• consider whether to transform data
• guided by understanding of task
Module 1 Introduction to Data Visualization

Task Abstraction
Module 1 Introduction to Data Visualization
Task abstraction: Actions and targets
• very high-level pattern
• actions
• {action, target}
– analyze
pairs
• high-level choices
– discover distribution
– search – compare trends
• find a known/unknown item – locate outliers
– query – browse topology
• find out about characteristics of item
• targets
– what is being acted on

16
Module 1 Introduction to Data Visualization
Actions: Analyze
• consume
– discover vs present
• classic split
• aka explore vs explain
– enjoy
• newcomer
• aka casual, social

• produce
– annotate, record
– derive
• crucial design choice

17
Actions: Search Module 1 Introduction to Data Visualization
• what does user know?
– target, location
• lookup
– ex: word in dictionary
• alphabetical order

• locate
– ex: keys in your house
– ex: node in network
• browse
– ex: books in bookstore
• explore
– ex: find cool neighborhood in new city

18
Module 1 Introduction to Data Visualization
Actions: Query
• how much of the data
matters?
– one: identify
– some: compare
– all: summarize

19
Module 1 Introduction to Data Visualization

Analysis: Four Levels for


Validation
Module 1 Introduction to Data Visualization
Analysis framework: Four levels, three questions
• domain situation domain

– who are the target users? abstraction

• abstraction
– translate from specifics of domain to vocabulary of vis
• what is shown? data abstraction idiom
• why is the user looking at it? task abstraction
algorithm
• idiom
– how is it shown?
• visual encoding idiom: how to draw
• interaction idiom: how to manipulate
[A Multi-Level Typology of Abstract Visualization Tasks. Brehmer and

Munzner. IEEE TVCG 19(12):2376-2385, 2013 (Proc. InfoVis 2013). ]


• algorithm [A Nested Model of Visualization Design and Validation. Munzner.

– efficient computation IEEE TVCG 15(6):921-928, 2009 (Proc. InfoVis 2009). ]

2
1
Module 1 Introduction to Data Visualization
Nested model
• downstream: cascading effects

[A Nested Model of Visualization Design and


Validation. Munzner. IEEE TVCG 15(6):921-928,
2009 (Proc. InfoVis 2009). ]

2
2
Module 1 Introduction to Data Visualization

Interacting with Databases


Module 1 Introduction to Data Visualization
Interacting with Databases
• In many applications data rarely comes from text files, that being a fairly
inefficient way to store large amounts of data.
• SQL-based relational databases (such as SQL Server, PostgreSQL, and MySQL)
are in wide use, and many alternative non-SQL (so-called NoSQL) databases have
become quite popular.
• The choice of database is usually dependent on the performance, data integrity,
and scalability needs of an application.
Module 1 Introduction to Data Visualization

Data Cleaning and


Preparation
Dirty Data
Module 1 Introduction to Data Visualization

• The Statistics View: • The Domain Expert’s View:


• There is a process that produces data
• This Data Doesn’t look right
• Any dataset is a sample of the output of that
process • This Answer Doesn’t look right
• Results are probabilistic • What happened?
• You can correct bias in your sample
• The Database View:
• I got my hands on this data set
• Some of the values are missing, corrupted, wrong, duplicated
• Results are absolute (relational model)
• You get a better answer by improving the quality of the values in your dataset
Module 1 Introduction to Data Visualization
• The Data Scientist’s View:
• Some Combination of all of the above
Module 1 Introduction to Data Visualization

Data Cleaning Makes Everything Okay?


The appearance of a hole in the earth's ozone
layer over Antarctica, first detected in 1976,
was so unexpected that scientists didn't pay
attention to what their instruments were telling
them; they thought their instruments were
malfunctioning.
National Center for Atmospheric Research

In fact, the data were rejected as unreasonable


by data quality control algorithms
Module 1 Introduction to Data Visualization
How Clean is “clean-enough”?
• How much cleaning is too much?
• Answers are likely to be:
• domain-specific
• data source-specific
• application-specific
• user-specific
• all of the above?
How to split between shared and application-specific cleaning?
Module 1 Introduction to Data Visualization

• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human
or computer error, transmission error
• incomplete: lacking attribute values, lacking certain attributes of interest, or containing only
aggregate data
• e.g., Occupation=“ ” (missing data)
• noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
• inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
• Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?
Module 1 Introduction to Data Visualization

Handling Missing Data


Module 1 Introduction to Data Visualization
Incomplete (Missing) Data
• Data is not always available
• E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the
time of entry
• not register history or changes of the data
• Missing data may need to be inferred
Module 1 Introduction to Data Visualization

Data Transformation
Data Transformation Module 1 Introduction to Data Visualization

• A function that maps the entire set of values of a given attribute to a new set of replacement
values s.t. each old value can be identified with one of the new values
• Methods
• Smoothing: Remove noise from data
• Attribute/feature construction
• New attributes constructed from the given ones
• Aggregation: Summarization, data cube construction
• Normalization: Scaled to fall within a smaller, specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
• Discretization: Concept hierarchy climbing
Module 1 Introduction to Data Visualization

Python Libraries: NumPy


NumPy
Module 1 Introduction to Data Visualization

• Stands for Numerical Python


• Is the fundamental package required for high performance computing and
data analysis
• NumPy is so important for numerical computations in Python is because it
is designed for efficiency on large arrays of data.
• It provides
• ndarray for creating multiple dimensional arrays
• Internally stores data in a contiguous block of memory, independent of other built-in
Python objects, use much less memory than built-in Python sequences.
• Standard math functions for fast operations on entire arrays of data without having
to write loops
• NumPy Arrays are important because they enable you to express batch operations
on data without writing any for loops. We call this vectorization.
Module 1 Introduction to Data Visualization
NumPy ndarray vs list
• One of the key features of NumPy is its N-dimensional array object, or
ndarray, which is a fast, flexible container for large datasets in Python.

• Whenever you see “array,” “NumPy array,” or “ndarray” in the text, with few
exceptions they all refer to the same thing: the ndarray object.

• NumPy-based algorithms are generally 10 to 100 times faster (or more) than
their pure Python counterparts and use significantly less memory.
import numpy as np
my_arr = np.arange(1000000)
my_list = list(range(1000000))
Module 1 Introduction to Data Visualization

pandas
Module 1 Introduction to Data Visualization
Why pandas?
• One of the most popular library that data scientists
use
• Labeled axes to avoid misalignment of data
salar Credit score
y
• When merge two tables, some rows may be different Alice 5000 700

• Missing values or special values may need to be


0
Bob NA 670
removed or replaced heigh Weig Weigh age Gend Chris 6000 NA
t ht t2 er 0
Amy 160 125 126 32 2 David - 750
Bob 170 167 155 -1 1 9999
9
Chris 168 143 150 28 1
Ella 7000 685
David 190 182 NA 42 1
0
Ella 175 133 138 23 2 Tom 4500 660
Frank 172 150 148 45 1 0
Module 1 Introduction to Data Visualization

Overview
• Created by Wes McKinney in 2008, now maintained by many others.
• Author of one of the textbooks: Python for Data Analysis
• Powerful and productive Python data analysis and Management
Library
• Panel Data System
• The name is derived from the term "panel data", an econometrics term for
data sets that include both time-series and cross-sectional data
• Its an open source product.
Module 1 Introduction to Data Visualization

Overview - 2
• Python Library to provide data analysis features similar to: R,
MATLAB, SAS
• Rich data structures and functions to make working with data
structure fast, easy and expressive.
• It is built on top of NumPy
• Key components provided by Pandas:
• Series
From now on:
• DataFrame
from pandas import Series, DataFrame
import pandas as pd
Module 1 Introduction to Data Visualization

matplotlib
Module 1 Introduction to Data Visualization
• Matplotlib is one of the most popular Python packages used for data
visualization.
• It is a cross-platform library for making 2D plots from data in arrays.
• Matplotlib is written in Python and makes use of NumPy, the numerical
mathematics extension of Python.
• It can be used in Python and IPython shells, Jupyter notebook and web
application servers also.
• Matplotlib has a procedural interface named the Pylab, which is designed to
resemble MATLAB, a proprietary programming language developed by
MathWorks.
• Matplotlib along with NumPy can be considered as the open source equivalent of
MATLAB. Matplotlib was originally written by John D. Hunter in 2003.
Module 1 Introduction to Data Visualization

GGplot
Module 1 Introduction to Data Visualization
ggplot2
• ggplot2: probably the most important visualization library in R.
• Enables most basic plot types.
• Implementation of the Grammar of Graphics (2010) by Hadley
Wickham, the guru of R.
• http://vita.had.co.nz/papers/layered-grammar.pdf
• The Grammar of Graphics is a philosophical outlook on exploratory
visualization expressed in Wilkinson, L., Anand, A., and Grossman, R.
(2005), “Graph-Theoretic Scagnostics”.
• http://papers.rgrossman.com/proc-094.pdf
Module 1 Introduction to Data Visualization

Plotting figures and graphs with ggplot


• ggplot is the plotting library for tidyverse
• Powerful
• Flexible

• Follows the same conventions as the rest of tidyverse


• Data stored in tibbles
• Data is arranged in 'tidy' format
• Tibble is the first argument to each function
Module 1 Introduction to Data Visualization

Code structure of a ggplot graph


• Start with a call to ggplot()
• Pass the tibble of data (normally via a pipe)
• Say which columns you want to use via a call to aes()

• Say which graphical representation (geometry) you want


to use
• Points, lines, barplots etc

• Customise labels, colours annotations etc.


Module 1 Introduction to Data Visualization

Introduction to pandas Data


Structures
Series
Module 1 Introduction to Data Visualization

• One dimensional array-like object


• It contains array of data (of any NumPy data type) with associated
indexes. (Indexes can be strings or integers or other data types.)
• By default , the series will get indexing from 0 to N where N = size -1
from pandas import Series, DataFrame #Output
import pandas as pd 0 4
1 7
obj = Series([4, 7, -5, 3])
2 -5
print(obj) 3 3
print(obj.index) dtype: int64
print(obj.values) RangeIndex(start=0, stop=4, step=1)
[ 4 7 -5 3]
Module 1 Introduction to Data Visualization
Series – referencing elements
obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c']) obj2['d']= 10
print(obj2) print(obj2[['d', 'c', 'a']])
#Output #Output
d 4 d 10
b 7 c 3
a -5 a -5
c 3 dtype: int64
dtype: int64
print(obj2.index) print(obj2[:2])
#Output #Output
Index(['d', 'b', 'a', 'c'], dtype='object') d 10
b 7
print(obj2.values)
dtype: int64
#Output
[ 4 7 -5 3]
print(obj2.a)
#Output
print(obj2['a'])
-5
#Output
-5
Module 1 Introduction to Data Visualization
Series – array/dict operations obj4 = obj3[obj3>0]
print(obj4)
Can be thought of as a dict.
#output
Can be constructed from a dict directly. d 10
b 7
obj3 = Series({'d': 4, 'b': 7, 'a': -5, 'c':3 }) c 3
print(obj3) dtype: int64
#output
print(obj3**2)
d 4 #output
b 7 d 100
a -5 b 49
c 3 a 25
c 9
dtype: int64
dtype: int64

numpy array operations can print(‘b’ in obj3)


also be applied, which will #output
preserve the index-value link true

You might also like