Python Data Analysis Visualization
Python Data Analysis Visualization
[email protected]
3
IPython
An
Interac4ve
Compu4ng
and
Development
Environment
It
provides
an
execute-explore
workow
instead
of
typical
edit-compile-run
workow
of
many
other
programming
languages
It
provides
very
4ght
integra4on
with
the
opera4ng
systems
shell
and
le
system
It
also
includes:
A
rich
GUI
console
with
inline
ploTng
A
web-based
interac4ve
notebook
format
A
lightweight,
fast
parallel
compu4ng
engine
[email protected]
4
Why
use
Python
for
Data
Analysis
The
Python
language
is
easy
to
fall
in
love
with
Python
is
dis4nguished
by
its
large
and
ac4ve
scien4c
compu4ng
community
Adop4on
of
Python
for
scien4c
compu4ng
in
both
industry
applica4ons
and
academic
research
has
increased
signicantly
since
the
early
2000s
Pythons
improved
library
support
(pandas)
made
it
a
strong
tool
for
data
manipula4on
tasks
[email protected]
5
Example:
US
Baby
Names
1880-2012
The
United
States
Social
Security
Administra4on
(SSA)
has
mad
available
data
on
the
frequency
of
baby
names
from
1880
through
2012,
this
data
set
is
ofen
used
in
illustra4ng
data
manipula4on
in
R,
Python,
etc.
The
data
can
be
obtained
at:
hWp://www.ssa.gov/oact/babynames/limits.html
Things
can
be
done
with
this
data
set
Visualize
the
propor4on
of
babies
given
a
par4cular
name
Determine
the
naming
trend
Determine
the
most
popular
names
in
each
year
[email protected]
6
Check
the
Data
In
IPython,
MacOS
or
Linux:
use
the
UNIX
head
to
look
at
the
rst
10
lines
of
the
one
of
the
les.
Windows:
download
the
les,
and
click
to
open
the
les
This
is
nicely
comma-separated
form.
[email protected]
7
Load
Data
Using
csv
module
from
the
standard
library,
CSV
means
Comma
Separated
Values,
and
any
delimiter
can
be
chosen.
[email protected]
9
Anonymous
(lamda)
Func4ons
Anonymous
or
lambda
func4ons
are
simple
func4ons
consis4ng
of
a
single
statement,
the
result
is
the
return
value.
Lamda
func4ons
are
convenient
in
data
analysis
since
there
are
many
cases
where
data
transforma4on
func4ons
will
take
func4ons
as
arguments.
[email protected]
10
Aggregate
the
data
at
the
year
and
sex
Since
the
level
data
set
is
split
into
les
by
year,
one
need
to
traverse
all
the
les
to
get
the
total
number
of
births
per
year
per
sex
[email protected]
11
The
result
list
(Lef)
rst
10
records
in
pieces
list
(Right)
last
10
records
in
pieces
list
[email protected]
12
Matplotlib
review
Before
we
start
ploTng
the
result,
lets
review
the
plot
rst
[email protected]
13
Prepare
the
data
for
plot
Currently,
the
result
is
a
list
of
list,
each
internal
list
include
three
values,
[year,
female
births,
male
births],
to
plot
the
births
according
to
year
and
sex,
the
plot
needs
to
have
year
as
x-axis,
and
births
as
y-axis,
while
two
lines
will
be
showing
to
represent
female
and
male
birth.
[email protected]
14
Plot
the
total
births
by
sex
and
year
Plot
[email protected]
15
Reorganize
the
data
Concatenate
the
all
les
together
to
prepare
the
further
analysis.
[email protected]
16
Extract
a
subset
of
the
data
Find
the
top
1000
names
for
each
sex/year
combina4on,
further
narrow
down
the
data
set
to
facilitate
further
analysis,
the
sor4ng
is
ignored
here
since
the
input
les
are
already
in
descending
order
[email protected]
17
Compare
the
subset
data
with
original
data
The
subset
data
has
much
less
records
than
the
original
data
set,
but
represents
the
majority
informa4on
[email protected]
18
Analyzing
Naming
Trends
With
the
full
data
set
and
Top
1,000
data
set
in
hand,
we
can
start
analyzing
various
naming
trends
of
interest.
SpliTng
the
Top
1,000
names
into
the
boy
and
girl
por4ons:
[email protected]
19
Analyzing
Naming
Trends
(Cont.)
Plot
for
a
handful
of
names
in
a
subplot,
John,
Harry,
Marry,
to
compare
their
trends
over
the
years,
rst
prepare
data
set
for
each
chosen
name.
[email protected]
20
Analyzing
Naming
Trends
(Cont.)
Plot
three
curves
ver4cally,
with
x-axis
as
years,
y-axis
as
births,
the
result
shows
that
those
names
have
grown
out
of
favor
with
American
popula4on
[email protected]
21
Measuring
the
increase
in
naming
diversity
To
explain
why
there
is
a
decrease
in
the
previous
plots,
we
can
measure
the
propor4on
of
births
represented
by
the
top
1000
most
popular
names
by
year
and
sex
Step
1:
nd
total
of
birth
per
year
for
each
sex
[email protected]
22
Measuring
the
increase
in
naming
diversity
(Cont.)
Step
2:
compute
the
propor4on
of
top
1000
births
to
the
total
births
per
year
per
sex
For
boys:
[email protected]
23
Measuring
the
increase
in
naming
diversity
(Cont.)
For
girls:
[email protected]
24
Measuring
the
increase
in
naming
diversity
(Cont.)
Plot
the
result
shows
that
fewer
parents
are
choosing
the
popular
names
for
their
children
over
the
years
[email protected]
25
Measuring
the
increase
in
naming
diversity
(Cont.)
Another
interest
metric
is
the
number
of
dis4nct
popular
names,
taken
in
order
of
popularity
from
highest
to
lowest
in
the
top
50%
of
births.
Step
1:
Add
the
fourth
column
to
girls1000
and
boys1000
list,
to
represent
the
birth
propor4on
to
the
total
birth
of
the
given
year,
then
sort
the
list
in
descending
order
on
propor4on,
sort
the
list
again
in
ascending
order
on
years.
The
result
list
will
have
each
years
records
in
a
chunk
with
propor4on
number
in
decreasing
order.
[email protected]
26
Measuring
the
increase
in
naming
diversity
(Cont.)
For
girls:
[email protected]
27
Measuring
the
increase
in
naming
diversity
(Cont.)
For
boys:
[email protected]
28
Measuring
the
increase
in
naming
diversity
(Cont.)
Step
2:
Adding
the
propor4on
for
each
year
from
highest
un4l
the
total
propor4on
reaches
50%,
recording
the
number
of
individual
names
[email protected]
29
Measuring
the
increase
in
naming
diversity
(Cont.)
For
girls:
[email protected]
30
Measuring
the
increase
in
naming
diversity
(Cont.)
For
boys:
[email protected]
31
Measuring
the
increase
in
naming
diversity
(Cont.)
Step
3:
Plot
the
result,
as
you
can
see,
girl
names
has
always
been
more
diverse
than
boy
names,
and
the
dis4nguished
names
become
more
over
4me.
[email protected]
32
Python
Library
for
Data
Analysis
Pandas
wriWen
by
Wes
McKinney
hWp://pandas.pydata.org/
provides
rich
data
structures
and
func4ons
working
with
structured
data
It
is
one
of
the
cri4cal
ingredients
enabling
Python
to
be
a
powerful
and
produc4ve
data
analysis
environment.
The
primary
object
in
pandas
is
called
DataFrame
a
two-dimensional
tabular,
column-oriented
data
structure
with
both
row
and
column
labels
Pandas
combines
the
features
of
NumPy,
spreadsheets
and
rela4onal
databases
[email protected]
33
Useful
Links
Python
Scien4c
Lecture
Notes
hWp://scipy-lectures.github.io/
Matplotlib
hWp://matplotlib.org/
Documenta4on
hWp://docs.python.org