05 Data Loading, Storage and Wrangling-1
05 Data Loading, Storage and Wrangling-1
1|Page
UECM 1534 Programming Techniques for Data Processing Jan 18/19
In [2]: df
Out[2]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
When reading the csv file, make sure the csv file is located in your current working directory.
To get current working directory and set a new working directory, we can use os package.
In [3]: import os
In [4]: currentpath = os.getcwd()
In [5]: currentpath
Out[5]: 'C:\\Users\\YourUserName'
In [6]: os.chdir('C:\\Users\\UECM1534')
Once you successfully change the working directory, you may read the csv file in the working
directory you have set by typing:
A file will not always have a header row. To read this file, you have a couple of options. You can
allow pandas to assign default column names, or you can specify names yourself:
2|Page
UECM 1534 Programming Techniques for Data Processing Jan 18/19
Suppose you wanted the message column to be the index of the returned DataFrame. You can
either indicate you want the column at index 4 or named 'message' using the index_col
argument:
In the event that you want to form a hierarchical index from multiple columns, pass a list of
column numbers or names:
In these cases, you can pass a regular expression as a delimiter for .read_table. This can be
expressed by the regular expression \s+, so we have then:
3|Page
UECM 1534 Programming Techniques for Data Processing Jan 18/19
In [20]: result
Out[20]:
A B C
aaa -0.264438 -1.026059 -0.619500
bbb 0.927272 0.302904 -0.032399
ccc -0.264273 -0.386314 -0.217601
ddd -0.871858 -0.348382 1.100491
Because there was one fewer column name than the number of data rows, .read_table infers
that the first column should be the DataFrame’s index in this special case.
Handling missing values is an important. Missing data is usually either not present (empty
string) or marked by some sentinel value. By default, pandas uses a set of commonly occurring
sentinels, such as NA and NULL:
In [25]: result
Out[25]:
something a b c d message
0 one 1 2 3.0 4 NaN
1 two 5 6 NaN 8 world
2 three 9 10 11.0 12 foo
In [26]: pd.isnull(result)
Out[26]:
something a b c d message
0 False False False False False True
1 False False False True False False
2 False False False False False False
4|Page
UECM 1534 Programming Techniques for Data Processing Jan 18/19
Before we look at a large file, we make the pandas display settings more compact:
In [29]: pd.options.display.max_rows = 10
Now we have:
In [32]: result
Out[32]:
one two three four key
0 0.467976 -0.038649 -0.295344 -1.824726 L
1 -0.358893 1.404453 0.704965 -0.200638 B
2 -0.501840 0.659254 -0.421691 -0.057688 G
3 0.204886 1.074134 1.388361 -0.982404 R
4 0.354628 -0.133116 0.283763 -0.837063 Q
... ... ... ... ... ..
9995 2.311896 -0.417070 -1.409599 -0.515821 L
9996 -0.479893 -0.650419 0.745152 -0.646038 E
9997 0.523331 0.787112 0.486066 1.093156 K
9998 -0.362559 0.598894 -1.843201 0.887292 G
9999 -0.096376 -1.012999 -0.657431 -0.573315 0
[10000 rows x 5 columns]
If you want to only read a small number of rows (avoiding reading the entire file), specify
that with nrows:
5|Page
UECM 1534 Programming Techniques for Data Processing Jan 18/19
Data can also be exported to a delimited format. Let’s consider one of the CSV files read
before:
In [35]: result
Out[35]:
something a b c d message
0 one 1 2 3.0 4 NaN
1 two 5 6 NaN 8 world
2 three 9 10 11.0 12 foo
Using DataFrame’s .to_csv method, we can write the data out to a comma-separated file:
In [37]: url =
‘https://www.fdic.gov/bank/individual/failed/banklist.html'
In [38]: tables=pd.read_html('https://www.fdic.gov/bank/individual
....: /failed/bankl ist.html')
In [39]: len(tables)
Out[39]: 1
6|Page
UECM 1534 Programming Techniques for Data Processing Jan 18/19
From here we could proceed to do some data cleaning and analysis, like computing the
number of bank failures by year:
In [43]: close_time.dt.year.value_counts()
Out[43]:
2010 157
2009 140
2011 92
2012 51
2008 25
...
2004 4
2001 4
2007 3
2003 3
2000 2
Name: Closing Date, Length: 15, dtype: int64
7|Page
UECM 1534 Programming Techniques for Data Processing Jan 18/19
Data stored in a sheet can then be read into DataFrame with parse:
If you are reading multiple sheets in a file, then it is faster to create the ExcelFile, but you can
also simply pass the filename to pandas.read_excel:
In [47]: frame
Out[47]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
To write pandas data to Excel format, you must first create an ExcelWriter, then write data to
it using pandas objects’ to_excel method:
In [50]: writer.save()
You can also pass a file path to to_excel and avoid the ExcelWriter:
The way that missing data is represented in pandas objects is somewhat imperfect, but it is
functional for a lot of users. For numeric data, pandas uses the floating-point value NaN (Not a
Number) to represent missing data. We call this a sentinel value that can be easily detected:
8|Page
UECM 1534 Programming Techniques for Data Processing Jan 18/19
In [53]: string_data
Out[53]:
0 aardvark
1 artichoke
2 NaN
3 avocado
dtype: object
In [54]: string_data.isnull()
Out[54]:
0 False
1 False
2 True
3 False
dtype: bool
In [56]: string_data.isnull()
Out[56]:
0 True
1 False
2 True
3 False
dtype: bool
There is work ongoing in the pandas project to improve the internal details of how missing data
is handled, but the user API functions, like pandas.isnull, abstract away many of the annoying
details. See table below for a list of some functions related to missing data handling.
9|Page
UECM 1534 Programming Techniques for Data Processing Jan 18/19
In [59]: data.dropna()
Out[59]:
0 1.0
2 3.5
4 7.0
dtype: float64
In [60]: data[data.notnull()]
Out[60]:
0 1.0
2 3.5
4 7.0
dtype: float64
With DataFrame objects, things are a bit more complex. You may want to drop rows or
columns that are all NA or only those containing any NAs. Method .dropna by default
drops any row containing a missing value:
In [63]: data
Out[63]:
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
In [64]: cleaned
Out[64]:
0 1 2
0 1.0 6.5 3.0
10 | P a g e
UECM 1534 Programming Techniques for Data Processing Jan 18/19
In [65]: data.dropna(how='all')
Out[65]:
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
3 NaN 6.5 3.0
In [66]: data[4] = NA
In [67]: data
Out[67]:
0 1 2 4
0 1.0 6.5 3.0 NaN
1 1.0 NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN 6.5 3.0 NaN
Suppose you want to keep only rows containing a certain number of observations. You
can indicate this with the thresh argument:
In [70]: df.iloc[:4, 1] = NA
In [71]: df.iloc[:2, 2] = NA
In [72]: df
Out[72]:
0 1 2
0 -0.204708 NaN NaN
1 -0.555730 NaN NaN
2 0.092908 NaN 0.769023
3 1.246435 NaN -1.296221
4 0.274992 0.228913 1.352917
5 0.886429 -2.001637 -0.371843
11 | P a g e
UECM 1534 Programming Techniques for Data Processing Jan 18/19
In [73]: df.dropna()
Out[73]:
0 1 2
4 0.274992 0.228913 1.352917
5 0.886429 -2.001637 -0.371843
In [74]: df.dropna(thresh=2)
Out[74]:
0 1 2
2 0.092908 NaN 0.769023
3 1.246435 NaN -1.296221
4 0.274992 0.228913 1.352917
5 0.886429 -2.001637 -0.371843
In [75]: df.fillna(0)
Out[75]:
0 1 2
0 -0.204708 0.000000 0.000000
1 -0.555730 0.000000 0.000000
2 0.092908 0.000000 0.769023
3 1.246435 0.000000 -1.296221
4 0.274992 0.228913 1.352917
5 0.886429 -2.001637 -0.371843
Calling fillna with a dict, you can use a different fill value for each column:
In [76]: df.fillna({1: 0.5, 2: 0})
Out[76]:
0 1 2
0 -0.204708 0.500000 0.000000
1 -0.555730 0.500000 0.000000
2 0.092908 0.500000 0.769023
3 1.246435 0.500000 -1.296221
4 0.274992 0.228913 1.352917
5 0.886429 -2.001637 -0.371843
With fillna you can do lots of other things with a little creativity. For example, you
might pass the mean or median value of a Series:
12 | P a g e
UECM 1534 Programming Techniques for Data Processing Jan 18/19
In [78]: data.fillna(data.mean())
Out[78]:
0 1.000000
1 3.833333
2 3.500000
3 3.833333
4 7.000000
dtype: float64
pandas.merge connects rows in DataFrames based on one or more keys. This will be familiar
to users of SQL or other relational databases, as it implements database join operations.
pandas.concat concatenates or “stacks” together objects along an axis.
The combine_first instance method enables splicing together overlapping data to fill in
missing values in one object with values from another.
In [81]: df1
Out[81]:
data1 key
0 0 b
1 1 b
2 2 a
3 3 c
4 4 a
5 5 a
6 6 b
13 | P a g e
UECM 1534 Programming Techniques for Data Processing Jan 18/19
In [82]: df2
Out[82]:
data2 key
0 0 a
1 1 b
2 2 d
This is an example of a many-to-one join; the data in df1 has multiple rows labeled a
and b, whereas df2 has only one row for each value in the key column. Calling .merge
with these objects we obtain:
Note that no specification which column to join on. If that information is not specified,
merge uses the overlapping column names as the keys. It’s a good practice to specify
explicitly, though:
If the column names are different in each object, you can specify them separately:
14 | P a g e
UECM 1534 Programming Techniques for Data Processing Jan 18/19
Other possible options are 'left', 'right', and 'outer'. The outer join takes the
union of the keys, combining the effect of applying both left and right joins:
15 | P a g e
UECM 1534 Programming Techniques for Data Processing Jan 18/19
In [91]: df1
Out[91]:
data1 key
0 0 b
1 1 b
2 2 a
3 3 c
4 4 a
5 5 b
In [92]: df2
Out[92]:
data2 key
0 0 a
1 1 b
2 2 a
3 3 b
4 4 d
16 | P a g e
UECM 1534 Programming Techniques for Data Processing Jan 18/19
A last issue to consider in merge operations is the treatment of overlapping column names.
17 | P a g e
UECM 1534 Programming Techniques for Data Processing Jan 18/19
In [101]: left1
Out[101]:
key value
0 a 0
1 b 1
2 a 2
3 a 3
4 b 4
5 c 5
In [102]: right1
Out[102]:
group_val
a 3.5
b 7.0
Since the default merge method is to intersect the join keys, you can instead form the
union of them with an outer join:
18 | P a g e
UECM 1534 Programming Techniques for Data Processing Jan 18/19
In [107]: left2
Out[107]:
Ohio Nevada
a 1.0 2.0
c 3.0 4.0
e 5.0 6.0
In [108]: right2
Out[108]:
Missouri Alabama
b 7.0 8.0
c 9.0 10.0
d 11.0 12.0
e 13.0 14.0
DataFrame has a convenient .join instance for merging by index. It can also be used to
combine together many DataFrame objects having the same or similar indexes but non-
overlapping columns. In the prior example, we could have written:
19 | P a g e
UECM 1534 Programming Techniques for Data Processing Jan 18/19
Calling .concat with these objects in a list glues together the values and indexes:
By default .concat works along axis=0, producing another Series. If you pass axis=1,
the result will instead be a DataFrame (axis=1is the columns):
20 | P a g e
UECM 1534 Programming Techniques for Data Processing Jan 18/19
In [119]: df1
Out[119]:
one two
a 0 1
b 2 3
c 4 5
In [120]: df2
Out[120]:
three four
a 5 6
c 7 8
A last consideration concerns DataFrames in which the row index does not contain
any relevant data:
In [124]: df1
Out[124]:
a b c d
0 1.246435 1.007189 -1.296221 0.274992
1 0.228913 1.352917 0.886429 -2.001637
2 -0.371843 1.669025 -0.438570 -0.539741
In [125]: df2
Out[125]:
b d a
0 0.476985 3.248944 -1.021228
1 -0.577087 0.124121 0.302614
21 | P a g e
UECM 1534 Programming Techniques for Data Processing Jan 18/19
22 | P a g e