0% found this document useful (0 votes)

90 views22 pages

05 Data Loading, Storage and Wrangling-1

The document discusses techniques for loading, storing, and manipulating data in Python using the pandas library. It covers: 1) Reading and writing tabular data from text files like CSV and HTML tables, as well as handling missing data values. 2) Loading Excel files using pandas and the xlrd/openpyxl libraries. 3) Techniques for working with large datasets in chunks, as well as examples of accessing specific rows or columns of data.

Uploaded by

FucKerWengie

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

90 views22 pages

05 Data Loading, Storage and Wrangling-1

Uploaded by

FucKerWengie

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

UECM 1534 Programming Techniques for Data Processing Jan 18/19

Chapter 05: Data Loading,

Storage and Wrangling
Accessing data is a necessary first step for using most of the tools. In this chapter, student will
learn to reading and writing in text format, reading and rewriting Microsoft excel file,
handling missing data, and combining and merging datasets.

1. Reading and Writing Data in Text Format

pandas features a number of functions for reading tabular data as a DataFrame object. Table
below summarizes some of them, though .read_csv and .read_table are likely the ones
you’ll use the most.

We can use .read_csv to read it into a DataFrame:

In [1]: df = pd.read_csv('your path/5_1.csv')

1|Page
UECM 1534 Programming Techniques for Data Processing Jan 18/19

In [2]: df
Out[2]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

When reading the csv file, make sure the csv file is located in your current working directory.
To get current working directory and set a new working directory, we can use os package.

In [3]: import os
In [4]: currentpath = os.getcwd()

In [5]: currentpath
Out[5]: 'C:\\Users\\YourUserName'

In [6]: os.chdir('C:\\Users\\UECM1534')

In [7]: currentpath = os.getcwd()

Once you successfully change the working directory, you may read the csv file in the working
directory you have set by typing:

In [8]: mypath = currentpath + '/5_1.csv'

In [9]: df = pd.read_csv(mypath)

We could also have used .read_table and specified the delimiter:

In [10]: pd.read_table(mypath, sep=',')

Out[10]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

A file will not always have a header row. To read this file, you have a couple of options. You can
allow pandas to assign default column names, or you can specify names yourself:

In [11]: mypath1 = currentpath + '/5_2.csv'

In [12]: pd.read_csv(mypath1, header=None)

Out[12]:
0 1 2 3 4
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

2|Page
UECM 1534 Programming Techniques for Data Processing Jan 18/19

In [13]: pd.read_csv(mypath1, names=['a', 'b', 'c', 'd', 'message'])

Out[13]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

Suppose you wanted the message column to be the index of the returned DataFrame. You can
either indicate you want the column at index 4 or named 'message' using the index_col
argument:

In [14]: names = ['a', 'b', 'c', 'd', 'message']

In [15]: pd.read_csv(mypath1, names=names, index_col='message')

Out[15]:
a b c d
message
hello 1 2 3 4
world 5 6 7 8
foo 9 10 11 12

In the event that you want to form a hierarchical index from multiple columns, pass a list of
column numbers or names:

In [16]: mypath2 = currentpath + '/5_3.csv'

In [17]: pd.read_csv(mypath2, index_col=[0,1])

Out[17]:
value1 value2
key1 key2
one a 1 2
b 3 4
c 5 6
d 7 8
two a 9 10
b 11 12
c 13 14
d 15 16

In these cases, you can pass a regular expression as a delimiter for .read_table. This can be
expressed by the regular expression \s+, so we have then:

In [18]: mypath3 = currentpath + '/5_4.txt'

In [19]: result = pd.read_table(mypath3, sep='\s+')

3|Page
UECM 1534 Programming Techniques for Data Processing Jan 18/19

In [20]: result
Out[20]:
A B C
aaa -0.264438 -1.026059 -0.619500
bbb 0.927272 0.302904 -0.032399
ccc -0.264273 -0.386314 -0.217601
ddd -0.871858 -0.348382 1.100491

Because there was one fewer column name than the number of data rows, .read_table infers
that the first column should be the DataFrame’s index in this special case.

You can also skip certain rows of a file with skiprows:

In [21]: mypath4 = currentpath + '/5_5.csv'

In [22]: pd.read_csv(mypath4, skiprows=[0, 2, 3])

Out[22]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

Handling missing values is an important. Missing data is usually either not present (empty
string) or marked by some sentinel value. By default, pandas uses a set of commonly occurring
sentinels, such as NA and NULL:

In [23]: mypath5 = currentpath + '/5_6.csv'

In [24]: result = pd.read_csv(mypath5)

In [25]: result
Out[25]:
something a b c d message
0 one 1 2 3.0 4 NaN
1 two 5 6 NaN 8 world
2 three 9 10 11.0 12 foo

In [26]: pd.isnull(result)
Out[26]:
something a b c d message
0 False False False False False True
1 False False False True False False
2 False False False False False False

Different NA sentinels can be specified for each column in a dict:

In [27]: sentinels = {'message': ['foo', 'NA'], 'something': ['two']}

4|Page
UECM 1534 Programming Techniques for Data Processing Jan 18/19

In [28]: pd.read_csv(mypath5, na_values=sentinels)

Out[28]:
something a b c d message
0 one 1 2 3.0 4 NaN
1 NaN 5 6 NaN 8 world
2 three 9 10 11.0 12 NaN

1.1 Reading Text Files in Pieces

When processing very large files or figuring out the right set of arguments to correctly
process a large file, you may only want to read in a small piece of a file or iterate through
smaller chunks of the file.

Before we look at a large file, we make the pandas display settings more compact:

In [29]: pd.options.display.max_rows = 10

Now we have:

In [30]: mypath6 = currentpath + '/5_7.csv'

In [31]: result = pd.read_csv(mypath6)

In [32]: result
Out[32]:
one two three four key
0 0.467976 -0.038649 -0.295344 -1.824726 L
1 -0.358893 1.404453 0.704965 -0.200638 B
2 -0.501840 0.659254 -0.421691 -0.057688 G
3 0.204886 1.074134 1.388361 -0.982404 R
4 0.354628 -0.133116 0.283763 -0.837063 Q
... ... ... ... ... ..
9995 2.311896 -0.417070 -1.409599 -0.515821 L
9996 -0.479893 -0.650419 0.745152 -0.646038 E
9997 0.523331 0.787112 0.486066 1.093156 K
9998 -0.362559 0.598894 -1.843201 0.887292 G
9999 -0.096376 -1.012999 -0.657431 -0.573315 0
[10000 rows x 5 columns]

If you want to only read a small number of rows (avoiding reading the entire file), specify
that with nrows:

5|Page
UECM 1534 Programming Techniques for Data Processing Jan 18/19

In [33]: pd.read_csv(mypath6, nrows=5)

Out[33]:
one two three four key
0 0.467976 -0.038649 -0.295344 -1.824726 L
1 -0.358893 1.404453 0.704965 -0.200638 B
2 -0.501840 0.659254 -0.421691 -0.057688 G
3 0.204886 1.074134 1.388361 -0.982404 R
4 0.354628 -0.133116 0.283763 -0.837063 Q

1.2 Writing Data to Text Format

Data can also be exported to a delimited format. Let’s consider one of the CSV files read
before:

In [34]: result = pd.read_csv(mypath5)

In [35]: result
Out[35]:
something a b c d message
0 one 1 2 3.0 4 NaN
1 two 5 6 NaN 8 world
2 three 9 10 11.0 12 foo

Using DataFrame’s .to_csv method, we can write the data out to a comma-separated file:

In [36]: data.to_csv(currentpath +'/out.csv')

1.3 Reading html Table

By using the .to_csv method, you can now save any table you created or read from other
file to excel. To read html file, the pandas.read_html function has a number of options,
but by default it searches for and attempts to parse all tabular data contained within
<table> tags. The result is a list of DataFrame objects:

In [37]: url =
‘https://www.fdic.gov/bank/individual/failed/banklist.html'

In [38]: tables=pd.read_html('https://www.fdic.gov/bank/individual
....: /failed/bankl ist.html')

In [39]: len(tables)
Out[39]: 1

6|Page
UECM 1534 Programming Techniques for Data Processing Jan 18/19

In [40]: failures = tables[0]

In [41]: failures.sort_values(by = 'Bank Name', ascending = False)

Out[41]:

From here we could proceed to do some data cleaning and analysis, like computing the
number of bank failures by year:

In [42]: close_time = pd.to_datetime(failures['Closing Date'])

In [43]: close_time.dt.year.value_counts()
Out[43]:
2010 157
2009 140
2011 92
2012 51
2008 25
...
2004 4
2001 4
2007 3
2003 3
2000 2
Name: Closing Date, Length: 15, dtype: int64

2. Binary Data Formats – Microsoft Excel Files

pandas also supports reading tabular data stored in Excel 2003 (and higher) files using either
the ExcelFile class or pandas.read_excel function. Internally these tools use the add-on
packages xlrd and openpyxl to read XLS and XLSX files, respectively. You may need to install these
manually with pip or conda.

To use pd.ExcelFile, create an instance by passing a path to an xls or xlsx file:

In [44]: xlsx = pd.ExcelFile(mypath + '/5_8.xlsx')

7|Page
UECM 1534 Programming Techniques for Data Processing Jan 18/19

Data stored in a sheet can then be read into DataFrame with parse:

In [45]: pd.read_excel(xlsx, 'Sheet1')

Out[45]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

If you are reading multiple sheets in a file, then it is faster to create the ExcelFile, but you can
also simply pass the filename to pandas.read_excel:

In [46]: frame = pd.read_excel(mypath + '/5_8.xlsx', 'Sheet1')

In [47]: frame
Out[47]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

To write pandas data to Excel format, you must first create an ExcelWriter, then write data to
it using pandas objects’ to_excel method:

In [48]: writer = pd.ExcelWriter(mypath + '/test.xlsx')

In [49]: frame.to_excel(writer, 'Sheet2')

In [50]: writer.save()

You can also pass a file path to to_excel and avoid the ExcelWriter:

In [51]: frame.to_excel(mypath + '/test2.xlsx')

3. Handling Missing Data

Missing data occurs commonly in many data analysis applications. One of the goals of pandas
is to make working with missing data as painless as possible. For example, all of the descriptive
statistics on pandas objects exclude missing data by default.

The way that missing data is represented in pandas objects is somewhat imperfect, but it is
functional for a lot of users. For numeric data, pandas uses the floating-point value NaN (Not a
Number) to represent missing data. We call this a sentinel value that can be easily detected:

8|Page
UECM 1534 Programming Techniques for Data Processing Jan 18/19

In [52]: string_data = pd.Series(['aardvark', 'artichoke', np.nan,

....: 'avocado'])

In [53]: string_data
Out[53]:
0 aardvark
1 artichoke
2 NaN
3 avocado
dtype: object

In [54]: string_data.isnull()
Out[54]:
0 False
1 False
2 True
3 False
dtype: bool

The built-in Python None value is also treated as NA in object arrays:

In [55]: string_data[0] = None

In [56]: string_data.isnull()
Out[56]:
0 True
1 False
2 True
3 False
dtype: bool

There is work ongoing in the pandas project to improve the internal details of how missing data
is handled, but the user API functions, like pandas.isnull, abstract away many of the annoying
details. See table below for a list of some functions related to missing data handling.

9|Page
UECM 1534 Programming Techniques for Data Processing Jan 18/19

3.1 Filtering Out Missing Data

There are a few ways to filter out missing data. While you always have the option to do it
by hand using pandas.isnull and boolean indexing, the .dropna can be helpful. On a
Series, it returns the Series with only the non-null data and index values:

In [57]: from numpy import nan as NA

In [58]: data = pd.Series([1, NA, 3.5, NA, 7])

In [59]: data.dropna()
Out[59]:
0 1.0
2 3.5
4 7.0
dtype: float64

This is equivalent to:

In [60]: data[data.notnull()]
Out[60]:
0 1.0
2 3.5
4 7.0
dtype: float64

With DataFrame objects, things are a bit more complex. You may want to drop rows or
columns that are all NA or only those containing any NAs. Method .dropna by default
drops any row containing a missing value:

In [61]: data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],

....: [NA, NA, NA], [NA, 6.5, 3.]])

In [62]: cleaned = data.dropna()

In [63]: data
Out[63]:
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0

In [64]: cleaned
Out[64]:
0 1 2
0 1.0 6.5 3.0

10 | P a g e
UECM 1534 Programming Techniques for Data Processing Jan 18/19

Passing how='all'will only drop rows that are all NA:

In [65]: data.dropna(how='all')
Out[65]:
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
3 NaN 6.5 3.0

To drop columns in the same way, pass axis=1:

In [66]: data[4] = NA

In [67]: data
Out[67]:
0 1 2 4
0 1.0 6.5 3.0 NaN
1 1.0 NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN 6.5 3.0 NaN

In [68]: data.dropna(axis=1, how='all')

Out[68]:
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0

Suppose you want to keep only rows containing a certain number of observations. You
can indicate this with the thresh argument:

In [69]: df = pd.DataFrame(np.random.randn(6, 3))

In [70]: df.iloc[:4, 1] = NA

In [71]: df.iloc[:2, 2] = NA

In [72]: df
Out[72]:
0 1 2
0 -0.204708 NaN NaN
1 -0.555730 NaN NaN
2 0.092908 NaN 0.769023
3 1.246435 NaN -1.296221
4 0.274992 0.228913 1.352917
5 0.886429 -2.001637 -0.371843

11 | P a g e
UECM 1534 Programming Techniques for Data Processing Jan 18/19

In [73]: df.dropna()
Out[73]:
0 1 2
4 0.274992 0.228913 1.352917
5 0.886429 -2.001637 -0.371843
In [74]: df.dropna(thresh=2)
Out[74]:
0 1 2
2 0.092908 NaN 0.769023
3 1.246435 NaN -1.296221
4 0.274992 0.228913 1.352917
5 0.886429 -2.001637 -0.371843

3.2 Filling In Missing Data

Rather than filtering out missing data (and potentially discarding other data along with it),
you may want to fill in the “holes” in any number of ways. For most purposes, the .fillna
method is the workhorse function to use. Calling method .fillna with a constant replaces
missing values with that value:

In [75]: df.fillna(0)
Out[75]:
0 1 2
0 -0.204708 0.000000 0.000000
1 -0.555730 0.000000 0.000000
2 0.092908 0.000000 0.769023
3 1.246435 0.000000 -1.296221
4 0.274992 0.228913 1.352917
5 0.886429 -2.001637 -0.371843

Calling fillna with a dict, you can use a different fill value for each column:
In [76]: df.fillna({1: 0.5, 2: 0})
Out[76]:
0 1 2
0 -0.204708 0.500000 0.000000
1 -0.555730 0.500000 0.000000
2 0.092908 0.500000 0.769023
3 1.246435 0.500000 -1.296221
4 0.274992 0.228913 1.352917
5 0.886429 -2.001637 -0.371843

With fillna you can do lots of other things with a little creativity. For example, you
might pass the mean or median value of a Series:

In [77]: data = pd.Series([1., NA, 3.5, NA, 7])

12 | P a g e
UECM 1534 Programming Techniques for Data Processing Jan 18/19

In [78]: data.fillna(data.mean())
Out[78]:
0 1.000000
1 3.833333
2 3.500000
3 3.833333
4 7.000000
dtype: float64

4. Combining and Merging Dataset

Data contained in pandas objects can be combined together in a number of ways:

 pandas.merge connects rows in DataFrames based on one or more keys. This will be familiar
to users of SQL or other relational databases, as it implements database join operations.
 pandas.concat concatenates or “stacks” together objects along an axis.
 The combine_first instance method enables splicing together overlapping data to fill in
missing values in one object with values from another.

They’ll be utilized in examples throughout the rest of the chapter.

4.1 Database-Style DataFrame Joins

Merge or join operations combine datasets by linking rows using one or more keys. These
operations are central to relational databases (e.g., SQL-based). The merge function in
pandas is the main entry point for using these algorithms on your data.

In [79]: df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a',

....: 'b'],'data1': range(7)})

In [80]: df2 = pd.DataFrame({'key': ['a', 'b', 'd'],

....: 'data2': range(3)})

In [81]: df1
Out[81]:
data1 key
0 0 b
1 1 b
2 2 a
3 3 c
4 4 a
5 5 a
6 6 b

13 | P a g e
UECM 1534 Programming Techniques for Data Processing Jan 18/19

In [82]: df2
Out[82]:
data2 key
0 0 a
1 1 b
2 2 d

This is an example of a many-to-one join; the data in df1 has multiple rows labeled a
and b, whereas df2 has only one row for each value in the key column. Calling .merge
with these objects we obtain:

In [83]: pd.merge(df1, df2)

Out[83]:
data1 key data2
0 0 b 1
1 1 b 1
2 6 b 1
3 2 a 0
4 4 a 0
5 5 a 0

Note that no specification which column to join on. If that information is not specified,
merge uses the overlapping column names as the keys. It’s a good practice to specify
explicitly, though:

In [84]: pd.merge(df1, df2, on='key')

Out[84]:
data1 key data2
0 0 b 1
1 1 b 1
2 6 b 1
3 2 a 0
4 4 a 0
5 5 a 0

If the column names are different in each object, you can specify them separately:

In [85]: df3 = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a',

....: 'a', 'b'],'data1': range(7)})

In [86]: df4 = pd.DataFrame({'rkey': ['a', 'b', 'd'],

....: 'data2': range(3)})

14 | P a g e
UECM 1534 Programming Techniques for Data Processing Jan 18/19

In [87]: pd.merge(df3, df4, left_on='lkey', right_on='rkey')

Out[87]:
data1 lkey data2 rkey
0 0 b 1 b
1 1 b 1 b
2 6 b 1 b
3 2 a 0 a
4 4 a 0 a
5 5 a 0 a

Other possible options are 'left', 'right', and 'outer'. The outer join takes the
union of the keys, combining the effect of applying both left and right joins:

In [88]: pd.merge(df1, df2, how='outer')

Out[88]:
data1 key data2
0 0.0 b 1.0
1 1.0 b 1.0
2 6.0 b 1.0
3 2.0 a 0.0
4 4.0 a 0.0
5 5.0 a 0.0
6 3.0 c NaN
7 NaN d 2.0

See table below for a summary of the options for how=.

Many-to-many merges have well-defined, though not necessarily intuitive, behavior.

Here’s an example:

In [89]: df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a',

....: 'b'], 'data1': range(6)})

In [90]: df2 = pd.DataFrame({'key': ['a', 'b', 'a', 'b', 'd'],

....: 'data2': range(5)})

15 | P a g e
UECM 1534 Programming Techniques for Data Processing Jan 18/19

In [91]: df1
Out[91]:
data1 key
0 0 b
1 1 b
2 2 a
3 3 c
4 4 a
5 5 b

In [92]: df2
Out[92]:
data2 key
0 0 a
1 1 b
2 2 a
3 3 b
4 4 d

In [93]: pd.merge(df1, df2, on='key', how='left')

Out[93]:
data1 key data2
0 0 b 1.0
1 0 b 3.0
2 1 b 1.0
3 1 b 3.0
4 2 a 0.0
5 2 a 2.0
6 3 c NaN
7 4 a 0.0
8 4 a 2.0
9 5 b 1.0
10 5 b 3.0

In [94]: pd.merge(df1, df2, how='inner')

Out[94]:
data1 key data2
0 0 b 1
1 0 b 3
2 1 b 1
3 1 b 3
4 5 b 1
5 5 b 3
6 2 a 0
7 2 a 2
8 4 a 0
9 4 a 2

16 | P a g e
UECM 1534 Programming Techniques for Data Processing Jan 18/19

To merge with multiple keys, pass a list of column names:

In [95]: left = pd.DataFrame({'key1': ['foo', 'foo', 'bar'],

....: 'key2': ['one', 'two', 'one'],
....: 'lval': [1, 2, 3]})

In [96]: right = pd.DataFrame({'key1': ['foo', 'foo', 'bar',

....: 'bar'], 'key2': ['one', 'one', 'one', 'two'],
....: 'rval': [4, 5, 6, 7]})

In [97]: pd.merge(left, right, on=['key1', 'key2'], how='outer')

Out[97]:
key1 key2 lval rval
0 foo one 1.0 4.0
1 foo one 1.0 5.0
2 foo two 2.0 NaN
3 bar one 3.0 6.0
4 bar two NaN 7.0

A last issue to consider in merge operations is the treatment of overlapping column names.

In [98]: pd.merge(left, right, on='key1')

Out[98]:
key1 key2_x lval key2_y rval
0 foo one 1 one 4
1 foo one 1 one 5
2 foo two 2 one 4
3 foo two 2 one 5
4 bar one 3 one 6
5 bar one 3 two 7

4.2 Merging on Index

In some cases, the merge key(s) in a DataFrame will be found in its index. In this case, you
can pass left_index=True or right_index=True (or both) to indicate that the index
should be used as the merge key:

In [99]: left1 = pd.DataFrame({'key': ['a', 'b', 'a', 'a', 'b',

....: 'c'],'value': range(6)})

In [100]: right1 = pd.DataFrame({'group_val': [3.5, 7]},

....: index=['a', 'b'])

17 | P a g e
UECM 1534 Programming Techniques for Data Processing Jan 18/19

In [101]: left1
Out[101]:
key value
0 a 0
1 b 1
2 a 2
3 a 3
4 b 4
5 c 5

In [102]: right1
Out[102]:
group_val
a 3.5
b 7.0

In [103]: pd.merge(left1, right1, left_on='key', right_index=True)

Out[103]:
key value group_val
0 a 0 3.5
2 a 2 3.5
3 a 3 3.5
1 b 1 7.0
4 b 4 7.0

Since the default merge method is to intersect the join keys, you can instead form the
union of them with an outer join:

In [104]: pd.merge(left1, right1, left_on='key', right_index=True,

....: how='outer')
Out[104]:
key value group_val
0 a 0 3.5
2 a 2 3.5
3 a 3 3.5
1 b 1 7.0
4 b 4 7.0
5 c 5 NaN

Using the indexes of both sides of the merge is also possible:

In [105]: left2 = pd.DataFrame([[1., 2.], [3., 4.], [5., 6.]],

....: index=['a', 'c', 'e'],
....: columns=['Ohio', 'Nevada'])

In [106]: right2 = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.],

....: [13, 14]], index=['b', 'c', 'd', 'e'],
....: columns=['Missouri', 'Alabama'])

18 | P a g e
UECM 1534 Programming Techniques for Data Processing Jan 18/19

In [107]: left2
Out[107]:
Ohio Nevada
a 1.0 2.0
c 3.0 4.0
e 5.0 6.0

In [108]: right2
Out[108]:
Missouri Alabama
b 7.0 8.0
c 9.0 10.0
d 11.0 12.0
e 13.0 14.0

In [109]: pd.merge(left2, right2, how='outer', left_index=True,

....: right_index=True)
Out[109]:
Ohio Nevada Missouri Alabama
a 1.0 2.0 NaN NaN
b NaN NaN 7.0 8.0
c 3.0 4.0 9.0 10.0
d NaN NaN 11.0 12.0
e 5.0 6.0 13.0 14.0

DataFrame has a convenient .join instance for merging by index. It can also be used to
combine together many DataFrame objects having the same or similar indexes but non-
overlapping columns. In the prior example, we could have written:

In [110]: left2.join(right2, how='outer')

Out[110]:
Ohio Nevada Missouri Alabama
a 1.0 2.0 NaN NaN
b NaN NaN 7.0 8.0
c 3.0 4.0 9.0 10.0
d NaN NaN 11.0 12.0
e 5.0 6.0 13.0 14.0

DataFrame’s join method preserving the left frame’s row index.

In [111]: left1.join(right1, on='key')

Out[111]:
key value group_val
0 a 0 3.5
1 b 1 7.0
2 a 2 3.5
3 a 3 3.5
4 b 4 7.0
5 c 5 NaN

19 | P a g e
UECM 1534 Programming Techniques for Data Processing Jan 18/19

4.3 Concatenating Along an Axis

The .concat function in pandas provides a consistent way to address each of these
concerns. I’ll give a number of examples to illustrate how it works. Suppose we have
three Series with no index overlap:

In [112]: s1 = pd.Series([0, 1], index=['a', 'b'])

In [113]: s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])

In [114]: s3 = pd.Series([5, 6], index=['f', 'g'])

Calling .concat with these objects in a list glues together the values and indexes:

In [115]: pd.concat([s1, s2, s3])

Out[115]:
a 0
b 1
c 2
d 3
e 4
f 5
g 6
dtype: int64

By default .concat works along axis=0, producing another Series. If you pass axis=1,
the result will instead be a DataFrame (axis=1is the columns):

In [116]: pd.concat([s1, s2, s3], axis=1)

Out[116]:
0 1 2
a 0.0 NaN NaN
b 1.0 NaN NaN
c NaN 2.0 NaN
d NaN 3.0 NaN
e NaN 4.0 NaN
f NaN NaN 5.0
g NaN NaN 6.0

The same logic extends to DataFrame objects:

In [117]: df1 = pd.DataFrame(np.arange(6).reshape(3, 2),

....: index=['a', 'b', 'c'],
....: columns=['one', 'two'])

In [118]: df2 = pd.DataFrame(5 + np.arange(4).reshape(2, 2),

....: index=['a', 'c'],
....: columns=['three', 'four'])

20 | P a g e
UECM 1534 Programming Techniques for Data Processing Jan 18/19

In [119]: df1
Out[119]:
one two
a 0 1
b 2 3
c 4 5

In [120]: df2
Out[120]:
three four
a 5 6
c 7 8

In [121]: pd.concat([df1, df2], axis=1, keys=['level1', 'level2'])

Out[121]:
level1 level2
one two three four
a 0 1 5.0 6.0
b 2 3 NaN NaN
c 4 5 7.0 8.0

A last consideration concerns DataFrames in which the row index does not contain
any relevant data:

In [122]: df1 = pd.DataFrame(np.random.randn(3, 4), columns=['a',

....: 'b', 'c', 'd'])

In [123]: df2 = pd.DataFrame(np.random.randn(2, 3), columns=['b',

....: 'd', 'a'])

In [124]: df1
Out[124]:
a b c d
0 1.246435 1.007189 -1.296221 0.274992
1 0.228913 1.352917 0.886429 -2.001637
2 -0.371843 1.669025 -0.438570 -0.539741

In [125]: df2
Out[125]:
b d a
0 0.476985 3.248944 -1.021228
1 -0.577087 0.124121 0.302614

In this case, you can pass ignore_index=True:

21 | P a g e
UECM 1534 Programming Techniques for Data Processing Jan 18/19

In [126]: pd.concat([df1, df2], ignore_index=True)

Out[126]:
a b c d
0 1.246435 1.007189 -1.296221 0.274992
1 0.228913 1.352917 0.886429 -2.001637
2 -0.371843 1.669025 -0.438570 -0.539741
3 -1.021228 0.476985 NaN 3.248944
4 0.302614 -0.577087 NaN 0.124121

22 | P a g e

DAP Module4 Notes
No ratings yet
DAP Module4 Notes
17 pages
Predictive Model - Linear and Logistic Models PDF
100% (1)
Predictive Model - Linear and Logistic Models PDF
108 pages
Autodesk Revit 2015-0-15 ReleaseNotes
No ratings yet
Autodesk Revit 2015-0-15 ReleaseNotes
10 pages
Salesforce
No ratings yet
Salesforce
8 pages
Mod 4
No ratings yet
Mod 4
63 pages
SPiiPlus C Library Reference Programmer's Guide
No ratings yet
SPiiPlus C Library Reference Programmer's Guide
436 pages
MySQL Crash Course: A Hands-on Introduction to Database Development
From Everand
MySQL Crash Course: A Hands-on Introduction to Database Development
Rick Silva
No ratings yet
Balck Book Meadical
No ratings yet
Balck Book Meadical
70 pages
OS Sheet4
No ratings yet
OS Sheet4
3 pages
016) Pandas - Batch 2 - Day 016
No ratings yet
016) Pandas - Batch 2 - Day 016
66 pages
MOD-3 Dap
No ratings yet
MOD-3 Dap
41 pages
On Data Handling Using Pandas-I
100% (2)
On Data Handling Using Pandas-I
63 pages
#4 UML Viva Q&A by AK-1 (E-Next - In) PDF
No ratings yet
#4 UML Viva Q&A by AK-1 (E-Next - In) PDF
20 pages
Bhavan Resume - Latest
No ratings yet
Bhavan Resume - Latest
3 pages
Unit 4.2
No ratings yet
Unit 4.2
24 pages
RM - Pandas - Importing Data
No ratings yet
RM - Pandas - Importing Data
15 pages
Pandas 1
No ratings yet
Pandas 1
64 pages
Python Presentation Assignment
No ratings yet
Python Presentation Assignment
5 pages
UNIT-4 Important Q-A
No ratings yet
UNIT-4 Important Q-A
28 pages
Employee Data Analysis System (Ip Class 12) (2024-25)
No ratings yet
Employee Data Analysis System (Ip Class 12) (2024-25)
30 pages
Pandas - Read Table (Filepath or Buffe
No ratings yet
Pandas - Read Table (Filepath or Buffe
7 pages
CH-6 Data Loading, Storage, and File Formats
No ratings yet
CH-6 Data Loading, Storage, and File Formats
163 pages
2
No ratings yet
2
5 pages
C# Practice Exercises 2
No ratings yet
C# Practice Exercises 2
2 pages
Introduction To Data Science Using Python Part2
No ratings yet
Introduction To Data Science Using Python Part2
45 pages
Data Frame
No ratings yet
Data Frame
95 pages
Class Xii Information Practices PPT On Data Handling Using Pandas-I
No ratings yet
Class Xii Information Practices PPT On Data Handling Using Pandas-I
64 pages
Chapter 4 - Import-Export Data
No ratings yet
Chapter 4 - Import-Export Data
30 pages
Data Handling Using Pandas-1
No ratings yet
Data Handling Using Pandas-1
60 pages
Linked List: - Linked List Is A Set of Nodes Where Each Node Has Two Fields Data and Next
No ratings yet
Linked List: - Linked List Is A Set of Nodes Where Each Node Has Two Fields Data and Next
39 pages
Week 2 Laboratory Activity
No ratings yet
Week 2 Laboratory Activity
7 pages
Vxworks 5.5 Migration Guide 6.6
No ratings yet
Vxworks 5.5 Migration Guide 6.6
101 pages
Software Architecture Quality Attributes
No ratings yet
Software Architecture Quality Attributes
10 pages
IICS Overview
No ratings yet
IICS Overview
6 pages
Pandas 1
No ratings yet
Pandas 1
89 pages
ABAP Statements Sap
No ratings yet
ABAP Statements Sap
21 pages
Python 3rd Unit Question and Answer
No ratings yet
Python 3rd Unit Question and Answer
25 pages
SET-2 Python Practical (3-5
No ratings yet
SET-2 Python Practical (3-5
4 pages
Pandas
No ratings yet
Pandas
57 pages
Postalcode Hashcode Age Gender Payment Method Rownumber Lasttransaction
No ratings yet
Postalcode Hashcode Age Gender Payment Method Rownumber Lasttransaction
44 pages
Experiment 678910
No ratings yet
Experiment 678910
12 pages
CSL 410 L16
No ratings yet
CSL 410 L16
22 pages
Data Frame Notes2
No ratings yet
Data Frame Notes2
4 pages
EMC Networker Nsradmin PDF
No ratings yet
EMC Networker Nsradmin PDF
60 pages
BTech 5 CSE Data Analytics Using Python Unit 4 Notes
No ratings yet
BTech 5 CSE Data Analytics Using Python Unit 4 Notes
25 pages
Python For Data Analysis (1) - 171-192
No ratings yet
Python For Data Analysis (1) - 171-192
24 pages
Roboy SW Documentation Template
No ratings yet
Roboy SW Documentation Template
33 pages
III Unit Fds
No ratings yet
III Unit Fds
24 pages
Experiment No 3 Importing and Exporting Data in Python Using Pandas Student
No ratings yet
Experiment No 3 Importing and Exporting Data in Python Using Pandas Student
6 pages
L32, 33 Pandas
No ratings yet
L32, 33 Pandas
7 pages
B161753 - Shaik Sameer - Oops - Week3 PDF
No ratings yet
B161753 - Shaik Sameer - Oops - Week3 PDF
7 pages
Rownumber Area Age Sex Payment Status Nodays Status
No ratings yet
Rownumber Area Age Sex Payment Status Nodays Status
22 pages
Rownumber Area Age Sex Payment Nodays Status
No ratings yet
Rownumber Area Age Sex Payment Nodays Status
22 pages
Descriptive and Predictive Analytics
0% (1)
Descriptive and Predictive Analytics
45 pages
Employee Data Analysis System (Ip Class Xii)
No ratings yet
Employee Data Analysis System (Ip Class Xii)
26 pages
VBA Programming Workshop - Tic Tac Toe: Mr. Chew Chun Yong, Ms. Choo Ley Ya
No ratings yet
VBA Programming Workshop - Tic Tac Toe: Mr. Chew Chun Yong, Ms. Choo Ley Ya
8 pages
Create Varchar Varchar Varchar Int
No ratings yet
Create Varchar Varchar Varchar Int
3 pages
Pizza
No ratings yet
Pizza
36 pages
Archit Key Principles
No ratings yet
Archit Key Principles
8 pages
Platform - Google Form
No ratings yet
Platform - Google Form
4 pages
Computer Science
No ratings yet
Computer Science
35 pages
3 Reasons Why You Are Seeing 2:22 - The Meaning of 222 ... : People Also Ask
No ratings yet
3 Reasons Why You Are Seeing 2:22 - The Meaning of 222 ... : People Also Ask
3 pages
DDD - Google Search
No ratings yet
DDD - Google Search
3 pages
Jupyter Notebook Commands
No ratings yet
Jupyter Notebook Commands
10 pages
Google Search
No ratings yet
Google Search
2 pages
GGG - Google Search
No ratings yet
GGG - Google Search
2 pages
908 - Google Search
No ratings yet
908 - Google Search
2 pages
Google Search
No ratings yet
Google Search
2 pages
Dwe - Google Search
No ratings yet
Dwe - Google Search
2 pages
Buying Decision
No ratings yet
Buying Decision
2 pages
Nonmandatory Appendix E Example Calculations: Record #19-943
No ratings yet
Nonmandatory Appendix E Example Calculations: Record #19-943
1 page
11.0 Introducción 11.1 Escritorio y Explorador de Archivos de Windows
No ratings yet
11.0 Introducción 11.1 Escritorio y Explorador de Archivos de Windows
3 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
10 pages
File Handling
No ratings yet
File Handling
6 pages
Cheat Sheet
No ratings yet
Cheat Sheet
10 pages
User Guide: Co-Curricular Fee Payment System
No ratings yet
User Guide: Co-Curricular Fee Payment System
11 pages
Exp - 1 - Introduction To Data Analytics and Python Fundamentals - SDK - Ok
No ratings yet
Exp - 1 - Introduction To Data Analytics and Python Fundamentals - SDK - Ok
9 pages
CSV File
No ratings yet
CSV File
30 pages
Arbaj Momin Viit
No ratings yet
Arbaj Momin Viit
1 page
CSV File Handling
No ratings yet
CSV File Handling
8 pages
Pandas 1705297450
No ratings yet
Pandas 1705297450
21 pages
Lab-3 Pandas Library
No ratings yet
Lab-3 Pandas Library
14 pages
Topic 4
No ratings yet
Topic 4
10 pages
MEME19403 Data Analytics and Visualization: Module Three: Visualization With Dashboard
No ratings yet
MEME19403 Data Analytics and Visualization: Module Three: Visualization With Dashboard
51 pages
Uncork
No ratings yet
Uncork
6 pages
單元操作PPT- chapter 20 PDF
No ratings yet
單元操作PPT- chapter 20 PDF
52 pages
CCCC - Google Search
No ratings yet
CCCC - Google Search
2 pages
Pandas 1
No ratings yet
Pandas 1
2 pages
04 Introduction To Python-1
No ratings yet
04 Introduction To Python-1
29 pages
Fds Unit - III
No ratings yet
Fds Unit - III
58 pages
Topic 1 T
No ratings yet
Topic 1 T
20 pages
Linux Installation Instructions
No ratings yet
Linux Installation Instructions
2 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
10 pages
Cheat Sheet - Pandas
No ratings yet
Cheat Sheet - Pandas
12 pages
Pandas DataFrame Notes
100% (1)
Pandas DataFrame Notes
10 pages
Pandas DataFrame Notes
67% (3)
Pandas DataFrame Notes
13 pages
Advance Java Prgramming
No ratings yet
Advance Java Prgramming
5 pages
Revision Point - Dataframe
No ratings yet
Revision Point - Dataframe
11 pages
DWV Unit1
No ratings yet
DWV Unit1
102 pages
12 Information Practices Text Book Preeti Arora
No ratings yet
12 Information Practices Text Book Preeti Arora
45 pages
Pandas DataFrameObject
No ratings yet
Pandas DataFrameObject
4 pages
Facebook Video Downloader - Download Facebook Videos Online
No ratings yet
Facebook Video Downloader - Download Facebook Videos Online
2 pages
Pandas Tutorial 1: Pandas Basics (Reading Data Files, Dataframes, Data Selection)
No ratings yet
Pandas Tutorial 1: Pandas Basics (Reading Data Files, Dataframes, Data Selection)
15 pages
Docker Mastery - 5
No ratings yet
Docker Mastery - 5
87 pages
Continuous Reaction
100% (1)
Continuous Reaction
28 pages
Cyclone Excel Calculation - Google Search
100% (1)
Cyclone Excel Calculation - Google Search
2 pages
Importing Data Python Cheat Sheet PDF
No ratings yet
Importing Data Python Cheat Sheet PDF
1 page
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
No ratings yet
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
12 pages
IOS Developer
No ratings yet
IOS Developer
3 pages
Introduction and ETL
No ratings yet
Introduction and ETL
125 pages
Api 650
No ratings yet
Api 650
14 pages
Api 650
No ratings yet
Api 650
14 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
13 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
12 pages
Google - Real Exams - Professional Cloud DevOps Engineer.v2021!03!27.by - Louie.20q
No ratings yet
Google - Real Exams - Professional Cloud DevOps Engineer.v2021!03!27.by - Louie.20q
11 pages
Learning Open Office: Calc & Base
From Everand
Learning Open Office: Calc & Base
Durgesh
No ratings yet
Real Time ABAP Interview Questions and
No ratings yet
Real Time ABAP Interview Questions and
10 pages
IGNOU BCA Introduction to Algorithm Design Previous Year Unsolved Papers BCS 042
From Everand
IGNOU BCA Introduction to Algorithm Design Previous Year Unsolved Papers BCS 042
Manish Soni
No ratings yet