0% found this document useful (0 votes)
33 views57 pages

Pandas

Pandas can be used to work with tabular data and includes tools for data cleaning and analysis. It allows data to be read from and written to different file formats including text, binary, JSON and databases. Pandas provides functions for reading data from files, URLs, databases into DataFrames. It can read common formats like CSV, Excel, JSON and write data out in formats like CSV, JSON. This allows for easy manipulation and analysis of data in Python.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views57 pages

Pandas

Pandas can be used to work with tabular data and includes tools for data cleaning and analysis. It allows data to be read from and written to different file formats including text, binary, JSON and databases. Pandas provides functions for reading data from files, URLs, databases into DataFrames. It can read common formats like CSV, Excel, JSON and write data out in formats like CSV, JSON. This allows for easy manipulation and analysis of data in Python.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 57

Pandas

Pandas
• Pandas contains data structures and data
manipulation tools designed to make data cleaning
and analysis fast and easy
• Difference between the Numpy list and Pandas
Series
• Pandas is designed for working with tabular or heterogeneous
data.
• NumPy, by contrast, is best suited for working with homogeneous
numerical array data.
Pandas

• Difference between the python list and Pandas


Series & Pandas Data Frames
• Python list is a single dimensional array of heterogeneous data
• Pandas Series is also single dimensional array of heterogeneous
data along-with labels
• Pandas Data Frame is spreadsheet-like data structure containing
an ordered collection of columns
k
Accessing Data
• Reading and Writing Data in Text Format
• Reading Text Files in Pieces
• Writing Data to Text Format
• Working with Delimited Formats
• JSON Data
• XML and HTML: Web Scraping
• Binary Data Formats
• Using HDF5 Format
• Reading Microsoft Excel Files
• Interacting with Web APIs
• Interacting with Databases
Accessing Data
• pandas features a number of functions for reading tabular data as a
DataFrame object
• Each function has number of different options
• Some of the functions have grown very complex e.g. read_csv has
over 50 parameters as of now
• online pandas documentation has many examples about how each of
them works
• A summary of parsing functions in pandas is given in next slide
Parsing functions in pandas
Function Description

read_csv Load delimited data from a file, URL, or file-like object; use comma as default delimiter
read_table Load delimited data from a file, URL, or file-like object; use tab ('\t') as default delimiter
read_fwf Read data in fixed-width column format (i.e., no delimiters)
read_clipboard Version of read_table that reads data from the clipboard; useful for converting tables from web
pages
read_excel Read tabular data from an Excel XLS or XLSX file
read_hdf Read HDF5 files written by pandas
read_html Read all tables found in the given HTML document
read_json Read data from a JSON (JavaScript Object Notation) string representation
read_msgpack Read pandas data encoded using the MessagePack binary format
read_pickle Read an arbitrary object stored in Python pickle format
read_sas Read a SAS dataset stored in one of the SAS system’s custom storage formats
read_sql Read the results of a SQL query (using SQLAlchemy) as a pandas DataFrame
read_stata Read a dataset from Stata file format
read_feather Read the Feather binary file format
Accessing Data
• Reading and Writing Data in Text Format
examples/ex1.csv
a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
• df = pd.read_csv('examples/ex1.csv')
Out[10]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
Accessing Data
• Reading and Writing Data in Text Format
• pd.read_table('examples/ex1.csv', sep=',')
Out[10]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
Accessing Data
• Reading and Writing Data in Text Format
• A file will not always have a header row. You can use default column names
examples/ex2.csv
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
• pd.read_csv('examples/ex2.csv', header=None)
Out[13]:
01234
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
Accessing Data
• Reading and Writing Data in Text Format
• Or you can specify names yourself:
• pd.read_csv('examples/ex2.csv', names=['a', 'b', 'c', 'd', 'message'])
Out[14]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
• names = ['a', 'b', 'c', 'd', 'message']
• pd.read_csv('examples/ex2.csv', names=names, index_col='message')
Out[16]:
abcd
message
hello 1234
world 5678
foo 9 10 11 12
Accessing Data
• Reading and Writing Data in Text Format
• you can skip the first, third, and fourth rows of the files using “skiprows” option
examples/ex4.csv
# hey!
a,b,c,d,message
# just wanted to make things more difficult for you
# who reads CSV files with computers, anyway?
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
• pd.read_csv('examples/ex4.csv', skiprows=[0, 2, 3])
Out[24]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
Accessing Data
• Reading and Writing Data in Text Format
• Handling missing values
examples/ex5.csv
something,a,b,c,d,message
one,1,2,3,4,NA
two,5,6,,8,world
three,9,10,11,12,foo
• result = pd.read_csv('examples/ex5.csv')
Out[27]:
something a b c d message
0 one 1 2 3.0 4 NaN
1 two 5 6 NaN 8 world
2 three 9 10 11.0 12 foo
Accessing Data
Accessing Data
• Reading and Writing Data in Text Format
• Reading Text Files in Pieces

• pd.options.display.max_rows = 10

• result = pd.read_csv('examples/ex6.csv')

Out[35]:

one two three four key

0 0.467976 -0.038649 -0.295344 -1.824726 L

1 -0.358893 1.404453 0.704965 -0.200638 B

2 -0.501840 0.659254 -0.421691 -0.057688 G

3 0.204886 1.074134 1.388361 -0.982404 R

4 0.354628 -0.133116 0.283763 -0.837063 Q

... ... ... ... ... ..

9995 2.311896 -0.417070 -1.409599 -0.515821 L

9996 -0.479893 -0.650419 0.745152 -0.646038 E

9997 0.523331 0.787112 0.486066 1.093156 K

9998 -0.362559 0.598894 -1.843201 0.887292 G

9999 -0.096376 -1.012999 -0.657431 -0.573315 0

[10000 rows x 5 columns]


Accessing Data
• Reading and Writing Data in Text Format
• Reading Text Files in Pieces
• pd.read_csv('examples/ex6.csv', nrows=5)
Out[36]:
one two three four key
0 0.467976 -0.038649 -0.295344 -1.824726 L
1 -0.358893 1.404453 0.704965 -0.200638 B
2 -0.501840 0.659254 -0.421691 -0.057688 G
3 0.204886 1.074134 1.388361 -0.982404 R
4 0.354628 -0.133116 0.283763 -0.837063 Q
Accessing Data
• Reading and Writing Data in Text Format
• Writing Data to Text Format

• data = pd.read_csv('examples/ex5.csv')
Out[42]:
something a b c d message
0 one 1 2 3.0 4 NaN
1 two 5 6 NaN 8 world
2 three 9 10 11.0 12 foo
• data.to_csv('examples/out.csv')
examples/out.csv
,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo
Accessing Data
• Reading and Writing Data in Text Format
• Working with Delimited Formats

examples/ex7.csv
"a","b","c"
"1","2","3"
"1","2","3"
• import csv
• f = open('examples/ex7.csv')
• reader = csv.reader(f)
• for line in reader:
• ....: print(line)
• ['a', 'b', 'c']
• ['1', '2', '3']
• ['1', '2', '3']
Accessing Data
• Reading and Writing Data in Text Format
• Working with Delimited Formats

• class my_dialect(csv.Dialect):
• lineterminator = '\n'
• delimiter = ';'
• quotechar = '"'
• quoting = csv.QUOTE_MINIMAL
• reader = csv.reader(f, dialect=my_dialect)
• with open('mydata.csv', 'w') as f:
• writer = csv.writer(f, dialect=my_dialect)
• writer.writerow(('one', 'two', 'three'))
• writer.writerow(('1', '2', '3'))
• writer.writerow(('4', '5', '6'))
• writer.writerow(('7', '8', '9'))
Accessing Data
• Reading and Writing Data in Text Format
• JSON Data
• JSON (short for JavaScript Object Notation)
• Standard format for sending data by HTTP request between web browsers and other applications
• json library, it is built into the Python standard library
• import json
• result = json.loads(obj)
• Out[64]:
• {'name': 'Wes',
• 'pet': None,
• 'places_lived': ['United States', 'Spain', 'Germany'],
• 'siblings': [{'age': 30, 'name': 'Scott', 'pets': ['Zeus', 'Zuko']},
• {'age': 38, 'name': 'Katie', 'pets': ['Sixes', 'Stache', 'Cisco']}]}
Accessing Data
• Reading and Writing Data in Text Format
• JSON Data
• siblings = pd.DataFrame(result['siblings'], columns=['name', 'age'])
Out[67]:
name age
0 Scott 30
1 Katie 38
• Json.dumps converts a Python object back to JSON:
• asjson = json.dumps(result)
Accessing Data
• Reading and Writing Data in Text Format
• JSON Data

• pandas.read_json can automatically convert JSON datasets in specific arrangements into a Series or
DataFrame.
examples/example.json
[{"a": 1, "b": 2, "c": 3},
{"a": 4, "b": 5, "c": 6},
{"a": 7, "b": 8, "c": 9}]
• data = pd.read_json('examples/example.json')
Out[70]:
abc
0123
1456
2789
Accessing Data
• Reading and Writing Data in Text Format
• JSON Data

• To export data from pandas to JSON, use the to_json method


• print(data.to_json())
{"a":{"0":1,"1":4,"2":7},"b":{"0":2,"1":5,"2":8},"c":{"0":3,"1":6,"2":9}}
Accessing Data
• Reading and Writing Data in Text Format
• XML and HTML: Web Scraping

• libraries for reading and writing data in the ubiquitous HTML and XML formats.
• lxml, Beautiful Soup, and html5lib

• pandas has a built-in function, read_html, which uses libraries like lxml and Beautiful Soup to automatically parse tables
out of HTML files as DataFrame objects.
• tables = pd.read_html('examples/fdic_failed_bank_list.html')
• In [74]: len(tables)
• Out[74]: 1
Accessing Data
• Reading and Writing Data in Text Format
• XML and HTML: Web Scraping

• In [75]: failures = tables[0]


• In [76]: failures.head()
Out[76]:
Bank Name City ST CERT \
0 Allied Bank Mulberry AR 91
1 The Woodbury Banking Company Woodbury GA 11297
2 First CornerStone Bank King of Prussia PA 35312
3 Trust Company Bank Memphis TN 9956
4 North Milwaukee State Bank Milwaukee WI 20364
Acquiring Institution Closing Date Updated Date
0 Today's Bank September 23, 2016 November 17, 2016
1 United Bank August 19, 2016 November 17, 2016
2 First-Citizens Bank & Trust Company May 6, 2016 September 6, 2016
3 The Bank of Fayette County April 29, 2016 September 6, 2016
4 First-Citizens Bank & Trust Company March 11, 2016 June 16, 2016
Accessing Data
• Reading and Writing Data in Text Format
• XML and HTML: Web Scraping

• In [77]: close_timestamps = pd.to_datetime(failures['Closing Date'])


• In [78]: close_timestamps.dt.year.value_counts()
Out[78]:
2010 157
2009 140
2011 92
2012 51
2008 25
...
2004 4
2001 4
2007 3
2003 3
2000 2
Name: Closing Date, Length: 15, dtype: int64
Accessing Data
• Reading and Writing Data in Text Format
• XML and HTML: Web Scraping
• Parsing XML with lxml.objectify

• Using lxml.objectify, we parse the file and get a reference to the root node of the XML file with getroot:
• from lxml import objectify
• path = 'examples/mta_perf/Performance_MNR.xml'
• parsed = objectify.parse(open(path))
• root = parsed.getroot()
root.INDICATOR returns a generator yielding each <INDICATOR> XML element. For each record, we can populate a dict of tag names (like YTD_ACTUAL) to data values
(excluding a few tags):
• data = []
• skip_fields = ['PARENT_SEQ', 'INDICATOR_SEQ', 'DESIRED_CHANGE', 'DECIMAL_PLACES']
• for elt in root.INDICATOR:
• el_data = {}
• for child in elt.getchildren():
• if child.tag in skip_fields:
• continue
• el_data[child.tag] = child.pyval
• data.append(el_data)
Accessing Data
• Reading and Writing Data in Text Format
• XML and HTML: Web Scraping
• Parsing XML with lxml.objectify

Lastly, convert this list of dicts into a DataFrame:


• In [81]: perf = pd.DataFrame(data)
• In [82]: perf.head()
• Out[82]:
• Empty DataFrame
• Columns: []
• Index: []
• XML data can get much more complicated than this example. Each tag can have metadata, too. Consider an HTML link tag, which is also valid XML:
• from io import StringIO
• tag = '<a href="http://www.google.com">Google</a>'
• root = objectify.parse(StringIO(tag)).getroot()
• You can now access any of the fields (like href) in the tag or the link text:
• In [84]: root
• Out[84]: <Element a at 0x7f6b15817748>
• In [85]: root.get('href')
• Out[85]: 'http://www.google.com'
• In [86]: root.text
• Out[86]: 'Google'
Accessing Data
• Binary Data Formats
• using Python’s built-in pickle serialization.
• to_pickle method writes the data to disk in pickle format:
• frame = pd.read_csv('examples/ex1.csv')
Out[88]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
• frame.to_pickle('examples/frame_pickle')
Accessing Data
• Binary Data Formats
• using Python’s built-in pickle serialization.
• to_pickle method writes the data to disk in pickle format:
• pd.read_pickle('examples/frame_pickle')
Out[90]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
Accessing Data
• Binary Data Formats
• Using HDF5 Format
• for storing large quantities of scientific array data
• Each HDF5 file can store multiple datasets and supporting metadata
• HDF5 supports on-the-fly compression
• HDF5 can be a good choice for working with very large datasets that don’t fit
into memory, as you can efficiently read and write small sections of much
larger arrays
• pandas provides a high-level interface that simplifies storing Series and
DataFrame object
Accessing Data
• Using HDF5 Format
• frame = pd.DataFrame({'a': np.random.randn(100)})
• store = pd.HDFStore('mydata.h5')
• store['obj1'] = frame
• store['obj1_col'] = frame['a']
• Objects contained in the HDF5 file can then be retrieved with the
same dict-like API:
• store['obj1']
Accessing Data
• Using HDF5 Format
• Out[97]:
• a
• 0 -0.204708
• 1 0.478943
• 2 -0.519439
• 3 -0.555730
• 4 1.965781
• .. ...
• 95 0.795253
• 96 0.118110
• 97 -0.748532
• 98 0.584970
• 99 0.152677
• [100 rows x 1 columns]
Accessing Data
• Using HDF5 Format
• HDFStore supports two storage schemas, 'fixed' and 'table'. The latter is generally slower, but it supports query operations
using a special syntax:
• store.put('obj2', frame, format='table')
• store.select('obj2', where=['index >= 10 and index <= 15'])
• Out[99]:
a
10 1.007189
11 -1.296221
12 0.274992
13 0.228913
14 1.352917
15 0.886429
In [100]: store.close()
• The put is an explicit version of the store['obj2'] = frame method but allows us to set other options like the storage format.
Accessing Data
• Using HDF5 Format
• The pandas.read_hdf function gives you a shortcut to these tools:
• frame.to_hdf('mydata.h5', 'obj3', format='table')
• pd.read_hdf('mydata.h5', 'obj3', where=['index < 5'])
• Out[102]:
a
0 -0.204708
1 0.478943
2 -0.519439
3 -0.555730
4 1.965781
Accessing Data
• Reading Microsoft Excel Files
• pandas supports reading tabular data stored in Excel 2003 (and higher) files using either
the ExcelFile class or pandas.read_excel function.
• xlsx = pd.ExcelFile('examples/ex1.xlsx')
• Data stored in a sheet can then be read into DataFrame with parse:
• pd.read_excel(xlsx, 'Sheet1')
Out[105]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
Accessing Data
• Reading Microsoft Excel Files
• you can also simply pass the filename to pandas.read_excel:
• frame = pd.read_excel('examples/ex1.xlsx', 'Sheet1')
Out[107]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
Accessing Data
• Writing Microsoft Excel Files
• To write pandas data to Excel format, you must first create an
ExcelWriter, then write data to it using pandas objects’ to_excel
method:
• writer = pd.ExcelWriter('examples/ex2.xlsx')
• frame.to_excel(writer, 'Sheet1')
• writer.save()
• You can also pass a file path to to_excel and avoid the ExcelWriter:
• frame.to_excel('examples/ex2.xlsx')
Accessing Data
• Interacting with Web APIs
• Many websites have public APIs providing data feeds
• easy-to-use method is the requests package.
• import requests
• url = 'https://api.github.com/repos/pandas-dev/pandas/issues'
• resp = requests.get(url)
• data = resp.json()
• data[0]['title']
'Period does not round down for frequencies less that 1 hour‘
• We can pass data directly to DataFrame and extract fields of interest:
Accessing Data
• Interacting with Web APIs
• issues = pd.DataFrame(data, columns=['number', 'title', 'labels', 'state'])

• issues

Out[120]:

number title \

0 17666 Period does not round down for frequencies les...

1 17665 DOC: improve docstring of function where

2 17664 COMPAT: skip 32-bit test on int repr

3 17662 implement Delegator class

4 17654 BUG: Fix series rename called with str alterin...

.. ... ...

25 17603 BUG: Correctly localize naive datetime strings...

26 17599 core.dtypes.generic --> cython

27 17596 Merge cdate_range functionality into bdate_range

28 17587 Time Grouper bug fix when applied for list gro...

29 17583 BUG: fix tz-aware DatetimeIndex + TimedeltaInd...

labels state

0 [] open

1 [{'id': 134699, 'url': 'https://api.github.com... open


[30 rows x 4 columns]
Accessing Data
• Interacting with Databases
• SQL-based relational databases (such as SQL Server, PostgreSQL, and MySQL) are
in wide use,
• Loading data from SQL into a DataFrame is fairly straightforward,
• import sqlite3
• con = sqlite3.connect('mydata.sqlite')
• query = """ CREATE TABLE test (a VARCHAR(20), b VARCHAR(20), c REAL, d
INTEGER);"""
• con.execute(query)
• con.commit()
Accessing Data
• Interacting with Databases
• Then, insert a few rows of data:
• data = [('Atlanta', 'Georgia', 1.25, 6), ('Tallahassee', 'Florida', 2.6, 3), ('Sacramento', 'California',
1.7, 5)]
• stmt = "INSERT INTO test VALUES(?, ?, ?, ?)"
• con.executemany(stmt, data)
• con.commit()
• cursor = con.execute('select * from test')
• rows = cursor.fetchall()
Out[132]:
[('Atlanta', 'Georgia', 1.25, 6),
('Tallahassee', 'Florida', 2.6, 3),
('Sacramento', 'California', 1.7, 5)]
Accessing Data
• Interacting with Databases
• SQLAlchemy project is a popular Python SQL toolkit that makes it easy to connect with databases
• import sqlalchemy as sqla
• db = sqla.create_engine('sqlite:///mydata.sqlite')
• pd.read_sql('select * from test', db)
Out[137]:
abcd
0 Atlanta Georgia 1.25 6
1 Tallahassee Florida 2.60 3
2 Sacramento California 1.70 5

You might also like