Pandas
Pandas
Pandas
• Pandas contains data structures and data
manipulation tools designed to make data cleaning
and analysis fast and easy
• Difference between the Numpy list and Pandas
Series
• Pandas is designed for working with tabular or heterogeneous
data.
• NumPy, by contrast, is best suited for working with homogeneous
numerical array data.
Pandas
read_csv Load delimited data from a file, URL, or file-like object; use comma as default delimiter
read_table Load delimited data from a file, URL, or file-like object; use tab ('\t') as default delimiter
read_fwf Read data in fixed-width column format (i.e., no delimiters)
read_clipboard Version of read_table that reads data from the clipboard; useful for converting tables from web
pages
read_excel Read tabular data from an Excel XLS or XLSX file
read_hdf Read HDF5 files written by pandas
read_html Read all tables found in the given HTML document
read_json Read data from a JSON (JavaScript Object Notation) string representation
read_msgpack Read pandas data encoded using the MessagePack binary format
read_pickle Read an arbitrary object stored in Python pickle format
read_sas Read a SAS dataset stored in one of the SAS system’s custom storage formats
read_sql Read the results of a SQL query (using SQLAlchemy) as a pandas DataFrame
read_stata Read a dataset from Stata file format
read_feather Read the Feather binary file format
Accessing Data
• Reading and Writing Data in Text Format
examples/ex1.csv
a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
• df = pd.read_csv('examples/ex1.csv')
Out[10]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
Accessing Data
• Reading and Writing Data in Text Format
• pd.read_table('examples/ex1.csv', sep=',')
Out[10]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
Accessing Data
• Reading and Writing Data in Text Format
• A file will not always have a header row. You can use default column names
examples/ex2.csv
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
• pd.read_csv('examples/ex2.csv', header=None)
Out[13]:
01234
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
Accessing Data
• Reading and Writing Data in Text Format
• Or you can specify names yourself:
• pd.read_csv('examples/ex2.csv', names=['a', 'b', 'c', 'd', 'message'])
Out[14]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
• names = ['a', 'b', 'c', 'd', 'message']
• pd.read_csv('examples/ex2.csv', names=names, index_col='message')
Out[16]:
abcd
message
hello 1234
world 5678
foo 9 10 11 12
Accessing Data
• Reading and Writing Data in Text Format
• you can skip the first, third, and fourth rows of the files using “skiprows” option
examples/ex4.csv
# hey!
a,b,c,d,message
# just wanted to make things more difficult for you
# who reads CSV files with computers, anyway?
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
• pd.read_csv('examples/ex4.csv', skiprows=[0, 2, 3])
Out[24]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
Accessing Data
• Reading and Writing Data in Text Format
• Handling missing values
examples/ex5.csv
something,a,b,c,d,message
one,1,2,3,4,NA
two,5,6,,8,world
three,9,10,11,12,foo
• result = pd.read_csv('examples/ex5.csv')
Out[27]:
something a b c d message
0 one 1 2 3.0 4 NaN
1 two 5 6 NaN 8 world
2 three 9 10 11.0 12 foo
Accessing Data
Accessing Data
• Reading and Writing Data in Text Format
• Reading Text Files in Pieces
• pd.options.display.max_rows = 10
• result = pd.read_csv('examples/ex6.csv')
Out[35]:
• data = pd.read_csv('examples/ex5.csv')
Out[42]:
something a b c d message
0 one 1 2 3.0 4 NaN
1 two 5 6 NaN 8 world
2 three 9 10 11.0 12 foo
• data.to_csv('examples/out.csv')
examples/out.csv
,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo
Accessing Data
• Reading and Writing Data in Text Format
• Working with Delimited Formats
examples/ex7.csv
"a","b","c"
"1","2","3"
"1","2","3"
• import csv
• f = open('examples/ex7.csv')
• reader = csv.reader(f)
• for line in reader:
• ....: print(line)
• ['a', 'b', 'c']
• ['1', '2', '3']
• ['1', '2', '3']
Accessing Data
• Reading and Writing Data in Text Format
• Working with Delimited Formats
• class my_dialect(csv.Dialect):
• lineterminator = '\n'
• delimiter = ';'
• quotechar = '"'
• quoting = csv.QUOTE_MINIMAL
• reader = csv.reader(f, dialect=my_dialect)
• with open('mydata.csv', 'w') as f:
• writer = csv.writer(f, dialect=my_dialect)
• writer.writerow(('one', 'two', 'three'))
• writer.writerow(('1', '2', '3'))
• writer.writerow(('4', '5', '6'))
• writer.writerow(('7', '8', '9'))
Accessing Data
• Reading and Writing Data in Text Format
• JSON Data
• JSON (short for JavaScript Object Notation)
• Standard format for sending data by HTTP request between web browsers and other applications
• json library, it is built into the Python standard library
• import json
• result = json.loads(obj)
• Out[64]:
• {'name': 'Wes',
• 'pet': None,
• 'places_lived': ['United States', 'Spain', 'Germany'],
• 'siblings': [{'age': 30, 'name': 'Scott', 'pets': ['Zeus', 'Zuko']},
• {'age': 38, 'name': 'Katie', 'pets': ['Sixes', 'Stache', 'Cisco']}]}
Accessing Data
• Reading and Writing Data in Text Format
• JSON Data
• siblings = pd.DataFrame(result['siblings'], columns=['name', 'age'])
Out[67]:
name age
0 Scott 30
1 Katie 38
• Json.dumps converts a Python object back to JSON:
• asjson = json.dumps(result)
Accessing Data
• Reading and Writing Data in Text Format
• JSON Data
• pandas.read_json can automatically convert JSON datasets in specific arrangements into a Series or
DataFrame.
examples/example.json
[{"a": 1, "b": 2, "c": 3},
{"a": 4, "b": 5, "c": 6},
{"a": 7, "b": 8, "c": 9}]
• data = pd.read_json('examples/example.json')
Out[70]:
abc
0123
1456
2789
Accessing Data
• Reading and Writing Data in Text Format
• JSON Data
• libraries for reading and writing data in the ubiquitous HTML and XML formats.
• lxml, Beautiful Soup, and html5lib
• pandas has a built-in function, read_html, which uses libraries like lxml and Beautiful Soup to automatically parse tables
out of HTML files as DataFrame objects.
• tables = pd.read_html('examples/fdic_failed_bank_list.html')
• In [74]: len(tables)
• Out[74]: 1
Accessing Data
• Reading and Writing Data in Text Format
• XML and HTML: Web Scraping
• Using lxml.objectify, we parse the file and get a reference to the root node of the XML file with getroot:
• from lxml import objectify
• path = 'examples/mta_perf/Performance_MNR.xml'
• parsed = objectify.parse(open(path))
• root = parsed.getroot()
root.INDICATOR returns a generator yielding each <INDICATOR> XML element. For each record, we can populate a dict of tag names (like YTD_ACTUAL) to data values
(excluding a few tags):
• data = []
• skip_fields = ['PARENT_SEQ', 'INDICATOR_SEQ', 'DESIRED_CHANGE', 'DECIMAL_PLACES']
• for elt in root.INDICATOR:
• el_data = {}
• for child in elt.getchildren():
• if child.tag in skip_fields:
• continue
• el_data[child.tag] = child.pyval
• data.append(el_data)
Accessing Data
• Reading and Writing Data in Text Format
• XML and HTML: Web Scraping
• Parsing XML with lxml.objectify
• issues
Out[120]:
number title \
.. ... ...
28 17587 Time Grouper bug fix when applied for list gro...
labels state
0 [] open