0% found this document useful (0 votes)
260 views3 pages

JSON A Panda Python

The document discusses how to read JSON data directly into a pandas DataFrame using the read_json() function. It provides an example of reading GitHub issue data into a DataFrame and summarizing the columns. Additionally, it demonstrates how to export a DataFrame back to JSON using the to_json() method and various output formats.

Uploaded by

Parra Victor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
260 views3 pages

JSON A Panda Python

The document discusses how to read JSON data directly into a pandas DataFrame using the read_json() function. It provides an example of reading GitHub issue data into a DataFrame and summarizing the columns. Additionally, it demonstrates how to export a DataFrame back to JSON using the to_json() method and various output formats.

Uploaded by

Parra Victor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Reading json directly into pandas

http://hayd.github.io/2013/pandas-json/

Andy Hayden
Reading json directly into pandas
12 Jun 2013
New to pandas 0.12 release, is a read_json function (which uses the speedy ujson under the hood).
Andy Hayden

It's as easy as whacking in the path/url/string of a valid json:

Python Data
Hacker

In [1]: df = pd.read_json('https://api.github.com/repos/pydata/pandas/issues?per_page=5'

resume

Let's inspect a few columns to see how we've done:


In [2]: df[['created_at', 'title', 'body', 'comments']]
Out[2]:
created_at
0 2013-06-12 02:54:37

title
DOC add to_datetime to api.rst

1 2013-06-12 01:16:19

ci/after_script.sh missing?

2 2013-06-11 23:07:52

ENH Prefer requests over urllib2

3 2013-06-11 21:12:45
4 2013-06-11 19:50:17

Nothing in docs about io.data


DOC: Clarify quote behavior parameters

Either I'm being thick or `to_da


https://travisAt the moment we
There's nothing on the docs abou
I've been bit many times recentl

The parse_dates argument has a good crack at parsing any columns which look like they're dates, and
it's worked in this example (converting created_at to Timestamps). It looks carefully at the datatype and at
column names (you can pass also pass a column name explicitly to ensure it gets converted) to choose
which to parse.
After you've done some analysis in your favourite data analysis library, the corresponding to_json allows
you can export results to valid json.
In [4]: res = df[['created_at', 'title', 'body', 'comments']].head()
In [5]: res.to_json()
Out[5]: '{"created_at":{"0":1370695148000000000,"1":1370665875000000000,"2":1370656273000000000
Here, orient decides how we should layout the data:
orient : {'split', 'records', 'index', 'columns', 'values'},
default is 'index' for Series, 'columns' for DataFrame
The format of the JSON string
split : dict like
{index -> [index], columns -> [columns], data -> [values]}
records : list like [{column -> value}, ... , {column -> value}]
index : dict like {index -> {column -> value}}
columns : dict like {column -> {index -> value}}
values : just the values array
For example (note times have been exported as epoch, but we could have used iso via):
In [6]: res.to_json(orient='records')

1 de 3

23/12/14 03:17

Reading json directly into pandas

http://hayd.github.io/2013/pandas-json/

Out[6]: '[{"created_at":1370695148000000000,"title":"CLN: refactored url accessing and filepath


Note, our times have been converted to unix timestamps (which also means we'd need to use the same
pd.to_datetime trick when read_json it back in). Also NaNs, NaTs and Nones will be converted to JSON's
null.
And save it to a le:
In [7]: res.to_json(file_name)
Useful.
Warning: read_json requires valid JSON, so doing something like will cause some Exception:
In [8]: pd.read_json("{'0':{'0':1,'1':3},'1':{'0':2,'1':4}}")
# ValueError, since this isn't valid JSON
In [9]: pd.read_json('{"0":{"0":1,"1":3},"1":{"0":2,"1":4}}')
Out[9]:
0

Just as an further example, here I can get all the issues from github (there's a limit of 100 per request), this
is how easy it is to extract data in pandas:
In [10]: page = 1
df = pd.read_json('https://api.github.com/repos/pydata/pandas/issues?page=
while df:
dfs[page] = df
page += 1
df = pd.read_json('https://api.github.com/repos/pydata/pandas/issues?page=
In [11]: dfs.keys() # 7 requests come back with issues
Out[11]: [1, 2, 3, 4, 5, 6, 7]
In [12]: df = pd.concat(dfs, ignore_index=True).set_index('number)
In [13]: df
Out[13]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 613 entries, 3813 to 39
Data columns (total 18 columns):

2 de 3

assignee

27

body

613

non-null values

closed_at

comments

613

non-null values

comments_url

613

non-null values

created_at

613

non-null values

events_url
html_url

613
613

non-null values
non-null values

id

613

non-null values

labels
labels_url

613
613

non-null values
non-null values

non-null values
non-null values

23/12/14 03:17

Reading json directly into pandas

milestone
pull_request

586
613

non-null values
non-null values

state

613

non-null values

title

613

non-null values

updated_at

613

non-null values

url

613

non-null values

user

613

non-null values

http://hayd.github.io/2013/pandas-json/

dtypes: datetime64[ns](1), int64(2), object(15)


In [14]: df.comments.describe()
Out[14]:
count

613.000000

mean

3.590538

std
min

9.641128
0.000000

25%
50%

0.000000
1.000000

75%

4.000000

max

185.000000

dtype: float64
It deals with fairly moderately sized les fairly eciently, here's a 200Mb le (this is on my 2009 macbook
air, I'd expect times to be may be faster on better hardware e.g. an SSD):
In [15]: %time pd.read_json('citylots.json')
CPU times: user 4.78 s, sys: 684 ms, total: 5.46 s
Wall time: 5.89 s
Thanks to wesm, jreback and Komnomnomnom for putting it together.
Ghostery ha bloqueado los comentarios creados por Disqus.

blog comments powered by Disqus

3 de 3

23/12/14 03:17

You might also like