All Pandas Json - Normalize
All Pandas Json - Normalize
Open in app
This is your last free member-only story this month. Upgrade for unlimited access.
All Pandas json_normalize() you should know for flattening JSON (Image by Author using canva.com)
https://towardsdatascience.com/all-pandas-json-normalize-you-should-know-for-flattening-json-13eae1dfb7dd 1/17
5/11/2021 All Pandas json_normalize() you should know for flattening JSON | by B. Chen | Towards Data Science
Reading data is the first step in any data science project. As a machine learning
practitioner or a data scientist, you would have surely come across JSON (JavaScript
Object Notation) data. JSON is a widely used format for storing and exchanging data.
For example, NoSQL database like MongoDB store the data in JSON format, and REST
API’s responses are mostly available in JSON.
Although this format works well for storing and exchanging data, it needs to be
converted into a tabular form for further analysis. You are likely to deal with 2 types of
JSON structure, a JSON object or a list of JSON objects. In internal Python lingo, you are
most likely to deal with a dict or a list of dicts.
In this article, you’ll learn how to use Pandas’s built-in function json_normalize() to
flatten those 2 types of JSON into Pandas DataFrames. This article is structured as
follows:
https://towardsdatascience.com/all-pandas-json-normalize-you-should-know-for-flattening-json-13eae1dfb7dd 2/17
5/11/2021 All Pandas json_normalize() you should know for flattening JSON | by B. Chen | Towards Data Science
a_dict = {
'school': 'ABC primary school',
'location': 'London',
'ranking': 2,
}
df = pd.json_normalize(a_dict)
(image by author)
https://towardsdatascience.com/all-pandas-json-normalize-you-should-know-for-flattening-json-13eae1dfb7dd 3/17
5/11/2021 All Pandas json_normalize() you should know for flattening JSON | by B. Chen | Towards Data Science
The result looks great. Let’s take a look at the data types with df.info() . We can see
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 school 1 non-null object
1 location 1 non-null object
2 ranking 1 non-null int64
dtypes: int64(1), object(2)
memory usage: 152.0+ bytes
json_list = [
{ 'class': 'Year 1', 'student number': 20, 'room': 'Yellow' },
{ 'class': 'Year 2', 'student number': 25, 'room': 'Blue' },
]
pd.json_normalize(json_list)
(image by author)
The result looks great. json_normalize() function is able to convert each record in the
list into a row of tabular form.
What about keys that are not always present, for example, num_of_students is not
available in the 2nd record.
https://towardsdatascience.com/all-pandas-json-normalize-you-should-know-for-flattening-json-13eae1dfb7dd 4/17
5/11/2021 All Pandas json_normalize() you should know for flattening JSON | by B. Chen | Towards Data Science
json_list = [
{ 'class': 'Year 1', 'num_of_students': 20, 'room': 'Yellow' },
{ 'class': 'Year 2', 'room': 'Blue' }, # no num_of_students
]
pd.json_normalize(json_list)
(image by author)
We can see that no error is thrown and those missing keys are shown as NaN .
json_obj = {
'school': 'ABC primary school',
'location': 'London',
'ranking': 2,
'info': {
'president': 'John Kasich',
'contacts': {
'email': {
'admission': '[email protected]',
'general': '[email protected]'
},
'tel': '123456789',
}
}
}
https://towardsdatascience.com/all-pandas-json-normalize-you-should-know-for-flattening-json-13eae1dfb7dd 5/17
5/11/2021 All Pandas json_normalize() you should know for flattening JSON | by B. Chen | Towards Data Science
The result looks great. All nested values are flattened and converted into separate
columns.
If you don’t want to dig all the way down to each value use the max_level argument.
With the argument max_level=1 , we can see that our nested value contacts is put up
pd.json_normalize(data, max_level=1)
(image by author)
json_list = [
{
'class': 'Year 1',
'student count': 20,
'room': 'Yellow',
'info': {
'teachers': {
'math': 'Rick Scott',
'physics': 'Elon Mask'
}
}
},
{
'class': 'Year 2',
'student count': 25,
'room': 'Blue',
'info': {
https://towardsdatascience.com/all-pandas-json-normalize-you-should-know-for-flattening-json-13eae1dfb7dd 6/17
5/11/2021 All Pandas json_normalize() you should know for flattening JSON | by B. Chen | Towards Data Science
'teachers': {
'math': 'Alan Turing',
'physics': 'Albert Einstein'
}
}
},
]
pd.json_normalize(json_list)
(image by author)
We can see that all nested values in each record of the list are flattened and converted
into separate columns. Similarly, we can use the max_level argument to limit the
number of levels, for example
pd.json_normalize(json_list, max_level=1)
(image by author)
https://towardsdatascience.com/all-pandas-json-normalize-you-should-know-for-flattening-json-13eae1dfb7dd 7/17
5/11/2021 All Pandas json_normalize() you should know for flattening JSON | by B. Chen | Towards Data Science
json_obj = {
'school': 'ABC primary school',
'location': 'London',
'ranking': 2,
'info': {
'president': 'John Kasich',
'contacts': {
'email': {
'admission': '[email protected]',
'general': '[email protected]'
},
'tel': '123456789',
}
},
'students': [
{ 'name': 'Tom' },
{ 'name': 'James' },
{ 'name': 'Jacqueline' }
],
}
get:
(image by author)
We can see that our nested list is put up into a single column students and other values
are flattened. How can we flatten the nested list? To do that, we can set the argument
record_path to ['students'] :
# Flatten students
pd.json_normalize(data, record_path=['students'])
https://towardsdatascience.com/all-pandas-json-normalize-you-should-know-for-flattening-json-13eae1dfb7dd 8/17
5/11/2021 All Pandas json_normalize() you should know for flattening JSON | by B. Chen | Towards Data Science
(image by author)
The result looks great but doesn’t include school and tel. To include them, we can use
the argument meta to specify a list of metadata we want in the result.
pd.json_normalize(
json_obj,
record_path =['students'],
meta=['school', ['info', 'contacts', 'tel']],
)
(image by author)
json_list = [
{
'class': 'Year 1',
https://towardsdatascience.com/all-pandas-json-normalize-you-should-know-for-flattening-json-13eae1dfb7dd 9/17
5/11/2021 All Pandas json_normalize() you should know for flattening JSON | by B. Chen | Towards Data Science
pd.json_normalize(json_list)
(image by author)
All nested lists are put up into a single column students and other values are flattened.
To flatten the nested list, we can set the argument record_path to ['students'] . Notices
https://towardsdatascience.com/all-pandas-json-normalize-you-should-know-for-flattening-json-13eae1dfb7dd 10/17
5/11/2021 All Pandas json_normalize() you should know for flattening JSON | by B. Chen | Towards Data Science
that not all records have math and physics, and those missing values are shown as NaN .
pd.json_normalize(json_list, record_path=['students'])
(image by author)
If you would like to include other metadata use the argument meta :
pd.json_normalize(
json_list,
record_path =['students'],
meta=['class', 'room', ['info', 'teachers', 'math']]
)
(image by author)
https://towardsdatascience.com/all-pandas-json-normalize-you-should-know-for-flattening-json-13eae1dfb7dd 11/17
5/11/2021 All Pandas json_normalize() you should know for flattening JSON | by B. Chen | Towards Data Science
The errors argument default to 'raise’ and will raise KeyError if keys listed in meta
are not always present. For example, the math teacher is not available from the second
record.
data = [
{
'class': 'Year 1',
'student count': 20,
'room': 'Yellow',
'info': {
'teachers': {
'math': 'Rick Scott',
'physics': 'Elon Mask',
}
},
'students': [
{ 'name': 'Tom', 'sex': 'M' },
{ 'name': 'James', 'sex': 'M' },
]
},
{
'class': 'Year 2',
'student count': 25,
'room': 'Blue',
'info': {
'teachers': {
# no math teacher
'physics': 'Albert Einstein'
}
},
'students': [
{ 'name': 'Tony', 'sex': 'M' },
{ 'name': 'Jacqueline', 'sex': 'F' },
]
},
]
pd.json_normalize(
data,
record_path =['students'],
meta=['class', 'room', ['info', 'teachers', 'math']],
)
https://towardsdatascience.com/all-pandas-json-normalize-you-should-know-for-flattening-json-13eae1dfb7dd 12/17
5/11/2021 All Pandas json_normalize() you should know for flattening JSON | by B. Chen | Towards Data Science
(image by author)
To work around it, set the argument errors to 'ignore' and those missing values are
filled with NaN .
pd.json_normalize(
data,
record_path =['students'],
meta=['class', 'room', ['info', 'teachers', 'math']],
errors='ignore'
)
(image by author)
https://towardsdatascience.com/all-pandas-json-normalize-you-should-know-for-flattening-json-13eae1dfb7dd 13/17
5/11/2021 All Pandas json_normalize() you should know for flattening JSON | by B. Chen | Towards Data Science
By default, all nested values will generate column names separated by . . For example
info.teachers.math. To separate column names with something else, you can use the
sep argument.
pd.json_normalize(
data,
record_path =['students'],
meta=['class', 'room', ['info', 'teachers', 'math']],
sep='->'
)
(image by author)
pd.json_normalize(
data,
record_path=['students'],
meta=['class'],
meta_prefix='meta-',
record_prefix='student-'
)
https://towardsdatascience.com/all-pandas-json-normalize-you-should-know-for-flattening-json-13eae1dfb7dd 14/17
5/11/2021 All Pandas json_normalize() you should know for flattening JSON | by B. Chen | Towards Data Science
(image by author)
import json
# load data using Python JSON module
with open('data/simple.json','r') as f:
data = json.loads(f.read())
data = json.loads(f.read()) loads data using Python json module. After that,
json_normalize() is called on the data to flatten it into a DataFrame.
import requests
URL = 'http://raw.githubusercontent.com/BindiChen/machine-
learning/master/data-analysis/027-pandas-convert-
json/data/simple.json'
data = json.loads(requests.get(URL).text)
Conclusion
https://towardsdatascience.com/all-pandas-json-normalize-you-should-know-for-flattening-json-13eae1dfb7dd 15/17
5/11/2021 All Pandas json_normalize() you should know for flattening JSON | by B. Chen | Towards Data Science
I hope this article will help you to save time in flattening JSON data. I recommend you to
check out the documentation for the json_normalize() API and to know about other
things you can do.
Thanks for reading. Please check out the notebook for the source code and stay tuned if
you are interested in the practical aspect of machine learning.
All the Pandas shift() you should know for data analysis
4 tricks you should know to parse date columns with Pandas read_csv()
Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials
and cutting-edge research to original features you don't want to miss. Take a look.
https://towardsdatascience.com/all-pandas-json-normalize-you-should-know-for-flattening-json-13eae1dfb7dd 16/17
5/11/2021 All Pandas json_normalize() you should know for flattening JSON | by B. Chen | Towards Data Science
https://towardsdatascience.com/all-pandas-json-normalize-you-should-know-for-flattening-json-13eae1dfb7dd 17/17