0% found this document useful (0 votes)
39 views

Data Science Formats Beyond CSV and Hdfs

This document discusses various data formats for data science including textual, binary, JSON, HDF5, HDFS, and columnar databases. It provides examples of using NumPy, Pandas, pickle, HDF5, PyROOT, and JSON for different types of data. The document emphasizes choosing a data format based on the type and size of data as well as the need for storage, sharing, or analysis. It also briefly introduces out-of-core processing and complicated data formats like OPeNDAP that are required for very large or complex datasets.

Uploaded by

ferney_09
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

Data Science Formats Beyond CSV and Hdfs

This document discusses various data formats for data science including textual, binary, JSON, HDF5, HDFS, and columnar databases. It provides examples of using NumPy, Pandas, pickle, HDF5, PyROOT, and JSON for different types of data. The document emphasizes choosing a data format based on the type and size of data as well as the need for storage, sharing, or analysis. It also briefly introduces out-of-core processing and complicated data formats like OPeNDAP that are required for very large or complex datasets.

Uploaded by

ferney_09
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Data Formats

for Data Science

Valerio Maggio
Data Scientist and Researcher
Fondazione Bruno Kessler (FBK)

Trento, Italy

@leriomaggio
About me kidding, that’s me!-)

• Post Doc Researcher @ FBK


• Complex Data Analytics Unit (MPBA)
• Interested in Machine Learning, Text
and Data Processing
• with “Deep” divergences recently
• Fellow Pythonista since 2006
• scientific Python ecosystem
• PyData Italy Chair
• http://pydata.it
• @pydatait
worthwhile mentioning…

The Program is online: https://www.euroscipy.org/2016/program/

End of early-bird: 

Jul 21, 2106

(that’s today! 😱)
Data Formats 4 Data Science
• Data Processing

• Q: What’s the better way to process data

• Q+: What’s the most Pythonic Way to do that?

• Data Sharing

• Q: What’s the best way to share (and to present data)

• A: [Interactive] Charts - Data Visualisation

• OMG, Bokeh is better than ever! by Fabio Pliger (after this


session!)
Jupyter Notebook for 

Data and Documentation Sharing
1.

Textual
Data format
More Pythonic
Numpy to the rescue
csv files
csv Module (in standard library)
Textual Data format
• Be Pythonic: use context managers (with)

• numpy (mostly numerical) and pandas (csv) 



to the rescue

• np.loadtxt and pd.read_csv

• (+) Very easy to (re)create and share

• very easy to process

• (-) Not storage friendly but highly compressible!

• (-) No structured information


2.

Binary 

Data format
Binary format
Integers and floats in native and string representations
*

• Space is not the only concern (for text). Speed matters!


• Python conversion to int() and float() are slow
• costly atoi()/atof() C functions

A. Scopatz, K.D. Huff - Effective Computations in Physics - Field Guide to Research in Python, O’Reilly 2015
import pickle

Still, it is often desirable to have something more than


a binary chunk of data in a file.
Hierarchical Data Format 5 (a.k.a. hdf5)
• Free and open source file format specification

• HDFGroup - Univ. Illinois Champagne-Urbana

• (+) Works great with both big or tiny datasets

• (+) Storage friendly

• Allows for Compression

• (+) Dev. Friendly

• Query DSL + Multiple-language support

• Python: PyTables, hdf5, h5py


Numpy Arrays tight integration
with PyTables

Accessing the table


Hierarchy and Groups
Data Chunking

A. Scopatz, K.D. Huff - Effective Computations in Physics - Field Guide to Research in Python, O’Reilly 2015
Data Chunking

• Small chunks are good for accessing only some of the


data at a time. 


• Large chunks are good for accessing lots of data at a time. 


• Reading and writing chunks may happen in parallel

A. Scopatz, K.D. Huff - Effective Computations in Physics - Field Guide to Research in Python, O’Reilly 2015
Parallel HDF5
MPI (mpi4py) integration
Learn More

• How to migrate from


PostgreSQL to HDF5
and live happily ever
after by 

Michele Simionato
@PyData Track on Friday
Data
Format
• Data Analysis Framework (and tool) dev. @CERN

• written in C++;

• native extension in Python (aka PyROOT)

• ROOT6 also ships a Jupyter Kernel

• Definition of a new Binary Data Format (.root)

• based on the serialisation of C++ Objects


C++ style

rootpy rootpy.github.io/
root_numpy rootpy.github.io/root_numpy/
root_numpy examples

Tight integration with PyROOT objects


root2hdf5 (included in rootpy)

http://www.rootpy.org/commands/root2hdf5.html
3.

JSON 

Data format
Jupyter Notebook
Data Format
JSON is the format of choice for 

Document Oriented DBs 

(a.k.a. NOSQL DBs)
HDF5 vs MongoDB
Total Number of Documents Total Number of Entries Total Number of Calls

100.000 8.755.882 319.970

Average time per Single Call (sec.)


0,005

0,004

0,003

0,001

HDF5
MongoDB
MongoDB

(blosc filter) (flat storage) (compact storage)


HDF5 vs MongoDB
Storage
Systems
(MB)
HDF5
Total Number of Documents Total Number of Entries Total Number of Calls 922.528
(blosc filter)
MongoDB
100.000 8.755.882 319.970 3.952.148
(flat storage)
MongoDB
1.953.125
(compact storage)

Storage (MB)
4.000.000

3.000.000

2.000.000

1.000.000

HDF5
MongoDB
MongoDB

(blosc filter) (flat storage) (compact storage)


4.

HDFS 

Data format

matthewrocklin.com/blog/work/
2016/02/22/dask-distributed-part-2
HDFS
• HDFS: Hadoop Filesystem

• Distributed Filesystem on top of Hadoop

• Data can be organised in shardes and distributed among several


machines (cluster config)

• (de facto) Big Data Data Format

• Python: hdfs3

• Native implementation of HDFS in C++

• No Java along the way!


HDFS + CSV

Opening a Single File on the HDFS


HDFS + CSV

Wildcard opening of CSVs on the HDFS


Big Data and Columnar DBs
• Big Data World is shifting towards columnar DBs
• better oriented to OLAP (analytics) rather than OLTP
• In-Database analytics with
python and MonetDB by 

G. Emireni @PyData Italy 2016
A format
has no
name
http://xarray.pydata.org/en/stable/index.html

http://blaze.pydata.org
Out-of-Core Processing
Complicated data require complicated formats
Complicated formats require good tools

OPeNDAP: http://goo.gl/fMehjh
Thanks a lot for your
kind attention

@leriomaggio [email protected]

+ValerioMaggio it.linkedin.com/in/valeriomaggio

You might also like