Data Science Formats Beyond CSV and Hdfs
Data Science Formats Beyond CSV and Hdfs
Valerio Maggio
Data Scientist and Researcher
Fondazione Bruno Kessler (FBK)
Trento, Italy
@leriomaggio
About me kidding, that’s me!-)
End of early-bird:
Jul 21, 2106
(that’s today! 😱)
Data Formats 4 Data Science
• Data Processing
• Data Sharing
Textual
Data format
More Pythonic
Numpy to the rescue
csv files
csv Module (in standard library)
Textual Data format
• Be Pythonic: use context managers (with)
Binary
Data format
Binary format
Integers and floats in native and string representations
*
A. Scopatz, K.D. Huff - Effective Computations in Physics - Field Guide to Research in Python, O’Reilly 2015
import pickle
A. Scopatz, K.D. Huff - Effective Computations in Physics - Field Guide to Research in Python, O’Reilly 2015
Data Chunking
A. Scopatz, K.D. Huff - Effective Computations in Physics - Field Guide to Research in Python, O’Reilly 2015
Parallel HDF5
MPI (mpi4py) integration
Learn More
• written in C++;
rootpy rootpy.github.io/
root_numpy rootpy.github.io/root_numpy/
root_numpy examples
http://www.rootpy.org/commands/root2hdf5.html
3.
JSON
Data format
Jupyter Notebook
Data Format
JSON is the format of choice for
Document Oriented DBs
(a.k.a. NOSQL DBs)
HDF5 vs MongoDB
Total Number of Documents Total Number of Entries Total Number of Calls
0,004
0,003
0,001
HDF5
MongoDB
MongoDB
Storage (MB)
4.000.000
3.000.000
2.000.000
1.000.000
HDF5
MongoDB
MongoDB
HDFS
Data format
matthewrocklin.com/blog/work/
2016/02/22/dask-distributed-part-2
HDFS
• HDFS: Hadoop Filesystem
• Python: hdfs3
http://blaze.pydata.org
Out-of-Core Processing
Complicated data require complicated formats
Complicated formats require good tools
OPeNDAP: http://goo.gl/fMehjh
Thanks a lot for your
kind attention
@leriomaggio [email protected]
+ValerioMaggio it.linkedin.com/in/valeriomaggio