0% found this document useful (0 votes)

50 views54 pages

Data Science Formats Beyond CSV and Hdfs

This document discusses various data formats for data science including textual, binary, JSON, HDF5, HDFS, and columnar databases. It provides examples of using NumPy, Pandas, pickle, HDF5, PyROOT, and JSON for different types of data. The document emphasizes choosing a data format based on the type and size of data as well as the need for storage, sharing, or analysis. It also briefly introduces out-of-core processing and complicated data formats like OPeNDAP that are required for very large or complex datasets.

Uploaded by

ferney_09

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views54 pages

Data Science Formats Beyond CSV and Hdfs

Uploaded by

ferney_09

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 54

Data Formats

for Data Science

Valerio Maggio
Data Scientist and Researcher
Fondazione Bruno Kessler (FBK) 
Trento, Italy

@leriomaggio
About me kidding, that’s me!-)

• Post Doc Researcher @ FBK

• Complex Data Analytics Unit (MPBA)
• Interested in Machine Learning, Text
and Data Processing
• with “Deep” divergences recently
• Fellow Pythonista since 2006
• scientific Python ecosystem
• PyData Italy Chair
• http://pydata.it
• @pydatait
worthwhile mentioning…

The Program is online: https://www.euroscipy.org/2016/program/

End of early-bird:  
Jul 21, 2106 
(that’s today! 😱)
Data Formats 4 Data Science
• Data Processing

• Q: What’s the better way to process data

• Q+: What’s the most Pythonic Way to do that?

• Data Sharing

• Q: What’s the best way to share (and to present data)

• A: [Interactive] Charts - Data Visualisation

• OMG, Bokeh is better than ever! by Fabio Pliger (after this

session!)
Jupyter Notebook for  
Data and Documentation Sharing
1.

Textual
Data format
More Pythonic
Numpy to the rescue
csv files
csv Module (in standard library)
Textual Data format
• Be Pythonic: use context managers (with)

• numpy (mostly numerical) and pandas (csv)  

to the rescue

• np.loadtxt and pd.read_csv

• (+) Very easy to (re)create and share

• very easy to process

• (-) Not storage friendly but highly compressible!

• (-) No structured information

Binary  
Data format
Binary format
Integers and floats in native and string representations
*

• Space is not the only concern (for text). Speed matters!

• Python conversion to int() and float() are slow
• costly atoi()/atof() C functions

A. Scopatz, K.D. Huﬀ - Eﬀective Computations in Physics - Field Guide to Research in Python, O’Reilly 2015
import pickle

Still, it is often desirable to have something more than

a binary chunk of data in a file.
Hierarchical Data Format 5 (a.k.a. hdf5)
• Free and open source file format specification

• HDFGroup - Univ. Illinois Champagne-Urbana

• (+) Works great with both big or tiny datasets

• (+) Storage friendly

• Allows for Compression

• (+) Dev. Friendly

• Query DSL + Multiple-language support

• Python: PyTables, hdf5, h5py

Numpy Arrays tight integration
with PyTables

Accessing the table

Hierarchy and Groups
Data Chunking

A. Scopatz, K.D. Huﬀ - Eﬀective Computations in Physics - Field Guide to Research in Python, O’Reilly 2015
Data Chunking

• Small chunks are good for accessing only some of the

data at a time.  

• Large chunks are good for accessing lots of data at a time.  

• Reading and writing chunks may happen in parallel

A. Scopatz, K.D. Huﬀ - Eﬀective Computations in Physics - Field Guide to Research in Python, O’Reilly 2015
Parallel HDF5
MPI (mpi4py) integration
Learn More

• How to migrate from

PostgreSQL to HDF5
and live happily ever
after by  
Michele Simionato
@PyData Track on Friday
Data
Format
• Data Analysis Framework (and tool) dev. @CERN

• written in C++;

• native extension in Python (aka PyROOT)

• ROOT6 also ships a Jupyter Kernel

• Definition of a new Binary Data Format (.root)

• based on the serialisation of C++ Objects

C++ style

rootpy rootpy.github.io/
root_numpy rootpy.github.io/root_numpy/
root_numpy examples

Tight integration with PyROOT objects

root2hdf5 (included in rootpy)

http://www.rootpy.org/commands/root2hdf5.html
3.

JSON  
Data format
Jupyter Notebook
Data Format
JSON is the format of choice for  
Document Oriented DBs  
(a.k.a. NOSQL DBs)
HDF5 vs MongoDB
Total Number of Documents Total Number of Entries Total Number of Calls

100.000 8.755.882 319.970

Average time per Single Call (sec.)

0,005

0,004

0,003

0,001

HDF5
MongoDB
MongoDB

(blosc filter) (flat storage) (compact storage)

HDF5 vs MongoDB
Storage
Systems
(MB)
HDF5
Total Number of Documents Total Number of Entries Total Number of Calls 922.528
(blosc filter)
MongoDB
100.000 8.755.882 319.970 3.952.148
(flat storage)
MongoDB
1.953.125
(compact storage)

Storage (MB)
4.000.000

3.000.000

2.000.000

1.000.000

HDF5
MongoDB
MongoDB

(blosc filter) (flat storage) (compact storage)

HDFS  
Data format

matthewrocklin.com/blog/work/
2016/02/22/dask-distributed-part-2
HDFS
• HDFS: Hadoop Filesystem

• Distributed Filesystem on top of Hadoop

• Data can be organised in shardes and distributed among several

machines (cluster config)

• (de facto) Big Data Data Format

• Python: hdfs3

• Native implementation of HDFS in C++

• No Java along the way!

HDFS + CSV

Opening a Single File on the HDFS

HDFS + CSV

Wildcard opening of CSVs on the HDFS

Big Data and Columnar DBs
• Big Data World is shifting towards columnar DBs
• better oriented to OLAP (analytics) rather than OLTP
• In-Database analytics with
python and MonetDB by  
G. Emireni @PyData Italy 2016
A format
has no
name
http://xarray.pydata.org/en/stable/index.html

http://blaze.pydata.org
Out-of-Core Processing
Complicated data require complicated formats
Complicated formats require good tools

OPeNDAP: http://goo.gl/fMehjh
Thanks a lot for your
kind attention

@leriomaggio [email protected]

+ValerioMaggio it.linkedin.com/in/valeriomaggio

Exertherm® Modbus Datacard
No ratings yet
Exertherm® Modbus Datacard
2 pages
Structure Versioning For PyTables
100% (2)
Structure Versioning For PyTables
17 pages
Unit-2 Notes
No ratings yet
Unit-2 Notes
19 pages
Python Slides PDF
No ratings yet
Python Slides PDF
35 pages
Python Programming For Economics Finance
No ratings yet
Python Programming For Economics Finance
267 pages
Mold Design Using Creo Parametric 3
No ratings yet
Mold Design Using Creo Parametric 3
604 pages
Advance Python Programming
No ratings yet
Advance Python Programming
46 pages
MAPEH (Arts) : Quarter 1 - Module 1: Appreciation of The Elements, Principles, & Processes of Arts Using New Technologies
100% (2)
MAPEH (Arts) : Quarter 1 - Module 1: Appreciation of The Elements, Principles, & Processes of Arts Using New Technologies
11 pages
25 Solana
No ratings yet
25 Solana
11 pages
Python For Scientific and High Performance Com
100% (1)
Python For Scientific and High Performance Com
125 pages
HDF5 Users Guide
No ratings yet
HDF5 Users Guide
342 pages
Data Analysis With Python
100% (3)
Data Analysis With Python
49 pages
Slide 1
No ratings yet
Slide 1
45 pages
Dokumen - Pub Python 3 Module Examples
No ratings yet
Dokumen - Pub Python 3 Module Examples
109 pages
How To Fix: Print Operation Failed Error 0x00000006
No ratings yet
How To Fix: Print Operation Failed Error 0x00000006
11 pages
Unit 3
No ratings yet
Unit 3
63 pages
HDF5 Intro
No ratings yet
HDF5 Intro
25 pages
Mosaik Documentation: Release 2.5.2
No ratings yet
Mosaik Documentation: Release 2.5.2
150 pages
AZ 104T00A ENU TrainerPrepGuide
100% (1)
AZ 104T00A ENU TrainerPrepGuide
28 pages
Parallel Io hdf5
No ratings yet
Parallel Io hdf5
53 pages
Slide 1 SS"G
No ratings yet
Slide 1 SS"G
45 pages
HDF5 tutorialNUG2010
No ratings yet
HDF5 tutorialNUG2010
112 pages
Data Management With Python, SQLite, and SQLAlchemy
No ratings yet
Data Management With Python, SQLite, and SQLAlchemy
57 pages
Python Programming For Economics Finance
No ratings yet
Python Programming For Economics Finance
267 pages
COMP246-016 - Fridge Management System - Parts A, B, & C
No ratings yet
COMP246-016 - Fridge Management System - Parts A, B, & C
56 pages
LM27402 High-Performance Synchronous Buck Controller With DCR Current Sensing
No ratings yet
LM27402 High-Performance Synchronous Buck Controller With DCR Current Sensing
52 pages
Data Wrangling & Visualization - II
No ratings yet
Data Wrangling & Visualization - II
41 pages
Chapter2 2
No ratings yet
Chapter2 2
27 pages
L12 FileInputOutput
No ratings yet
L12 FileInputOutput
18 pages
Lecture 2 File Types Suitable For Storing Big Data
No ratings yet
Lecture 2 File Types Suitable For Storing Big Data
12 pages
HDF5 and H5py
No ratings yet
HDF5 and H5py
26 pages
Introduction To Netcdf4 Binary File With Python, C++ and R: Bertrand Brelier
No ratings yet
Introduction To Netcdf4 Binary File With Python, C++ and R: Bertrand Brelier
27 pages
Data Analysis With Python: Laboratoire Leprince-Ringuet, École Polytechnique, CNRS/IN2P3
No ratings yet
Data Analysis With Python: Laboratoire Leprince-Ringuet, École Polytechnique, CNRS/IN2P3
37 pages
Numpy
No ratings yet
Numpy
13 pages
Composite Functions PixiPPt
No ratings yet
Composite Functions PixiPPt
13 pages
Hostel Management System Project Literature Review
100% (1)
Hostel Management System Project Literature Review
7 pages
Plumbing & Fire Fighting Bim Modeler RAMAN DEEP
No ratings yet
Plumbing & Fire Fighting Bim Modeler RAMAN DEEP
4 pages
Aokatec AK-G750
No ratings yet
Aokatec AK-G750
2 pages
Notes For Python Part I
No ratings yet
Notes For Python Part I
56 pages
HDF5 in Python - The Future of Large Dataset Storage
No ratings yet
HDF5 in Python - The Future of Large Dataset Storage
11 pages
A Comparison of Hdf5, Zarr, and Netcdf4 in Performing Common I/O Operations
No ratings yet
A Comparison of Hdf5, Zarr, and Netcdf4 in Performing Common I/O Operations
6 pages
File Formats Worked Example
No ratings yet
File Formats Worked Example
3 pages
Record-Carmen Bautista-Dator MD, PC - Fee Revision
No ratings yet
Record-Carmen Bautista-Dator MD, PC - Fee Revision
3 pages
Pytables: An On - Disk Binary Data Container, Query Engine and Computa:Onal Kernel
No ratings yet
Pytables: An On - Disk Binary Data Container, Query Engine and Computa:Onal Kernel
35 pages
Virtual Laboratories in Education
No ratings yet
Virtual Laboratories in Education
11 pages
Unlocked Games For School
No ratings yet
Unlocked Games For School
2 pages
Multithreading in Java
No ratings yet
Multithreading in Java
3 pages
DSBDA
No ratings yet
DSBDA
145 pages
Data File Handling - Binary File
No ratings yet
Data File Handling - Binary File
16 pages
01 Python For Data Analysis (Ziad)
No ratings yet
01 Python For Data Analysis (Ziad)
53 pages
1.02 Installing The Accelerator - SAP Commerce Cloud Developer Training
No ratings yet
1.02 Installing The Accelerator - SAP Commerce Cloud Developer Training
56 pages
FIT1043 - Lecture 2 - 2024 Slides
No ratings yet
FIT1043 - Lecture 2 - 2024 Slides
55 pages
Kendriya Vidyalaya Painavu, Idukki: Class Xii - Term 2 Lab Record
No ratings yet
Kendriya Vidyalaya Painavu, Idukki: Class Xii - Term 2 Lab Record
38 pages
42 P16cse5a-P16ite3a 2020052204503639
No ratings yet
42 P16cse5a-P16ite3a 2020052204503639
23 pages
An Introduction To The Python Programming Language: Prabhu Ramachandran
No ratings yet
An Introduction To The Python Programming Language: Prabhu Ramachandran
88 pages
Python Datatype
No ratings yet
Python Datatype
13 pages
Requirement Analysis and Modeling
No ratings yet
Requirement Analysis and Modeling
6 pages
Q.1 Explain Process of Working With Data From Files in Data Science
No ratings yet
Q.1 Explain Process of Working With Data From Files in Data Science
10 pages
Topic 1 IntroductionToNumpy-2
No ratings yet
Topic 1 IntroductionToNumpy-2
7 pages
SCG-1 Manual
No ratings yet
SCG-1 Manual
18 pages
Oracle - Overview of Oracle Spatial
No ratings yet
Oracle - Overview of Oracle Spatial
20 pages
Numpy Data Analysis and Visualisation With Python
No ratings yet
Numpy Data Analysis and Visualisation With Python
75 pages
Exp 1
No ratings yet
Exp 1
22 pages
M1 R5 O-Level Detailed
No ratings yet
M1 R5 O-Level Detailed
4 pages
Q.1 Explain Process of Working With Data From Files in Data Science
No ratings yet
Q.1 Explain Process of Working With Data From Files in Data Science
20 pages
Python Week+1 New
No ratings yet
Python Week+1 New
44 pages
Python Data Structure - Quick Guide
No ratings yet
Python Data Structure - Quick Guide
94 pages
CPS-UNIT - 1-Compressed
No ratings yet
CPS-UNIT - 1-Compressed
183 pages
Ch2 PDF Slides
No ratings yet
Ch2 PDF Slides
26 pages
Vsfiltermod: List of New Override Tags
No ratings yet
Vsfiltermod: List of New Override Tags
6 pages
Update and Document Operational Procedure-Final
No ratings yet
Update and Document Operational Procedure-Final
21 pages
j2c Uk (s12)
No ratings yet
j2c Uk (s12)
2 pages
Features Description: Ltc3565 1.25A, 4Mhz, Synchronous Step-Down DC/DC Converter
No ratings yet
Features Description: Ltc3565 1.25A, 4Mhz, Synchronous Step-Down DC/DC Converter
22 pages
Ass 1 DSBDL
No ratings yet
Ass 1 DSBDL
24 pages
Python Libraries
No ratings yet
Python Libraries
77 pages
TY FDS Workbook
No ratings yet
TY FDS Workbook
56 pages
DSBDA Lab Manual
No ratings yet
DSBDA Lab Manual
155 pages
Physics-Guided Neural Networks PGNN An Application
No ratings yet
Physics-Guided Neural Networks PGNN An Application
9 pages
NumPy and Pandas
No ratings yet
NumPy and Pandas
72 pages
Python For Data Science
No ratings yet
Python For Data Science
8 pages
007 Python Introduction
No ratings yet
007 Python Introduction
22 pages
CH 4
No ratings yet
CH 4
17 pages
Chapter Four
No ratings yet
Chapter Four
14 pages
Ass1 DSBDA Writeup
No ratings yet
Ass1 DSBDA Writeup
8 pages
DATA 1050 Cheatsheet
No ratings yet
DATA 1050 Cheatsheet
4 pages
Data Structures For Statistical Computing in Pytho
No ratings yet
Data Structures For Statistical Computing in Pytho
7 pages
Intro To Scientific Python (2018-01-23) PDF
No ratings yet
Intro To Scientific Python (2018-01-23) PDF
16 pages
Python Syntax
No ratings yet
Python Syntax
22 pages
File Types in Data Engineering!
No ratings yet
File Types in Data Engineering!
18 pages
Csol 510 Final Project
No ratings yet
Csol 510 Final Project
19 pages
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Learn MongoDB in 24 Hours
From Everand
Learn MongoDB in 24 Hours
Alex Nordeen
5/5 (2)

Data Science Formats Beyond CSV and Hdfs

Uploaded by

Data Science Formats Beyond CSV and Hdfs

Uploaded by

Data Formats

for Data Science

• Post Doc Researcher @ FBK

The Program is online: https://www.euroscipy.org/2016/program/

• Q: What’s the better way to process data

• Q+: What’s the most Pythonic Way to do that?

• Q: What’s the best way to share (and to present data)

• A: [Interactive] Charts - Data Visualisation

• OMG, Bokeh is better than ever! by Fabio Pliger (after this

• numpy (mostly numerical) and pandas (csv)

• np.loadtxt and pd.read_csv

• (+) Very easy to (re)create and share

• very easy to process

• (-) Not storage friendly but highly compressible!

• (-) No structured information

• Space is not the only concern (for text). Speed matters!

Still, it is often desirable to have something more than

• HDFGroup - Univ. Illinois Champagne-Urbana

• (+) Works great with both big or tiny datasets

• (+) Storage friendly

• Allows for Compression

• (+) Dev. Friendly

• Query DSL + Multiple-language support

• Python: PyTables, hdf5, h5py

Accessing the table

• Small chunks are good for accessing only some of the

• Large chunks are good for accessing lots of data at a time.

• Reading and writing chunks may happen in parallel

• How to migrate from

• native extension in Python (aka PyROOT)

• ROOT6 also ships a Jupyter Kernel

• Definition of a new Binary Data Format (.root)

• based on the serialisation of C++ Objects

Tight integration with PyROOT objects

100.000 8.755.882 319.970

Average time per Single Call (sec.)

(blosc filter) (flat storage) (compact storage)

(blosc filter) (flat storage) (compact storage)

• Distributed Filesystem on top of Hadoop

• Data can be organised in shardes and distributed among several

• (de facto) Big Data Data Format

• Native implementation of HDFS in C++

• No Java along the way!

Opening a Single File on the HDFS

Wildcard opening of CSVs on the HDFS

You might also like

• numpy (mostly numerical) and pandas (csv)  

• Large chunks are good for accessing lots of data at a time.