Julia for Data Analysis
()
About this ebook
In Julia for Data Analysis you will learn how to:
Read and write data in various formats
Work with tabular data, including subsetting, grouping, and transforming
Visualize your data
Build predictive models
Create data processing pipelines
Create web services sharing results of data analysis
Write readable and efficient Julia programs
Julia was designed for the unique needs of data scientists: it's expressive and easy-to-use whilst also delivering super-fast code execution. Julia for Data Analysis shows you how to take full advantage of this amazing language to read, write, transform, analyze, and visualize data—everything you need for an effective data pipeline. It’s written by Bogumil Kaminski, one of the top contributors to Julia, #1 Julia answerer on StackOverflow, and a lead developer of Julia’s core data package DataFrames.jl. Its engaging hands-on projects get you into the action quickly. Plus, you’ll even be able to turn your new Julia skills to general purpose programming!
Foreword by Viral Shah.
About the technology
Julia is a great language for data analysis. It’s easy to learn, fast, and it works well for everything from one-off calculations to full-on data processing pipelines. Whether you’re looking for a better way to crunch everyday business data or you’re just starting your data science journey, learning Julia will give you a valuable skill.
About the book
Julia for Data Analysis teaches you how to handle core data analysis tasks with the Julia programming language. You’ll start by reviewing language fundamentals as you practice techniques for data transformation, visualizations, and more. Then, you’ll master essential data analysis skills through engaging examples like examining currency exchange, interpreting time series data, and even exploring chess puzzles. Along the way, you’ll learn to easily transfer existing data pipelines to Julia.
What's inside
Read and write data in various formats
Work with tabular data, including subsetting, grouping, and transforming
Create data processing pipelines
Create web services sharing results of data analysis
Write readable and efficient Julia programs
About the reader
For data scientists familiar with Python or R. No experience with Julia required.
About the author
Bogumil Kaminski iis one of the lead developers of DataFrames.jl—the core package for data manipulation in the Julia ecosystem. He has over 20 years of experience delivering data science projects.
Table of Contents
1 Introduction
PART 1 ESSENTIAL JULIA SKILLS
2 Getting started with Julia
3 Julia’s support for scaling projects
4 Working with collections in Julia
5 Advanced topics on handling collections
6 Working with strings
7 Handling time-series data and missing values
PART 2 TOOLBOX FOR DATA ANALYSIS
8 First steps with data frames
9 Getting data from a data frame
10 Creating data frame objects
11 Converting and grouping data frames
12 Mutating and transforming data frames
13 Advanced transformations of data frames
14 Creating web services for sharing data analysis results
Related to Julia for Data Analysis
Related ebooks
Julia as a Second Language Rating: 0 out of 5 stars0 ratingsThink Like a Data Scientist: Tackle the data science process step-by-step Rating: 0 out of 5 stars0 ratingsExploring the Python Library Ecosystem: A Comprehensive Guide Rating: 0 out of 5 stars0 ratingsIntroducing Data Science: Big data, machine learning, and more, using Python tools Rating: 5 out of 5 stars5/5Python How-To: 63 techniques to improve your Python code Rating: 0 out of 5 stars0 ratingsPandas in Action Rating: 0 out of 5 stars0 ratingsJulia: High Performance Programming Rating: 0 out of 5 stars0 ratingsSoftware Mistakes and Tradeoffs: How to make good programming decisions Rating: 0 out of 5 stars0 ratingsDeep Learning with R, Second Edition Rating: 0 out of 5 stars0 ratingsMastering Pandas in Python: Course Book Rating: 0 out of 5 stars0 ratingsStatistics Slam Dunk Rating: 0 out of 5 stars0 ratingsHaskell from Another Site Rating: 0 out of 5 stars0 ratingsApplied Machine Learning Solutions with Python: SOLUTIONS FOR PYTHON, #1 Rating: 0 out of 5 stars0 ratingsDatabases DeMYSTiFieD, 2nd Edition Rating: 3 out of 5 stars3/5Social Media Data Mining and Analytics Rating: 0 out of 5 stars0 ratingsLearn C++ by Example: Covers versions 11 to 23 Rating: 0 out of 5 stars0 ratingsiOS in Practice Rating: 0 out of 5 stars0 ratingsFeature Engineering Bookcamp Rating: 0 out of 5 stars0 ratingsPYTHON DATA ANALYTICS: Mastering Python for Effective Data Analysis and Visualization (2024 Beginner Guide) Rating: 0 out of 5 stars0 ratingsVisualizing Graph Data Rating: 0 out of 5 stars0 ratingsExperimentation for Engineers: From A/B testing to Bayesian optimization Rating: 0 out of 5 stars0 ratingsPractical LaTeX Rating: 3 out of 5 stars3/5Python Text Processing with NLTK 2.0 Cookbook: LITE Rating: 4 out of 5 stars4/5Python for Probability, Statistics, and Machine Learning Rating: 0 out of 5 stars0 ratingsComplex Binary Number System: Algorithms and Circuits Rating: 0 out of 5 stars0 ratingsHands-On Julia Programming: An Authoritative Guide to the Production-Ready Systems in Julia Rating: 0 out of 5 stars0 ratingsNeo4j Cookbook Rating: 0 out of 5 stars0 ratingsModeling with Data: Tools and Techniques for Scientific Computing Rating: 3 out of 5 stars3/5R High Performance Programming Rating: 4 out of 5 stars4/5
Computers For You
Elon Musk Rating: 4 out of 5 stars4/5The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution Rating: 4 out of 5 stars4/5Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are Rating: 4 out of 5 stars4/5Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition Rating: 4 out of 5 stars4/5Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics Rating: 4 out of 5 stars4/5The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology Rating: 4 out of 5 stars4/5Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls Rating: 4 out of 5 stars4/5Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad Rating: 5 out of 5 stars5/5Mastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 4 out of 5 stars4/5The Invisible Rainbow: A History of Electricity and Life Rating: 5 out of 5 stars5/5Excel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time! Rating: 0 out of 5 stars0 ratingsDeep Search: How to Explore the Internet More Effectively Rating: 5 out of 5 stars5/5How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally Rating: 4 out of 5 stars4/5CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide Rating: 5 out of 5 stars5/5Tor and the Dark Art of Anonymity Rating: 5 out of 5 stars5/5The Professional Voiceover Handbook: Voiceover training, #1 Rating: 5 out of 5 stars5/5Uncanny Valley: A Memoir Rating: 4 out of 5 stars4/5The Hacker Crackdown: Law and Disorder on the Electronic Frontier Rating: 4 out of 5 stars4/5101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters Rating: 4 out of 5 stars4/5Learning the Chess Openings Rating: 5 out of 5 stars5/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5Going Text: Mastering the Command Line Rating: 4 out of 5 stars4/5People Skills for Analytical Thinkers Rating: 5 out of 5 stars5/5Fundamentals of Programming: Using Python Rating: 5 out of 5 stars5/5
Reviews for Julia for Data Analysis
0 ratings0 reviews
Book preview
Julia for Data Analysis - Bogumil Bogumil
inside front cover
IBC_F01_Kaminski2Julia for Data Analysis
Bogumił Kaminski
Foreword by VIRAL SHAH
To comment go to liveBook
Manning
Shelter Island
For more information on this and other Manning titles go to
www.manning.com
Copyright
For online information and ordering of these and other Manning books, please visit www.manning.com. The publisher offers discounts on these books when ordered in quantity.
For more information, please contact
Special Sales Department
Manning Publications Co.
20 Baldwin Road
PO Box 761
Shelter Island, NY 11964
Email: [email protected]
©2023 by Manning Publications Co. All rights reserved.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher.
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps.
♾ Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine.
ISBN: 9781633439368
contents
front matter
foreword
preface
acknowledgments
about this book
about the author
about the cover illustration
1 Introduction
1.1 What is Julia and why is it useful?
1.2 Key features of Julia from a data scientist’s perspective
Julia is fast because it is a compiled language
Julia provides full support for interactive workflows
Julia programs are highly reusable and easy to compose together
Julia has a built-in state-of-the-art package manager
It is easy to integrate existing code with Julia
1.3 Usage scenarios of tools presented in the book
1.4 Julia’s drawbacks
1.5 What data analysis skills will you learn?
1.6 How can Julia be used for data analysis?
Part 1 Essential Julia skills
2 Getting started with Julia
2.1 Representing values
2.2 Defining variables
2.3 Using the most important control-flow constructs
Computations depending on a Boolean condition
Loops
Compound expressions
A first approach to calculating the winsorized mean
2.4 Defining functions
Defining functions using the function keyword
Positional and keyword arguments of functions
Rules for passing arguments to functions
Short syntax for defining simple functions
Anonymous functions
Do blocks
Function-naming convention in Julia
A simplified definition of a function computing the winsorized mean
2.5 Understanding variable scoping rules
3 Julia’s support for scaling projects
3.1 Understanding Julia’s type system
A single function in Julia may have multiple methods
Types in Julia are arranged in a hierarchy
Finding all supertypes of a type
Finding all subtypes of a type
Union of types
Deciding what type restrictions to put in method signature
3.2 Using multiple dispatch in Julia
Rules for defining methods of a function
Method ambiguity problem
Improved implementation of winsorized mean
3.3 Working with packages and modules
What is a module in Julia?
How can packages be used in Julia?
Using StatsBase.jl to compute the winsorized mean
3.4 Using macros
4 Working with collections in Julia
4.1 Working with arrays
Getting the data into a matrix
Computing basic statistics of the data stored in a matrix
Indexing into arrays
Performance considerations of copying vs. making a view
Calculating correlations between variables
Fitting a linear regression
Plotting the Anscombe’s quartet data
4.2 Mapping key-value pairs with dictionaries
4.3 Structuring your data by using named tuples
Defining named tuples and accessing their contents
Analyzing Anscombe’s quartet data stored in a named tuple
Understanding composite types and mutability of values in Julia
5 Advanced topics on handling collections
5.1 Vectorizing your code using broadcasting
Understanding syntax and meaning of broadcasting in Julia
Expanding length-1 dimensions in broadcasting
Protecting collections from being broadcasted over
Analyzing Anscombe’s quartet data using broadcasting
5.2 Defining methods with parametric types
Most collection types in Julia are parametric
Rules for subtyping of parametric types
Using subtyping rules to define the covariance function
5.3 Integrating with Python
Preparing data for dimensionality reduction using t-SNE
Calling Python from Julia
Visualizing the results of the t-SNE algorithm
6 Working with strings
6.1 Getting and inspecting the data
Downloading files from the web
Using common techniques of string construction
Reading the contents of a file
6.2 Splitting strings
6.3 Using regular expressions to work with strings
Working with regular expressions
Writing a parser of a single line of movies.dat file
6.4 Extracting a subset from a string with indexing
UTF-8 encoding of strings in Julia
Character vs. byte indexing of strings
ASCII strings
The Char type
6.5 Analyzing genre frequency in movies.dat
Finding common movie genres
Understanding genre popularity evolution over the years
6.6 Introducing symbols
Creating symbols
Using symbols
6.7 Using fixed-width string types to improve performance
Available fixed-width strings
Performance of fixed-width strings
6.8 Compressing vectors of strings with PooledArrays.jl
Creating a file containing flower names
Reading in the data to a vector and compressing it
Understanding the internal design of PooledArray
6.9 Choosing appropriate storage for collections of strings
7 Handling time-series data and missing values
7.1 Understanding the NBP Web API
Getting the data via a web browser
Getting the data by using Julia
Handling cases when an NBP Web API query fails
7.2 Working with missing data in Julia
Definition of the missing value
Working with missing values
7.3 Getting time-series data from the NBP Web API
Working with dates
Fetching data from the NBP Web API for a range of dates
7.4 Analyzing data fetched from the NBP Web API
Computing summary statistics
Finding which days of the week have the most missing values
Plotting the PLN/USD exchange rate
Part 2 Toolbox for data analysis
8 First steps with data frames
8.1 Fetching, unpacking, and inspecting the data
Downloading the file from the web
Working with bzip2 archives
Inspecting the CSV file
8.2 Loading the data to a data frame
Reading a CSV file into a data frame
Inspecting the contents of a data frame
Saving a data frame to a CSV file
8.3 Getting a column out of a data frame
Understanding the data frame’s storage model
Treating a data frame column as a property
Getting a column by using data frame indexing
Visualizing data stored in columns of a data frame
8.4 Reading and writing data frames using different formats
Apache Arrow
SQLite
9 Getting data from a data frame
9.1 Advanced data frame indexing
Getting a reduced puzzles data frame
Overview of allowed column selectors
Overview of allowed row-subsetting values
Making views of data frame objects
9.2 Analyzing the relationship between puzzle difficulty and popularity
Calculating mean puzzle popularity by its rating
Fitting LOESS regression
10 Creating data frame objects
10.1 Reviewing the most important ways to create a data frame
Creating a data frame from a matrix
Creating a data frame from vectors
Creating a data frame using a Tables.jl interface
Plotting a correlation matrix of data stored in a data frame
10.2 Creating data frames incrementally
Vertically concatenating data frames
Appending a table to a data frame
Adding a new row to an existing data frame
Storing simulation results in a data frame
11 Converting and grouping data frames
11.1 Converting a data frame to other value types
Conversion to a matrix
Conversion to a named tuple of vectors
Other common conversions
11.2 Grouping data frame objects
Preparing the source data frame
Grouping a data frame
Getting group keys of a grouped data frame
Indexing a grouped data frame with a single value
Comparing performance of indexing methods
Indexing a grouped data frame with multiple values
Iterating a grouped data frame
12 Mutating and transforming data frames
12.1 Getting and loading the GitHub developers data set
Understanding graphs
Fetching GitHub developer data from the web
Implementing a function that extracts data from a ZIP file
Reading the GitHub developer data into a data frame
12.2 Computing additional node features
Creating a SimpleGraph object
Computing features of nodes by using the Graphs.jl package
Counting a node’s web and machine learning neighbors
12.3 Using the split-apply-combine approach to predict the developer’s type
Computing summary statistics of web and machine learning developer features
Visualizing the relationship between the number of web and machine learning neighbors of a node
Fitting a logistic regression model predicting developer type
12.4 Reviewing data frame mutation operations
Performing low-level API operations
Using the insertcols! function to mutate a data frame
13 Advanced transformations of data frames
13.1 Getting and preprocessing the police stop data set
Loading all required packages
Introducing the @chain macro
Getting the police stop data set
Comparing functions that perform operations on columns
Using short forms of operation specification syntax
13.2 Investigating the violation column
Finding the most frequent violations
Vectorizing functions by using the ByRow wrapper
Flattening data frames
Using convenience syntax to get the number of rows of a data frame
Sorting data frames
Using advanced functionalities of DataFramesMeta.jl
13.3 Preparing data for making predictions
Performing initial transformation of the data
Working with categorical data
Joining data frames
Reshaping data frames
Dropping rows of a data frame that hold missing values
13.4 Building a predictive model of arrest probability
Splitting the data into train and test data sets
Fitting a logistic regression model
Evaluating the quality of a model’s predictions
13.5 Reviewing functionalities provided by DataFrames.jl
14 Creating web services for sharing data analysis results
14.1 Pricing financial options by using a Monte Carlo simulation
Calculating the payoff of an Asian option definition
Computing the value of an Asian option
Understanding GBM
Using a numerical approach to computing the Asian option value
14.2 Implementing the option pricing simulator
Starting Julia with multiple-thread support
Computing the option payoff for a single sample of stock prices
Computing the option value
14.3 Creating a web service serving the Asian option valuation
A general approach to building a web service
Creating a web service using Genie.jl
Running the web service
14.4 Using the Asian option pricing web service
Sending a single request to the web service
Collecting responses to multiple requests from a web service in a data frame
Unnesting a column of a data frame
Plotting the results of Asian option pricing
appendix A First steps with Julia
appendix B Solutions to exercises
appendix C Julia packages for data science
index
front matter
foreword
Today, the world is awash with lots of software tools for data analysis. The reader may wonder, why Julia for Data Analysis? This book answers both the why
and the how.
Since the reader may not be familiar with me, I would like to introduce myself. I am one of the creators of the Julia language and co-founder and CEO of Julia Computing. We started the Julia language with a simple idea—build a language that is as fast as C, but as easy as R and Python. This simple idea has had an immense impact in a lot of different areas as the Julia community has built a wonderful set of abstractions and infrastructure surrounding it. Bogumił, along with many co-contributors, has built a high performance and easy-to-use package ecosystem for data analysis.
Now, you may wonder, why one more library? Julia’s data analysis ecosystem is built from the ground up leveraging some of the fundamental ideas in Julia itself. These libraries are Julia all the way down,
meaning they have been implemented fully in Julia—the DataFrames.jl library for working with data, the CSV.jl library for reading data, the JuliaStats ecosystem for statistical analysis, and so on. These libraries have built on ideas specifically developed in R and taken forward. For example, the infrastructure for working with missing data in Julia is a core part of the Julia ecosystem. It took many years to get it right and to make the Julia compiler efficient in order to reduce the overhead of working with missing data. A completely Julia native DataFrames.jl library means that you no longer have to be restricted to vectorized coding style for high performance data analysis. You can simply write for loops over multi-gigabyte datasets, use multi-threading for parallel data processing, integrate with computational libraries in the Julia ecosystem, and even deploy these as web APIs to be consumed by other systems. All these features are presented in the book. One of the things I really enjoyed in this book is that the examples that Bogumił introduces to the reader are not just neat, small, tabular datasets, but real-world data—for instance, a set of chess puzzles with 2 million rows!
The book is divided into two parts. The first part introduces the basic concepts of the Julia language, introducing the type system, multiple dispatch, data structures, etc. The second part then builds on these concepts and presents data analysis—reading data, selecting, creating a DataFrame, split-apply-combine, sorting, joining, and reshaping—and finally finishes with a complete application. There is also a discussion of the Arrow data exchange format that allows Julia programs to co-exist with data analysis tools in R, Python, and Spark, to mention a few. The code patterns in all the chapters teach the reader good practices that result in high-performance data analysis.
Bogumił is not only a major contributor to Julia’s data analysis and statistical ecosystem, but also has built several courses (like the one on JuliaAcademy) and has blogged extensively about the internals of these packages. Thus, he is one of the best authors to present how Julia can effectively be used for data analysis.
—
Viral Shah, Co-founder and CEO of Julia Computing
preface
I have been using the Julia language since 2014. Before that, I mainly used R for data analysis (Python was not then mature enough in the field). However, in addition to exploring data and building machine learning models, I often needed to implement custom compute-intensive code, which required days to finish the computations. I mostly worked with C or Java for such applications. Constantly switching between programming languages was a pain.
After I learned about Julia, I immediately felt that it was an exciting technology matching my needs. Even in its early days (before its 1.0 release), I was able to successfully use it in my projects. However, as with every new tool, it still needed to be polished.
Then I decided to start contributing to the Julia language and to packages related to data management functionalities. Over the years, my focus evolved, and I ended up as one of the main maintainers of the DataFrames.jl package. I am convinced that Julia is now ready for serious applications, and DataFrames.jl has reached a state of stability and is feature rich. Therefore, I decided to write this book sharing my experiences with using Julia for data analysis.
I have always believed that it’s important for software to not only provide great functionality, but to also offer adequate documentation. For this reason, for several years I have maintained these online resources: The Julia Express (https://github.com/bkamins/The-Julia-Express), a tutorial giving a quick introduction to the Julia language; An Introduction to DataFrames.jl (https://github.com/bkamins/Julia-DataFrames-Tutorial), a collection of Jupyter notebooks; and a weekly blog about Julia (https://bkamins.github.io/). Additionally, last year Manning invited me to prepare the Hands-On Data Science with Julia liveProject (https://www.manning.com/liveprojectseries/data-science-with-julia-ser), a set of exercises covering common data science tasks.
Having written all these teaching materials, I felt strongly that a piece of the puzzle was still missing. People who wanted to start doing data science with Julia had a hard time finding a book that would gradually introduce them to the fundamentals required in order to perform data analysis using Julia. This book fills this gap.
The Julia ecosystem has hundreds of packages that can be used in your data science projects, and new ones are being registered daily. My objective for this book is to teach Julia’s most important features and selected popular packages that any user will find useful when doing data analysis. After reading the book, you should be ready to do the following on your own:
Perform data analysis with Julia.
Learn the functionalities provided by specialized packages that go beyond data analysis and are useful when doing data science projects. Appendix C provides an overview of tools I recommend that are available in the Julia ecosystem, categorized by application area.
Comfortably study more advanced aspects of Julia that are relevant for package developers.
Benefit from discussions about Julia on social media such as Discourse (https://discourse.julialang.org/), Slack (https://julialang.org/slack/), and Zulip (https://julialang.zulipchat.com/register/), confident that you understand the key concepts and terminology that other users reference in their comments.
acknowledgments
This book is an important part of my journey with the Julia language. Therefore, I would like to thank many people for helping me.
Let me start by thanking the Julia community members from whom I’ve both learned a lot and taken inspiration for my contributions. There are too many of them to name, so I had the hard choice of picking a few. In my early days, Stefan Karpinski helped me a lot in getting started as a Julia contributor when I supported his efforts toward shaping the string-processing functionalities in Julia. In the data science ecosystem, Milan Bouchet-Valat has been my most important partner for many years now. His custodianship efforts on the Julia data and statistics ecosystem are invaluable. The most important thing I learned from him is attention to detail and consideration of the long-term consequences of design decisions that package maintainers make. The next key person is Jacob Quinn, who designed and implemented a large part of the functionalities I discuss in this book. Finally, I would like to mention Peter Deffebach and Frames Catherine White, who are both significant contributors to the Julia data analysis ecosystem and are always ready to provide invaluable comments and advice from the package users’ perspective.
I would also like to acknowledge my editor at Manning, Marina Michaels, technical editor Chad Scherrer, and technical proofreader German Gonzalez-Morris, as well as the reviewers who took the time to read my manuscript at various stages during its development and who provided invaluable feedback: Ben McNamara, Carlos Aya-Moreno, Clemens Baader, David Cronkite, Dr. Mike Williams, Floris Bouchot, Guillaume Alleon, Joel Holmes, Jose Luis Manners, Kai Gellien, Kay Engelhardt, Kevin Cheung, Laud Bentil, Marco Carnini, Marvin Schwarze, Mattia Di Gangi, Maureen Metzger, Maxim Volgin, Milan Mulji, Neumann Chew, Nikos Tzortzis Kanakaris, Nitin Gode, Orlando Méndez Morales, Patrice Maldague, Patrick Goetz, Peter Henstock, Rafael Guerra, Samuel Bosch, Satej Kumar Sahu, Shiroshica Kulatilake, Sonja Krause-Harder, Stefan Pinnow, Steve Rogers, Tom Heiman, Tony Dubitsky, Wei Luo, Wolf Thomsen, and Yongming Han. Finally, the entire Manning team that worked with me on the production and promotion of the book: Deirdre Hiam, my project manager; Sharon Wilkey, my copyeditor; and Melody Dolab, my page proofer.
Finally, I would like to express my gratitude to my scientific collaborators, especially Tomasz Olczak, Paweł Prałat, Przemysław Szufel, and François Théberge, with whom I’ve published multiple papers using the Julia language.
about this book
This book was written in two parts to help you get started using Julia for data analysis. It begins by explaining Julia’s most important features that are useful in such applications. Next, it discusses the functionalities of selected core packages used in data science projects.
The material is built around complete data analysis projects, starting from data collection, though data transformation, and finishing with visualization and building basic predictive models. My objective is to teach you the fundamental concepts and skills that are useful in any data science project.
This book does not require prior knowledge of advanced machine learning algorithms. This knowledge is not necessary for understanding the fundamentals of data analysis in Julia, and I do not discuss such models in this book. I do assume that you have knowledge of basic data science tools and techniques such as generalized linear regression or LOESS regression. Similarly, from a data engineering perspective, I cover the most common operations, including fetching data from the web, writing a web service, working with compressed files, and using basic data storage formats. I left out functionalities that require either additional complex configuration that is not Julia related or specialist software engineering knowledge.
Appendix C reviews the Julia packages that provide advanced functionalities in the data engineering and data science domains. Using the knowledge you glean from this book, you should be able to confidently learn to use these packages on your own.
Who should read this book
This book is for data scientists or data engineers who would like to learn how Julia can be used for data analysis. I assume that you have some experience in doing data analysis using a programming language such as R, Python, or MATLAB.
How this book is organized: A roadmap
The book, which is divided into two parts, has 14 chapters and three appendices.
Chapter 1 provides an overview of Julia and explains why it is an excellent language for data science projects.
The chapters in part 1 follow, teaching you essential Julia skills that are most useful in data analysis projects. These chapters are essential for readers who do not know the Julia language well. However, I expect that even people who use Julia will find useful information here, as I have selected the topics for discussion based on issues commonly reported as difficult. This part is not meant to be a complete introduction to the Julia language, but rather is written from the perspective of usefulness in data science projects. The part 1 chapters are as follows:
Chapter 2 discusses the basics of Julia’s syntax and common language constructs and the most important aspects of variable scoping rules.
Chapter 3 introduces Julia’s type system and methods. It also introduces working with packages and modules. Finally, it discusses using macros.
Chapter 4 covers working with arrays, dictionaries, tuples, and named tuples.
Chapter 5 discusses advanced topics related to working with collections in Julia, including broadcasting and subtyping rules for parametric types. It also covers integrating Julia with Python.
Chapter 6 teaches you how to work with strings in Julia. Additionally, it covers the topics of using symbols, working with fixed-width strings, and compressing vectors by using the PooledArrays.jl package.
Chapter 7 concentrates on working with time-series data and missing values. It also covers fetching data by using HTTP queries and parsing JSON data.
In part 2, you’ll learn how to build data analysis pipelines with the help of the DataFrames.jl package. While, in general, you could perform data analysis using only the data structures you will learn in part 1, building your data analysis workflows by using data frames will be easier and at the same time will ensure that your code is efficient. Here’s what you’ll learn in part 2:
Chapter 8 teaches you how to create a data frame from a CSV file and perform basic operations on data frames. It also shows how to process data in the Apache Arrow and SQLite databases, work with compressed files, and do basic data visualization.
Chapter 9 shows you how to select rows and columns from a data frame. You will also learn how to build and visualize locally estimated scatterplot smoothing (LOESS) regression models.
Chapter 10 covers various ways of creating new data frames and populating existing data frames with new data. It discusses the Tables.jl interface, an implementation-independent abstraction of a table concept. You will also learn to integrate Julia with R and to serialize Julia objects.
Chapter 11 teaches you how to convert data frames into objects of other types. One of the fundamental types is the grouped data frame. You will also learn about the important general concepts of type-stable code and type piracy.
Chapter 12 focuses on transformation and mutation of data frame objects—in particular, using the split-apply-combine strategy. Additionally, this chapter covers the basics of using the Graphs.jl package to work with graph data.
Chapter 13 discusses advanced data frame transformation options provided by the DataFrames.jl package, as well as data frame sorting, joining, and reshaping. It also teaches you how to chain multiple operations in data processing pipelines. From a data science perspective, this chapter shows you how to work with categorical data and evaluate classification models in Julia.
Chapter 14 shows you how to build a web service in Julia that serves data produced by an analytical algorithm. Additionally, it shows you how to implement Monte Carlo simulations and make them run faster by taking advantage of Julia’s multithreading capabilities.
The book ends with three appendices. Appendix A provides essential information about Julia’s installation and configuration, as well as common tasks related to working with Julia—in particular, package management. Appendix B contains solutions to the exercises presented in the chapters. Appendix C gives a review of the Julia package ecosystem that you will find useful in your data science and data engineering projects.
About the code
This book contains many examples of source code, both in numbered listings and in line with normal text. In both cases, source code is formatted in a fixed-width font like this to separate it from ordinary text. Sometimes code is also in bold to highlight code that has changed from previous steps in the chapter, such as when a new feature adds to an existing line of code.
Additionally, comments in the source code have often been removed from the listings when the code is described in the text. Code annotations accompany many of the listings, highlighting important concepts.
All the code used in this book is available on GitHub at https://github.com/bkamins/JuliaForDataAnalysis. The code examples are intended to be executed in an interactive session in the terminal. Therefore, in the book, in most cases, the code blocks show both Julia input prefixed with the julia> prompt and the produced output below the command. This style matches the display in your terminal. Here is an example:
julia> 1 + 2 ❶
3
❷
❶ 1 + 2 is the Julia code executed by the user.
❷ 3 is the output printed by Julia in the terminal.
All the material presented in this book can be run on Windows, macOS, or Linux. You should be able to run all examples on a machine with 8 GB of RAM. However, some code listings require more RAM; in those cases, I give a warning in the book.
How to run the code presented in the book
To ensure that all code presented in the book runs correctly on your machine, it is essential that you first follow the configuration steps described in appendix A.
This book was written and tested with Julia 1.7.
An especially important point is that before running example code, you should always activate the project environment provided in the book’s GitHub repository at https://github.com/bkamins/JuliaForDataAnalysis.
In particular, in the book, we use the DataFrames.jl package a lot. All the code is written and tested in version 1.3 of this package. You can find versions of all other packages used in the book in the Manifest.toml file available in the book’s GitHub repository.
The code presented in the book is not meant to be executed by copying and pasting it to your Julia session. Always use the code that you can find in the book’s GitHub repository. For each chapter, the repository has a separate file containing all code from that chapter.
liveBook discussion forum
Purchase of Julia for Data Analysis includes free access to liveBook, Manning’s online reading platform. Using liveBook’s exclusive discussion features, you can attach comments to the book globally or to specific sections or paragraphs. It’s a snap to make notes for yourself, ask and answer technical questions, and receive help from the author and other users. To access the forum, go to https://livebook.manning.com/book/julia-for-data-analysis/discussion. You can also learn more about Manning's forums and the rules of conduct at https://livebook.manning.com/discussion.
Manning’s commitment to our readers is to provide a venue where a meaningful dialogue between individual readers and between readers and the author can take place. It is not a commitment to any specific amount of participation on the part of the author, whose contribution to the forum remains voluntary (and unpaid). We suggest you try asking the author some challenging questions lest his interest stray! The forum and the archives of previous discussions will be accessible from the publisher’s website as long as the book is in print.
Other online resources
Here is a list of selected online resources that you might find useful when reading this book:
DataFrames.jl documentation (https://dataframes.juliadata.org/stable/) with links to tutorials
Hands-on Data Science with Julia liveProject (https://www.manning.com/liveprojectseries/data-science-with-julia-ser), designed as a follow-up resource you can use after reading this book to test your skills and learn how to use advanced machine learning models with Julia
My weekly blog (https://bkamins.github.io/), where I write about the Julia language
In addition, there are numerous valuable sources of general information on Julia. Here is a selection of some of the most popular ones:
The Julia language website (https://julialang.org)
JuliaCon conference (https://juliacon.org)
Discourse (https://discourse.julialang.org)
Slack (https://julialang.org/slack/)
Zulip (https://julialang.zulipchat.com/register/)
Forem (https://forem.julialang.org)
Stack Overflow (https://stackoverflow.com/questions/tagged/julia)
Julia YouTube channel (www.youtube.com/user/julialanguage)
Talk Julia podcasts (www.talkjulia.com)
JuliaBloggers blog aggregator (https://www.juliabloggers.com)
about the author
KaminskiBogumił Kamiński is a lead developer of DataFrames.jl, the core package for data manipulation in the Julia ecosystem. He has over 20 years of experience delivering data science projects for corporate customers. Bogumił also has over 20 years of experience teaching data science at the undergraduate and graduate levels.
about the cover illustration
The figure on the cover of Julia for Data Analysis is Prussienne de Silésie,
or Prussian of Silesia
taken from a collection by Jacques Grasset de Saint-Sauveur, published in 1797. Each illustration is finely drawn and colored by hand.
In those days, it was easy to identify where people lived and what their trade or station in life was just by their dress. Manning celebrates the inventiveness and initiative of the computer business with book covers based on the rich diversity of regional culture centuries ago, brought back to life by pictures from collections such as this one.
1 Introduction
This chapter covers
Julia’s key features
Why do data science with Julia?
Patterns for data analysis in Julia
Data analysis has become one of the core processes in virtually any professional activity. The collection of data has become easier and less expensive, so we have easy access to it. The crucial aspect is that data analysis allows us to make better decisions cheaper and faster.
The need for data analysis has given rise to several new professions, among which a data scientist often comes to mind first. A data scientist is a person skilled at collecting data, analyzing it, and producing actionable insights. As with all craftsmen, data scientists need tools that will help them deliver their products efficiently and reliably.
Various software tools can help data scientists do their jobs. Some of those tools use a graphical interface and thus are easy to work with, but also usually have limitations on how they can be used. The vast array of tasks that data scientists need to do typically leads them to quickly conclude that they need to use a programming language to achieve the required flexibility and expressiveness.
Developers have come up with many programming languages that data scientists commonly use. One is Julia, which was designed to address challenges that data scientists face when using other tools. Quoting the Julia creators, it runs like C, but reads like Python.
Julia, like Python, supports an efficient and convenient development process. At the same time, programs developed in Julia have performance comparable to C.
In section 1.1, we will discuss the results of exemplary benchmarks supporting these claims. Notably, in 2017, a program written in Julia achieved a peak performance of 1.54 petaflops (quadrillions of floating-point operations per second) using 1.3 million threads when processing astronomical image data. Before, only software implemented in C, C++, and Fortran achieved processing speeds of over 1 petaflop (https://julia computing.com/case-studies/celeste/).
In this book, you’ll learn how to use the Julia language to perform tasks that data scientists need to do routinely: reading and writing data in different formats, as well as transforming, visualizing, and analyzing it.
1.1 What is Julia and why is it useful?
Julia is a programming language that is both high level and has a high execution speed. It’s fast to both create and run Julia programs. In this section, I discuss the reasons why Julia is becoming increasingly popular among data scientists.
Various programming languages are commonly used for data analysis, such as (in alphabetical order) C++, Java, MATLAB, Python, R, and SAS. Some of these languages—for instance, R—were designed to be very expressive and easy to use in data science tasks; however, this typically comes at a cost of slower execution times of their programs. Other languages, like C++, are more low level, which allows them to process data quickly; unfortunately, the user usually must pay the price of writing more verbose code with a lower level of abstraction.
Figure 1.1 compares the execution speed and code size (one of the possible measures of programming language expressiveness) of C, Java, Python, and Julia for 10 selected problems. Since these comparisons are always hard to do objectively, I have chosen the Computer Language Benchmarks Game (http://mng.bz/19Ay), which has a long history of development and maintainers who have tried, in my opinion, to make it as objective as possible.
On both subplots in figure 1.1, C has a reference value of 1 for each problem; values smaller than 1 show that the code runs faster (left plot) or is smaller (right plot) than C. On the left plot, the y-axis representing execution time has a logarithmic scale. Code size on the right plot is the size of the gzip archive of the program written in each language.
In terms of execution speed (left plot), C is fastest, and Julia (represented with circles) comes in second. Notably, Python (represented with diamonds) is, in many tasks, orders of magnitude slower than all other displayed languages (I had to plot the y-axis on a log scale to make the left plot legible).
When considering the code size (right plot), Julia leads in 8 of 10 tasks, while for C and Java, we see the largest measurements. In addition to code size, a language’s ease of use is also relevant. I prepared the plots in figure 1.1 in Julia in an interactive session that allowed me to easily tune it; you can check the source code in the GitHub repository accompanying the book (https://github.com/bkamins/JuliaForDataAnalysis). This would also be convenient in Python, but more challenging with Java or C.
CH01_F01_Kaminski2Figure 1.1 Comparing code size and execution speed of C, Python, Java, and Julia for 10 selected computational problems
In the past, developers faced a tradeoff between language expressiveness and speed. However, in practice, they wanted both. The ideal programming language should be easy to learn and use, like Python, but at the same time allow high-speed data processing like C.
This often required data scientists to use two languages in their projects. They prototyped their algorithms in an easy-to-code language (for example, Python) and then identified performance bottlenecks and ported selected parts of the code to a fast language (for example, C). This translation takes time and can introduce bugs. Maintaining a codebase that has significant parts written in two programming languages can be challenging and introduces the complications of integrating several technologies. Finally, when working on challenging and novel problems, having code written in two programming languages makes quick experimentation difficult, which increases the time from the product’s concept to its market availability.
Timeline case study
Let me give you an example from my experience of working with Julia. Timeline is a web app that helps financial advisers with retirement financial planning. Such an application, to supply reliable recommendations, requires a lot of on-demand calculations. Initially, Timeline’s creators began prototyping in MATLAB, switching to Elixir for online deployment. I was involved in migrating the solution to Julia.
After the code rewrite, the system’s online query time was reduced from 40 seconds to 0.6 seconds. To assess the business value of such a speedup, imagine you are a Timeline user having to wait for 40 seconds for your web browser’s response. Now assume the wait is 0.6 seconds. Apart from increased customer satisfaction, faster processing time also decreases the cost and complexity of the technical infrastructure required to operate this system.
However, execution speed is only one aspect of the change. The other is that Timeline reports that switching to Julia saved tens of thousands of dollars in programming time and debugging. Software developers have less code to write, while data scientists who communicate with them now use the same tool. You can find out more about this use case at https://juliacomputing.com/case-studies/timeline/.
In my opinion, the Timeline example is especially relevant for managers of data science teams that deploy the results of their work to production. Even a single developer will appreciate the productivity boost of using a single language for prototyping and writing high-performance production code. However, the real gains in time to production and development cost are visible when you have a mixed team of data scientists, data engineers, and software developers that can use a single tool when collaborating.
The Timeline case study shows how Julia was used to replace the combination of MATLAB and Elixir languages in a real-life business application. To complement this example, it’s instructive to check which languages are used to develop popular open source software projects that data scientists routinely use (statistics collected on October 11, 2021). Table 1.1 shows the top two programming languages used (in percentages of lines of source code) to implement three R and Python packages.
Table 1.1 Languages used to implement selected popular open source packages
All these examples share a common feature: data scientists want to use a high-level language, like Python or R, but because parts of the code are too slow, the package writer must switch to a lower-level language, like C or C++.
To solve this challenge, a group of developers created the Julia language. In their manifesto, Why We Created Julia,
Julia’s developers call this issue the two-language problem (http://mng.bz/Poag).
The beauty of Julia is that we do not have to make such a choice. It offers data scientists a language that is high level, easy to use, and fast. This fact is reflected by the source code structure of Julia and its packages. Table 1.2 lists packages approximately matching the functionality of those in table 1.1.
Table 1.2 Julia packages matching functionality of packages listed in table 1.1
All of these packages are written purely in Julia. But is this important for users?
As I also did several years ago, you might think that this feature is more relevant for package developers than for end-user data scientists. Python and R have mature package ecosystems, and you can expect that most compute-intensive algorithms are already implemented in a library that you can use. This is indeed true, but we quickly hit three significant limitations when moving from implementing toy examples to complex production solutions:
Most algorithms
is different from all algorithms.
While in most of your code you can rely on the packages, once you start doing more advanced projects, you quickly realize that you’ll write your own code that needs to be fast. Most likely, you do not want to switch the programming language you use for such tasks.
Many libraries providing implementations of data science algorithms allow users to pass custom functions that are meant to perform computations as a part of the main algorithm. An example is passing an objective function (also called a loss function) to an algorithm that performs training of a neural network. Typically, during this training, the objective function is evaluated many times. If you want your computations to be fast, you need to make sure that evaluation of the objective function is fast.
If you are using Julia, you have the flexibility of defining custom functions the way you want and can be sure that the whole program will run fast. The reason is that Julia compiles code (both library code and your custom code) together, thus allowing optimizations that are not possible when precompiled binaries are used or when a custom function is written in an interpreted language. Examples of such optimizations are function inlining (https://compileroptimizations.com/category/function_inlining.htm) and constant propagation (https://compileroptimizations.com/category/constant_propagation.htm). I do not discuss these topics in detail as you will not need to know exactly how the Julia compiler works in order to use it efficiently; you can refer to the preceding links for more information about compiler design.
As a user, you will want to analyze the source code of packages you use, because you’ll often need to understand in detail how something is implemented. This is much easier to do if the package is implemented in a high-level language. What is more, in some cases, you’ll want to use the package’s source code—for example, as a starting point for implementing a feature that its designers have not envisioned. That is simpler to do if the package is written in the same language as the language you use to call it.
To explain the claims presented here in more detail, the next section presents the key features of Julia that data scientists typically find essential.
1.2 Key features of Julia from a data scientist’s perspective
Julia and its package ecosystem have five key characteristics that are relevant for a data scientist:
Speed of code execution
Designed for interactive use
Composability, leading to highly reusable code that is easy to maintain
Package management
Ease of integration with other languages
Let’s dive into each of these features in more detail.
1.2.1 Julia is fast because it is a compiled language
We start with execution speed, as this is the first promise Julia makes. The key design element that enables this feature is that Julia is a compiled language. In general, before Julia code is executed, it is compiled to native assembly instructions, using the LLVM technology (https://llvm.org/). The choice to use LLVM ensures that Julia programs are easily portable across various computing environments and that their execution speed is highly optimized. Other programming languages, like Rust and Swift, also use LLVM for the same reasons.
The fact that Julia is compiled has one major benefit from a performance perspective. The trick is that the compiler can perform many optimizations that do not change the result of running the code but improve its performance. Let’s see this at work. The following example code should be easy to understand, even for those of you without prior experience with Julia:
julia> function sum_n(n)
s = 0
for i in 1:n
s += i
end
return s
end
sum_n (generic function with 1 method)
julia> @time sum_n(1_000_000_000)
0.000001 seconds
500000000500000000
Note You can find an introduction to Julia syntax in chapter 2, and appendix A will guide you through the process of Julia’s installation and configuration.
In this example, we define the function sum_n that takes one parameter, n, and calculates the sum of numbers from 1 to n. Next, we call this function, asking to produce a sum for n equal to one billion. The @time annotation in front of the function call asks Julia to print the execution time of our code (technically, it is a macro, which I explain in chapter 3). As you can see, the result is produced very fast.
You can probably imagine that executing one billion iterations of the loop defined in the body of the sum_n function in this time frame would be impossible; it surely would have taken much more time. Indeed, this is the case. What the Julia compiler did is realize that we are taking a sum of a sequence of numbers, so it applied a well-known formula for a sum of numbers from 1 to n, which is n(n + 1)/2. This allows Julia to drastically reduce the computation time.
This is only one example of an optimization that the Julia compiler can perform. Admittedly, implementations of languages like R or Python also try to perform optimizations to speed up code execution. However, in Julia, more information about the types of processed values and the structure of the executed code is available during compilation, and therefore many more optimizations are possible. Julia: A Fresh Approach to Numerical Computing
by Jeff Bezanson et al. (the creators of the language; see http://mng.bz/JVvP) provides more detailed explanations about the design of Julia.
This is just one example of how the fact that Julia is compiled can speed up code execution. If you are interested in analyzing the source code of carefully designed benchmarks comparing different programming languages, I recommend you check out the Computer Language Benchmarks Game (http://mng.bz/19Ay) that I used to create figure 1.1.
Another related aspect of Julia is that it has built-in support for multithreading (using several processors of your machine in computations) and distributed computing (being able to use several machines in computations). Also, by using additional packages like CUDA.jl (https://github.com/JuliaGPU/CUDA.jl), you can run Julia code on GPUs (have I mentioned that this package is 100% written in Julia?). This essentially means that Julia allows you to fully use the computing resources you have available to reduce the time you need to wait for the results of your computations.
1.2.2 Julia provides full support for interactive workflows
A natural question you might now ask is this: Since Julia is compiled to native machine code, how it is possible that data scientists—who do most of their work in an exploratory and interactive manner—find it convenient to use? Typically, when we use compiled languages, we have an explicit separation of compilation and execution phases, which does not play well with the need for a responsive environment.
But here comes the second feature of the Julia language: it is designed for interactive use. In addition to running Julia scripts, you can use the following:
An interactive shell, typically called a read-eval-print loop (REPL).
Jupyter Notebook (you might have heard that Jupyter’s name is a reference to the three core programming languages that are supported: Julia, Python and R).
Pluto.jl notebooks (https://github.com/fonsp/Pluto.jl), which, using the speed of Julia, take the concept of a notebook to the next level. When you change something in your code, Pluto.jl automatically updates all affected computation results in the entire notebook.
In all these scenarios, the Julia code is compiled when the user tries to execute it. Therefore, the compilation and execution phases are blended and hidden away from the user, ensuring an experience that is like using an interpreted language.
The similarity does not end at this point; like R or Python, Julia is dynamically typed. Therefore, when writing your code, you do not have to (but can) specify the types of variables you use. The beauty of the Julia design is that because it is compiled, this dynamism still allows Julia programs to run fast.
It is important to highlight here that it is only the user who does not have to annotate the types of variables used. When running the code, Julia is aware of these types. This not only ensures the speed of code execution but also allows for writing highly composable software. Most Julia programs try to follow the well-known UNIX principle: do one thing and do it well. You’ll see one example in the next section and will learn many more throughout this book.
1.2.3 Julia programs are highly reusable and easy to compose together
When writing a function in Python, you often must think about whether the user will pass a standard list, a NumPy ndarray,