Ultimate Parallel and Distributed Computing with Julia For Data Science: Excel in Data Analysis, Statistical Modeling and Machine Learning by leveraging MLBase.jl and MLJ.jl to optimize workflows
()
About this ebook
Key Features
● Comprehensive Learning Journey from fundamentals of Julia ML to advanced techniques.
● Immersive practical approach with real-world examples, exercises, and scenarios, ensuring immediate application of acquired knowledge.
● Delve into the unique features of Julia and unlock its true potential to excel in modern ML applications.
Book Description
This book takes you through a step-by-step learning journey, starting with the essentials of Julia's syntax, variables, and functions. You'll unlock the power of efficient data handling by leveraging Julia arrays and DataFrames.jl for insightful analysis. Develop expertise in both basic and advanced statistical models, providing a robust toolkit for deriving meaningful data-driven insights. The journey continues with machine learning proficiency, where you'll implement algorithms confidently using MLJ.jl and MLBase.jl, paving the way for advanced data-driven solutions. Explore the realm of Bayesian inference skills through practical applications using Turing.jl, enhancing your ability to extract valuable insights. The book also introduces crucial Julia packages such as Plots.jl for visualizing data and results.
The handbook culminates in optimizing workflows with Julia's parallel and distributed computing capabilities, ensuring efficient and scalable data processing using Distributions.jl, Distributed.jl and SharedArrays.jl. This comprehensive guide equips you with the knowledge and practical insights needed to excel in the dynamic field of data science and machine learning.
What you will learn ● Master Julia ML Basics to gain a deep understanding of Julia's syntax, variables, and functions.
● Efficient Data Handling with Julia arrays and DataFrames for streamlined and insightful analysis.
● Develop expertise in both basic and advanced statistical models for informed decision-making through Statistical Modeling.
● Achieve Machine Learning Proficiency by confidently implementing ML algorithms using MLJ.jl and MLBase.jl.
● Apply Bayesian Inference Skills with Turing.jl for advanced modeling techniques.
● Optimize workflows using Julia's Parallel Processing Capabilities and Distributed Computing for efficient and scalable data processing.
Table of Contents
1. Julia In Data Science Arena
2. Getting Started with Julia
3. Features Assisting Scaling ML Projects
4. Data Structures in Julia
5. Working With Datasets In Julia
6. Basics of Statistics
7. Probability Data Distributions
8. Framing Data in Julia
9. Working on Data in DataFrames
10. Visualizing Data in Julia
11. Introducing Machine Learning in Julia
12. Data and Models
13. Bayesian Statistics and Modeling
14. Parallel Computation in Julia
15. Distributed Computation in Julia
Index
Related to Ultimate Parallel and Distributed Computing with Julia For Data Science
Related ebooks
Julia for Scientific Computing: Julia in Production: A Data Science Journey Rating: 0 out of 5 stars0 ratingsProgramming with Julia: Definitive Reference for Developers and Engineers Rating: 0 out of 5 stars0 ratingsMastering the Art of Julia Programming: Advanced Techniques for Expert-Level Programming Rating: 0 out of 5 stars0 ratingsMastering Julia: Enhance your analytical and programming skills for data modeling and processing with Julia Rating: 0 out of 5 stars0 ratingsUltimate Java for Data Analytics and Machine Learning Rating: 0 out of 5 stars0 ratingsMastering Julia: From Basics to Expert Proficiency Rating: 0 out of 5 stars0 ratingsKickstart Compiler Design Fundamentals Rating: 0 out of 5 stars0 ratingsKickstart Python Programming Fundamentals Rating: 0 out of 5 stars0 ratingsUltimate MLOps for Machine Learning Models Rating: 0 out of 5 stars0 ratingsUltimate Machine Learning with ML.NET Rating: 0 out of 5 stars0 ratingsGenerative AI Foundations in Python: Discover key techniques and navigate modern challenges in LLMs Rating: 0 out of 5 stars0 ratings
Databases For You
Excel 2021 Rating: 4 out of 5 stars4/5Python Projects for Everyone Rating: 0 out of 5 stars0 ratingsSQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5Learn SQL in 24 Hours Rating: 5 out of 5 stars5/5Simply SQL: The Fun and Easy Way to Learn Best-Practice SQL Rating: 4 out of 5 stars4/5Practical Data Analysis Rating: 4 out of 5 stars4/5Access 2019 For Dummies Rating: 0 out of 5 stars0 ratingsCOMPUTER SCIENCE FOR ROOKIES Rating: 0 out of 5 stars0 ratingsMastering Blockchain Rating: 4 out of 5 stars4/5Learn SAP SD in 24 Hours Rating: 5 out of 5 stars5/5Developing Analytic Talent: Becoming a Data Scientist Rating: 3 out of 5 stars3/5SQL in 30 Pages Rating: 4 out of 5 stars4/5The Ultimate Guide To Microsoft Excel Vba For Beginners And Seniors Rating: 0 out of 5 stars0 ratingsBlockchain For Dummies Rating: 4 out of 5 stars4/5The AI Bible, Making Money with Artificial Intelligence: Real Case Studies and How-To's for Implementation Rating: 4 out of 5 stars4/5"Data Analysis" Basic Concepts and Applications Rating: 0 out of 5 stars0 ratingsITIL 4: Direct, plan and improve: Reference and study guide Rating: 0 out of 5 stars0 ratingsITIL 4: Digital and IT strategy: Reference and study guide Rating: 5 out of 5 stars5/5Learn SQL using MySQL in One Day and Learn It Well: SQL for beginners with Hands-on Project Rating: 0 out of 5 stars0 ratingsInstant Oracle GoldenGate Rating: 0 out of 5 stars0 ratingsDatabases DeMYSTiFieD, 2nd Edition Rating: 3 out of 5 stars3/5Learn SQL Server Administration in a Month of Lunches Rating: 3 out of 5 stars3/5FileMaker Pro Design and Scripting For Dummies Rating: 0 out of 5 stars0 ratingsSchaum's Outline of Principles of Computer Science Rating: 0 out of 5 stars0 ratingsVisual Basic 6.0 Programming By Examples Rating: 5 out of 5 stars5/5JAVA for Beginner's Crash Course: Java for Beginners Guide to Program Java, jQuery, & Java Programming Rating: 4 out of 5 stars4/5
Reviews for Ultimate Parallel and Distributed Computing with Julia For Data Science
0 ratings0 reviews
Book preview
Ultimate Parallel and Distributed Computing with Julia For Data Science - Nabanita Dash
CHAPTER 1
Julia In Data Science Arena
Introduction
This chapter acts as an insight into the buzzwords such as Julia, data science, and machine learning in the title of the book. We will discuss data science, machine learning, and the need for data science and machine learning in today’s world to solve and automate challenging problems. We will learn about technologies that will help us tackle these problems. You will gain an accurate and deep understanding of Julia programming language and the reason for using Julia in addressing all data-related issues. Julia is a new and actively developing language. Every programming language has some lacunae. Most of the time, Julia’s numerous advantages surpass its flaws. In the end, we will go through some shortcomings of Julia language.
Structure
In this chapter, we will discuss the following topics:
Introducing data science
Defining data science
The need for task automation
Introducing statistics
Introducing machine learning
Drawing correlations from raw data
Explaining the need for data analysis
Introducing Julia
Astounding Julia language!!
Julia: ideal for data analysis
Drawbacks of Julia
Introducing Data Science
Data is everywhere and abundantly available and more so in the near future. According to a late-2012 assessment, the quantity of digital data saved would increase by a factor of 300 between 2005 and 2020. The total digital data saved would be approximately 130 exabytes to 40,000 exabytes (Gand & Reinsel, 2012). Total digital data saved equates to 40 trillion gigabytes, which is more than 5.2 terabytes for every individual! The rising availability of data and the value of data will have an influence in every industry (Chen et al., 2014; Khan et al., 2014). We need data to understand the process and make decisions based on it. We must learn skills that will assist us in better interpreting the data cheaper and faster. A data scientist/data analyst/researcher is familiar with fundamental data management techniques. Figure 1.1 paints quite a picture of all the fundamental data management techniques.
Figure 1.1: The data science workflow
Defining Data Science
Data science is the study of massive data using advanced tools and methodologies to discover previously unknown patterns, extract valuable information, and make better decisions. We employ science and math to help us use data to make better decisions. We can safely say that data science is a culmination of data acquisition, data warehousing (maintaining data by cleaning, processing, and creating an architecture), data mining (clustering/modeling), data analysis and data reporting techniques. It is quite humanly impossible to browse through the entire dataset and derive any useful insights from it. We use various statistical techniques to gain an understanding from our data. We use advanced technologies such as machine learning to analyze the data and automate the entire process. Although most of you are familiar with the concepts of machine learning and statistics, let me refresh those concepts again. Before going into machine learning, let us see why we need to automate all the data processes.
The need for Task Automation
Artificial intelligence was created in the 1950s by computer scientists who were interested if computers might be programmed to think on their own—a dilemma that is still being debated and looked into today. Artificial intelligence may be defined as a machine’s ability to accomplish activities that would normally need human-level intelligence and judgment. We should keep in mind that AI is a fairly wide science that covers not just machine learning and deep learning but also various approaches that do not involve learning.
One of the first significant results of AI algorithms was successfully implemented in chess competitions. Deep Blue (https://www.chess.com/terms/deep-blue-chess-computer) was created to compete with humans in chess. It had 8,000 handmade elements. In 1997, the Deep Blue program defeated former chess champion, Gary Kasparov. For a long time, most academics and scientists felt that achieving human-level intelligence required handcrafting a sufficiently wide set of specialized instructions.
The disadvantage of employing handmade characteristics is that only game professionals (checkers or chess) can build computer algorithms. Even professionals might fail to manufacture characteristics for complex tasks like image classification, speech recognition, natural language translation, and so on. Analyzing huge amounts of data and figuring out the apt algorithms is humanly impossible. We need some kind of automation tool that will solve our problem. In the next part, we will discuss machine learning, which is a subset of artificial intelligence.
Introducing Statistics
Statistics is a highly diverse research discipline. Statistics study has applications in almost all scientific research domains. Scientific research and data analysis, in fact, drives the development of novel statistical methods. Statistics is the branch of science that deals with creating and researching ways to gather, analyze, interpret, and display empirical data. Statistics aids in gaining a better understanding of the facts from raw data. Statisticians use a range of mathematical and computational techniques to develop procedures. Statistical computations demand an easy-to-read language with high-scaling features that can help Statisticians to focus only on developing theories rather than worry about the computational aspect of the data as in computing or testing or verifying the data.
It is humanly impossible to formulate a statistical equation for every set of inputs and outputs in a case where the data is large. Here, machine learning, a paradigm that devises statistical equations between outputs and inputs on its own, comes to our aid. A simple machine learning system is shown in Figure 1.2. The machine learning system in figure 1.2 shows that multiple inputs and outputs are fed into the machine and the machine devises an algorithm or model on its own. The model shows a correlation between the inputs and outputs of data. This correlation is further used for test inputs to find the outputs predicted.
Figure 1.2: Basic machine learning workflow
Introducing machine learning
Machine learning, a novel programming method, takes in input data and correlates it with answers, then deciphers the rules that link the inputs to their corresponding responses. The machine is given different samples and their related outputs while attempting to comprehend the statistical association between the samples and their outputs on its own. To categorize photographs of cats and dogs, we input different images of cats and dogs that have previously been tagged by humans, and the system will learn the statistical relationships between the images and their related labels. Machine learning is closely connected to computational statistics, which is the fundamental goal of making the machine properly predict the consequence. Now that we know we can use data science to derive insights, let us go step by step to understand how data is analyzed to derive meaningful information.
Drawing Correlations from Raw Data
Before collecting or processing data, we will understand the problem at hand. If we face a complex problem, then we can divide it into smaller subproblems and solve them one at a time. We will examine the subset of data pertaining to the smaller problem we are attempting to address. Otherwise, if our initial problem is not that complex, we will use the entire data for addressing the issue. While solving a real-world problem, we don’t get a customized dataset. We have to figure out the variables concerning the problem that can provide us valuable information about the data. We have to curate and create our own data pertaining to the problem. We compile data from numerous sources to create a unified dataset.
After collecting the data, we need to process the raw data into a format that is suitable for performing data analysis. At this stage, we delete any misplaced or incorrect data. If necessary, we will consolidate some data or eliminate all unneeded data.
The raw data is processed so that we can easily make correlations from it.
The second step is to create a machine learning or statistical model which will learn the data correlations on its own.
The third step encompasses interpreting the acquired information from the data. At this stage, we find the solution to our problem. Otherwise, we will keep on changing metrics and techniques till we find a proper solution to our desired problem.
The fourth step is to prepare the results and summarize the insights to share.
Figure 1.3 shows a pictorial description of the steps to follow while drawing correlation from raw data.
Figure 1.3: The process of drawing correlations from the raw data
A data scientist gets insights and finds patterns and trends in datasets by making data models and forecasting algorithms. These algorithms are written using machine learning techniques on the data. The product is improved consistently by sharing ideas with other teams and upper management. We use several data analytics tools such as SAS, SQL and various programming languages like R, Python, and Julia leading developments in the field of data science. Some of these tools have a graphical interface and are thus simple to use. But they typically have constraints on how they are used. The wide diversity of jobs that a data scientist must perform points to the fact that some programming language is essential to accomplish maximum flexibility and clarity.
Explaining the Need for Data Analysis
Data analytics is critical since it allows firms to improve their performance. Companies that include it in their business models can assist cut expenses by discovering more efficient methods of doing business. Data analytics may also help a firm improve its business decisions and assess consumer patterns. Let’s see some features of data analysis:
Data analysis is a simple approach to check on the research topic and provides the reader with information on what upgrades have been acquired in the whole data and interpretation process.
You can understand your clients, tailor the model according to your client and support their specific needs.
Data analysis aids in the reduction of enormous datasets via the application of new tools and technologies.
Sellers rely on a large quantity of data to deliver their worth in the study and exploration of information via data mining.
Data analysis entails comprehending and interpreting data in the form of data analysis without any human bias.
Data analysis increases the legitimacy of existing data or the latest research by providing solid references that depend on a conceptual framework.
Data analysis enables you to adapt your approach and technique, focus on saving money, and improve your initial research.
Introducing Julia
Julia is a fast, dynamic, high-level dynamic programming language widely developing and a huge number of machine learning libraries and frameworks are being developed which are being used by data analysts, data scientists, and machine learning researchers both in industry and academics. Julia is a free and open-source programming language that was created in 2012 by MIT. It is extensively used in statistical computing, data analytics, scientific research, data modeling, graphical representation, and reporting. Julia includes statistical modeling, plotting, and modeling libraries required for data science.
Julia is a huge ecosystem consisting of several libraries that are focused on many machine learning-related areas. This book will give a broad overview of the libraries and their functions along with hands-on exercises and code so that readers can directly start working on their own model with Julia. This book is aimed at creating interest in readers to contribute to various Julia libraries. This chapter aims to give an overview of the use and efficiency of Julia in handling Statistics, Data Science, and Machine Learning. Figure 1.4 shows a comparison of the speed of execution time of various languages including Julia with respect to a static language C.
Figure 1.4: Comparison between speeds of various languages with Julia
In this book, we are going to focus mainly on the use of Julia in machine learning and data science fields. Since Julia is open source, a lot of packages and libraries are developed for the use of data science and machine learning. We are going to cover the most important and recent development packages for both data science and machine learning. This book focuses mainly on machine learning, different packages that are designed for classical ML are thoroughly explained in the book. Statistical analysis and its use in data science in Julia are also visited. A few bonus chapters such as an introduction to parallel and distributed computing in Julia are well explained. This book can assist data scientists and analysts who desire to transition to Julia for their daily data science job. This book will demonstrate how and why using Julia will be a game-changer for them. Let us look at some of the major functionalities of Julia language that make it stand out. Figure 1.5 shows some of the important features of Julia language.
Figure 1.5: Features of Julia language
Astounding Julia language!!
Julia was designed to combine multiple performance measures from different languages into a single language. Julia was originally designed to have scientific computing in consideration. As a result, it began as a competition between languages such as Python, R, and MATLAB to employ the most effective metrics of such languages and achieve greater performance than other languages.
Machine-learning programmers and scientists want a language that can do quantitative, probabilistic, differentiable, tree-structured, and parallelized calculations. Julia is a functional programming language that employs the LLVM compiler framework with just-in-time compilation, making it possible to perform complex deep learning tasks quickly. In contrast to verbose C++ code, Julia is dynamically typed, making it perfect for designing, training, and tweaking data-related code.
There are several tools that you can use to write Julia code. You can use Julia REPL (read-eval-print-loop) which is a command line built into Julia executable or Pluto notebooks which are like Jupyter notebooks with additional functionalities for code execution and sequencing. If you have installed Julia (we will learn about it in the next chapter), you can open your terminal and type in julia. A Julia prompt will appear on your terminal screen. Figure 1.6 shows the Julia prompt.
Figure 1.6: Julia prompt in REPL
Let us look at a sample function written in Julia that returns an integer raised to the power of 7. We can measure the performance of our code using @time macro and pay attention to the memory allocation. Generally, Julia runs faster after the first iteration.
julia> x = 7
7
julia> @time x->x^7
0.008357 seconds (977 allocations: 62.519 KiB, 66.14% compilation time) #1 (generic function with 1 method)
Julia is open source and widely available, making it simple to examine high-level code rather than delving into low-level abstractions. Julia is quicker than other dynamic languages like Python and similar to static ones.
Now, let us look into some of the main features of Julia language.
Julia includes all features for complicated mathematical computations, so you can immediately write your mathematical equations into code. Julia syntax is close to that of MATLAB, making the transition to Julia easy for MATLAB users. This feature allows you to create mathematical equations in Julia while utilizing the same mathematical operators and expressions as you would in Julia code.
Julia and Python are both dynamic-typed languages with similar-looking and easy-to-write prototype code. Julia’s Python interface is available at JuliaPy/pyjulia for Python developers. Julia, like Common Lisp, is easy to write and executes almost as quickly as C. Julia adheres to the majority of Lisp’s programming paradigms and components. Julia executes with a speed equivalent to FORTRAN.
When you compile your program in Julia, all data types are moved from dynamic to static code, which takes some time but is afterwards super-fast. Julia has been a member of the petaflop club since 2020, which means it can execute one petaflop of instructions in one second. This minimizes compile time delay while also making it easier to write code in Julia without having to define the same object numerous times.
As Julia’s code is written in Julia, programmers may simply make modifications or rectify problems. There is no need to install anything or bother about package linking services or different environments.
Since Julia is a new language, numerical computing libraries in C and FORTRAN outperform Julia language libraries. Julia makes it simple to call C and FORTRAN functions, allowing you to reuse existing code. To call such functions, no in-between code, code creation, or compilation is required; instead, a regular call using ccall syntax from an interactive Julia prompt is sufficient.
Julia dynamically dispatches functions and values based on the object’s run-time type. This entails leveraging the combined attributes of one or more function parameters to implement functions or values. When developing idiomatic Julia, type stability and memory allocation are taken into account. This functionality allows Julia code to be reused across many packages.
Julia’s programming language enables coroutines, multi-threading, distributed computing, and GPU computing. Julia provides a message-passing multiprocessing environment that enables programs to run in many memory domains at the same time. Julia’s message-forwarding technique is distinct from those used in other contexts such as MPI. This functionality enables the CPU to accelerate deep learning job execution.
The REPL (read-eval-print loop) is an interactive command-line interface that is built into the Julia executable. REPL provides simple and quick execution, a searchable history, tab completion, various handy key bindings, help, and shell mode. This feature makes it easier to write and run Julia code in the interactive Julia shell.
Julia’s package manager Pkg can add packages quicker than the library equivalent since it does not have to recompile a stale cache file. Pkg handles package dependencies entirely on its own. While utilizing a package in a project, it downgrades and upgrades package dependencies. Pkg also manages project environments by utilizing two files, Project.toml and Manifest.toml, to keep track of package versions and dependencies in a specific project context. You may easily reproduce the full project procedure on your machine.
Julia: Ideal for Data Analysis
The most important feature of every language is its extensibility. The language should be designed in such a way that programmers may simply add new necessary functionality. Julia’s open-source machine learning frameworks make it simple for programmers to utilize and build specialized functionality as needed. Flux is a machine learning package that is intended for use in high-performance application pipelines. Flux is able to meet high-performance expectations because of a combination of current and novel compiler designs. The Flux framework is used to write all differentiable algorithms in Julia. Plots and graphs, as well as Flux, may be used to visualize data in Pluto.jl, a basic reactive notebook for Julia. CUDA.jl is used in Julia to communicate with NVIDIA CUDA GPUs, including array abstraction, CUDA kernel development in Julia, and wrappers for other CUDA libraries. The development and application of numerical computation libraries, such as differential solvers like SciML and optimization libraries like JuMP, have made it feasible to address complicated real-world problems with ease. Programmers can only view a restricted range of mathematical operations and cannot directly operate the GPU when training machine learning algorithms in parallel. Julia GPU programming allows you to work directly with CUDA kernels produced and executed from a script or notebook. This entire Julia GPU architecture contributes to boosting machine learning capabilities such as scalability because frameworks supply kernels.
Julia contributes to the development of many tools and technologies. Many major institutions, both technical and academic, employ Julia to perform various tasks more efficiently and quickly. You can go to https://juliahub.com/case-studies/ to learn about more use cases of Julia. You can see some of the major use cases of Julia in Figure 1.7.
Figure 1.7: Break-throughs in various fields using Julia language
Some of the industrial and academic real-world use cases of Julia and its related libraries and frameworks are:
Apart from these classical libraries and packages, Julia is progressing in the fields of modeling and simulation, which has aided Pumas’ predictive healthcare analytics. Pumas is a comprehensive platform for pharmacological modeling and simulation from Julia Computing and Pumas.ai. It is a tool for the whole drug development pipeline.
Raj Dandekar, Chris Rackauckas, Emma Wang, and George Barbastathis of MIT devised a model to help reduce and stop COVID-19’s spread.
Marius Millea, a cosmologist, used Julia to study gravitational lensing in the Cosmic Microwave Background from the South Pole Telescope.
Jean-Christophe B. Loiseau uses Julia and highly recommends it for modeling turbulence because of the ability to write only fewer lines of code in Julia.
JuliaHub is the simplest way to use the cloud to scale enormous, distributed Julia jobs.
Julia Computing’s JuliaHub makes it easy for Julia users to manage their packages, locate documentation, contribute to open-source projects, and run large compute intensive workloads.
Julia is taught at many institutions, including MIT and Stanford.
Stipple is a new tool for developing, deploying, and scaling Julia applications. Stipple is a reactive UI library for building interactive data applications in pure Julia.
JuliaTeam from Julia Computing allows your organization to collaborate, build, and manage private and public packages, manage open-source licenses, and take advantage of Julia’s continuous integration, deployment, security, indemnity, and enterprise governance capabilities.
We at Julia are trying to heal the planet together. JuliaClimate is devoted to climate research and solving climate change problems utilizing Julia tools.
Pfizer and Moderna have used Julia Computing to simulate new pharmaceuticals.
JuliaComputing helps AstraZeneca with AI-based toxicity prediction.
JuliaComputing aids European insurance giant Aviva in compliance issues.
JuliaComputing has provided energy provider Fugro Roames with an AI-based grid network failure prediction system and the FAA with its airborne collision prevention program.
JuliaComputing helped Cisco with ML-based network security and several national labs and academic institutions with ML-based network security research programs.
Julia Computing also made headlines because of a DARPA award to modernize semiconductor codes for more efficient, current simulation codes.
The language is used by over 10,000 firms throughout the world, including AstraZeneca PLC, BlackRock, and Microsoft Corp. Julia is also used by NASA, the Federal Aviation Administration (FAA), and the Federal Reserve Bank of New York.
JuliaHub has also helped in the advancements in Williams racing cars.
JuliaComputing team will be designing digital phased array systems using Julia and GPUs. Eventually, Julia is expected to become an integral part of next-generation wireless systems.
Limitations of Julia
Julia’s code pre-compilation and dynamism are certainly noteworthy, but as with any other advantage provided by the language, Julia has its own limitations related to code pre-compilation and dynamism. Julia’s REPL takes a long time to start any Julia process for the first time.
This is due to Julia taking longer to compile the many available data types when the program is initially started. Although the compilation time delay has decreased with each subsequent Julia version, it remains relatively considerable. This is quite inconvenient when compiling a large complex machine learning program for the first time. When an executable is built from Julia code, Julia consumes extra RAM which makes it inconvenient to run complex data science or machine learning problems. As Julia is a functional programming language, integrating it into other languages is tricky. Static languages like C/C++/Rust are easier to incorporate since they generate binaries that can be compiled by other languages. Since the Julia environment is less developed than those of other languages, it is prone to a few common issues. There are several packages in Julia that cannot be used in production. To access R and Python libraries in Julia, use packages such as RCall.jl or PyCall.jl. Only production-ready libraries will be covered in the book.
Conclusion
Machine learning and data analysis lay the ground for data science tasks. Julia is a high-level and general-purpose language that is used to construct code for scientific computations which are quick to execute and simple to implement. The language is intended to address all of the requirements of scientific researchers and data scientists in order to maximize experimentation and design execution. You’ve probably heard of Julia, the future programming language of data science if you’re a data aficionado. There are speculations that Julia will eventually replace Python and R in the data science arena due to considerable improvements in performance, efficiency, and usability.
In the next chapter, we will learn to install Julia on our system. We will learn about various tools and technologies to write Julia code. We will learn to write basic Julia code in Pluto notebooks. We will go through the fundamentals of programming such as using data types, writing loops, functions and some other functionalities in Julia language.
Points to Remember
Data science is the study of massive data using advanced tools and methodologies to discover previously unknown patterns, extract valuable information, and make better decisions.
Artificial intelligence uses computers and technology to emulate humans’ problem-solving and decision-making skills.
Machine learning is the usage and development of computer systems that can learn and adapt without explicit instructions, by analyzing and drawing conclusions from data patterns utilizing algorithms and statistical models.
Statistics is the branch of science that deals with creating and researching ways to gather, analyze, interpret, and display empirical data.
Julia is a high-level dynamic programming language. Julia is known for its high speed and performance compared to other languages used for machine learning, such as Python.
Julia is an open-source language which was mainly designed for scientific computing but Julia also supports web development, database, gaming and so on efficiently.
Julia’s package manager and REPL are robust. Julia supports multiple-dispatch.
Julia uses the LLVM compiler with just-in-time compilation and is dynamically typed which makes it ideal for writing, training and tuning machine learning programs.
Julia’s extensibility of code which makes it flexible to use and write various Julia programs and simplified design of the machine learning framework in Julia makes it the ideal machine learning language.
References
Gand and Reinsel, 2012:
https://datastorageasean.com/sites/default/files/idc-the-digital-universe-in-2020.pdf
Chen et al., 2014:
https://www.sciencedirect.com/science/article/abs/pii/S1551741113001277
Khan et al., 2014:
https://link.springer.com/book/10.1007/978-3-319-06710-0
CHAPTER 2
Getting Started with Julia
Introduction
This chapter will walk you through installing and configuring Julia on your PC. We will practice coding on several Julia IDEs (integrated development environments). We will learn about package management in Julia. We will define Julia environments and use them to develop project code. If you are new to the Julia language, this chapter will teach you the basic syntax and most significant ideas. Even if you are previously acquainted with Julia, we recommend that you rapidly review the offered subjects to ensure that you have a thorough knowledge of the fundamental principles. In this chapter, we will go through the fundamentals and basic functionalities of Julia. We will learn how to express values, variables, loops, functions, and so on. For a thorough introduction to Julia programming, we recommend the books mentioned at https://julialang.org/learning/books/ or the Julia guide at https://docs.julialang.org/en/v1/.
Structure
In this chapter, we will discuss the following topics:
Installing and setting up Julia
Working with Julia IDEs
Julia package manager: Pkg
Handling numerical data in Julia
Declaring variables
Constructing control flows
Defining functions
Scope rules for variables
Installing and setting up Julia
Let us go through the process of downloading, installing, and configuring Julia environments on our own PC. These steps can be replicated in your own system. Once you have correctly configured Julia in your system, it will be simple to follow along with the examples in the chapters.
To begin, visit the website https://julialang.org/downloads/ to download Julia on your local PC. There are numerous operating systems and Julia versions available at the site. You can get Julia for your operating system by downloading the correct version. There are operating system specific instructions to set up Julia in your system.
We use Julia version 1.7 throughout the book because it’s the most recent one. However, you can read the book while using version 1.7 or above (if available). Using Julia versions greater than 1.7 will not cause problems, while using Julia versions less than 1.7 may cause problems. Version 1.7 was used to test the code in this book. However, the code will basically operate as expected for any Julia 1.x version. Higher versions may include minor modifications to execution or output. If you need Julia 1.7, go to the older unmaintained releases page (https://julialang.org/downloads/oldreleases/).
Follow the instructions for your operating system at https://julialang.org/downloads/platform/ to load the Julia executable on your system. Linux users can use Julia Installer (https://github.com/abelsiqueira/jill) to download and install Julia on their machine, or they can follow the platform-specific methods.
sudo bash -ci "$(curl -fsSL
https://raw.githubusercontent.com/abelsiqueira/jill/main/jill.sh)"
You can download Julia using the above code with the help of the jill package. It is an unofficial Julia Linux downloader.
After loading Julia, make sure to add Julia to the PATH environment variable. There are various methods for adding Julia to the PATH variable, each operating system has its own download and installation requirements along with adding the executable to the PATH. This makes it simple to start the Julia executable by entering julia followed by Enter keypress in a terminal.
Figure 2.1 depicts a basic Julia session running on the terminal.
Launch your system’s terminal.
The $ sign symbolizes the command prompt in the operating system. Along with the $ sign, enter the julia command.
A Julia session will begin, and you will see the Julia logo, as well as links to assistance, such as documentation, help, and the official release repository.
Below the Julia logo, you can notice the $ prompt change to julia> prompt. This implies
that you can execute Julia instructions.
To return to the terminal prompt and exit the Julia prompt, use the Julia command exit() on the terminal. The $ prompt has returned to your terminal screen.
Figure 2.1: The Julia prompt in the Linux terminal
Working with Julia IDEs
We learned how to install and configure Julia locally using our terminal in the previous part. In this part, we’ll look at different ways to run Julia commands locally. You can use any of the techniques listed to write and execute Julia commands locally.
We’ll take a quick look at the many ways to run Julia commands. Throughout the book, we’ve used Julia’s REPL to write the code examples. It is up to you where you want to run the Julia code. The output will be the same in all three modes. An integrated development environment (IDE) offers robust technological facilities for software development. Programmers who use IDEs can edit code, and build and debug the code simultaneously. To recapitulate, some typical methods for creating and modifying Julia code include:
using Julia read-evaluate-print-loop (REPL)
using an integrated programming environment (IDE) like Visual Studio Code (VSCode) or Atom
using notebooks like Jupyter Notebook or Pluto notebook
Using Julia REPL
After adding the Julia executable to the PATH environment variable, you will just type julia commands directly into the terminal. There are two methods to access the terminal to execute Julia commands. To begin, you can employ the Julia executable’s in-terminal interactive session. Julia read-evaluate-print-loop is yet another name for the terminal with Julia prompt (REPL).
Julia REPL is the code editor for Julia command line programming.
In the previous unit, we explored how to use Julia REPL to run Julia commands. We used Julia’s exit() command in the previous section. You can apply any Julia command line options in addition to interactive Julia sessions https://docs.julialang.org/en/v1/manual/command-lineoptions/#command-line-options.
Julia scripts can also be executed from the terminal. You can write Julia code in a *.jl file, where
* is the filename. You may run the file by typing julia *.jl into the terminal. This will execute the file’s code and then exit at the completion of the code in the file.
Using IDEs
Aside from the Julia REPL, we can create and execute Julia code in any integrated development environment (IDE). To see the results, we may write a .jl script in the IDE and then execute it on the terminal. The Julia plugin may be added and installed on any IDE of your choosing, such as Atom or VSCode. You may learn how to access and configure all settings by consulting the instructions for VSCode or Atom. You can refer to https://code.visualstudio.com/docs/languages/julia and https://docs.junolab.org/stable/man/installation/ for assistance. The VSCode add-on contains capabilities such as integrated interactive autocompletion, embedded REPL, inline solutions, code tracking, plot panel, debugger, and many more.
Using Notebook
Pluto (https://github.com/fonsp/Pluto.jl) is a reactive notebook designed by Julia developers that includes capabilities not found in Jupyter notebooks. Being reactive implies that if you update a variable, Pluto will run all the cells that reference the variable after the change.
Because it automatically discovers relationships between cells, you can arrange them in whatever order you like. Pluto understands which packages are required in the project and installs and handles them automatically.
Julia can be composed and performed in notebooks as well. In the notebook, you must switch to the Julia executable and then type Julia commands in the cell. Both Jupyter and Pluto notebooks include output as well as graphical visualizations. The concept of a notebook, which incorporates code, arithmetic, graphs, prose, and other multimedia in a single prepared document, simplifies project collaboration.
Text, programming, arithmetic, and other graphical information can be written in the Jupyter or Pluto notebook. More installation options are available in the documentation at https://julialang.github.io/IJulia.jl/stable/. Launch your Jupyter notebook and change the kernel to Julia by choosing it from the drop-down menu.
You may also use IJulia to write Julia code in Jupyter notebooks by launching the IJulia notebook in your browser with the IJulia.jl package (https://julialang.github.io/IJulia.jl/stable/). Participate in the Julia interactive session. Then, in the REPL, enter the following code to open a notebook with Julia kernel in your browser.
Pluto makes it easy to develop and manage project environments. If we use the terminal to execute Julia code, we will learn how to establish and maintain project environments in the following section.
Julia package manager: Pkg
Julia has built-in package management. It allows you to install, manage, and use a wide range of dependent and independent packages in your projects. In this part, we’ll go through the most important components of package management in Julia. More thorough explanations may be found at https://pkgdocs.julialang.org/v1/.
Project environments
Julia has an in-built package manager - Pkg. Pkg handles packages, registries and artifacts. Pkg is based on environments,
which are discrete groups of packages that can be local to a single project or shared and accessed by name. Pkg maintains packages in /.julia/environments/v1.6, registries in /.julia/registries, and artifacts in /.julia/artifacts in the local machine if the Julia executable is running in the home directory. You can alter the locations by establishing new environments, registries, or artifacts. We will just learn how to set up and oversee new project environments in this segment. You can learn more about managing registries at https://pkgdocs.julialang.org/v1/registries/ and about artifacts at https://pkgdocs.julialang.org/v1/artifacts/.
When you create a new project environment, you will see two new files created, that is, manifest.toml and project.toml. The manifest.toml file captures the detailed collection of packages and versions in an environment, which is then checked into a project repository and managed under version control, considerably increasing project reproducibility.
Figure 2.2: The manifest.toml file which contains information about all the packages
A project.toml is a file in the root directory of a project that contains metadata about the project, such as