Clojure, The Essential Reference
Clojure, The Essential Reference
— Renzo Borgatti
Software development is often compared to a craft, despite the fact that it’s
predominantly an intellectual activity. While software development is abstract in
nature there are many craft-oriented aspects to it:
• The keyboard requires time and dedication to operate correctly. There are endless
discussions on the best keyboard layout for programmers, for example to speed up
typing 1.
• The development environment is a key aspect of programmers productivity and
another source of debate (almost reaching a religious connotation). Mastering a
development environment often translates into learning useful key combinations
and ways to customize the most common operations.
• Libraries, tools and idioms surrounding the language. Almost everything above
the pure syntax rules.
• Proficiency in several programming languages is definitely a plus in the job
marketplace and the way to achieve it is by practicing them on a regular basis
including getting familiar with APIs and libraries the language offers.
1
Dvorak users often claim huge benefits compared to QWERTY users. Here’s one comparison, including other kind of
layouts: lifehacker.com/should-i-use-an-alternative-keyboard-layout-like-dvorak-1447772004
• Many other aspects require specific skills depending on the area of application:
teaching, presenting or leadership.
The focus on mastering programming skills is so important that it became one of the
key objectives of the Software Craftsmanship Movement 2. Software Craftsmanship
advocates learning through practice and promotes an apprenticeship process similar to
other professions.
The standard library is definitely one of the most important tools to master a language.
One aspect that characterizes the standard library is the fact that it is already packaged
with a language when you first experiment with it. Interestingly, it doesn’t get the
amount of attention you would expect for such an easy to reach tool. This book will
show you how much wisdom and potential is hidden inside the Clojure standard
library.
2
manifesto.softwarecraftsmanship.org
terms of documentation: the language and the standard library are described in a very
essential style which is often considered beginner-unfriendly 3 . A lot of effort has been
put lately into improving Clojure documentation although, at the time of this writing,
the standard library is still lacking a comprehensive and centralized reference.
This book puts a great deal of effort illustrating functions in a readable and pleasant
way, using a lot of real-life examples and visual structure to attract attention to the
essential parts. Despite not being designed as a book to read cover-to-cover, each
function is a pleasant and interesting reading on its own that also offers insight into
functional (and general) programming. The following is a simplified version of the
function “fnil” very similar to how it appears in the book. It has been annotated to
show what is the purpose of each section:
3
See the latest "State of Clojure" survey 2015: blog.cognitect.com/blog/2016/1/28/state-of-clojure-2015-survey-results.
Documentation is still ranking high in the list of major problems with the language
Figure 1.1. The template for a function as it is illustrated in the book, with ovals explaining
what each section is about.
4
The interested reader can see the extent of the effort by checking out the Clojure project from Github and using the
following git command: git rev-list --reverse --format="- %B %cd" -n 1 HEAD — src/cli/runtime.
The C# files were finally removed from the project sometimes in 2007 with commit
b6db84aea2db2ddebcef58918971258464cbf46f
5
David Miller speaks about the history of ClojureCLR on this episode of the "Defn" podcast: soundcloud.com/defn-
771544745/48-david-miller-and-clojure-on-the-clr
6
The ClojureScript effort can be traced back to IRC discussions in May 2008 clojure-log.n01se.net/date/2008-05-
29.html#15:26
7
The original ClojureScript release announcement was captured on video and available
at www.youtube.com/watch?v=tVooR-dF_Ag
and manipulate other functions, conditionals. Core currently contains around 700
definitions between functions and macros. Functions in core are always available
without any explicit reference from any namespace.
2. Namespaces other than "core" (still shipped as part of Clojure). These are
usually prefixed with clojure followed by a descriptive name,
like clojure.test, clojure.zippers or clojure.string. Functions in these
namespaces are sometimes available just prefixing their namespace
(like clojure.string/upper-case) but in other cases they need to be imported in
the current namespace using “refer, refer-clojure, require, loaded-libs, use,
import” 8 .
3. The content of the Java SDK which is easily available as part of Clojure Java
interoperability features. This book shows many examples of use of the Java
standard library from Clojure, but doesn’t go in the details of describing the Java
examples.
In this book we will refer to the Clojure standard library as the first two parts described
above, basically everything that you get by just downloading the Clojure package and
without downloading other libraries. In general, items in the standard library are
marked as public, although some functions are marked as "alpha" in the Clojure
documentation string and subject to change. The book will warn the reader about
functions that can be used but are not guaranteed to stay in the library.
The standard library content can be roughly categorized by looking at the major
features Clojure introduces and by the most common programming tasks. There are,
for example, big groups of functions dedicated to Software Transactional Memory 9,
concurrency and persistent collections. Of course Clojure also adds all the necessary
support for common tasks like IO, sequence processing, math operations, XML, strings
and many others. Apparently missing from the Clojure standard library are solutions
already provided by the Java SDK, for example cryptography, low-level networking,
HTTP, 2D graphics and so on. For all practical purposes those features are not missing,
but just usable as they are from Java without the need to re-write them in Clojure. Java
interoperability is one of the big strength of Clojure, opening the possibility to easily
use the Java SDK (Standard Development Kit) from a Clojure program.
This book will cover both clojure.core (vast majority of functions in the standard
library) as well as the additional namespaces described in the following diagram and
broadly grouped by area of application.
8
this is due to the fact that while bootstrapping, Clojure already imports several namespaces that are automatically available
for the end user. Very popular tools like nRepl or Cider also load libraries while bootstrapping, which are then available at
the prompt. It is good practice to always require what is useful in a namespace explicitly
9
For a good introduction to STM see Wikipedia: en.wikipedia.org/wiki/Software_transactional_memory
• Java are namespaces dedicated to Java interop beyond what core already has to
offer. clojure.java.browser and clojure.java.javadoc offer the possibility to
open a native browser to display generic web pages or javadoc documentation
respectively. clojure.reflect wraps the Java reflection APIs offering an
idiomatic Clojure layer on top of it. clojure.java.io offers a sane approach
to java.io, removing all the idiosyncrasies that made Java IO so confusing, like
knowing the correct combination of constructors to transform a Stream into a
Reader and vice-versa. Finally the clojure.inspector offers a simple UI to
navigate data structures.
• Data Serialization is about ways in which Clojure data can be encoded as string
as an exchange format. clojure.edn is the main entry point into EDN 10 format
serialization. clojure.data contains only one user-dedicated
function "clojure.data/diff" to compute differences between data
structures. clojure.instant defines encoding of time related types.
Despite the classification above giving a nice overview of what’s available beyond
core functions, the book is structured so that clojure.core functions and non-core
functions are re-grouped when necessary to reflect their area of application. A couple
of notable examples are:
• clojure.reflect/reflect appears in the "Java Interop" chapter along
with “proxy”, “gen-class and gen-interface” or “".", ".." and doto” which are
instead core functions.
• clojure.walk/stringify-keys appears along with other core hash-map functions.
The book makes the assumption that readers are relatively interested in knowing where
exactly a function lives (if not just to “refer, refer-clojure, require, loaded-libs, use,
import” it at the top of the namespace to use it) but they are more interested in knowing
that the function exists when they have a particular problem to solve.
Although the vast majority of items in the standard library are either functions or
macros, the book also describes some dynamic variables. Dynamic variables are a
special kind of reference type that can be re-bound on a thread-local basis (see the
great description of dynamic variables from "Joy of Clojure" for a detailed
explanation 11 ). The reason for dynamic variables to be also described in this book is
because they are often the way other functions in the standard library are configured.
10
The EDN format is described here: github.com/edn-format/edn
11
The "Joy of Clojure" is available on the Manning website: www.manning.com/books/the-joy-of-clojure-second-edition
a human readable form. Information is coming from an external system and a library is
already taken care of that communication. All you know is that the input arrives
structured as the following XML (here saved as a local balance var definition):
(def balance
"<balance>
<accountId>3764882</accountId>
<lastAccess>20120121</lastAccess>
<currentBalance>80.12389</currentBalance>
</balance>")
(defn- to-double [k m]
(update-in m [k] #(Double/valueOf %)))
(print-balance balance)
;; {"Account Id" 3764882, "Last Access" "20120121", "Current Balance" "80.12"}
❶ parse takes the XML input string and parses it into a “hash-map” containing just the necessary
keys. parse also converts :currentBalance into a double.
❷ clean-key solves the problem of removing the ":" at the beginning of each attribute name. It checks
the beginning of the attribute before removing potentially unwanted characters.
❸ separate-words takes care of searching upper-case letters and pre-pending a space. reduce is used
here to store the accumulation of changes so far while we read the original string as the input. up-
first was extracted as an handy support to upper-case the first letter.
❹ format-decimals handles floating point numbers format. It searches digits with re-find and then
either append (padding zeros) or truncate the decimal digits.
❺ Finally print-balance puts all the transformations together. Again reduce is used to create a new
map with the transformations while we read the original one. The reducing function was big enough to
suggest an anonymous function in a letfn form. The core of the function is “assoc, assoc-in and
dissoc” the new formatted attribute with the formatted value in the new map to display.
While being relatively easy to read (the 3 formatting rules are somehow separated into
functions) the example shows minimal use of what the standard library has to offer. It
contains map, reduce, “apply” and a few others including XML parsing, which are of
course important functions (and usually what beginners learn first). But there are
definitely other functions in the standard library that would make the same code more
concise and readable.
Let’s have a second look at the requirements to see if we can do a better job. The
source of complexity in the code above can be tracked down to the following:
• String processing: strings need to be analyzed and de-composed.
The clojure.string namespace comes to mind.
• Hash-map related computations: both keys and values need specific
processing. reduce is used here because we want to gradually mutate both the key
and the value at the same time. But “zipmap” sounds a viable alternative worth
exploring.
• Formatting rules of the final output: things like string padding of numerals or
rounding of decimals. There is an interesting "clojure.pprint/cl-format" function
that might come handy.
• Other details like nested forms and IO side effects. In the first case threading
macros can be used to improve readability. Finally, macros like “with-
©Manning Publications Co. To comment go to liveBook
open” removes the need for developers to remember to initialize the correct Java
IO type and close it at the end.
By reasoning on the aspect of the problem we need to solve, we listed a few functions
or macros that might be helpful. The next step is to verify our assumptions and rewrite
the example:
(require '[clojure.java.io :as io])
(require '[clojure.xml :as xml])
(require '[clojure.string :refer [split capitalize join]])
(defn- to-double [k m]
(update-in m [k] #(Double/valueOf %)))
(print-balance balance)
;; {"Account Id" 3764882, "Last Access" "20120121", "Current Balance" "80.12"}
❶ parse now avoids the let block, including removing the need to close the input stream. This is
achieved by “with-open”. The ->> threading macro has been used to give a more linear flow to the
previously nested XML processing.
❷ separate-words now uses a few functions from clojure.string. split takes a regular expression
that we can use to divide the string by upper case letters. Compare this version with the previous one
using reduce: this is easier to read and understand.
❸ We now capitalize each word and finally join everything together in new string.
❹ format-decimals delegates almost completely to "clojure.pprint/cl-format" which does all the job of
formatting decimals.
❺ “zipmap” brings in another dramatic change in the way we process the map. We can isolate changes
to the keys (composing words separation and removing the unwanted ":") and changes to the values
into two separated map operations. “zipmap” conveniently combines them back into a new map
without the need of reduce or “assoc, assoc-in and dissoc”.
The second example shows an important fact about "knowing your tools" (in this case
the Clojure standard library): the use of a different set of functions not only cuts the
number of lines from 45 to 30, but also opens up the design to completely different
decisions. Apart for the case where we delegated entire sub-tasks to other functions
(like cl-format for decimals or name to clean a key), the main algorithmic logic took a
different approach that does not use reduce or “assoc, assoc-in and dissoc”. A solution
that is shorter and more expressive is clearly easier to evolve and maintain.
"Returns a lazy seq of the first item in each coll, then the second etc." is precise and
essential. It assumes you understand what a "lazy seq" is and leaves out details like
what happens with unevenly sized collections. You could further explore interleave by
typing examples at the REPL or, missing ideas about what to type, search for snippets
on the Internet. Some of the background concepts are documented on the Clojure
website under the "reference" section (clojure.org/reference). The reference
documentation has been there since the beginning and is following the same essential
style of doc at the REPL. If you are a seasoned programmer with some functional
experience you’ll be definitely comfortable with that, but that’s not always the case for
Clojure starters. The recently introduced Clojure-Doc website at clojure-doc.org is the
beginning of that community contributed effort more directed at "getting started".
Although clojure-doc.org is now here, multiple efforts started over the years to fill the
12
The first survey for 2010 is available here: cemerick.com/2010/06/07/results-from-the-state-of-clojure-summer-2010-
survey/. The last is available on the Cognitect blog:clojure.org/news/2019/02/04/state-of-clojure-2019
13
Here’s the request for help related to the open source release of the Clojure.org
website: clojure.org/news/2016/01/14/clojure-org-live
gaps left by the original documentation. The following is a summary of the other
resources available at the time of this writing:
• clojuredocs.org is a community powered documentation engine. It basically offers
examples and notes on top of the standard library documentation including cross-
links. The quality of the documentation for a function varies from nothing to many
examples and comments.
• groups.google.com/forum/#!forum/clojure is the main Clojure mailing list.
Absolutely great threads are recorded in there, including topics discussing the
overall Clojure vision and design by Rich Hickey himself and the rest of the core
team.
• clojure-log.n01se.net the IRC Clojure channel logs. Same as the mailing list, with
some important discussions shaping the design of the future Clojure releases.
• Books. The number of Clojure books written so far is impressive. People really
like to write books on Clojure and this book is no exception!
• stackoverflow.com/search?q=clojure Clojure related questions is an amazing
source of great information. Almost any conceivable problem, philosophical or
practical, has been answered there.
• Blogs: too many good blogs to enumerate all here. Google is your entry point for
those, but a couple of always useful ones are "Jay Fields' Thoughts on Clojure"
at blog.jayfields.com/ and "Aphyr’s Clojure From the Ground Up" series
at aphyr.com/posts/301-clojure-from-the-ground-up-welcome.
As you can see documentation exists in many forms and is overall very valuable, but it
is fragmented: jumping between all the different sources is time consuming, including
the fact that searching the right place it’s not always obvious. One of the main goals of
this book is to do that work on your behalf: bringing together all the valuable sources
of information in a single accessible place.
14
blog.jayfields.com
The following is a trick that works wonders to become a true Clojure Master. Along
with learning tools like tutorials, books or exercises like the Clojure Koans 15 ,
consider adding the following:
• Select a function from this book’s table of content every day. It could be lunch or
commuting time for example. Another option is to have this book on your desk
and randomly open up a page every once in a while.
• Study the details of the function sitting in front of you. Look at the official docs
first, try out examples at the REPL, search the web or www.github.com for
Clojure projects using it.
• Try to find where the function breaks or other special corner cases. Pass nil or
unexpected types as arguments and see what happens.
• Repeat the next day or regularly.
Don’t forget to open up the sources for the function, especially if belonging to the
"core" Clojure namespace. By looking at the Clojure sources, you have the unique
opportunity to learn from the work of Rich Hickey and the core team. You’ll be
surprised to see how much design and thinking goes behind a function in the standard
library. You could even find the history of a function intriguing, especially if it goes
back to the origins of Lisp: " “apply”" for example, links directly to the MIT AI labs
where Lisp was born in 1958! 16 Only by expanding your knowledge about the content
of the standard library you’ll be able to fully appreciate the power of Clojure.
15
github.com/functional-koans/clojure-koans
16
“eval” and “apply” are at the core of the meta-circular interpreter of Lisp fame. The whole Lisp history is another
fascinating reading on its own. See any paper from Herbert Stoyan on that matter
1.9 Summary
• The standard library is the collection of functions and macros that comes out of
the box by installing Clojure.
• The Clojure Standard Library is rich and robust, allowing developers to
concentrate on core business aspects of an application.
• Information about the Standard Library tends to be fragmented, but this book
collects everything in a single accessible place.
• Deep knowledge of the content of the Standard Library improves code
expressiveness exponentially.
• While the standard library is considered by many a passive resource to access in
case of a specific need, this book suggests the more interesting approach to learn it
with a more systematic approach.
• A lot of effort has been put in this book to make what follows in Part II an
interesting and enriching experience, not just a dry list of specifications.
2
Without too much surprise, a functional language is specifically good at providing
developers with tools and syntax support for creating and composing functions. This
chapter groups together the functions in the Clojure standard library that are dedicated
to manipulate or generate other functions. The chapter splits them into 4 broad
categories:
1. Function Definition. A function is the fundamental unit of composition in Clojure.
This section contains the main macros dedicated to declaring new functions.
2. Higher order functions. This section describes functions and macros whose main
goal is to produce new functions guided by an user defined computation or other
existing definitions.
3. Threading macros. This important group of macros gives Clojure a visually
appealing syntax to describe processing pipelines.
4. Function execution. Finally, another group of functions dedicated to manage the
execution of other functions.
Other functions and macros exist that can be categorized using the same criteria, but in
this initial "fundamental" chapter, we concentrate on the most important ones while
others are described in other parts of the book.
standard library to define a function is defn. Additionally, Clojure offers other ways to
help modularize applications: “definline” improves performance during Java interop,
while fn is embeddable in other functions. There is an overlap with macros described
later, but considering they introduce a small language on their own, they have been
dedicated a specific chapter.
2.1.1 defn and defn-
macro since 1.0
defn (and its private version defn-) is one of the fundamental constructs and main
entry point for function creation in Clojure. It supports a rich set of features like
destructuring, multiple arities, type hinting, :pre and :post conditions and more
(via fn, which is closely related). The calling contract is like a small language in itself
and defn is dedicated to parse this little grammar. The most used form of defn is
probably the simple single-arity case:
(defn hello [person] ; ❶
(str "hello " person))
❶ A simple function definition. The function hello takes a string and return a string.
defn works in conjunction with def (for interning its name in the current namespace)
and fn (for pre-post conditions and destructuring). Since defn is a macro, we can
call macroepxand on it to understand how it works:
(macroexpand ; ❶
'(defn hello [person]
(str "hello " person)))
;; (def hello
;; (clojure.core/fn ; ❷
;; ([person] (str "hello " person))))
(hello "people") ; ❸
;; "hello people"
❶ We can call macroexpand on the previous function definition to see how Clojure assembles the
creation of an anonymous function with a var definition in the current namespace.
❷ The lambda just created via “fn” is assigned to a new Var object "hello".
❸ The "hello" symbol is available for execution in the current namespace using surrounding parenthesis.
CONTRACT
The contract for defn is quite elaborated. "fdecl", which comes after the function name,
can be further expanded into a list of arities (the different groups of arguments the
©Manning Publications Co. To comment go to liveBook
function can be called with) which in turn supports type hinting (and
surrounding metadata). We are going to use a little (informal) grammar syntax to
describe it. Terms in angle brackets <> are further explained below:
(defn <tags> name fdecl)
• "tags" is an optional list of tags (in the form of ^:tagname1 ^:tagname2 separated
by spaces). Tags are stored along with the var created by the function definition.
• "name" is mandatory and must be a valid symbol 17 .
• "docstring" is an optional string that describes the function. The documentation
string is also stored in the var object resulting from the function definition. You
can see the doc string using the doc function.
• "metamap" is an optional map of key-value pairs. You can later use
the meta function to print metadata. For example (meta #'name) shows the
metadata attached to the var object "name". A similar "metamap" is also allowed
at the end of the function signature and before each argument vector.
• ([arity1]) ([arity2]) .. ([arityN]) are arguments vectors of different
lengths. In case of a single [arity] the wrapping parenthesis are optional.
When we look inside an argument vector, we can see the following:
• "ret-typehint" is an optional type hint that applies to the return value for the arity.
"ret-typehint" can appear inside the "metamap" for that arity with equivalent
results.
• "arg-typehint" is an optional type hint for an argument in the argument vector.
• "body" contains the actual implementation of the function.
defn returns a clojure.lang.Var referencing the function object that was just created.
The function name becomes available in the current namespace without any additional
prefixing.
It’s worth noting that there are three places in defn to specify metadata. The
resulting var definition is going to merge all of them. We can see how it works in the
following (admittedly contrived) example:
17
See the main Clojure Reader documentation at clojure.org/reader for the definition of a valid Clojure symbol.
(meta #'foo) ; ❹
❶ The first place for metadata is right after the defn declaration. In this position, it requires the
caret ^ character.
❷ The second place comes after the documentation string and before the first arity declaration.
❸ The third and final option comes after all definitions.
❹ We can see that :t1, :t2, :t3 appear in the var metadata. Other context dependent information
(like the namespace object or column/line information might differ when printed from another REPL).
WARNING metadata at the end of the function only works if all arities (the argument vector followed by
the body) are wrapped by parenthesis. It’s important to remember this aspect when the
function has a single arity, as this is commonly written without the pair of surrounding
parenthesis. For instance, the example above works because ([a b] (+ a b)) is wrapped in
parenthesis.
Examples
defn usage is of course widespread. The following examples illustrate some of its most
important aspects.
Documenting
It is good practice to attach a short documentation string to a function to describe its
purpose. Clojure provides a specific position for the documentation string, so the
compiler can store this information appropriately. You can then use
the clojure.repl/doc function to print useful information about the function including
the documentation string:
(defn hello
"A function to say hello" ; ❶
[person]
(str "Hello " person))
(clojure.repl/doc hello) ; ❷
;; ([person])
;; A function to say hello ; ❸
;; nil
❶ The documentation string appears just after the name of the function.
❷ We use the function doc passing the var "hello" as a parameter.
❸ The documentation string prints on screen along with the function signature.
❹ Alternatively, we can extract the key :doc fromm the metadata.
(ns user)
(profilable/profile-me 500) ; ❸
(prepare-bench 'profilable)
(profilable/profile-me 500) ; ❹
;; Crunching bits for 500 ms
;; "Elapsed time: 502.422309 msecs"
(profilable/dont-profile-me 0) ; ❺
;; not expecting profiling
❶ The function profile-me in the "profilable" namespace has a :bench annotation that enters the
metadata map
❷ prepare-bench does the search for all the function tagged with `:bench in the given
namespace and wraps them into a new function that is doing profiling.
❸ Before prepare-bench is invoked `profile-me prints the expected message.
❹ But after invoking prepare-bench, `profile-me also prints the elapsed time along with the
message.
❺ Other functions that were not tagged are unaffected.
Pre/post conditions
The next example shows how to use pre- and post-conditions. Conditions are functions
with access to arguments and return value (post-conditions only). Clojure inspects the
metadata map for the argument vector (or the body, see the contract section) in search
for :pre or :post keys. When :pre or :post keys exist, their value must be a collection
of predicates. Predicates are invoked before or after function execution, respectively.
The following save! function, saves an item to some storage. Before pushing it to
storage it checks a few facts about the input using pre-conditions. After saving to
storage, it verifies that the item has the correct ":id":
(defn save! [item]
{:pre [(clojure.test/are [x] x ; ❶
(map? item) ; ❷
(integer? (:mult item)) ; ❸
(#{:double :triple} (:width item)))] ; ❹
:post [(clojure.test/is (= 10 (:id %)))]} ; ❺
(assoc item :id (* (:mult item) 2)))
;; FAIL in () (form-init828.clj:2) ; ❻
;; expected: (integer? (:mult item))
;; actual: (not (integer? "4"))
;;
;; FAIL in () (form-init828.clj:2)
;; expected: (#{:double :triple} (:width item))
;; actual: nil
;;
;; AssertionError Assert failed:
;; (clojure.test/are [x] x (map? item) (integer? (:mult item))
;; (#{:double :triple} (:width item))) user/save!
;; FAIL in () (form-init8288562343337105678.clj:6)
;; expected: (= 10 (:id %))
;; actual: (not (= 10 8))
;;
;; AssertionError Assert failed:
;; (clojure.test/is (= 10 (:id %)))
❶ "clojure.test/are" groups together multiple assertions. The assertions in this example are all expected
to return logical false (including nil) if the assertion fails.
❷ This predicate checks that item is of type map. Note that the argument "item" is available in pre- and
post-conditions.
❸ Similarly, this predicates is checking that the value for the key :mult is of type integer.
❹ Set inclusion is used to verify the value of the :width key belongs to a small enumeration of allowed
values.
❺ Post-conditions work similarly with the addition of the placeholder % percent sign to access the
returned value from the function. In this case we check the returned map contains an id equals to 10.
❻ Failing assertions are nicely printed because of the clojure.test functions. clojure.test is part of the
standard library.
❼ In the next attemp, we fix pre-conditions, but we have a problem with post-conditions.
❽ We can finally see a successful call to save!.
NOTE The example demonstrate an useful trick consisting of wrapping pre- and post-conditions with
clojure.test/is or clojure.test/are macros. The conditions still fail with
java.lang.AssertionError, but the clojure.test wrappers show a much nicer message.
Type hinting
Type hints are the bridge between the dynamically-typed world of Clojure (where
almost everything is treated as a generic java.lang.Object) and the statically-typed
world of Java. Type hints in Clojure are optional in most cases, but they are required
when speed is important (other common tips include disabling checked-math, using
primitive unboxed types, using transients and many others techniques dependent on the
specific case).
Type hints are usually required when Clojure functions call into Java methods. The
Clojure compiler uses type information to avoid reflection in generated Java bytecode.
Reflection is a very useful (but slow) Java API to discover and invoke Java methods
needed by the Clojure runtime.
To illustrate the point, the following example is about signing a request using a secret
key. The Java standard library contains everything we need for this task, so no external
libraries are required. The idea of signing a request is the following:
1. There is some unique string representation of the event we want to sign. We are
©Manning Publications Co. To comment go to liveBook
(sign-request "http://example.com/tx/1")
;;
"http://example.com/tx/1?signature=EtUPpQpumBqQ5c6aCclS8xDIItfP6cINNkKJXtlP1pc%3D"
❶ Clojure provides the *warn-on-reflection* dynamic var to show where the compiler was unable to
infer the types.
❷ the sign function shows the steps required for the signature. We don’t need to go deep into the details
of the algorithms, but creating a sha256 hmac is pretty common procedure 18
❸ sign-request takes an url representing the transaction. The function returns the same URL
appending the signature as one of the request parameters, ready to be sent across the wire.
When we look at the output during compilation, Clojure prints something similar to the
following:
Reflection warning, crypto.clj:12:3 - call to method getBytes can't be resolved
(target class is unknown).
Reflection warning, crypto.clj:21:3 - call to method doFinal can't be resolved
(target class is unknown).
Source lines/column references might be different, but the message says that there are
at least two places where the compiler is unable to infer the types and is using
reflection. If in our example we assume peaks of 100k transactions per second, we
might want to review how sign-request is performing. Advanced tools like
Criterium 19 are always the suggested choice for benchmarking, but in this specific
case we can clearly see what happens just by using time:
(time (dotimes [i 100000] ; ❶
(sign-request (str "http://example.com/tx/" i))))
NOTE The elapsed time displayed here (and in other parts of the book) is dependent on the hardware
the benchmark is executed, so it could display a different number on other machines. What
matters is the relative difference between instances of the benchmark, which should instead
be the same independent from hardware.
Let’s now add type hints to the function definitions highlighted by the compiler
warnings:
(defn get-bytes [^String s] ; ❶
(.getBytes s (StandardCharsets/UTF_8)))
❶ The parameter "s" was tagged as ^String so the following .getBytes is fully qualified.
18
More info about how to create a signature with sha256 can be found
here: security.stackexchange.com/questions/20129/how-and-when-do-i-use-hmac
19
Criterium is the de-facto benchmariking tool for Clojure: github.com/hugoduncan/criterium
❷ The parameter "mac" was tagged to be of type ^Mac. The other warnings from the compiler also
disappear, as .doFinal becomes fully qualified as well by inference.
❸ After adding the two type hints, we are able to cut the processing time down 50%.
As shown by the new measured time, we can achieve better performance when
Reflection calls have been removed.
(defn a [a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15 a16 a17 a18 a19 a20 a21])
The above results in a compile time exception. The limit might seem arbitrary or restrictive but the
rationale behind the choice is simple: Clojure puts a great deal of attention to speed and there are
compiler optimizations that greatly benefit from having a specific Java method for each parameters
number. There are several place in the Clojure codebase where this is visible 20 and of course it is not
easy to read, maintain or evolve.
Apart from the compiler implementation details any function with more than three or four parameters
should look suspicious. Too many parameters should raise the question if there is a missing abstraction
that groups them together.
See also
• fn is used under the hood by defn to generate the body of the function and
implement destructuring. Differently from defn, fn does not create a var object or
alter the current namespace as a side effect. Thus fn is a better choice for local use
of functions without the need for an external name. fn is often used with
sequential operations such as reduce to create an anonymous function of two
arguments.
• definline creates a defn definition but also include an inlined version of the
function body to improve Java interoperation. Consider using definline for
performance sensitive functions if the function body does not do much more than
wrapping a Java method call.
• letfn is syntactic sugar for an anonymous function definition associated to
a let binding. Prefer letfn to create one or more named local functions.
Performance considerations and implementation details
defn is a macro with an impact primarily on Clojure compliation time. Common usage
of defn should not generate concerns during program runtime. The definition of defn
happens quite early during bootstrap of the standard library, when most of the common
Clojure failities are not yet defined. This aspect, along with the complexity related to
20
For places in the Clojure source where the group of 20 Java methods is visible see for example: clojure.lang.IFn
inlining and type hints makes defn sources not easy to follow.
2.1.2 fn
macro since 1.0
❶ Please refer to defn for an extended version of the supported feature while declaring a new function.
fn creates a new function and supports important features like destructuring, type hints,
pre- and post-conditions (illustrated in defn) and multiple signatures based on the
number of parameters (or "arities" as they are commonly called in Clojure
documentation). fn functions are available immediately: you can pass them as
arguments or bind them locally.
Function objects (also known as lambdas) are so common in functional programming
that Clojure offers a special reader syntax for them (the reader macro #()). The
following example shows the same function created with fn and the shirtcut reader
syntax:
((fn [x] (* (Math/random) x)) ; ❶
(System/currentTimeMillis)) ; ❷
;; 1.314465483718698E12
(#(* (Math/random) %) ; ❸
(System/currentTimeMillis))
;; 1.2215726280027874E12
CONTRACT
Along with a few other functions and macros in the standard library, fn has quite an
articulated signature that resembles a little grammar on its own. The following
informal contract shows the most important features of fn (check the examples below
21
The Unix Epoch time is a system to measure relative time: en.wikipedia.org/wiki/Unix_time
arities :=>
<metamap> [arity] <body>
OR
(<metamap> [arity1] <body>)
(<metamap> [arity2] <body>)
[..]
(<metamap> [arityN] <body>)
arity :=>
[<arg1-typehint> arg1
[..]
<argN-typehint> argN]
body :=>
<body-metamap> <forms>
body-metamap :=>
{:pre f1 :post f2 :tag tag1 :k :v}
• "<name>" is an optional symbol that bounds the generated function to the local
scope of the function itself. The name allows the function to be recursive (see
examples below).
• "arities" is a list of 1 or more arity declarations (for example, the function (fn ([]
"a") ([x] "x")) contains two "arities" of zero and one argument). Each arity
allows for an optional metadata map, followed by a mandatory vector of
arguments and an optional body. In case of a single arity the wrapping parenthesis
can be omitted. The content of each vector can be plain symbols or more
complex destructuring expressions.
• "<metamap>" is an optional map of keywords-values pairs that merges into the
function metadata. It might contain type hints, pre-post conditions or custom
metadata. When attached to the arguments vector, the metadata needs to use the
special reader syntax ^{:k :v}.
• "<body-metamap>" optionally appears at the beginning of the body and is similar
to the other <metamap> (although this one doesn’t need the initial "^" caret
symbol.
• "arity" is the content of the argument vector. Except the name, each argument can
be individually type-hinted.
• <body>, when present, contains the actual function instructions. It is implicitly
wrapped in a do block. It is assumed to be nil when there is no body. When the
body contains forms at the same level (not nested) and the first is a Clojure map,
the map is used as metadata. When both the argument vector and the body contain
the metadata map, the last one in the body takes precedence in case of clashing
keys.
• returns: the function object that was just created.
Examples
fn is the minimum common denominators for all declaring functions and macros. For
instance, type hints given for arguments in a function declared with defn are processed
by fn under the hood. Although implemented in fn, type hints or pre- and post-
conditions are usually present in defn declarations. The reader is invited to
check defn examples as well for what is not present in this section.
Named recursion
The first example demonstrates a possible use of the optional name that makes the
function bound inside the innermost scope. It could be used for example in the
recursive definition of a basic Fibonacci 22:
((fn fibo [n] ; ❶
(if (< n 2)
n
(+ (fibo (- n 1))
(fibo (- n 2)))))
10)
;; 55
;; 55
FN AND DESTRUCTURING
The function literal syntax #() is quite idiomatic in Clojure, but there are cases in
which the features it provides are not sufficient: destructuring, for example, is not
available with function literal syntax. The following example shows a hash-map being
transformed into another by applying a mix of key and value changes. Instead of using
the concise but limited #() function reader literal, we make the lambda explicit
with fn to introduce destructuring:
(def sample-person
22
The popular Fibonacci series is often used to show implementation of recursive calls. For more information see
Wikipedia: en.wikipedia.org/wiki/Fibonacci_number
{:person_id 1234567
:person_name "John Doe"
:image {:url "http://focus.on/me.jpg"
:preview "http://corporate.com/me.png"}
:person_short_name "John"})
(def cleanup ; ❶
{:person_id [:id str]
:person_name [:name (memfn toLowerCase)]
:image [:avatar :url]})
❶ cleanup is a mapping between input key names and a vector pair. The pair contains the new name of
the key in the output map and a function to apply to transform the value. For example the first key
says that :person_id should be renamed into :id and the str function should be applied to the value.
❷ The transform function takes an input map orig and the mapping rules as arguments (sample-
person and cleanup are the instances used in the example). The map function is used here to apply
all the transformation rules. By using fn we can destructure the content of cleanup that would not be
possible if we used the special reader form #().
Without destructuring the fn lambda would be polluted with first or second calls to
access the vector elements, as showed by the following re-write of the transform
function:
(defn transform [orig mapping] ; ❶
(apply merge
;; prefer destructuring instead of this
(map (fn [rules]
(let [k (first rules)
k' (first (second rules))
f (second (second rules))]
{k' (f (k orig))}))
mapping)))
❶ Re-write of the transform function to illustrate how many repetitions of first and second are
necessary when not using detructuring.
Higher order functions are functions that can accept other functions as parameters or return functions to
their callers. A language needs to support functions as first class objects in the language so they can be
sent around as "data" to other functions. The way function objects are created is different from language
to language but historically they have been named lambdas (from the Lambda Calculus, the first widely
adopted formalized notations for mathematical functions introduced by Alonzo Church around 1930 23 ).
Some languages even use lambda as a keyword to stress that connection. Clojure doesn’t have
a lambda keyword but fn is definitely Clojure’s lambda implementation.
Referential transparency guarantees that the return value of a function is only dependent on its
parameters and nothing else. Functional languages that enforce referential transparency at some level,
often get a number of other features as a consequence: laziness, immutable values, infinite sequences
and so on. Clojure is definitely part of the group of mainstream functional languages supporting all of the
above.
See also
• fn* is a slight variation of fn that also performs "locals clearing" after the first
invocation. Please refer to the documentation of fn* for more information.
• defn is obviously related to fn. The main difference is that defn is designed to
"intern" the function object to the enclosing namespace through a var object. You
should probably think of refactoring an fn definition out into a defn every time
there is some chance of reuse by other functions.
• identity is an example function returning an anonymous function of one argument.
Performance considerations and implementation details
Similarly to defn, fn processing mainly happens at compile time, so it’s not usually a
concerne in tems of runtime performance. Differently from defn, fn does not side-
effect into creating a var definition that is then added to the mappings of the current
namespace.
2.1.3 fn*
(Thanks Nicola Mometto for contributing this section)
special form since 1.0
❶ Please refer to defn for an extended version of the supported feature while declaring a new function.
fn* is the special form underlying the “fn” macro. It supports less features, lacking for
example support for pre- and post-conditions (or destructuring). The main goal
of fn* is memory optimization. fn* has unique support for creating closure objects
with only-once run guarantees.
A normal lambda created by fn could be referenced in multiple places (which is
usually the case in large applications) and re-used as needed. The Clojure compiler
23
From the abundant literature available on the subject, I suggest this gentle introduction to the Lambda
Calculus: www.cs.bham.ac.uk/~axj/pub/papers/lambda-calculus.pdf
cannot keep track of all the references to the lambda, so after an execution the lambda
(and its internal state in the related Java class that is generated) needs to stay around
for potential new executions. But there is a certain class of lambdas that are known in
advance to run just once: it is the case of delay or future macros for example, which
run in an external thread.
These threads are often kept alive in a thread pool and with them, the function objects
they ran. The function object, in turns, could be holding a reference to arbitrarily large
data, even if the function already returned its results. fn* ensures that the references
the function hold are set to nil atfer the result returns. This is also an important feature
to have when writing macros that delegate to wrapping functions, quite a common
idiomatic pattern, in order to avoid retaining memory for longer that is actually needed.
CONTRACT
Refer to the contract of “fn”, keeping in mind the only two differences:
• It has no support for the various metadata maps that fn accepts.
• It will assign special compile-time meanings to forms where the fn* symbol
has ^:once metadata, (while fn doesn’t support this feature).
Examples
We’ll only showcase the unique "once-only" feature of fn*, for all other examples and
usages, refer to “fn” and refrain from using fn* directly.
It is both a common pattern and a good practice to implement macros by delegating to
their function version, by wrapping the unevaluated bodies in an anonymous
function 24. This has several advantages:
• It makes it easier to understand the implementation of the macro
• By providing a function version it improves its composability and power, since it
makes the functionality also available for runtime use rather than just as a
compile-time feature.
This exact pattern appears in several places in clojure.core itself: future is a macro
that delegates to the future-call function using the same technique just described:
(defmacro future [& body] ; ❶
`(future-call (^{:once true} fn* [] ~@body))) ; ❷
future is going to execute the body at some later point in time in a separate thread. The
other important aspect about future design is that the body is meant to be executed only
once (that is, the thread is supposed to run once and never re-scheduled). So as the
author of a macro like future we know already that the resources used by the function,
24
This is called creating a thunk: en.wikipedia.org/wiki/Thunk
once executed, can be claimed back by the JVM. We have basically the power to tell
the Clojure compiler that once the body has executed, every reference to the lambda in
the compiled code can be set to null, allowing the JVM to claim resources back as
soon as possible. This is an important memory optimization done by the Clojure
compiler called "locals clearing" 25.
By just replacing fn with ^:once fn* (thus promising the compiler that the body will
never be executed more than once), the compiler is now able to perform the locals-
clearing optimization and avoids potential memory leaks 26.
The reader is invited to review future-call where the book explores an example
showing the effect of locals clearing.
See also
• “fn” is the macro that should always be used over fn* unless you need
the ^:once feature.
• future wraps an expression in a fn* function of no arguments with once-only
semantic.
Performance considerations and implementation details
Similarly to defn or fn, fn* has very little impact during runtime as the actual
generation of the function happens during compile-time. For this reason the user
should not be concerned with fn* when searching for performance improvements.
fn* is a special form, which means that its implementation it’s a "given" while the
compiler is executing. For Clojure in particoular, this means that fn* implementation
only exists as Java code.
Listing 2.4. -> Function generation, parameter handling, default argument values
(fnil
([f default1])
([f default1 default2])
([f default1 default2 default3]))
fnil generates a new function starting from another input function "f". The main use
case for fnil is to decorate "f" so that it can default to optional values in case the input
is nil. fnil operates positionally: "default1" will be used for a nil passed as first
25
Rich Hickey describes this feature quite extensively in the following mailing list
post: groups.google.com/forum/#!topic/clojure/FLrtjyYJdRU
26
See also Christophe Grand, who describes this type of memory leak in his blog: clj-me.cgrand.net/2013/09/11/macros-
closures-and-unexpected-object-retention/
argument, "default2" for a nil passed as the second argument and "default3" for a nil
as the third argument. fnil doesn’t support more than 3 defaults, so (fnil + 1 2 3 4)
causes an exception to be thrown.
CONTRACT
• "f" can be a function of any number of arguments returning any type.
• "default1,default2,default3" are the default values that should be used if the
generated function receives a nil as its first, second or third argument
respectively.
Examples
fnil main use case is to wrap an existing function that doesn’t handle nil arguments
the way we want (for example, it could even throw exception). fnil replaces
the nil input with a given default and the default is given to the original function in
turn.
One example of exceptional behavior in the presence of nil is inc, the simple function
that increments a number. We could use fnil to define an alternative behavior if, for
any reason, inc is given nil as input. In the following example, we want to update the
numerical values in a map with update 27:
(update {:a 1 :b 2} :c inc) ; ❶
;; NullPointerException
❶ We try to update the ":c" key in a map, but without knowing what is the content in advance, we don’t
know if the map contains the key or not. inc fails badly if the input is nil which is what happens in this
case.
❷ We can use fnil to wrap the nil argument case for inc. If inc is given a nil, fnil replaces
the nil with 0, which is then given to inc.
A typical unpredictable value (especially for a web application) is a string coming from
an input form. fnil can be handy in this case. In this example, an input form is
transformed into the request-params map:
(require '[clojure.string :refer [split]])
(def request-params ; ❶
{:name "Jack"
:selection nil})
27
This is the original use case for fnil as documented by this thread in the Clojure mailing list:
<groups.google.com/d/msg/clojure/mcxKa_5mWm4/CkSrutnPUfIJ,https://groups.google.com/d/msg/clojure/mcxKa_5m
Wm4/CkSrutnPUfIJ>
❶ request-params simulates the content of a web form already transformed into a Clojure data
structure. Some parameters are structured, like ":selection" which is a comma separated string.
❷ as-nums is designed to take the ":selection" parameter, split it into a list of strings and convert those
strings into numbers.
❸ Unfortunately the user on the web page didn’t fill out ":selection" as expected (or something else went
wrong) producing a nil selection.
The :selection key is normally a comma separated list of numbers but it could result
in a nil if the user doesn’t fill the related input field. as-nums throws an exception in
case of a nil selection, because it’s calling split on a null string. We can wrap as-
nums with fnil to solve this problem:
The new function as-nums+ handles the case by replacing nil (the result of retrieving
the :selection key from the parameters) with "0,1,2" as string (for this particular
example we are assuming that "0,1,2" is equivalent to "no selection"). Once defined,
the new as-nums+ can be safely replace any old use of the normal, exception
throwing, as-nums.fnil can operate similarly for the 2nd and 3rd argument, for
example:
(require '[clojure.string :as string])
(def greetings
(fnil string/replace "Nothing to replace" "Morning" "Evening"))
❶ The example shows fnil handling nil arguments for replace and 3 potential exception-
throwing nil invocations.
(+ 1 2 nil 4 5 nil) ; ❸
;; NullPointerException
(def zero-defaulting-sum ; ❹
(apply fnil+ + (repeat 0)))
The new function fnil+ accepts default values for nil in any position. map can take any number of
sequential collections to map over, which is handy in many situations. The first sequence "args" is the list
of proper arguments to the function. The second sequence passed to map is a concatenation of the
given "defaults" passed to fnil and any number of additional nil arguments to map over "args".
We also take advantage of map laziness to cover a potentially infinite number of default arguments,
like illustrated by zero-defaulting-sum. The infinite sequence of zeroes created by (repeat
0) covers nil values for all the (potentially infinite) arguments to +.
The other important aspect to consider with this use of map is that it will automatically stop mapping
when reaching the end of the shortest sequence. This is a great example of a function that in just 3 lines
shows much of the power available in Clojure. 28
See also
• some-> can be used to achieve a similar effect to fnil. Consider for
example (some-> nil clojure.string/upper-case): the form correctly
returns nil without throwing an exception. some-> might be a better choice if you
need to prevent a function of a single argument to throw exception.
However some-> default value is fixed and can’t be changed (it always
returns nil).
28
There is already an improved version of fnil proposed in this patch ready to be added to Clojure core.
(comp
([])
([f])
([f g])
([f g & fs]))
comp accepts zero or more functions and returns another function. The new function is
the composition of input. Given for example the functions f1, f2, f3, comp creates a new
function so that: ((comp f1 f2 f3) x) is equivalent to (f1 (f2 (f3 x))). This
equivalence is the reason why comp apparently reads backwards, for example:
((comp inc +) 2 2) ; ❶
;; 5
In the example above, + appears last in the arguments but is the first one to apply.
NOTE when invoked with no arguments comp returns the “identity” function. This is helpful in
situation where the list of function to compose is dynamically generated at runtime and
potentially empty. Instead of dealing with the error case, comp will gladly accept an empty list
of arguments.
CONTRACT
Input
With the exception of the rightmost function (that can take any number of arguments),
all other argument functions must accept a single argument. In the case of (comp f g
h) for example, "h" is the only function that can accept multiple parameters while "f g"
receive a single argument.
Notable exceptions
IllegalArgumentException when any of the input functions (except the last) does not
support a single argument call.
Output
comp returns a function of the same number of arguments of the rightmost input
parameter, representing the functional composition of all the input functions.
Examples
A concatenation of functions is the main use case for comp. Consider the following
example where we produce how many stamps we need to buy to send letters to
different destinations:
(require '[clojure.string :refer [split-lines]])
(def mailing ; ❶
[{:name "Mark", :label "12 High St\nAnchorage\n99501"}
{:name "John", :label "1 Low ln\nWales\n99783"}
{:name "Jack", :label "4 The Plaza\nAntioch\n43793"}
{:name "Mike", :label "30 Garden pl\nDallas\n75395"}
{:name "Anna", :label "1 Blind Alley\nDallas\n75395"}])
(postcodes mailing)
;; ("99501" "99783" "43793" "75395" "75395")
❶ The input is in the form of a vector or maps, a common format to transfer data with similar structure
but different values.
❷ The function postcodes returns a list of (potentially repeating) postcode after parsing the content of
the :label value. Note that the body of the function contains 4 nested calls to other functions
(map, last, split-lines and the key ":label" used as function).
❸ We can use frequencies to count the number of occurrences of each postcode.
29
This style of composisition is also called point-free style
❷ After the changes we make sure that the results are the same as before.
By using comp we added emphasis on the sequence of tranformations. This is the effect
of removing parenthesis which in turns allow for a natural vertical alignment. Note that
the use of comp in this case is possible because all functions take 1 parameter.
comp is also the main construct to compose transducers. Here’s the same postcodes
seen before written using transducers:
(defn postcodes [mailing] ; ❶
(sequence (comp ; ❷
(map :label)
(map split-lines)
(map last))
mailing))
Note the reverse order of the transducing functions compared to the previous version
of postcodes using map instead of sequence. This is an effect of how transducers are
implemented, but the results are the same.
In the following example, we add a step to the transformations to remove Alaska from
the list of postcodes and we prevent duplicates in the final output. Note that thanks to
composisiton, we can add transformations using a more appealing vertical alignment:
(require '[clojure.string :refer [starts-with? split-lines]])
(unique-postcodes mailing)
;; ("43793" "75395")
❶ The new unique-postcodes function removes Alaska from the list and removes duplicates.
See also
• juxt is another function generator. It doesn’t compose functions like comp does,
but executes them independently and collects the results. Use juxt when the input
functions operates independently in the input.
• sequence accepts composition of transducers as demonstrated by the examples.
• transduce is the other transducing function that appears frequently with comp.
Performance considerations and implementation details
(complement [f])
❶ A simple example of using complement to invert the meaning of checking if a value is an integer.
CONTRACT
Input
• "f" is a function of any number of arguments returning any type.
Output
• returns: a function of any number of arguments returning boolean true or false.
Examples
complement takes advantage of the fact that everything in Clojure has an extended
boolean meaning and always returns either true or false:
((complement {:a 1 :b 2}) :c) ; ❶
;; true
❶ The example shows how to invert the meaning of validating the presence of a key in a map. If :c is
not present in the map it returns true.
❷ However, we should pay attention using complement in the presence of nil values. In this second
case :b is present in the map but it’s value is nil.
complement should be used with care in the presence of nil as demonstrated by the
example above. A similar scenario is possible in the presence of the idiomatic use
of seq to determine if a sequence is empty or not. Assuming we didn’t know about the
existence of empty? or not-empty?, we could write the following:
(defn not-empty? [coll]
((complement empty?) coll))
(not-empty? ()) ; ❶
;; true
❶ A problematic not-empty?. You should rather use the standard not-empty (no question mark)
instead.
However, if the presence of nil in the input is under control, we could express that an
item does not belong to a set in a very concise way:
(filter ; ❶
(complement #{:a :b :c})
[:d 2 :a 4 5 :c])
;; (:d 2 4 5)
❶ A concise way to filter all items that don’t match a set of values.
❷ The approach assumes the complemented set does not contain nil as one of the values to remove.
In that case, it won’t be able to remove nil from the input.
complement offers the possibility to extract a function from a negated function. We are
unable to do the same with not which is a boolean operator. Here’s for example a way
to express typical opposites like "left" and "right" in terms of each other:
(defn turning-left? [wheel]
(= :left (:turn wheel)))
(def turning-right?
(complement turning-left?)) ; ❶
❶ Some implementation details have been removed from the implementation of remove as it appears in
the standard library.
See also
• not does not produce a function but just inverts the boolean meaning of its
argument.
Performance considerations and implementation details
(constantly [x])
constantly generates a function that always returns the same result independently
from the number and type of arguments it is called with. The output function always
returns the initial argument as the only answer.
CONTRACT
Input
• "x" a mandatory argument of any type used as the returned result from the
generated function.
Output
• returns: a new function of 0 or more parameters of any type.
Examples
constantly can be used for all those situations where an updating function is required
but the new value doesn’t depend on the old. There are many of such updating
functions in the standard library. update, for example, takes a map, a key and a
function. The function receives the old value at the key and is expected to use that
value to compute the next.
The following example implements a quantize-volume function to calculate the
average volume in a collection of musical notes. The sound expressiveness is
expressed by both the :volume and the :expr keys:
(def notes [{:name "f" :volume 60 :duration 118 :expr ">"}
{:name "f" :volume 63 :duration 120 :expr "<"}
{:name "a" :volume 64 :duration 123 :expr "-"}])
(quantize-volume notes) ; ❼
❶ process-note takes a note and a map of functions. Clojure maps support sequential access and can
be used as input for reduce.
❷ update-note is locally bound with letfn. It defines the reducing function used by reduce in the
following line. Apart from destructuring the second argument, it applies update on a note with the
given key and function.
❸ process-note applies update multiple times (one for each updating function passed as parameter in
the "fns" map). Since Clojure “hash-map” are persistent data structures, we need to make sure that the
updated note that each function produces is the input of the next updating
function. reduce implements exactly the updating semantic we are looking for, making sure each
intermediate step is passed as input to the following. Our "initial value" for reduce becomes "the note"
and we are starting the update-chain from there.
❹ quantize-volume main goal is to prepare the input functions for update and apply them to all the
notes.
❺ Each note has a :volume key and we want all volumes to be the average. constantly is a good
choice here: we need the same "average" value for all notes and a function wrapper that returns that
value.
❻ :expr key needs the old value to determine the new value, so we pass in a function of the
old expressiveness to the new.
❼ When we finally process the notes, we can see that the maps are updated as expected,
with :volume updated to the average and :expr updated in relation to the volume being above or
below the average.
Another use of constantly is for "stubbing" function calls in testing. A good setup for
a test isolates the function under test from less predictable behavior (such as network
requests) providing stubbed responses. The stubbed response is also useful to control a
particular aspect of the function under test, so its behavior can be verified. with-
redefs is often used in conjunction with constantly for this purpose:
(ns book.unit
(:require [clojure.test :refer [deftest testing is]]))
(deftest test-logic
(with-redefs [third-party-service (constantly {:b "x"})] ; ❸
(testing "should concatenate 2"
(is (= "s2" (fn-depending-on-service "s"))))))
WARNING With direct linking 30 switched on, with-redefs stops working. with-redefs relies
on var indirection to temporarily swap the function implementation. When Clojure is compiled
with direct-linking, the vars content is inlined directly and cannot be changed.
See also
• identity also returns the argument that is passed in as a parameter.
But identity doesn’t return a function, just the value itself. identity is often
used with similar goals to constantly, with the only restriction that identity only
accepts one argument instead of many.
• with-redefs is used often in conjunction with constantly to generate stubbed
responses while testing.
Performance considerations and implementation details
(identity [x])
identity is a little function in the standard library. It just returns its single argument as
output:
(identity 1)
;; 1
Despite the apparent simplicity, there are many cases in which identity can be useful
(see the example section). identity derives its name from the equivalent mathematical
30
Please see the official Clojure documentation about direct-linking here: clojure.org/reference/compilation#directlinking
Input
• "x" is the only mandatory argument and can be of any type.
Output
• returns: the argument that was passed as input.
Examples
The first example illustrates an idiomatic way to transform a map into a flat sequence
of keys and values. It is a single liner and a single function call. All other options
would include a second function call:
(mapcat identity {:a 1 :b 2 :c 3}) ; ❶
;; (:a 1 :b 2 :c 3)
mapcat iterates and concatenates at the same time. Since iterating over a “hash-map”
produces a sequence of vectors containing key-value pairs, we just need the identity
transformation before concatenating all vectors together.
Identity can also be used as "noop" (contraction for no-operation) to provide a
function when one is required but without producing any effect. One useful case is
when we need to filter all logical false elements from a sequence (anything that is
either nil or false):
(defn custom-filter [x] ; ❶
(if (or (= x nil) (= false x))
false
true))
❶ custom-filter implements what we want to achieve in a very verbose way: we are not considering
the fact that Clojure accepts any value as logical true/false, so this is considered not idiomatic.
❷ Shows that custom-filter works as expected, filtering out all unwanted nil and false from the
sequence.
❸ The same result can be achieved without a custom function using identity. The reason this works is
that values like false or nil are part of Clojure logical false definition. filter works
31
A wikipedia article introducing the identity function concept en.wikipedia.org/wiki/Identity_function
using nil or false as markers for what items should or should not be in the final
results. identity passes values as they are, so filter can use them directly.
The following example shows how to use identity with some to retrieve the
next logical true element from a collection. A list of cashiers in a super-market is
marked available by adding a number in a vector at its corresponding index. As soon as
the customer picks a lane, the cashier becomes busy and we need to update the value at
the index so no other customer can pick the same lane. To avoid concurrent read/write
of the cashier line we use a ref, one of the concurrency primitives in Clojure. By using
a ref we can check the availability and book the lane in a single transaction:
(def cashiers (ref [1 2 3 4 5])) ; ❶
(defn next-available [] ; ❷
(some identity @cashiers))
(defn book-lane [] ; ❹
(dosync
(if-let [lane (next-available)]
(make-unavailable! lane)
(throw (Exception. "All cashiers busy!")))))
(book-lane) ; ❺
;; 1
(book-lane)
;; 2
@cashiers
;; [false 2 3 4 5]
❶ cashiers contains a vector initialized with numbers (representing free cashiers lanes). The vector is
wrapped by a ref.
❷ next-available uses identity and some on the vector of cashiers. It returns the first true result,
or nil after reaching the end of the vector. Note that next-available is a read-only operation on
the ref that doesn’t need an explicit transaction context.
❸ make-available! and make-unavailable! take a number as argument and add or remove the
element at that index. This effectively marks the cashier available or not, because marking "not
available" adds a false in the vector at that index causing next-available to continue the search.
❹ book-lane coordinates searching for the next available cashier and booking a lane. dosync needs to
wrap both read/write operations to be effective, as other customers might be trying to use the same
lane simultaneously. In case there are no more lanes available, book-lane throws an exception.
❺ We can see a quick simulation of the system by booking and releasing a few lanes.
identity can be used with partition-by when we are interested in grouping consecutive
elements in a sequence. The following example shows how to search for prominence in
words, assuming repeating letters indicates emphasis:
(def they-say ; ❶
[{:user "mark" :sentence "hmmm this cake looks delicious"}
{:user "john" :sentence "Sunday was warm outside."}
{:user "steve" :sentence "The movie was sooo cool!"}
{:user "ella" :sentence "Candies are bad for your health"}])
(enthusiatic-people they-say) ; ❸
;; ("mark" "steve")
See also
• nil? is a better option to use as a predicate with filter to deal with nil elements in
the sequence. As we have seen in the examples, identity use with filter also
removes false elements, while nil? does not:
(remove nil? [0 1 2 false 3 4 nil 5])
;; (0 1 2 false 3 4 5)
• “constantly” returns a function that accepts any number of arguments but always
the same given result. Use “constantly” instead of identity if you need a variable
number of arguments and to return the same result.
Performance considerations and implementation details
2.2.6 juxt
function since1.1
(juxt
([f])
([f g])
([f g h])
([f g h & fs]))
juxt takes an argument list of functions and returns a new "juxtaposing" function that
applies each original function to the same set of arguments. All results are then
collected in a vector. juxt could be described as a "function multiplexer" since it calls
multiple functions to return multiple result. Here’s how you can use juxt to see the
different effects of calling first second and last on a list:
((juxt first second last) (range 10)) ; ❶
;; [0 1 9]
We can describe the example above "visually" with the following picture:
CONTRACT
Input
• juxt requires at least one argument up to an unlimited number of arguments.
• "f", "g" and "h" are functions. They need to accept the same number of arguments
the output function will be called with. If for example the generated function is
called with 2 parameters, then "f", "g" and "h" will be called with those 2
parameters.
• "fs" is any additional function after "f", "g" and "h".
Notable exceptions
• clojure.lang.ArityException when juxt is invoked without arguments or when
the generated function is called with the wrong number of arguments.
Output
• juxt returns a function of any number of arguments returning a vector. The
resulting vector has size equal to the number of the initial functions.
Examples
juxt is useful to group multiple actions together. One simple case is searching for
neighbors in a grid of cells identified by two-dimensional coordinates. The neighbors
are the 4 cells sitting above, below, left and right of another cell. Given a pair of
coordinates [x, y], we need to apply 4 transformations to find the adjacent cells.
The following diagram shows the cell [2 1] and its neighbors:
We need to be careful though, because the grid has finite dimensions and we don’t
want to return non-existing neighbors:
(def dim #{0 1 2 3 4}) ; ❶
(neighbors [2 1]) ; ❺
;; ([2 0] [2 2] [1 1] [3 1])
(neighbors [0 0]) ; ❻
;; ([0 1] [1 0])
❶ dim defines the possible values for the coordinates of a grid using a 0-based indexing.
❷ up, down, left, right are functions taking a coordinate pair [x y] and computing the coordinates
of the cell above, below, left or right respectively.
❸ valid? is a function that returns true if the given [x,y] cell is contained inside the given grid
dimensions.
❹ juxt groups together the functions we need to calculate the neighbors in a single call.
❺ We can see that these are valid coordinates looking at the diagram above.
❻ This is an example of cell at the edge of the grid returning the only two available neighbors.
Another idiomatic use of juxt serves the purpose of maintaining some unaltered
version of a value along with its transformations. If for example we have a vector of
words and we want to show their length we could use juxt and “identity”:
(def words ["book" "this" "an" "awesome" "is"])
❶ An example of using juxt to decorate each word in a sentence with its lenght.
By using juxt we are able to map over the sequence of words, keep a copy of the word
unchanged and decorate the word with its length. We could achieve a similar result
using an anonymous function but we would have to deal with explicitly with
parameters and wrap the results in a vector:
(map #(vector (count %) %) words) ; ❶
;; ([4 "book"] [4 "this"] [2 "an"] [7 "awesome"] [2 "is"])
❶ An alternative version for juxtaposing functions using an anonymous function instead of juxt,
resulting in a more complicated form to read.
Another fairly common use of juxt is as a helper to extract values from a map. The
following example shows how we can create a message by joining together the relevant
keys:
(def post
(->> post
((juxt :count :normal-title)) ; ❶
(interpose " ") ; ❷
(apply str)) ; ❸
❶ post is an example map of data that we thread with ->> through a set of transformations. The first
transformation creates a function with juxt that is applied to the post map. The output shows the
values corresponding the :count and :normal-title keys.
❷ interpose interleaves a space between the sequence of strings.
❸ apply with str joins everything together in a single string.
In the presence of a list of maps, we could use juxt with sort-by (or group-by) to sort a
sequence of maps by more than one attribute in a nested fashion:
(sort-by (juxt count str) ["wd5" "aba" "yp" "csu" "nwd7"]) ; ❶
❶ This call to sort-by is first sorting by count and then by alphabetical order between those strings with
the same size.
Nested grouping is common when handling tabular data, such as a database result-sets.
The following person-table definition shows how some raw data might appear once
loaded in memory. We can query the table using a combination of sort-by, group-by
and juxt:
(def person-table ; ❶
[{:id 1234567 :name "Annette Kann" :age 31 :nick "Ann" :sex :f}
{:id 1000101 :name "Emma May" :age 33 :nick "Emma" :sex :f}
{:id 1020010 :name "Johanna Reeves" :age 31 :nick "Jackie" :sex :f}
{:id 4209100 :name "Stephen Grossmann" :age 33 :nick "Steve" :sex :m}])
(sort-by-age person-table)
;; ([31 "Ann"] [31 "Jackie"] [33 "Emma"] [33 "Steve"])
(group-by-age-sex person-table)
;; ({[31 :f] ([31 "Ann"] [31 "Jackie"])}
;; {[33 :f] ([33 "Emma"])}
;; {[33 :m] ([33 "Steve"])})
❶ A person table is represented here as a sequence of maps. This is typically the result of querying a
table from a SQL database. Each record contains attributes for a person and we are interested in
presenting them in some useful way.
❷ sort-criteria and group-criteria are definitions for two functions returned by juxt. We extract
them into their own var definitions, so we can reuse these criteria elsewhere.
❸ sort-by-age uses the sort criteria created with juxt to sort the table before mapping it (again
using juxt) to only show the relevant attributes. Note that when used in conjunction with sort-
by or group-by, juxt has the meaning of "first sorting-by the first argument" and when in this sorted
condition already also "sort-by the second argument". juxt is effectively nesting sort and grouping
operations.
❹ group-by-age-sex is similar to sort-by-age just applying slightly different criteria. As before, we are
using criteria created with juxt both for grouping and for filtering only the interesting keys in
the map operation.
See also
• comp has some similarities with juxt in that they both compose multiple functions
into one, but they are different in the way the functions are composed to obtain the
final result, for example: ((comp f g h) x) is equivalent to (f (g (h x))) is
equivalent to (f (g (h x) while ((juxt f g h) x) is equivalent to [(f x) (g x)
(h x)]. Use comp instead of juxt when the goal is for each function output to be
the input for the next function. Use juxt when functions should operate in parallel
on the same input.
• “select-keys and get-in” should be preferred to filter keys and values from a map.
We’ve seen in our examples that juxt can be use effectively as a "select-values"
only instead.
• “zipmap” can be used to create pairs from a sequence similarly to what (map
(juxt :somekey identity) maps) does with a slightly different syntax: (zipmap
(map :somekey identity) maps). While “zipmap” results in an unordered
map, map with juxt creates a “vector” of pairs that maintains order.
Performance considerations and implementation details
The implementation of juxt for arities less than 3 arguments simply invokes each
function in a vector to return the result. The variable-arity case is handled
with apply or reduce.
2.2.7 memfn
macro since 1.0
❶ The correct approach uses memfn to make sure toUpperCase is called on each string in the vector.
❷ The wrong approach shows that Clojure is trying to resolve toUpperCase as a symbol.
Input
• "tags" is an optional list of meta tags (in the form of ^:tagname1 ^:tagname2
separated by spaces). "tags" are propagated to the target object receiving the
method call. The main use of tags is for type hinting.
• "name" must be a symbol representing a callable method on a Java class.
• "args" is an optional enumeration of symbols. The optional "args" are delegated to
the Java method.
Notable exceptions
• RuntimeException in the unlikely case you need to pass more than 20 parameters
to memfn.
Output
• memfn returns a new function of at least one argument. The first argument is the
Java object instance the function will receive. Any additional arguments after the
first is passed as is to the method invocation.
Examples
The value provided by memfn is mainly related to Java inter-operation scenarios,
especially when higher order functions are executed on Java objects. The example
below shows how to process a sequence of java.time.Instant to find their duration
from an initial instant t0. Alternatively, the time at the invocation is taken as the
starting point:
(import '[java.time Instant Duration])
(def instants
(repeatedly (fn [] ; ❶
(Thread/sleep (rand-int 100))
(Instant/now))))
❶ “repeatedly” takes a function of no arguments and returns an infinite sequence of invocations of that
function. instants uses “repeatedly” to sleep some rand-int amount of time and then add the current
time in the sequence. The result is lazy infinite sequence of instants.
❷ Duration/between is a call to a static Java method. Compared to an instance method, a static
method does not require the use of memfn.
❸ memfn wraps the instance method symbol toMillis. This is then used as higher order function
for map.
❹ We take and realize a couple of instants for our experiment. Without doall, (Instant/now) is invoked
at some later time confusing the results.
❺ The second invocation shows that we can pass our own starting point to measure durations.
One aspect to keep in mind when using memfn is related to multiple arguments. If the
instance method requires one or more arguments memfn can be instructed to pass them
through by adding arbitrary symbols:
(map (memfn indexOf ch) ["abba" "trailer" "dakar"] ["a" "a" "a"])
;; (0 2 1)
The map of two sequences requires a function of two arguments, one for the item from
the first sequence and one for the second. The first argument is implicit
(because memfn generates a function of at least one argument to pass to the instance
method) while the second must be explicit like in the example above (indicated by the
symbol "ch"). You need to be careful though. Additional parameters passed
to memfn should not to be confused with “partial” application: for example the
following attempt to find the index of a letter in a string doesn’t compile:
(map (memfn indexOf "a") ["abba" "trailer" "dakar"])
;; CompilerException java.lang.Exception: Unsupported binding form: a
The reason for the problem becomes clear if we “macroexpand, macroexpand-1 and
macroexpand-all” the form:
(macroexpand '(memfn indexOf "a"))
;; (fn* ([target12358 p__12359]
;; (clojure.core/let ["a" p__12359] (. target12358 (indexOf "a"))))) ; ❶
❶ The macro expansion shows that the string "a" is used as a local binding in a “let and let*” form. This
is the reason why all arguments after the first need to be valid symbols (at compile-time) and valid for
the method signature (at run-time).
See also
• “fn” and related function literal #(), can be used in all places where memfn is
used. memfn is however a better choice to invoke Java instance methods as higher
order functions.
• “".", ".." and doto” helps with side-effecting Java methods invocations, allowing
multiple invocations to be chained together. If the instance method you need to
call is side-effecting (like a setter method), prefer “".", ".." and doto” instead.
Performance considerations and implementation details
❶ Note that any metadata associated with the first parameter is propagated to the target object instance.
❷ The optional args are "unquote-spliced" after the method name.
One important performance aspect affecting memfn is related to reflective calls during
Java interoperation. The following example shows what happens when memfn is used
without type hinting:
(set! *warn-on-reflection* true)
❶ We use “dotimes” and time (a rudimentary but sufficient benchmarking approach) to show a rough
estimate of the time consumed to map toLowerCase 100000 times.
memfn accepts and propagates metadata when present on the first argument. We can use
this aspect to type hint the Java call to remove reflection:
(time (dotimes [n 100000]
(map (memfn ^String toLowerCase) ["A" "B"]))) ; ❶
;; "Elapsed time: 5.701509 msecs"
;; nil
❶ The type hint appears to affect toLowerCase but memfn is propagating it to the right place after macro
expansion.
With the type hint in place the time consumed cuts down into approximately a half. It’s
probably safe to suggest that the presence of memfn (especially when used as higher
order function on collections) should trigger a check for the presence of expensive
reflective calls.
2.2.8 partial
function since 1.0
(partial
([f])
([f arg1])
([f arg1 arg2])
([f arg1 arg2 arg3])
([f arg1 arg2 arg3 & more]))
partial is used when a function requires one or more arguments but not all of them
are available at the time of the invocation (for example because the function is passed
to another function):
(def incrementer (partial + 1)) ; ❶
(incrementer 1 1) ; ❷
;; 3
❶ partial "injects" the parameter "1" into the function + creating another incrementer function as
output. The incremeneter function does not evaluate, waiting for other parameters to be available at
some later point.
❷ incrementer is evaluated with 2 additional parameters, bringing the total of passed parameters to 3.
(finder "tons") ; ❷
;; 0
Maybe more interesting is the fact that partial is positional and can capture arguments
strictly starting from the left. Suppose from the previous example that we only want to
"fix" the word to search to always be "tons", but leave the target text as a free
variable. partial won’t allow us to do that.
Examples
partial achieves a similar effect to fn (or the literal syntax #()) but with reduced
flexibility. While fn supports the missing arguments in any position (and with possible
gaps), partial only allows missing arguments at the end of function signature:
(let [f (partial str "thank you ")] (f "all!")) ; ❶
;; "thank you all!"
(let [f #(str %1 "thank you " %2)] (f "A big " "all!")) ; ❷
;; "A big thank you all!"
❶ "all!" with partial can only close the sentence. We can’t add something before "thank you" for
example.
❷ Using the function literal #() (which is syntactic sugar for the anonymous function fn) we have the
flexibility to accept additional arguments to place before the others.
(all-a? "aaaaa") ; ❺
;; true
(all-red? [:red :red :red]) ; ❻
;; true
❶ partial is used to suspend equality "=" after the first argument. We want something to be equal to "x"
but we still don’t know what that is going to be.
❷ same? contains the call to (as item) passing as many arguments are in the collection passed as
argument. This will always work for equality that can take any number of arguments.
❸ we use partial again to suspend the second argument to same? to create another specialization
of "=" to just check for the single character \a in a sequence.
❹ similarly we can do for other kind of items (for example keywords) and re-use everything we have
written so far.
❺ as expected it works for strings (which are sequences of characters)
❻ thanks to another specialization of partial we can use all-red? on a collection of keywords
(def valid-req
{:id "1322"
:cache "rb001"
:product "cigars"})
(def invalid-req
{:id "1323"
:cache "rb004"
:spoof ""})
❶ validate applies some simple rule to keys and values of a map, combining them into a final true/false
answer with every. It also requires a list of "whitelisted" keys for one of the validation check.
❷ the map operation happens on a collection of maps to be validated. We know at this point which keys
are allowed in each map while the actual map is coming from the internal iteration. partial is giving
us all the aspect of the higher order function we need: it accepts the argument we already know have
and creates the single argument function we need for map. The alternative of using an anonymous
function here is still possible but more verbose.
Currying
partial application in Clojure is often compared to "currying". Depending on the level of formalism, the
two concepts are often conflated but there is usually a difference related to the level of support to
currying by the hosting programming language. Currying is a mathematical inspired concept related to
the fact that a function of multiple arguments can be expressed as a concatenation of functions of a
single argument. The following examples illustrates the idea:
(defn f1 [a b c d]
(+ a b c d))
(defn f2 [a b c]
(fn [d]
(+ a b c d)))
(defn f3 [a b]
(fn [c]
(fn [d]
(+ a b c d))))
(defn f4 [a]
(fn [b]
(fn [c]
(fn [d]
(+ a b c d)))))
(f1 1 2 3 4)
((f2 1 2 3) 4)
(((f3 1 2) 3) 4)
((((f4 1) 2) 3) 4)
f1, f2, f3 and f4 all produce the same result, but they differ in the number of arguments they need
to be invoked with and the level of nesting of the returned function, forcing us to "unroll" the arguments
one by one. If we were to be constrained by the language to only have functions of a single argument,
then we would have to simulate functions of multiple arguments like f4 does. Luckily for us, such a
constraint doesn’t not exist in Clojure (or any other mainstream language) but as we have seen in the
examples for partial in this chapter, sometimes it’s handy to suspend part of a function until more
arguments are available. So in Clojure we won’t define f2, f3 or f4 explicitly and
prefer partial instead:
((partial f1 1 2 3) 4)
((partial f1 1 2) 3 4)
((partial f1 1) 2 3 4)
The approach Clojure is taking here is to just use higher order functions to remove all the nesting that
would be necessary otherwise. Other languages go beyond partial application supporting currying at the
compiler level. In Haskell, for example, all functions called with less than the declared number of
arguments are automatically turned in their curried form:
f = (+ 1) ; ❶
-- is equivalent to:
f x = x + 1
❶ f is a function of 1 argument incrementing its input by 1. Instead of creating our own implelementation
we used the already existing function + and "curried" the first argument to be always 1.
In the Haskell example above, the compiler accepts that plus is called with less than the allowed number
of arguments to produce a function f that accepts the remaining one. No need to invoke an
explicit partial function like in the Clojure version as well as no need to have explicit multiple artities in
Haskell.
See also
• fn (or the equivalent reader macro #()) can be used instead of partial when the
argument we want to suspend is not coming last in the argument list. fn is
effectively a general partial that let us wrap the target function the way we need.
For all other cases partial could be clearer in conveying the idea that a function
is waiting for more arguments. In general the two forms are a matter of taste.
❶ myvec is a small function that always insert ":start" at the beginning of a vector. We call myvec with 3
arguments first. vector also has several optmized arities and we try to take advantage of this fact.
❷ The second benchmark adds an additional argument while calling myvec resulting in a visible
performance impact despite the fact that vector has a specific arity for a 5 arguments call (the
threshold before switching to the catch-all variable arity is 6 arguments for vector).
❸ As a comparison, let’s have a look at a similar solution using an explicit lambda function.
The reader should remember that in real-life scenarios micro-benchmarks like the ones
we are performing are influenced by many other factors and the described speed impact
is tiny in absolute terms. But if you happen to use partial in a tight loop with more
than 4 arguments, you should probably look into using fn instead.
2.2.9 every-pred and some-fn
function since1.3
(every-pred
([p])
([p1 p2])
([p1 p2 p3])
([p1 p2 p3 & ps]))
(some-fn
([p])
([p1 p2])
([p1 p2 p3])
([p1 p2 p3 & ps]))
every-pred and some-fn take one or more predicates and produces a new function. The
returned function takes zero or more arguments and invokes all predicates against all
arguments combining the results with the equivalent of an and or or operation,
respectively.
Both every-pred and some-fn applies short-circuiting behavior to the combination of
predicates, stopping evaluating at the first predicate that returns:
• Either nil or false for every-pred.
• A truthy value in case of some-fn.
CONTRACT
Input
• "p", "p1", "p2" and "p3" must be functions supporting a one argument call and
returning any type. Even when a predicate returns something different from a
boolean type, the returned value is evaluated as a true or false following Clojure
conventions.
• "ps" is the list of all remaining predicates after the third.
Notable exceptions
• ArityException when every-pred or some-fn is invoked without arguments, or
when any predicate requires more than 1 argument.
Output
• every-pred: returns a function of any number of arguments of any type that
returns either true or false.
• some-fn: returns a function of any number of arguments of any type that returns
any type. The return value is usually interpreted using extended boolean logic.
Examples
every-pred and some-fn are often found when a sufficiently large combination of
predicates needs to be applied to one or more values. For example it might be natural
to think that this is possible and works correctly:
(remove (and number? pos? odd?) (range 10)) ; ❶
;; (0 2 4 6 8)
What we would like to achieve is to combine a set of predicates using and so each
element from the input collection is subject is accepted or not by remove. But and is
a macro and its evaluation happens while the form is compiling, resulting in the
following being evaluated:
(remove odd? (range 10)) ; ❶
❶ How the form from the example before appears after it is compiled by the Clojure compiler.
The and expression returns the last value that is not false or nil, which is the
function odd? in this case. What we really want is to combine the predicates in such a
way that they are evaluated on each item, so we could do something like this:
(remove #(and (number? %) (pos? %) (odd? %)) (range 10)) ; ❶
;; (0 2 4 6 8)
❶ The correct way to combine multiple predicates require to add a wrapping anonymous function and
repeating the argument for each predicate.
The correct way to combine multiple predicates doesn’t read as well as before after
adding a wrapping anonymous function and repeated the argument for each predicate.
The situation becomes worse the more predicates need to be combined. For example,
here’s how we could go to find palindromes (words that read the same both directions)
in a collection of items:
(defn symmetric? [xs]
(= (seq xs) (reverse xs)))
❶ symmetric? verifies if the sequential collection "xs" is equal to its reverse. This is a good (although not
efficient) definition of palindromic sequence.
❷ palindromes applies a series of steps to establish if a word in a collection is a palindrome: first it
should not be nil, second it should be a string, third it should not be empty and finally, it should be
equal to its reverse.
The function used to filter palindromes, contains a concatenation of predicates with the
argument "word" repeated 4 times. By using every-pred we can remove the
anonymous function, the need to pass the "word" argument and the need to use and:
(defn symmetric? [xs]
(= (seq xs) (reverse xs)))
The example shows every-pred used in-place, without giving a name to the generated
function. We could also assign every-pred a name and reuse the same combination of
predicates in different places. The following example extracts the check for
palindromes into a new palindrome? function:
(defn symmetric? [xs]
(= (seq xs) (reverse xs)))
(def palindrome? ; ❶
(every-pred some? string? not-empty symmetric?))
❶ The combination of predicates to check for palindromic words is now available for reuse in other parts
of the code.
❷ The palindromes function is now just invoking filter on the collection.
Let’s now have a look at some-fn. In the following example, we are going to perform
multiple checks on a value to determine if an email message is spam:
(defn any-unwanted-word? [words] ; ❶
(some #{"free" "sexy" "click"} words))
(def spam? ; ❹
(some-fn any-unwanted-word? any-link? any-blacklisted-sender?))
❶ any-unwanted-word? contains a set of unwanted words and we want to know if any of them is
present in a message. We use some applied to the sequence of words using the set itself as a
function. If the word is in the set, the word itself is returned. some returns the first occurrence of the
word in the set or nil otherwise.
❷ The second function applies a regular expression to verify if any of the word is a link to external
content on the web. In this simplified example we consider every email with a link as suspicious.
❸ any-blacklisted-sender? checks the content of the email against a collection of email addresses
that we consider spam. any-blacklisted-sender? works exactly as any-unwanted-word?.
❹ We combine all the functions using some-fn in a new function returning the first positive check
(logical true). We also opted to def the generated function as spam? to promote further reuse.
❺ words splits a string using "clojure.string/split".
❻ We can see an example of "clean" email with no matching words. some-fn eventually
returns nil after calling the chain of predicate functions.
❼ The last call discovers a blacklisted word.
some-fn can be used to retrieve the result from a matching predicate (like in the
example above) or in conditional statements, where we have an option to completely
ignore the precise value returned. With reference to the previous example, we can
use when-let to combine conditional logic and value matching:
(when-let [match (spam? (words "from: [email protected], click here for a free
gift."))] ❶
(throw (Exception. (str "Spam found: " match))))
;; Exception Spam found: click
❶ We can use some-fn in conditional logic as well as to retrieve a specific value. In this case, when-
let both assigns the matching value to a local binding "match" as well as entering the condition and
throwing an exception.
32
This quote is attributed to Phil Karlton, once Netscape architect. See: skeptics.stackexchange.com/questions/19836/has-
phil-karlton-ever-said-there-are-only-two-hard-things-in-computer-science
❶ every-fn has the same interface as every-pred: takes a list of functions (supposedly predicates)
and return another function that calls all predicates in "ps" on each of the items in the input. Our
implementation generates permutations for all predicates "ps" against each input and group them
together with partition.
❷ The example provides two simple predicates (an extended definition that includes logical true and
logical false.
❸ We can now call every-fn using a sample input. The result shows returned values from the
predicates grouped for each input.
❹ By comparison, every-pred would look at each value and stops at the first occurence of
a nil or false returning false.
every-fn is designed to return the explicit result of applying each predicate to each input. Without
further processing, we can’t see the same useful information that every-pred is returning, such as if
there is at least one input that doesn’t not satisfy all predicates at once. We can still use the results in
some other way, but at that point we can solve the same problem without the need of creating custom
functions:
❶ The effect of every-fn are produced without using a custom function. juxt applies multiple functions to
the same input and it looks a good choice in this case.
See also
• every? is used by every-pred implementation. If you have a single predicate to
apply to multiple arguments, prefer "every?" instead of every-pred.
• “every-pred and some-fn” is very similar to every-pred except that instead of
verifying if all predicates are true, it verifies if at least one is true (equivalent to
the boolean "or" operator).
• some applies a similar logic to some-fn. But instead of composing together
multiple predicates it uses a single predicate against each element of the sequence.
Use some instead of some-fn if you are only interested in a single predicate applied
to multiple values.
Performance considerations and implementation details
every-pred and some-fn generate functions in constant time with a small performance
penalty for 4 or more arguments. The reader should consider these aspects in case
every-pred or some-fn appears as part of processing large collections or in a fast
loops. The following benchmark illustrates the point:
(require '[criterium.core :refer [quick-bench]])
Please note that the difference between 3 and 4 arguments is small and there could be
other kind of dominating computations to consider. A straightforward solution to
eliminate the generation time from the computation is to name the generated function
as a let binding or a var in the namespace. If we look at the performance of the
generated function we can see a similar behavior:
(require '[criterium.core :refer [quick-bench]])
❶ A dummy predicate p always returning true. This is the worst case scenario for every-pred that
doesn’t have a way to short-circuit the computation.
❷ We generate a few functions with every-pred passing a different number of predicates.
The difference between 3 and 4 arguments depends on the fact that also the generated
-> (also known as thread first macro or thrush operator) can be used to compose or
group together a list of operations. The arguments of the -> macro consist of an
expression (mandatory) and a list of forms (optional).
The idea is to position the expression as the first argument of the following forms. For
example here’s a step-by-step explanation of what happens when evaluating (-> {:a
2} :a inc). Although this is not strictly how the macro is implemented, it is a good
thinking model:
1. The keyword :a is the first optional form after the initial expression {:a 2}.
Internally the form is added to a list (unless it is already). Since :a it’s not already
a list, it is transformed into (:a).
2. The expression {:a 2} is placed as the second item in the previously created list
resulting in (:a {:a 2}). The resulting form is evaluated and passed downstream.
In this case (:a { :a 2}) equals 2 and 2 is passed down to the next form.
3. inc is the next item in the list of forms. As before it’s not a list, so it needs to be
transformed into (inc).
4. The previous result, 2, is then placed as the second item in the previous list: (inc
2).
5. We finally reached the end. The final form is evaluated and returned.
During macro-expansion, the Clojure compiler transforms (-> {:a 2} :a inc) in
(inc (:a {:a 2})). During evaluation the form evaluates to the number 3.
macroexpand confirms our theory:
(macroexpand '(-> {:a 2} :a inc))
;; (inc (:a {:a 2}))
-> tends to improve the readability of some class of sequential operations that would
otherwise read backward (or inner-most to outer-most). Transformation pipelines
(where the result of a first operation needs to be passed down to the next operation) are
usually a good candidate to be "threaded" using ->.
CONTRACT
Input
• "x" can be any valid Clojure expression. It can be useful to remember "x" as the
"x" in "eXpression", which is what → threads through the following "forms".
• "forms" is an optional list of arguments. If any of the optional forms is not
a “list” already, it will be made so by invoking “list” on it. The first element of
each form must be a callable function (such that (ifn? (first form)) evaluates
to true).
Notable exceptions
• ArityException if called with no arguments.
• ClassCastException if any form is not callable, for example (→ 1 2 []).
Output
• -> produces the evaluation of the last form, using the result of the previously
evaluated forms, following the thread-first rules exposed above. If no forms are
provided, it returns the evaluation of the first argument "x".
Examples
-> is particularly useful for processing pipelines where an initial input is transformed at
each step. This is true for the common case of map processing. The following example
shows a way to parse an HTTP request into a “hash-map”:
(def req {:host "http://mysite.com" ; ❶
:path "/a/123"
:x "15.1"
:y "84.2"
:trace [:received]
:x-forward-to "AFG45HD32BCC"})
❶ req is an example request. Some web framework is taking care of transforming the request into
a map for us.
❷ prepare takes the request and assoc a few additional keys. It then removes keys that are no longer
needed and finally updates the :trace.
❸ We can use clojure.pprint/pprint to better format the output. pprint is available at the REPL
directly, but requires an explicit require otherwise.
To prepare the request above, we need a few transformations: join the host and path
together to form the :url,create a vector out of the coordinates, remove the coordinates
and the forward header and finally, update the trace to include the preparation step was
done. In a real life application, request processing could be arbitrarily long and
complicated. We could take advantage of -> to increase the readability of the
transformation:
(defn prepare [req] ; ❶
(-> req
(assoc :url (str (:host req) (:path req))
:coord [(Double/valueOf (:x req)) (Double/valueOf (:y req))])
(dissoc :x-forward-to :x :y)
(update :trace conj :prepared)))
Introducing -> in the prepare function creates a visual top-to-bottom flow that is easier
to read: the req input is passed "down to" the first assoc operation, then dissoc and
finally update.
Another interesting use of -> is in conjunction with the anonymous lambda form #().
When applied to a single argument, -> behaves similarly the “identity” function so (->
1) is equivalent to (identity 1). To understand how this could be useful, let’s take a
look at the following failing example:
(def items [:a :a :b :c :d :d :e])
❶ We’d like create a map out of each element in items, but this is not the right way to do it.
What we would like to achieve in the example above, is to create a map containing a
key :count which is always 1 and a key :item which is the original element from
the items vector. The problem with the above is that the anonymous
function #({:count 1 :item %}) is trying to invoke the map with no arguments. We
need to macroexpand the form to see what’s happening:
(macroexpand '#({:count 1 :item %})) ; ❶
;; (fn* [p1] ({:count 1, :item p1}))
❶ A set of 4 forms all producing the same result on the same vector of items. The first uses hash-map,
which is an idiomatic choice.
❷ We could use identity and keep using the map literal syntax with curly braces {} but the need
for identity is hard to understand.
❸ A shorter alternative to identity is do. However, do presence is often associated with side-effects which
are nowhere in this form. Overall, this option is confusing like the second one.
❹ The final form uses -> and is short and to the point.
The last option making use of -> conveys information about the operation effectively:
it’s short and easy to read, without introducing the semantic cluttering of identity or do.
Both the option using hash-map and -> are idiomatic, but they produce slightly
different results:
(map type (map #(hash-map :count 1 :item %) [1])) ; ❶
;; (clojure.lang.PersistentHashMap)
(map type (map #(-> {:count 1 :item %}) [1])) ; ❷
;; (clojure.lang.PersistentArrayMap)
Please refer to array-map and hash-map to understand this type difference. Most of the
time, Clojure handles the transition from one map type to another transparently without
the user needing to know.
33
The Wikipedia article on combinatory logic is good introduction to the subject: en.wikipedia.org/wiki/Combinatory_logic
The two expressions in the example above return the same result but the second exposes the flow much
clearly. However, -> as T-combinator is limited by the fact that it does not support nested function with
arguments, for example:
The above results in a compile error. The macroexpansion clearly shows what’s wrong:
This is why sometimes the thread operator in Clojure is compared to a limited T-combinator 34.
See also
-> is just one of the several flavors of thread macros offered by Clojure. Initially it was
only ->, followed by ->> in 1.1 and a bigger expansion with the 1.5 release of Clojure
which added as->, some->, some->>, cond-> and cond->>. The other related threading
macros are:
• ->> is called "thread last" macro and is very similar to -> but it puts the element at
the end of the next form instead of as the second element. It is particularly useful
for sequence processing, where the input sequence usually appears last in the list
of arguments.
• "as->" enables the selection of a placeholder making explicit where the element is
placed in the next form. Use as-> when it’s necessary to fine control the
placement of the element in the next form. This thread macro has the drawback to
be more verbose, because the placeholder is repeated each form.
• some-> takes care of any initial or intermediate nil value, stopping right away
instead of passing it to the next form. some-> is useful when a form evaluates
to nil causing exception.
• cond-> enables a custom condition to decide if the processing should continue or
not. This is the only thread macro allowing to skip a step completely.
• get-in fetches the value for an arbitrarily nested associative data structure such as a
Clojure map. For example: (-> {:a 1 :b {:c "c"}} :b :c) is equivalent
to (get-in {:a 1 :b {:c "c"}} [:b :c]). Consider using get-in instead of -
> if you need to access values in a deeply nested map.
34
Around the reason why the Clojure thread operator cannot be considered a true T-combinator, see the very good
explanation by Michael Fogus on his blogblog.fogus.me/2010/09/28/thrush-in-clojure-redux/
->> (also known as thread last macro) can be used to compose or group together a list
of operations by positioning the first expression as the last argument of the following
form (similarly to "->" which places it first instead). ->> tends to improve the
readability of some class of sequential operations that would otherwise read backward
(or inner-most to outer-most). Transformation pipelines (where the result of a first
operation needs to be passed down to the next operation) are usually a good candidate
to be "threaded" using ->>.
The arguments of the ->> macro consist of an expression (mandatory) and a list of
forms (optional). The idea is that the first expression is "piped through" the other forms
that get a chance to process the expression at each step before the final output is
returned.
CONTRACT
Input
• "x" is a mandatory expression. The expression is evaluated and placed last in the
following form (if any).
• "forms" is an optional list of forms. If any form is not a list already, it is wrapped
by a “list”. The first element of each form must be a callable object (such
that (ifn? (first form)) is true). Each evaluated form is placed last in the
following and then evaluated until there are no more forms.
Notable exceptions
• ArityException if called with no arguments.
• ClassCastException if any form is not callable. For example in (->> "a" "b"
[]) the string "a" is treated as a function.
return
• returns: the result of evaluating the last form (if any), by placing the previously
evaluated form as the last argument of the next. If no forms are provided, it returns
the evaluation of the expression "x".
Examples
->> is well suited for sequential processing pipelines where an initial input is
transformed by each step into the final output. The following example shows how we
could rewrite the nesting of several filter operations using ->>. We want to filter all
even positive numbers divisible by 3 made by the same repeated digit:
(filter pos? ; ❶
(filter #(apply = (str %))
(filter #(zero? (mod % 3))
(filter even? (range 1000)))))
;; (6 66 222 444 666 888)
❶ This set of nested filters is reasonably easy to follow, but we still need the mental effort of searching
for the inner-most form and move outward to understand it.
❷ ->> inverts the previous flow starting with the input sequence first, then the set of operation in the
order they are actually applied.
The following example illustrates how flexible ->> is, for example when different
sequence operations are involved, like the case of parsing the query string of a web
request. Here’s the first option that nests each step processing step without using the
threading last macro:
(require '[clojure.string :refer [split]])
(def sample-query "guidx=123&flip=true")
(params sample-query)
;; {"guidx" "123", "flip" "true"}
The function that process the parameters is not easy to follow, as it reads backward.
Here’s the new version of params that takes advantage of ->>:
(defn params [query] ; ❶
❶ The rewrite of the params function to take advantage of the thread last macro.
The new params function contains exactly the same number of steps as before, just
arranged in a different order. Worth noticing how the same operations naturally flow in
order this time, so we can start reading about the first operation from the top and
follow the vertical flow to the bottom.
See also
->> is one of the most used and generic of the thread-last macros. There are other
thread-last variations that interact more specifically with the processing pipeline:
• some->> is a nil-aware thread last macro that stops processing at the
first nil occurrence in the evaluation chain.
• cond->> allows the presence of a condition each step to decide if to continue or
not. This version of the thread last macro allows to skip one or more steps
completely.
Performance considerations and implementation details
cond-> and cond->> are a specialized version of the basic threading macros -> and -
>> respectively. cond-> takes an expression and "threads first" the expression into the
following form (as the first argument of the formed function) if and only if a clause
is true. Similarly cond->> "threads last" the expression when the condition is
`true. Each form is preceded by a clause that is used to decide if the previous
evaluation should go through the form or not.
One important fact about conditional threading macros is that they are not short-
circuiting. If the clause is false, the related form is simply skipped and computation
resumes from the next. Also worth noticing that the clauses don’t have access to the
evaluation of other forms but just the surrounding local bindings, like any other non-
macro evaluated part of the code. Specifically, the clause can’t reference the result of
the previous form (what is "threaded-through"). This behavior can be used to
repeatedly check an initial expression (or some other given option) independently from
the transformations that are happening before.
Here’s a step-by-step explanation to clarify (cond->) logic:
(let [x \c] ; ❶
(cond-> x ; ❷
(char? x) int ; ❸
(char? x) inc ; ❹
(string? x) reverse ; ❺
(= \c x) (/ 2))) ; ❻
;; 50
Similarly, here’s a step by step example of how cond->> operates on its argument:
(let [x [\a 1 2 3 nil 5]] ; ❶
(cond->> x ; ❷
(char? (first x)) rest ; ❸
true (remove nil?) ; ❹
(> (count x) 5) (reduce +))) ; ❺
;; 11
❶ A local binding "x" is established for the vector [\a 1 2 3 nil 5].
❷ "x" is threaded through the cond->> macro.
❸ The clause (char? (first x)) is evaluated. Since \a is a character type, then the form is
evaluated. Since the rest function is not a sequence, it is transformed in a list with (list
rest) internally. x is used as the second argument of (rest x) which is evaluated to the list (1 2 3
nil 5).
❹ When true is used as a clause, the form is always evaluated. The previous evaluation is added as the
last argument of (remove nil?) resulting in (remove nil? (1 2 3 nil 5)) which evaluates to the
new list (1 2 3 5).
❺ The last clause (> (count x) 5) counts the elements in "x" (this is the original expression, not the
previously evaluated list). Since there are more than 5 items, the previously evaluated list is used as
the last argument of the current form: (reduce + (1 2 3 5)). The final result is 11.
CONTRACT
Input
• "expr" is a mandatory expression. "expr" is evaluated and placed second in the
following form if the clause condition evaluates to true.
• "clauses" is an optional list of clause-form pairs. In each pair, a "clause" is an
expression that evaluates logical true/false. A "form" must be present for each
clause. If the form is not a list already, a new wrapping “list” is created. The first
element of each form must be an callable object (so that (ifn? (first
form)) is true).
Output
• The result of evaluating the last form (if provided), using the result of the
previously evaluated form. If no forms are provided (or all conditions evaluate
to false), it returns the evaluation of "expr".
Examples
One idiomatic use of cond-> is in conditional forms where the "true" branch should
transform the input while the "false" branch leaves it untouched. For example the
following forms are equivalent:
(let [x "123"] (if (string? x) (Integer. x) x)) ; ❶
(let [x "123"] (cond-> x (string? x) Integer.)) ; ❷
❶ The variable "x" can be a string or a number. If it’s a string, we want to convert it to a number,
otherwise we don’t do anything. The conditional form needs to repeat "x" one more time at the end,
just to leave it as it is.
❷ In the cond-> version, we avoid repeating "x" a third time, as it is threaded
through Integer constructor only when the condition is true.
cond-> can be used to process heterogeneous data so they appear eventually under the
same "shape". This situation can happen for example when an application receives
XML or JSON for the same entity but there are small differences in the structure or
values (tree-like data structures can be directly represented and processed as hash-maps
in Clojure). The following shape-up function checks if the incoming “hash-map”
conforms to a set of rules and changes it accordingly:
(defn same-initial? [m]
(apply = (map (comp first name) (keys m))))
(map shape-up
[{:k1 "k1" :k2 {:h1 "h1" :h2 "h2"} :k3 {:j2 "j2"}}
{:k1 "k1" :k2 "k2"}
{:k1 "k1" :k2 {:h1 "h1" :h3 "h3"} :k3 {:j1 "j1"}}])
; ({:k1 "k1", :k2 "h1h2", :k3 {:j2 "j2", :j1 "default"}, :same true}
; {:k1 "k1", :k2 "k2", :k3 {:j1 "default"}, :same true}
; {:k1 "k1", :k2 "h1h3", :k3 {:j1 "default"}, :same true})
❶ The first form enforces the presence of a :k3 key pointing at the map {:j1 "default"} (and if the
key already exists it will be replaced). After this first step we are sure that :k3 :j1 key combination
exists possibly with a "default" value. To enforce that the condition will be always be applied, "true"
was used as the clause.
❷ The second clause checks if all the keys are starting with the same letter. If that’s the case, we add a
key :same true.
❸ In the last step if value for key :k2 is another hash-map then we take all the values of that inner hash-
map and concatenate them together as a string. We finally replace the same :k2 key with the new
string.
(process signals {:bypass? even? :interpolate? true :noise? 0.5 :cutoff? 200})
;; (0 4 12 14 16 ...
❶ Using destructuring we can extract the relevant keys from the input map.
❷ Signal processing starts by checking how many sampling events we received. If less than some
amount, each signal gets incremented. In our example this operation is not evaluated.
❸ To simulate the introduction of new data (interpolation) “range” is invoked on each signal in the
sequence, generating a list of nested sequences of different sizes. “mapcat” takes care of joining
everything back together. In our example, the option was set in the map and the interpolation takes
place on the original signal list, because the previous step wasn’t executed.
❹ This step filters the signals based on the bypass? key. If bypass? is nil, then there is no filtering.
When bypass? contains something other than nil, it assumes bypass? is the predicate for the filter.
The filter operation takes place in our example using even? as predicate.
❺ This step optionally adds noise to the signal by random sampling the list using “random-sample”.
Since the noise? key is set, also this step takes place using a 50% (0.5) probability.
❻ Finally the cutoff step removes all signals above a certain threshold. The step gets executed with a
200 threshold.
The condition column has access to local and global bindings (as any other part of the code) but it
doesn’t know anything about the right-hand column. Similarly, the processing column on the right can’t
have any impact on the conditions on the left (assuming no side-effects).
Please keep in mind that the above indentation style is used here to emphasize the vertical flow
of cond->> and is not normally used. Consider emphasis through indentation (or columns) a special
case of documentation to be used sparingly. When in doubt on how to use the correct indentation style
for a function or a macro, the user contributed clojure style guide is the definitive reference on the
subject 35 .
See also
• "->" is the "thread-first" macro. Differently from cond-> it doesn’t apply any
clause for the execution of the next form.
• ->> is the "thread last" macro. Differently from cond->> it doesn’t check a
condition for the execution of the next form.
• "some->" can be roughly compared to a cond-> where all conditions are only
checking for nil. However, "some->" short-circuits and returns right away in case
of nil.
• some->> roughly compares to cond->> where all conditions check for nil.
35
The Clojure Style Guide github.com/bbatsov/clojure-style-guide#literal-col-syntax
However, some->> short-circuits and returns right away if some of the forms
evaluates to nil.
Performance considerations and implementation details
some-> and some->> are variation of the thread first -> and thread last "->>" macros
which return immediately if any of the form evaluates to nil. This is especially useful
for those functions throwing NullPointerException in the presence of nil (a common
situation with Java interop but not only):
(-> {:a 1 :b 2} :c inc) ; ❶
;; NullPointerException
❶ An attempt to increment the value for the key :c in a map. The key does not exist,
returning nil. inc throws exception in case of nil.
❷ The same example using some-> returns nil.
CONTRACT
Input
• "expr" is mandatory argument and can be any valid Clojure expression.
• "forms" are additional optional arguments. If any of the optional forms is not
a list, a new list is created to wrap the form. The first element of each form must
be an callable object (so that (ifn? (first form)) is true).
Output
• some-> returns the result of evaluating the last form by placing the result of the
©Manning Publications Co. To comment go to liveBook
previous form as the second argument of the next, while some->> places the
evaluated form as the last argument instead. It returns nil if any of the forms
evaluates to nil.
Examples
An idiomatic use of some-> is when using Java interop, for example to convert strings
into numbers. This is often the case when reading from environment variables,
something frequent during system startup:
(defn system-port []
(or (some-> (System/getenv "PORT") Integer.) ; ❶
4444))
(system-port) ; ❷
;; 4444
❶ The presence of some-> here prevents a NumberFormatException when the "PORT" variable doesn’t
exist.
❷ Invoking (system-port) works regardless of the presence of the "PORT" environment variable.
When the "PORT" is present, it overrides the default which is used in when the "PORT" is not present.
re-seq is a good candidate for conditional processing with some->>: re-seq receives the
target string as the last argument and it doesn’t tolerate nil arguments. Here’s a
function to extract the content between <title></title> tags from some HTML
document:
(defn titles [doc] ; ❶
(some->> doc ; ❷
(re-seq #"<title>(.+?)</title>") ; ❸
(map peek))) ; ❹
(titles nil)
;; nil
(titles "<html><head>Document without a title</head></html>")
;; nil
(titles "<html><head>
<title>Once upon a time</title>
<title>Kingston upon Thames</title>
</head></html>")
;; ("Once upon a time" "Kingston upon Thames")
❶ match-title is a simple function that searches for a pair of <title></title> tags in a HTML
document and then verifies if the title contains a given regular expression.
❷ some->> prevents the need for guards against a possible nil value.
❸ If the entire document is nil, we don’t want re-seq to generate a NullPointerException.
❹ re-seq returns matching results in vector pairs. peek is the optimal way to access the last item in a
vector.
WARNING Using regular expressions to match large HTML documents is possible but not efficient. For
intensive HTML processing it is a better idea to use one of the many HTML parsing libraries
available (for example Enlive 36).
Listing 2.18. for-> repetition The thread-expr for-> from Pallet Ops allows for
(-> 1 repetition of forms in an already existing -> thread
(for-> [x [1 2 3]] first macro. The example shown here expands
(+ x))) into: (-> 1 (+ 1) (+ 2) (+ 3))
;; 7
Listing 2.20. updating macros These are two examples from the Synthread
(-> {:a 1 :b {:c 2}} library. ->/update and ->/in are two thread
(->/update :a inc -) macros dedicated to maps, similar to “update and
(->/in [:b :c])) update-in” and get-in but supporting threading
;; 2 multiple updates in a single call as shown by (-
>/update :a inc -) which is incrementing and
changing sign to the value pointed by the key :a
See also
• fnil is a function generator that works by wrapping another function. fnil is
preferable when the check around nil values happens in relation to arguments.
Performance considerations and implementation details
as-> specializes the two basic threading macros, -> and ->>, by adding a new
parameter which is used as a placeholder to position the evaluation of the previous
form into the next. With -> and ->>, the evaluation of the expression at the top is
placed at the second place or at the end of the next form, respectively. All the forms in
the chain need to obey the same positioning. as-> enables a precise placement of the
evaluation for the next form:
(as-> {:a 1 :b 2 :c 3} x ; ❶
(assoc x :d 4) ; ❷
(vals x) ; ❸
(filter even? x) ; ❹
(apply + x))
;; 6
❶ as-> chain starts with 2 elements, the expression to be thread-in and the local binding "x".
❷ "x" is used as placeholder in the next form to drive its positioning, in this case right after
<assoc,assoc>>.
❸ Note that even when there is no ambiguity, "x" needs to be explicit in the form.
❹ This is an example of placement as last argument, equivalent to ->> positioning.
(let [x {:a 1, :b 2, :c 3}
x (assoc x :d 4)
x (vals x)
x (filter even? x)
x (apply + x)]
x)
Since as-> is based on let, it also supports destructuring (although this was only
enabled starting with Clojure 1.8).
CONTRACT
Input
• "expr" is any valid Clojure expression. The result of the evaluation of the
expression is bound to the placeholder.
• "name" can be either a symbol or a destructuring form. If "name" is a symbol, it
can be used as a placeholder in the following forms. If a destructuring form is used
instead, the subsequent evaluations in the chain have to be compatible with the
destructuring form.
• "forms" an optional list of forms, potentially making use of the placeholder
defined before.
Notable exceptions
• It throws a generic Exception is the binding placeholder is not a symbol or
destructuring expression.
Output
• as-> returns the result of evaluating the last form, using the placeholder to refer to
the previously evaluated form. If no forms are provided, it returns the evaluation
of the expression.
Examples
as-> is useful in those cases where the threaded value is positioned differently in each
form. Here’s an example where sequence processing (usually a thread-last operation) is
mixed with map processing (a thread-first operation). The example simulates fetching
data from some URL endpoint that contains id, name, count triplets:
(defn fetch-data [url] ; ❶
[{:id "aa1" :name "reg-a" :count 2}
{:id "aa2" :name "reg-b" :count 6}
{:id "aa7" :name "reg-d" :count 1}
{:id "aa7" :name nil :count 1}])
(process "home/index.html")
; 9
❶ fetch-data simulates a response after fetching data from a remote service. The url parameter is not
used in this example.
❷ url-from creates a valid URL from a simple path.
❸ We can see as-> in action. The first 3 forms require the threaded value to appear last, while the last
form takes the placeholder in nested position.
In the example above, process uses the as-> threading macro. The chain of operations
required to sum the :count key for the relevant items requires a mix of function calls
and sequence operations, so the evaluation of the previous form is required at a
different positions. The choice of the placeholder symbol <$> is arbitrary, but this one
is more visible through the forms.
The following example illustrates the use of destructuring with as->. One important
aspect to understand is that the same destructuring applies during each evaluation
despite appearing only once at the top. This allows each form to see fresh update of
local bindings based on previous evaluations:
(let [point {:x "15.1" :y "84.2"}]
(as-> point {:keys [x y] :as <$>} ; ❶
(update <$> :x #(Double/valueOf %))
(update <$> :y #(Double/valueOf %))
(assoc <$> :sum (+ x y)) ; ❷
(assoc <$> :keys (keys <$>)))) ; ❸
❶ A map contains the coordinates x,y of a point as strings. We destructure the map while declaring the
placeholder for as->.
❷ The value of x and y at this step of the computation is the result of applying destructuring to the
previous form, after both x and y have been converted from strings into doubles.
❸ Note that the placeholder <$> can be used at any location in the expression not just at the beginning.
See also
• The basic threading macros -> and ->> can be regarded as specialized forms
of as-> where the position of the result in the next form is fixed (either the first
parameter or the last in the next form).
Performance considerations and implementation details
(apply
([f args])
([f x args])
([f x y args])
([f x y z args])
([f a b c d & args]))
apply, in the most used form, takes a function and a collection of arguments and
returns the result of invoking the function against the arguments in the list. apply is
useful in those cases where the parameters of a function are generated dynamically and
are not known at the time of writing the expression. apply can be visualized thinking
about "unrolling" or "spreading" arguments from a list to call a function.
CONTRACT
Input
• "f" is the function to invoke and is a mandatory argument.apply requires at least
two parameters.
• "x", "y", "z", "a", "b", "c" and "d" are arguments with a dedicated function
signature.
• The last parameter must be a sequential collection.
Notable exceptions
• IllegalArgumentException when the last parameter is not a sequential collection.
• ClassCastException if the first argument is not a callable object.
Output
• apply returns the result of invoking "f" against the specified parameters.
Examples
A common case for apply is string concatenation when the collection of strings to
concatenate is known as the result of some runtime computation. Here is for example a
function to generate random binary strings of length "n":
(defn rand-b [n]
(->> #(rand-int 2) ; ❶
(repeatedly n) ; ❷
(apply str))) ; ❸
(rand-b 10)
; "1000000011"
❶ The first step creates a function of no arguments to generate random either 0 or 1 with equal
probability. This function is required by repeatedly below.
❷ The random generator is passed to repeatedly which creates a lazy sequence of randmon "n" bits.
❸ We use apply with str for the final string concatenation.
The example shows how to generate a random list of bits before converting it into a
single string. Another common use of apply is to create maps using a list as input:
(defn event-stream [] ; ❶
(interleave (repeatedly (fn [] (System/nanoTime))) (range)))
❶ event-stream simulates a stream of events coming from some external source in the form of a
timestamp followed by a value in a simple sequence.
❷ “hash-map” requires key-value pairs as arguments. We can use apply to transform the collection of
events into a list of arguments.
The following example illustrates apply used in conjunction with map. map accepts
any number of collections as input, so apply can be used to "spread" arguments
to map, for example to process a table of two-dimensional vectors:
(def header [:sold :sigma :end]) ; ❶
(def table [[120 3 399] [100 2 242] [130 6 3002]])
❶ header and table represents a typical destructuring of a two-dimensional table into Clojure data
structures. table contains the actual rows by group of 3 items, while header is the title for each
column.
❷ (apply map + table) for this example is equivalent to (map + [120 3 399] [100 2 242] [130
6 3002]). + can take any number of arguments (in this case 3) creating a total for each column.
❸ Finally, we add the title to each total.
λ[[y;z];cons[car[y];cdr[z]]]
The above is a function of two list arguments y,z that produces a new list as output merging (first
y) and (last z) equivalent to (fn [y z] (cons (first y) (last z))) in Clojure.
Sometimes between 1958 and 1959, McCarthy wanted to prove that Lisp was better at expressing
computability than the formalism of the Turing Machine. Part of that challenge was also to define an
"universal Lisp function", a function able to parse and execute another copy of itself written with the
same syntax (exactly like the universal Turing Machine is able to accept a definition of itself).
McCarthy had to find a way to express Lisp functions in a form that could be digested by Lisp itself
and decided to encode them in lists, using the convention that the first element of the list was the name
of the function and the rest of the list the parameters. McCarthy called this new notation an S-expression
(where S stands for Symbolic). The above "cons" M-expression would look like the following as an S-
expression (which is perfectly valid modern Lisp):
The universal function that was able to parse S-expressions and apply them to arguments was called
indeed apply. McCarthy envisioned apply purely for research with no practical scope, until Steve Russel
(one of his graduates) decided to implement apply in machine language, effectively creating the first
Lisp REPL.
See also
• into can be used to create maps (along with other collection types), similarly to
what we saw in the examples. One difference is that the input sequence needs to
be already in the form of a collection of vector pairs.
• zipmap is the perfect choice to create a hash-map when you have two collections,
one containing the keys the other containing the values. Combining the keys and
values together and pass them to apply would be more verbose.
• reduce can be used to concatenate strings similarly to apply with the restriction
that reduce only takes functions of 2 arguments. For example: (apply str ["h"
"e" "l" "l" "o"]) produces the same result as (reduce str ["h" "e" "l" "l"
"o"]).
• eval evaluates expressions as lists.
NOTE reduce performs worse than apply for string concatenation. str takes advantage
of java.lang.StringBuilder, a mutable Java object to build strings incrementally, but only
when arguments are passed at the same time. reduce instead calls str repeatedly with 2
arguments only, creating many intermediate string builders. As a rule of thumb,
use apply when the function is specifically optimized for long sequences of input.
❶ The benchmark measures apply while increasing the number of explicit arguments.
Beyond the 5th explicit argument apply creates a nested cons list using recursion. The
case with more than 5 arguments is uncommon, so apply should not be considered a
problematic performance hot-spot in normal circumstances.
2.4.2 memoize
function since 1.0
(memoize [f])
memoize generates a function that stores the results of an existing one using the
argument values as key. When the wrapped function is invoked with the same list of
arguments, the result is returned immediately from the cache without any additional
computation. The effects of memoize are readily visible if we print some message from
the wrapped function. We expect the message to appear once for each key:
(defn- f* [a b] ; ❶
(println (format "Cache miss for [%s %s]" a b))
(+ a b))
(f 1 2)
;; Cache miss for [1 2]
;; 3
(f 1 2)
;; 3
(f 1 3)
;; Cache miss for [1 3]
;; 4
The first invocation generates the message while the following for the same
combination of keys are not, confirming that the wrapped function f* is not invoked
again.
There is no universal convention for naming, but given the connection between the
target function and the one generated by memoize the two names should be somewhat
related. In our examples, the public interface of the function remains the same, while
the memoized version is private and is added a star "*" at the end.
CONTRACT
Input
• "f" needs to be a function and is mandatory argument.
Notable exceptions
• ClassCastException if "f" is not callable.
• ArityException when called without arguments.
Output
• A new function of a variable number of arguments that stores the results of the
evaluation in an internal map.
Examples
memoize works well for non-trivial computations that accept and return values with a
small memory footprint. The following example illustrates the point. The Levenshtein
distance 39 is a simple metric to measure the difference between two strings. The
distance can be used, for example, to suggest corrections for common spelling
mistakes. The distance is straightforward to implement but becomes computationally
intensive for longer strings (above 10 characters or more). We could use memoize to
save us from computing the distance of the same pair of strings over and over again.
The input (the strings arguments) and the output (a small integer) are relatively small
in size, so we can cache a large amount of them without exhausting memory (assuming
the list of words with which the function is invoked is some finite number that we can
estimate).
To feed our example we are going to use a dictionary of words in plain text format (on
Unix systems such file is available at "/usr/share/dict/words"). If we were asked to
implement an auto-correction service, it could work as follow:
1. The user input a misspelled word.
2. The system checks the distance of the word against the words in the dictionary.
3. Results are returned in order of smaller distance.
We are also going to pre-compute several small dictionaries starting with the initials of
the word, a technique to further speed-up the distance calculation:
39
The Wikipedia article contains a good introduction to the Levenshtein Distance
algorithm: en.wikipedia.org/wiki/Levenshtein_distance
❶ The Levenshtein algorithm presented here is a variation of the many similar ones available online. The
important aspect to remember is that it growths roughly as O(n*m) where m and n are the length of the
strings, or in other words O(n^2) in the worst scenario.
❷ This def actually builds the wrapping function through memoize, conveniently
called levenshtein without the final * that is reserved for the non-memoized version.
❸ to-words is an helper function to prepare the dictionary filtered by the initial string. to-words is part of
the "static" or "learning" phase of the algorithm, since we can prepare words by initial off-line and store
them for later use.
❹ The best function is responsible for the application of the levenshtein memoized function to the
words in the dictionary. It then sorts the results with sort-by and returns the lowest distances.
❺ The def invocation is defining a filtered dictionary starting by "ac" so it doesn’t need to be computed
multiple times. This also prevents the time function to report on the time needed to read and process
the file.
❻ The first invocation to search the best matches for the misspelled word returns in almost 5 seconds.
The memoized version of the distance function stores each new pairs of strings as key
and the returned distance as the value in an internal map. Each time the function is
invoked with the same arguments the return value is fetched from the map.
The example also shows a way to "train" the memoized distance before actual use. A
real application could pre-compute a set of dictionaries by initials similar to the
indexing happening inside a database. This technique contributes to the speed-up seen
in our implementation, but for serious applications there are algorithms outperforming
Levenshtein 40.
Pure functions
The wrapped function needs to be referentially transparent. If there are factors other than the input
arguments influencing the results, then cached results could be different. The cache would then need to
be aware of this side effecting "context" and use it as part of the key (if possible). Memoization becomes
straightforward in functional languages supporting referential transparency.
See also
• lazy-seq creates a "thunk" (wrapper function around a value) that evaluates its
content on first access and return a cached version on following calls. When the
thunks are joined together in a sequence it forms a lazy sequence. Lazy sequences
are comparable to a cache where the order and value of the keys is predetermined.
An "evaluate once" semantic on collections can be achieved with “lazy-seq”.
Since all Clojure sequences are lazy, you might be already using a "cached data
structure" without knowing it.
• atom creates a Clojure Atom, one of the possible Clojure reference
types. memoize uses an atom to store results. Use a
custom “atom” when memoize implementation is too restrictive for a specific kind
of caching. You can for example look into something different than a
40
See the list of metrics available on Wikipedia: en.wikipedia.org/wiki/String_metric
Clojure “hash-map” to store items in the map, like a mutable Java map with soft-
references 41. Keep in mind that there are already libraries like core.cache
(github.com/clojure/core.cache) to provide common caching strategies if this is
what you’re looking after.
Performance considerations and implementation details
❶ Along with the actual cache, additional counters are added to the initial let block.
❷ :done is a sentinel value that can be used to extract statistics during run-time.
❸ This is an estimate of the amount of memory necessary to store the keys given the number of chars 42.
❹ Additional swap! operations are performed to update counters.
41
There are several examples of use of SoftReference for caching in Java. This is a good starting point: www2.sys-
con.com/itsg/virtualcd/java/archives/0507/shields/index.html
42
A good enough formula to estimate the amount of memory necessary to store strings in Java
is: www.javamex.com/tutorials/memory/string_memory_usage.shtml
By making access to the additional stats at run-time, we can estimate the key-space
size or the memory footprint. If we run the same Levenshtein example replacing
memoize with memoize2 we can extract the following results:
(levenshtein :done)
{:calls 400, :hits 0, :misses 400 :count-chars 5168 :bytes 10376}
(levenshtein :done)
{:calls 800, :hits 400, :misses 400 :count-chars 5168 :bytes 10376}
As you can see, the first time the best function is invoked it generates 400 misses
while the second time it results in all hits. We can also an estimate of the memory
taken by the strings stored in memory which is around 10Kb.
Second aspect to consider when using memoize is the additional hash-
map assoc operation and atom swap! that is added for each new key combination
presented as input. The hash-map adds O(n log n) steps to add a new key while
the atom could under perform under heavy thread contention. Depending on the
application requirement, memoize could be built on top of a transient data structure to
avoid the performance penalty of filling the cache. Another option to consider, when
possible, is "warming the cache": while the application is still not serving live traffic,
cache could be populated artificially with the most common keys.
2.4.3 trampoline
function since 1.0
(trampoline
([f])
([f & args]))
(such that (fn? object) yields true). "f" will need to return an object so that (fn?
object) is false at least once to prevent trampoline from going into an infinite
recursion.
• "args" are the optional arguments to pass to "f".
Output
• The result of invoking "f" over optional "args" until the return type is not a
function. trampoline exit condition checks the returned type with fn?. Vectors,
sets, keywords and symbols are also invocable objects, but they are not considered
invocable by trampoline.
WARNING If the input function "f" already returns a function as the final result, that function will need to
be wrapped in a collection (or other object so that (fn? object) is false) to make
sure trampoline has a proper exit condition.
Examples
trampoline can be used to transform stack consuming mutually recursive functions
into a tail-recursive iteration. Mutual recursion doesn’t occur that often in everyday
programming, but it has a couple of interesting applications. State machines, for
instance, are a well known example of problem that mutual recursion solves in an
elegant way. The following example shows how a traffic light (based on US traffic
laws) can be implemented as a state machine and how trampoline can be used to
prevent stack-overflow in case of very long sequences of state transitions:
(defn- invoke ; ❶
[f-key & args]
(apply (resolve (symbol (name f-key))) args))
❶ invoke takes a function as a keyword (such as :+) and related args (1 2) and invokes (+ 1
2) provided :+ can be found in the current namespace. The example uses invoke to call one of the
possible traffic light states passing the rest of required transitions as arguments.
❷ The green state function deals with the traffic light when the green light is already on. The function will
determine what should happens given the next required state transition. Other functions for other
colors work the same way. The case switch is instructed to return false if the transition is not
possible, a condition that forces trampoline to break the chain. nil needs to be handled separately,
since this is the transition list terminator marker. The termination marker signals that all transitions
were successful. The catch-all branch at the end of the case statement deals with any additional valid
transition. invoke calls the next transition once the color keyword (any of :green, :amber or :red)
has been translated into the corresponding function.
❸ flashing-red and flashing-amber have one case less to deal with, because all states are allowed
from a flashing light condition. The case statement has been replaced with an if compared to previous
states.
❹ traffic-light is the entry point. It starts the chain of calls through trampoline. Once the traffic light
is turned on for the first time, the first state is flashing-amber.
The last call to traffic-light in the example shows what happens when we call a life-
time long list of traffic light states (with a 2 minutes total time per loop, 10 million
cycles correspond to roughly 39 years of continuous traffic light activity). Every item
in the list could potentially create a new stack frame, but thanks to trampoline the
mutual recursion executes on the heap.
See also
• iterate has similar effects to recursion, but it creates a sequence of intermediate
results instead of returning the final result. “iterate” is not an alternative
to trampoline as they solve different problems.
• loop-recur is at the core of trampoline implementation, eliminating the problem
of consuming the entire stack space.
Performance considerations and implementation details
Basic Constructs
3
This chapter groups together some of the most important constructs in Clojure (and
similarly other programming languages): conditional branching, iteration and local
scope definition. There are other aspects that could be added to this category like
namespaces, variables or functions but because of their complexity they have been
dedicated a separate chapter.
You might be surprised to see things like conditionals, switch statements or loops as
being part of the standard library. But Clojure (as many other Lisps before) builds on a
small core of primitives called special forms and many functions that would be
considered reserved words in other languages are defined (or refined) in the standard
library. This is the reason why the Clojure standard library could be compared to a
language specification.
Although special forms are not technically part of the standard library (they are
implemented mainly in the Compiler on the Java side of Clojure), this book is going to
describe them anyway. The reason for this is that even though special forms are not
usually meant for the public language interface, Clojure is offering some of them
without any standard library wrapper: if and fn* for example don’t have a wrapper. In
the case of if the Java implementation is complete enough to be used directly,
while fn* exposes more advanced functionality that the wrapping macro “fn” can’t
offer (but as the "star" in the name suggests, the function is more directed at other
language implementors than the larger user community).
a let macro:
The symbol "b" defined by the let macro is only visibile when you consider the
surrounding parenthesis. When add-one is invoked we can’t mention "b" anymore
because it cannot be resolved in the newly created scope. There is indeed a close
relationship between the scope created by a function declaration and the scope created
by a let-like form. let can infact be considered syntactic sugar for a lambda function
invocation, as illustrated by the following example:
((fn [a b] (* (+ a b) b)) 1 2) ; ❶
❶ The anonymous function created with “fn” is invoked right away on a couple of arguments. The
function declares two arguments a and b locally bound to the values 1 and 2 respectively. Once inside
the body of the function the arguments can be used many times without any further re-evaluation. The
scope of a and b is bound lexically by the parenthesis defining the anonymous function. From the
reading perspective, the parameters and the values they are bound to are sitting at the extreme of the
function body.
❷ This let declaration achieves the same effect of the anonymous function but reads much better: the
symbols and values are now close together followed by the main code block.
There is a clear equivalence between let and anonymous functions which sets local
bindings apart from the usual procedural variable assignment: it’s all just immutable
parameter passing. Despite this, even purely functional lexical binding is colloquially
referred to as "assignment" because of the striking similarities. Like imperative
assigned variables, let-bound symbols are available throughout the lexical scope
without any further evaluation of the expression they refer to. Although it’s common to
refer to symbols as "assigned variables", the similarity with the imperative world stops
right there:
• There is no concept of location where the value has been stored.
• Once bound, there is no way to mutate a symbol so it produces a different value.
• The same symbol can be re-bound by shadowing the previous (that doesn’t mutate
at all) using another binding form.
The macros and specialforms in this group offers different possibilities to create lexical
bindings. The most general let is followed by a few variants that can conditionally
define symbols or functions. if-let and letfn for example are useful to remove some
typing overhead when creating local symbols. All let-like forms (except letfn which
has a slightly different syntax) accept a vector of pairs which are then used to create
the bindings and a body to execute against those bindings. Lexical binding forms
additionally offer facilities like destructuring, a concise syntax to allow portions of
Clojure collections to be directly assigned to symbols (see “destructure” for details on
how destructuring works and its syntax).
3.1.1 let and let*
macro (let) special-form (let*) since 1.0
let is a very frequently used Clojure macro. One of the main uses of let is to create a
local name which stands for the evaluation of an expression, so the expression doesn’t
need re-evaluation every time it’s used. For example:
(let [x (rand-int 10)] ; ❶
(if (>= x 5)
(str x " is above the average")
(str x " is below the average")))
❶ There is a 50% probability for "x" to be be below or above 5. The evaluation of rand-int happens only
once.
Once the local binding "x" has been established, the symbol can be used without re-
evaluation of rand-int (which would then become problematic, since it would return
different values for each invocation).
Destructuring is another common case for using let, when applying the equivalent in
the function parameters is not possible or impractical. let* is the special form used
by let internally to parse and validate bindings. From the user perspective there is no
specific reason to use let* directly, so this chapter focus mainly on let.
Contract
(let [bindings & body])
bindings :=>
[<bind1> <expr1>, <bind2> <expr2> .. <bind-N> <expr-N>]
Input
• "bindings" is a (possibly empty) vector containing an even number of elements.
• "bind1", "bind2", .. , "bind-N" are valid binding expressions as
per destructuring semantic. They must appear on an even index in the bindings
vector (position 0, 2, 4 and so on).
• "expr1", "expr2", .. , "expr-N" are valid Clojure expressions and must appear on
an odd index in the bindings vector (position 1, 3, 5 and so on).
• "body" is an optional group of expressions (they don’t need explict wrapping in a
list or other data structure). The "body" is automatically wrapped in a do block.
Notable exceptions
• UnsupportedOperationException when type hinting a local binding with a
primitive type. For example the following expression is not valid: (let [^long i
0]). let automatically recognizes types for primitive locals (like longs, doubles,
etc.) and does not accept type hints in this case.
Output
let returns the evaluation of the last expression in "body" (if multiple are present)
allowing expressions to refer the bound names set by the binding pairs. It
returns nil when "body" is empty.
Examples
The following code implements the interaction loop commonly found in games with
multiple players. If we assume a human playing against the computer, there is usually a
phase of "input" followed by an action taken by the computer, including printing the
current move on the screen or deciding who is the winner. Let’s take for example the
console version of rock-paper-scissor 43 :
(defn game-loop [] ; ❷
(println "Rock, paper or scissors?")
(let [human (read-line) ; ❸
ai (rand-nth ["rock" "paper" "scissor"])
res (rule [human ai])]
(if (= "exit" human)
"Game over"
43
Rock, Paper, Scissors is a very easy and popular game: en.wikipedia.org/wiki/Rock-paper-scissors
(do
(println (format "Computer played %s: %s" ai res))
(recur))))) ; ❹
(game-loop)
;; Rock, paper or scissors?
;; Bang
;; Computer played scissor: computer can't win that!
;; Rock, paper or scissors?
;; paper
;; Computer played rock: paper wins over rock
;; Rock, paper or scissors?
;; exit
;; "Game over"
❶ rule contains the rock paper scissor rules, which are easy to implement. We need to check if the two
choices are included in one of the possible sets (independently from the order) and return the
corresponding messages. This is for instance an idiomatic use of a “set” as a function-predicate
and every? to verify each of the choices. let is used here for destructuring only: p1 and p2 can now
be referenced without any assistance from first or last to extract them from the moves parameter.
❷ the game-loop is a recursive function that repeats multiple plays until the human player types "exit" at
the console. read-line is used to read from standard input.
❸ let is declaring three local binding that will be used (potentially multiple times) over the contained
block. You can see that ai is also used directly in the following binding to retrieve the rule results.
❹ We finally recur over the function (no loop statement).
The rock-paper-scissor example shows two facts about let (this extends to the other
flavors letfn and if-let): the locally bound symbol (in this case ai) is immediately
available for other binding definitions. This implicitly defines an ordering for the
evaluation of the right-side expressions, so they can mutually refer the defined
symbols.
The second interesting aspect of the example is that let has been used in
the rule function to destructure the single sequential (vector) argument into its first
and last component. Destructuring is removing the need to use (= (first moves)
(last moves)) for the condition in the if statement, saving quite a few keystrokes.
Since let is so connected with the concept of function parameters, destructuring is
available for defn exactly in the same way. Using it in defn or in the inner let is
essentially a matter of opportunity and taste.
• Common Lisp let creates bindings independently (and potentially in parallel, although this is a
compiler implementation detail) so each individual pair cannot see local symbols defined by another
pair. All local symbols will be then available in the main let block at the same time.
• Common Lisp let* is instead the same as Clojure let, allowing the expression under evaluation to
establish a binding to see previously declared symbols right away.
The reason why Common Lisp offers the two forms and takes the less imposing let as the default choice
is often subject to debate 44. The author of Clojure decided to incorporate let* only flavor into Clojure
once and for all (simply renamed as let), preventing any further debate.
See Also
• letfn creates a local binding from a symbol directly into a function definition. It
replaces the slightly more verbose (let [f (fn [x])]) to declare a local
function.
• if-let and when-let are specialized let version wrapping a condition on top of
the let definition. Use them when the let body starts with if or when. In this case
the let binding can be completely skipped if the expression in the pair is
evaluated to nil.
• “for” could be considered a sequential let and indeed, it also supports
destructuring. Consider using “for” when the symbol should be bound to the next
element of a sequence each time the body is evaluated.
Performance Considerations and Implementation Details
44
This StackOverflow question summarizes the debate about the two different let forms in Common
Lisp: stackoverflow.com/questions/554949/let-versus-let-in-common-lisp
(large-let 5000) ; ❷
;; CompilerException java.lang.RuntimeException: Method code too large!
❶ macroexpand shows what the macro is doing, which is simply declaring a few
symbols a0, a1, .. sequentially and reducing their values in the body.
❷ large-let is then used to forge an unusually large let
As you can see, large-let generates a large let definition that in turns generates
enough bytecode to go beyond the limit allowed by the JVM for the length of a single
method. Let’s use a disassemble utility like no.disassemble 45 to see what’s going on
under the hood:
(require '[no.disassemble :refer [disassemble]])
(println (disassemble (fn [] (large-let 2))))
❶ no.disassemble output has been cleaned up a little to show the most important features. Basically
the invoke() method generated to allow the function created by “fn” to be invoked, is allocating a
long const on the stack for each pair in the bindings, explaining why a large number of them can go
beyond the allowed method length.
The generated bytecode also explain the linear aspect of the performance profile, since
the let* Java code needs to iterate through each passed binding to create the necessary
bytecode invocation.
45
"no.disassemble" is available on Github: github.com/gtrak/no.disassemble
(defmacro if-let
([bindings then])
([bindings then else))
(defmacro when-let
[bindings & body])
(defmacro if-some
([bindings then])
([bindings then else))
(defmacro when-some
[bindings & body])
if-let, when-let, if-some and when-some are specialized versions of let to create
lexically bound names. They support a single symbol-expression pair in the binding
vector. The form in the body is conditionally evaluated (with the symbol included in
the local scope) based on the expression being logical true/false (if-let and when-let)
or nil (if-some and when-some).
if-let and if-some allow the selection between two possible forms to be executed
based on the condition, while when-let and when-some either execute the forms (using
an implicit do) or return nil (equivalent to if and when semantic respectively). Here is
some simple example to demonstrate their use:
(if-let [n "then"] n "else")
;; "then"
(when-let [n "then"] n)
;; "then"
(when-let [n false] n)
;; nil
if-some and when-some are based on the expression being evaluated as "not nil". They
are better understood with a mental translation into "if-not-nil?" and "when-not-nil?":
(if-some [n "then"] n "else")
;; "then"
;; "else"
(when-some [n "then"] n)
;; "then"
(when-some [n nil] n)
;; nil
The only case where you need to be careful is where the concepts of being "logical
true" and "not nil" overlap and differ, such as testing false:
(if-let [n false] n "else") ; ❶
;; "else"
❶ if-let is testing for logical true/false. The expression is false hence the alternative body "else" is
returned.
❷ if-some tests for not nil. Since false is different from nil the expression (not (nil?
false)) is true and the first body returning the content of the bound variable is returned for
evaluation.
Contract
(if-let [bind expr] <then-form> <else-form>)
(if-some [bind expr] <then-form> <else-form>)
Examples
The most common usage of conditional let expressions is in the context of a let form
immediately followed by an if or when condition testing for the content of the locally
bound symbol. The following function for example, is counting the lines of code
(LOC) for files in the classpath (the virtual file system that Java implements
aggregating all known code sources):
(defn loc [resource]
(let [f (clojure.java.io/resource resource)] ; ❶
(when f
(count (clojure.string/split-lines (slurp f)))))) ; ❷
The loc function can be improved combining the creation and check on the local
binding "f" with when-let:
(defn loc [resource]
(when-let [f (clojure.java.io/resource resource)] ; ❶
(count (clojure.string/split-lines (slurp f)))))
❶ The when simply disappeared, removing one set of parenthesis in the process.
;; 7570
❶ if-let is now replacing when-let. Since the "else" body is optional, this would work like before
without any other changes. In this case though, we want a specific value other than nil to be
returned.
❷ The "else" body is simply "0". This effectively prevents the function from returning nil.
❸ A positive effect of the introduction of if-let and the 0 default propagates down to the reduce: we
don’t need to think about the potential presence of nil anymore.
Despite the missing "let" in the name, if-some and when-some works the same as if-
let and when-let with a modification to accommodate scenarios
where nil, true or false are part of the business logic. One example of this behavior
happens while processing core.async channels 46.
core.async models computation as streams of items "flowing" through channels from
producer to consumer. Channels are designed to be "open-ended" and it’s an agreement
between consumer and producer to mark the end of the computation. By
calling close! on a channel, the producer sends a conventional nil element to signal
the consumer that there are no more items. This is the reason why nil cannot be sent
down a channel explicitly.
The following example shows the typical master-worker model using core.async. The
worker needs to loop on available items until the nil signal is reached, processing
them one by one. This is a good use case for if-some:
46
core.async is a popular library in Clojure to model concurrent or asynchronous processes. The homepage of the project
is: github.com/clojure/core.async
❶ The master function takes the items to process as input along with the channel the items should be
sent to. The master signals the end of the items by closing the channel.
❷ The worker receives the channel where results should be sent and creates the input channel that will
be used by the master to send items through.
❸ if-some assigns the next element to the symbol item in the following lexical scope. If the item is
different from nil (thus including potential boolean true or false) the item gets processed and
the loop recurs.
❹ Processing is simulated by decorating each item with "*".
❺ If the channel returns a nil the output will be closed.
❻ process coordinates worker and master. It also iterates the results from the output channel once the
computation finishes, transforming the result back into a sequence.
❼ This line effectively starts the computation. The worker call is evaluated first. Thanks to the go-block
the wait on the input channel won’t block (but just park) returning the input channel that is needed by
the master.
❽ Another example of if-some used for the same reasons as before.
Scheme letrec
letrec in Scheme expand on the concept of visibility, making symbols available even to expressions
coming before that symbol definition. letrec can be used to make mutually referencing let bindings, for
example (here translated in how it would look like in Clojure):
The specific problem of mutually recursive functions, can be solved with letfn in Clojure. A
potential letrec macro in Clojure requires some tricks. The main complexity is to "suspend" the symbol
that is not yet defined at the point of the first invocation and deliver the right expression when it’s first
used. One attempt at this was made by Michal Marczyk some time ago and is available as a gist 47.
47
letrec implementation in Clojure can be found here: gist.github.com/michalmarczyk/3c6b34b8db36e64b85c0
(aif true (println "it is" it) (println "no 'it' here"))
(aif false (println it) (println "no 'it' here"))
aif is similar to a simplified if-let macro that doesn’t require the binding vector. The fact that it is
injected brings two consequences:
• aif cannot be (easily) nested, since the it bindings would wrap and hide each other ambiguously.
• As any captured binding, the user might legitimately use it in the outer scope and thinking that it
would resolve correctly inside aif as well:
(let [it 3]
(aif true (println "it is" it)))
it has been captured from the macro and its value cannot be 3 during println.
See Also
• let is the generic version of if-let, assigning the local binding unconditionally.
• if and when are the basic conditionals upon which if-let and when-let are based.
If there is no need for locally bound variables, you can can directly use those.
3.1.3 letfn and letfn*
macro (letfn) special-form (letfn*) since 1.0
letfn is similar to the combination of let and “fn”. Apart from being able to declare
locally scoped functions only, letfn differs from let for the fact that function names
are immediately available to all functions at the same time, enabling mutually recursive
calls. letfn use should also be considered whenever a non-trivial portion of the code
inside a function is sufficiently self-contained to deserve its own name but not general
enough to be extracted away in the namespace. A trivial example of letfn would be to
48
The Arc programming language: arclanguage.github.io/ref/
letfn* is instead the special form responsible for most of the feature in the more
documented and widely used letfn and there is no particular value in using it directly.
Contract
(letfn [fnspec+ & body])
Other use cases involving letfn are related to self-contained bits of computation that
are private to a function and would otherwise break readability when left in the middle.
Have a look for example at the following locs-xform transducer. top-locs uses the
transducer to return the top 10 longest functions in a matching namespace:
(defn top-locs
([match] (top-locs match 10))
([match n]
(->>
(all-ns)
(sequence (locs-xform match)) ; ❹
(sort-by last >)
(take n))))
(top-locs "clojure.core" 1)
;; ['clojure.core/generate-class 382]
❶ The transducer chain starts by filtering out of a sequence of namespaces all the ones that are not
matching the given name. To do so it uses re-find.
❷ At some point in the transducer chain we need to transform a Var object into a fully qualified symbol
(such as from #'clojure.core/+ to 'clojure.core/+)
❸ Counting the lines of code is done by asking clojure.repl/source-fn to retrieve the original text of
the function, splitting into lines and counting. This is a very simple approach that doesn’t take into
account empty lines or comments.
❹ The transducer is transformed into a sequence that is then sorted by count and the last n elements are
returned.
As you can see in this second version, the transducer chain inside “comp” almost reads
like plain english:
1. filter the matching namespaces
2. Extract all the interned symbols with ns-interns
3. Just take the vals of the resulting maps
4. Extract the meta data from the related vars
5. Translate the var name into a symbol name
6. Assemble the pairs of names and their LOCs
Common Lisp also includes a slightly different macro flet that doesn’t have a Clojure equivalent but it
would be the same as Clojure let followed by “fn” declaration: (let [a (fn [])). The reason why one
would use flet instead of labels is not immediate and involves shadowing of functions with the same
name. Using Clojure let + fn to simulate flet syntax:
You can note how the inner let is declaring a function a that is both defined in the outer let and re-
defined in the inner let. The second function a is making a call to (a n) that is not resulting in stack
overlflow because it’s not recursive. The same attempt using letfn would instead consume the stack
because the call to a from the inner letfn would be recursive:
(a 2)))
;; StackOverflowError
See Also
• let is more generic than letfn. With let you can assign local bindings to any
expression not just function definitions. At the same time, let is unable to look
ahead for other symbol definitions, preventing mutually referencing expressions
(like we’ve seen in the first example). Prefer letfn when the only reason for the
local binding is a function declaration, or there are mutually referencing
expressions.
• “trampoline” should be used to invoke locally defined functions that are mutually
referencing, one of the options offered by letfn.
Performance Considerations and Implementation Details
49
en.wikipedia.org/wiki/Truth_table
Clojure departs from Common Lisp in what is considered false: in Lisp, for example,
the empty list () is false while in Clojure it true. In Clojure the only value (other
than false itself) that is evaluated false is nil.
Clojure also contains a rich set of bitwise operators (these are just functions, but since
they are often found implemented directly in hardware, we tend to call them
"operators" like others belonging to the CPU instruction set). Bitwise operators are
more efficient for some class of operations frequently found in computer science. We
should also remember that math arithmetic is always reduced to bit manipulation inside
the registers of the CPU (even when normal programming happens at a much higher
level of abstraction). We are going to see how to use them in the following sections.
3.2.1 not
function since 1.0
(not [x])
Like “complement”, not takes any kind of input (not necessarily boolean) mapping it
to either true or false. Despite its simplicity, not has an important role improving
readability and expressiveness of code and used pervasively in the standard library
itself. Many functions and macros like some?, “complement”, if-not are implemented
directly on top of not.
The following is Table 3.2:
x (not x)
true false
false true
Contract
• "x" is a single mandatory argument of any type
• returns: boolean true or false.
Examples
It’s common for strings to be tested to see if they are empty (zero character length) but
sometimes this definition needs to be extended to space-only strings.
The clojure.string namespace already contains blank? to test such a condition, but
it’s missing a complement version. In the following pluralize function for instance,
we use not to prevent appending "s" to a blank string:
(defn pluralize [s] ; ❶
(if (not (clojure.string/blank? s))
(str s "s")
s))
(pluralize "flower")
;; flowers
(pluralize "")
;; ""
❶ pluralize is a simple function that returns the plural of a word by appending "s".
When the negation of a boolean test has a strong conventional name, it might be good
to extract the form and make the name explicit, like the following weekday? function:
(defn weekend? [day]
(contains? #{"saturday" "sunday"} day))
(weekday? "monday")
;; true
(weekend? "sunday")
;; true
(weekend? "monday")
;; false
❶ A week day is unambiguously everything outside a weekend. Instead of using (not (weekend?
day)) throughout the code, is better to just name a week day directly avoid the mental effort involved
in the parsing of a negative form.
See Also
Related not functions and macros in the standard library are often dealing with specific
cases of "negation". In general, prefer the more idiomatic use of a specific alternative
(when available) instead of building the same logic on top of not.
• “complement” uses of not to negate the output of the function passed as argument.
Use “complement” for the specific case of negating the output of a function,
instead of the longer (not (f)).
• boolean can be considered the opposite of not, since it transforms its input into a
boolean without negating it. not achieves the same result returning the logical
opposite of its input.
• bit-not is negation for binary numbers. It negates a numeric operand by
considering its binary representation and converting each 1 to 0 and vice-versa.
Performance Considerations and Implementation Details
(and
([])
([x])
([x & next]))
(or
([])
([x])
([x & next]))
and and or are widely used macros. They implement logic conjunction and disjunction
respectively. One of the best way to illustrate the behavior of logic operators is through
a truth table, where all the combinations of true and false are described 50:
p q (and p q) (or p q)
true true true true
50
See the Wikipedia page related to logical connectives for more information at en.wikipedia.org/wiki/Logical_connective
From the table you can see that or is more tolerant of the presence
of false while and only returns true when all operands are true. Although the table
only shows p and q, Clojure allows both "and" and "or" to receive more than two
arguments (see the contract section). Here’s for example a typical use of and for
conditional branching:
(let [probe {:temp 150 :rpm "max"}]
(when (and (> (:temp probe) 120) ; ❶
(= (:rpm probe) "max"))
(println "Too hot, going protection mode.")))
❶ and and or are frequently seen in conditions for if and when statements.
You can also use and and or outside conditions, for example for nil checking. We are
going to see this and other idiomatic uses in the example section below.
Contract
Both "and" and "or" accept 0 or more expressions and evaluates them left to
right. and returns:
• true in the absence of arguments.
• The argument in case of a single argument (behaving like “identity”)
• false if any expression evaluates to false.
• nil if any expression evaluates to nil.
• The evaluation of the last expression in any other case.
or returns:
(path "/tmp/exp/lol.txt")
;; "/tmp/exp"
(path "")
;; nil
(path nil)
;; nil
❶ The first and guard enables "s" to be safely trimmed, potentially resulting in a nil or an empty string.
This second "s" local binding will hide the one coming from the function parameter.
❷ The second and guard prevents subs to execute on an empty string. (seq coll) is an idiomatic way
to verify if a collection is empty in Clojure.
or can be used to provide a default value in case of nil expressions, for example
parsing optional command line options:
(defn start-server [opts]
(let [port (or (:port opts) 8080)]
(str "starting server on localhost:" port)))
(start-server {})
;; "starting server on localhost:8080"
Both examples illustrated in this section are very idiomatic and used very often in
Clojure projects.
toward a more pure approach trying to isolate side effects, so it comes to no surprise that there is no
such operator in Clojure.
See Also
• and and or are macro-expanded in terms of nested if statements. See the
implementation details further down in this chapter.
• every? can be used to check if a collection of expressions are all
evaluated true with (every? identity [e1 e2 e3]) instead of the not
applicable (apply and [e1 e2 e3])
• some-> or some->> is another option to exit a processing chain in the presence of
a nil.
Performance Considerations and Implementation Details
❶ and expands at compile time to invoke itself on the rest of the expressions until the last one is
reached.
❷ At runtime the nested if statements are executed, possibly stopping ahead of touching the bottom of
the chain at the first logical false value.
As you can see from the first let* expression, the short-circuiting logic applies at run-
time. So if some machine generate code was to produce and forms with a large enough
number of expressions, they might potentially incur in a StackOverflow exception even
in the presence of a false as the first condition:
(clojure.walk/macroexpand-all ; ❶
`(and
false
~@(take 1000 (repeat true))))
CompilerException java.lang.StackOverflowError
❶ We purposely create troubles for and by generating a compile time expression with 1000 arguments.
The scenario described above is unlikely and shouldn’t be of any concern in normal
applications.
3.2.3 bit-and and bit-or
function since 1.0
NOTE This section also touches briefly on other related functions such as: bit-xor, bit-not, bit-
flip, bit-set, bit-shift-right, bit-shift-left, bit-and-not, bit-clear, bit-test
and unsigned-bit-shift-right.
(bit-not [x])
(bit-clear [x n])
(bit-set [x n])
(bit-flip [x n])
(bit-test [x n])
(bit-shift-left [x n])
(bit-shift-right [x n])
(unsigned-bit-shift-right [x n])
Clojure provides a rich set of bitwise operators. There is no "bit-set" type in Clojure,
but we can use bytes, shorts, integers or longs as bit containers:
(Long/toBinaryString 201) ; ❶
;; "11001001"
(Long/toBinaryString 198)
;; "11000110"
(Long/toBinaryString ; ❸
(bit-and 2r11001001 2r11000110))
;; "11000000"
bitwise operators operate on bit patterns providing a fast mean to perform certain
classes of artihmetic functions. The speed gain is also a consequence of bits mapping
naturally to CPU internal registers: modern hardware usually offers native bitwise
operators that Clojure leverages via the JVM. One negative aspect of using bitwise
operators is that they are low level and tight to a particular bit size and representation.
Contract
Input
Bitwise operators can be divided into groups based on their input. Unless otherwise
specified, arguments have to be of type byte, short, int or long and cannot be nil:
• bit-not takes a single argument.
• bit-and, bit-or, bit-xor and bit-and-not require at least 2 arguments up to any
number.
• bit-clear, bit-set, bit-flip, bit-test, bit-shift-left, bit-shift-
right and unsigned-bit-shift-right all take 2 arguments. The first is the
numerical bit-set representation and the second is the index of a bit in the set
(starting from the least significant).
Notable exceptions
• IllegalArgumentException if the type of the argument is different
• NullPointerException if any argument is nil.
Output
All bitwise operators except bit-test returns a java.lang.Long that, interpreted as
binary, is the result of the related bitwise operation. bit-test returns a boolean true if
the bit at index "n" is "1", false otherwise.
Examples
bitwise operations are normally introduced to speed up recurring arithmetic operations
using the lowest number of CPU cycles 51 . bit-and, bit-or, bit-xor, bit-shift-
left, bit-shift-right and unsigned-bit-shift-right are the fundamental
operations on which the other are built on. We’ll have a look at them first and
introduce shorter forms when available.
bit-and
bit-and takes 2 or more arguments and performs the and operation on each pair
(triplet, quadruplet and so on) of corresponding bits:
(require '[clojure.pprint :refer [cl-format]])
51
Please refer to the Wikipedia page at en.wikipedia.org/wiki/Bitwise_operation for an in depth overview
2r11000110
2r01011110)) ; ❷
;; "01000000"
❶ bin uses cl-format to properly format binary numbers to a fixed 8 bits size. It is used here and the rest
of the section for readability.
❷ In this example, bit-and accepts more than 2 arguments. The vertical alignment helps visualizing the
bit triplets involved in the operation.
We call "bit mask" a bit-set built on purpose to "mask" certain bits. Given a target bit
"x", the result of performing an and operation with "1" (true) answers the question if
"x" is true or false:
(def 4th-bit-set-mask 2r00001000) ; ❶
❶ This binary number has a "1" in 4th place. When used with bit-and it represents a mask to answer
the question "is the 4th bit set in the other argument?". We named the binary number in a definition to
clarify its meaning in the following bitwise operation.
❷ With bit-and we can perform "masking" to check if one or more bits are set to "1". The answer in this
example is that the 4th bit is indeed set to "1".
bit-set
bit-test collapses the creation of the mask and checking for a bit into a single
operation (bit-and is useful to perform the same operation on multiple bits at once):
(bit-test 2r11001001 3) ; ❶
;; true
❶ bit-test returns true if the bit at index 3 (0-based) is set to "1". bit-test internally creates the
necessary mask before delegating the question to Java’s bitwise and operation.
By flipping the bits in the masking bit-set, we achieve the effect of setting the
corresponding bits to zero:
(def turn-4th-bit-to-zero-mask 2r11110111)
❶ Note that the bit paired up with a "0" in the mask gets set to "0" in the result. Anything else paired with
"1" remains unchanged. We can infer that true (or "1") is the "identity" value for and.
bit-clear
bit-clear achieves the same effect of setting a bit to "0" without the need to providing
a masking bit-set:
❶ Using bit-clear to set the bit at index "3" (zero-based) to "0" (or false).
bit-or
bit-or works similarly to bit-and by applying the boolean operation or on bit pairs,
but bit-or masking is inverted compared to bit-and.
bit-xor
More interesting is the case of bit-xor. "xor" (which stands for "exclusive or") is a
variation on or where, if both bits are true, it results in false instead of true. The
following example illustrates the effect comparing bit-or and bit-xor:
(map bin ((juxt bit-or bit-xor) 2r1 2r1)) ; ❶
;; ("00000001" "00000000")
❶ We present "1" and "1" as operand to bit-or and bit-xor respectively (using juxt). This is the only
case the two bitwise operator differs.
bit-xor is particularly useful for comparison of similar bit-set. For example we can
tell that two bit-sets are the same if the result only contain "0". The result contains "1"
for every bit that is different:
(bin (bit-xor 2r11001001
2r11001000)) ; ❶
;; "00000001"
❶ The bit-set contains "0" if the corresponding bit pair was the same, "1" if they were different. In this
example we can see the two bit-set differs in one place only.
bit-xor is also useful with masking. A mask containing "1" achieve the effects of
"flipping" the bit at that position:
(bin (bit-xor 2r11001001
2r00010001)) ; ❶
;; "11011000"
❶ bit-xor with a mask where the bit in the least significant position (index 0 from the right) and the 4th
bit have been inverted.
bit-shift-right
Another big class of bitwise operations is shifting. Shifting consists of pushing all bits
to the right or left, discarding the least or the most significant, respectively. In Java all
numerical types are signed so the most significant bit represents the sign. However,
during a right shift, the sign bit is preserved and "1" introduced as padding. By
preserving the sign bit, positive numbers remain positives and negative numbers
remain negatives (this is also called "arithmetic shifting").
Let’s start by illustrating a simple right shift on a negative number. As you can see
Clojure inherits Java’s semantic for bit operations, including the two’s complement
format to represents negatives footnote[A good overview of bitwise operations
including some language implemenation details is
available https://en.wikipedia.org/wiki/Bitwise_operation#Arithmetic_shift]:
(Integer/toBinaryString -147) ; ❶
;; "11111111111111111111111101101101"
❶ We can print binary numbers using Integer/toBinaryString. This is similar to use cl-format like we
did at the beginning of the section, but cl-format preserves zeroes on the left (if any). Note that the
number is expressed using the two’s complement format by flipping all the bits and adding 1.
❷ bit-shift-right shifts -147 1 bit to the right. The most significant bit (first from left) is the sign bit
which is left unchanged. The least significant bit on the right has been dropped.
❸ This time bit-shift-right pushes 2 bits to the right. Two "1"s are added on the left hand side and
"01" was dropped from the right.
Every position shifted to the right is equivalent to dividing the number by 2. More in
general, the number is divided by 2n, with "n" the number of shifts:
(bit-shift-right -146 1) ; ❶
;; -74
(bit-shift-right -146 2) ; ❷
;; -37
bit-shift-left
It should come without surprise that bit-shift-left has symmetrical effects to bit-
shift-right. One interesting property is that every left shift corresponds to
multiplying the number by 2n with "n" corresponding to the number of left shifts:
(dotimes [i 5] ; ❶
(println [(int (* -92337811 (Math/pow 2 i)))
(Integer/toBinaryString (bit-shift-left -92337811 i))]))
;; [-92337811 11111010011111110000100101101101] ; ❷
;; [-184675622 11110100111111100001001011011010]
;; [-369351244 11101001111111000010010110110100]
;; [-738702488 11010011111110000100101101101000]
;; [-1477404976 10100111111100001001011011010000]
❶ The effect of calling bit-shift-left up to 4 positions for -92337811. The expression prints both the
decimal and the corresponding binary number.
❷ The first line printed correspond to a shift of zero positions, which is equivalent to the bit-set itself. As
shift progresses, we can "0" pushed from the right, while the sign bit is preserved.
For those cases where we can ignore the sign bit (because it doesn’t actually represent
a sign) we can use unsigned-bit-shift-right:
(require '[clojure.pprint :refer [cl-format]])
(dotimes [i 5] ; ❷
(->> i
(unsigned-bit-shift-right -22)
Long/toBinaryString
right-pad
println))
;; 1111111111111111111111111111111111111111111111111111111111101001 ; ❸
;; 0111111111111111111111111111111111111111111111111111111111110100
;; 0011111111111111111111111111111111111111111111111111111111111010
;; 0001111111111111111111111111111111111111111111111111111111111101
;; 0000111111111111111111111111111111111111111111111111111111111110
❶ right-pad takes care of larger 64 bits sets padding with "0" from the right.
❷ We can see the effect of shifting the number -22 right of 4 positions (the first line is the bit-set no
shifting).
❸ Zeroes start to appear from the left, pushing ones to the right. By using a negative number, we make
sure we can see this effect clearly, contrasting zeroes and ones on the left side.
Unsigned shift right (also known as "logical shifting") always pads with zeros from the
left, independently from the presence of a sign bit. Since Clojure always returns 64 bits
numbers of type long, we can now see the full resolution of bitwise operators. Logical
shifting on negative numbers always return a positive number, as a "0" will appear as
the most significant bit after padding.
NOTE there is no unsigned-bit-shift-left because the effect would be exactly the same
as bit-shift-left.
See Also
• “and, or” are the common boolean operators. Unless you’re interested in
processing multiple operations at once, you should probably use “and, or” instead
of bitwise operators.
;; 0
;; 1
;; 10
;; 11
;; 100
;; 101
;; 110
;; 111
❶ This expression shows that increasing binary numbers form all the possible combinations of bits in
different position in the bit-set.
Taking advantage of this fact, we can formulate a new bit-powerset function that uses
one for loop to iterate the bit-sets and an inner loop to fetch the corresponding indexes
from the input collection:
(defn bit-powerset [coll]
(let [cnt (count coll)
bits (Math/pow 2 cnt)] ; ❶
(for [i (range bits)]
(for [j (range cnt)
:when (bit-test i j)] ; ❷
(nth coll j)))))
(bit-powerset [1 2 3]) ; ❸
;; (() (1) (2) (1 2) (3) (1 3) (2 3) (1 2 3))
n
❶ We need 2 bit-sets, corresponding to the number of possible combinations of the items in "coll".
❷ The :when constraint in the for controls which elements from the input collection should end up in the
subset.
❸ bit-powerset returns all the combinations of the input, including the empty collection and the input
itself.
(if
([test then])
([test then else]))
(if-not
([test then])
([test then else]))
(when
[test & then])
(when-not
[test & then])
if, if-not, when and when-not are at the core of conditional branching in Clojure.
They are used (as in many other languages) to enable or prevent evaluation of some
part of the code. The condition for evaluation is any valid Clojure expression that is
used as logical true or false. if and if-not can be used to pick one of two branches,
while when and when-not supports decision on a single branch. The -not suffix in
either forms simply inverts the meaning of the condition, resulting in enhanced
expressiveness when the "negative" should be given more prominence.
if can be used as simply as:
Contract
Input
• "test" is a mandatory Clojure expression. After evaluation the expression produces
a logical boolean value that is used to evaluate another argument.
• "then" is the first evaluable argument. Unlike normal functions this argument
won’t necessarily evaluate. It is mandatory for if and if-not and optional
for when and when-not. when and when-not automatically consider "then" wrapped
in a do block.
• "else" is meaningful for if and if-not (when and when-not will just treat it as
additional "then" forms part of the implicit do block). When present, it evaluates
when the "test" is false (for if) or when the test is true (for if-not). When not
present it behaves like if a nil was passed: (if false :a) is equivalent to (if
false :a nil)
Output
Returns: the result of the evaluation of the expressions depending on the condition.
if evaluates:
❶ “rand and rand-int” returns a float between 0 and 1. Asking if what was returned is above or below the
mid-point of 0.5 is equivalent to a 50% chance.
❷ “repeatedly” is a nice function to call another function continuously. We can then simulate multiple toss
of a coins easily and take as many as we wish.
of the if. We can simply use if-not to keep having the exit condition at the top
without using a “not”:
(def tree
[:a 1 :b :c
[:d
[1 2 3 :a
[1 2
[1 2
[3
[4
[0]]]]] ; ❶
[:z
[1 2
[1]]]
8]]
nil])
(depth tree)
;; 8
❶ We simulate a tree by arbitrarily nesting vectors. The most indented item is 8 levels deep.
❷ We take advantage of if-not to enforce the fact that the first branch, when selected, means a few
important facts: we reached a leaf, we return a result and we don’t go into further recursion.
❸ The result of mapping over a sequence using the function itself as the mapping function produces a
similarly nested sequence where elements have been replaced with a count (in this case). Therefore
we need to “flatten” and take the “max and min”.
Although not an universal rule, when and when-not presence may indicate side effects
when the returned nil is just discarded. For example when is pretty common in the
tear-down phase of component systems to close connections:
(defn start [] ; ❶
(try
(java.net.ServerSocket. 9393 0
(java.net.InetAddress/getByName "localhost"))
(catch Exception e
(println "error starting the socket"))))
(.isClosed socket)
;; false
(stop socket) ; ❸
;; nil
(.isClosed socket)
;; true
❶ Starting a socket with Java interop is quite simple. start returns the newly create socket in open
state.
❷ when is used here as a guard against a potentially nil socket that wasn’t correctly setup during
initialization. We really care about closing the socket if the socket is there, nothing otherwise.
❸ The client of the side effecting when is not interested in knowing the results of the operation.
See Also
• “not” is the explicit way to invert the meaning of if or when. It’s unlikely you’ll
have to use it instead of if-not or when-not.
• “cond” is essentially nested if-else statements in a readable form. Use them when
multiple nested conditions are necessary.
• if-let and when-let can be used when conditional branching follows a let binding
and the condition happens on the symbol that was just bound.
Performance Considerations and Implementation Details
a :a
b :b ; ❶
:else :c)) ; ❷
;; :b
❶ b is declared as true in the let binding. cond will then return the corresponding expression, in this
case the keyword :b.
❷ Note the last :else :c condition-expression pair, which will be used as a default in case no other
condition matches. :else is a completely arbitrary "truthy" value (any other keyword or string could be
used except nil and false).
cond reads easier than the corresponding nested if since conditions and expressions are
vertically aligned, quickly showing which branch belongs to which test expression.
The catch-all :else :c last pair for example is much easier to see than the
corresponding nested “if, if-not, when and when-not” where it ends up as the most
nested form. Worth noticing that :else is conventionally used as the last condition, but
any logical true value could be used (which in Clojure is anything other
than nil and false).
Contract
(cond [clauses])
Input
• "clauses" can be zero or more and will be evaluated in order.
• "clause" is a pair formed by a "condition" and an "expression".
• "condition" is any valid Clojure form.
• "expression" can be any valid Clojure form.
Notable exceptions
• IllegalArgumentException: when the number of arguments passed is odd,
implying there is at least one incomplete pair, for example (cond (= 1 1)) would
throw exception because there is no form to evaluate as the result of
the true expression.
Output
cond returns:
(response-code good-data)
;; 200
(response-code bad-data)
;; 500
❶ response-code contains a cond form with 3 options. The data parameter is inspected for errors or
failures. A default 400 options is returned if no other matches.
As a general rule of thumb, cond should be used for any condition requiring 3 or more
branches, while it would be overkill compared to if for the frequent case of 2 branches.
if a > b
print "X"
elsif a == b
print "Y"
else
print "Z"
end
var = 100
if var == 200:
print "1 - Got a true expression value"
elif var == 150:
print "2 - Got a true expression value"
elif var == 100:
print "3 - Got a true expression value"
else:
print "4 - Got a false expression value"
The Ruby’s elsif and Python’s elif are reserved words that the compiler understands natively. By
defining a macro like cond Clojure solves the problem of having additional conditional branches without
adding any additional complexity to the compiler.
See Also
• if is still a possible solution for short "if-else" combinations but cond normally
reads better. Prefer cond over 2 or more nested if statements.
• “defmulti and defmethod” along with defmethod defines multimethods in Clojure.
Consider using multimethods if the quality and number of conditions in
a cond tends to expand frequently to handle previously unknown cases. “defmulti
and defmethod” offers a flexible polymorphic dispatch including the possibility to
extend the multimethod from different namespaces (while all cond expressions
need to be defined inside a single form).
• cond-> combines multiple conditions evaluation with the option to thread a value
through the expressions. Use when, based on conditions, you also want to
gradually build results.
• “condp” avoids some typing if the condition just repeats over different values, for
example (cond (= x 1) "a" (= x 2) "b").
Performance Considerations and Implementation Details
Normal use of cond (e.g. not macro generated) should not be particularly relevant
during performance analysis. In order to see how many clauses could be used before
exhausting the stack, the curious reader can try to execute the following loop that
increasingly creates larger and larger cond:
(doseq [n (filter even? (range 10000))]
(do (println n)
(clojure.walk/macroexpand-all ; ❶
`(cond ~@(take n (repeat false))))))
;; ...
;; ...
;; 2040
;; 2042
;; 2044
❶ The technique used here consists of fully expanding a cond invocation and use unquote-splicing to
give cond a large list of clauses.
The above example generates and evaluates a cond form containing 1022 ((/ 2044 2))
pairs. This should be considered a very unusual case to find in real code that is not
machine-generated and this not relevant for standard performance analysis.
3.3.3 condp
macro since 1.0
((op "mult") 3 3)
;; 9
The predicate ("=" in the example) is applied to "plus", "minus" and finally "mult" that
is the first evaluation returning logical true, hence * is selected as the return value.
Contract
(condp <pred> <expr> [clauses] [<default>])
Input
• "pred" is a mandatory function of 2 arguments ("selector" and "expr"). The return
value is interpreted as logical boolean.
• "expr" is mandatory and can be any valid Clojure expression.
• "clauses" can be zero or more and are evaluated in order.
• "clause" can contain 2 (a "pair") or 3 items (a "triplet")
• "pair" is a "selector" followed by a "choice". Both are valid Clojure expression of
any type.
• "triplet" is a "selector" followed by the symbol :>> and a function "f". The selector
is any valid Clojure expression while "f" must take a single argument of any type
and can return any type.
• "default" is any valid Clojure expression.
Notable exceptions
• condp throws IllegalArgumentException when a matching clause cannot be
found (in contrast with cond that would return nil instead) and no default
provided.
Output
• "default" when there is no matching clause.
• the evaluation of "choice" of the first pair-clause where (pred selector expr) is
logical true.
• the evaluation of (f (pred selector expr)) for the first triplet-clause
where (pred selector expr) is logical true.
Examples
The mime-type function is in charge of setting the right mime-type (the media type,
also known as mime-type, is used by browsers to interpret the response returned by a
web server, that is ultimately a just stream of bytes) by looking at the extension of the
URL given as argument. We could use condp to decide what mime-type to assign:
(defn extension [url] ; ❶
(last (clojure.string/split url #"\.")))
"png" "image/png"
"bmp" "image/bmp"
"application/octet-stream")))
(mime-type "http://example.com/image.jpg") ; ❸
;; "image/jpeg"
(mime-type "http://example.com/binary.bin")
;; "application/octet-stream"
❶ extension is a helper function to extract the last part of the url after ".".
❷ mime-type passes the extension through condp to decide which mime-type it corresponds to. Note
that a default "octect-stream" identifies a generic binary type that we couldn’t recognize.
❸ The returned string is the mime-type that can be used in the response.
❶ To solve FizzBuzz, we use a predicate of two arguments. The predicate returns true if the numbers
are multiple of each other.
❷ The string "fizzbuzz" needs to appear first to avoid returning results divisible by 3 or 5 (which are both
15 divisors).
❸ This FizzBuzz implementation works on natural positive numbers to retrieve results. We could
use nth to isolate a single item from the results.
The last example shows how we can use :>>, a special keyword in condp, to attach
actions to choices. It works the same as the basic condp but when the :>> keyword is
present in the clause, the last element of the triplet is considered a function and
invoked with the result of the predicate. In the following (simplified) Poker game
implementation condp is at the core of the game decision step 53.
The first set of functions are helpers used later on to identify relevant combination of
cards:
(def card-rank first) ; ❶
(def card-suit second)
52
FizzBuzz, also popular for developers interviews, is a game to teach division to children: en.wikipedia.org/wiki/Fizz_buzz
53
Here is a nice summary of the standard rules of Poker: en.wikipedia.org/wiki/List_of_poker_hands
frequencies))
❶ card-rank and card-suit are aliases for first and second respectively. Using aliases in this case
helps readability by giving a precise meaning to an otherwise very general standard library function
(thanks to Ted Schrader for suggesting this and other changes in this section).
❷ The first functions of the example are helpers arranging cards by suit (one of the 4 types) or by rank
(in our example, the Jack, Queen, King and Ace have been numbered 11, 12, 13 and 14 respectively).
The next set of functions builds on top of the previous to identify winning
combinations for the game of Poker. There are more, but in this example we
implemented just a few to keep the example shorter:
(defn three-of-a-kind [hand] ; ❶
(n-of-a-kind hand 3))
❶ Using n-of-a-kind we can create functions to identify if the hand contains 3 or 4 of the same kind of
cards.
❷ A straight flush requires additional logic to sort cards.
❸ Functions to recognize winning combinations use the thread-last operator →> to combine helper
functions in a meaningful way.
Finally, condp sits at the core of the game to determine who is the winner given a set of
players sitting at the table. This is accomplished by filtering players based on different
kind of winning combinations and then selecting the highest in case of tie:
(defn game [players]
(condp (comp seq filter) players ; ❶
straight-flush :>> straight-flush-highest
four-of-a-kind :>> n-of-a-kind-highest
three-of-a-kind :>> n-of-a-kind-highest
(n-of-a-kind-highest players)))
❶ condp combines a filter operation with seq through comp so that if the filter returns an empty list, then
it results in a nil.
The simplified game of Poker described here checks only 3 out of the 7 potential
winning conditions for a real game, not considering a full-house or a
straight. condp aggregates the decision logic around the following design:
• Higher ranking combinations should be checked first because as soon as we have
a match (for example four of a kind) we are not interested in other lower ranking
combinations.
• The predicate gives us the possibility to filter players by cards combinations and
passing them through to the related clause.
• In case of players with equally ranking combinations, we pass the matching
players to the clause function (through :>>) that sorts the combination based on a
more specific ranking.
The following games verify if the Poker game has been implemented correctly. Each
card is encoded as a pair of rank-suit where clubs (♣) is ":c", diamonds (♦) is ":d",
hearts (♥) is ":h" and spades (♠) is ":s":
(game [#{[8 :h] [2 :h] [2 :s] [2 :c] [2 :d] } ; ❶
#{[8 :h] [1 :h] [1 :s] [1 :c] [1 :d] }
#{[2 :h] [2 :s] [2 :d] [12 :s] [12 :h]}
#{[5 :d] [4 :s] [7 :d] [14 :s] [14 :h]}
#{[8 :s] [4 :c] [3 :d] [10 :s] [10 :h]}])
❶ Games are implemented as collections of sets. Each set represents a player. We encode cards as
pairs with a rank and a suit.
❷ This game simulation has 4 players. The hand with a four of a kind wins.
❸ This game contains two straight flushes. The one with the highest rank wins.
See Also
• “cond” supports similar functionalities to condp. Use “cond” when you need
different predicates each clause. Use condp if you have the same predicate or you
are interested in the :>> form to trigger a function after a match.
• cond-> has similar intent, of selecting one or more branches (although is not short-
circuiting, so it might execute multiple true branches). Use cond-> when you
don’t need to execute the same predicate and you are interested in multiple
branches execution.
Performance Considerations and Implementation Details
54
The thread talking about fcase and condp inclusion into the standard
library: groups.google.com/forum/#!topic/clojure/3ukQvvYpYDU
55
The thread where addition of :>> was discussed for
condp: groups.google.com/d/msg/clojure/DnULBF2HAfc/1nfJS7n3BQYJ. It was proposed by Meikel Brandmeyer.
56
cond documentation in Scheme is available here: docs.racket-lang.org/guide/conditionals.html
from “cond” or “condp” and it can be considered part of the same family of macros:
(let [n 1] ; ❶
(case n
0 "O"
1 "l"
4 "A"))
;; l
Under the surface, case diverge from cond for its treatment of tests expressions which
are not evaluated at macro-expansion time. This means that an expression like (inc
0) is not replaced with "1" as test expression. In the context of case, (inc 0) is
equivalent to the set containing the symbol inc and the number 0:
(let [n 1]
(case n
(inc 0) "inc" ; ❶
(dec 1) "dec" ; ❷
:none))
;; "dec"
❶ This branch of the case statements verifies if the number "1" (the current local binding of the symbol
"n") is present in the set formed by "inc" and "0". The answer is false and the control moves forward.
❷ The following branch contains the number "1" and "dec" is selected as the answer.
Input
• "expr" is mandatory and can be any valid Clojure expression.
• "clauses" are grouped into one or more pairs. If there are no clauses, there should
be at least one "default" exit.
• "test" is a compile-time literal and is not evaluated at macro-expansion time.
Examples of valid literals
are: :a (keywords), 'a (symbols), 1, 1.0, 1M, 1N (numbers), {} #{} ()
57
To know more about "tableswitch" JVM instruction please read the following article about control flow in the Java virtual
machine:www.artima.com/underthehood/flowP.html
Exceptions
java.lang.IllegalArgumetnException when:
• there is no matching "test" for the given expression and no "default" is given.
• there is a duplicate "test" constant.
Output
• case returns the "default" if one or more clauses are present but none is matching.
• case returns the evaluation of "then" for the first pair-clause where (identical?
test expr) is true.
Examples
Let’s first clarify some aspects of the contract. case tests are compile time literals with
implications like the following trying to use symbols like 'alpha, 'beta and 'pi for
branching:
(case 'pi ; ❶
'alpha \α
'beta \β
'pi \π)
(macroexpand ''alpha) ; ❷
;; (quote alpha)
(case 'pi ; ❸
(quote alpha) \α
(quote beta) \β
(quote pi) \π)
(case 'pi ; ❹
alpha \α
beta \β
pi \π)
;; \π
❶ Symbols like 'alpha that would be evaluated as the symbol itself at the REPL, are not evaluated
here. This case expression fails claiming that there is a "quote" symbol somewhere that we don’t see
immediately.
❷ case sees the quoted version of 'alpha at macro expansion time, which is equivalent to "double-
quoting" the symbol at the REPL like shown here.
❸ If we replace the single quote char ' using the full call (quote) instead, we can see what is wrong.
The symbol "quote" is appearing inside all tests expressions on the left hand side, resulting in
ambiguous multiple matching branches. Also note that the list (quote alpha), is testing for the
presence of "'pi" inside the set formed by "quote" and "alpha" and it’s not a
proper clojure.lang.PersistentList instance.
❹ The correct way to match against symbols is to completely the single quote from test constants.
You should take particular care using case with test expressions other than numbers,
strings and keywords. The special cases to remember are:
• Expressions containing reader macros are compared ahead of their expansion. We
saw the example of a single quoted symbol, but other common cases
are var literal #' or deref literal @.
• List literals are compared for inclusion rather than equivalence (see example
below).
• Other collection literals, such as vectors, sets and maps are compared using
normal equality.
case compares list literals by checking if they contain the test expression. We can take
advantage of list literals to enumerate matching operators in the following infix
calculator:
(defn error [& args]
(println "Unrecognized operator for" args))
❶ operator translates an operator as string into the corresponding Clojure function. We can use case to
select between operations or an error function to handle unrecognized operators. Note how we can
add multiple synonyms for the four basic operations using a list literal.
❷ execute takes the operator and operands and evaluate the corresponding operation once it has been
translated by the case statement.
❸ calculator takes the raw unevaluated string and converts it into "tokens" ready for evaluation.
❹ Invoking the calculator produces the expected results.
Considering lists have a special meaning for case, we are apparently in trouble if we
want to compare lists as actual collections. Clojure equality does not distinguish
between lists and vectors as container types, but only compares their contents allowing
us to match against lists. We are going to see how in the following example designed
to score effectiveness of Vim users at the keyboard 58
Vim is a popular editor that leverage short mnemonics key sequences to execute
arbitrarily complex tasks. We could score an user based on the best key combination to
achieve some editing task (usually the shortest amount of keystrokes wins). For
simplicity we are going to consider the very simple task of moving the cursor from the
lower-left corner of a 5x5 grid terminal to the upper right corner, like shown in
the picture below:
Figure 3.2. Visually representing Vim keystrokes movement to move from one corner to the
other.
The letter "k" moves up the cursor while the letter "l" moves it to the right. One poor
solution would be to hit "k" four times followed by hitting "l" four more times
(diagram on the left): in this case we are going to acknowledge the accomplishment but
giving a low score of "5". A better solution would be to press "4" followed by the
moving letter, halving the number of keystroke compared to the previous solution
(picture on the right). The code to score such a result could be implemented as the
following case statement:
(defn score [ks]
(case ks ; ❶
[\k \k \k \k \l \l \l \l] 5
58
Vim is a popular text editor that thanks to editing contexts has very short key combinations.
See en.wikipedia.org/wiki/Vim_(text_editor) to know more.
[\4 \k \4 \l] 10
0))
(check "kl")
;; 0
(check "kkkkllll")
;; 5
(check "4k4l")
;; 10
❶ We group the movement constants in a vector, each vector representing one test expresison in
the case statement. Note that case does not consider the presence of the letter "k" or "l" in multiple
vectors as duplication (it would be an exception if we used list literals).
❷ Since the input is a string, we just need to call seq on it to transforms it into a sequence of characters.
The practical implications for Clojure is that there must be a way to transform compile time constants or
grouping thereof into integers and shift/mask the integers to obtain the smallest possible gap in between
keys. Another potential problem happens on hash-collisions and in general when transforming
composites into integers. So despite the simple idea, Clojure has to do quite a lot of non-trivial
processing to get it right 59. A few fairly complicated functions (prep-hashes, merge-hash-
59
A good selection of case corner cases is visible on this ticket: dev.clojure.org/jira/browse/CLJ-426
collisions, fits-table? and others) are dedicated in "core.clj" to transform case constants into a
gap-less list of non-clashing integers.
See Also
• “cond” has a similar semantic compared to case. The most notable difference is
the possibility to evaluate test expressions at compile time.
• “condp” allows to input the predicate that should be used for matching and adds
the additional :>> semantic.
“cond” and “condp” are in general more flexible. As a rule of thumb, prefer case in the
presence of literals or when performance is specifically important.
Performance Considerations and Implementation Details
(defn c1 [n]
(cond
(= n 0) "0" (= n 1) "1"
(= n 2) "2" (= n 3) "3"
(= n 4) "4" (= n 5) "5"
(= n 6) "6" (= n 7) "7"
(= n 8) "8" (= n 9) "9"
:default :none))
(defn c2 [n]
(case n
0 "0" 1 "1"
2 "2" 3 "3"
4 "4" 5 "5"
6 "6" 7 "7"
8 "8" 9 "9"
:default))
As you can see the mean execution time goes from 10.825367 ns for the version
using “cond” to the 6.716657 ns for the version using case which is about 40% faster.
The speedup is also given by the fact that “cond” is using the "=" equality operator
60
Criterium is the de-facto benchmariking tool for Clojure: github.com/hugoduncan/criterium
while case, being based on constant literals, is implicitly using reference equality. A
more "fair" benchmark could use identical?, but that would restrict the normal
operational spectrum of “cond” with potentially surprising results:
(defn c1 [n]
(case n 127 "127" 128 "128" :none))
(c1 127)
;; "127"
(c1 128) ; ❶
;; "128"
(defn c2 [n]
(cond (identical? n 127) "127" (identical? n 128) "128" :else :none))
(c2 127)
;; "127"
(c2 128) ; ❷
;; :none
Please note that there is nothing wrong with the implementation of “cond” but it has
more to do with the implication of using identical? as the equality
operator. case simply avoids the additional cognitive time required to understand the
implications of using identical?.
If we macroexpand a simple example, we can see how case delegates down to case* (a
special form) passing down the arguments that are needed to create the necessary
bytecode:
(macroexpand
'(case a 0 "0" 1 "1" :default))
;; (let*
;; [G__759 a]
;; (case* G__759
;; 0 0 :default
;; {0 [0 "0"], 1 [1 "1"]}
;; :compact :int))
Going further down to the produced JVM bytecode, the case* special form produces
the following (showing just the main tableswitch and related details):
(require '[no.disassemble :refer [disassemble]]) ; ❶
(println
(disassemble ; ❷
#(let [a 8] (case a 0 "0" 1 "1" :default))))
61
See www.owasp.org/index.php/Java_gotchas#Immutable_Objects_.2F_Wrapper_Class_Caching to know how Java
internal caching of boxed values works
;; [...] ; ❸
0 ldc2_w <Long 8> [12]
3 lstore_1 [a]
4 lload_1 [a]
5 lstore_3 [G__22423]
6 lload_3 [G__22423]
7 l2i
8 tableswitch default: 54
case 0: 32
case 1: 43
;; [...]
❶ disassemble is a library used in this example to de-compile the object produced by evaluating a
Clojure form.
❷ We call disassemble on a case expression wrapped in a let block.
❸ The disassembled object is long on contains many other parts that are not shown here. We are only
interested in showing the specific portion regarding the translation of the case statement in the
expression. As you can see, the case was translated into a tableswitch bytecode instruction.
❶ This commented out line would cut the loop short. Possible, but potentially dangerous. What if we
assign i = 8 instead?
In the Java version, the mutable variable "i" is created at the beginning of the loop and
mutated at each iteration. "i" controls the loop and we can interfere by changing it from
within the loop, something difficult and explicit to achieve with Clojure. In Clojure we
would pass successive values "parameters" (more properly local bindings) to the next
iteration:
(loop [i 0 s []]
(if (< i 10)
(recur (inc i) (conj s (* i i)))
s))
;; [0 1 4 9 16 25 36 49 64 81]
In Clojure there is no way to mutate "i" inside the body of the loop simply because "i"
it’s not mutable. Secondly, the Java "for" statement only allows interaction with the
outside world by mutation (in this case the outer-scope java.util.Stack object) while
Clojure returns the last expression before exiting the loop. To be fair, both languages
would allow the non-idiomatic alternative approach:
public static Stack square(int i, Stack s) {
if (i < 10) {
s.push(i * i);
square(++i, s); ❶
}
return s;
}
System.out.println(square(0, new Stack()));
// [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
❶ Recursive square invocation happens as the last instruction in a mutually exclusive branch (either "i"
is less than 10 or is not). This recursive computation could be transformed into iterative 62.
Java doesn’t push strongly on recursion because of the compiler lacks automatic tail-
call optimization capabilities (see the tail-recursion section in loop for a detailed
explanation). Any sufficiently large recursive iteration in Java would eventually
consume the entire stack, even if the recursion happens as the last instruction (like in
our example). Similarly Clojure would allow the following mutating loop:
(let [i (atom 0) s (atom [])] ; ❶
(while (< @i 10)
(swap! s #(conj % (* @i @i)))
(swap! i inc))
@s)
;; [0 1 4 9 16 25 36 49 64 81]
❶ Clojure would only allow controlled mutation through an atom (or other concurrency aware primitives
like references.
Like the non-idiomatic Java recursion, the above usage of “while” with
mutating atoms significantly increases complexity of the code and is strongly
discouraged (and very non-idiomatic Clojure).
Recursion is so common that it comes with a specific vocabulary:
• loop and fn function declaration family are considered the recursion "targets", the
instruction where execution jumps after a recur.
62
Perhaps one of the best explanation of recursive computation and tail-call optimization is in SICP, Structure and
Interpretation of Computer Programming:mitpress.mit.edu/sicp/full-text/book/book-Z-H-11.html#%_sec_1.2.1
• The "exit condition" is a conditional form (usually if or a cond) that decides when
to exit the loop. A condition is always present (if we exclude the degenerated
single-iteration and infinite-iterations cases).
• When the recursive call happens as the last instruction of the current scope, then
the recursion is called "tail recursive".
3.4.1 loop, recur and loop*
macro and special-form since 1.0
loop-recur is the most basic recursive construct in Clojure. loop is one of the possible
targets to resume execution, while recur performs the controlled "jump" to transfer
control to the inner-most loop or fn form (including defn, defn-, fn* and anonymous
function literal #()). In general, Clojure allows 3 ways to recur:
1. A call to the function from within the function itself. No loop or recur is used in
this case. The recursive call can appear anywhere, not just as last instruction, like
in this example returning the n-th element in the Fibonacci series 63:
(defn fib [n]
(if (<= 0 n 1)
n
(+ (fib (- n 1)) (fib (- n 2))))) ; ❶
❶ Note that despite the line is the last in a mutually exclusive if branch, the first fib invocation is
not in tail position, because there is a second (- n 2) evaluation that follows.
2. The recur special form invoked without a loop target will use the innermost
containing function definition. Here’s is the same example re-written to avoid
recurring twice:
(defn fib [a b cnt]
(if (zero? cnt)
b
(recur (+ a b) a (dec cnt)))) ; ❶
63
Fibonacci sequence is characterized by the fact that every number in the series is the sum of the preceding
ones: en.wikipedia.org/wiki/Fibonacci_number
❶ recur target here is the top function definition fib, because there is no other inner-most
definition and there is no loop instruction either.
3. The recur special form invoked with a loop target will restart computation from
the innermost loop form. The following is probably the most effective of the
3 fib versions presented so far. The presence of loop takes care of some
initialization that doesn’t need to be outside the function and at the same time we
are not recurring twice:
(defn fib [n]
(loop [a 1 b 0 cnt n] ; ❶
(if (zero? cnt)
b
(recur (+ a b) a (dec cnt)))))
❶ Note that 0 and 1 initialization parameters had to be sent in when invoking the function. They
are now handled by loop without being exported in the function parameters.
These are the main differences between calling the function recursively and using
the recur variants:
• The compiler ensures that if recur is present, it needs to be in tail position. The
compiler throws exception if recur is in not in tail position.
• The compiler enables a special optimization when using recur that doesn’t
consume the stack (see the tail-recursion section below for a detailed explanation).
• loop offers additional control over local bindings without interfering with the
function arguments, for example to initialize values or to add additional
parameters to the recursion.
• loop is also the main choice for iteration when speed is important. loop takes care
of propagating type information to avoid unnecessary boxing/unboxing, which is
usually an important factor for fast code execution 64 . The rest of the chapter will
fully expand on the speed aspect of loop-recur.
Contract
(target [binding-parameters]
(<body>
(recur binding-parameters)))
64
Autoboxing is the automatic conversion of primitive types into the corresponding wrapper class (int to Integer for
example in Java). Boxing has usually a minimal cost, but a big impact in Clojure when primitive types could be used and
are instead converted into their wrapping object by a function call. Without the necessary type hinting the Clojure
compiler needs to compile a function into a generic bytecode able to deal with any type of argument (e.g.
java.lang.Object).
Input
• "target" can be any of loop, defn, defn-, fn* or anonymous function literal #(). A
target for recur must always be present, although the short (recur) is valid
Clojure resulting in an infinite loop.
• "binding-parameters" is a vector containing symbols (like in the case of a normal
function declaration) or a vector of bindings in the case of loop.
• "params" are used in case of recursion with a function declaration as target. In that
case recur invocation must have the same number of parameters declared by the
function.
• "bindings" is used in case of recursion with loop as target. The bindings are a
(potentially empty) vector containing an even number of
elements. recur invocation must have the same number of bound locals, which is
equivalent to the number of bindings. The "bindings" in loop are essentially
equivalent to "bindings" in let.
• "body" contains everything regarding the computation including recur as the last
instruction. It needs to contain at least 1 condition instruction to select when
to recur and when instead to return results.
Notable exceptions
• java.lang.UnsupportedOperationException: Can only recur from tail position.
The message of the exception explains that recur was used but another form will
need evaluation after the recursion returns. In this case loop-recur cannot be
used. If the algorithm cannot be re-formulated with tail-recursion, then the only
available option is to use explicit recursion.
Output
• The evaluation of the last non-recurring instruction in the body.
Examples
We briefly described the possibility for recur to use any of the macros dedicated to the
creation of functions. The following is a rewrite of the Fibonacci example using fn and
the function literal #(). Apart from requiring to be invoked in a different way, they are
equivalent to recur with defn as seen at the beginning of the chapter but they are
definitely less readable. While the first example with fn as target is tolerable for very
small function, the second example using the function literal #() is rarely used:
(map
(partial
(fn [a b cnt] ; ❶
(if (zero? cnt)
b
(recur (+ a b) a (dec cnt)))) 1 0)
(range 10))
;; (0 1 1 2 3 5 8 13 21 34)
(map
(partial
#(if (zero? %3) ; ❷
%2
(recur (+ %1 %2) %1 (dec %3))) 1 0)
(range 10))
;; (0 1 1 2 3 5 8 13 21 34)
loop-recur can also be used in cases where the iteration is not necessarily collection
traversal (in that case, sequence operations like map would be an obvious choice). This
book contains interesting examples of loop used in contexts other than collection
traversal. The reader is invited to take a look at the following:
• let shows an infinite (and side-effecting) loop to collect user input for an
interactive game.
• if-let shows a master-worker computational pattern where workers wait for work
in an infinite loop-recur.
• clojure.zip/zipper shows how to traverse a tree with zippers, another typical use
of recur.
The following example explores another of the good reason to use explicit recursion:
speed. Let’s see how loop-recur can replace an example of collection traversal when
speed of execution is paramount. The Newton method to compute the square root of a
number describes an algorithm where an initial guess converges to an approximate
solution 65 . Assuming we don’t know about the existence of Math/sqrt, let’s
implement a solution using sequences. The following approach consists of pulling from
an infinite stream of gradually improving approximations and then stop when the
solution is precise enough:
(set! *warn-on-reflection* true)
65
The Newton method can be generalized for other problems not just the square root calculation. More details available
onen.wikipedia.org/wiki/Newton%27s_method#Square_root_of_a_number
(sq-root 2)
;; 1.4142135623746899
❶ By type hinting the argument to double we make sure that Clojure makes the right call
to (Math/abs) without requiring reflection, as *warn-on-reflection* was correctly pointing out. That
alone would slow down lazy-root one order of magnitude.
❷ Although there are more sophisticated ways to select the initial guess, 1 is reasonable enough here
66
.
❸ The anonymous function passed to iterate takes the current guess % and produce a better one by
averaging % with (/ x %). We use “iterate” to calculate one step and feed the result of the newly
improved guess into the next iteration, effectively producing an infinite lazy sequence of guesses from
which we pull as much improved guessed as reasonable.
❹ We can now filter the best guess out of the increasingly better ones and take the first item. The
predicate function is using the square of the guess (* % %) to verify how far off we are from the
perfect solution. We use a very small number like 1e-8. By making this number even smaller we can
get more precise at the price of more guesses to compute.
❺ The wrapper function sq-root just make sure special cases are accounted for. Something we don’t
want to allow for instance is the square root of a negative number.
lazy-root is sufficiently readable, idiomatic and reasonably fast. Notice how lazy-
root, by adopting a stream-like model for processing guesses, implicitly collects all the
results: we could just remove the call to “last” to see all of them. This additional
feature could be useful or not depending on the context, but it’s implicit in the way the
stream of guesses is processed on demand. Let’s now concentrate on performance and
check how well we are doing against Java’s Math/sqrt (which is likely a faster rival):
(require '[criterium.core :refer [bench]])
The benchmark shows that lazy-root is about 100 times slower than the
JDK Math/sqrt function. Apart from teaching us to use the JDK math functions when
possible, the benchmark also shows that producing and consuming lazy sequences
comes with an associated cost that can be considered a problem or not depending on
the use case. For instance:
• We need to pass functions to iterate and take-while, which adds some invocation
indirection. Secondly, this forces Clojure to compile for generic types, because
there is no way for the compiler to know at compile time that x is a double without
type hinting. Clojure greatly benefits from high order functions, but when
66
Here’s a more detailed explanation about how to pick the initial guess: math.stackexchange.com/questions/787019/what-
initial-guess-is-used-for-finding-n-th-root-using-newton-raphson-method
The recursive-root version reads certainly good and perform much better: it is now
comparable to the Java version despite still being twice as slow. Note how:
• Anonymous functions are not necessary. The double type of the guess local
binding is now enforced between iterations and doesn’t require type hinting.
• There is no collection of guesses or results, or any intermediate ones. Only the last
guess is ever passed between invocations.
• The recursive model is sufficiently easy to reason about in this case, but in general
it requires some practice to create a mental model for recursion compared to other
types of computations.
The example in this chapter is not suggesting to abandon powerful tools like lazy
sequences, but to search for an equivalent explicit loop-recur when speed is an
important factor.
procedure that has not yet returned a result plus any additional contextual information about the calling
site.
Conceptually, recursive calls are not different from heterogeneous calls: a new stack frame is
created at each invocation, independently from the fact that the caller is calling itself or something else.
But while normal call chains are driven by how the code is laid out manually, recursive calls are driven by
data: they usually map around some data structure executing operation on each element until
exhaustion, or like in the case of square root, until reaching some wanted precision. The room available
for creating frames is limited by the amount of memory available and recursion can easily consume all
the space available (the dreaded StackOverflow exception).
Tail-recursion is important because when a recursive call is the last instruction of a repeating set,
there is no need to remember the state of the function at that point in time and thus no need to create a
framecertainly no other instruction would benefit from remembering the execution state at that point in
time. Advanced compilers (Scheme being a notable example) are able to automatically recognize the
presence of a recursive call in tail position and prevent the stack-based propagation. The compiler can
then treat the sequence of repeating instructions as if there was a "jump" or "goto" instruction as the last
call in the procedure, without any stack creation and just the current current value as parameter.
Clojure doesn’t offer automatic tail-recursive optimization, but can optimize tail recursion with
the loop-recur construct. It would be relatively simple to have an automatic way to detect tail-call
optimizable code, but Clojure prefers to rely on Java semantic for method calls and Java doesn’t
implement tail-call optimization 67 .
See Also
• “trampoline” handles the case of mutual recursion, something that loop-recur is
not designed for. Interestingly it implements mutual recursion in a straightforward
way based on loop-recur.
• while performs side-effect based iterative code. It is there specifically to handle
those (mostly Java-interop) cases where side-effects are necessary to manage the
exit condition. It should be used sparingly.
• for is the Clojure list comprehension form. for is very useful for generating
potentially complicated sequences to drive further processing. If we consider
recursion as an algorithmic recipe composed by argument passing and argument
processing, for represents the sequence of parameters as they are passed over time,
while other sequence functions perform the actual computation. Both models have
advantages in different situations, with loop-recur being generally lower level and
better performing.
Performance Considerations and Implementation Details
67
Clojure support for automatic tail-recursion has been often discussed on the mailing list. One thread that explains the
rationale behind Clojure opting for loop-recur instead can be found
here: groups.google.com/forum/#!msg/clojure/4bSdsbperNE/tXdcmbiv4g0J
As we have seen in the examples, loop is smart enough to recognize and maintain
primitive types declared within the bindings of the loop. Let’s disassemble a small
snippet to see what happens:
(require '[no.disassemble :refer [disassemble]]) ; ❶
But loop type recognition wouldn’t be exploited to its full potential without adding
types to bindings that are not automatically recognized by Clojure. We just need to add
the necessary hint:
(println (disassemble (fn [^long n] (loop [i 0] (< i n) (inc i)))))
❶ Clojure is now producing the perfect call with primitive types which doesn’t incur in any casting or
boxing penalties.
If you remember our iterative-root function from the examples, we didn’t add type
hinting of x in the function arguments. The reason is that despite the type hint would
produce a better performing byte code, the kind of operations performed in the loop
outweighs the optimization. The only way to know this is by always being consistent
by measuring with tools like Criterium 68 before taking any decision.
3.4.2 range
function since 1.0
(defn range
([])
([end])
([start end])
([start end step]))
range is a general purpose number generator with many practical applications. One of
the most used arities is the one with a single argument producing a sequence of
integers:
(range 10)
;; (0 1 2 3 4 5 6 7 8 9)
The main use case for range is to provide a sequence of numbers that can be used by
other sequence processing operations to create more complex behavior.
Contract
Input
• "end" is the number delimiting when the generated sequence should stop (it stops
at "end" minus 1). (number? end) must return true. When "end" is not given (no
arguments) it defaults to positive infinity creating an infinite range.
• "start" is the number at which the generated sequence should start. (number?
start) must return true. "start" defaults to 0 when not given (only "end" is
present).
• "step" is the increment between each element in the sequence. (number?
step) must return true. "step" defaults to 1 when only "start" and "end" are given.
Notable exceptions
• clojure.lang.ArityException when more than 3 arguments are present.
68
Criterium is the de-facto benchmariking tool for Clojure: github.com/hugoduncan/criterium
Output
range returns:
❶ The resulting sequence has items promoted from long to bigint automatically.
Examples
Let’s start with some interesting ways of using range before going into more complex
problems. A list of even numbers can be easily obtained using the "step" parameter:
(range 0 20 2) ; ❶
;; (0 2 4 6 8 10 12 14 16 18)
Negative ranges can be produced using negative steps and using negative numbers as
extremes:
(range -1 -20 -2) ; ❶
;; (-1 -3 -5 -7 -9 -11 -13 -15 -17 -19)
❶ Negative odd numbers sequence using a negative "end" and a negative "step" to decrease from the
bigger "start".
Worth remembering that range works with any kind of number. To work with other
numerical types it might be necessary to remember the rules of type conversion for
addition. In general, having the "step" or the "end" of a specific type triggers an output
sequence with that type:
(range 0.5 5 0.5) ; ❶
By combining range and map we can obtain some interesting behavior, for example a
sequence of other sequences gradually increasing in size:
(map range (range 10)) ; ❶
;; (()
;; (0)
;; (0 1)
;; (0 1 2)
;; (0 1 2 3)
;; (0 1 2 3 4)
;; (0 1 2 3 4 5)
;; (0 1 2 3 4 5 6)
;; (0 1 2 3 4 5 6 7)
;; (0 1 2 3 4 5 6 7 8))
The same concept can be expanded to create successive sequences in which the
"extremes" get removed:
(->> (reverse (range 10)) ; ❶
(map range (range 10)) ; ❷
(remove empty?)) ; ❸
;; ((0 1 2 3 4 5 6 7 8) ; ❹
;; (1 2 3 4 5 6 7)
;; (2 3 4 5 6)
;; (3 4 5)
;; (4))
Let’s have a look now at how we can use range in practice. Many algorithms are based
on some non-trivial iteration over a collection. Comparing the edges of a list is at the
base of searching palindromes for example 69 . A palindrome is a sequence of letters
69
Palindromes are described very well on Wikipedia: en.wikipedia.org/wiki/Palindrome
which reads the same backward: "Was it a car or a cat I saw" is a typical example. One
way to find if a string is a palindrome is to check if the middle letters are the same and
then proceed outward to verify the others until we reach the end of the sequence:
|<-------------------------------( n )------------------------------->|
w a s i t a c a r o r a c a t i s a w
|<------------(quot n 2)----------->|
|<-- (- idx) -->|<-- (+ idx) -->|
❶ palindrome? is a function taking a sequence xs and a count of the elements in the sequence.
❷ idx contains indexes to access the sequence in reverse starting from half the count down to 0. We
use quot to avoid conversion into a ratio type that would occur through the division operation /.
❸ We access the sequence by index with nth. Note that in case of a lazy sequence xs, the first nth call
realizes roughly half of the sequence (if the collection supports chunking evaluation might stop beyond
half point up to the end of the current chunk). After comparing all the symmetrical pairs with = we
verify if there is any false with every?.
❹ string-palindrome? performs some initial preparation, like lower-casing letters and removing
spaces. some->> guards against potential nil inputs.
The palindrome example presented here is one of the many ways to check if a sequence
is a palindrome. Depending on problem requirements (like memory allocation, length
of the sequence or probability for palindromes) other solutions based on vectors are
likely to perform better (see how rseq can be used with vectors to check for
palindromes for example). The performance section contains a few more
considerations around range efficiency and laziness trade-offs.
See Also
• for can be considered range big brother. It allows for more flexibility in selecting
how the sequence should be generated. Use range if you need a simple numeric
sequence, use for if you need to filter out elements of the sequence in a more
complicated way or you need to cross multiple generating methods or different
item types.
• iterate accepts a function that is called with the result of the previous computation
to generate the next item. (take 10 (iterate inc 0)) for instance is equivalent
to (range 10) but with the added flexibility to change inc to another function.
Performance Considerations and Implementation Details
❶ The example show access to the last element of a large sequence created with range. Since last is
also the final result of evaluating the entire form, the rest of the sequence can be safely garbage
collected as the sequence is processed.
❷ The last operation appears before another operation to access the large sequence. As a result the
sequence produces by range needs to remain in memory in full, creating a
possible OutOfMemoryError (also depending on the allowed heap size).
range (like iterate, repeat and cycle) is implemented as a Java class and provides a
specialized algorithm for reduce and related functions including transducers. To
activate the fast path, you need to pay attention not to wrap range in a sequence
generating function:
(require '[criterium.core :refer [quick-bench]])
❶ reduce cannot activate the range fast path because the range is wrapped in a map function. The
default sequential path for reduce is selected instead.
❷ The transformation is now part of a transducer and the range type is left visible for transduce that can
activate the fast path. transduce uses reduce internally.
Similar considerations are valid for apply, which does not follow the fast reduce path.
The following function kth calculates the k-th coefficient of (x - 1)n (part of the
calculation necessary to test if a number is prime following the AKS primality test 70).
The function uses range to create a potentially long sequences and it has been
implemented with apply and with range for comparison:
(defn kth [n k]
(/ (apply *' (range n (- n k) -1)) ❶
(apply *' (range k 0 -1))
(if (and (even? k) (< k n)) -1 1)))
(quick-bench (kth 820 6))
;; Execution time mean : 924.071439 ns
(defn kth [n k]
(/ (reduce *' (range n (- n k) -1)) ❷
(reduce *' (range k 0 -1))
(if (and (even? k) (< k n)) -1 1)))
(quick-bench (kth 820 6))
;; Execution time mean : 401.906780 ns
3.4.3 for
macro since 1.0
70
Please see en.wikipedia.org/wiki/AKS_primality_test
❶ "i" is declared as a local binding and will be visible further down the bindings and in the body of
the for macro.
❷ "k" and "v" are also locals demonstrating that destructuring is available over the map. While the first
value of "i" is assigned, "k" and "v" will assume all the values in the map as pairs ":a !", ":b ?" and so
on until all the permutations of "i" and "k v" have been formed.
❸ The ":let" expression creates an additional local binding which is not based on iterating over a
sequence like "i" or "k v".
❹ The ":while" expression accepts a predicate that evaluates for each permutation. As soon as the
predicated is false, the presence of :while stops the current iteration (in our case the "k v" iteration
of local bindings against the map). In this case, the iteration will stop as soon as "k" is equal to the
keyword ":b" preventing that permutation and any other following in the map to enter the final results.
❺ The ":when" filter operates similarly to the ":while" filter by preventing some permutation to enter the
final sequence of results. Differently from ":while" it’s not going to affect other elements in the iteration
after the one that makes the predicate false.
Contract
(for [bindings] <body>)
binding :=>
bind-expr OR let-expr OR while-expr OR when-expr
Examples
The following table collects a few notable examples of for focusing on some non-
trivial aspects. Each rows contain an example and a description.
Description Example
:when or :while with a dependency on
multiple local bindings. Equivalent to a (for [x (range 100)
constraint based on a function f(x1,x2,..xn) of y (range 10)
the local bindings. This is to point out the fact :when (= x (* y y))]
that constraints are flexible and can depend on [y x])
multiple local bindings at once. ;;([0 0] [1 1] [2 4] [3 9] [4 16] [5 25]
[6 36] [7 49] [8 64] [9 81])
The following is instead a complete example making good use of the for macro.
Conway’s Game of Life is a classic example of cellular automaton 71 . Cellular
automata are mathematical models exhibiting a high degree of complexity generated in
turn by very simple rules 72. They are usually visualized as two-dimensional grids
where elements in the grid are cells and rules are applied to define interaction between
them, as can be seen in the following picture.
Figure 3.3. A 5x5 Game of Life grid, with a "blinker" presented at initial state and after one
71
en.wikipedia.org/wiki/Conway%27s_Game_of_Life
72
en.wikipedia.org/wiki/Cellular_automaton
iteration. The numbers [x y] in square brackets are the coordinates of the neighbors of cells [1
2].
Passing of time can be implemented as a discrete "tick" during which rules are applied
transforming cells from dead (uncolored white square) to alive (black square) or vice-
versa. There is some analogy between the 4 rules governing the Game of Life and
some society of living organisms (hence the name):
1. Health: any live cell with two or three live neighbors lives on to the next
generation.
2. Reproduction: any dead cell with exactly three live neighbors becomes a live cell.
3. Underpopulation: any live cell with fewer than two live neighbors dies.
4. Overpopulation: any live cell with more than three live neighbors dies.
The following example shows how the Game of Life could be implemented in Clojure.
Since for can easily create non-trivial sequences, we can use to "navigate" the grid and
defines the meaning of being a neighbor:
count))
;; testing a blinker:
(next-gen 5 5 #{[2 1] [2 2] [2 3]})
;; #{[1 2] [2 2] [3 2]}
(next-gen 5 5 (next-gen 5 5 #{[2 1] [2 2] [2 3]}))
;; #{[2 1] [2 2] [2 3]}
❶ The first application of for is used to count the neighbors of a [x y] cell. In a two-dimensional system
where cells are identified by x and y coordinates (like our case), the problem of finding neighbors is
about moving the coordinates up-down, left-right and diagonals (by incrementing and decrementing in
turn). The two increments dx and dy are the ranges of permutations we need.
❷ The :let expression inside the for macro help us defining temporary locals available inside the loop
without them being necessarily part of the value comprehension (like is happening instead
for dx and dy). In the :let we define the cell found by incrementing or decrementing the given cell [x
y].
❸ :when defines a filter for the comprehension. In our case we don’t want the [x y] cell itself and we
don’t want cells outside the grid either.
❹ The application of all the rules combined happens inside the apply-rules function, which essentially
operates on boolean logic. This result will be used later in the last for macro to keep or remove cells
we don’t want in the final computation.
❺ This last for generates all the possible cell coordinates for a grid of size w,h. Assuming the presence
of a cell pair of coordinates indicates the cell is alive, our job is to remove all those cells that are not
going to live the next generation. At the same time, we want other cells to become alive if they weren’t
based on the rules of the game. The filter is achieved with another :when expression that just
delegates to apply-rules.
See Also
• “while”. If for is the functional way of iterating without mutable
variables, “while” is offered for those cases where side effects are needed to
control the loop. Use “while” for Java interoperation, especially when the Java
code is in some external library that you can’t control requiring explicit use of
side-effects to control the loop.
©Manning Publications Co. To comment go to liveBook
Should be preferred to the equivalent but less easy to read nested maps version:
(mapcat
(fn [i] (map
(fn [a] (str i "-" a))
["D" "C" "H" "S"]))
(range 1 14))
The resulting lazy-sequence, if fully unrolled (for example with “count”), would result
in "n" (number of elements in each range) to the power of "c" (number of bindings)
iteration steps. for laziness and abundance of features might not be the optimal
solution for tight loops where performance is important. In that case it might be better
option to use a custom loop (or even transients).
The implementation details are mainly related to the mechanics of the creation of lazy
sequences, as can be seen in the following macro-expansion of a simple form
with macroexpand (code has been formatted and cleaned-up):
(macroexpand
(let* [main-fn
(fn recur-fn [xs]
(lazy-seq
(loop [xs xs]
(when-let [xs (seq xs)] ; ❶
(if (chunked-seq? xs)
(let [fchunk (chunk-first xs)
chunk-size (int (count fchunk))
chunk-buff (chunk-buffer chunk-size)]
(if (loop [i (int 0)]
(if (< i chunk-size)
(let [i (.nth fchunk i)]
(do (chunk-append chunk-buff i)
(recur (unchecked-inc i))))
true))
(chunk-cons
(chunk chunk-buff)
(recur-fn (chunk-rest xs))) ; ❷
(chunk-cons (chunk chunk-buff) nil)))
(let [i (first xs)]
(cons i (recur-fn (rest xs)))))))))]
(main-fn (range 3)))
❶ The input sequence is iterated based on the fact that it is itself a lazy-sequence or not.
❷ Chunks of the input sequence are appended using <lazy-seq> to the output lazy-sequence.
Despite not being the easiest code to follow, the main goal of for is to create a
"chunked" lazy sequence (the default Clojure implementation of lazy sequences). The
snippet is complicated by the fact that the input sequence needs to be treated
differently if it is already lazy, so the internal chunks can be iterated accordingly: from
this point of view, for can be thought as a sophisticated machine for lazy-sequence
building.
3.4.4 while
macro since 1.0
The while iteration macro is possibly the closest to the loop construct found in other
imperative languages. while takes a test expression and a body and repeatedly executes
the body until the expression evaluates as false. It follows that some side effect needs
to mutate the test expression from true to false other than the result returned by the
body. The following snippet for example uses “rand and rand-int” in the test expression
to exit the while loop:
(while (> 0.5 (rand))
(println "loop"))
;; loop
;; loop
;; nil
“rand and rand-int” is impure function since the final returned value is dependent on
something outside the application control (usually some operative system primitive).
Usage of while should be restricted to a few special cases such as Java interoperability,
since more idiomatic iteration forms exist in Clojure that don’t require side effects (for
example “for” to build an initial range followed by map or filter functions). Despite
this, there are still a few legitimate cases to use while that will be illustrated in the
examples.
Contract
Input
• "test" is any Clojure expression yielding logical true or false as a result.
• "body" can be 0 or more Clojure forms ==== Output
• while returns: the result of evaluating the "body" or nil in case no "body" was
given or the body wasn’t evaluated.
Examples
"while true" expressions in Java are quite common to create daemon threads to run a
parallel task along with the main application. We could use while to start a never
ending loop, for example to print a health-check message on the console output to
monitor the good health of the application:
(defn forever []
(while true ; ❶
(Thread/sleep 5000) ; ❷
(println "App running. Waiting for input...")))
(defn status-thread []
(let [t (Thread. forever)] ; ❸
(.start t)
t))
(def t (status-thread))
;; App running. Waiting for input...
;; App running. Waiting for input...
;; App running. Waiting for input...
(.stop t) ; ❹
;; nil
❶ We can create an infinite while loop by using an expression that can only be true.
❷ We sleep the current thread 5 seconds to prevent a flood of output messages.
❸ Threads are created by simply using the constructor and passing the function they need to execute.
The thread is then started right away.
❹ The always true expression used in the while macro can only be affected from outside the body of
the loop. The consequence in this case is that we need to stop the entire thread to stop the loop.
Other examples of while usually happen with Java IO. Java IO often requires to test
the status of a stream to understand when the end has been reached. The main
operation of reading bytes from the stream has also the side effect of advancing a
"cursor" holding the current reading position, which is what we want to read inside the
test expression. The following Clojure code computes the SHA-256 hashing 73 of a
file:
(import
'java.io.File
'javax.xml.bind.DatatypeConverter
'java.security.MessageDigest
'java.security.DigestInputStream)
❶ We need to obtain a MessageDigest instance for the type of hashing we need. The sha instance
created here hold the current state of the SHA-256 computation and can be updated at each read
from the input stream reading the file.
❷ The DigestInputStream instance is created on top of the sha instance. Notice that “with-open” is
used to automatically close the stream after we finish reading from it the line below.
❸ while is used here to keep reading from the DigestInputStream until it returns "-1", a pattern that is
commonly used in Java. This while form is side-effecting in two ways: the expression
becomes false as the state of the cursor in the file goes beyond the end of the file and finally
because it has no body: the sha instance is updated by just reading from the input stream.
❹ The computed sha is finally converted into readable form.
int count = 1;
do {
System.out.println("Count is: " + count);
count++;
} while (count < 4);
73
SHA-256 is a very well known cryptographic and hashing function. See en.wikipedia.org/wiki/SHA-2 for the details.
The count mutable variable needs to be mutated by the body of the loop in order for the loop to exit at
some point (here mutated using the ++ operator). Functional languages don’t support (or strongly
discourage) iteration using this style, preferring instead recursion or list comprehension.
Recursion is obtained with a function invoking itself (or multiple mutually recursive functions,
see “trampoline”) passing the mutating variable as the argument of the next invocation. The following
example is the re-working of the do while Java code into Clojure:
(loop [count 1]
(when (< count 4)
(println "Count is:" count)
(recur (inc count))))
;; Count is: 1
;; Count is: 2
;; Count is: 3
;; nil
As you can see, the mutating element becomes the argument of the recur function and it’s incremented
every iteration. Compared to the Java code, the test expression previously inside the while has been
translated into a when invocation in the Clojure code: a condition is always requested inside the loop-
recur to exit the loop and is typical in recursive code.
A list comprehension instead, is the concatenation of many processing steps starting from an initial
list of values. Comprehension can be also used to mimic iteration, but it goes beyond that formulating a
new programming style. Instead of mutating or recursively changing the value to check the test
expression, the sequence of all values is assembled first and the computation builds up from those. If we
look at the previous example we can collect the different values of the count variable during each
iteration like this:
(loop [count 1
res []]
(if (< count 4)
(recur (inc count) (conj res count))
res))
;; [1 2 3]
Once the values upon which the iteration should be performed are decided, we can build up the
computation using sequence manipulation functions. In this case we don’t need the loop-recur just to
build the natural positive numbers from 1 to 4, we could use map or “for”:
;; Count is: 1
;; Count is: 2
;; Count is: 3
;; nil
Both forms produce the same output of the initial example by feeding println with an initial list of values.
We could add more processing steps on top of the initial value generation, simulating the equivalent of
multiple isolated loops in an imperative language. Thanks to Clojure map, “for”, filter, reduce (and many
other functions), programming by list comprehension results in code that is more concise and expressive
than their imperative counterpart.
See Also
• “for” is an idiomatic alternative to iteration by mutation in a functional language
like Clojure. It offers a powerful syntax to generate driving values to process with
sequence manipulating functions like map or filter. Prefer “for” instead
of while unless mutation is an essential part of the iteration.
• loop is the low common denominator for many iteration-like forms in Clojure and
is also used inside while implementation. loop gives greater control on the
iteration, including the definition of local bindings. Use loop and recur when other
parameters (which likely not side-effects) are controlling the loop and should
appear as locally bound variables.
Performance Considerations and Implementation Details
The expansion reveals a basic use of the loop-recur recursion pattern with a when to
verify the expression.
3.4.5 dotimes
macro since 1.0
dotimes is used to repeat some portion of code multiple times. The form to be repeated
appears as the last argument of the macro while the first argument is a binding vector
that contains a local binding and the number of desired repetitions, for example:
(dotimes [i 3] (println i))
;; 0
;; 1
;; 2
;; nil
❶ A typical use of dotimes to repeat the execution of some code and calculate the total elapsed time.
To measure the performance of “max and min” above, the form evaluates some large
number of times and the total elapsed is measured with time. By using dotimes it’s
possible to quickly verify assumptions about performance before using more rigorous
methods (such as Hugo Duncan’s Criterium library github.com/hugoduncan/criterium).
Outside the REPL use, dotimes is often connected to the execution of side effects. The
locally bound variable provided with dotimes is a perfect fit for array access
operations. The following example shows a faster version of the fizz-buzz
game presented in the “condp” chapter:
(require '[criterium.core :refer [quick-bench]])
5 "buzz"
n))
❶ fizz-buzz-for is the function that contains the conditional deciding if the number needs to be
replaced with the corresponding string based on the divisors.
❷ fizz-buzz-slow is exactly the same as before, with just a doall added to realize the lazy sequence in
full. Despite the claim that this version is slower, fizz-buzz-slow is still a very idiomatic and natural
way to solve the problem and it should be considered the best solution unless raw performance is an
important factor.
❸ The new fizz-buzz function first creates an empty transient vector and uses dotimes to perform side
effects on the indexes.
❹ assoc! is used here to permanently alter the transient vector at the current index "n" of the
the dotimes iteration.
❺ The transient is finally returned as a normal persistent collection for results.
As you can see from the benchmark, there is a noticeable speed up by using a transient
74
. dotimes is a perfect choice to perform the side effect of adding elements to the
vector, including the necessary incrementing index.
74
In the spirit of searching the best possible performance, there are other important factors to consider for the version of
Fizz Buzz presented here that are not discussed in this chapter because not relevant to the discussion
75
See for example this excellent StackOverflow answer regarding a common problem found with retaining the head
processing sub-sequences:stackoverflow.com/questions/15994316/clojure-head-retention
collection. Although doall behavior can be desirable at times, “doseq, dorun, run!, doall,
do” dorun and dotimes all returns nil by design to avoid any memory overflow in case the iteration
produces a collection.
See Also
• “doseq, dorun, run!, doall, do” is very similar to dotimes but it supports extended
bindings including multiple locals and destructuring. Prefer “doseq, dorun, run!,
doall, do” when the single incrementing integer provided by dotimes is not
sufficient.
• doall takes a sequence as input and iterates the sequence realizing its items.
Use doall when the only goal of the iteration is realizing a lazy sequence.
• dorun are similar to doall but they return nil without holding the head of the
sequence. Prefer dorun when the input is a sequence containing side-effecting
items once realized.
Performance Considerations and Implementation Details
Worth noticing that in order for the loop-recur loop to be the fastest possible, the
numeric binding (the number of times to execute the iteration) is cast to a long and
incremented with unchecked-inc.
repository of Clojure projects, with the name of the functions and macros from the
index of this book (around 700):
Table 3.5. The top 20 most used functions/macros when searching Clojure repositories.
Name Frequency
“ns, in-ns, create-ns and remove-ns” 394490
defn 293918
“refer, refer-clojure, require, loaded-libs, use, import” 279210
let 237654
def 172983
“refer, refer-clojure, require, loaded-libs, use, import” 163654
map 159781
“fn” 154482
str 145899
nil? 125109
“refer, refer-clojure, require, loaded-libs, use, import” 119952
“test and assert” 115419
first 98908
“get” 93911
true? 91826
when 91463
name 90469
string? 86492
if 85942
keys 85435
Even if this section offers a small overview of what can be done with collections (more
specifically of the "sequential" type) the following subset is powerful enough to get
you started:
• “first, second and last” are handy helpers to fetch the head, the second element or
the tail of a collection.
• map is the primary way to apply transformations to the elements.
• filter yields specific elements from a collection depending on a predicate function.
• reduce can be used to converge the collection to a final result, obtained by
combining a group of items in some meaningful way.
Other collection/sequences functions will be discussed further on in their dedicated
chapters 76.
76
The book will try to clarify the difference between collections and sequences when necessary, but a good starting point is
this article on sequences by Alex Miller:insideclojure.org/2015/01/02/sequences/
(first [xs])
(second [xs])
(last [xs])
first, second and last are functions taking a sequence-able collection (any Clojure
collection that can be iterated using the sequence interface) and extracting the element
at the position described by their names. They can be used easily like:
(def numbers '(1 2 3 4))
(first numbers)
;; 1
(second numbers)
;; 2
(last numbers)
;; 4
first, second and last are part of a larger group of functions to access specific parts
of a sequential collection.
Ultimately is the specific collection type to decide how to implement the sequential
access operation. For example, unordered collections like sets and maps also
implement clojure.lang.Seqable:
• hash-maps: when iterated sequentially, a map becomes a list of key-value pairs.
But when fetching elements, they are not necessarily following insertion order:
(def a-map (hash-map :a 1 :b 2 :c 3 :d 4 :e 5 :f 6 :g 7 :h 8 :i 9))
(first a-map)
;; [:e 5]
(second a-map)
;; [:g 7]
(last a-map)
;; [:a 1]
• sets: similarly to hash-maps, they have no notion of ordering (see sorted-set for
that purpose), so the same uncertainty factor applies:
(def a-set #{1 2 3 4 5 6 7 8 9})
(first a-set)
;; 7
(second a-set)
;; 1
(last a-set)
;; 8
Contract
Input
first, second and last all accept one parameter "xs":
Notable exceptions
• None. first, second and last all use nil to signal exceptional conditions.
Output
• the element at the first, second or last position in the sequence, if available. If the
wanted element is not existent at the desired position it returns nil. If the input
sequence "xs" is nil, returns nil.
Examples
first
One common use case is to pass first as a parameter to higher-order functions. The
following example shows how first can be used with map to extract just the first
element from a small sequence. Extracting parts of a string (in this case a phone
number) is a common case:
(def phone-numbers ["221 610-5007"
"221 433-4185"
"661 471-3948"
"661 653-4480"
"661 773-8656"
"555 515-0158"])
(unique-area-codes phone-numbers)
;; ("221" "661" "555")
❶ At this point, the string containing the entire phone number has been split into two parts based on the
position of the space character. We just want the area code, so we ask for the first.
❷ distinct can be used to get rid of repetitions inside a sequence. We use it here to remove duplicated
area codes.
second
Extracting the second element from a sequence is frequent enough to grant a dedicated
function. One reason for that is that many intermediate steps in data processing involve
small lists and second can save a few keystrokes compared to the equivalent (first
(rest xs)). The following example shows a sequence of temperature samples from
different locations, reporting on the maximum and minimum temperatures recorded for
the day. The max temperature is appearing right after the first element. We can
use sort-by by the second element to extract the highest temperature like this:
(def temp '((60661 95.2 72.9) (38104 84.5 50.0) (80793 70.2 43.8)))
(max-recorded temp)
;; (60661 95.2 72.9)
❶ sort-by takes a function and optionally a comparator to decide how to order the sequence. Here we
use second to define which element in the triplet should we sort with. The second parameter is the
comparator > "greater than" to sort in reverse order.
❷ After sorting the sequence, we can just drop everything expect the highest temperature recorded that
is now at the top.
last
Similarly to first and second, last can be used to fetch the last element in a
sequence. The following example shows last in action with re-seq and regular
expressions. Given a long string of commands, we want to know which user was last
set before sending the message, assuming users are set with the
syntax user:username in the message:
©Manning Publications Co. To comment go to liveBook
❶ re-seq returns a list of matching patterns, in this case anything in the form "user-colon-name".
Many of the Lisp implementations that followed continued the tradition of naming functions to access
the first and last element of a list as car and cdr, even when the hardware didn’t have such registers
anymore. Nowadays, Common Lisp, Scheme, Arc (and many others) still use car and cdr and all
combinations thereof, while Clojure decided to name them differently to detach itself from this old part of
the Lisp heritage:
Clojure names might be slightly longer but they better convey the semantic of the function.
See Also
first, second and last are just a few of the many ways you can access the parts of a
sequence. These functions will be extensively explained in their own section, but the
following is an useful summary of what is available:
• next and rest return what’s remaining after throwing away the first element of the
sequence. They differ in the way they treat empty collections.
• drop accepts the number of elements to remove from the head of the sequence, not
just the first.
• drop-last and butlast drop the last element and keeps the rest.
• take and drop-last remove elements from the end of a sequence and keep what’s
left. The difference is in the interpretation of "n": take will return a collection of
"n" elements, drop-last will make sure that the last "n" elements are removed.
• ffirst, nfirst, nnext and fnext are shortcuts for common operations involving
sequence containing other sequences. The first letter "f" or "n" indicates the first
operation, either first or next and the rest of the name the second operation. So
for example ffirst is equivalent to (first (first xs)), fnext to (first (next
xs)) and so on.
• “nth” is a general way to access an element by index in a collection.
• “rand-nth” extracts a random element from a sequence.
• nthrest and nthnext are returning everything after the nth element.
There are also other functions similar to the one above that are optimized for a specific
collection type:
• peek grab the first elements for lists and queues. Last element for vectors.
• pop returns everything but the last element for queues, vectors and lists.
• pop! returns the last element of a transient vector.
• “get” is mainly for hash-maps, but works also on vectors and strings to fetch the
element at the specific index. It works on “set” to check for the inclusion of an
element.
• “subvec” is dedicated to splitting vector apart at some index n.
Performance Considerations and Implementation Details
types are accepted as input, they need to be converted into sequences, potentially
producing sub-optimal performance. last, for instance, should be avoided on vectors
for which there are better performing functions ( such as peek).
The following table shows the most used collection types, suggesting a faster
alternative to first or last when one exist.
NOTE please not that O(1) is used as an approximation of O(log32N) here and in several other
places in the book. O(log32N) is very close to O(1) for most practical purposes. When the
difference is important, it’s appropriately made clear.
Table 3.7. Alternative ways to access the head or the tail for ordered collection types.
(map
([f])
([f c1])
([f c1 c2])
([f c1 c2 c3])
([f c1 c2 c3 & colls]))
(map-indexed
([f])
([f coll]))
map is a fundamental tool in almost every functional language. The basic form takes a
function and collection and return the sequence of results of the function applied to
each element in the collection. The following, for instance, inverts the sign of each
number in the list:
(map - (range 10))
;; (0 -1 -2 -3 -4 -5 -6 -7 -8 -9)
Contract
The contract of map is different based on how many collections are passed to the input
after the mapping function. "f" should preferably be free of side effects,
because map and map-indexed operates on lazy sequences there is no guarantee about a
specific "once-only" calling semantic for "f".
Let's divide the contract based on those cases.
(map f): no input collections
• When map is invoked with just "f" it returns a transducer and no actual invocation
of "f" is performed until the transducer is invoked.
(map f coll): single collection as input
• "f" is invoked with 1 argument and can return any type. "f" needs to support at
least artiy-1 but it can also have others, e.g.: (map - (range 10)).
• "coll" is a collection that can be iterated sequentially, so that (instance?
clojure.lang.Seqable coll) returns true.
• returns: a lazy sequence containing the result of applying f to all the elements in
the input collection.
(map f c1 c2 & colls): with any number "n" of collections
As you can see the "middle" vector containing the 3 letters ["a" "b" "c"] is
determining when the map operation is going to end. “str” receives 3 arguments each
invocation: "0 a h", "1 b i" and "2 c j".
The contract for map-indexed is more restrictive:
• When map is invoked with just "f" it returns a transducer that can be later
composed or applied.
• "f" must be a function of at least 2 arguments returning any type.
• returns: a lazy sequence containing the result of applying f(idx,item) to each
item in the collection.
Examples
map is often present in data transformations (along with filter) to prepare the data for
further processing. In the following example a list of credit products contains essential
data like the annual interest rate and the minimum credit allowed. Given a loan amount
and a desired number of years, we would like to output how much we will have to
repay back and the cost of the credit. The final result gives us a way to compare the
cheapest credit for the amount of money we wish to borrow:
(def products ; ❶
[{:id 1 :min-loan 6000 :rate 2.6}
{:id 2 :min-loan 3500 :rate 3.3}
{:id 3 :min-loan 500 :rate 7.0}
{:id 4 :min-loan 5000 :rate 4.8}
{:id 5 :min-loan 1000 :rate 4.3}])
(map round-decimals)
(sort-by :credit-cost)))
(cost-of-credit 2000 5)
❶ The list of products is short and in memory for this example. It would probably come from a separate
source and contain much more detailed data.
77
❷ The compound interest formula is a direct translation of the Wikipedia version .
❸ add-cost is the function that injects two new keys into the input product. The total payments and cost
of credits are double with many digits.
❹ min-amount returns a function predicate that is dependent on the requested loan amount. It will be
used by filter in the main calculation below.
❺ round-decimals is the second function we use with map. In this case given a product we want the two
costs to be rounded to the second decimal. update-in is relatively straightforward to use for this goal.
❻ Finally we chain everything together using ->>. filter operations appear first so downstream parts of
the computation receives less work to do.
From the example we can see that for our request to borrow 2000 and repay them in 5
years, product id "5" is the best option, although other products like id "1" have a very
competitive rate but they don’t allow 2000 borrowing.
Now an example related to map-indexed, which comes handy when we want to
associate an ordinal number (usually a natural number) to the elements in a collection,
so that it’s possible to relate them to their position. map-indexed saves us from
explicitly passing a range. Showing the winning tickets for the lottery could be such an
example:
(def tickets ["QA123A3" "ZR2345Z"
"GT4535A" "PP12839"
"AZ9403E" "FG52490"])
❶ draw takes the tickets and performs a “random-sample” of n winners. 0.5 is probability of that element
in the collection to be part of the final sequence.
❷ display uses map-indexed to interleave the order of the extraction (and thus a higher prize) to the
extracted tickets, printing them in a nice “format, printf and cl-format”.
77
See en.wikipedia.org/wiki/Compound_interest for an example of compound interest calculation
We don’t need to enter into the details of how diff was supposed to work, but we can have a look into
how maplist is used in this fragment:
maplist(cdr(J), K, diff(K))
In this early design (early 1958), maplist takes 3 arguments: a list of items (for example (cdr(J))), a
target list to collect the results (K L) and the actual invocation of a function (diff). McCarthy, after
finding impractical to implement maplist as designed, introduces the lambda notation. The following is
a re-write of the diff function some time later:
diff(L,V) = (car(L)=const->copy(CO),
car(L)= var -> (car (cdr(L)) = V -> copy(C1, 1->copy(C0)),
car(L)= plus -> consel(plus, maplist(CDR(L), λ(J diff(car(J), V)))),
car(L)= times-> consel(plus, maplist(cdr(L),
λ(J, consel(times, maplist(cdr(L),
λ(K, (J != K -> copy(car(K)), l->diff(car(K), V))))))))))
Calls to maplist are now making use of 2 arguments, like the following fragment:
maplist(CDR(L), λ(J diff(car(J), V)))
The first argument is now the list to map over (like for example CDR(L)) and a lambda λ(J,
f) function of J followed by the body of the function, removing the need to pass as argument a list K to
hold the results. maplist eventually made it to the famous 1960 original Lisp paper with the following
definition:
In Clojure this is very similar to the current map implementation (although in Clojure this is complicated
by building the resulting sequence as a lazy-sequence).
See Also
• “mapcat” is useful when the result of applying f to an item is again a sequence,
with the overall results of producing a sequence of sequences. “mapcat” applies a
final concat operation to the resulting list, flattening the result.
• amap operates with the same semantic of map on Java arrays.
• mapv is a specialized version of map producing a vector instead of a lazy-sequence
as output. It uses a transient internally so it’s faster than the equivalent (into []
(map f coll)).
• pmap executes the map operation on a separate thread thus creating a parallel map.
Replacing map with pmap makes sense when the overall cost of handling the
function f to separate threads is less than the execution of f itself. Long or
otherwise processor-consuming operations usually benefit from using pmap.
• clojure.core.reducers/map is the version of map used in the context of “Reducers”.
It has the same semantic of map and should be used similarly in the context of a
chain of reducers.
Performance Considerations and Implementation Details
(let [res (map inc (range 1e7))] (last res) (first res)) ; ❷
❶ last is forcing map to perform the computation on all elements to return the last one in the sequence.
Since nothing else needs the local binding res after closing the scope, every item before the last can
be safely garbage collected.
❷ Here (last res) is requested first, forcing map to go trough all the 10M elements and increment
them. Differently from before, we still need res after that because there is another instruction in the
scope of the local binding. This second version will likely exhaust the memory (depending on
hardware and the JDK settings) because no elements of the output collection can be garbage
collected until first is evaluated.
interface. Since virtually all "iterable" things in Clojure implement Seqable we can talk
about an "input collection" but map is technically a sequence-in sequence-out operation.
The only reason you might be interested in this detail is if you wanted to create your
own sequence in that integrates nicely with the rest of Clojure ecosystem.
3.5.3 filter and remove
function since 1.0
(filter
([pred])
([pred coll]))
(remove
([pred])
([pred coll]))
filter and remove are very common operations on sequences. They perform the same
action of removing/keeping an item in a sequence based on a predicate (a function
returning logical true or false):
• filter allows the item through when the predicate is true.
• remove prevents the item to appear in the resulting sequence when the predicate
is true.
filter is essentially the complemented remove operation (and the other way around):
Contract
• "pred" is mandatory argument. It must be a function of 1 argument returning any
type (which will be interpreted as logical true or false). "pred" should preferably
be free of side effects, because filter and remove operates on lazy sequences
there is no guarantee about a specific "once-only" calling semantic for "pred".
• "coll" can be any sequential collection (such that (instance?
clojure.lang.Seqable coll) is true).
• returns: a (potentially empty) lazy sequence which has the same size or less than
the input sequence. filter keeps items when (true? (pred
item)) while remove removes them for the same predicate result.
Examples
filter and remove are typically found in processing pipelines. Some data enters the
pipeline on one end and is subject to a mix of transformation to produce the result. It’s
usually a good idea to remove unwanted elements before doing any other expensive
computation. For this reason operations like filter or remove most likely appear at the
top of the chain. There are filter examples throughout the book worth reviewing:
• Filtering out interesting sentences in the sentiment analysis example.
• Preparing an index by initial letter for a dictionary in the Levenshtein distance
example.
• filter is also common in transducer chains, like the following example to find
the longest function in a namespace.
In this section we are going to show a common usage of remove in conjunction
with some-fn to remove some type of values accumulating during the computation. In
the following example, a network of sensors connected to weather stations produces
regular readings that are encoded as a list of maps. Each map contains some
identification data, a timestamp and a payload containing the data for all the available
sensors. One potential problem is that any of the sensors could fail resulting in that
particular key missing or an :error value being reported. We want to be able to
process such events and take care of possible errors:
(def events
[{:device "AX31F" :owner "heathrow"
:date "2016-11-19T14:14:35.360Z"
:payload {:temperature 62
:wind-speed 22
:solar-radiation 470.2
:humidity 38
:rain-accumulation 2}}
{:device "AX31F" :owner "heathrow"
:date "2016-11-19T14:15:38.360Z"
:payload {:wind-speed 17 ; ❶
:solar-radiation 200.2
:humidity 46
:rain-accumulation 12}}
{:device "AX31F" :owner "heathrow"
:date "2016-11-19T14:16:35.362Z"
:payload {:temperature :error ; ❷
:wind-speed 18
:humidity 38
:rain-accumulation 2}}
{:device "AX31F" :owner "heathrow"
:date "2016-11-19T14:16:35.364Z"
:payload {:temperature 60
:wind-speed 18
:humidity 38 ; ❸
:rain-accumulation 2}}])
(def event-stream ; ❹
(apply concat (repeat events)))
(defn average [k n]
(let [sum (->> event-stream
(map (comp k :payload)) ; ❺
(remove (some-fn nil? keyword?)) ; ❻
(take n)
(reduce + 0))]
(/ sum n)))
NOTE We are going to see the same example of processing events and calculate their average
with “Transducers” while talking about completing.
(defn walk-all
"Returns a lazy-seq of all first elements in coll,
then all second elements and so on."
[colls]
(lazy-seq
(let [ss (map seq colls)] ; ❶
(when (every? identity ss) ; ❷
(cons (map first ss) (walk-all (map rest ss))))))) ; ❸
(defn filter+
([pred coll] ; ❹
(filter pred coll))
([pred c1 c2 & colls] ; ❺
(filter+ #(apply pred %) (walk-all (conj colls c2 c1)))))
❶ We use walk-all helper function to create a lazy sequence of all the first elements in a list of input
collections, then the second elements and so on, stopping when we reach the end of the first
collection. Before doing that, we make sure all collections are not empty using seq.
❷ We also need to make sure we didn’t reach the end of any of the collection. We can make sure there
are no nil by checking that identity is returning true for every? element in the sequence.
❸ We build the lazy sequence by using cons for all the first elements so far into the recursion of walk-
all again for all the remaining elements.
❹ The basic arity of filter+ is just calling into filter.
❺ The extended arity for filter+ is taking the results of the walk-all function and applying the
predicate to all the first elements, then the second ones and so on.
Compared to map it’s not immediately obvious how to use our new extended filter+. One idea is to
consider the predicate as a function of multiple arguments returning a result that will be interpreted as
logical true or false. We could for example filter only those numbers (as strings) containing at least
one instance of the index they appear in the input collection:
(filter+ re-seq ; ❶
(map re-pattern (map str (range))) ; ❷
["234983" "5671" "84987"])
;; ((#"1" "5671")) ; ❸
❶ re-seq is a function of two arguments, exactly what we need for the two-collections input in this
example.
❷ The first collection builds up from an infinite range into a list of regular expressions: #"1", #"2" and so
on. It uses a string as input for re-pattern
❸ "5671" appears at index "1" in the input vector and contains the number "1", so it appears in the final
results.
See Also
• keep is a cross between map and remove: like map it applies a function to a
sequence and like remove with nil? it removes nil from the output. It could be
used with similar effect to (remove nil?) when using identity as the
function: (keep identity coll).
• filterv is the equivalent operation optimized for vectors. You should
use “mapv” whenever a vector input type can be assumed, since the operation in
that case is much faster.
Performance Considerations and Implementation Details
(reduce
([f coll])
([f val coll]))
(reductions
([f coll])
([f val coll]))
reduce takes a function of two arguments. After calling the function against the first
item (or "val" if provided), it proceeds to call the same function with the previous
result against the next item in the sequence. At each step in walking the input
sequence, the function has an opportunity to do something with the "result-so-far" and
the next element.
Similarly to the other functions presented in this chapter, reduce is a well known
functional tool. When describing operations on sequences (or Clojure collections in
general) reduce is often mentioned as part of the trio with map and filter, as it
frequently appears as the last step of a processing pipeline. The following example
shows an initial list of numbers transformed into squares and their total sum used to
calculate the average:
(defn sum-of-squares [n]
(->> (range n) ; ❶
(map #(* % %)) ; ❷
(reduce +))) ; ❸
(average-of-squares 10)
;; 28.5
reductions helps visualizing the reduce process. It has the same interface
as reduce but it also outputs all the intermediate results:
(reductions + (map #(* % %) (range 5)))
;; (0 1 5 14 30)
• (+ 1 4) the sum-so-far becomes 1 and that is added up to the square of the second
number in the range.
• (+ 5 9) the step before the last continues with the same idea.
• (+ 14 16) the square of the last element of the sequence 16 gets added to the sum-
so-far. There are no more inputs, the last evaluation is "30".
As a consequence of the relationship between reduce and reductions it is possible to
say that given a collection "coll" and a function "f": (= (reduce f coll) (last
(reductions f coll))).
reduce implements the typical recursive iterative process (where the incremental
results appear in the argument list) and includes a standard vocabulary:
• "accumulator" is the name given to the "result-so-far". It is sometimes abbreviated
as "acc" in source code.
• "reducing function" is the function "f" of two arguments. Note that "reducing"
does not necessarily mean a scalar value or "single object" as output. You can
indeed use reduce with hash-maps to enrich them with new keys (see the
examples section after the contract).
• "fold" is the class of operations reduce belongs to, more specifically a "fold-left".
This is because the elements of the input collection are gradually consumed from
the left, like if we were "folding up" the sequence.
Contract
Input
• "f" should provide both a 0 and 2 arguments arity (invokable with 0 or 2
arguments) and is a required argument. The 0-argument version is only called if
there is no "val" initial value and the collection is either empty or nil:
(reduce + nil) ; ❶
;; 0
(reduce / []) ; ❷
;; ArityException
❶ The collection is nil, so (+) is invoked without arguments returning the identity for addition.
❷ An exception is thrown on an empty collection because the function "f" does not have a 0-arity
call.
• "coll" is also required and can be nil or empty. If "coll" is not nil, "coll" needs to implement
the Seqable interface such that (instance? clojure.lang.Seqable
coll) returns true (the only not supported type are transients).
• "val", when present, is used instead of the first item in the collection to start the computation.
It follows that (reduce + 1 [1 2 3]) and (reduce + [1 1 2 3]) are equivalent. When
"coll" is either nil or empty, then "val" is always returned.
Notable exceptions
• IllegalArgumentException when "coll" is not a sequential collection
(word-count "To all things, all men, all of the women and children")
;;{"To" 1
;; "all" 3
;; "and" 1
;; "children" 1
;; "men," 1
;; "of" 1
;; "the" 1
;; "things," 1
;; "women" 1}
❶ The first operation is to associate the number "1" to each item in the list.
❷ reduce comes next, to "reduce" multiple "1" appearing for the same key. We destructure here each
vector-item in the input into a key "k" and value "cnt" bindings.
78
See the Google paper that popularized the topic a while ago: research.google.com/archive/mapreduce.html
❸ reduce starting point is an empty map. We assoc the element at key "k" knowing that it might not be
found. By using get to fetch the current counter we can pass a default initializer of 0 for the sum.
Conveniently, the count-occurrences function in the example can handle any item
type, not just "words" (provided items contains some definition of equality that can be
used to store them in the hash-map). Even more conveniently, Clojure contains such a
function in the standard library already, it’s called “frequencies”:
(defn word-count [s]
(frequencies (.split #"\s+" s))) ; ❶
(word-count "To all things, all men, all of the women and children")
;;{"To" 1
;; "all" 3
;; "and" 1
;; "children" 1
;; "men," 1
;; "of" 1
;; "the" 1
;; "things," 1
;; "women" 1}
❶ The custom made count-occurrences has been replaced with the standard library
equivalent “frequencies”.
79
Please read the Wikipedia article on moving averages available at en.wikipedia.org/wiki/Moving_average to know more
;; [1 5.4 5.4]
;; [2 8.8 4.4]
;; [3 15.8 5.266666666666667]
;; [4 24.0 6.0]
;; [5 35.0 7.0])
❶ next-average is our reducing function. It destructures the results so far into a counter, the sum and
the last average calculated. It then proceeds with generating a new average that is stored in a new
triplet ready to be returned for the next iteration.
❷ reductions is invoked with the reducing function, an initializer triplet of all zeroes and a collection of
values.
❸ The result of invoking stock-prices shows all generated triplets. If we are interested in just the
average, we could (map last) the results and ignore the rest.
(foldl + 0 numbers)
;; 10
❶ number has been defined with the typical "cons-cell" design to show the left to right movement in
folding the list operated by foldl.
❷ The recursion "unfolds" the list at each iteration applying "f" to the first element and the results so far
(stored in "init").
The above is conceptually how reduce is implemented in Clojure for lists 80 . The example shows what
happens at each step of the iteration over the input list. The first recursion, "init" is (+ 0 1), then (+ 1
2), then (+ 3 3) and finally (+ 6 4). Visually, the computation starts by applying "f" from the left,
reason why Clojure reduce is also called a left-fold. Also note how foldl is tail-recursive, since the
new foldl invocation is the very last operation in the loop.
There is also another way to write the same operation, suspending the application of "f" until we
reach the end of the list:
80
reduce is instead implemented as a for loop in Java for most of the Clojure collections
init))
(foldr + 0 numbers)
;; 10
❶ The last operation is now "f" invoked over the arguments, where the collection is represented by the
recursive call to foldr.
The implementation above is also called a right-fold, because the first invocation of "f" happens using the
tail of the collection (the number 4) and moves backward until it reaches the head to perform the last
operation. To obtain this effect, the recursive foldr invocation happens inside "f" in the last line, forcing
the computation to suspend until the frame returns. Note how foldr is now not tail-recursive and
potentially subject to stack overflow (aggressively lazy languages like Haskell have instead the option of
making good use of foldr without exhausting the stack).
A practical distinction between foldr and foldl is about non-associative operations, in which the
order the list is consumed matters. Operations like division / for example, behave differently
with foldl or foldr: the unfolding of foldl with / would result in (/ (/ (/ (/ 1. 1.) 2.) 3.)
4.) while foldr would produce the equivalent of (/ 1 (/ 2 (/ 3 (/ 4 1.)))) generating a
different output:
(foldl / 1. numbers)
;; 0.041666666666666664
(foldr / 1. numbers)
;; 0.375
foldr is not part of the Clojure standard library, in part for the problem with tail-recursion in part
because it can be easily implemented using reverse (although with an higher performance cost):
(foldr / 1. numbers)
;; 0.375
❶ foldr implemented using reduce and reverse. Note that the reducing function "f" needs to swap its
arguments.
See Also
• reduce-kv is the analogous of reduce for associative data structures. Instead of a
function of 2 arguments, reduce-kv takes 3 arguments: a map accumulator and a
key-value pair. Prefer reduce-kv when reducing against a hash-map.
• loop is the low common denominator of almost all sequential processing
functions. There is always a way to transform a reduce into a loop-recur where
you can customize all aspects of the reduction, including propagating types if
necessary.
• “frequencies” was mentioned in the examples as a perfect application for reduce,
where a final data structure is created incrementally by walking a sequential input.
• “reduced, reduced?, ensure-reduced, unreduced” are a group of functions that you
can use to fine tune the behavior of reduce or reductions. When an element in
the sequence is reduced?, reduce stops the computation and return the result
immediately. This behavior requires a reducing function that knows how to deal
Figure 3.5. reduce invoked on different collection types and sizes. Lower number means faster
execution.
The diagram shows the linear behavior of reduce while increasing the collection size
from 100 to 500 and then 1000 items. It also shows that reduce on sets (ordered or
unordered) is roughly 5 times slower than vectors, the fastest of the benchmark. In
absolute terms, reduce (especially on vectors or lists) is hard to beat, even with a loop-
recur.
reduce walks the entire sequence by design, so it’s not lazy (although there are ways to
short-circuit using reduced). The memory footprint depends largely on the reducing
function. Assuming "f" is not accumulating the entire input in memory, even large
sequences can be reduced in linear time without worrying about going out of memory:
(let [xs (range 1e8)] (reduce + xs)) ; ❶
;; 4999999950000000
❶ + uses the items to complete the sum, but after that they can be safely garbage collected, resulting in
just a portion of the large collection being in memory at any given time.
❷ In this second example, the reducing function is merge. The result is a collection with the same size of
the input, forcing all elements in memory. The likely outcome (depending on the JVM settings) is an
out of memory error.
;; OutOfMemory
❶ The call to last happens before reduce. Since they appear in the same form, the content of xs cannot
be garbage collected before also reduce has an opportunity to scan the sequence.
Both last and reduce function calls wouldn’t normally produce an out of memory when
taken in isolation. The problem is that they appear inside the same expression, so the
garbage collector that would normally kick-in while last is scanning through the
sequence cannot run, as reduce holds the head of the sequence preventing garbage
collection.
One last word about reductions which is also part of the chapter. Despite exhibiting
the same behavior, reduce and reductions are quite different in
performance. reductions is not a drop-in replacement for reduce because it always
walks the input collection sequentially, regardless of a potential custom
implementation:
(let [xs (range 1000)]
(* 10e6 (b (last (reductions unchecked-add-int xs))))) ; ❶
;; 530.79127793974734 (µs)
Thanks to Nicola Mometto, Clojure core committer, for contributing this chapter
(except “definline”).
4
Arguably one of the most powerful aspects of any LISP is their ability to define custom
and arbitrarily complex macros, and Clojure is no exception. Although many languages
have the concept of macros, LISP ones are an entirely different beast, effectively
providing the users of the language with an expressive power that in other languages
only compiler authors can have.
Given their power one would expect macros to be a complex and advanced feature to
use, this is in fact not the case: because of the homoiconic nature of Clojure 81 ,
defining a macro is as simple as defining functions and manipulating data: macros are
indeed just regular functions that the compiler invokes at compile time, passing as
inputs their arguments as if wrapped in an implicit “quote” invocation and returning a
valid Clojure expression that will be evaluated at run time.
Macros can be used for a variety of reasons, from simply reducing the amount of
repeated code, to allowing code to be expressed in a more concise way, to writing
complex DSLs or embedding small compilers 82 .
This chapter is dedicated to the facilities in the standard library (and the language) to
create, inspect and help using macros. Here’s a brief summary:
81
Homoiconicity is the property of a language in which its syntax is represented in terms of data structures of the language
itself, see en.wikipedia.org/wiki/Homoiconicity
82
The core.async library is perhaps one of the best examples, implementing a source-to-source rewriting compiler as a
single macro github.com/clojure/core.async
• “defmacro” is the main entry point in the language to create macros. The body of
the macro is assigned to newly created var in the current namespace and ready to
be referenced. Although several Clojure facilities can be used also outside macros,
many are found almost exclusively when creating them (like “Syntax Quote” for
instance). We are going to see a few of them while illustrating defmacro.
• “macroexpand, macroexpand-1 and macroexpand-all” are debugging tools to
show how the macro will process some input without actually executing. The
"expanded" macro is simply printed out for inspection.
• “quote” is a function that prevents evaluation of what is passed in as argument. It
is simple but fundamental for macro programming.
• “gensym” is a helper function to generate unique symbols name. It is part of
macro hygiene 83.
• “definline” takes a body and defines both a function and an "inilined" version of
that function. The inlined version is very similar to a macro and shares the same
syntax.
• “destructure” is used by many macros in the standard library to implement
destructuring, a key feature of Clojure.
• “clojure.template/apply-template” and “clojure.template/do-template” are
dedicated to replacement of symbols in expressions during macro expansion.
4.1 defmacro
macro since 1.0
defmacro is to macros what defn is to functions, but while a function evaluates after
compilation, a macro evaluates when its body is compiling. This gives the macro an
opportunity to alter the output of the compiler, including intercepting arguments before
they are evaluated (which is the case in normal Clojure functions). Being a macro
itself, defmacro action can be revealed with macroexpand:
(macroexpand '(defmacro simple [a] (str a))) ; ❶
;; (do
;; (clojure.core/defn simple ([&form &env a] (str a))) ; ❷
;; (. (var simple) (setMacro)) ; ❸
;; (var simple)) ; ❹
❶ The macro being defined is simply returning a string conversion of its only argument.
83
Hygiene in macros has to do with preventing symbols defined outside the macro to collide with what happens inside the
macro. For an initial overview on the topic please see en.wikipedia.org/wiki/Hygienic_macro
❷ defmacro produces a do block starting with a call to defn to define what is at the beginning just a
function. As you can see two arguments are automatically added to the generated
function, &form and &env the meaning of which will be explained further down in the chapter.
❸ Once the function is defined, it is transformed into a macro making direct access to
the clojure.lang.Var object that the previous line just interned in the current namespace.
❹ The last form returns the “var, find-var and resolve” object just created and set to be a macro.
Because defmacro is built on top of defn, it supports all of its features, including
multiple arities, destructuring, :pre and :post conditions and more. Please
check defn contract and examples for any of these features.
All the macros provided by the Clojure standard library are defined
using defmacro itself (except defn that needs to come first). This is for example
how when is defined:
(defmacro when
"Evaluates test. If logical true, evaluates
body in an implicit do." ; ❶
{:added "1.0"} ; ❷
[test & body]
(list 'if test (cons 'do body))) ; ❸
❶ The string documenting the macro needs to be between the name of the macro and the arguments
declaration.
❷ Here an additional metadata map is present.
❸ The macro body returns a list. Since the macro is executed at compile time, the list is evaluated
(basically removing "list" and replacing "test" and "body" with proper expansions) and "in-lined" at the
calling site.
We can verify the expected behavior with macroexpand-1 which for clarity is not
further expanding past the first level:
(macroexpand-1 '(when (= 1 2) (println "foo")))
;; (if (= 1 2) ; ❶
;; (do (println "foo")))
(when (= 1 2) ; ❷
(println "foo"))
;; nil
❶ The arguments passed to the macro are not evaluated like for normal function calls, but are instead
passed to the macro as their quoted value and a valid Clojure expression is returned.
❷ Because when is implemented as a macro, the body expression is only evaluated if the test expression
returns true. This wouldn’t be possible using a normal function.
Contract
Input
defmacro uses the same syntax as defn. The reader is invited to review defn for the
complete set of options.
• "name" must be a valid Clojure symbol. "name" is used as the name for the macro
and is required.
• "fdecl" is commonly given as a vector of arguments and a body. The argument
vector is always added 2 implicit arguments, &form and &env.
Output
defmacro returns a clojure.lang.Var object referencing the macro just created. The
macro "name" becomes available in the current namespace as a side effect.
NOTE defmacro hard limit on the number of fixed arguments is not 20 but 18 because of the 2
implicit arguments.
Examples
One of the most common usage is the with- style of macros, a class of macros that
execute their bodies within a defined context, automatically performing some logging
or cleanup logic.
Here’s an example of such a macro usage applied to the problem of contacting some
third party service that involves a network call. When networking is involved, an
application needs to always prepare for the worst, such as intermittent connections,
unreachable hosts and so on. For this reason one common pattern is to keep track of
networking error and re-try to contact the third party service some number of times
before giving up completely and raising a proper error:
(defn backoff! [attempt timeout] ; ❶
(-> attempt
(inc)
(rand-int)
(* timeout)
(Thread/sleep)))
(defn frequently-failing! [] ; ❷
(when-not (-> (range 30)
(rand-nth)
(zero?))
(throw (Exception. "Fake IO Exception"))))
(defmacro with-backoff! ; ❸
[{:keys [timeout max-attempts warning-after] :or {timeout 100}} & body]
`(letfn [(warn# [level# n#] ; ❹
(binding [*out* *err*]
(println
(format "%s: expression %s failed %s times"
(name level#) '(do ~@body) n#))))] ; ❺
(loop [attempt# 1] ; ❻
(when (not= :success (try
~@body
:success
(catch Exception e#)))
(when (= ~warning-after attempt#)
(warn# :WARN attempt#))
(if (not= ~max-attempts attempt#)
(do
(backoff! attempt# ~timeout)
(recur (inc attempt#)))
(warn# :ERR attempt#))))))
(with-backoff! ; ❼
{:timeout 10
:max-attempts 50
:warning-after 15}
(frequently-failing!))
❶ The function backoff! implements a simple backoff algorithm: taking as input an attempt number and
a timeout it then picks n, a random number between 0 and attempt and sleeps for n*timeout ms.
❷ The function frequently-failing! simulates a function that is subject to frequent failures, only
succeeding 1/30 of the times
❸ The macro with-backoff! takes a map defining the desired backoff behavior and a body to execute
in that backoff context.
❹ We start right off by using “Syntax Quote” on the returned expression of the macro, making sure that
we return a data structure representing a program rather than executing that program. We
immediately make use of the additional features that syntax-quote has over normal quote, by using
the auto-gensym feature for both the local function we’re defining and for its arguments. In particular
the function warn that we’re defining will deal with printing a warning or an error message
to *err* reporting the number of retries and the expression that is being retried
❺ Here we make use the unquote-splicing feature of syntax-quote, to splice the list of expressions
into a do body. Note that if, for example, we defined with-backoff! as a function taking an
anonymous function, this level of reporting wouldn’t have been possible, as functions don’t have a way
of accessing the actual representation of the arguments they’re handed.
❻ The macro then emits a loop in which the body is evaluated. If its evaluation causes an exception, the
exception is caught and we proceed with the potential backoff and retry, otherwise the loop simply
returns.
❼ Here we demonstrate how with-backoff! is used, using the previously defined frequently-
failing! function as its body, with a backoff timeout of 10 ms, a maximum number of attempts of 50
and telling the macro to print a warning after 15 failed attempts at executing its body.
`s/upper-case
;; clojure.string/upper-case
`lower-case
;; clojure.string/lower-case
`foo
;; user/foo
WARNING Usage of the tilde-single-quote "pattern" is highly discouraged: the reason for the auto-
qualification and auto-gensym features is to avoid the age-old problem of LISP macros of
accidental symbol capture, without having to implement purely hygienic macro system 84 and
this "pattern" sidesteps those safety measures. There are legitimate cases where this is indeed
the desired behavior (some instances appear in the clojure.core code-base itself), but they
are extremely rare and usually only needed in very complex contexts.
3. unquote is the escape mechanism that turns syntax-quote into full blown
templating engine for clojure expressions. By prefixing an expression used from
within a syntax-quote context with the unquote symbol ~, that expression is
normally evaluated as opposed to being quoted and the result of its evaluation is
embedded into the syntax-quote expression:
`[1 2 (+ 1 2) ~(+ 1 2)]
;; [1 2 (clojure.core/+ 1 2) 3] ; ❶
84
More on the problem of accidental symbol capture and hygienic macros here: en.wikipedia.org/wiki/Hygienic_macro
❶ Everything inside the square brackets should be quoted and unevaluated but unquote (tilde)
temporarily turns on the normal evaluation engine for the form it precedes (including all its
inner forms).
❶ unquote-splicing (tilde-at) turns on evaluation and treats the following form as a collection.
`[~@:foo]
;; IllegalArgumentException Don't know how to create ISeq from:
clojure.lang.Keyword
`~@[1]
;; IllegalStateException splice not in list
clojure.lang.LispReader$SyntaxQuoteReader.syntaxQuote
❶ We define the just-print-me macro, which does exactly what its name suggests: it prints the form
that is being invoked and returns nil.
❷ A quick invocation of this macro shows that it’s behaving as expected, printing exactly the form that is
being invoked.
Some might be tempted to observe that the previous macro could be rewritten without
any need for &form, like this:
(defmacro just-print-me [& args]
(println (apply list 'just-print-me args)))
(let [a 1
b [:foo :bar]]
(with-locals-to-string [a b])) ; ❸
❶ The with-locals-to-string macro retrieves the local symbols available at the point of macro-
expansion using (keys &env) and puts them into a vector so that it will be possible to use that vector
in a destructuring let.
❷ It then emits a destructuring let statement wrapping the body, where every local will be rebound to the
result of invoking str on itself.
85
An actual real world example of a macro that uses &env is core.async’s go macro, possibly the most complex clojure
macro ever written to date
❸ Here’s an example of how with-locals-to-string is used. Unfortunately it’s not possible to inspect
a macro that uses &env using macroexpand-1 and preserving the lexical context, but this is what that
expression will macro expand to:
(let [a 1
b [:foo :bar]]
(let [[a b] (mapv str [a b])]
[a b]))
Macros where later proposed and quickly replaced fexprs being both easier to reason about for
humans and allowing the compiler’s to do a better job at optimizing expressions 87. At this day there are
still a small number of minor LISPs that make use of fexprs instead of macros, such as newLISP 88 and
PicoLisp 89 .
The evolution of LISP macros didn’t stop with their proposal though, different LISPs had different
implementations and thus offered different behaviors: for example the MIT PDP-6 Lisp expanded macros
on the fly at function call rather than at function definition. This had the advantage of allowing macros
redefinition without requiring redefinition of the functions using those macros, but required the
interpreter to expand the same macro call every time, reducing execution speed.
A big jump forward in the LISP macros evolutionary time-line happened in the mid '70 with the
introduction of the “Syntax Quote” templating system in ZetaLisp. This allowed macros to be written in a
significantly more concise style and allowed also normal people to write macros (at that time writing
complex macros was considered an activity that only real gurus could perform).
During the '80 the problem of macro hygiene arose, and caused Scheme to diversify significantly
from the other major LISP of that time, Common Lisp. While Common Lisp tried to side-step that problem
86
For a more in depth analysis of the history of macros in LISPs, refer to chapter 3.3 of "The Evolution of Lisp" by Guy
Steele and Richard Gabriel:www.csee.umbc.edu/courses/331/resources/papers/Evolution-of-Lisp.pdf
87
This is discussed at length in Kent Pitman’s 1980’s paper "Special Forms in
Lisp": www.nhplace.com/kent/Papers/Special-Forms.html
88
www.newlisp.org/
89
picolisp.com/wiki/?home
by instructing programmers to make use of “gensym” (clojure uses the same style of macros as Common
Lisp, but rather than relying on users to make use of “gensym”, forces them to by auto-qualifying symbols
in “Syntax Quote” expressions), Scheme decided that defmacro style macros were too hard to write,
allowing both arbitrary computation to happen in macro bodies and forcing users to deal with problems
like macro hygiene and manual parsing.
To solve those problems, Scheme ditched defmacro-style macro definitions in favor of define-
syntax, syntax-rules and later syntax-case. Those primitives allow users to create macros as
syntax transformers, by simply defining the input language in a BNF 90 style and declaring a
transformation, here’s an example of how the when macro would be defined in Scheme (note the lack of
explicit quoting/unquoting):
(define-syntax when
(syntax-rules ()
((when pred body ..)
(if pred (begin body ..)))))
There are several libraries that implement similar functionality in Clojure and Clojure itself will probably
include something similar in future releases 91
See Also
• “eval” is a function that offers the opposite functionality of macros, by taking a
quoted expression and evaluating it.
• “macroexpand, macroexpand-1 and macroexpand-all” are invaluable functions
when debugging or trying to understand a macro, allowing to inspect the result of
a macro call sidestepping the evaluator.
• “quote” is a special form used to prevent the compiler from evaluating an
expression. Conceptually a macro can be simulated appropriately
combining eval and quote.
• “definline” blurries the difference between defn and defmacro, defining a function
that can also act as a macro when not used in a higher-order context.
Performance Considerations and Implementation Details
90
Backus–Naur Form, a language for describing the syntax of languages, see: en.wikipedia.org/wiki/Backus–Naur_Form
91
See the Clojure wiki page on macro grammars to have an idea of the kind of work that is currently in
progress: dev.clojure.org/display/design/Macro+Grammars
recursive analysis step is what distinguish a macro from a normal function which
would instead proceed directly to call the generated Java class. macroexpand allows
the user to invoke the recursive analysis process stopping just before the evaluation
step.
(macroexpand form)
(macroexpand-1 form)
(clojure.walk/macroexpand-all form)
❶ The result of macro expanding a simple when form. Note the required use of syntax quoting (') so the
Clojure runtime does not evaluate the form straight away.
Input
• "form" must be a valid Clojure expression.
Output
• Returns a macro expanded version of "form", with depth of macro expansion
depending on the macroexpand* variant used, as described above and illustrated
below in the examples.
Examples
Here’s an example using all the three macroexpand* variants on the same form,
showcasing the difference in how they work:
(macroexpand-1 '(when-first [a [1 2 3]] (println a))) ; ❶
;; (clojure.core/when-let [xs__5218__auto__ (clojure.core/seq [1 2 3])]
;; (clojure.core/let [a (clojure.core/first xs__5218__auto__)]
;; (println a)))
❶ macroexpand-1 runs the macro expander exactly once on the input form, as we can see when-
first macro expands into a combination of when-let, seq and let.
❷ macroexpand loops macroexpand-1 on the form until the first element doesn’t resolve to a macro
anymore, in this case it will run 3 times: when-first macro expands to a when-
let expression, when-let macro expands to a let expression, let macro expands to
a let* expression.
❸ clojure.walk/macroexpand-all walks the expression running macro expand on each subform,
using a breadth-first traversal. All macro calls in the returned form have been fully macro expanded.
While it is true that the macro expansion utilities are almost exclusively used in the
REPL for interactive exploration and debugging, they can be useful in code as well to
implement really complex macros or tooling utilities. In the following example we
use macroexpand-all and clojure.walk/walk to find an approximation of all the
functions called by another function:
(require '[clojure.walk :as w])
(find-invoked-functions
'(when-first [a (vector 1 2 3)] ; ❼
(inc a)))
❶ find-invoke-functions is a function that takes a quoted expression and returns a set of vars that
approximates the actual set of functions that that expression references.
❷ !fns is an atom that we will use to collect referenced vars while walking the expression.
❸ walkfn! is a recursive function that is invoked on each sub-form that could contain function calls and
collects invoked functions. It starts by checking whether the sub-expression is a sequence whose first
element is a symbol, the syntax for function call in Clojure.
❹ If the sub-expression is a function call we try to resolve the symbol in function position to
a var using resolve, if that returns a var we conjoin it to !fns, then we recurse walkfn! on the sub-
expressions using clojure.walk/walk. We skip the resursive walk if the symbol in function call
is quote since nothing inside a quote body is evaluated and thus there can be no function referenced.
❺ If the sub-expression is a collection then we recurse walkfn! on its content, otherwise we do nothing.
❻ Here we invoke walkfn! on the given expression, invoking clojure.walk/macroexpand-all on it
first to make sure we find all the functions refrenced by the expression body.
❼ Finally we invoke find-invoke-functions on a simple expression, the result shows a set
of clojure.core vars. As we can see, the resulting set contains seq and first, none of which
appear explicitly in our expression but are used by the expansion of when-first; hadn’t we
used clojure.walk/macroexpand-all, we wouldn’t have been able to know they were referenced.
The function just showcased isn’t perfect (it won’t find functions used as values, for
instance), but it’s a good example of how we can implement a simple call resolution
algorithm without the use of complex analysis tools.
Shotcomings
The macroexpand* functions have a couple of known shortcomings that can be
potentially surprising and should be kept in mind:
• They’re not aware of the surrounding lexical environment, meaning it’s not
always possible to macro expand macros that make use of &env
• clojure.walk/macroexpand-all macro expands without taking into account the
synctactic rules of clojure, meaning it will potentially macro expand subforms that
should not be macro expanded because they either appear in the body of a special
form, or the referenced macro has been shadowed by a local binding
See Also
• “eval” is a function that takes a clojure expression and evaluates it as code; macro
expansion happens as part of the evaluation of a form in the clojure compiler
pipeline.
• “read-string” is a function that takes a clojure expression as a string and returns its
representation as a clojure data structure; reading precedes macro expansion in the
clojure compiler pipeline.
• “quote” is a special-form used to prevent the compiler from evaluating an
expression, clojure forms can be passed to macroexpand* either through the use
of quote or through the use of read-string.
4.3 quote
special form since 1.0
(quote [expr])
quote is a special form that simply returns its input expression without evaluating it:
(quote (+ 1 2)) ; ❶
;; (+ 1 2)
'(+ 1 2) ; ❷
;; (+ 1 2)
As with all the utilities that affect how Clojure forms are evaluated, quote is mostly
useful in metaprogramming contexts. Because of how primitive quote is to the
language, Clojure provides a shortrand syntax to quote expressions via the reader
macro $$'££. In other words (quote foo) can be conveniently re-written using the
more concise and equivalent syntax 'foo.
Contract
Input
• "expr" is the required and only argument.
Output
• returns: the argument that was passed as input, unevaluated.
Examples
Because of the evaluation rules of Clojure, if a symbol points to a var, then the value of
that var is dereferenced in place. Using quote is the only way to embed lists and
symbol literals in code. Literal symbols are used for all kind of purposes in Clojure,
most frequently as input to functions that provide runtime introspection functionalities
such as resolve:
(resolve '+)
;; #'clojure.core/+
Without the quote special form, one would be forced to write that call as:
(resolve (read-string "+"))
;; #'clojure.core/+
This is not only more cumbersome to write, but also less performant: rather than
embedding a constant at compile time, this will force clojure to parse the string and
create a new symbol every time that expression is evaluated.
Besides the more common usage of embedding symbol literals in code, quote is
sometimes used in macros inside complex syntax-quote expressions as an escape hatch
for its automatic namespace qualification feature, via the "unquote-quote" pattern ~'.
To showcase this pattern we define a macro called defrecord* which augments
“defrecord” by making it implement the clojure.lang.IFn interface, so that records
created with defrecord* are callable just like maps:
(defmacro defrecord* [name fields & impl]
`(defrecord ~name ~fields ; ❶
~@impl
clojure.lang.IFn
(~'invoke [this# key#] ; ❷
(get this# key#))
(~'invoke [this# key# not-found#]
(get this# key# not-found#))
(~'applyTo [this# args#] ; ❸
(case (count args#)
(1 2) (this# (first args#) (second args#))
(throw (AbstractMethodError.))))))
((Foo. 1) :a)
;; 1
((Foo. 1) :b 2)
;; 2
❶ We define the defrecord* macro, taking as input the record name, fields and default implementations
and we insert those args into a defrecord expression
❷ After the provided record impl, we implement clojure.lang.IFn and the two arities of
its invoke method that just delegate to get. Here we make use of the unquote-quote pattern so that
the method name will be invoke rather than user/invoke
❸ Similarly we implement the applyTo method so that we can also use apply on our record.
❹ We can verify that our macro is doing what it’s supposed to by instantiating an example record and
invoking it as a function.
See Also
• “eval” is a function that takes as input a quoted expression and returns its
evaluated value
• “Syntax Quote” is a reader macro that can be considered quote on steroids and the
go-to tool for writing macros
Performance Considerations and Implementation Details
4.4 gensym
function since 1.0
(gensym
([])
([prefix-string]))
gensym is a simple function whose only purpose is to return an unique symbol each
time it’s invoked. It’s mainly used in the context of writing macros to avoid the
problem of accidental symbol capture when the automatic symbol generation feature of
“Syntax Quote” is not enough, but can be used in any reason when there’s need for a
random symbol, such as generating unique labels.
(gensym) ; ❶
;; G__14
(gensym "my-prefix") ; ❷
;; my-prefix17
Contract
Input
• "prefix-string" is the only optional argument. If no prefix is provided, "G__" will
be used as prefix.
Output
• gensym returns a symbol whose name is a prefix followed by a random number
guaranteed to be unique in the current Java instance.
Examples
Here’s an example showcasing gensym while manipulating symbolic expressions for a
small logic language. First order logic 92 allows for quantification of logic variables
with quantifiers such as "any" (there is at least one item for which the expression
is true) and "all" (the expression should be true for all items). Expressions written in
first order logic are amenable for transformations that maintain logic equivalence
between formulae. One of them allows to pull a quantified expression "up":
(OR (EXIST x (Q x)) (P y)) ; ❶
The logic formula above reads: either there is at least one "x" such that "Q" of "x"
is true or "P" of "y" is true. "Q" and "P" represents logic predicates. Logic predicates
are similar to functions: they take a logic variable (such as "x" or "y") and they
evaluates as true or false in a logic expression. We can claim that this expression is
logically equivalent to another using the "<=>" symbol (which means "if and only if"):
(OR (EXIST x (Q x)) (P y)) <=> (EXIST x (OR (Q x) (P y))) ; ❶
❶ Two logic expressions are logically equivalent when they evaluate the same given the same "x", "y"
input.
Our goal is to write a Clojure function that "pulls up" a nested quantifier in a logic
formula, so the quantifier appears external to the expression, like illustrated by the
logic equivalence above. One problem related to this transformation is the potential
accidental capturing of logic variables. Observe the following:
(OR (EXIST x (Q x)) (P x)) <!=> (EXIST x (OR (P x) (Q x))) ; ❶
❶ The accidental capture of "x" does not guarantee equivalence between these expressions. In one
case "x" is quantified but the quantification should not be extended to other predicates arbitrarily.
In the last example, the predicate "(P x)" suddenly becomes part of the quantification
of the variable "x" when previously it wasn’t, breaking the logic equivalence between
the expressions. We need to make sure that when transforming the expression, we
change the quantified variable to avoid accidental capture. We can achieve this by
using gensym as follows:
(defn- quantifier? [[quant & args]] ; ❶
(#{'EXIST 'ALL} quant))
92
First order logic is a formal system for logic reasoning. Compared to other kind of formal systems (such as propositional
logic) first order logic also allows quantification of logic expression over collection of items. Please
see en.wikipedia.org/wiki/First-order_logic for more information
❶ The function quantifier? returns true if the argument is a sequence starting with either "EXIST" or
"ALL".
❷ emit-quantifier assembles a new quantified expression given a quantifier and the original
expressions.
❸ To assemble the new expression, emit-quantifier makes sure the quantified variable is brand new,
so it cannot clash with an already existing variable in either expressions.
❹ At the same time we need to make sure the predicate that was originally part of the quantified
expression also receives the newly create variable name. Note how the final expression is assembled
easily using syntax-quote.
❺ Callers perform transformations using the pull-quantifier function. This function understands
which expression contains the quantifier and calls emit-quantifier accordingly.
4.5 definline
experimental macro since 1.0
WARNING As of Clojure 1.10 it is the only remaining instance of "experimental" declaration in the
clojure.core namespace. Experimental should be read as "use at your own
risk". definline had at least one serious issue related to AOT compilation 94 on clojure
versions prior to 1.6.0 and may be replaced with a different solution in future Clojure releases.
Lisp-like compiler-macros are being considered for example. 95 Despite these
problems, definline and the :inline keyword are widely used in small and big projects. 96
(timespi 3) ; ❷
;; 9.42
;; (do
;; (defn timespi [x] ; ❹
;; (* x 3.14))
;; (alter-meta! (var timespi) ; ❺
93
Function inlining is an internal process by which a compiler replaces a function invocation with the body of that function
at compilation time. More information is available on Wikipedia: en.wikipedia.org/wiki/Inline_expansion
94
see dev.clojure.org/jira/browse/CLJ-1227
95
See dev.clojure.org/display/design/Inlined+code on the Clojure wiki
96
A partial list of projects making use of inlining is available on the Clojure mailing list: groups.google.com/d/msg/clojure-
dev/UeLNJzp7UiI/WA6WALO6EPYJ
;; assoc :inline
;; (fn timespi [x]
;; (seq
;; (concat (list (quote *))
;; (list x)
;; (list 3.14)))))
;; (var timespi))
When the inlined version of the function is just the same as the body (at least for a
subset of the arities), using the :inline metadata keyword creates a duplication
that definline can take care of. Like the metadata keyword, definline allows the
compiler to treat a function differently based on the way it is invoked. Direct
invocations of the inlined function will be expanded similarly to macros, while high-
order uses in which the function is passed as an argument will be treated like any other
function definition.
The main use case of function inlining has to do with performance optimizations
during Java interoperation (commonly referred as "interop"). With an inlined version
of a function the compiler has a chance to use the presence of type hints to make calls
to the right Java method (when many overrides are present). Without the inlined
version, Clojure would have to wrap the primitive Java type argument into
a java.lang.Object.
Contract
Input
• "name" is the name of the function that definline will generate as part of the
macro expansion. The name should be a valid symbol as per Clojure Reader
rules 97 .
• "&decl" despite the presence of "&", "decl" is not really optional.
Because definline hash to expand into a defn declaration, "decl" must contain at
least a vector (representing the list of parameters for the function). So: (definline
f []) is perfectly valid but (definline f) is not permitted.
Output
• definline returns a clojure.lang.Var object pointing at the function that was
just declared. The function is created in the current namespace, so there is
97
See the main Clojure Reader documentation at clojure.org/reader
Examples
The following example is going to explore a hypothetical integer math Java library that
we want to use from Clojure. To keep things simple, the library accepts different
numeric types but only output integers. The math library contains a plus method that is
overloaded for boxed numbers (e.g. java.lang.Integer) and also for primitive types
(int). It also contains a catch-all plus method that accepts generic java.lang.Object
as a last resort for other types that can be cast to java.lang.Number:
public class IntegerMath { ; ❶
❶ The IntegerMath Java class, simulates a fast math library that we wish to use from a Clojure
program.
Our goal, as the developers of the Clojure layer on top of the IntegerMath class, is to
be able to invoke the right plus method based on the inferred or explicit type. This also
includes the possibility for clients to call the native-unboxed "int" option if needed.
Finally, we would like to hide all of the complexity of the Java interoperation to the
deverlopers of the Clojure application. To achieve this isolation we design the
following intermediate layer:
(ns math-lib ; ❶
(:import IntegerMath))
(defn plus [x y] ; ❷
(IntegerMath/plus x y))
❶ A Clojure namespace that hides the complexity related to invoking methods on the Java class.
❷ Clojure clients wishing to use the IntegerMath class only see a plus function of 2 arguments.
The math-lib namespace is designed to be the public interface to clients wishing to use
the IntegerMath class. The following example illustrates the use of the namespace to
sum a number to a list of other numbers:
(ns math-lib-client
(:require [math-lib :as m]))
(vsum 3 [1 2 3])
❶ the client code requires the library and executes a sum of some numbers without any knowledge that
Java-interop is required for this operation.
❷ printouts are showing that we end up calling the generic plus of objects instead of the more
specialized integer version
Clojure doesn’t have a clue about what kind of sum vsum is executing once everything
is compiled: it could be summing up boxed or unboxed numbers, floats or integers. The
reason why this information is missing is because plus was compiled to a Java class
with an invoke method that accepts and return Objects. An attempt at coercing types
would not work either since the math-lib library is already compiled, as demonstrated
by the following:
(ns math-lib-client
(:require [math-lib :as m]))
(vsum 3 [1 2 3])
❶ the only change was to cast x and the item from the “vector” to be integer, but still the compiler won’t
take advantage of this.
❷ despite the type coercion to int, we are still calling into the generic plus version of the Java method
(definline plus [x y] ; ❶
`(IntegerMath/plus ~x ~y))
❶ Rewriting of the plus function in math-lib using definline. Note the similarity to macro writing.
Now plus is expanded in the place of the invocation, where information about the
types are still available for use:
(ns math-lib-client
(:require [math-lib :as m]))
(vsum 3 [1 2 3]) ; ❶
;; int plus(int int)
;; int plus(int int)
;; int plus(int int)
;; (4 5 6)
❶ The new printout confirms plus is now routed to the more specific Java method for unboxed integers.
(definline sq [x] ; ❶
`(let [x# ~x] (* x# x#)))
(direct-use 2.0)
;; 4.0
❶ sq simply multiplies its argument by itself. The let form and the apparent redefinition of the symbol "x"
is there to prevent double evaluation (common practice for generic macro programming, since "x"
could be an entire form including side effects). The "#" pound sign suffix in a macro is syntactic sugar
for “gensym”.
❷ direct-use is a function invoking sq directly
❸ higher-order-use is a function that passes sq to another function, in this case map
As expected direct use and higher order use returns the same results. Clojure compiles the direct use of
the function using the inlined form, effectively replacing direct invocation of sq with its macro expanded
form. direct-use function above is effectively replaced by:
This is the reason why definline needs to use macro syntax, because it will be treated similarly to a
macro-expansion at compile time to replace all direct uses of the function. Now let’s assume the
situation where we are playing at the REPL to solve a problem. We decide that the square function must
return integers and we cast the result using int. One very common thing at the REPL is to go back to the
definition of the function in local history, change what we want to change and re-evaluate the function,
which is exactly what are going to do below, without redefining direct-use:
(definline sq [x] ; ❶
`(let [x# ~x] (int (* x# x#))))
(direct-use 2.0)
;; 4.0
As you can see, direct-use does not truncate the return value to be an integer, while the higher-order
version is returning "4" as expected. The same would happen changing a macro and forgetting to re-
evaluate the functions using it, a common "reloading" problem. In a simple example like this one it’s easy
to see why this is happening, but in much bigger namespaces, whose dependency graphs are evaluated
at the REPL, this behavior can trip you up.
See Also
• “memfn” is a good choice when wrapping calls to instance methods of Java
objects for use in higher order functions. definline has a similar effect with a
better control of type passing at the cost of an additional function to write. For
example, the following invocations of the toString method on a Java object are
equivalent. Prefer the memfn solution in this case:
(map (memfn toString) [(Object.) (Object.)])
;; ("java.lang.Object@65b38578" "java.lang.Object@88df565")
• “defmacro” if the logic of the function is mostly related to the compile time aspect
(as a macro) and the higher-order function is never used, consider
using “defmacro” instead to make explicit that the only intended use of the
function is as a macro.
4.6 destructure
function since 1.0
(destructure [bindings])
; [vec__14 [1 2]
; x (nth vec__14 0 nil)
; y (nth vec__14 1 nil)]
❶ defstructure returns the form that when evaluated produce the destructuring of a collection type (in
this case vector).
❶ We can compose a let expression using “Syntax Quote” and decide what kind of destructuring to use
programmatically.
Contract
Destructuring expressions can get very complex, the syntax supports a lot of different
options and can be arbitrarily nested; here’s our attempt at a pseudo-formal
specification of it:
(destructure [bindings])
bindings :->
[bind1 expr1 .. bindN exprN]
bind :->
sym
OR
vec-bind
OR
map-bind
vec-bind :->
[bind1 .. <& bindN> <:as sym>]
map-bind :->
{<:keys [qbind1 .. qbindN]>
<:strs [sym1 .. symN]>
<:syms [qbind1 .. qbindN]>
<:or {sym1 expr1 .. symN exprN}>
<:as sym>
<bind1 expr1 .. bindN exprN>}
Sequential destructuring
Sequential destructuring works over any collection type that implements the concept of
sequential ordering, this includes Clojure sequences and vectors, strings, Java arrays
and lists. It is used to efficiently and concisely alias the nth or nthnext elements of a
collection, without having to explicitly access each element at its index. For example:
(let [my-vec [1 2 3 4]
[a b] my-vec ; ❶
[_ _ & r] my-vec ; ❷
[_ _ c d e :as v] my-vec] ; ❸
[a b c d e r v])
;;[1 2 3 4 nil (3 4) [1 2 3 4]]
❶ This is the simplest usage of sequential destructuring: the destructuring expression [a b] is applied to
the vector [1 2 3 4], causing a and b to be bound to 1 and 2, the rest of the vector is ignored.
❷ This destructuring expression uses the "tail destructuring" feature of sequential destructuring via
the & symbol: after ignoring the first two elements of the vector, r is bound to the remainder of the
98
For a more in-depth guide on destructuring, refer to: clojure.org/guides/destructuring
collection as per nthnext (meaning that if the sequence is over, r will be bound to nil rather than to
an empty sequence). Note that _ is not a special symbol used in destructuring, it’s just a regular local
binding name that is idiomatically used for values that we’re not interested in.
❸ Finally this destructuring expression uses the "collection aliasing" feature via the :as keyword: v will
be bound to the original collection being destructured, preserving its original type and metadata (if
applicable). This destructuring expression also showcases how it’s possible to destructure more
elements than there are in the destructured collection: in this case e will be bound to nil.
(cond
(not el) ; ❸
(str ret cur)
(= el cur) ; ❹
(recur more state)
:else ; ❺
(recur more [el (str ret cur)]))))
(dedupe-string "")
;; ""
(dedupe-string "foobar")
;; "fobar"
(dedupe-string "fubar")
;; "fubar"
❶ The function is implemented as a loop over the string, during each step of the loop we want to
consider the first character of the remaining string so we use destructuring to split apart the first
character (bound to el) from the rest of the string (bound to more).
❷ The loop also needs to keep some internal state representing the character we’re currently deduping
and the deduped string it’s built so far. We use destructuring to bind the current char to cur (initialized
to nil), the deduped string to ret (initialized to the empty string) and aliasing the whole state vector
to state.
❸ We’re in the body of the loop now, if el is nil it means the string has been fully consumed, so we exit
the loop by concatenating the current deduped string with the last char being deduped.
❹ If there is a char to consider and it’s the same as the char being deduped, we simply recur on the
remainder of the string and we keep the state unaltered, discarding the current char.
❺ If there the current char to consider is not the same as the char being deduped, we recur on the
remainder of the string and we update cur to be the current character and ret to be the
concatenation of ret and cur.
Associative destructuring
Associative destructuring works over any collection that implements the concept of
key-value pairs, this includes Clojure maps, sets, vectors, records and strings. It is used
to efficiently and concisely extract and alias values from associative collections:
(let [my-map {:x 1 :y 2 :z nil}
{x :x y :y :as m} my-map ; ❶
{:keys [x y]} my-map ; ❷
{:keys [z t] :or {z 3 t 4}} my-map] ; ❸
[x y z t m])
;; [1 2 nil 4 {:x 1, :y 2, :z nil}]
❶ The double colon "::" notation denotes a keyword qualified with the current namespace. So if this
expression was evaluated at the repl in the user namespace, ::x would be equivalent to :user/x.
Because Clojure encourages the use of maps for named or optional args to functions
(over the more typical Lisp keyword args), map destructuring is very commonly found
in the arguments of function definitions.
Nested and composed destructuring
Both sequential and associative destructuring expressions can be composed and
arbitrarily nested. Deeply nested destructuring expressions can quickly become hard to
read, so idiomatic Clojure usually doesn’t nest more than 2 destructuring expressions.
Here’s for example a destructuring extract-info function that takes keys
like :address or :contacts in a map and additionally destructure them:
(defn extract-info [{:keys [name surname] ; ❶
{:keys [street city]} :address ; ❷
[primary-contact secondary-contact] :contacts}] ; ❸
(println name surname "lives at" street "in" city)
(println "His primary contact is:" primary-contact)
(when secondary-contact
❶ First we extract "name" and "surname" from the input map using :keys destructuring.
❷ Without closing the first destructuring, we further destructure ":address" into "street" and "city".
❸ Finally, ":contacts" are subject to further sequential destructuring.
Destructured vectors of arguments are also useful to describe the shape of the input
data structure in the function’s signature, since they will be included in the output
of “doc”.
See Also
• let is arguably the macro where destructuring is used more frequently, as
destructuring reduces the mental overhead of having to extract values out of
nested collections.
• “fn” also supports destructuring in its argument vectors by relying internally
on destructure. Keyword arguments support can be achieved by combining
varargs and associative destructuring, since destructuring a sequence using
associative destructuring just converts the sequence to a map as per (apply hash-
map the-sequence).
• loop, “doseq, dorun, run!, doall, do”, “for” and all the other macros that support
argvecs or binding vectors support destructuring, as they usually build on top of
either let or “fn”.
Performance Considerations and Implementation Details
destructure is optimized to perform similarly to the same data lookup written
explicitely:
• Sequential destructuring has the same performance characteristics as repeatedly
using “nth” on the input collection (and using nthnext for tail destructuring).
• Associative destructuring has the same performance characteristics as repeatedly
using “get” on the input collection.
4.7 clojure.template/apply-template
function since 1.0
apply-template has some specific use cases in macro programming and symbolic
manipulation in general.
Contract
Input
• "argv" is a vector of symbols.
• "expr" is a valid Clojure expression that potentially contains one or more instances
of some of the symbols in "argv".
• "values" is a collection of Clojure values that will be used to replace in "expr" the
symbols at the matching position in "argv"
Output
apply-template returns an expression that is the same as "expr" but with the symbols
in "argv" that have a matching value in "values" replaced by the matching value
NOTE if (count argv) is not the same as (count values), only the symbols that can be matched
with a value will be replaced. Any extra symbols in "argv" or extra expressions in "values" will
be ignored.
Examples
While its docstring explicitly states that its main usage should be in macros, it is
actually not good idea to use apply-template in that scenario unless its mechanism of
action is completely understood, as it can lead to some unexpected results. apply-
template expands lexically without knowledge of a specific form semantic. For
example:
(require '[clojure.template :refer [apply-template]])
❶ A small example of apply-template expansion shows that using the same symbol "x" in both
arguments generates incorrect Clojure.
Even though this small example is sufficiently trivial to understand what is going on
and what the fix should be ("expr" should not use binding symbols that appear in
"argv"), issues like this one could happen if the "expr" is provided by users of macros
that use this function.
For this and other reasons, the original author of apply-template has in more
occasions stated that its inclusion in the Clojure standard library might have been a bad
idea 99
For other cases, apply-template could be useful tool to apply simple substitutions. For
example, here’s how to replace the variable "x" with "y" in an arbitrarily nested
expression:
(apply-template '[x] '(P(x) ∧ (∃ x Q(x))) '[y]) ; ❶
;; (P (y) ∧ (∃ y Q (y)))
❶ We use apply-template to replace all occurrences of the symbol "x" with "y" in a logic expression.
See Also
• clojure.template/do-template is a macro that uses clojure.template/apply-
template to expand the same template "expr" multiple times, using a diffent set of
values as substitute for the symbol in "argv".
• clojure.walk/postwalk-replace is a function that deep walks Clojure expressions
replacing matching exrepssions along the way. It’s a more general version
of clojure.template/apply-template
Performance Considerations and Implementation Details
4.8 clojure.template/do-template
macro since 1.0
99
grokbase.com/t/gg/clojure/124q5bb8y1/stackoverflowerror-caused-by-apply-
template#20120423oadz7ag6ufqed27u2jsxsk5e64
;; 3
;; nil
Input
• "argv" is a vector of symbols.
• "expr" is a valid Clojure expression that potentially contains one or more instances
of some of the symbols in "argv"
• "values" is a collection of Clojure values that will be partitioned by the count of
"argv" and each partition will be used to replace in "expr" the symbols at the
matching position in the current partition.
Output
• do-template repeatedly executes the template expression substituting the symbols
in "argv" with the matching value in the current partition of "values". It
returns nil.
Examples
The following is a simple example that prints the same expression with different
substitutions:
(require '[clojure.template :refer [do-template]])
❶ We need a side effecting function such as println to see the effects of do-template. Also note that
some white spaces in the result are not present in the original form.
Overview
Operations on Numbers
5
Arithmetic operations are a fundamental feature of a language. This chapter collects the main arithmetic
operations offered by Clojure. The list might seem quite small and the reason is that Clojure is neither re-
implementing nor wrapping the vast selection of math functions that Java offers. If you are for example
searching a function to truncate decimals or square root a number they can be easily leveraged through
Java interoperability.
Clojure still offers explicit versions of the most common math operations in the
standard library. This is mainly to provide optimal performance without requiring
explicit type hints. The operations Clojure provides are part of this chapter and are
summarized by the following table:
+ Are the 4 basic arithmetic operations. Unlike Java they throw exception on
overflow.
“inc and dec” Are commonly used shortcuts for incrementing and decrementing numbers
by one.
“quot, rem and mod” Clojure offers one function to retrieve the quotient of a number and two
types of reminder operations.
“max and min” Calculates the max and min between a set of numbers.
“max-key and min-key” Calculates the max or min after applying a transformation function.
“rand and rand-int” Generation of random numbers.
“with-precision” Utilities to set the rounding strategies for decimals operations.
+' Core set of arithmetic operations with auto-promoting capabilities (note the
single quote appended to the name).
unchecked-add and others Java style arithmetic on longs. Subject to truncation on overflow. This is
unchecked-* functions the way Clojure can call the corresponding Java basic math operators.
unchecked-add-int and others Java style arithmetic on ints. Also subject to potential truncation on
unchecked-*-int functions overflow.
ARBITRARY PRECISION
The basic math operations +, -, *, inc and dec are all examples of simple precision
operators. When their short, int or long type operands go beyond the boundaries
of Long/MIN_VALUE and Long/MAX_VALUE, these operators throw exception. Clojure also
offers a different option: the arbitrary precision operators +' (note the single quote
appended at the end of the name) automatically promote their return values to
a BigInt type which holds arbitrary size numbers (subject to memory availability).
For many applications default precision for integer types is good enough. But there are
classes of applications that require representation of numbers bigger than 263 (the
largest long number representable with 64 signed bits). If that’s is the case, "single-
quote" operators will make your life much easier compared to Java. Java big-integer
arithmetic is based on classes and objects without overloaded math operators, which
means that there is no easy way to sum up two BigInteger other than creating their
respective instances and calling methods on them. Clojure would automatically use the
right precision just by using +'.
NOTE You might have noticed that Clojure arbitrary precision operators are missing a /' (divide-quote)
equivalent. You should consider that / is already special because it potentially produces
fractional numbers that already preserve all possible precision (for example (/ 10 3) returns
the symbolic representation 10/3 without actually realizing any decimals). Secondly, / cannot
create long overflow if both arguments are longs (excluding the "zero" special case).
ROLL-OVER PRECISION
Roll-over precision defines a set of functions in the Clojure standard library that do not
result in an exception (or a type promotion) when the allocated storage space for that
type is reached. The roll-over behavior for the long type refers to the fact that:
• Upon reaching Long/MAX_VALUE increasing a number by one results
in LONG/MIN_VALUE
• Upon reaching Long/MIN_VALUE decreasing a number by one results
in Long/MAX_VALUE
The roll-over behavior is implemented by the set of functions ending with the "-int"
suffix: “unchecked-add-int and other unckecked-int operators”. Clojure roll-over
functions mimic Java in case of overflow restarting from the lowest number at the
opposite end of the scale. The next table shows what happens to the 64 available bits
during a long type overflow (we need to remember the two’s complement integer
representation 101):
101
en.wikipedia.org/wiki/Two%27s_complement is the Wikipedia article describing how the two’s complement binary
representation works
As you can see from the table, approaching Long/MAX_VALUE fills up 63 bits with "1"s
and the change of sign happens by resetting everything to zero except the first bit.
NON-CASTING, ROLL-OVER PRECISION
Another group of math operators is named after the pattern "unchecked-*-int"
(replacing * with the name of the operation): unchecked-add-int, unchecked-
subtract-int, unchecked-multiply-int, unchecked-divide-int, unchecked-inc-
int, unchecked-dec-int, unchecked-negate-int, unchecked-remainder-int. These 8
functions are very similar by operating on the int type only and we are going to
describe them as a single group all under the unchecked-add-int function.
The int native type in Java has 32 bits and it’s stored using 2’s complement
format 102. The "unchecked-*-int" operators overflow into the opposite sign when
reaching the (Integer/MAX_VALUE) and (Integer/MIN_VALUE) limits. The following
table shows the bits layout upon reaching those limits and the effect of the related
operation:
32 bits decimal value relative value
01111111111111111111111111111101 2147483645 (- Integer/MAX_VALUE 2)
01111111111111111111111111111110 2147483646 (- Integer/MAX_VALUE 1)
01111111111111111111111111111111 2147483647 Integer/MAX_VALUE, (unchecked-
102
See en.wikipedia.org/wiki/Two%27s_complement for the details
subtract-int Integer/MIN_VALUE 1)
10000000000000000000000000000000 -2147483648 Integer/MIN_VALUE, (unchecked-add-int
Integer/MAX_VALUE 1)
10000000000000000000000000000001 -2147483647 (+ Integer/MIN_VALUE 1)
10000000000000000000000000000010 -2147483646 (+ Integer/MIN_VALUE 2)
5.1 +, -, * and /
function since 1.0
(+
([])
([x])
([x y])
([x y & more]))
(-
([x])
([x y])
([x y & more]))
(*
([])
([x])
([x y])
([x y & more]))
(/
([x])
([x y])
([x y & more]))
The basic arithmetic operations have a lot of features in common. The following
description applies to +, -, *, / unless specified otherwise. One of the main aspects of
main math operations in Clojure is that they take advantage of multiple "arities" to
work in different contexts with great flexibility and performance.
NOTE Before Clojure 1.2, basic math operators worked equivalently to the current auto-promoting
version (the functions ending with a single quote). After Clojure 1.2, their behavior was
converted to the current (by throwing instead of auto-promoting) to avoid the related
performance penalty.
CONTRACT
• - and / do not support the no-arguments arity.
• When invoked with no arguments (+) and * returns their identity value, 0 and 1
respectively.
• When invoked with one argument (- x) inverts the sign of the "x" argument.
• When invoked with one argument (/ x) returns the reciprocal of "x", commonly
indicated as (1/x).
• When invoked with a single argument both + and * just return the argument.
• All arguments must be of type java.lang.Number or subclasses ((number?
x) must return true for all arguments). It will throw ClassCastException when an
argument is not of type Number.
Return types change based on the input arguments. The following table summarizes the
possibilities (excluding for the moment a few exceptional corner cases listed right
after). Each box in the table shows the return type for each of the math operations
considering the operand type at the x-y axis. If multiple return types are present for an
operation (such as (/) ratio long) it means the return type also depends on aspects
other than the types of the operands:
short/int/long float/double BigInt BigDecimal Ratio
short/int/long (+)long (+)double (+)bigint (+)bigdec (+)ratio
(-)long (-)double (-)bigint (-)bigdec bigdec
(*)long (*)double (*)bigint (*)bigdec (-)ratio
(/)ratio (/)double (/)bigint (/)bigdec bigdec
long (*)ratio
bigdec
(/)ratio
bigdec
float/double (+)double (+)double (+)double (+)double (+)double
(-)double (-)double (-)double (-)double (-)double
(*)double (*)double (*)double (*)double (*)double
(/)double (/)double (/)double (/)double (/)double
BigInt (+)bigint (+)double (+)bigint (+)bigdec (+)ratio
(-)bigint (-)double (-)bigint (-)bigdec (-)ratio
(*)bigint (*)double (*)bigint (*)bigdec (*)ratio
(/)ratio (/)double (/)ratio (/)bigdec[!] (/)ratio
bigint bigint bigint
BigDecimal (+)bigdec (+)double (+)bigdec (+)bigdec (+)bigdec
(-)bigdec (-)double (-)bigdec (-)bigdec (-)bigdec
(*)bigdec (*)double (*)bigdec (*)bigdec (*)bigdec
(/)bigdec[!] (/)double (/)bigdec[!] (/)bigdec[!] (/)bigdec[!]
Ratio (+)ratio (+)double (+)ratio (+)bigdec (+)ratio
(-)ratio (-)double (-)ratio (-)bigdec bigint
(*)ratio (*)double (*)ratio (*)bigdec (-)ratio
(/)ratio (/)double bigint (/)bigdec[!] bigint
bigint (/)ratio (*)ratio
bigint
(/)ratio
bigint
WARNING Operands type marked with [!] can result in a ArithmeticException "Non-terminating
decimal expansion". See “with-precision”.
(apply + empty-coll) ; ❷
0
(apply * empty-coll) ; ❸
1
❶ A simple empty-call var simulates the result of some computation of which we don’t know in
advance the cardinality and resulting in an empty collection.
❷ Since + is equipped with a zero-arity variant it works fine on empty sequences, without requiring an
explicit check.
❸ * works the same way, just returning 1 instead of 0.
In more general terms, + and * implement the identity element for addition and
multiplication respectively 103.
The single operand version of / can be used to represent reciprocal of a number series.
The values of the Riemann zeta function at 2 for example, is the sum of the reciprocal
of the squares of natural numbers 104. Other Riemann zeta functions are important in
statistics and physics. We can approximate the value of zeta at 2 by taking some
number of elements from the series (it is possible to demonstrate the series converges
to PI^2/6, the Basel problem solved by Euler in 1734):
(defn x-power-of-y [x y] (reduce * (repeat y x))) ; ❶
(def square #(x-power-of-y % 2)) ; ❷
(def cube #(x-power-of-y % 3))
(defn riemann-zeta [f n] ; ❹
(->> f
103
This simple Wikipedia article also illustrates identity elements for other
operations: en.wikipedia.org/wiki/Identity_element
104
To know more about the Riemann zeta function see the introductory article at
Wikipedia en.wikipedia.org/wiki/Riemann_zeta_function
reciprocal-of
(take n)
(reduce +)
float))
(/ (* Math/PI Math/PI) 6) ; ❻
;; 1.6449340668482264
2-arity
The next example shows what is probably the most used number of arguments with
basic math operators: two operands. The annual interest rate formula, for example, is a
way to determine how much an initial capital will increase over time. We are going to
see how easy the mathematical formula can be translated into Clojure by
using “partial” application:
ca1, ca2, ca3, ca4, …
where:
• c = initial investment
• r = interest rate
• a=1+r
Each item in the series represents the total amount each year. So if we assume an initial
investment of c = 1000$ and we want to know how much will be in the bank after 3
years with a 20% interest rate, we’ll have to look at the 3rd element in the list: 1000 *
(1 + 0.2)^3 = 1728. We can generalize the formula using Clojure, creating an infinite
sequence from which we take as many yearly forecasts as we want:
(defn powers-of [n] ; ❶
(iterate (partial * n) 1))
❶ powers-of creates an infinite sequence of powers of the number n. We use “partial” along with * to
let iterate pass down the result of the previous multiplication.
❷ interest-at groups together the rest of the formula. Again the use of “partial” prepares for one
element from the previous series of powers to be multiplied for the initial investment.
Precision
The common (and default) math operators can throw exception (in this respect Clojure
departs from Java). Clojure numbers literals are handled as long by default,
correspondent to the java.lang.Long Java class. So for example, the very fast growing
series n = (n - 1)n-1 will throw ArithmeticException pretty soon:
(take 7 (iterate #(* % %) 2)) ; ❶
;; ArithmeticException integer overflow
NOTE Please note that despite the fact that Clojure treats numbers as Long by default, the error
message is always referring to an "Integer" overflow. It should be read more generally as
"Natural" numbers overflow, being those Integers or Longs.
In Java the + operator will happily execute an overflowing operation and return a
negative number! This is why it’s common idiom to check for over/underflow in Java
or use the BigInteger class 105 . Clojure took the more conservative approach that an
operation should never result in some implicit truncation or sign change. Developers
can still access that behavior if they need by using "unchecked" version of the same
operators.
105
with the introduction of Java 8 there is now a new set of arithmetic operations to throw exception in case of
overflow/underflow exactly like Clojure. See for
example:docs.oracle.com/javase/8/docs/api/java/lang/Math.html#addExact-long-long-
coming from some other Lisp language). By executing a simple (+ 1 1) newbies get a good idea about
invoking functions with parenthesis and prefix operators. That’s why + is likely the first function ever
executed by someone learning Clojure.
See also:
• A single quote ' suffix appended to the operation symbol defines the auto-
promoting version of it. Instead of throwing exception upon reaching the limit of
Java longs, it will promote the long to a BigInt instance (that can take care of
arbitrary precision). Use for example *' when it is important for your application
to maintain precision. Beware that precision comes at a cost.
• “unchecked-add and other unchecked operators” are versions of math operators
(including the basic ones described in this chapter) removing the over/under flow
checks. This is the standard Java behavior. Use the unchecked version if you are
willing to trade performance for the possibility to have a sign change when
overflowing. If your application will never see big numbers and you need a
performance boost you can use unchecked versions with confidence.
• “unchecked-add-int and other unckecked-int operators” are even faster. All other
operators will promote int operands to longs and return longs. Use the unchecked-
int version when working primarily with integers to avoid unnecessary casts to
long. Unless you’re doing fast integer math, it is unlikely you’ll ever need
unchecked integer operations.
Performance considerations and implementation details
(inc [x])
(dec [x])
inc and dec functions are basically shortcuts for #(+ % 1) and #(- % 1) respectively.
Incrementing and decrementing is a common operation in everyday programming, so
inc and dec are responsible for saving a good amount of keystrokes in a typical
Clojure development life. Using inc or dec is as easy as:
(inc 1)
;; 2
(dec 1)
;; 0
CONTRACT
• "x" is the single mandatory argument of numerical type (i.e. (number? x) must be
true)
• returns: the value obtained by incrementing or decrementing x by 1. It throws
ArithmeticException on (Long/MAX_VALUE) or (Long/MIN_VALUE) overflow.
Examples
Maps, atoms or any other data structure offering an "update" function are good
candidates to store a counter and can be used in conjunction with inc or dec. The
following example shows an instrument function that takes another function as
argument and "injects" it with a counter to store the number of calls it receives. The
number can be read later using a special keyword:
(defn instrument [f]
(let [calls (atom 0)] ; ❶
(fn [& args]
(if (= "s3cr3tC0d3" (first args)) ; ❷
@calls
(do (swap! calls inc) ; ❸
(apply f args))))))
(def say-hello ; ❹
(instrument #(println "hello" %)))
(say-hello "john")
;; hello john
;; nil
(say-hello "laura")
;; hello laura
;; nil
(say-hello "s3cr3tC0d3") ; ❺
;; 2
❶ an atom instance initialized to zero is created every time instrument is invoked. Compare and swap
semantic (CAS) will prevent missing (or double) counts effectively even in case this function is called
in highly concurrent environments.
❷ we intercept the arguments and when the first one is a special "secret" code the function returns the
count so far instead of delegating to the wrapped function.
❸ in all other cases, we increment the counter. With atoms it’s as easy as passing the incrementing
function for updating to swap!. What we need is a function of one argument that increments it: the
perfect spot for inc.
❹ say-hello shows how instrument can be used to wrap another function.
❺ after using say-hello a few times we can see what happens when we use the secret code that
displays the number of times the inner println was called.
Haskell
Haskell has a "pred" and a "succ" functions that work similarly to Clojure:
> succ 1
2
> pred 0
-1
Haskell can also define curried functions in a very compact form, so despite the fact that we are using
the common + and - operators we can express incrementing and decrementing easily like:
> (+1) 1
2
> ((-) 1) 1
0
Although for subtraction it doesn’t work as good because of the ambiguity generated by -1 as a negative
number literal.
Ruby
Ruby main inspiration in this case is object orientation. Numbers are objects and can receive
"messages". We can send the message succ or pred to a number like this:
irb(main):001:0> 1.succ
=> 2
irb(main):002:0> 1.succ.pred
=> 1
Java
Differently from Ruby, Java number literals cannot receive methods calls directly. Although numbers can
be wrapped in a new Integer() object first, there is no method to get the next of a number. The only way is
through mutation. Java derives increment and decrement operators from C. There is a big difference
between Java’s ++ increment operator and Haskell’s succ for example: Java ++ will also mutate a
variable while making it bigger:
class Test
{
public static void main (String[] args)
{
int i = 0;
System.out.println("incrementing " + ++i);
System.out.println("and i is? " + i);
}
}
>> incrementing 1
>> and i is? 1
See also:
• inc' and dec': similarly to +' and -' the single quote ' identifies the auto-promoting
version of inc and dec respectively. If the number is
Long/MAX_VALUE or Long/MIN_VALUE it will throw exception in the attempt to inc
or dec. The single quote version will avoid the problem promoting the long to a
BigInt.
• unchecked-inc and unchecked-dec: this version of the operators is not auto-
promoting nor throwing exception. Upon reaching the upper/lower limit the result
will simply invert the sign and start from the other side:
(unchecked-inc Long/MAX_VALUE)
;; -9223372036854775808
(loop [n (int n) i 0]
(if (< i n)
(recur n (inc i)) ; ❶
i)))
NOTE The example shows a visible improvement by removing the check for integer overflow. But the
example also shows that there could be other operations dominating the performance profile
other than the one we are trying to optimize. In our case, the comparison (< i n) (where n is
not cast to int) dominates over unchecked-inc for the worse. Always use a profiler to verify the
assumptions about a target hotspot before engaging in changes that don’t produce a
noticeable effect.
In the Euclidean division (the process of division of two integers 106) the "quotient" is
the result of the division while the "remainder" is anything that is left when the two
numbers are not multiple of each other. In Clojure quot returns the result of dividing
the numbers while rem returns the rest of the integer division, if any. The quotient can
also be defined as the number of times the divisor divides "num", excluding any
fractional part which is equivalent in turn to take the result of the division and truncate
the decimals. Finally, the "modulus" operation mod is very similar to rem but with
106
with the introduction of Java 8 there is now a new set of arithmetic operations to throw exception in case of
overflow/underflow exactly like Clojure. See for
example:docs.oracle.com/javase/8/docs/api/java/lang/Math.html#addExact-long-long-
slightly different rules regarding how to return results in the presence of negative
numbers 107 . So for example:
(quot 38 4)
;; 9
(rem 38 4)
;; 2
The above means that the number "4" (the divisor) sums up 9 times before being larger
than the dividend "38". The remainder of the operation would be 2. This description is
not very rigorous because things get hairy when negatives numbers are involved. This
is where mod and rem differs:
(rem -38 4)
;; -2
(mod -38 4)
;; 2
We won’t go into the glory details of rem and mod for negative numbers since the most
common uses for real-world applications are around positive quantities.
CONTRACT
Input
• "num" and "div" must be numbers (which means that (and (number? num)
(number? div)) must be true). So although quot deals conceptually with integer
quantities it can return doubles representing them:
(quot 38. 4)
;; 9.0
Notable exceptions
• AritmeticException when "div" is 0.
Output
Returns the quotient or the remainder of dividing "num" by "div". If any of "num" or
"div" is a double, then the result will also be a double.
Examples
quot and rem are often seen in problems where some items should be partitioned in
containers. If we don’t have the possibility to physically cut the item into a half or
other fraction, quot is handy to discover how many items we can distribute evenly in
the containers. Let’s say we need to load a truck with some goods and the truck only
accepts 22 containers. Given 900 items to transport, we want to know how many items
we should put in each container:
107
The Wikipedia article about the Reminder operation also clarifies how each language distinguish between modulo and
reminder: en.wikipedia.org/wiki/Remainder
(defn optimal-size [n m]
(let [size (quot n m) ; ❶
left? (zero? (rem n m))] ; ❷
(if left?
size
(inc size))))
❶ optimal-size will return the best size for a container given n items and m containers. It makes good
use of quot to find out how many items can stay in all containers
❷ rem is used then to to see how many items would be left out. If there are items left out, the optimal
size is increased by one.
❸ When we call optimal-size with the example parameters we can see that the optimal amount of
items each container is 41
❹ We can also see how the items should be distributed in each container using partition-all. The output
is quite big, so it has been omitted from the example.
Although the example above is interchangeable between rem and mod, there is one case
that commonly correlates with mod more than rem (although even in this case, being all
the quantities positive, it still doesn’t make a difference): the definition of basic
operators on top of a finite set. Let’s take the alphabet for example: we want to
implement an increment operator so that it returns the next letter. After creating the
alphabet as an array we can make good use of indexes.
(def alpha
["a" "b" "c" "d"
"e" "f" "g" "h"
"i" "j" "k" "l"
"m" "n" "o" "p"
"q" "r" "s" "t"
"u" "v" "w" "x"
"y" "z"])
(defn ++ [c] ; ❶
(-> (.indexOf alpha c)
inc
(mod 26)
alpha))
(++ "a")
;; "b"
(++ "z")
;; "a"
❶ several interesting things are going on in this single liner: first we extract the position of the letter in the
array using Java interop indexOf that is supported by the “vector” type. After incrementing it we need
to make sure we are not at the boundaries and if we are we want to overflow gracefully and restart
from the beginning. The key enabler is (mod x 26) that will recompute any number x to be in the
domain 0-25 including any potential roll-over of the rest from the beginning. Once we have the new
index we just access the array again. Notice that alpha is used as a function, passing the index to
retrieve the element at that index.
So, rem or mod? Some of the confusion stems from the fact that in "C" languages there
is a % operator that is commonly referred as "mod" but is instead implementing the
remainder operation. So some classic example of use of remainder are referred as
using mod in those languages.
See also:
• / is the common division operator. It will return fractions instead of removing
decimals as quot does. Use quot only when you need to deal exclusively with
integer quantities.
• unchecked-remainder-int is available to improve performance when quantities are
between Integer/MIN_VALUE and Integer/MAX_VALUE. This version of the
remainder operation won’t auto-promote or cast to long. See the performance
considerations section for additional details.
Performance considerations and implementation details
(let [num (int 100) div (int 10)] (quick-bench (unchecked-remainder-int num div)))
; ❷
;; Execution time mean : 8.169385 ns
❶ The presence of let communicates the types of of "num" and "div" to the compiler, so when the
expression is generated as Java bytecode "num" and "div" appears as primitive long types. This is
important to avoid conversions between the uppercase Long class type and the primitive type (also
known as "boxing/unboxing").
❷ We can definitely observe a speed up.
We can observe a 50% improvement which is substantial but small in absolute terms
(timings are nanoseconds). Only use unchecked-remainder-int when you are able to
prove there are no other factors dominating the execution of the expression.
(max
([x])
([x y])
([x y & more]))
(min
([x])
([x y])
([x y & more]))
max and min are two useful math-related functions in the standard library. Given a list
of numerical arguments they return the biggest or the smallest number in the list,
respectively. So:
(max 5 7 3 7) ;; 7
(min -18 4 -12) ;; -18
max and min functions imply some notion of ordering which is almost always
guaranteed with numbers. Notable cases of tricky numbers to order are "Infinity" and
"NaN". While for infinity we have negative and positive by convention (so it’s always
possible to determine which is bigger), max or min operations involving NaN (Not A
Number 108) will always return NaN.
CONTRACT
Input
• "x", "y" and "more" are numbers (such that (number? x) is true).
108
NaN (or Not a Number) is a special kind of number that is not representable in modern computers architectures.
See en.wikipedia.org/wiki/NaN
known exceptions
• ClassCastException if any argument is not a number.
Output
max and min returns the biggest or smallest of their respective arguments. If negative
infinity is one of the argument, min will always return negative infinity. If positive
infinity is an argument, max will always return positive infinity. When one of the
argument is NaN they both return NaN:
(max Long/MAX_VALUE 5 (/ 1.0 0)) ; ❶
;; Infinity
❶ The most important corner cases of max and min when dealing with special numbers.
Examples
max and min are obviously related to statistics. One first application could be to find the
extremes of a given collection, which can be solved with an easy one liner:
(apply (juxt min max) (range 20)) ; ❶
;; [0 19]
❶ Since max and min don’t take a collection directly, we use apply to spread the collection. To
combine max and min in a single call we used “juxt”.
To expand the example in the direction of searching for local maxima, we could
implement a "best-lap" function that can be used to show what is the current best lap in
a competition:
(defn tracker []
(let [times (atom [])]
(fn [t] ; ❶
(swap! times conj t) ; ❷
(apply min @times)))) ; ❸
❶ The tracker function initializes a new state atom each time it is invoked. It then "closes over" the
initialized state returning a new function accepting a single "time" parameter. The higher order function
is returned to the caller ready to be used.
❷ Every time we receive a new time measure we add it to the collection of time events stored in
the atom
❸ The best of the timings collected up to now is returned. We use min and “apply” to find out the
smallest number in the collection.
(def ∞ (/ 1. 0))
(/ 0. 0)
(/ ∞ ∞)
(* 0 ∞)
(- ∞ ∞)
(Math/pow 1 ∞)
(Math/sqrt -1)
;; all producing:
;; NaN
Clojure mainly follow Java rules for NaNs. One of the interesting properties of NaN is that is the only
number that is not equal to itself:
(def NaN (/ 0. 0))
(== NaN NaN)
;; false
See also:
• “max-key and min-key” works very similar to min and max except that they can
accept any type of argument with the constraint that there must be a way to turn
them into numerical. A function that is passed as the first argument will be used to
determine how the arguments should be turned into numbers.
Performance considerations and implementation details
(max-key
([k x])
([k x y])
([k x y & more]))
(min-key
([k x])
([k x y])
([k x y & more]))
max-key and min-key build conceptually on top of “max and min” functions.
While “max and min” would just operate on numbers, max-key and min-key can also
accepts other types and an additional function (the key) to help transform or extract
those arguments into numbers and finally determine the biggest or the smallest,
respectively. Instead of just returning the min or the max number, max-key and min-
key return a value of the same type as the argument:
(max-key :age {:name "anna" :age 31} {:name "jack" :age 21})
;; {:name "anna", :age 31}
CONTRACT
Input
• "k" is a function of a single argument returning a numerical type. "k" is invoked
over the arguments: (k x), (k y) and so on.
• "x", "y" (and any additional arguments) can be of any type.
Notable exceptions
• ClassCastException when "k" return a result which is not a number.
Output
max-key and min-key return the argument that after calling "k" returns the biggest or
the smallest result respectively.
WARNING Differently from “max and min”, min-key and max-key do not handle NaN (while it handles
positive and negative infinity just fine). So you need to pay attention when using min-
key and max-key and number processing that can potentially generate NaN in any of the
arguments.
The following example shows an erroneous result produced by the presence of NaN.
This function to calculate the speed of sound contains a slightly wrong formula:
(def air-temp [[:cellar 4]
[:loft 25]
[:kitchen 16]
[:shed -4]
[:porch 0]])
❶ Speed of sound in air roughly increases with the square root of the temperature. The problem in this
formula is that we should be talking about the absolute temperature instead: (inc (/ temp
273.15)). But we forgot to do the conversion.
❷ When searching for the part of the house where sound moves fastest we get a wrong result without
exceptions being raised. The problem is in the square root of a negative number generating NaN
forcing max-key to return always the last argument of the collection, despite others being bigger.
The correct formulation that prevents the problem above is for example the following:
(defn speed-of-sound [temp] ; ❶
(* 331.3 (Math/sqrt (inc (/ temp 273.15)))))
❶ We change the formula so it transforms the temperature in absolute Kelvin (which is never negative,
assuming the input is correct).
❷ We can see that sound moves quickly in the loft, where the temperature is higher.
Examples
When talking about min we used a tracker function that, when invoked sequentially
with measures of times, was always responding with the current minimum. We want
now to extend the example so it can also record additional information such as the
athlete name, so we can use it to show who the winner of a competition, not just the
best time. min-key solve the problem quickly:
(defn new-competition [] ; ❸
(let [stats (atom {:min {} :max {} :events []})]
(fn
([] (str "The winner is: " (:min @stats))) ; ❹
([t] (swap! stats (partial update-stats t)))))) ; ❺
(race1) ; ❽
;; "The winner is: {:athlete \"Won T.\", :time 36.44}"
❶ update-stats is a function that takes a new event and the current stats and calculates the new stats.
❷ Using both min-key and max-key we can calculate the best and the worst time keeping all the other
related information.
❸ new-competition wraps the setup of the initial state. The initial state is stored inside an atom and
consists of a few nested data structures identified by relevant keys. The local state is part of the
bindings of the function that is returned to the caller.
❹ The no-arguments arity is called at the end of the competition to show the winner.
❺ The second arity takes the new event as input and proceeds to update the state with swap!, passing
the update-state function the old state. The new event is also passed down to update-
state through “partial”.
❻ A new competition is created by invoking the main new-competition function without arguments.
❼ The examples are showing that after sending new events to the competition, the returned value can
be queried for statistics, like best time or worst time.
❽ The race1 function invoked without arguments prints the final result.
Another way to look at problems involving minimum and maximum is when trying to
minimize or maximize a function, for example the nearest neighbor search. Donald
Knuth informally called this class of searches the "post-office problem" 109.
Let’s take the challenge literally and try to solve the following: given a list of
geographical coordinates of postal offices, we want to know which post-office to
assign to a new house in the area. We can solve the problem searching the value that
minimizes the euclidean distance (approximated for a spherical surface). This is more
109
A nice introduction to nearest neighbor problems is available on
wikipedia: en.wikipedia.org/wiki/Nearest_neighbor_search
(def post-offices
[[51.74836 -0.32237]
[51.75958 -0.22920]
[51.72064 -0.33353]
[51.77781 -0.37057]
[51.77133 -0.29398]
[51.81622 -0.35177]
[51.83104 -0.19737]
[51.79669 -0.18569]
[51.80334 -0.20863]
[51.74472 -0.19791]])
❶ sq, rad, cos and sq-diff are all helper functions necessary for conversions and other geometry
related transformations
❷ haversine calculates the spherical distance which would be approximated on Earth since the radius
is not always the same. It is definitely a good enough approximation for this exercise.
❸ closest is the function that determines which of the positions of the postal offices minimizes the
distance with the target residence coordinates. We start by taking the vector of the positions of postal
offices, decorate it with the distance to the target residence with a map operation and the using min-
key to see which one is the closest.
❹ Invoking the closest function returns a pair containing the shorted distance in Kilometers.
See also:
• “max and min” have a similar goal but they only accept numerical arguments,
without the option to pass a "key" function to decide how to process arguments of
other types. Prefer “max and min” when you are only interested in retrieving the
110
For examples and explanation of the Haversine formula see the following Wikipedia
article: en.wikipedia.org/wiki/Haversine_formula
actual min or max from a collection of numbers, without any other related data.
• reduce is the generalization of the process adopted by min-key and max-
key. reduce can iterate over a sequence while storing some information for the
next step to execute. This is exactly what happens when electing a local minimum
or maximum for the current comparison and then repeating the process with the
next. Consider “reduce and reductions” if you need different kind of aggregation
statistics, such as the median.
Performance considerations and implementation details
(rand
([])
([n]))
(rand-int
[n])
rand and rand-int are a common feature in many languages. In Clojure they generate
random numbers of type double or int respectively using the Java pseudo-random
generation capabilities. For many practical applications, this kind of random numbers
are good enough. They are not considered good for stronger randomness requirements
like Monte Carlo Generation 111. By default rand produces a double number in the
range 0 ⇐ x < 1 while rand-int requires a specific upper bound:
(rand)
;; 0.6591252808399537
(rand-int -10)
;; -5
111
As usual, Wikipedia has a good introductory article about the type of random algorithm used in Java called a Linear
Congruential Generatoren.wikipedia.org/wiki/Linear_congruential_generator
CONTRACT
Input
• rand with no arguments: returns a pseudo random double number between 0.0
inclusive and 1.0 exclusive. rand-int requires at least one argument.
• "n": is the upper-bound for both functions and it must be of
type double for rand and of type int for random-int negatives included.
Output
The output is an integer for rand-int and a double for rand.
Examples
Randomness finds applications in many fields. A simple use case is in A/B testing,
when we want some percentage of the users to pick the new feature. Let’s say we want
to increase the number of people answering a survey and we think that showing how
many questions still remain is a good idea. At the same time we don’t want to risk to
make a potential bad decision on the entire population of customers taking the survey,
so we decide to roll-out only to the 50% of the total requests. We can achieve this by
using rand:
(def questions
[["What is your current Clojure flavor?" ["Clojure" "ClojureScript" "CLR"]]
["What language were you using before?" ["Java" "C#" "Ruby"]]
["What is your company size?" ["1-10" "11-100" "100+"]]])
(ask questions)
;; Q(3/1): What is your current Clojure flavor? [1] Clojure [2] ClojureScript [3]
CLR
;; 2
;; Q(3/2): What language were you using before? [1] Java [2] C# [3] Ruby
;; 1
;; Q(3/3): What is your company size? [1] 1-10 [2] 11-100 [3] 100+
;; 3
;; [2 1 3]
❶ A-B is a function for branching very similar to if but it also accept a number between 0 and 1 to
determine with which probability one branch of the condition will be used compared to the other. It can
be implemented very simply by just comparing the input probability with the result of
invoking rand which effectively converts into a probability to pick option A or option B in the “if, if-not,
when and when-not” statement.
❷ progress-handler, of all the features offered by our survey manager, is the feature impacted by the
A/B testing. So it just uses A-B to put in effect a 90% probability that the new feature "A" will be
presented to the user compared to the old feature "B". We use 50% (here translated to 0.5) so the
user will be presented with a "progress indicator" instead of just a colon ":" symbol 50 times over 100.
❸ when it’s time to ask, we start the survey by looping over the questions. Since we want that all
questions are either showing the progress or not showing it, the A-B conditional branching needs to
happen ahead of the loop.
❹ since we are missing the current index of the question being asked, we need to wait until that data is
available inside the loop. This is why the progress-handler returns a function of the progress so far
that needs to be passed as an argument at the time of the actual invocation. During this println we
have the information available, so we invoke the higher order function that was created in
the progress-handler
The first problem has to do with the distribution of the random numbers and can be visualized if we plot
random points in a 3D space, thus the name (this 1968 paper is probably the first to call the problem
this way). One LCG algorithm that suffers more than others about the problem is the RANDU 112 and
although it has been replaced with much better ones, it helps to understand the problem. As you can see
in the picture below, the plotted points are not distributed evenly as one would expect, but forms a
predictable pattern in the form of 2D planes.
112
The infamous RANDU generator of the first era of programming is described in en.wikipedia.org/wiki/RANDU
Figure 5.1. RANDU generated 3D points: the pattern is visible in the form of 2D planes.
The second problem has to do with the way the numbers are generated, which essentially is a repeating
bit-wise mutation every new number. When we look at the number generated as bits, we can see a
pattern emerging. This time we can verify our assumption directly on java.util.Random and visualize
the problem with the following example:
(import 'java.awt.image.BufferedImage
'java.util.Random
'javax.imageio.ImageIO
'java.io.File)
(defn coords [x y]
(for [m (range x)
n (range y)] [m n]))
(defn a []
(let [r (Random.)
img (BufferedImage. width height BufferedImage/TYPE_BYTE_BINARY)]
(doseq [[x y] (coords width height)]
(.setRGB img x y (rand-pixel r)))
(save! img)))
The code in the example creates a binary image object of 256x256 pixels and colors each pixel randomly
black or white. As you can see in the generated image, the dots are aligning horizontally in some places
instead of distributing uniformly demonstrating the other limitation of LCG.
See also:
• “rand-nth” takes a sequence and returns a random element from that sequence.
Use it when you’re interested in a random integer to use to access a collection,
instead of the random integer itself.
• “shuffle” takes a collection as input and returns a random shuffle of the elements
back into a new collection. Use “shuffle” when your only interested is to swap the
elements in a collection randomly.
• “random-sample” takes a collection as input and a probability factor "p". It will
return a new sequence that has probability "p" of containing the original elements.
The smaller the probability, the less elements are returned in the sequence.
Performance considerations and implementation details
5.7 with-precision
Notable exceptions since 1.0
(/ 22M 7) ; ❷
;; ArithmeticException Non-terminating decimal expansion; [...]
❶ The first decimal expansion is sufficiently precise but not perfect. It is allowed assuming there is no
need of additional precision.
❷ If we try to get maximum precision, we need to face the problem of a never terminating list of
decimals, requiring an infinite amount of memory. This is clearly not allowed and produces an
exception.
❸ Using with-precision we can specify how much memory we want to allocate to store decimal
numbers. We ask to stop and the fourth significant decimal (not included).
CONTRACT
(with-precision <precision> [<rounding>] [<exprs>]) ; ❶
precision :=> ; ❷
0 <= X <= Integer/MAX_VALUE
rounding :=> ; ❸
:rounding [CEILING|FLOOR|HALF_UP|HALF_DOWN|HALF_EVEN|UP|DOWN|UNNECESSARY]
exprs :=> ; ❹
form1,form2,..,formN
❶ with-precision takes a mandatory precision argument, an optional rounding type and zero or more
forms to evaluate.
❷ the mandatory precision is a positive integer from zero inclusive to Integer/MAX_VALUE.
❸ the optional rounding-type is the keyword :rounding followed by any of the values in square brackets.
If the rounding-type is not specified it is assumed :rounding HALF_UP by default.
❹ exprs are optional Clojure forms. If no form is passed as input with-precision returns nil.
Examples
with-precision has application in any numerical calculation handling
java.math.BigDecimal object instances. This can happen directly because of the use of
BigDecimal Clojure literals or because a function receives them as parameters.
BigDecimal type is said to be "contagious" because once an expression introduces it
somewhere in the code, the rest would be affected. One strategy to design the internals
of an application based on BigDecimal is to assume that outside the function the
parameters are of type double and their precision has already taken care of. Inside the
function we promote to BigDecimal, execute calculations and finally return double type
again following some contract regarding the required precision.
For example the following shows how to calculate the value of an account after
purchasing some number of shares at the current market price:
(defn share-qty [account price] ; ❶
(let [accountM (bigdec account)
priceM (bigdec price)]
(if (zero? priceM)
0
(long (with-precision 5 :rounding DOWN (/ accountM priceM)))))) ; ❷
❶ Input parameters are promoted to BigDecimal using bigdec. They are now ready for any subsequent
calculation without the fear of causing a loss of precision.
❷ with-precision is used to prevent unwanted ArithmeticException if the ratio between the current
account total and the market price of the share option has an infinite amount of decimals. In this case
we also want to always returns integer quantities for the amount of share in the account, so we accept
to always round them DOWN.
❸ The qty parameter was not converted assuming the quantity of shares is always an integer number.
(- 1.03 0.42)
;; 0.6100000000000001 ; ❶
Although the error is very small, it can accumulate into more worrying amounts when scaled up to
millions of operations. For this reason all mainstream languages have some way to deal with this
potential loss of precision. Java has BigDecimal for floating point numbers and BigInteger for
unlimited size integer numbers. Clojure builds on top of those making the code immensely less verbose
when promoting, constructing or operating on them.
See also:
• bigdec is used to transform other types into BigDecimal while with-
precision deals with the parameters that should be used for the following
operations.
• *math-context* is the dynamic variable that with-precision is setting with the
selected precision and rounding mode. If you need to be more precise around
the MathContext surrounding BigDecimal you can wrap your code directly around
it, bypassing with-precision:
(binding [*math-context* (java.math.MathContext. 10 java.math.RoundingMode/UP)]
(/ 10M 3))
;; 3.333333334M
(+'
([])
([x])
([x y])
([x y & more]))
(*'
([])
([x])
([x y])
(-'
([x])
([x y])
([x y & more]))
(inc' [x])
(dec' [x])
As expected +' -' *' inc' and dec' are very similar to their "unquoted" counterparts.
They implement the basic operations following similar rules to + - * inc' and dec',
changing behavior only for results beyond Long/MIN_VALUE and Long/MAX_VALUE. This
chapter will focus mainly on those differences.
CONTRACT
For the general contract please refer to + and inc. The only difference is that when the
result "x" of the operation goes beyond x < Long/MIN_VALUE OR x >
Long/MAX_VALUE (lower and upper long precision limits respectively), the result is
automatically converted to clojure.lang.BigInt.
Examples
The Diffie-Hellman key-exchange algorithm is first published algorithm that allows
two peers to exchange a private key to start an encrypted conversation 113.
Before Diffie-Hellman two parties had to use some physical medium (such as paper) to
exchange the key, like during World War 2 to setup the Enigma machine. Once the key
is shared, the two peers can start an encrypted conversation using a symmetric key
cipher. One important part of the algorithm is that the modulus "p" and the initial
secrets "a" and "b" are sufficiently large to prevent brute force attacks. For this reason
they are chosen way beyond the long limit of 264 bits space. Related math operations
will need to operate on bigint, so arbitrary-precision operators fit perfectly.
What follows is a simplified example that is not using more complicated parameters
conventions included later by modern protocols like SKIP (Simple Key-Management
for Internet Protocol), but it is still functional and usable:
(defn genbig [n] ; ❶
(->> #(rand-int 10)
(repeatedly n)
(apply str)
(BigInteger.)
bigint))
113
This is the very accessible Wikipedia entry about the Diffie-Hellman algorithm for key
exchange: en.wikipedia.org/wiki/Diffie–Hellman_key_exchange
(defn- mod-pow [b e m] ; ❹
(bigint (.modPow (I b) (I e) (I (next-prime m)))))
(= a-common-secret b-common-secret) ; ⓫
;; true
❶ genbig is an helper function to generate the extremely large numbers needed for strong protection
during the exchange. The final product is a Clojure bigint type.
❷ The I function has an unconventional name. It is however a catchy mnemonic for the goal of
transforming a Clojure int type (small "i") into a Java BigInteger (big "I").
❸ next-prime is necessary to find the closest prime to a given number. We need this function to make
sure the modulo is a prime number. Since we randomly generate the number, we use next-prime to
grab the closest prime instead. This is where we make good use of inc' and dec', because if by any
chance the number n we pass in doesn’t need the bigint range, we don’t need to self promote.
❹ mod-pow wraps the same function provided by Java BigInteger class. There is no "pow" operation in
Clojure that deals with big integers, so we use the one provided by Java.
❺ public-share applies the mod-pow operation on the base, secret and modulo as from diffie-hellman
specifications.
❻ shared-secret is exactly the same operation that has been defined as a different function so the
name of the parameters can clearly highlight the two different contexts in which the two functions
should be used.
❼ this step is where our example usage starts. We first decide a modulo and a base. All values produced
by functions ending in -pub can be shared in public. So two parties A and B can agree on the base
and modulo by email for example, without any encryption necessary.
❽ A can now generate a public "mask" covering the private key "123456789N" that can be sent in clear.
Although this public part is sent in clear, there are too many possibilities to try to find which key
generated the public mask. B does the same with "987654321N" as the key. Notice that this private
keys will never have to leave the local computer.
❾ a-common-secret is generated in a similar way to the public-share, using what B shared as public
and A-private key. This number is what A can use to encrypt all the following messages.
❿ B executes the same operation, obtaining b-common-secret that will be used both to decrypt and
encrypt messages to A.
⓫ As you can see a-common-secret and b-common-secret are effectively the same number.
The example above highlights a potential issue when working with bigint: despite
being way less verbose than the Java equivalent syntax, some casting back and forth
from Java BigInteger is necessary as soon as there is the need of functionalities
implemented on the Java side only.
(type 2M)
;; java.math.BigDecimal
(type 2N)
;; clojure.lang.BigInt
See also:
• bigint creates a clojure.lang.BigInt starting from all numeric types,
including java.math.BigInteger and even strings: (bigint "1") gives 1N.
• If you are sure that the application will never have the need to upgrade
to BigInt you can just use normal operators, +.
• “unchecked-add and other unchecked operators” are the related unchecked-*
versions, which maps directly to the Java basic operations. The unchecked version
is not going to auto-promote and on crossing the long boundaries it won’t throw
exception but simply start from the opposite sign side. Unchecked math version
should not be used if precision is important.
As you can see, unless the application is specifically dealing with a lot of big integer
math, these operations shouldn’t be the primary source of concern in performance
analysis. But if you can constrain the hotspot in your application to only work with
longs (even better with ints) then it might be work moving to unchecked operators (of
which you can see an example in the performance section about “inc and dec”).
(unchecked-add [x y])
(unchecked-subtract [x y])
(unchecked-multiply [x y])
(unchecked-inc [x])
(unchecked-dec [x])
(unchecked-negate [x])
(unchecked-subtract 2 38)
;; -36
(unchecked-multiply 10 3)
;; 30
(unchecked-inc 100)
;; 101
(unchecked-dec 12)
;; 11
(unchecked-negate 1)
;; -1
CONTRACT
Input
Unlike other math operators unchecked-add, unchecked-subtract, unchecked-
multiply, unchecked-inc, unchecked-dec and unchecked-negate have restricted
argument numbers:
• unchecked-add, unchecked-subtract and unchecked-multiply only accepts 2
numeric arguments.
• unchecked-inc, unchecked-dec and unchecked-negate accepts a single numeric
argument.
Output
They all return the result of the corresponding operation. The final type depends on the
types of the arguments, following the rule of the most "contagious" type. See the call-
out in this chapter to know more about the precedence rules of type promotion.
Examples
The long type number in Clojure uses to the equivalent signed long type in Java which
has 64 bits. One bit is used for the sign, so the available range for a long type goes
from -263 to 263 inclusive:
[(long (- (Math/pow 2 63))) (long (Math/pow 2 63))] ; ❶
;; [-9223372036854775808 9223372036854775807]
❶ The range of numbers representable using a long type. Note that we need a cast to long
because Math/pow returns a double in scientific notation.
If we increment the largest possible long number using unchecked-inc the result
restarts from the other end with a change of sign. Similarly if we decrement the
smallest number:
(unchecked-inc (long (Math/pow 2 63))) ; ❶
;; -9223372036854775808
;; 9223372036854775807
❶ The effect of using unchecked-inc on the largest long number representable on 63 bits.
❷ Similarly, unchecked-dec on the smallest (negative) long number achieves the effect of restarting
from the largest one.
❶ scramble is a function of two longs, "x" and "y", that applies some simple arithmetic. We need to
"type-hint" the parameters to long type because the function is passed to reduce later below. The
reason why we need to do this is that the compiler is not able to see types at runtime and treats them
as boxed objects instead. Unfortunately unchecked operators will use normal math if any of the
operands are boxed Java objects, something that is under discussion for a possible fix in later Clojure
releases 114 .
❷ This is where we need to use something that deals with potentially big numbers, beyond long values.
Since we are not interested in the precision but only about the bits relationship, we can accept the
operation to overflow. The overflow still produces a valid long which is what we want. The number "31"
is a prime number. Prime numbers have the property of being less prone to introduce bit bias in the
operation.
❸ hash-str is a function taking the arbitrarily long string we want to hash.
❹ We need a large number to start the sum with. Main reason for this is to avoid upper bits to be always
set to zero for shorter strings. The large number is also a prime, again to avoid to introduce unwanted
bit bias.
❺ We need to transform each character in the string into a number. For this we can use the ASCII table.
The next step is to apply scramble and sum up the result.
❻ when called, hash-str returns a number that converted into binary is the expected 64 bits long.
114
For the problem of unchecked operators falling back on checked math in the presence of Boxed numbers, see the
following ticket: dev.clojure.org/jira/browse/CLJ-1832
Contagious Types
Programming languages often support a selection of numerical types with different behaviors. Numerical
types can be classified based on the supported precision which describes what are the lower/upper
limits each type can handle. Functions in the standard library are affected by the types present in the
language and they should describe what happens when operations are applied on different numerical
types.
Intuitively, it makes sense that if an operator with greater precision is used along with one that has
less, the returned type should have at least the precision of the most precise operand. Clojure roughly
applies the precedence list shown in the figure to determine how the result of an operation should be
promoted:
Figure 5.3. Contagion rules: the result of an operator will roughly return the most precise of the operands types.
Clojure "roughly" applies the contagion rules the picture describes, because as we have seen in
the return types relationships table at the beginning of the current chapter, there are several exceptions
to the rule. Also consider that double, the most contagious type, also has a very peculiar definition of
precision:
(double (- 10 1/3)) ; ❶
;; 9.666666666666668
❶ A generic operation showing automatic rounding of the last digit of a periodic decimal number.
The above snippet just shows a very well known fact in modern computer floating point arithmetic, since
CPU registries only have a limited precision and some rounding needs to happen for things like periodic
floating point numbers. So if precision matters, you should design code that handle precision explicitly
instead of leaving the language to decide using contagion rules.
See also:
• The unchecked-*-int family of functions is strictly related to the unchecked-*
described in this chapter. They differs mainly for two aspects: the arguments will
be cast to int (so values above or
below (Integer/MAX_VALUE) and (Integer/MIN_VALUE) respectively will generate
error) and the overflow is operated on integers instead of longs. Apart from that,
the argument and operation semantic is the same.
• “hash” is the function that Clojure offers to create hash (int) numbers for all the
types in the standard library. Sometimes it delegates to Java sometimes it
implements better options (like Murmur3 for collections 115). It is unlikely you’ll
have to implement your own hashing function and before doing so check
if “hash” is good enough already.
Performance considerations and implementation details
(unchecked-add-int [x y])
(unchecked-subtract-int [x y])
(unchecked-multiply-int [x y])
(unchecked-divide-int [x y])
(unchecked-inc-int [x])
(unchecked-dec-int [x])
(unchecked-negate-int [x])
(unchecked-remainder-int [x y])
115
The original Murmur3 algorithm was written in C++ ans is visible
here: github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp
(unchecked-add-int 2.0 1) ; ❷
;; 3
❶ unchecked-add operates a promotion of the second argument to double type to return a double.
❷ unchecked-add-int returns a primitive int type instead.
for coordinates (a pixel in the image can only have integer coordinates because Java
arrays max size is Integer/MAX_VALUE) but also for data. To keep things simpler, the
color depth is going to be only 2: black (1) or white (0). Since we know that the
domain will be integers only, we can take advantage of the "unchecked-*-int" functions
described in this chapter.
We want to be able to draw points and lines on a digital canvas. Oblique lines are
especially tricky to draw with discrete pixels, forcing us to "interpolate" points so that
they are roughly aligned. The Bresenham’s line algorithm 116 can help us figuring out
which points are forming the line under this conditions. Since there are many parts to
explain, let’s start with a few helper functions:
(defn- steep? [x1 x2 y1 y2] ; ❶
(> (Math/abs (unchecked-subtract-int y1 y2))
(Math/abs (unchecked-subtract-int x1 x2))))
❶ the line is represented by the four coordinates x1,y1 and x2,y2 which are the two line
extremes. steep? helps us finding the direction of the slope, upward or downward.
❷ adjust-slope swaps coordinates for extremes if the line points upward. We need this along with
other transformations to "normalize" the line information before searching for the line points.
❸ adjust-direction swaps the extremes of the line considering "x" coordinates.
❹ adjust combines transformations together, adjusting the slope incline and the line direction.
❺ swap creates a function that based on the slope incline is swapping the x,y point coordinates.
The following to-points function takes the two extremes of a line (x1, x2, y1, y2) and
applies the Bresenham’s line interpolation algorithms to return the collection of the
points forming the line. As you can see, we cast or use unchecked int operators
throughout:
(defn to-points [x1 y1 x2 y2]
116
See this article with nice illustration around the mechanism used by the Bresenham’s family of
algorithms www.cs.helsinki.fi/group/goa/mallinnus/lines/bresenh.html
❶ let with type hinting is a common idiom to force the Clojure compiler to understand types at compile
time. If type hinting was not in place, Clojure would use generic less-than < operator for objects
instead of basic integers. This behavior can be seen by setting a warning on boxed math
with (set! unchecked-math :warn-on-boxed)
❷ The main loop of the computation recurs on the next point coordinates (x,y) which are found by
applying increment/decrement on previously found points.
❸ We recur in two different ways based on the approximation error resulting from placing the current
point in a line. The new x,y coordinates are found using unchecked-inc-int or unchecked-dec-int.
Finally, after extracting the points forming the line, we need to draw them. The next
function paint! takes a canvas (a bidimensional array of zeros) and modifies it by
drawing the given points:
(defn paint! [^"[[I" img points] ; ❶
(let [pset (into #{} points)]
(dotimes [i (alength img)]
(dotimes [j (alength (aget img 0))]
(if (pset [i j])
(aset-int img i j 1))))) ; ❷
img)
;; [0 0 0 0 0 0 0 0 0 0 0 0]
;; [0 0 0 1 0 0 0 0 0 0 0 0]
;; [0 0 0 0 1 0 0 0 0 0 0 0]
;; [0 0 0 0 0 1 0 0 0 0 0 0]
;; [0 0 0 0 0 0 1 0 0 0 0 0]
;; [0 0 0 0 0 0 1 0 0 0 0 0]
;; [0 0 0 0 0 0 0 1 0 0 0 0]
;; [0 0 0 0 0 0 0 0 1 0 0 0]
;; [0 0 0 0 0 0 0 0 0 1 0 0]
;; [0 0 0 0 0 0 0 0 0 0 1 0]
;; [0 0 0 0 0 0 0 0 0 0 0 0]
;; ]
❶ By type-hinting the image as an array of array of ints (you can actually type-hint a local with a string
containing the Java rendition of a bidimensional array of integers "]]I") Clojure can use the right
polymorphic calls for ints and avoid any reflective call.
❷ aset is used to mutate the array at the given position with "1" if this is a point of the line.
❸ For our example we need a few more helpers just to set up an empty canvas. But once everything is
ready, we can invoke to-points to produce the line approximation between (2,3) and (10,10) (for
example) and pass the points to the paint! function for the final draw. We need to convert the array
into a vector to actually print it on the screen.
As you can see the "line of 1" is approximately a line that sometimes align a couple of
"1" on top of each other because of the line incline. When seen on a much bigger
canvas this would actually appear as a straight line.
The main reasons to used unchecked-int operators in this case are consistency and
performance. Since a Java array is integer-indexed and even the content of the arrays
are integers, it makes sense to avoid any transformation (even if implicit) to further
stress the fact that all calculations are in the integer range. Secondly, unchecked-int
operators are removing unnecessary boxing and reflective calls resulting in better
performance overall.
117
See the Jira ticket releated to this issue at dev.clojure.org/jira/browse/CLJ-1905
As you can see, the difference between the two versions is minimal, showing that other parts of the
computation are dominating the cast int-long. Since the results might be different in the next version of
Clojure containing the fix for the unwanted widening to long, we suggest to use unchecked-int math
operators anyway. See the performance considerations section for further information.
See also:
In order of less restrictive behavior:
• If you are not in an integer-only domain, you can use simple unchecked
operators: “unchecked-add and other unchecked operators”
• If you don’t want the overflowing behavior but you need precision instead, the
self-promoting operators are what you need: +'.
• “hash” is the function that Clojure offers to create integer hashes of all the Clojure
common objects.
❶ The first test is using the current Clojure master branch without any additional change. The execution
time mean is around 770 microseconds.
❷ The second test is conducted against a locally patched Clojure, showing more than a 50%
performance increase (at around 291 microseconds)
The suggestion is to use unchecked int operators anyway, since the problem is known
and will be likely fixed in the next Clojure releases.
Equality in Clojure takes into account several aspects when comparing "objects"
6
("object" here is intended in the most general meaning, without relationship to Object
Oriented Programming):
• Their values. The "value" of an object is Clojure’s equality main ingredient. One consequence is that
collections don’t necessarily need to have the same type to be equal. Values satisfy the common
intuition that (1 2) and [1 2] are equal, despite the fact
that clojure.lang.PersistentList and clojure.lang.PersistentVector are not the
same type. The equality operator = is mainly based on this principle.
• Their types. Types are definitely a relevant aspect for comparison. Types are used for example to
add semantic meaning to equality in case of ordered collections.
• Their identity. Finally, Clojure also offers a way to check if two objects are the same instance (which
most of the time resolves into checking if they have the same address in memory) with “identical?”.
• “= (equal) and not= (not equal)” are used mainly as operators in conditions. They
know how to deal with Clojure data structures and compare most of them
intuitively.
• “== (double equal)” is a specialized equality operator for numbers. There are
specific comparison rules for different types of numbers.
• “< , > , <= , >=” are the standard comparison operators, requiring arguments to
implement a notion of ordering when processing results.
• “compare” provides complete information about the relative ordering of
arguments. It is also used in standard sorting when no other comparator is
provided.
• identical? allows Clojure to access Java equality by reference.
• hash is the main hashing function in Clojure. Clojure hashing needs a specific
optimization to deal with collection as keys in hash-maps compared to basic Java
hashing.
• “clojure.data/diff” builds on top of equality and provides a way to retrieve
differences in deeply nested data structures.
Correctly addressing equality in a programming language is a very difficult task and
Clojure had to tackle the additional problem of integrating with Java semantic. The
overall result is the best possible tradeoff. If from one side there are still rough edges
you need to be aware of, on the other Clojure achieved the perfect balance between the
JVM, purely functional data structures and usable equality.
(=
([x])
([x y])
([x y & more]))
(not=
([x])
([x y])
([x y & more]))
= ("equals to") is one of the very frequently used Clojure functions. not= ("not equals
to") is just the opposite of = and can help shortening the more verbose (not (= a b)).
They both take one or more arguments to compare, returning true (or false for not=)
when they are the "same". Despite the simple explanation, the meaning of "same"
depends on the kind of things being compared which is not necessarily the definition
found in other programming languages. Basic usage is pretty simple:
(= "a" "a" "a") ;; true
(not= 1 2) ;; true
Equivalence implemented with = takes into account values and their representations,
which means that in case of collections for example, the comparison happens against
the content and not the type of the container. The contract section in this chapter is
more specific about rules and exceptions.
CONTRACT
"x", "y" and "more" can be any kind of Clojure expression, literal or nil. At least one
argument is required.
Output
= returns true or false depending on types and content. If we call T1 the type of "x"
and T2 the type of "y", we can broadly split types using the concept of compatibility
group:
• List-like group (ordered comparison): lists, vectors, subvectors, persistent queues,
ranges, lazy sequences.
• Map-like group (unordered comparison): hash maps, sorted maps, array maps.
• Set-like group (unordered comparison): sets and sorted sets (sorted-
map and sorted-set are sorted during sequential access, but ordering is ignored
when comparing).
• Integer group: byte, short, int, long, big-integer, ratios.
• Floating point group: float, double
Any other type not mentioned in the groups above can only be compared with an object
of the same type: big-decimal, strings, symbols, keywords, deftypes, defrecords, plain
Java types (following normal Java semantic). Given the mentioned compatibility
groups, equality between "Object1" and "Object2" is roughly described by the
following:
• If Object1 Object2 are compatible and ordered containers, = returns true if the
content is the same in the same order.
• If Object1 Object2 are compatible and unordered containers, = returns true if the
content is the same in any order.
• If Object1 Object2 are compatible but not containers, = returns true if the objects
have the same value.
• Return false in any other case.
The following section contains many examples going through each comparison in
deeper details.
Examples
The contract for = is difficult to express in a formal way to include all possible
permutations of object types. The best way to understand equivalence is by examples:
• (= 1N (byte 1)) is true because the operands are part of a compatible numeric
integer class (byte and big-integer). The main reason for them for being
compatible is that their value is not subject to loss of precision during a potential
type conversion.
• (= 1M 1.) is false, because big-decimals and floating point numbers are not in
the same compatibility group. There is a dedicated == operator for number
equivalence that returns true in this case.
• (= '(1 2 3) [1 2 3]) is true because both collections belong to the compatible
list-like ordered group and their content is the same in the same order.
• (= [0 1 2] [2 1 0]) is false because vectors belong to the compatible list-like
ordered group and even if content and types are the same the order is not.
• (= #{1 2 3} #{3 2 1}) is true because sets belong to the same compatible set-
like unordered group.
• (= (sorted-set 2 1) (sorted-set 1 2)) is true despite the "sorted"
designation. Sets are always compared without considering ordering, but they
could be ordered for sequential access.
• (= {:a "a" :b "b"} {:b "b" :a "a"}) is true similarly to sets comparison,
because maps are unordered and compared without considering ordering.
• (= (sorted-map :a "a" :b "b") (sorted-map :b "b" :a "a")) is true,
because sorting of sorted-map is applied only to the sequential access. Sorting is
not considered when comparing.
• (defrecord SomeRecord [x y]), (= (SomeRecord. 1 2) (SomeRecord. 1
2)) is true, since the two fields x and y in the two records are the same.
• (deftype SomeType [x y]), (= (SomeType. 1 2) (SomeType. 1 2)) is false,
because there is no handling of equality other than what is inherited from Java
Object (which by default uses identical?).
• (= [1 2 3] #{1 2 3}) is false because of incompatible group types.
• (= "hi" [\h \i]) is false because of incompatible group types.
• (= (Object.) (Object.)) is false, since Clojure delegates equality for objects to
Java semantic (the .equalsTo(Object o) method call).
Equality accepts any number of arguments, not just two. We could use a variable
number of arguments in the following simulation. In the classic slot-machine gambling
game, three (or more) reels come to a stop after spinning for some time. The user wins
if the reels are aligned showing the same symbol. Here’s how a simulated slot-machine
could be implemented in Clojure:
(defn generate [& [{:keys [cheat reels]
:or {cheat 0 reels 3}}]] ; ❶
(->> (repeatedly rand) ; ❷
(map #(int (* % 100))) ; ❸
(filter pos?) ; ❹
(map #(mod (- 100 cheat) %)) ; ❺
(take reels))) ; ❻
:result res}))
(play)
;; {:win false, :result (12 9 19)}
❶ The generate function is responsible for generating numbers (instead of symbols or patterns like in
real slot-machines). It takes two optional parameters for the number of reels to use and a cheating
factor going from 0 (=no cheating) to 100 (100% cheating) so we can manipulate results. This feature
should be obviously well hidden in a real system!
❷ We start the generation from an infinite stream of random numbers between 0 and 1.
❸ The next step transforms the floating point numbers into integers between 0 and 100.
❹ We need to filter out zeros because we don’t what the next step using mod to incur in a division by
zero error.
❺ mod is used to force the generation to return the same number multiple time. The higher the
"cheating" factor the closest the probability for the numbers to be the same.
❻ Finally, we just extract how many generated numbers we need.
❼ To check if the generated numbers are a winning position, we use “apply” with = to see if they are all
the same.
// reference types can't compare even when they represent the same number:
System.out.println(new Integer(1).equals(new Long(1))) ; // false
System.out.println(new Integer(1).equals(new Short((short)1))); // false
Clojure definition of == removes any trace of the primitive VS reference duality: (== 1 1N (Integer.
1) 1. 1M) returns true as one would expect. Despite the huge improvement, == still suffers from the
notorious problem affecting floating points represented as binary:
1) = for collections allow different types of containers with the same content to be represented as a
single key in a hash map.
2) = for numbers ensure that there is always a distinction between numbers with different precision
features, so if they compare false and they are not floating point numbers we certainly know that
they are different (with floating points you should always consider potential precision problems).
The lesson we should take away from the implementation of = and == is that you’re pretty much free to
use anything you want as a key in a hash map, but there could be very subtle bugs happening if you use
floating point numbers as keys. Secondly, if you know that you’re comparing numbers and only numbers,
you should use == to remove the artificial categories that = introduces. Equality design in Clojure is
possibly the perfect compromise between Java constraints, intuitive behavior and the much more liberal
use of collections that Clojure enables.
See also:
• == is what should be used to compare numbers in the most general
case. == follows Java semantic for primitive numbers comparison but prevents
Java equals() requirement to box primitive numbers.
• <= and >= are also dedicate to numbers and additionally checking for relative
ordering between arguments.
• identical? verifies if two objects are the same by checking if they live at the same
address in memory (following Java semantic). It follows that (indentical? 1.
1.) is false because two java.lang.Double objects are created and they live at
two different addresses in memory. Since integers between -128 and 127 are
cached by the compiller, (indentical? 127 127) is true but (identical? 128
118
See the following StackOverflow answer to know more stackoverflow.com/questions/588004/is-floating-point-math-
broken
128) is false.
• distinct can be an interesting option if you are searching all the unique elements in
a collection. This is equivalent to remove elements from a collection if they are
the same.
Performance considerations and implementation details
When searching for raw speed in tight loops, it might be worth considering the
Java .equals() method which removes some of overhead of Clojure = at the price of
being exposed to Java equality semantic. Here’s for example what happens when
comparing longs:
(require '[criterium.core :refer [bench]])
(set! *warn-on-reflection* true) ; ❶
❶ Setting the warning on reflection dynamic variable to true shows any use of .equals that forces the
compiler to use reflection. We just want to make sure correct type hinting is in place to avoid unfair
benchmarking.
❷ Please note that timings is related to the specific hardware where the tests are running, so if you try
the examples yourself these numbers can be different (although you should see a speed improvement
using .equals).
As you can see there is around 25% speed improvement by using .equals, although
there are trade-offs to consider
1. We were forced to deal with explicit type hinting to avoid incurring in reflection
penalties.
2. We are now forced to use boxed numbers (upper case java.lang.Long instead of
plain simple long).
3. Java equality on boxed numbers results in surprises like (.equals (Integer. 1)
(Short. (short 1))) being false that we need to be aware of.
(==
([x])
([x y])
([x y & more]))
== is a specific equivalence operator for numbers. While = (single equal sign) is stricter
and only compare numbers if they are from the same numerical category (see = for a
definition of them), == can also compare across categories:
(== 1M 1.) ; ❶
;; true
(= 1M 1.) ; ❷
;; false
❶ A bigdec "1M" and a double "1." are the same for ==.
❷ = "single-equal" considers numbers belonging to certain categories. bigdec and double are not part
of the same category, hence returning false.
CONTRACT
Contract
• == requires at least one argument. Calling == with a single argument (numeric or
not) always returns true by default.
• "x", "y" and any other arguments have to be numbers such that (number?
x) is true.
Notable exceptions
• ClassCastException if there is more than one argument and any argument is not a
numeric type.
Output
== returns:
• true if all numbers are the same or have a different type, provided there is a
transformation from the least precise type to the more precise type. The available
transformations are governed by the Java language specifications for binary
promotion 119 or ultimately delegating to
the java.lang.Object::equalsTo() implementation of the first operand.
• false in all other cases.
Examples
One of the main reason for Clojure to include both = and == operators is that they are
specialized for specific tasks without being mutually exclusive. == is best for numbers
because it respects the general notion that number equivalence is independent from
types or binary representation.
In the next example, we want to implement an exchange service to enable transactions
between buyers and sellers. Trades happen on different markets and each market
provides a slightly different API to list current buy/sell requests. We can match
requests through their stock symbols and create a transaction for each compatible pair.
Requests are compatible if their buying price is matched by an equivalent selling price.
Here’s how requests look like when they enter the system:
(def tokyo ; ❶
[{:market :TYO :symbol "AAPL" :type :buy :bid 22.1M}
{:market :TYO :symbol "CSCO" :type :buy :bid 12.4M}
{:market :TYO :symbol "EBAY" :type :sell :bid 22.1M}])
(def london ; ❷
[{:market :LDN :symbol "AAPL" :type :sell :bid 23}
{:market :LDN :symbol "AAPL" :type :sell :bid 22}
{:market :LDN :symbol "INTC" :type :sell :bid 14}
{:market :LDN :symbol "EBAY" :type :buy :bid 76}])
(def nyc ; ❸
[{:market :NYC :symbol "YHOO" :type :sell :bid 28.1}
{:market :NYC :symbol "AAPL" :type :buy :bid 22.0}
{:market :NYC :symbol "INTC" :type :buy :bid 31.9}
{:market :NYC :symbol "PYPL" :type :sell :bid 44.1}])
119
The Java Language Specification contains an extensive description of the possible conversions and
promotions: www.cs.cornell.edu/andru/javaspec/5.doc.html
Prices are ultimately a mix of integers, doubles and big-decimals, depending on the
market they come from. The service needs to compare numbers using mixed type
arithmetic, something that == is most suitable for:
(defn group-orders [& markets] ; ❶
(group-by :symbol (apply concat markets)))
❶ group-orders is a helper function to aggregate listings from different markets grouping them by stock
symbol. The resulting map contains the symbol as a key and a list of requests as value.
❷ compatible? tells us if two requests match. The rules are: they need to be a buy/sell pair and have
the same amount. not= is used here to make sure the requests are not buy/buy or sell/sell pairs. == is
used to verify that the bid price is the same for both requests independently from the type. Notice that
both requests are coming in as maps that are destructured in the function definition.
❸ matching creates all possible permutations given a list of requests for the same stock symbol. for is
perfect to create permutations, including filtering through the compatibility rule applied directly using
the :when directive to remove unwanted pairs.
❹ exchange is the main entry function for the computation. It accepts requests from different markets
grouped by symbol and tries to match them.
❺ We need to access the last of each input pair. The first is the symbol each listing is grouped by. The
last element is instead the list of requests we want to match.
❻ We need to mapcat each of the results coming from matching because they are all returned
contained in their own sequence (which could also be empty). By concatenating we make sure they
are all flattened into the same sequence without "gaps".
❼ The distinct is necessary because matching returns matching pairs in both directions: #{order1
order2} and #{order2 order1} are both returned in the results if they are compatible. distinct gets
rid of this duplication (thanks to = operators that knows how to deal with unordered set).
❽ We can finally see matching an example request for 22$ worth of Apple stocks that can proceed to
trade.
0.1 ; ❶
;; 0.1
(BigDecimal. 0.1) ; ❷
;; 0.1000000000000000055511151231257827021181583404541015625M
❶ Typing "0.1" at the REPL produces a double type.
❷ The same number used to initialize a BigDecimal instance produces a much larger number of
decimals, revealing interesting facts about the concept of "precision".
The number 0.1 for example is printed as expected with its literal form, because rounding is operating to
show that correctly. As soon as we turn on full precision creating a java.lang.BigDecimal we can see
what exactly is stored inside the 64 bits. Even without using big decimals, we can see propagation of the
rounding problems with a simple operation:
This fundamental imprecision is the reason why currencies shouldn’t be represented as "float" or
"double". A more exact representation can be achieved by a wrapper like big-decimal for example, which
ultimately delegates precision to a Java array (Integer/MAX_VALUE long) inside a BigDecimal instance.
See also:
• = is a more general equality operator (since it operates on other types than
numbers) but at the same more tailored around Clojure data structures (especially
using collection as keys in maps and sets). You should use = single equal operator
in mixed contexts (numbers and other types) or in any other not numerical case.
• identical? verifies if two objects are the same by checking if they live at the same
address in memory (following Java semantic). You should use identical? mainly
in Java inter-operation scenarios requiring reference equality.
120
The floating point entry in Wikipedia is a nice summary of everything related to this tricky
subject: en.wikipedia.org/wiki/Floating_point
(quick-bench (= 1 1 1 1 1)) ; ❶
Execution time mean : 86.508844 ns
❶ Both = and == have a catch-all variable arity argument to deal with more than 2 arguments.
❷ == is just a bit faster than =.
As suggested for =, there is some speed-up improvement possible by using Java interop
with .equals directly on the number (with some trades-off to consider, please
check = to know more).
< ("less than" or "lt"), > ("greater than" or "gt"), <= ("less than or equal" or "lte") and >=
("greater than or equal" or "gte") are common operators in many languages. They work
by assuming the existence of an order relationship between the given arguments and
returning true or false if this relationship holds for all the arguments. Usage of
ordering predicates is ubiquitous. In Clojure, they accept only one or more numeric
arguments:
(< 0 (byte 1) 2 2.1 3N 4M 21/2) ; ❶
;; true
❷ Note that since (== 2 2M) then < is not satisfied for all arguments. <= would instead returns true in
this case.
CONTRACT
Contract
• When a single parameter "x" is present, "x" can be of any type (it is not restricted
to just numbers). (< "a") is for example a valid expression
returning true while (< "a" "b") is not allowed and throws exception.
• When 2 or more parameters are present: in this case arguments must be numeric,
that is (number? x) must be true for all arguments. Integers, decimals, ratios, big
integers and big decimals are all accepted as well as other Java numeric types
(like AtomicLong).
Notable exceptions
• clojure.lang.ArityException when calling with no arguments.
• java.lang.ClassCastException if there is more than one argument and any of
them is not a number (that is, (number? x) returns false).
Output
• true when there is only one argument (of any type).
• true when there are 2 or more arguments and the ordering relationship holds for
all of them (see below).
• false in any other case.
The order relationships that <, >, <= and >= are designed to verify are defined as
follows:
• Strictly monotonically increasing <: when the arguments taken from the left to
the right are always in a relationship that goes from the smaller to the bigger (but
never equal).
• Monotonically increasing <=: same as before, but also allow the arguments to be
equal (using the same semantic as == for numeric equality).
• Strictly monotonically decreasing >: when the arguments taken from the left to
the right are always in a relationship that goes from the bigger to the smaller (but
never equal).
• Monotonically decreasing >=: same as before but allows the arguments to be
equal (using the same semantic as == for numeric equality).
Examples
The fact that <, >, <= and >= accept more than two arguments comes handy to check if a
quantity belongs to a range. We’ve seen this already in other chapters throughout the
book:
• In the game of life example when talking about for, we saw how to constrain the
permutations in search for neighbors cells by using (<= 0 x' (dec w)). In this
expression the quantity to check x' sits in between 0 and (dec w). The expression
only allows x' that are more or equal to zero and at the same time less or equal
to (dec w). This is the same as typing: (and (>= x' 0) (<= x' (dec w))).
• A comparison predicate is often present in loop-recur or recursion in general. This
is a consequence of the presence of a condition to exit the loop. You can see an
example of this in the Fibonacci sequence implementation included in fn.
In general <, >, <= and >= can be used to verify if a collection of numbers is ordered:
(apply < [2.1 4 5.2 8 124 9012 1e4]) ; ❶
;; true
Other noticeable use of <, >, <= and >= is in conjunction with sorting operations
like “sort and sort-by” or sort-by. Assuming the input contains numbers, we can for
example sort in reverse by using an ordering predicate:
(sort > (range 10)) ❶
;; (9 8 7 6 5 4 3 2 1 0)
In the following example we want to consume the elements from a sequence until they
reach some threshold, operation that can be readily accomplished using drop-while and
a comparison predicate:
(drop-while
(partial > 90) ; ❶
(shuffle (range 100))) ; ❷
;; (96 23 46 18 61 84 60 83 56 32 38 54 87...) ; ❸
❶ With partial we can fix the threshold above which elements are returned.
❷ We simulate random numbers with shuffle.
❸ The first element in this sequence is always a number bigger than 90.
See also:
• “compare” returns -1, 0, 1 to signal what relationship relates the arguments. Use
when you need different actions based on the ordering of the
operands. “compare” also works for other types like strings, provided they are
"comparable".
• “reverse” can be used to invert the order of a collection.
❶ The first example does not contain type hints. The compiler does not know what happens at run-time
and is forced into the conservative approach of considering the arguments of type Object. The
relevant bytecode is shown below.
❷ The second example contains type hints, so the compiler can infer long types. The resulting bytecode
shows that the generated method call can be now more specific about the type (resulting in a speed-
up).
Similarly to single or double equal operators, <, >, <= and >= consume processing time
linearly with the number "n" of arguments. Any pair of inputs determining
a false predicate immediately terminates the evaluation avoid to walk the entire
sequence.
An easy improvement has been proposed on the Clojure Jira board (visible
at dev.clojure.org/jira/browse/CLJ-1912) to speed up predicates for more than 2
arguments. < implementation is fairly easy to follow, so let’s have a look at the
proposed improvement by first looking at the current version:
(defn <
{:inline (fn [x y] `(. clojure.lang.Numbers (lt ~x ~y)))
:inline-arities #{2}
:added "1.0"}
([x] true)
([x y] (. clojure.lang.Numbers (lt x y))) ; ❶
([x y & more]
(if (< x y)
(if (next more) ; ❷
(recur y (first more) (next more)) ; ❸
(< y (first more)))
false)))
The improvement removes the double evaluation of (next more) that happens when
the iteration needs to move forward toward the end of the sequence when there are
more than 2 arguments. It can be fixed easily by introducing a let:
(defn new<
{:inline (fn [x y] `(. clojure.lang.Numbers (lt ~x ~y)))
:inline-arities #{2}
:added "1.0"}
([x] true)
([x y] (. clojure.lang.Numbers (lt x y)))
([x y & more]
(if (< x y)
(let [nmore (next more)] ; ❶
(if nmore
(recur y (first more) nmore)
(< y (first more))))
false)))
❶ The let statement is now preventing the double evaluation of (next more)
We can see the benefit using Criterium to benchmark the new function:
(require '[criterium.core :refer [bench]])
Despite not being huge improvements on an absolute scale, easy improvements like the
above can make an appreciable difference in very tight loops. The same change would
be applied to all other predicates, also including == and =.
6.4 compare
function since 1.0
(compare [x y])
compare is one of the option Clojure offers to compare values. The main difference
with comparison predicates like =, ==, <, >, <=, >= is that compare returns a
java.lang.Integer (1, 0 or -1, or more generically a negative, zero or positive
number) to indicate that "x" is less, the same or more than "y" respectively. As a
consequence an exhaustive compare for a conditional expression would require 3
branches:
(let [c (compare 1 2)]
(cond
(neg? c) "less than" ; ❶
(zero? c) "equal"
(pos? c) "more than"))
;; "less than"
❶ An example showing the 3 branches necessary to cover all possible results of compare.
A function that given two arguments returns an integer to indicate relative order is also
called a "comparator". compare is the default comparator in Clojure (if no other is
given) in functions like sort or sort-by. compare also works for types that are not
necessarily numbers or even custom types, provided they implement the
121
java.lang.Comparable interface . Many Java classes already implement
Comparable, so compare can be used for example on java.lang.String,
java.util.Calendar and many others:
(import 'java.util.GregorianCalendar)
(def t1 (GregorianCalendar/getInstance)) ; ❶
(def t2 (GregorianCalendar/getInstance)) ; ❷
(compare t1 t2)
;; -1
Similarly, Clojure implements the Comparable interface for several Clojure internal
types:
• Vectors created with vector or the [] literal syntax (clojure.lang.Vector).
• Keywords (clojure.lang.Keyword).
• Ratios (clojure.lang.Ratio) like the literal 2/3.
• Refs (clojure.lang.Ref) created with ref.
• Symbols (clojure.lang.Symbol).
Each Clojure type provides a specific interpretation of compare and some of them are
not obvious. For example ref are compared based on their creation order (which is
mostly to be treated as implementation detail):
121
Please see docs.oracle.com/javase/7/docs/api/java/lang/Comparable.html about Comparable in Java
❶ ref are compared independently of their content based on creation order. This is an implementation
detail used in transactions to establish precedence of ref updates.
We are going to see a few examples of how compare works for internal Clojure types
and other Java types in the extended examples below.
CONTRACT
Input
"x" and "y" are mandatory. "x" can be compared with "y" when:
• nil appears as one of the arguments or both.
• (instance? java.lang.Number x) and (instance? java.lang.Number y) are
both true.
• (instance? java.util.Comparable x) is true (there is one generalization to this
case when (identical? x y) is true and at that point "x" can be of any type).
• The implementation of compareTo() provided by "x" (if any) allows comparison
with an instance of type "y". compareTo() is the method required by
the java.util.Comparable interface.
Exeptions
• clojure.lang.ArityException: compare requires exactly 2 arguments.
• java.lang.ClassCastException: when "x" doesn’t implement
the java.lang.Comparable interface (but not when (identical? x y) is true).
• Any other exceptional condition specific to the implementation
of compareTo() provided by "x".
Output
• A negative, zero or positive java.lang.Integer depending on how "x" and "y"
compare. A negative number if "x" is less-than "y", 0 if "x" equals "y", positive
otherwise.
(compare nil (/ -1. 0)) ; ❶
❶ An extreme example of exceptional result provided by compare. The number (/ -1. 0) is negative
infinity and compare reports that nil is smaller. This should be regarded more properly as an example
of undetermined result when either operand is nil.
Examples
Let’s start with some simple example that better explains some of the peculiarities
of compare described in the contract section. nil is accepted as possible argument and
is always considered the "smallest" value, even smaller than negative infinity:
(def -∞ (/ -0 0.))
(map compare [nil nil "a"] [-∞ nil nil]) ; ❶
;; (-1 0 1)
❶ An extreme example of exceptional result provided by compare. This should be regarded more
properly as an example of undetermined result when either operand is nil.
When the two arguments are the same (as in identical?, which means they are the same
Java object) compare returns 0. In this example we are apparently comparing ranges,
but compare returns zero because they are the same object instance:
(instance? java.lang.Comparable (range 10)) ; ❶
;; false
The final decision about how to compare the arguments is delegated to the specific
implementation of compareTo() defined in the type of the first argument (when
available). For example vectors are compared first by size and then by juxtaposing
each pair of items:
(compare [1 1 1 1] [2 2 2]) ; ❶
;; 1
(compare [1 2 4] [1 2 3]) ; ❷
;; 1
❶ The first vector contains 4 elements, while the second only 3. This is equivalent to compare (compare
4 3) ignoring the content altogether.
❷ Provided the size is the same, the first pair is compared. If they are equal, the second pair is
compared and so on, until the first pair that is not equal or the end of the vector. The last pair [4 3] is
the one producing the result.
Clojure strings are java.lang.String instances, thus providing the same comparable
❶ "a" and "z" are in ascending order in the ASCII table, with "a" appearing first determining the negative
result. There are 25 letters in between.
❷ The two strings are of different sizes with "abc" substring of "abcz". Their length is compared.
Clojure keywords and symbols behave like strings with the addition that if they are
namespace qualified, then the namespace string comparison takes precedence:
(map compare [:a :my/a :a :my/a :abc123/a] ; ❶
[:z :my/z :my/a :a :abc/a ])
;; (-25 -25 -1 1 3)
❶ When both keywords are namespace qualified, the comparison of the namespace takes precedence
following the same rules for strings and just ignoring the name of keyword. When both keywords are
not namespaced qualified, the string comparison happens on the keyword name only. If the first
keyword is not qualified but the second is, the result is always -1. If the second keyword is namespace
qualified but the first is not, then the result is always positive 1.
❷ Exactly the same applies to symbols, which are internally stored as keywords.
In the next example we are going to provide Clojure with a way to sort custom types
using our definition of compare. We want to know what is the closest gas station given
an origin location where our car is currently located. Both the gas station and the
concept of location have been modeled through a defrecord declaration and enhanced
with our own version of compareTo to make them comparable. We assume the
locations are on a two-dimensional plane for simplicity 122:
(defn- sq [x] (* x x))
122
The formula to calculate the distance between to geographical locations would probably be different, like the Haversine
formula we saw in the post office problem
❶ distance calculates the euclidean distance between two points using the classic formula.
❷ We define a Point as a record containing the expected coordinates (x,y) plus an additional
function. distance-origin is used to calculate the distance of this location from "the origin", an
arbitrary selected point from which all other distances are calculated.
❸ Our implementation of compareTo uses compare on the result returned by distance-origin which is
a number. Any other implementation obeying the contract of returning an integer (negative, zero or
positive) would do.
❹ relative-point contains the logic to build a new point. Our point are all requiring a function to
calculate a distance given the coordinates. relative-point creates a new location which also
embeds the information about the "origin" a special location to describe where the user doing the
search is located.
❺ The second defrecord definition is used to describe a GasStation which contains a brand (or name of
the company selling gas) and where it is located.
❻ Also gas stations can compare. A gas station is "less than" another gas station when it is closer to the
origin and vice-versa. compareTo has been implemented by calling compare passing the locations of
the two gas station. A location is represented as a Point object, so the compareTo implementation of
the Point object will be invoked to compare gas stations.
By providing a compareTo logic to both GasStation and Point objects, we are now
able to sort gas stations based on the position where we are located:
(def gas-stations
(let [x 3 y 5] ; ❶
[(->GasStation "Shell" (relative-point 3.4 5.1 x y))
(->GasStation "Gulf" (relative-point 1 1 x y))
(->GasStation "Exxon" (relative-point -5 8 x y))
(->GasStation "Speedway" (relative-point 10 -1 x y))
(->GasStation "Mobil" (relative-point 2 2.7 x y))
(->GasStation "Texaco" (relative-point -4.4 11 x y))
(->GasStation "76" (relative-point 3 -3 x y))
(->GasStation "Chevron" (relative-point -2 5.3 x y))
(->GasStation "Amoco" (relative-point 8 -1 x y))]))
❶ Our coordinates are used to create the gas stations objects. The relative-point constructor takes
care of creating a Point that is related to an origin.
❷ sort works without the need for a specific comparator. compare will be used by default and dispatched
at run-time to the provided compareTo implementations.
(compare Double/NaN 1)
;; 0
(compare 1 Double/NaN)
;; 0
(compare Double/NaN Double/NaN)
;; 0
compare always returns 0 when Double/NaN is present, with the consequence that NaN is the same
against any other number. Quite confusing, especially when we remember that (== Double/NaN
Double/NaN) is instead false. To add to the confusion consider the following:
sort by default uses compare as a comparator and since compare of NaN is always 0, different results
are produced based on the relative ordering of the elements appearing before NaN in the vector. This
behavior could be especially problematic if we wanted to use collections of numbers as keys in a hash-
map (or set), resulting in unexpected unique keys after sorting. The lesson to take away in this case is to
refrain from using numbers as keys, first because of potential rounding problems and additionally if they
are NaN.
See also:
• comparator, given a function of two arguments, returns a wrapper that translates
the function results into -1, 0, or 1. Use when you want need a comparator-like
function and you already have one predicate-like function whose results can be
translated into integers.
• identical? verifies if two objects are the same by checking if they live at the same
address in memory (following Java semantic). You should use identical? mainly in
Java inter-operation scenarios requiring reference equality.
• = or == should be used if you are only interested in knowing if two arguments are
the same, without specific interest in their relative order relationship.
Performance considerations and implementation details
types the comparison is applied to. The following table is a quick summary of what to
expect comparing the most common Clojure and Java types:
Table 6.1. Various performance profiles for compare based on the type of the arguments.
In general, you’ll have to pay attention at the potential linear scanning of very big
vectors or strings.
6.5 identical?
function since 1.0
(identical? [x y])
identical? has a few specific use cases related to object identity and should not be
(defn encode [] ; ❸
(let [e (.take channel)]
(if (identical? SENTINEL e)
(println "done")
(do (println (hash e))
(recur)))))
123
Please read the Wikipedia article about sentinels for more background: en.wikipedia.org/wiki/Sentinel_value
(defn start [] ; ❹
(let [out *out*]
(.start (Thread.
#(binding [*out* out]
(encode))))))
(do
(start)
(.offer channel :a)
(.offer channel (Object.)) ; ❺
(.offer channel SENTINEL)
(.offer channel :a)) ; ❻
;; -2123407586
;; 1430420663
;; done
❶ We use a blocking queue to orchestrate the communication between the producer thread (the REPL
thread in this case) and the consumer. The server thread can then run in am infinite loop waiting for
input. The take operation on the queue is blocking, so the server waits for at least an element to be
present each loop.
❷ The sentinel is a generic java.lang.Object instance defined in the current namespace.
❸ encode contains a loop to examine the next event offered by the blocking queue. The loop stops as
soon as the sentinel is identified. Since any object could be sent by the producer, we don’t want to
use = single equal to eliminate the risk of confusing another object instance with our sentinel.
❹ start is dedicated to preparing the two thread for correct communication, for example making sure
that they both use the same standard output.
❺ We can send any kind of event to the channel (even another object instance) because we are sure
nothing can return true when compared with identical? if not the sentinel object itself.
❻ But when we actually send the SENTINEL object, the loop exits as the "done" message shows.
Additional offers to the channel won’t print the “hash” code anymore.
On the other hand, other data literals are not subject to interning, or are proper global constants:
❶ The empty list literal is the only instance of the empty list inside a running JVM.
❷ The clojure.lang.Ratio instance 2/1 is saved internally as long type: (class
2/1) is java.long.Long. Ratios in general are not interned.
Interning doesn’t work for all possible long and char type literals, but just the most used between -127
and 128 (for longs) and 0-127 ASCII conversion (for chars). On the other hand interning works for all
string literals. Complete string interning is done under the assumption that an average program would
just contain a limited number of strings in source code. Larger string instances are of course created
during the application lifetime that are not interned.
The following example shows a collection of literals that are not identical, either because they are out
of range of their interning capabilities or because they don’t support interning at all:
❶ A collection of data values that are not interned. Note for instance that symbols are not interned.
Finally, notice that as soon as a native type is used to create a new instance of the corresponding
reference type, the interning is no longer possible because Java doesn’t have a chance to look into the
cache to return the interned instance:
❶ By wrapping the number 100 with the corresponding number initializer we explicitely ask the JVM to
create a new instance of the number ignoring any interned option.
To give Java the option to look into the interned cached of numbers, we need to
use Long/valueOf instead of the constructor. This is exactly the mechanism that Java uses to
transform a native parameter into a reference, for example passing arguments to methods:
❶ Long/valueOf is interning-aware.
The last interesting behavior of identical? discussed here is related to "boxing". Boxing is the informal
name given to the action of wrapping a native data type with the corresponding full fledged class. Let’s
compare the following application of identical? where the arguments are bound
as vars or let respectively:
(def a 1000)
(def b a)
(identical? a b) ; ❶
;; true
(let [x 1000 y x]
(identical? x y)) ; ❷
;; false
❶ On creation of the var "a" the primitive 1000 has been passed to the constructor
of clojure.lang.Var as (Long/valueOf 1000) because a var is created from a
generic java.lang.Object. Note that 1000 does not belong to the pool of interned integer constants.
Clojure automatically de-reference any usage of "a" from now on, including the definition of the second
var "b", passing the same long instance contained in "a" to "b". Note that 1000 was already
transformed into a reference type before and it doesn’t need this again. As expected, comparing the
var "a" to "b" with identical? reports that we are talking about the same long instance.
❷ The same number 1000 is now used in a let block without var indirection. "y" is assigned "x" again
but as a native primitive value. We would expect the same equivalence as before but this is not
happening, showing that two independent instances of 1000 have been built.
In the second case, "x" is a native type without a var wrapping. In the generated code, the primitive
long 1000 is passed as argument to clojure.lang.Util/identical?(Object x, Object
y) forcing Java to box the primitive into a reference java.lang.Long twice, one for "x" and one for "y".
This is equivalent to the following rewrite:
(let [x 1000 y x]
(identical? (Long/valueOf x) (Long/valueOf y))) ; ❶
;; false
As you can see from the example, identical? is subject to a lot of exceptional behavior that you should
be aware of. Clojure equality =, when possible, should be the preferred option for equivalence testing.
See also:
• Use compare when you are interested in comparing quantities (not references) and
respecting the relative order of the operands.
• = is the most generic and flexible of the comparison operators. It’s not the best
choice for mixed numbers comparison, but it works on collections and other
Clojure data types.
• == is the operator dedicated to numerical equivalence. Use it when you are
interested in comparing numerical quantities instead of references.
Performance considerations and implementation details
6.6 hash
NOTE This section also mentions other related functions such as: mix-collection-hash, hash-
ordered-coll and hash-unordered-coll.
(hash [x])
(hash-ordered-coll [coll])
(hash-unordered-coll [coll])
(mix-collection-hash [hash-basis count])
NOTE Java comes with its own hashing algorithm accessible through the hashCode() method on
each object instance. Clojure supports Java hashing requirements by
implementing hashCode() on its own collections. However, Java hashing falls short of certain
idiomatic Clojure scenarios, such as using collections as keys in associative data structures.
This is one of the main reasons Clojure provides its own hashing function.
CONTRACT
Contract
• hash: "x" is the only required argument. It can be any type including nil.
• hash-ordered-coll and hash-unordered-coll: "coll" is the only mandatory
argument. It has to implement the java.lang.Iterable interface.
• mix-collection-hash: "hash-basis" and "count" are both required arguments of
type long.
Notable exceptions
• NullPointerException: when passing nil to hash-ordered-coll or hash-
unordered-coll.
Output
• hash returns a java.lang.Integer number between -231 and 231-1. When "x" is a
number, string or implements clojure.lang.IHashEq, the output is consistent
with Clojure hashing implementation. For all other types, hash delegates
to .hashCode() from java.lang.Object.
• hash-ordered-coll and hash-unordered-coll returns a java.lang.Long number
between -231 and 231-1 (same as integers but with a final cast to long).
• mix-collection-hash: "hash-basis" and "count" are both required arguments of
type long.
Examples
The last changes to the hash function (including the introduction of hash-ordered-
coll, hash-unordered-coll and mix-collection-hash) are relatively new. Before
Clojure 1.6, it was possible to produce inefficient programs by using composite keys in
maps. The inefficiency was the result of frequent collisions in such situations 124. To
understand why Clojure needs its own hashing, have a look at the following example:
(def long-keys [-3 -2 -1 0 1 2])
(def composite-keys [#{[8 5] [3 6]} #{[3 5] [8 6]}])
❶ We can see the effect of Java hashCode() on numbers of type long. Java simply combines the upper
and lower bits of a 64 bits long integer to shrink them to the required 32 bits size. But in doing so it
creates some evident collisions between negative and positive numbers.
❷ Another problem with hashCode() manifests on small collections with repeating patterns of items (a
common case with Clojure). The sets presented here collide if we use hashCode().
Java hashCode() creates relatively easy collisions, for example on longs, vectors or
sets in algorithms or data structures that require hashing. hash improves on Java taking
these factors into account:
(map hash long-keys) ; ❶
;; (-1797448787 -1438076027 1651860712 0 1392991556 -971005196)
124
A nice summary on the topic of hashing in Clojure is available on the <<groups.google.com/d/msg/clojure-
dev/lWXYrjaDuIc/WE_LUtll7VgJ,Clojure mailing list>> written by Mark Engelberg
❶ hash improves hashing on longs compared to Java by removing the effect of simple compression from
64 to 32 bits.
❷ Similarly, hash has been extended to Clojure collections producing evenly distributed hash numbers.
Clojure collections uses the same hashing function provided by hash internally, so we
are free to use numbers or small collections as keys without the risk of generating
frequent collisions. When it comes to interoperation scenarios though, we need to be
careful:
(import 'java.util.ArrayList)
(= k1 k2)
;; true
;; IllegalArgumentException: Duplicate key [1, 2, 3]
❶ In this interoperation scenarios with Java, we have 2 collections with the same content but different
types. One is a vector the other a java.util.ArrayList.
❷ We are unable to create an array-map because it does not use hashing to check for the presence of
keys. It uses clojure equality that correctly claims the two collections are the same.
❸ We are instead able to create a hash-map because the two collections use different hashing
algorithms and they appear as different keys.
The different behavior between array-map and hash-map is consistent with their design
goals and ArrayList uses a different hashing algorithm than Clojure collections. A
similar behavior should be expected for other custom types (not necessarily
collections) not implementing the clojure.lang.IHashEq interface.
If we wanted to use Clojure hashing on collections consistently with their content (but
independently from their type), we could use hash-ordered-coll or hash-unordered-
coll. With the "unordered" version, the output hash doesn’t change when the order of
the content changes. This could be useful to mix different collection types as keys in a
Clojure hash-map while guaranteeing hash consistency:
(import 'java.util.ArrayList)
(import 'java.util.HashSet)
(defn hash-update [m k f] ; ❶
(update m (hash-unordered-coll k) f))
(def m (hash-map))
(-> m ; ❷
(hash-update [1 2 3] (fnil inc 0))
(hash-update k1 (fnil inc 0))
(hash-update k2 inc))
;; {439094965 3}
❶ hash-update is a small wrapper function around the normal update for maps. Instead of using the
given key directly, hash-update is first calling hash-unordered-coll.
❷ The hash-map "m" repeatedly updates using hash-update. Despite the very different collections used
as keys, they all update the same value (instead of creating new keys).
The example above implies that hash-unordered-coll generates the same key for
different collection types provided they have the same content in any order:
(=
(hash-unordered-coll [1 2 3]) ; ❶
(hash-unordered-coll [3 2 1])
(hash-unordered-coll #{1 2 3}))
;; true
(=
(hash-ordered-coll [1 2 3]) ; ❷
(hash-ordered-coll [3 2 1])
(hash-ordered-coll #{1 2 3}))
;; false
❶ We can verify that hash-unordered-coll generates the same hash number for diversely ordered
collections.
❷ If we need ordering to determine different hashing numbers, we can use hash-ordered-coll.
In the next example we are going to design a Clojure-compatible hashing function that
works on java.util.HashMap, enabling us to compare hashing on a mix of Clojure and
Java maps. We can do this by iterating the hash-map and sum the hash for each pair of
key-values. There is however one last problem we need to deal with, something that
Clojure implements for us with hash-unordered-coll that we need to replicate.
Hashing algorithms often suffer from a problem related to how much of the bits in the
©Manning Publications Co. To comment go to liveBook
hash change when altering the input. A good hashing algorithm produces an
"avalanche effect" when each change in the input determines at least half of the bits in
the output to change (possibly all, although this is not always possible in practice). The
avalanche effect is better achieved with one last step that "mixes" the bits to maximize
the changes. Since we are going to implement our own hashing algorithm, we also
need to explicitly call mix-collection-hash as the last step:
(defn hash-java-map [^java.util.Map m]
(let [iter (.. m entrySet iterator)] ; ❶
(loop [ret 0 cnt 0]
(if (.hasNext iter)
(let [^java.util.Map$Entry item (.next iter) ; ❷
kv [(.getKey item) (.getValue item)]]
(recur
(unchecked-add ret ^int (hash kv)) ; ❸
(unchecked-inc cnt)))
(.intValue ^Long (mix-collection-hash ret cnt)))))) ; ❹
❶ To iterate the java.util.HashMap we need to go through its EntrySet first, which is an iterable
object.
❷ An Iterator object is stateful and advances each time we call .next on it.
❸ Note that to hash a java.util.HashMap$Entry we need its key and value components to form
a vector.
❹ After summing all hashed map pairs, we call mix-collection-hash to ensure a good avalanche
effect.
❺ hash called on java.util.HashMap results in a different number than the same called
on clojure.lang.PersistentArrayMap.
❻ But if we use hash-java-map we enable Clojure-style hashing on Java map objects.
6.7 clojure.data/diff
function since 1.3
(diff [a b])
;; ({:a "1"} ; ❸
;; {:c "4"} ; ❹
;; {:b "2"}) ; ❺
❶ Note that diff is not part of the clojure.core namespace and thus is not available by default.
❷ diff takes two arguments and returns a sequence of 3 elements.
❸ The first item in the result is what is present only in the first argument and not present in the second.
❹ Likewise, the second item in the result is what is present only in the second argument but not in the
first.
❺ The last and final item in the results is what is common between the arguments, if any.
diff works across all Clojure data structures and scalar types with some limitations
that are going to be illustrated in this chapter.
CONTRACT
Contract
• "a" and "b" are the two mandatory arguments. They can be of any type
including nil.
Notable exceptions
• clojure.lang.ArityException: when not exactly 2 arguments are given.
Output
A sequential collection (list or vector) of 3 elements at index 0, 1 or 2. The resulting
triplet contains:
• [a b nil] when "a" and "b" don’t have anything in common or when the types of
"a" and "b" are not compatible (see below).
• [only-in-a only-in-b common-items] when "a" and "b" have something in
common and they have a compatible types.
For collections, assuming "a" is of type A and "b" of type B, A and B are compatible
©Manning Publications Co. To comment go to liveBook
if:
• They are both Java arrays (such that (.isArray (class a)) and (.isArray
(class b)) are both true).
• They are both java.util.Set (such that (instance? java.util.Set a) and
(instance? java.util.Set b) are both true.
• They are both java.util.List (such that (instance? java.util.List a) and
(instance? java.util.List b) are both true. This is what makes lists and
vectors compatible.
• They are both java.util.Map (such that (instance? java.util.Map a) and
(instance? java.util.Map b) are both true.
For scalars (any other type that is not a container), diff follows = single equal
compatibility rules.
Once established that "a" and "b" are compatible for diff, then the result contains:
• At index 0: an element of type A that contains all items (and sub-items) only
present in "a" but not in "b". When "a" and "b" don’t have anything in common,
the index-0 element contains "a" itself.
• At index 1: an element of type B that contains all items (and sub-items) only
present in "b" and not in "a". When "a" and "b" don’t have anything in common,
the index-1 element contains "b" itself.
• At index 2: the elements common to both "a" and "b" as a list or vector, or a
single nil in case they don’t have anything in common.
The example section contains example of the most interesting diff applications.
WARNING Any interleaving nil occurrence in the resulting triplet should be ignored, as it could be the
result of diff internal processing and not an actual occurrence of nil in the input
arguments. diff is thus not well equipped to handle input with explicit nil or empty
collections, as it would be problematic to tell them apart from missing elements in the
resulting triplet.
Examples
Let’s start with a list of small examples to show how diff behaves in relation to the
type categories illustrated in the contract section. Remember to require clojure.data
if you want to use diff like the examples below:
(diff 1.0 1) ; ❶
;; [1.0 1 nil]
(diff [1 "x" 3 4] ; ❷
'(1 "y" 3 5))
;; [[nil "x" nil 4]
;; [nil "y" nil 5]
;; [1 nil 3]]
diff is a powerful tool to compare nested data structures and obtain an immediate
feedback about how they differ. Services written in Clojure, for example, often
produce JSON or EDN output with arbitrary nesting of data structures. If we wanted to
implement changes to the service and we wanted to be sure not to introduce any
regression, we could compare the output of the new service with the old and check for
differences. Some of them might be expected, others aren’t.
In the following example, a service is returning metadata about Clojure libraries and
their dependencies. Here’s a sample response from the live service for project "prj1":
(def orig
{:defproject :prj1
:description "the prj"
:url "https://theurl"
:license {:name "EPL"
:url "http://epl-v10.html"}
:dependencies {:dep1 "1.6.0"
:dep2 "1.0.13"
:dep6 "1.7.5"}
125
For this bug afflicting diff on sorted-map see dev.clojure.org/jira/browse/CLJS-1709
We now make the same request to the new service, that is just different code using the
same database/infrastructure:
(def new-service
{:defproject :prj1
:description "the prj"
:url "https://theurl"
:license {:name "EPL"
:url "http://epl-v10.html"}
:dependencies {:dep1 "1.6.0"
:dep2 "1.0.13"
:dep6 "1.7.5"}
:profiles {:uberjar {:main 'some.core :aot :all}
:dev {:dependencies {:dep8 "1.6.1"}
:plugins {:dep9 "3.1.1" :dep11 {:id 13}}}}})
They apparently look the same, but how can we be sure? To make our life easier, we’d
like to use some automation to extract all the paths where the two data structures differ,
without any nesting or actual values. diff can do the heavy lifting, we just need to
build on top of it:
(require '[clojure.data :refer [diff]])
❶ walk-diff is a function that knows how to parse the results coming back from diff. We are
interested in creating a path of hash-map keys in the form of a vector, like [:a :b :c] for each
difference that was found by diff. To do so we need to walk diff results recursively and go deeper
every time a new hash-map is found.
❷ If what is presented as argument d is a hash-map, we know there are differences and we follow
through each key to find how deep they are. We call walk-diff recursively passing as the second
argument the path found to far.
❸ flatten-paths helps us cleaning the final output, removing any unnecessary nesting of lists that
contain a single vector path in them. This is necessary because walk-diff is recursively generating
nested list for each map invocation.
❹ “tree-seq” is another great resource in the Clojure standard library. tree-seq transforms an arbitrarily
nested sequence into a tree of which it returns a depth-first walk. We can use it here to produce a tree
where the nodes are the vector paths we want to filter out.
❺ diff-to-path is our entry point. It takes an orig and other arguments to compare with diff. We
take the first result in the diff triplet (taking the second would be the same) and pass that
through walk-diff. As explained before, walk-diff output needs cleanup from all the cluttering
nested lists surrounding the paths.
(get-in orig ; ❸
[:profiles :dev :dependencies :dep8])
;; "1.6.3"
❶ From the output of diff we can see that there are indeed differences. It could take sometime to find
out what’s different in the original output though, especially if there are many more differences.
❷ diff-to-path produces an alternative view at the differences. We can quickly see that there are 2 of
them and where they are located in the input.
❸ Here’s how we can get-in on one paths to see what the different value is.
See also:
diff is the most sophisticated option of all the functions contained in this chapter. But
it might be overkill if you need a simple operator or equality predicate. All the
following alternatives are lower abstraction level and more specialized:
• = is what diff is based on. Use = to test simple conditions and where deeply
nested equality is not the point.
• compare offers a way to verify multiple comparable conditions at once.
• == is equality for numbers and it should be used in all those cases where numbers
are primarily involved.
Other functions in the standard library could be used to "navigate" data structures
if diff is not doing what you need:
• clojure.walk/walk can iterate over tree-like data structures. The behavior to
execute when a node is found can be easily customized.
• “tree-seq” was seen in action in this chapter. It doesn’t provide a way to execute
behavior when a node is found, but it produces a depth-first walk that can be later
processed.
Performance considerations and implementation details
(blow 700)
;; StackOverflow at 700 deep.
;; StackOverflow at 650 deep.
;; StackOverflow at 600 deep.
;; StackOverflow at 550 deep.
;; StackOverflow at 500 deep.
;; ... from here diff starts working ; ❸
;; correctly from the bottom of the stack
❶ generate generates a nested map n-levels deep. For example (generate 3) generates the hash-
map {0 {1 {2 2}}}.
❷ blow repeatedly calls diff with gradually shallower maps (in steps of 50 levels shallower each time),
waiting for a point in which the StackOverflowError stops appearing. The generated
map a and b have a difference in their very last nested map, forcing diff to walk the entire structure
to find out.
❸ Since diff first walk the structure to find the first place where a-branch is different from the b-
branch using only = single equal, the stack is the first thing that is consumed and it throws exception
immediately. As soon as the data structure is small enough to stay inside the stack, the
actual diff computation starts and it might take minutes to end.
As you can see you need very deep data structures in order to start suffering from stack
size problems (between 450 and 500 levels deep) and there is also a good probability
you’ll go out of heap space before reaching the end of the stack. Fortunately, real-life
data structures are unlikely to be that deep and, if they are, you likely have bigger
performance problems to solve first.
This chapter touches on these aspects and illustrate the different functions involved.
Some extended examples are included in call-out sections. These are:
7.1 Reducers
Reducers were introduced in Clojure 1.5. Reducers implementation can be found in
the clojure.core.reducers namespace (which needs to be required before use).
Reducers contain a wrapper layer on top of the Java fork-join framework (a model for
parallelism introduced in Java 1.7). Reducers also contain a set of collection processing
functions that have the same name as the ones in core: map, filter, reduce to name a
few. Compared to the ones in clojure.core, they create a "recipe" for processing that
is executed when calling reduce:
(require '[clojure.core.reducers :as r]) ; ❶
(reduce + apply-to-input) ; ❻
;; 30
Reducers also introduce some new vocabulary that the reader should be aware of:
• A "reducible" collection is a collection that provides reduce with a custom
implementation. If the collection implements the coll-
fold protocol, reduce delegates the iteration to the collection itself instead of
using the generic mechanism. For example, (range 10) is a reducible (sequential)
collection. r/map inc) (range 10 is also reducible collection although it does not
exhibit other typical properties of a collection.
• A "reducing function" is a function of 2 arguments that can be used in
a reduce operation (for example, +).
• A "reducer" is a function that when invoked on a reducible collection, returns a
"reducible transforming collection". For example r/map inc) (range 10 is a
reducible transforming collection because a reduce operation on this collection
©Manning Publications Co. To comment go to liveBook
Their semantic is the same as the related functions in clojure.core, so the rest of the
chapter is mainly dedicated to those functions specific to
Reducers: fold, reducer, monoid, folder, foldcat, cat and append!. To avoid
confusion with the same functions in clojure.core, such functions are often prefixed
with r/ (which is a conventional alias for clojure.core.reducers).
7.1.1 fold
function since 1.5
(fold
([reducef coll])
([combinef reducef coll] )
([n combinef reducef coll]))
In its simplest form, fold takes a reducing function (a function supporting at least 2
arguments) and a collection. If the input collection supports parallel folding
(currently vectors, maps and foldcat objects), it splits the input collection into chunks
of roughly the same size and executes the reducing function on each partition in
parallel (and on multiple CPU cores when possible). It then combines the results back
into a single output:
(require '[clojure.core.reducers :as r]) ; ❶
(r/fold + (into [] (range 1000000))) ; ❷
;; 499999500000
❶ Reducers are bundled with Clojure, but they need to be required before use.
❷ fold splits the 1 million elements vector into chunks of roughly 512 each (the default). Chunks are
then sent to the fork-join thread pool for parallel execution where they are reduced by +. The chunks
are subsequently combined again with +.
fold offers parallelism based on "divide and conquer": chunks of work are created and
computation happens in parallel while, at the same time, finished tasks are combined
back into the final result. The following diagram illustrates the journey of a collection
going through a fold operation:
An important mechanism that fold implements (the diagram can’t show this clearly
without being confusing) is work-stealing. After fold sends a chunk to the Java fork-
join framework, each worker could further splits the work into smaller pieces,
generating a mix of smaller and larger chunks. When free, a worker can "steal" work
from another 126. Work-stealing improves over basic thread-pooling, especially for less
predictable jobs keeping one or more threads unexpectedly busy.
CONTRACT
Input
The contract is different based on the presence of the optional "combinef" function and
whether the input collection is a map:
• "reducef" is mandatory argument. It must be a function supporting at least a 2
arguments (and a 0 argument call when "combinef" is not provided). The 2
arguments call implements the canonical reduce contract receiving an accumulator
and the current element. The 0 arguments call is used to establish the seed for the
126
The Fork-Join model for parallel computation is a complicated subject that can’t be illustrated in this book. If you want
to know more, please read the following paper by Doug Lea, the author of Fork-join in
Java: gee.cs.oswego.edu/dl/papers/fj.pdf
❶ r/monoid is a helper function to create a function suitable for "combinef". The first argument
for r/monoid is the merge function to use when to pieces are combined together. We want to sum the
counts for the same word, something we can do with merge-with.
❷ "reducef" needs to assoc every word to the results map "m". Two cases are possible: the word already
exists and the count gets incremented or the word doesn’t exist and 0 is used as the initial count.
❸ "coll" needs to be a vector so we make sure the input is transformed with into. The transformation of
each line includes the creation of a tuple (vector of 2 items) with the word and the number 1. We
use r/map from the reducers library for this, so the transformation is deferred to parallel execution.
NOTE Project Gutenberg files are unfortunately not available in certain countries. In that case any
other large text file could replace the examples in the book.
fold also works natively on maps. We could use freqs produced before as a new input
for another fold operation. We could for example see the relationship between the first
letter of a word and its frequency in the book.
The following example groups words by their initial letter and then calculates their
average frequency. This operation is a good candidate for parallel fold, since the input
contains thousands of keys (one for each word found in the input text):
(defn group-by-initial [freqs] ; ❶
(r/fold
(r/monoid #(merge-with into %1 %2) (constantly {})) ; ❷
(fn [m k v] ; ❸
(let [c (Character/toLowerCase (first k))]
(assoc m c (conj (get m c []) v))))
freqs))
(defn update-vals [m f] ; ❹
(reduce-kv (fn [m k v] (assoc m k (f v))) {} m))
avg-by-initial
(sort-by second >)
(take 5)))
(most-frequent-by-initial freqs) ; ❼
;; ([\t 41.06891634980989]
;; [\o 33.68537074148296]
;; [\h 28.92705882352941]
;; [\w 26.61111111111111]
;; [\a 26.54355400696864])
❶ group-by-initial uses fold expecting a hash-map from strings to numbers. The output is a much
smaller map from letters to vectors. Number of keys in this map, is equal to the number of letters in the
alphabet (assuming the text is large enough and we filtered out numbers and symbols). The letter "a"
in this map contains something like [700, 389, 23, 33, 44] which are the occurrences of each word in
the book starting with the letter "a".
❷ The combining function is assembled using r/monoid. The initial value for each reducing operation is
the empty map {}. Partial results are combined together by key merging their vector values together
into a single vector.
❸ The reducing function takes three parameters: a map of partial results "m", the current key "k" and the
current value "v". Similarly, to count word frequencies, we fetch a potentially existent key (using an
empty vector as default value) and conj that into the vector of values "v". The key is the initial letter of
each word found in the input map.
❹ update-vals takes a map and a function "f" of one parameter. It then applies "f" to every value in the
map using “reduce-kv”.
❺ avg-by-initial replace each vector value in a map with the average of the numbers found in it.
❻ most-frequent-by-initial orchestrates the functions seen so far to extract the top-most frequent
words by initial.
❼ freqs is the result of the word count from the previously in the example.
After running most-frequent-by-initial we can see that the letter "t" is on average
the most used at the beginning of a word, closely followed by "o", "h", "w" and "a".
This indicates that words starting with the letter "t" are on average the most repeated
throughout the book (while some other word not starting with "t" might be, on
absolute, the most frequent).
like inc or str would be probably overkill for fold parallelism so we we are going to use the Leibniz
formula to approximate "Pi" instead (we already encountered this formula while talking about “filterv”).
We would like to execute the transformation on each key in parallel.
The design of the parallel execution is as follows: instead of splitting the values into chunks, we are
going to split the keys. Values corresponding to each partitioning are transformed in parallel by separate
threads. No clashing would normally happen (as keys are unique), but fork-join is a work stealing
algorithm so a partition could be routed to a thread where another partition has been assigned,
generating an overlap. This is the reason why we
need java.util.concurrent.ConcurrentHashMap instead of a plain java.util.HashMap.
(import 'java.util.concurrent.ConcurrentHashMap)
(require '[clojure.core.reducers :as r])
(defn pi [n] ; ❶
"Pi Leibniz formula approx."
(->> (range)
(filter odd?)
(take n)
(map / (cycle [1 -1]))
(reduce +)
(* 4.0)))
(defn large-map [i j] ; ❷
(into {}
(map vector (range i) (repeat j))))
(dorun ; ❺
(r/fold
(combinef a-large-map)
reducef
a-large-map))
;; IllegalArgumentException No implementation of method: :kv-reduce
❶ pi calculates an approximation of the π value. The greater the number "n" the better the
approximation. Relatively small numbers in the order of the hundreds generate an expensive
computation.
❷ large-map serves the purpose of creating a large ConcurrentHashMap to be used in our example.
The map keys are increasing integers while the values is always the same.
❸ combinef with no arguments returns the base map, the one all threads should update concurrently.
There is no need of concatenation as the updates happen on the same
mutable ConcurrentHashMap instance. So combinef with two arguments just returns one of the two
(they are the same object). combinef could be effectively replaced by (constantly m).
❹ reducef replaces an existing key with the calculated "pi". Note the use of “".", ".." and doto” so Java
operations like .put which would otherwise return nil return the map itself.
❺ fold is unsuccessful, as it searches for a suitable implementation of reduce-kv which is not found.
We are facing the first problem: fold fails because two polymorphic dispatches are
missing: fold doesn’t have a specific parallel version
for java.util.concurrent.ConcurrentHashMap, so it routes the call to reduce-kv. reduce-
kv also fails because there is an implementation for Clojure hash-map but not for
Java ConcurrentHashMap. As a first step, we could provide a reduce-kv version which removes the
error, but this solution is not enough to run the transformations in parallel:
(extend-protocol ; ❶
clojure.core.protocols/IKVReduce
java.util.concurrent.ConcurrentHashMap
(kv-reduce [m f _]
(reduce (fn [amap [k v]] (f amap k)) m m)))
(time ; ❷
(dorun
(r/fold
(combinef a-large-map)
reducef
a-large-map)))
;; "Elapsed time: 41113.49182 msecs"
(extend-protocol r/CollFold ; ❷
java.util.concurrent.ConcurrentHashMap
(coll-fold
[m n combinef reducef]
(foldmap m n combinef reducef)))
(time ; ❸
(dorun
(into {}
(r/fold
(combinef a-large-map)
reducef
a-large-map))))
"Elapsed time: 430.96208 msecs"
After extending CollFold protocol from the clojure.core.reducers namespace, we can see
that fold effectively runs the update of the map in parallel, cutting the execution time consistently. As a
comparison, this is the same operation performed on a persistent hash-map which is parallel by default:
(time
(dorun
(r/fold
(r/monoid merge (constantly {}))
(fn [m k v] (assoc m k (pi v)))
a-large-map)))
;; "Elapsed time: 17977.183154 msecs" ; ❶
❶ We can see that despite Clojure hash-map is parallel enabled, the fact that it is a persistent data
structure is playing against fast concurrent updates. This is not a weakness in Clojure data structure
as they are designed with a completely different goal in mind.
See also:
• pmap also applies a transformation function to an input sequence in
parallel. fold and pmap have commonalities but they differ in the computational
model. pmap supports laziness and has a variable number of workers (dependent
on the collection chunk size plus the number of available cores plus 2). However,
before moving on to the next chunk in the sequence, pmap has to wait for all
workers in the current chunk to finish. Less predictable operations (those keeping
a worker busy more than usual), effectively prevent pmap full concurrency. fold,
on the other hand, allows a free worker to help a busy one dealing with a longer
than expected request. As a rule of thumb, prefer pmap to enable lazy processing
on predictable tasks, but use fold in less predictable scenarios where laziness is
less important.
⇒ O(n) linear
fold is implemented to recursively split a collection into chunks and send them to the
fork-join framework, effectively building a tree in O(log n) passes. However, each
chunk is subject to a linear reduce that dominates the logarithmic traversal: the bigger
the initial collection, the more are the calls to the reducing function, making it a linear
behavior overall.
Orchestration of parallel threads has a cost that should be taken into account when
executing operations in parallel: like pmap, fold performs optimally for non-trivial
transformations on potentially large dataset. The following simple operation for
example, results in a performance degradation when executing in parallel:
(require '[criterium.core :refer [quick-bench]])
(require '[clojure.core.reducers :as r])
As the collection gets bigger, the computation more complicated and the available
cores increase, fold starts to outperform a similar sequential operation. The potential
performance boost is still not enough to grant the need for a fold, since other variables
come into place such as memory requirements.
fold is designed to be an eager operation, as the chunks of input are further segmented
by each worker to allow an effective work-steal algorithm. fold operations like the
examples in this chapter need to load the entire dataset in memory before starting
execution (or as part of the execution). When fold produces results which are
substantially smaller than the input, there are ways to prevent the entire dataset to load
in memory, for example by indexing it on disk (or a database) and include in the
reducing function the necessary IO to load the data. This approach is used for example
in the Iota library 127 which scans large files to index their rows and use that as the
input collection for fold.
7.1.2 reducer and folder
function since 1.5
127
The Iota library README explains how to use the library: github.com/thebusby/iota
Both reducer and folder takes a collection and a function of one argument. They
enhance their input collection with a custom reduce implementation (and
additionally fold in the case of folder), as directed by the "xf" argument. Here’s an
example of a collection enhanced by reducer and one by folder:
(require '[clojure.core.reducers :as r]) ; ❶
(into [] ; ❸
(r/reducer
(range 100)
divisible-by-10))
;; [0 10 20 30 40 50 60 70 80 90]
(r/fold ; ❹
(r/monoid merge (constantly {}))
(fn [m k v] (assoc m k (+ 3 v)))
(r/folder
(zipmap (range 100) (range 100))
(fn [rf] (fn [m k v] (if (zero? (mod k 10)) (rf m k v) m)))))
;; {0 3, 70 73, 20 23, 60 63, 50 53, 40 43, 90 93, 30 33, 10 13, 80 83}
❶ Both reducer and folder lives inside the reducers namespace. You need to require the namespace
before use.
❷ divisible-by-10 is an example of transformation on a reducing function. reducer transforms the
input collection using divisible-by-10 as the new reducing behavior. divisible-by-10 verify if the
current element is divisible by 10 and applies the current reducing function only in that case.
❸ “into” is used here to show how the collection is now transformed. into is implemented on top
of reduce hence why the transformation takes place. divisible-by-10 has the same effect of
filtering the input collection.
❹ folder works similarly for fold. This is demonstrated using a hash-map as input. folder is
instrumenting the reducing function so only keys which are multiple of 10 are passing through. Another
reducing function is present to increment every value by 3. The two are eventually concatenated
together.
The input function "xf" in both reducer and folder gets a chance to intercept the
current call to the original reducing function and potentially alter the results. Reducer
and folder are useful in the definition of custom reducers (and they are used
extensively inside the implementation of reducers themselves). The following table is a
summary of the available reducers and their foldable behavior:
The table shows us that apart from r/take-while, r/take and r/drop all other standard
reducers are foldable. The practical effects are that if you are using any of the three
non-foldable reducers, you could prevent parallelism during a fold. Please refer to the
call-out section after the examples to see how non-foldable reducers can also be
enabled in parallel contexts.
CONTRACT
Input
• "coll" is any collection supported by seq, which excludes transients and
deprecated structs.
• "xf" is a function of 1 argument returning a function of 2 arguments. "x" stands for
"transforming" while "r" means "reducing". reducer invokes "xf" in the context of
a reduce call, passing the original reducing function. "xf" returns a function of 2
arguments as per reduce contract. reducer replaces the original reducing function
with the new reducing function returned by "xf".
Output
• reducer returns "coll" enhanced with additional behavior as dictated by the "xf"
argument that will apply in the context of a reduce call.
• folder applies the same changes as reducer, but also includes enhancing the input
collection in the context of a fold operation. folder only intercepts "reducef"
behavior, not "combinef" (please see “fold” contract).
Examples
dedupe is a function in the standard library that removes consecutive occurrences of
the same item in a collection. We want to create a version of reducer-dedupe which
plays nicely with the other transforming functions in the clojure.core.reducers
namespace. To do so, we are going to use reducer to wrap the given collection with
some additional behavior:
(require '[clojure.core.reducers :as r])
;; [1 3 1 3 1 3 5 1 3 5 1 3 5 7 1 3 5 7]
❶ reducer wraps the incoming collection and indicates how the reduction process should be altered
passing a function of one argument "rf" (the original reducing function, for example “conj” is
what “into” is using).
❷ Our reducer-dedupe needs to remember the previous element at each invocation of the reducing
function. We need to store state and since the state is local to the function that wraps over the
reduction we can use volatile! (an atom would also work, but it would introduce additional complexity
regarding thread isolation in a concurrent context).
❸ Each reducer in the chain, including reducer-dedupe, decides what to do with the next
transformations. In the case of reducer-dedupe, the next transformation happens only if the current
element is not a duplicate.
❹ We can now use reducer-dedupe similarly to the other reducer functions.
In the previous example, reducer was used to enhance the input collection. Reducer
does not provide a fold implementation, so our reducer-dedupe would prevent
parallelism (without any warning) for vectors or maps when they are the input of a
fold operation:
❶ r/map function prints the thread signature for each element in the collection, thus showing on which
thread the reduction is happening.
❷ The prints on the screen are coming from the main thread, confirming that reducer-dedupe is
preventing parallel fold.
NOTE Reducers like reducer-dedupe implemented in our example are also called "stateful",
because they need to propagate information between invocations of the reducing function (the
reducers namespace already contains r/take and r/drop stateful reducers). Stateful reducers
typically define a local variable of type volatile! or an atom.
The darker squares in the diagram are the target for removal from the head of each partitioned
collection. We could implement the same behavior sequentially with the following:
(->> (vec (range 1600)) ; ❶
(partition 200)
(mapcat #(drop 10 %))
(reduce +))
;; 1222840
❶ A collection of 1600 numbers is split into 8 partitions of 200 items each. Standard drop is used to
remove the first 10 items on each partition. The numbers are finally summed up together.
Our first attempt naturally follows the idea to to use folder instead of reducer to wrap around
the drop operation. That enables parallelism straight away, but it comes with surprising result:
(defn pdrop ; ❶
[n coll]
(r/folder
coll (fn [rf]
(let [nv (volatile! n)]
(fn
([result input]
(let [n @nv]
(vswap! nv dec)
(if (pos? n)
result
(rf result input)))))))))
(distinct ; ❷
(for [i (range 1000)]
(->> (vec (range 1600))
(pdrop 10)
(r/fold 200 + +))))
;; (1279155 1271155 1277571)
❶ pdrop is a custom reducer using r/folder to define a specific foldable behavior. When
executed, pdrop doesn’t propagate reduction for the first "n" elements, effectively ignoring them in the
final result.
❷ We try now to fold over the 1600 numbers defining a 200 chunk size. To show the inconsistency,
a for repeats the same operation 1000 times. As we can see, the result is not only different from the
expected 1222840, but also changes randomly.
Parallel enabled pdrop is returning inconsistent results. The reason for this is how the volatile! closes
over the reducing function. Each fork-join task is created around standard reduce, which in turn is using
our enhanced reducing function. The problem is that each reduce task sees the same volatile "nv" and
each task could run on a separate thread. Depending on which thread reads "nv" and when, we are
dropping less or more items than expected. Even assuming we could use isolated counters on each
thread, work-stealing could migrate a chunk to another thread with a different counter condition.
One solution is to initialize state at every reduce invocation, instead of reducer creation. To do this,
we need to create our own folding algorithm (a very similar version to the one found in the standard
library) and a revised version of pdrop which suspends the creation of the state until the execution point
inside the fork-join task. The only change necessary to the current r/foldvec reduce-combine algorithm
is to "unwrap" the additional function created around "reducef" before use:
(defn stateful-foldvec
[v n combinef reducef]
(cond
(empty? v) (combinef)
(<= (count v) n) (reduce (reducef) (combinef) v) ; ❶
:else
(let [split (quot (count v) 2)
v1 (subvec v 0 split)
v2 (subvec v split (count v))
fc (fn [child] #(stateful-foldvec child n combinef reducef))]
(#'r/fjinvoke
#(let [f1 (fc v1)
t2 (#'r/fjtask (fc v2))]
(#'r/fjfork t2)
(combinef (f1) (#'r/fjjoin t2)))))))
(defn pdrop
[dropn coll]
(reify ; ❷
r/CollFold
(coll-fold [this n combinef reducef]
(stateful-foldvec coll n combinef
(fn [] ; ❸
(let [nv (volatile! dropn)]
(fn
[result input]
(let [n @nv]
(vswap! nv dec)
(if (pos? n)
result
(reducef result input))))))))))
(distinct ; ❹
(for [i (range 1000)]
(->> (vec (range 1600))
(pdrop 10)
(r/fold 200 + +))))
;; (1222840)
❶ stateful-foldvec is copied from the private function foldvec inside the reducers namespace.
There is only one small change when reduce is invoked on a chunk, where "reducef" is wrapped in
round parenthesis to initialize the reducing function and remove the additional wrapping function.
❷ pdrop implements its own reify for the CollFold protocol. prdop instructs fold to call the
new stateful-foldvec. The transformation of the reducing function happens when passing the last
argument to stateful-foldvec.
❸ The reducing function is wrapped in a "thunk" (a lambda of no arguments with the only goal of
delaying evaluation). The thunk is unwrapped at execution time by stateful-foldvec.
❹ We can see now the expected single result.
The idea for stateful reducer parallelization exposed above, can be extended to transducers by wrapping
state initialization in a similar way. The need to also change the standard r/foldvec function from
reducers remains, but we don’t need to include that in the new transducer implementation:
(distinct ; ❸
(for [i (range 1000)]
(r/fold 200
+
((drop 10) +)
(vec (range 1600)))))
;; (1279155 1271155 1267155 1275155 1275145
(distinct ; ❹
(for [i (range 1000)]
(r/fold 200
+
((drop-xform 10) +)
(stateful-folder (vec (range 1600))))))
;; (1222840)
❶ The new version of drop transducer is the same as the one in the standard library, except for the layer
of indirection introduced just before the state is initialized. This is just a lambda function with no
arguments that prevents evaluation of the volatile! instance.
❷ To prevent r/fold from using standard vectors parallelization, we wrap the vector instance with
a reify call that swaps the base implementation for our new stateful-foldvec.
❸ We want to compare the differences between using drop transducer without any modification and our
version. As you can see here, drop transducer used in a parallel context shows multiple inconsistent
results.
❹ By contrast, the parallel-enhanced drop-xform shows the expected result consistently.
See also:
• reify is the main mechanism used by reducer and folder. Use reify on the target
collection in case you need additional control over the implementation of
the coll-reduce or coll-fold protocols.
Performance considerations and implementation details
❶ r/drop is used in this reducer chain without actually dropping any element, just to show the effects on
computation. The elapsed is around 45 seconds.
❷ The same operation (except for r/drop) now executes in parallel, greatly reducing execution time to
below 10 seconds.
fold operation using r/take-while, r/take, r/drop or any custom reducer using
reducer to reify the input collection, would appear to work normally but the reader
should be aware that there is no parallelism in that case.
7.1.3 monoid
function since 1.5
(r/fold
(r/monoid str (constantly "Concatenate ")) ; ❷
["th" "is " "str" "ing"])
;; "Concatenate this string"
Normal reduce has an additional parameter to pass an initial value for reduction. fold,
on the other hand, doesn’t offer this option (having many parameters already). If the
reducing function for a fold doesn’t provide the zero-argument call, monoid offers a
quick way to fix the problem without the need to use an anonymous function.
CONTRACT
Input
• "op" must be a function accepting two arguments and is required argument.
• "ctor" must be a function accepting a zero arguments call and is required
argument.
Notable exceptions
• ArityException or ClassCastException are typically seen when "ctor" is given as
a value (e.g. a number, or empty vector) instead of a function of no arguments:
(r/fold (r/monoid + 0) (range 10))
;; ClassCastException java.lang.Long cannot be cast to clojure.lang.IFn
To prevent this from happening, remember to use something like constantly to wrap
around the constant value.
Output
Returns a function of 2 arities that can be called with zero or two arguments. When
invoked with no arguments, it returns the result of invoking (ctor). When two
©Manning Publications Co. To comment go to liveBook
arguments are present, they are passed to "op" in order. This is equivalent to (op a
b) if "a" and "b" are the arguments.
Examples
monoid is mainly used to build the "reducef" or "combinef" argument for fold. When
processing a hash-map, for instance, fold often requires an empty hash-map to start
with:
(r/fold (r/monoid merge (constantly {})) ; ❶
(fn [m k v] (assoc m k (str v)))
(zipmap (range 10) (range 10)))
❶ monoid is used here to create a function to combine partial results with merge. Each reduce operation
that fold performs on a chunk, is going to use the zero-argument arity provided by monoid (the empty
map).
Please also check out fold for another example of monoid used to calculate the word
frequencies for some long text.
• Identity element: there is an unique element in the set that, when used in the binary operator, it
returns the other argument untouched.
• Associativity: applications of the binary operation to different subsets are order independent.
Let’s verify that natural numbers with "+" (binary operator) and "0" (identity element) are indeed a
monoid:
(+ 99 0) ; ❶
;; 99
(= (+ (+ 1 2) 3) ; ❷
(+ 1 (+ 2 3)))
;; true
❶ When "0" identity element is used, the other argument is returned without changes. This is tested with
one random number, but it holds true for all other natural numbers by definition of zero addition.
❷ We have here a subset of the natural numbers formed by "1,2,3". The application of "+" to "1,2" and
"3" is the same as "2,3" first and "1" next.
128
Please see the Wikipedia page for an introduction to the topic: en.wikipedia.org/wiki/Monoid
r/monoid has been named this way considering its applicative context. The zero-arguments arity of the
reducing function in fold provides a way to bootstrap reduce with an initial element. The initial value can
optionally be the identity element for the reducing function (like "0" for "+"), thus providing an identity
transformation for the first element.
r/monoid name reminds standard library users that the binary operation in fold should be
associative (especially when the reducing function is also used as concatenation), as it can be potentially
parallel and execute in any order.
See also:
• completing has a similar goal to monoid. completing is used with
custom transducers to provide additional arities to the reducing function. The
single arity call in transducers is used to signal the end of the
reduction. completing provides a quick way to create all the required arities
around the main reducing function.
• fold as explained throughout the chapter, is the main use case for monoid to create
the combine function.
Performance considerations and implementation details
(foldcat [coll])
(cat ([])
([ctor])
([left right]))
(append! [acc el])
while r/cat appends them to a tree. The orchestration of both effects is achieved
by r/foldcat which just uses r/cat as "combinef" and r/append! as "reducef":
(require '[clojure.core.reducers :as r])
(def input (r/map inc (into [] (range 1000))))
❶ This example shows using r/cat and r/append! explicitly with fold.
❷ The second example is equivalent to the first, showing the use of r/foldcat to achieve the same
result.
r/foldcat returns the root of the chunks produced by fold as a network of Cat objects
(note the uppercase "c" denoting a class name instead of the function). Cat nodes are
"counted" (they support the clojure.lang.Counted interface and can be counted in
constant time), reducible and foldable, so they can be used efficiently as further input
of a reduce or a fold.
CONTRACT
Input
r/foldcat
java.util.ArrayList when the size of the input is below the size of the requested
chunk (512 by default) or a clojure.core.reducers.Cat object for larger
collections. The Cat type represent the node of a binary tree with left and right
children. The more chunks processed by fold, the deeper the tree.
• r/cat with no arguments returns an empty ArrayList. With one argument returns
a new function which overrides the no argument behavior using the result of
calling "ctor" without arguments. With two not-null arguments, it returns a
new clojure.core.reducers.Cat object with count equal to the sum of the counts
of the "left" and "right" objects.
• r/append! returns the result of calling the java.util.Collection/add method on
"acc" using "el" as the argument.
Examples
r/foldcat uses r/cat and r/append! internally to generate results. The following
example shows how we could process words from a large text using reducers and
r/foldcat:
(def text ; ❶
(-> "http://www.gutenberg.org/files/2600/2600-0.txt"
slurp
s/split-lines))
(def r-word ; ❷
(comp
(r/map #(vector % (count %)))
(r/map s/lower-case)
(r/remove s/blank?)
(r/map #(re-find #"\w+" %))
(r/mapcat #(s/split % #"\s+"))))
(take 5 words)
;; (["the" 3] ["project" 7] ["gutenberg" 9] ["ebook" 5] ["of" 2])
❶ We fetch "War and Peace", a large book, from Project Gutenberg (a collection of literature classics).
We need to split the file into lines to create the initial vector for r/foldcat.
❷ r-word is a composition of reducers dedicated to text processing and cleanup. They are applied
bottom-up: lines are split into words, non-alphabetic characters and empty strings are removed, words
are converted lower-case and finally a pair is formed for each word with its length.
❸ r/foldcat accepts the collection of lines wrapped in the reducers call.
If we inspect the previous results we can see that "words" is not a normal collection (it
would have been an ArrayList for a file with less than 512 words):
(type words) ; ❶
;; clojure.core.reducers.Cat
(.count words) ; ❷
;; 565985
(.left words) ; ❸
;; #object[clojure.core.reducers.Cat 0x28e8dde3
"clojure.core.reducers.Cat@28e8dde3"]
(.right words) ; ❹
;; #object[clojure.core.reducers.Cat 0x1f6c9cd8
"clojure.core.reducers.Cat@1f6c9cd8"]
If we walk the tree all the way down to a leaf, we can find
a java.util.ArrayList instance created by invoking r/append! on each chunk:
(loop [root words cnt 0]
(if (< (count root) 512) ; ❶
(str (type root) " " (count root) " elems, depth: " cnt)
(recur (.left root) (inc cnt))))
❶ 512 is the default size of a chunk of computation in a fold operation. When the chunk size is below that
threshold, we know we are in front of a leaf in the tree.
❷ In this case we also know that the ArrayList that was found at that leaf, has 321 words in it.
By knowing the depth of the binary tree, we also know the approximate number of
created nodes, corresponding to how many chunks of work were necessary to process
the initial vector of lines. The number of nodes at the lowest level in a binary tree is 2k,
where k is the last level of the tree (counting from 0). In our example approximately
28 = 256 splits were created.
After looking at the words in the result, it’s easy to see that there are many duplicates:
(count (distinct (seq words))) ; ❶
;; 17200
❶ Note the use of seq on the Cat tree instance returned by r/foldcat. A Cat object supports count but
doesn’t not support nth (and many other sequential operations).
There are 17200 distinct words out of 565985, showing that the big majority of the
words returned is a duplicate. We can use the property of java.util.HashSet to get rid
of the duplicates. r/cat has a single argument call that allows to swap the internal
implementation of the mutable data structure, as long as it exposes an .add method:
(def words
(r/fold
(r/cat #(HashSet.)) ; ❶
r/append!
(r-word text)))
(count words) ; ❷
;; 185561
❶ r/cat accepts a function of no arguments. The function will be used to initialize the data structure to
use to initialize the reduction on each parallel chunk. We can pass the HashSet constructor here.
❷ We count the result to see if the words are now distinct.
We see a surprising number when trying to count the result. This number is lower than
the total count of words, but nowhere near 17200. The reason is that although sets
contain unique elements before concatenation, they are concatenated as sequences
potentially introducing duplicates. To fix this problem, we can walk the tree merging
individual HashSet back into a set, getting rid of duplicates in the process:
(defn distinct-words [words]
(letfn [(walk [root res] ; ❶
(cond
(instance? clojure.core.reducers.Cat root)
(do (walk (.left root) res) (walk (.right root) res))
(instance? java.util.HashSet root)
(doto res (.addAll root))
:else res))]
(into #{} (walk words (HashSet.))))) ; ❷
❶ walk is a recursive function that merges the content of each leaf into a new HashSet instance. It starts
a new recursion for each left and right branch (when we are on a Cat node).
❷ The function encapsulate the mutable part of the merge, only returning a persistent data structure as
the last step.
❸ The number of distinct words is now the same that we found using a sequential distinct.
By iterating the tree structure produced by r/cat we can continue using a mutable data
structure to incrementally build results. Alternatively, r/cat is also reducible and
foldable, which means that we can further use reduce or even r/fold on the results,
producing another parallel computation:
(reduce + (r/map last words)) ; ❶
;; 1105590
(frequencies res))) ; ❸
❶ reduce can be used just fine on the result of r/foldcat or r/fold with r/cat as combine function.
Here we are summing up the length of all the words in the set.
❷ In this r/fold invocation, we are using a custom r/cat constructor and a custom reducing function
(instead of the standard r/append!). Each chunk is processed in parallel to create
a StringBuilder instance, which is a fast way to concatenate large number of strings together.
❸ The results are iterated using the sequential interface provided by clojure.core.reducers.Cat to
create a map of frequencies. This operation and following sort-by are executing outside
the fold operation and thus sequentially.
The choice regarding processing parallel chunks accumulating results in mutable data
structures should be guided by performance measurements. There are many parameters
impacting on speed, like size of the chunk, amount of processing requested and final
use of the produced output. These concerns are better addressed in the performance
section at the end of the chapter.
See also:
• fold is the core engine upon which r/foldcat, r/cat and r/append! operates. The
details covered in fold are fundamental to understand what is covered in this
chapter. Use of plain fold compared to r/foldcat allows for additional flexibility
on the choice of combination and reducing function. Use r/foldcat when the
provided ArrayList building block is sufficient to cover the given use case. Use
r/fold with specific r/cat initializer to have additional control over the mutable
data structure.
• cat is a concatenation transducer based on reduce. cat is much more general use
than r/cat which is tied to reducers and folding. Use cat if your goal is to flatten
nested collections.
• concat is used to iterate the tree produced by r/fold sequentially. concat is more
general purpose than r/cat alone and can be used to merge collections together.
Performance considerations and implementation details
⇒ O(n) (foldcat)
For a general discussion about r/foldcat performance characteristics, please read fold
performance section. r/foldcat is an application of r/fold that does not change its
performance profile. The use of r/cat or r/append! in isolation, is constant time.
The most interesting aspect in terms of performance of r/foldcat compared to a plain
r/fold, is the speed improvement related to the use of mutable data structures:
❶ Please refer to the beginning of the chapter for the definition of r-word reducer and text initial
collection of lines.
❷ A similar computation based on standard conj shows a relevant increase in computation time.
If the problem you are solving includes a large input data-set, non-trivial processing
steps and still produces a collection, chances are that r/foldcat is outperforming
plain r/fold. Moreover, if r/foldcat output needs additional processing, it can be
processed again in parallel using additional fold operations.
7.2 Transducers
Transducers are a recently introduced Clojure feature which has an impact on many
standard library functions, including introducing new dedicated ones
like transduce, eduction, completing or cat. The impact of transducers on already
existent functions usually consists on adding a new arity that returns a specific
transducer type. The following list is a summary of all the transducers enabled
functions currently available in the standard library and a brief description of their use
in this context. They are illustrated in deeper details later in the chapter:
• transduce: applies a reducing function and related transducer chain to a collection.
• completing: completes a binary function with the required calls so it can be
invoked as a transducer.
• eduction: applies a transducer chain to a collection producing a lazy sequence of
transformed elements.
• sequence: similar to eduction with additional caching if the sequence is iterated
multiple times.
• into: copy elements from one collection type to another, optionally transforming
them with a transducer chain.
• cat: is a transducer in the standard library. It concatenates each (collection) item
input into a collection output.
All the following collection processing functions, when invoked without their
collection argument, return a transducer:
• map: returns a transducer that applies a transformation to each element.
• map-indexed: like map, but the produced transducer also includes the index for
each item.
• mapcat: returns a transducer that concatenates a transformation of each item into
the final results.
• filter: returns a transducer that applies the reducing function or not, based on a
predicate.
• remove: similar to filter, but inverting the meaning of the predicate.
• take: produces a transducer that terminates the reduction after the given number of
elements.
• take-while: produces a transducer that terminates the reduction when a predicate
returns true.
• take-nth: produces a transducer that collects each "nth" element in the collection.
• drop: produces a transducer that doesn’t invoke the reduction for the first "n"
elements.
• drop-while: produces a transducer that doesn’t invoke the reduction until a
predicate returns true.
• replace: produces a transducer that replaces every elements in the input collection
following the given substitution map.
• partition-by: produces a transducer that splits the input collection every time a
given function applied to each element returns a different value.
• partition-all: like partition-by, but allows partitions to have less than the
requested number of elements at the end.
• keep: produces a transducer that transforms each element and keep those that are
not nil after the transformation.
• keep-indexed: like keep, but the produced transducers also includes the index for
each item.
• distinct: produces a transducer that removes all duplicates from the output of the
preceding transducer.
• interpose: produces a transducer that alternates input items with the given
separator.
• dedupe: like distinct but the produced transducer only removes contiguous
duplicates, allowing repetitions if there is something separating them.
• random-sample: produces a transducer that let each item through based on the
given probability value.
7.2.1 transduce
function since 1.7
(transduce
([xform f coll])
([xform f init coll]))
transduce is one of the main entry points into the transducers abstraction. It works
similarly to reduce, but it also accepts a composition of transforming reducing
functions (the so called transducers) as the parameter "xform". The following example
shows the same operation performed with reduce and transduce to see how the
compare:
(reduce + (map inc (filter odd? (range 10)))) ; ❶
;; 30
❶ reduce is used to sum 10 numbers after incrementing them and just keeping the odd ones.
❷ The same operation is now executed with transduce.
The similarity with reduce is evident and deliberate, as the two operations are an
expression of the same form of iteration. The main difference is that transduce isolates
transforming operations (like map or filter) along the other arguments in the
parameter list. This design has a lot of interesting consequences, such as enabling the
same transformation chain to be reused in other contexts (for example
the core.async library).
CONTRACT
Input
• "xform" is a function following the transducers semantic and is a mandatory
argument. "xform" is invoked with the reducing function "f" and should return
another function supporting at least two arities: a single argument call to be
invoked at the end of the reduction and a two arity call for the actual reduction. An
optional zero arguments call is currently unused by transduce but could be in the
future.
• "f" is a reducing function of two arguments, receiving the results so far and the
next item in "coll". It is a mandatory argument.
• "init" is optional. When present, it is used as the first accumulator value in the
reducing process, similarly to reduce.
• "coll" is any collection supported by reduce. transients and scalars (like numbers,
keywords and so on) are not supported. Anything else is, including nil, Java
Iterable and arrays.
Notable exceptions
• NullPointerException when "xform" or "f" is nil.
• ArityException could happen if the given "xform" doesn’t support some of the
required arities. Unless you’re using custom transducers, you shouldn’t be
concerned about this. If you are using custom transducers, please
check completing.
Output
• returns: the result of applying the reducing function, along with any other
transformer reducing functions, to "coll". When "coll" is nil, returns "init" or the
result of invoking (f) without arguments. In both cases, the result also depends on
the single argument arity of any transducer in the chain. A transducer could in fact
alter the final result of the reduction (for example like partition-all that flushes the
last partial partition if necessary).
Examples
Most of the time transduce can replace reduce, provided it is possible to rearrange
transformations of the input (if any) to a transducer chain with comp. The following
example illustrates the point by showing how to implement the Egyptian multiplication
algorithm 129. Ancient Egyptians didn’t use times tables to multiply numbers but they
worked out the multiplication by decomposing the number by power of twos:
(defn egypt-mult [x y]
(->> (map vector ; ❶
(iterate #(quot % 2) x)
(iterate #(* % 2) y))
(take-while #(pos? (first %)))
(filter #(odd? (first %)))
(map second)
(reduce +))) ; ❷
❶ The computation begins by generating pairs of numbers where the first is increasingly halved while the
second increasingly doubled.
❷ After stopping at the first zero occurrence, we filter the odd numbers. The final step is to use reduce to
sum up the second element in the pair.
❶ The threading macro and the final reduce operation have been replaced by a tranduce call.
❷ comp groups together all the preprocessing steps except for the initial creation of pairs.
❸ The creation of the sequence of pairs happens through map and is not part of the transducers
composition.
129
Please see the following Wikipedia article to know more about Egyptian multiplication
en.wikipedia.org/wiki/Ancient_Egyptian_multiplication
egypt-mult is now using a transducer chain but this is happening after forming the
pairs of numbers. It would be nice to include all processing inside transduce, as this is
creating an intermediate sequence that would be nice to avoid. Our only hope is that
there is a way to express the formation of pairs with an alternative design so it can be
included in transduce. Although there is no guarantee that such alternative design
exists, we can in this case express the formation of pairs as interleaving of two
sequences followed by grouping.
The next call-out section shows how this can be done, including creating custom
transducers if the standard library doesn’t provide one.
(defn egypt-mult [x y]
(->> (interleave ; ❶
(iterate #(quot % 2) x)
(iterate #(* % 2) y))
(partition-all 2)
(take-while #(pos? (first %)))
(filter #(odd? (first %)))
(map second)
(reduce +)))
❶ The only change made to egypt-mult is to replace the (map vector) expression
with interleave followed by partition-all.
The new algorithm is very similar to the original one, except for isolating the interleaving from the pair
creation. This allows us to design the reduction on top of one sequence, while the other is part of the
creation of the transducer chain. There is a problem though: there is no interleave transducer in the
standard library. We can create our own interleave-xform transducer as follows:
(defn interleave-xform ; ❶
[coll]
(fn [rf]
(let [fillers (volatile! (seq coll))] ; ❷
(fn
([] (rf))
([result] (rf result))
([result input]
(if-let [[filler] @fillers] ; ❸
(let [step (rf result input)]
(if (reduced? step) ; ❹
step
(do
❶ interleave-xform is modeled on the same semantic of the interleave function in the standard library:
it interleaves elements up to the end of the shortest sequence. interleave-xform contains all the
required arities: no arguments, single argument and two arguments.
❷ interleave-xform assumes the interleaving is coming from a collection passed while creating the
transducer. The other is the transducing collection. We need to keep track of the remaining items in
the sequence as we consume them, so the rest of them is stored in a volatile! instance.
❸ During the reducing step we verify to have at least one more element to interleave before allowing the
reduction. Note the use of if-let and destructuring on the first element of the content of the volatile
instance.
❹ As any good transducer "citizen", we need to check whether another transducer along the chain has
required the end of the reduction. In that case we obey without propagating any further reducing step.
❺ If instead we are not at the end of the reduction and we have more elements to interleave, we can
proceed to update our volatile state and call the next transducer using the "filler" element coming from
the internal state. Note that at this point, this is the second time we invoke "rf": the first one for the
normal reducing step, the second is an additional reducing step for the interleaving.
❻ In case we don’t have any more items to interleave, we end the reduction using reduced. This
prevents nil elements to appear in the final output, exactly the same as normal interleave.
(defn egypt-mult [x y]
(transduce
(comp
(interleave-xform (iterate #(* % 2) y)) ; ❶
(partition-all 2)
(take-while #(pos? (first %)))
(filter #(odd? (first %)))
(map second))
+
(iterate #(quot % 2) x))) ; ❷
(egypt-mult 4 5)
;; 20
❶ What used to appear as the second iteration of increasingly doubling numbers in the (map
vector) form is now considered the interleaving sequence that we pass when creating the
transducer.
❷ The other iteration with increasingly halved numbers is now the normal input for transduce. The two
sequences are interleaved together and partitioned into vectors as part of the transducing step.
See also:
• reduce is the core abstraction for the semantic of transduce. reduce is still
necessary for all those cases where is not possible to reformulate the algorithm
with transduce. There are also reasons related to laziness that prevent the
implementation of some algorithms with transduce (please
see sequence regarding this aspect in relation to transducers).
❶ reduce is used to sum a collection of numbers that have been incremented and filtered.
❷ The same operation is performed with transduce which allows processing to happen at the same
time the collection is iterated for reduction.
You shouldn’t hesitate to use transduce if your algorithm can be easily rewritten in
terms of composition of transducers. The Egyptian multiplication example in this
chapter should be considered border line: it requires a new design and a custom
transducer which is additional complexity that needs to be justified in terms of the
performance improvements it brings.
7.2.2 eduction
function since 1.7
eduction takes any number of transducers (with or without explicit comp composition)
and a collection and applies the transducers chain to each element in the collection:
(take 2 (eduction (filter odd?) (map inc) (range))) ; ❶
;; (2 4)
❶ Note that despite using an infinite sequence, eduction works lazily and returns with the requested
elements.
130
The reason for the chunk size 32 is because eduction returns a java.lang.Iterable which is iterated in chunk of
32 items by all sequence-aware functions in Clojure
argument.
• Returns the sequential version of "coll" if a single argument is present and is
supported by seq.
• Returns the transformed iteration of "coll" as directed by the transducers
composition in any other case.
Examples
The result of eduction is a rare example of lazy sequence-like collection that does not
cache its results. As the composition of transducers return a value for each item in the
input, no chain of cons-cells is created to contain the resulting values. Standard lazy
sequences cache their results so repeated access to the same item is possible without
repeating computation (removing repetition of potential side effects).
eduction becomes faster than sequence when the final goal is to reduce the output, an
operation that does not require to cache results in any way. The following example
illustrates the effect of caching:
(let [input (sequence (map #(do (print ".") %)) (range 10)) ; ❶
odds (filter odd? input)
evens (filter even? input)]
(if (> (first odds) (first evens))
(println "ok")
(println "ko")))
;; ..........ok ; ❷
The example shows that even when the same sequence requires multiple passes (to
filter odds and then evens in this case), it only produces the initial computation once
(the dot only prints once for each item). To achieve this result, sequence produces a
chain of cached values amenable for subsequent access. Now observe the same
example after replacing sequence with eduction:
(let [input (eduction (map #(do (print ".") %)) (range 10)) ; ❶
odds (filter odd? input)
evens (filter even? input)]
(if (> (first odds) (first evens))
(println "ok") ❷
(println "ko")))
;; ....................ok
The presence of 20 dots in the last example with eduction shows that the source
sequence needs a new pass of computation each time we call filter on it.
Here’s a more interesting example. We are going to design a best-product function
that applies transformations to a large collection of financial products (kept small for
©Manning Publications Co. To comment go to liveBook
the purpose of this example). The function accepts parameters influencing the
composition of transducers. But first let’s have a look at the input data:
(def data
[{:fee-attributes [49 8 13 38 100]
:product {:visible true
:online true
:name "Switcher AA126"
:company-id 183
:part-repayment true
:min-loan-amount 5000
:max-loan-amount 1175000
:fixed true}
:created-at 1504556932728}
{:fee-attributes [11 90 79 7992]
:product {:visible true
:online true
:name "Green Professional"
:company-id 44
:part-repayment true
:min-loan-amount 25000
:max-loan-amount 3000000
:floating true}
:created-at 15045569334789}
{:fee-attributes [21 12 20 15 92]
:product {:visible true
:online true
:name "Fixed intrinsic"
:company-id 44
:part-repayment true
:min-loan-amount 50000
:max-loan-amount 1000000
:floating true}
:created-at 15045569369839}])
Next we define two groups of transducers. The first prepare-data shapes the data in a
slightly different form than the raw input, while filter-data filters them based on
user arguments. Several helper functions are also present to improve readability:
(import 'java.util.Date)
(defn- update-at [k f]
(map (fn [m]
(update m k f))))
(defn if-equal [k v]
(filter (fn [m]
(if v (= (m k) v) true))))
(def prepare-data ; ❷
(comp
(merge-into :product [:fee-attributes :created-at])
(update-at :created-at #(Date. %))))
(best-product
{:repayment-method :part-repayment
:loan-amount 500000}
data
best-fee)
❶ Somewhat arbitrarily, we decide that the best product is the one offering the minimum last fee
attribute.
❷ eduction also takes an already composed transducer as argument. We use the parameters from the
request to prepare a different transducer configuration. You can see that the eduction output is the
input for the enclosing reduce operation.
(def best-fixed ; ❷
(eduction (xform {:rate :fixed}) data))
❶ best-part-repayment represents the computation and the data necessary to retrieve which product
has the lowest fee that allows "part repayment" as repayment method. The computation is defined, but
doesn’t run yet. This is similar to a delayed reduce operation.
❷ We also define another reduction best-fixed which retrieves with the cheapest fixed rate product.
❸ By using reduce we force the eduction and retrieve a result.
The eductions defined in the previous example are created at compile time and reused
throughout the lifetime of the application additionally improving performance.
See also:
• sequence is eduction close sibling (in the context of transducers).
Use sequence when you plan to use the produced output multiple times, for
example assigning it to a local binding. Use eduction when there is no plan to
©Manning Publications Co. To comment go to liveBook
❶ first and rest have the effect of forcing the eduction to re-execute, at least up to the chunk size
necessary to fulfill the request. Although the input collection contains 10 elements only, it appears that
we are iterating them twice, as printed by "@cnt1" counter.
❷ We now repeat the same operation on a sequence, which shows the caching behavior by not
executing the computation again. The counter "@cnt2" correctly shows only 10 evaluations.
eduction approach has benefits on memory allocation at the price of possible re-
evaluations. Since results are not cached, there is no "holding on to the head" problem
either:
(defn busy-mem []
(str (/ (-
(.. Runtime getRuntime totalMemory)
(.. Runtime getRuntime freeMemory))
1024. 1024.) " Mb"))
(System/gc) (busy-mem) ; ❶
;; 5.574 Mb
(System/gc) (busy-mem) ; ❷
;; 7.615 Mb
(System/gc) (busy-mem) ; ❸
;; 304.5126 Mb
❶ We start a new REPL with -Xmx512M option to allocate max 512MB of heap size. Just after starting,
we ask the JVM for garbage collection and measure the currently used heap size, which is around
5Mb.
❷ We now execute an eduction pass on 10 million elements taking only the last. This is going to
process the entire sequence moving the iterator from the first item down to the last. We keep
the eduction alive by storing the head of the sequence "s1", but by the time we call the garbage
collector and measure the memory again, we can see that the heap size only increased a couple of
Mb.
❸ If we try the same using sequence instead, we can see that memory remains allocated even after
calling the garbage collector. sequence is caching all elements and by holding the head "s2" all items
remain in memory.
(completing
([f])
([f cf]))
completing provides (or replace) the single arity call in a reducing function. This is
useful when working with transduce:
(transduce (map inc) - 0 (range 10)) ; ❶
;; 55
❶ transduce invokes subtraction - with the results of the reduction as the last step. Subtract with a single
argument (- -55) negates the already negative output, resulting in the apparently wrong result.
❷ completing wraps over subtraction single arity using the identity as a default, which leaves the result
untouched.
There are three possible problems related to the single arity call that can arise using a
reducing function with transduce:
1. The single arity version of the reducing function is generating unwanted changes.
This is the case we’ve seen in the previous example.
2. The single arity version of the reducing function is non existent. In this case we
want to provide one to avoid exceptions.
3. We want to provide a specific wrap-up behavior for transduce, so when the
reduction is complete we still have control of the very last step.
Based on the list of scenarios above, completing is a handy function that completes or
fixes the reducing function for transduce.
CONTRACT
Input
• "f" is a reducing function. It is expected to support a zero-arity call and the
standard two-arity call with an accumulator/item pair. It is mandatory argument.
• "cf" is a "closing function", the function that should be called at the end of the
reduction process. "cf" must support at least one argument. "cf" is optional
argument. When "cf" is not provided identity is used.
Output
• returns: a function accepting zero, one and two arguments. With no arguments, it
returns the result of invoking "f" with no argument. With a single argument, it
returns the result of invoking "cf" (or identity when omitted) with that argument.
Finally, it invokes "f" with two arguments when invoked with two arguments.
Examples
The following example illustrates the way invocations flow through a transducer chain
when invoked with transduce. identity-xform is a custom transducer that works like
the identity function with additional printing when a dynamic variable is set to true.
By executing the transducer with tracing enabled, we can follow the invocations on
screen:
(def ^:dynamic *debug* false)
(defn- identity-xform ; ❷
([]
(fn [rf]
(fn
([] (print-if "#0") (rf))
([acc] (print-if "#1") (rf acc))
([acc item] (print-if "#2") (rf acc item)))))
([x] x))
;; #2 #2 #2 #1 #done! 9
❶ print-if conditionally print to the standard output based on the dynamic variable *debug*.
❷ identity-xform custom transducer prints a different message for each provided arity. It does not
touch the results in any other way. We need this custom transducer to see what happens when the
single-arity call is invoked at the end of the reduction.
❸ Similarly, completing-debug wraps completing to print a message on screen. We want to verify
when the function generated by completing is called.
❹ We put everything together when calling transduce. *debug* is set to true so we can see the related
messages on the screen. The transducing chain increments each item and the pass them through
the identity-xform custom transducer.
We can see from the results of the example that, after the two-argument call is called
for each element in the input collection, the single-arity of each transducers is called
(we only see messages from identity-xform but map transducer single-arity is called
right before that). The very last step is performed by complteing-debug which prints
the "#done!" message, just before the result is returned.
WARNING Note that the 0-arity call is not currently used by any transducing context but it could be in the
future. Although transducers are not using it yet, other contexts might be, for example fold: it is
possible to use a stateless transducer with fold, but the reducing function needs to provide an
initialization value when called with no arguments. (map inc) for example, generates 0 with
"+" enabling fold use: (r/fold ((map inc) +) (range 10)).
Noting that the 1-arity version of reducing function is called last, we can create a
function to calculate the average of numbers based on completing. The following is
the same example found in filter but rewritten to take advantage of transducers:
(def events ; ❶
(apply concat (repeat
[{:device "AX31F" :owner "heathrow"
:date "2016-11-19T14:14:35.360Z"
:payload {:temperature 62.0
:wind-speed 22
:solar-radiation 470.2
:humidity 38
:rain-accumulation 2}}
{:device "AX31F" :owner "heathrow"
:date "2016-11-19T14:15:38.360Z"
:payload {:wind-speed 17
:solar-radiation 200.2
:humidity 46
:rain-accumulation 12}}
{:device "AX31F" :owner "heathrow"
:date "2016-11-19T14:16:35.360Z"
:payload {:temperature 63.0
:wind-speed 18
:humidity 38
:rain-accumulation 2}}])))
(defn average [k n] ; ❷
(transduce
(comp
(map (comp k :payload))
(remove nil?)
(take n))
(completing + #(/ % n))
events))
❶ events is a simulation of an infinite series of readings from weather sensors connected to a central
server. We just concatenate the same 3 events to generate an infinite stream.
❷ average contains a call to transduce and handles two parameters to change what data are returned.
After selecting the specific sensor to read, we remove potentially empty results and take the requested
amount of information. completing is adding a last step to + that calculates the average.
(def xform ; ❶
(comp (map inc)
(partition-all 3)
cat))
(def xform-reductor ; ❷
(xform
(completing +
#(do (print "#done! ") %))))
(xform-reductor 0 0) ; ❸
;; 0
(xform-reductor 0 0) ; ❹
;; 0
(xform-reductor 0) ; ❺
;; #done! 2
❶ A transducer chain is assembled as usual. Notice that partition-all is a stateful transducer with an
internal buffer length 3. The cat transducer that follows unwraps any inner lists created by partition-
all.
❷ The transducer chain is invoked with +, preparing it for execution.
❸ The first invocation of xform-reductor simulates what happens when 0 is the first element in the
input collection. It gets incremented by the time it reaches partition-all, which stores the number
in the internal buffer because it has not reached size 3 yet. 0 is returned, because + is yet to be
invoked.
❹ After calling xform-reductor with 0 again, partition-all contains [1,1] in its internal buffer.
❺ We now simulate the end of the reduction calling xform-reductor with a single argument.
The example shows what happens inside transduce when invoked with a collection with 2 zeros in it: [0
0]. First the transducers chain is invoked with +, which instantiates all the transducing functions. It is
named xform-reductor and partition-all initializes its internal state at this point. xform-
reductor is a fully fledged reducing function, which increments each item, stores it temporarily
in partition-all (up to 3) and sums up this item with the previous results.
Without the last step, (xform-reductor 0), the result would be just "0". This is
because patition-all doesn’t have a chance to empty the internal buffer. transduce conveniently
invokes the last step for us, making sure tear-down of transducers state happens consistently.
In summary:
See also:
• monoid has some similarities with completing but they are made for different
purposes and cannot be used interchangeably. monoid compensates for the lack of
a zero-arity call in a reducing function, while completing is mainly dedicated to
the single-arity call.
• transduce is the main use case for completing.
(cat [rf])
cat is the only pure transducer (which is not connected to an already existent sequence
function) present in the standard library. While there are many other transducers that
can be obtained calling a sequence processing function (usually without
arguments), cat is the only one that can be used directly:
(eduction
(map range)
(partition-all 2)
cat
cat ; ❶
(range 5))
;; (0 0 1 0 1 2 0 1 2 3)
❶ Note the use of cat without any wrapping parenthesis. This is just an effect of eduction which
combines transducers through comp internally.
Examples
The following example illustrates the relationship of cat with reduce:
((cat +) 0 (range 10)) ; ❶
;; 45
❶ We can invoke cat with a reducing function such as +. Normally this is not the way cat is used, which
is in the context of transducer composition.
❷ cat works like calling reduce once a reducing function has been assigned to it.
The previous example shows that cat, once instructed with a reducing function
like + (plus), expects the input to be sequential. Differently from reduce, cat is a
proper transducer and can be used to flatten back the inner collections produced by an
upstream step:
(def team
["jake" "ross" "trevor" "ella"])
(def week
["mon" "tue" "wed" "thu" "fri" "sat" "sun"])
(def rota ; ❷
(sequence
(map vector)
(rotate team)
(rotate week)))
❶ rotate is a function that creates infinite sequence of repeating elements starting from a finite
enumeration. We could for example repeat the days of the week or the members of a
team. repeat creates a new sequence to enclose the infinite repetition of the input, but we don’t want
the inner wrapping layer to appear. cat is used here to eliminate the inner collections forming a flat
sequence.
❷ rota shows how to assign a team member to each day of the week, even when the two enumerations
have different length. The rota function can be used to decide who is in charge of a specific aspect of
the project each day.
❶ In this case we only need concatenation but we don’t need transformation. For this reason the
"mapping" part of map is given identity as transforming function.
❷ identity can be removed completely by using cat only.
cat is essentially the "cat" part of mapcat. Since the iteration mechanism is now implemented by
transducers, we don’t have to necessarily carry over the mapping and just use cat alone when
appropriate.
mapcat (which is also transducer-ready) is still useful when flattening includes a transformation step.
For instance:
❶ The use of a range on top of another range with mapcat produces a series of increasingly bigger
enumerations all starting at zero.
See also:
• mapcat adds to cat the transformation step ahead of concatenation. If your
algorithm always requires a transformation, prefer mapcat instead
of cat over map. mapcat is also necessary when transducers are not an option (this
has often to do with different laziness guarantees offered by transducers compared
to normal sequential processing).
• concat lazily concatenates collections together.
• flatten is a stronger form of concatenation which includes removing also any layer
of nested collection, not just the first layer. cat and mapcat will only affect the
first level of nesting, leaving inner collections untouched. Use flatten when you
want to remove any level of nested collections.
• r/cat in reducers is a specific concatenation operation specifically created to
perform fold efficiently.
• r/foldcat combines r/fold and r/cat together. Like r/cat this is a specific
operation used in reducers only.
Performance considerations and implementation details
(reduced [x])
(reduced? [x])
(ensure-reduced [x])
(unreduced [x])
;; 15
❶ The custom reducing function passed to reduce includes a condition on the incoming element "el". It
only reduces with + if the element is less than 5.
❷ When the element is more than 5, a reduced value wrapping the result so far is returned.
When reduce finds an element wrapped in a Reduced instance, it stops recurring over
the remaining elements in the input. Both reducers and transducers take advantage of
the signaling mechanism to signal a premature end of the reduction (for example, take-
while immediately wraps the results so far in a reduced so other reducers/transducers
down the line have a chance to avoid computationally intensive operations).
The wrapped reduced instance supports the IDeref interface, allowing for easy
inspection using the @ (at) operator:
(def r (reduced "string"))
(reduced? r)
;; true
r
;; #object[Reduced 0x3e7d {:status :ready :val "string"}]
@r ; ❶
;; "string"
❶ A reduced object supports the IDeref interface similar to atoms or other reference objects.
CONTRACT
Contract
• "x" is requested argument in all cases. It can either be an object already wrapped
in a clojure.lang.Reduced instance or a plain object of any type.
Notable exceptions
• ClassCastException is the typical exception encountered when a
wrapped reduced object is used forgetting to dereferencing it first.
Output
• reduced: returns the input object "x" wrapped in a new instance
of clojure.lang.Reduced.
• reduced?: returns true if "x" is a reduced object, false otherwise.
• ensure-reduced: is like reduced, but does not wrap again "x" when "x" is already
a reduced object.
• unreduced: retrieve the dereferenced version of "x" when "x" is a reduced object.
NOTE there are 3 ways to look inside the content of a reduced object: given "r" is reduced,
then: @r, (deref r) and (unreduce r) are equivalent.
Examples
With reduced we can provide an alternative way to drop elements from a collection
after some condition is met. drop-while implements a similar concept, but if we are
interested in the first element "after" the condition is met, we need an
additional first call. The reduced option is not only shorter but likely faster; depending
on the type of the input collection, the iteration could be already optimized for reduce:
Let’s compare the two forms in the following example 131:
(def random-vectors ; ❶
(repeatedly #(vec (drop (rand-int 10) (range 10)))))
(first ; ❷
(drop-while
#(>= 3 (count %))
random-vectors))
;; [2 3 4 5 6 7 8 9]
(reduce ; ❸
#(when (> (count %2) 3) (reduced %2))
random-vectors)
;; [4 5 6 7 8 9]
131
Thanks to Max Penet (@mpenet on the Slack Clojurians group chat) for suggesting the inclusion of this comparison in
the book
❷ There are several options to solve the problem of returning the first vector longer that 3 items. One of
them is to use sequential functions like drop-while to remove items from the head of the sequence
until the predicate is true. At that point we need to call first to get the first item from the remaining
vectors in the list.
❸ We can use reduce and reduced to achieve a similar effect and stop the underlying loop when we find
the first item that satisfies the predicate. reduced stops the computation and exit reduce at that point,
removing the need to take the first element from the remaining ones.
A "moving average" is a time-based series of average values over time. It’s useful to
show snapshot results over time, without necessarily waiting for other values that
might still be in transit. A perfect example is the typical plot for a stock price where
each dot on the chart is an average value of the previous (daily or otherwise) 132. The
idea is to use reduce in a blocking fashion, start processing values as they arrive,
effectively suspending the loop. We need a few tools for this to happen:
• java.util.concurrent.LinkedBlockingQueue is a type queue where push-pop
operations block waiting for elements to be in the queue. We can use it as a
channel to feed new values in.
• We also need a "signal" to mark the end of the event stream, so
the reduce operation can unblock and return.
To provide the given design, we are going to need to create and check for reduced
elements during the reducing process. A reduced signal is sent by the client down the
queue, to signal the end of the events for that session. The reduced signal is interpreted
in the reducing function and propagated to reduce implementation to stop the loop (in
normal conditions, the signal would be the end of the input sequence, but we don’t
have that in this case):
(import 'java.util.concurrent.LinkedBlockingQueue)
(def values (LinkedBlockingQueue. 1)) ; ❶
(defn value-seq [] ; ❷
(lazy-seq
(cons (.take values) (value-seq))))
(defn start [] ; ❹
(let [out *out*]
(.start (Thread.
#(binding [*out* out]
132
Please read the Wikipedia article on moving averages available at en.wikipedia.org/wiki/Moving_average
(println "Done:"
(reduce
moving-average
[0 0 0]
(value-seq))))))))
(start)
(.offer values 10)
;; [1 10 10.0]
(.offer values 10)
;; [2 20 10.0]
(.offer values 50)
;; [3 70 23.333333333333332]
(.offer values (reduced 20))
;; [4 90 22.5]
;; Done: [4 90 22.5]
❶ A linked queue of size one indicates that as soon as the first element is added, no more push into the
queue are allowed until that element is consumed. This determines reduce to block waiting for another
element to be available.
❷ value-seq creates a lazy sequence on top of of the queue, a sequential interface that reduce can
understand and use without knowing the internals of a blocking queue.
❸ moving-average is the reducing function. The function is unaware of the queue-based
implementation and could work just fine in a normal reduce call. The function assumes that, at any
point, the argument "x" could be reduced. In that case the end of the reduction signal is sent
to reduce implementation by wrapping the result returned in a reduced call.
❹ The start function starts the reduction in a separate thread. The main thread is used for the client to
send values to the queue, while the background thread runs the (otherwise blocking) reduction. Note
that we redirect the standard output of the new thread to the current one, so we can see messages at
the REPL. The call to reduce is blocking and only returns when a reduced element is sent down the
queue.
In the example, the interaction starts by forking a new thread containing the
background (blocking) reduce call. The current thread holds a reference to the queue
that can be used to send values to reduce. As soon as we offer a value wrapped
by reduced the thread stops and exit, showing the last available average. The vector
contains a triplet with the count of elements seen so far, their sum and finally their
average.
From the example, we can see how reduced, reduced? and unreduced are used:
• To transform a normal element into a reduced one, use reduced.
• To verify if an element is reduced, use reduced?.
• To use an element regardless of it being reduced or not, use unreduced. Using the
dereference form with @ in this case would cause an exception without checking
with reduced? first.
See also:
• reduce is the main recipient of reduced objects. Worth understanding a bit more
about reduce if you need to work with the functions in this section.
Performance considerations and implementation details
This chapter groups together the functions that can be used consistently across
8
Collections
The chapter is further divided into additional sub-chapters. We are going to first have a
look at the basics functions to create, count or otherwise access a collection in the most
general way. Next are the functions designed to be polymorphic. Finally, the last sub-
chapter is dedicated to general purpose functions like grouping, sorting, partitioning
and others that are very common in every day use.
8.1 Basics
8.1.1 into
function since 1.0
(into
([])
([to])
([to from])
([to xform from]))
into is a frequently used function in the Clojure standard library. into copies the
content of a collection into another:
(into #{} (range 10))
;; #{0 7 1 4 6 3 2 9 5 8} ; ❶
The output collection can be (and most frequently is) of a different type than the input
collection effectively simulating the "creation" of another collection. into can be also
used to apply a transducer chain "xform" to an input collection, adding the option to
transform data while passing them from a collection to another:
(into (vector-of :int) (comp (map inc) (filter even?)) (range 10))
;; [2 4 6 8 10] ; ❶
❶ into applies a transducer chain while copying the content from a sequence into a vector of primitives.
CONTRACT
Input
• "from" can be any seqable collection or nil.
• "to" should be such that (coll? to) is true:
lists, vectors, sets, queues or maps are all allowable target collections. The
following non-persistent data structures are instead not supported:
transients, arrays or generic Java containers. "to" can also be nil and in this case a
default list is used as the target collection.
• "xform" is a transducers composition (or a single transducer) and is optional
argument.
Notable exceptions
• ClassCastException: "X" cannot be cast to java.util.Map$Entry appears
when "to" is a collection of type map and the "from" input is not in a form
that “conj” can process for maps. In this case the "from" collection should be
structured using pairs or other maps, for example:
(into {} [[:a "1"] [:b "2"]]) ; ❶
;; {:a "1", :b "2"}
❶ The example illustrates the correct format the input collection should present to into when the
target collection is a map.
❷ This second form is equivalent to the previous.
Output
• The "to" collection object, with additional elements added from the input
collection, if any. The elements are added to the "to" collection
following “conj” semantic.
Examples
into can be used in those cases where the application requires the features of a specific
data structure but the input is given in a different one. The reader is invited to check
the “for” chapter for example. The Game of Life implementation in that chapter, uses
a set as the main data structure to enable easy checking of the presence of a cell in the
set. But “for” returns a sequence, so into is used to transform the results back into a
set for further computation. Similarly, the following maintain function applies
sequence processing (like map or filter) making sure that the output remains the same
as the input collection:
(defn maintain [fx f coll] ; ❶
(into (empty coll) (fx f coll)))
(->> #{1 2 3 4 5} ; ❷
(maintain map inc)
(maintain filter odd?))
;; #{3 5}
(->> {:a 1 :b 2 :c 5} ; ❸
(maintain filter (comp odd? last)))
;; {:a 1 :c 5}
❶ maintain delegates processing of the collection to the input "fx" and "f" functions. It then
uses empty to create an empty collection of the same type as the input collection. into is finally used
to copy the processed content before returning.
❷ A set gets incremented and even numbers are removed. The standard calls to map and filter would
return a lazy-sequence instead.
❸ Another example using a map.
(def xform ; ❶
(comp
(map dec)
(drop-while neg?)
(filter even?)))
(defn maintain ; ❸
([fx f coll]
(into (empty coll) (fx f coll)))
([xform coll]
(into (empty coll) xform coll)))
(def input-queue
(queue -10 -9 -8 -5 -2 0 1 3 4 6 8 9))
(def transformed-queue ; ❹
(maintain xform input-queue))
(peek transformed-queue) ; ❺
;; 0
(into [] transformed-queue) ; ❻
;; [0 2 8]
❶ with-meta attach metadata information to a data structure. The map can contain any set of key-values.
❷ meta is used to read metadata information out.
The standard library functions and macros usually play nicely with metadata, carrying them over as one
would expect. into is taking the approach of preserving any metadata already present in the "to"
collection, discarding what is instead part of the "from" collection:
❶ The function sign attaches metadata to the incoming collection "c". The metadata values is a string
conversion of the content in the input collection.
❷ A simple experiment shows that the metadata resulting from calling into are coming from the target
collection.
See also:
• into-array is the equivalent operation when the target collection to return is a
native array. into doesn’t support arrays.
• conj can be used similarly to into when the input is not a collection but just a
single item. into is based on the semantic of conj in relation to types and
behavior.
Performance considerations and implementation details
(defmacro b [expr] ; ❷
`(first (:mean (quick-benchmark ~expr {}))))
(c/view ; ❹
(c/xy-chart
{"(list)" [(sample '()) (range 100000 1e6 100000)]
"(vector)" [(sample []) (range 100000 1e6 100000)]}))
133
The clj-xchart library can be used to plot charts quite easily: github.com/hyPiRion/clj-xchart/
Figure 8.1. into used with vectors compared to into used with a list. The x-axis is time in
seconds.
The graph shows that into behavior is approximately linear (the lines are not perfectly
straight because there are only 10 samples). But after the first few samples, the version
with vector becomes faster, especially for larger collections.
The following benchmark shows a comparison between apply merge and into {} to
merge multiple maps into a single one:
(require '[criterium.core :refer [quick-bench]])
❶ The benchmark creates a sequence of 10 maps of 5 keys each. The keys are distinct, so the resulting
map will contain the original total of 100 keys.
❷ The benchmark reveals that there is not much difference between the two.
Both forms are idiomatic Clojure and they perform roughly the same, but into
{} seems slightly better in terms of conveying the overall meaning of the operation
with the visual aid of the map literal "{}".
8.1.2 count
function since 1.0
(count [coll])
count eventually asks the input collection to count itself, a design decision that allows
each data structures to provide the most efficient implementation. See the performance
section for more information about how this decision affects speed.
CONTRACT
Input
• "coll" can be any collection type, record or struct. (count nil) is 0.
Notable exceptions
• UnsupportedOperationException: when "coll" cannot be counted, for example for
scalars like keywords, symbols, numbers and so on.
• ArithmeticException: if the number of elements in the collection is
beyond Integer/MAX_VALUE, for example: (count (range (inc
Integer/MAX_VALUE))).
Output
• returns: a java.lang.Integer representing the number of elements in "coll". The
definition of "element" is type specific: it can be a tuple as in the case of maps or
the number of fields in a record.
Examples
The need for counting is ubiquitous in programming and algorithmic. Here’s a broad
categorization of problems involving count in one form or another:
• Determining the largest/smallest of a set or group of items.
• Math formulas like the average, min or max.
(defn- print-usage [] ; ❶
(println "Usage: copy 'file-name' 'to-location' ['work-dir']"))
(defn- copy ; ❷
([in out]
(copy in out "./"))
([in out dir]
(io/copy (io/file (str dir in)) (io/file out))))
❶ We print the expected usage of the command through this helper function.
❷ The core logic is implemented in the copy function which contains 2 arities. The function excepts 2 or
3 arguments.
❸ The entry-point function in Clojure, using the conventional "-" suffix. Clojure boostrap sequence can be
instructed to search for this main signature function. Here we are using a “cond” instruction
to count arguments.
❹ For simplicity we are calling the -main function directly, otherwise we would have to compile and
invoke the generated class requiring several other steps that are system specific. This version with 2
arguments assumes the current folder as the "origin" from where the input file should be loaded.
Current folder in Java usually means the folder from where the JVM is invoked.
❺ In this second version, we pass the third parameter. The additional "dir" param is used as the "origin"
instead of the current folder.
For data structures going through a phase of construction, counting can be done as each element enters
the collection, resulting in a very fast count operation not requiring any additional enumeration
(e.g. vectors).
Lazy-sequences are instead assembled on-demand at the time of consumption. It follows that there
is no way for Clojure to hide the cost of counting at build-time, forcing a walk over the sequence to find
where it ends. This is the primary reason why count on lazy-sequences is a linear operation. Please see
the performance section for additional details.
See also:
• counted? verifies if the given object supports the clojure.lang.Counted interface,
which for Clojure collections means constant time count. Other notable Java types
like strings and arrays are not counted but still constant time operations.
Performance considerations and implementation details
The table describe a very good situation overall, with most collection types counting in
constant time. There are a few notable linear time exceptions:
• lazy-collections and chunked sequences 134 [134]. Note that count also realizes
the lazy sequence when counting.
• long-range behaves differently from range created on other types (doubles for
instance). count on a long-range is constant time, while it’s not for other types.
• seq called on sets, maps, sorted-sets and sorted-maps are also linear-time (while
the same on vectors or array-maps is instead constant time).
If counting is a necessary step for code that should also be as fast as possible, you
should pay particular attention at lazy-sequences that are built as intermediate results
after using seq as they often force a linear-time count operation.
8.1.3 nth
function since 1.0
(nth
([coll index])
([coll index not-found]))
nth returns the value found at the specified index from the collection passed as
argument:
(let [coll [0 1 2 3 4]]
(nth coll 2))
;; 2
nth throws error in case the index doesn’t exist (instead of returning nil (as get would
instead do for sequential access). nth doesn’t throw exception when a default value is
provided:
(nth [] 2) ; ❶
;; IndexOutOfBoundsException
(nth [] 2 0) ; ❷
;; 0
134
Chunked sequences are a by-product of a performance optimization that Clojure implements behind the scenes.
nth is specifically designed for data structures offering random access lookup
like vectors, sub-vectors, primitive vectors and transient vectors. It also handles native
arrays and Java arrays with the same efficiency. nth can makes access
to sequences although, in this case, less efficiently. It does not work at all on
associative data structures (e.g. maps, sets and related ordered variants) for which a
dedicate alternative exists (get).
CONTRACT
Input
• "coll" can be any collection type, excluding maps, sets (both ordered, unordered or
created as transients).
• "index" can be any type of positive number up to Integer/MAX_VALUE. "index" is
truncated to an integer if contains decimals: (nth [1 2 3] 3/4) is equivalent
to (nth [1 2 3] 0). nth using Double/NaN as "index" always returns the first
element (when present).
• "not-found" is optional and can be of any type. "not-found" is returned when the
requested index is not present.
• nil is accepted as a degenerated collection type. nth on nil returns nil unless
"not-found" is provided.
Notable exceptions
• UnsupportedOperationException is thrown when "coll" does not support any
kind of indexed access.
• IndexOutOfBoundsException when the requested index is outside from the
currently requested index.
• IllegalArgumentException when the given "index" is out of the integer range (>
"index" Integer/MAX_VALUE) is true.
Output
nth returns:
135
Hashing is the action of applying an hashing function to an object. The hashing function always converts the object to a
number.
key-value pair. We can use the hash function to retrieve such number given the key and
use the number to store the item at the index in the vector.
One problem we have with this solution is the pre-allocation of potentially large
vectors (even for a single element). This is because the hash could be any number
between -231 and 231-1.
NOTE The reader should consider using a proper hash-map for any practical purposes. The
implementation given here should give a good idea of the challenges related to creating robust
data structures.
To reduce memory consumption, we are going to limit the hash to 216 positive integers
at the price of increasing the probability of collisions (which will still be relatively low
and sufficient for the example):
(defn to-hash [n] ; ❶
(bit-and (hash n) 0xFFFF))
❶ The hashing function builds on top of Clojure own hash function. We need to restrict hash domain so it
16
doesn’t go beyond 2 and it doesn’t return negative numbers (which are not valid for vector indexing).
We can use bit-and to apply both constraints at once.
❷ The grow function decides if we need to allocate more space in the vector holding the hash-table by
checking the highest hash value from the entering keys. If one of the keys require an index in the
vector that is beyond its length, the vector is conj-ed all the additional times required. The transient-
persistent idiom is used to gain some speed in the process.
❸ The assign function assumes the vector holding the hash-table to be already of the right size. It uses
again the transient-persistent idiom to write many key-value pairs efficiently.
❹ with-hashed-keys is a helper function to deal with transforming a list of arguments into a list of pairs
where the first element (the key) is the hashed version.
❺ put is used both during initial construction and following updates. It first checks the arguments and
then proceeds to potentially grow the hash-table (if there is not enough room for one of the arguments)
and assigning the new elements.
❻ hashtable is used to build a new hash-table and it just calls put with an empty vector.
❼ fetch will make direct access to the hash-table to retrieve a specific value based on the hashed key.
Note the use of nth to fetch elements efficiently from vectorized hash-table. The hash-
table implementation can be used as follow:
(def ht (hashtable :a 1 :b 2)) ; ❶
See also:
• get has a similar approach to nth with a specific focus to associative data
structures. Although get can be used with vectors, there is no specific need to do
that when the input is known to be a vector.
• vectors can also be used as a function with identical results to nth: ([1 2 3] 1).
Vectors as a function can be handy and perform quite well but they don’t support
a default value.
Performance considerations and implementation details
(empty [coll])
empty is a simple function that creates an empty collection of the same type as the
input argument:
(empty (range 10))
;; ()
empty eventually delegates to the input collection the creation of an empty instance.
Almost all collections (when (coll? coll) is true) provide this facility. Check the
rest of the chapter for a few situations where empty doesn’t work as expected (or
throws exception).
CONTRACT
Input
• "coll" should be a collection type ((coll? coll)` is true) but it can be of any
other type (including nil).
Notable exceptions
• UnsupportedOperationException when "coll" is a collection but empty is not
Output
The type and value returned by empty changes based on the input collection.
The following table shows the result of empty invoked on the most common Clojure
and Java collections:
Table 8.3. Summary of the results of invoking empty on the most common collection types.
Note that queue in the table means (PersistentQueue/EMPTY) constructor. The lower
part of the table starting from transients shows some interesting results:
• transients do have an empty representation, but they are conventionally created
only by the transient function only. This is a way to enforce use of transients
(which are mutable data structures) only in explicit contexts. Allowing other
functions to return transients would escape this form of control. Similarly goes for
other mutable data structures like arrays and Java arrays.
• records have an identity of which attributes are an integral part. Creating an empty
record from another would essentially create a different type. This
breaks empty contract.
• A MapEntry is an implementation detail of how Clojure maps are iterated as
sequential structures. A MapEntry only exists to contain a pair of key-value and it
would not make sense to have an empty one.
• nil (along with other scalar types) does not throw exception but instead
returns nil.
Examples
One of main goal of empty is to avoid conditionals to verify the type of a collection, so
another can be created of the same type. To see an example of this, the reader is invited
to check into, where empty was used to prevent an input collection to be transformed
into another type.
The following example shows what kind of conditional logic is replaced when
using empty:
(let [coll [1 2 3]] ; ❶
(cond
(list? coll) '()
(map? coll) {}
(vector? coll) []
:else "not found"))
;; []
(empty [1 2 3]) ; ❷
;; []
❶ “cond” checks the type of coll to decide which empty collection to return. Note that this is not an
exhaustive list and would fail short using something like (range 10) as input "coll". empty is indeed
implementing much more than this illustrative example.
❷ empty effectively "hide" the conditional avoiding the cumbersome task of listing all the collection types.
Creating new nodes while maintaining the original type is a typical problem when
"walking" collections. Walking is typical of nested data structures and often involves
changes that maintain the original structure. The following example, shows how we
could iterate through an arbitrarily nested data structure so a function can be applied to
certain elements. We expected the output to have the same structure of the input,
including nodes and collections of the same type:
(defn walk [data pred f] ; ❶
(letfn [(walk-c [d] (map ; ❷
(fn [k] (walk k pred f)) d))
(walk-m [d] (reduce-kv ; ❸
(fn [m k v] (assoc m k (walk v pred f))) {} d))]
(cond
(map? data) (walk-m data)
(coll? data) (into (empty data) (walk-c data)) ; ❹
:else (if (pred data) (f data) data))))
❶ The logic to walk an arbitrarily nested collection is implemented by the recursive function walk. The
function body is a “cond” instruction that differentiate between maps, other collection types and not
collections. The need for distinguish between maps and collections is because they need to be
iterated and assembled in a slightly different way.
❷ walk-c is the local function dedicated to iterate collections. The function maps over the items of
nested collection to apply walk again to each element.
❸ Similarly walk-m is the function dedicated to iterate and rebuild hash-maps. Clojure maps when
iterated like collections produce a sequence of tuples containing key-value pairs. So walk-c can’t be
used as is when a nested collection is of map type.
❹ After dealing with hash-maps in the previous branch of “cond”, we can now make good use
of empty to make sure that whatever is produce by walk-c (very likely a lazy-sequence) is always
transformed back into the original collection type (for instance a “vector”).
❺ We now invoke walk asking to increment all odd numbers any level deep.
See also:
• “into” is often seen in conjunction with empty to build a specific type of
collection.
• “empty? and not-empty” are predicate functions to check if a collection contains
elements. (empty? (empty coll)) is true by definition.
Performance considerations and implementation details
every?, not-every?, some and not-any? are four related functions that takes a
predicate and a collection as arguments. They apply the predicate to the elements in the
collection and combine the results using boolean logic operators like AND, OR and NOT.
every? for example, returns true if the predicate is true for each element in the
collection (equivalent to the AND concatenation (and (pred c1) (pred c2) (pred
cN)) where c1, c2, .. , cN are the items in "coll"):
not-every?, some and not-any? operate similarly by combining predicate results with
NOT AND, OR and NOT OR respectively:
❶ not-every? returns true because not all of the numbers in the list are negative.
❷ some returns true because at least one item in the collection is negative.
❸ not-any? returns true because it’s false that there is at least one negative.
evident from the naming convention of the question mark appended at the end of
the name).
• some returns the first predicate result that that is not nil.
Examples
Functions like every? or some (and their complementary "not" variants) have some
idiomatic usages in conjunction with other functions in the standard library. Their use
avoids additional function calls, sometimes allowing more readable forms. For
example, the following search by phone number (or any other unique attribute) can be
written without filter or first:
(def address-book
[{:phone "664 223 8971" :name "Jack"}
{:phone "222 231 9008" :name "Sarah"}
{:phone "343 928 9911" :name "Oliver"}])
(->> address-book ; ❶
(filter #(= "222 231 9008" (:phone %)))
first)
;; {:phone "222 231 9008" :name "Sarah"}
❶ The threaded form filters the address book for the given phone number. It results in a sequence of the
filtered items that we know contains a single entry.
❷ some is used to filter out the first item in the sequence that results in the predicate being not nil. We
don’t need the additional first call to unwrap the single returned result.
every? has also a few idiomatic uses in conjunction with other functions in the
standard library. The following example shows how we can verify if all collections in a
list have at least one element each:
(every? seq (list [:g] [:a :b] [] [:c])) ; ❶
;; false
In the next example we are using a set as a function to check if all items in a collection
are also contained in a set. The function bingo? checks if all numbers in the card are
contained in today’s numbers by using the drawn set:
(def drawn #{4 38 20 16 87})
SOME
Despite achieving similar goals, some is slightly different from the other functions in
this section. The main differences are:
• It returns the result of applying the predicate to an element directly (there is no
boolean transformation).
• It never returns false. If "el" is an element in "coll" and the result of (pred
el) is false, some doesn’t stop and continues with the next item.
• It returns true only when true is an element of the input collection (and the
predicate returns true).
• It returns nil only if the predicate evaluated nil or false for all elements in the
input.
The following example shows how we could use some to achieve two effects at once:
pick the first of a valid set of options and second transform values using a hash-map:
(def prizes {"AB334XC" "2 Weeks in Mexico" ; ❶
"QA187ZA" "Vespa Scooter"
"EF133KX" "Gold jewelry set"
"RE395GG" "65 inches Tv set"
"DF784RW" "Bicycle"})
❶ The prizes hash-map contains a selection of ticket numbers and their associated prizes.
❷ In the win function we can check all the tickets owned by a person. Using some we are constraining a
potential double winner (because they have more than one winning ticket) to only get the prize for the
first winning ticket. The translation from the ticket number to the associated prize happens by using
the hash-map as a predicate function.
❸ win receives lists of tickets to verify if there are winners and what prize they get. If some returns nil it
means there was no winning ticket in the list. We can check this condition with or. When no prize was
found, the default message is printed.
;; true
(every? neg? []) ; ❷
;; true
❶ This result shows that every element in an empty collection is a positive number
❷ The second result shows the exact opposite: every element in the empty collection is negative.
It might seem an illogical or contradicting result, but it’s a commonly accepted fact in mathematics that
every statement about the empty set is true. The reason is that it’s not possible to give counter-
examples to any predicate applied to the empty set, so the answer is assigned by convention.
See en.wikipedia.org/wiki/Vacuous_truth to know more.
See also:
• every-pred is very similar to every? but it allows multiple predicates, combining
all permutations with a logical "AND". Use every-pred if you need more than one
predicate.
• some? should not be confused with some (without the question mark). some? is the
equivalent of (not (nil? x)) to verify if something is not nil.
Performance considerations and implementation details
Despite the amount of long integers created, the memory profile remains stable with a
minimal garbage collection effort. The following image was taken during the execution
of the line above and shows VisualVM 136 inspection of the JVM memory:
136
VisualVM is a free JVM profiler that can be downloaded from visualvm.github.io
Figure 8.2. VisualVM showing JVM memory profile while running the example.
At "19:39:00" the execution kicks off after a small warm-up. The blue line representing
the used heap is growing and released a few times until the evaluation returns at about
"19:39:30". As the reader can see the used heap is quite small compared to the 330MB
currently allocated (the orange line could grow up to 512MB maximum allowed size
before going out of memory) and is gracefully claimed back a few times by the garbage
collector.
8.1.6 empty? and not-empty
function since 1.0
(empty? [coll])
(not-empty [coll])
empty? and not-empty are simple functions to verify if a collection is empty or the
opposite:
(empty? [])
;; true
(not-empty [])
;; nil
(empty? [1 2 3])
;; false
not-empty is not strictly the opposite of empty? as the missing question mark indicates
that not-empty returns nil or "coll" itself to indicate logical true or false (instead of
boolean types).
CONTRACT
Input
• "coll" can be a collection type, nil, string, array or generic Java iterable
(e.g. (instance? java.lang.Iterable) is true). transients are not supported.
Notable exceptions
• IllegalArgumentException when Clojure has no way to transform "coll" into a
sequence. Apart from scalars (which are not collections, like numbers, symbols,
keywords etc.), the exception is thrown with transients and the now
deprecated structs.
Output
empty? returns:
Examples
empty? and not-empty are handy functions to compose a few idiomatic forms. For
example the following snippet removes all empty strings (including nil) from a
collection, but not strings with at least one character or beyond:
(remove empty? [nil "a" "" "nil" "" "b"]) ; ❶
;; ("a" "nil" "b") ; ❷
The following example can be used to transform a string into a number. The same code
without making use of not-empty would require a condition with multiple branches or
throw exception. By using not-empty, we get rid of corner cases like nil or empty
strings:
(defn is-digit [s]
(every? #(Character/isDigit %) s)) ; ❶
(to-num nil) ; ❸
;; nil
(to-num "")
;; nil
(to-num "a")
;; false
❶ isDigit verifies if a string contains only numbers. Note that every? returns empty list when "s" is
empty and true when "s" is nil.
❷ and is short-circuiting, so as soon as one of the expression returns nil, there are no further
evaluations. By using not-empty at the top of the chain, we make sure that following evaluations are
not getting nil or empty string.
❸ A few example using to-num showing what happens when the string is not properly a number. No
exceptions are thrown and either nil or false are the only possible outputs.
❹ to-num behavior can be used in conjunction with when-let so we can proceed to treat "n" as a number
only after the binding has been set.
Note how the use of not-empty with when-let achieves quite a lot of features in a small
amount of code:
(def coll [1 2 3])
❶ not-empty ensures the collection is not empty while when-let only assigns the local binding based
on the results of not-empty. The body is then evaluated as a consequence with the original input
collection.
❷ Comparison between not-empty and seq. seq is roughly equivalent to not-empty with an important
caveat: seq transforms the collection into a sequential iterator which is not compatible with the
following pop invocation.
See also:
• clojure.string/blank? is a function dedicated to check for empty strings. blank? is
more robust than not-empty because it considers white spaces. For example, if a
string contains only spaces: (clojure.string/blank? "
") returns true while (empty? " ") returns false. If string processing is the main
goal, the functions in clojure.string are in general a better fit.
• seq is widely used in the standard library to iterate over sequential
collections. seq can also be used to check if a collection is not empty with (seq
coll). The main difference between not-empty and seq to check for empty
collections is that not-empty returns the collection unaltered, while seq returns the
sequential iterator over the input collection. This factor is important if seq or not-
empty are then used in a if-let or when-let and the collection is further processed.
empty? and not-empty implementation is based on seq. This has been the subject of
many discussions in the past 137. If, on one hand, seq is not as fast as count to
determine if a collection is empty, on the other it allows lazy-sequences to stay
(almost) lazy: seq (and thus empty?) needs to realize only the first element (or chunk).
The following example illustrates the point:
(empty? (map #(do (println "realizing" %) %) (range 100))) ; ❶
;; realizing 0
;; realizing 1
;; [..]
;; realizing 31
;; false
137
Of the several threads on the mailing list about checking for empty collections with seq, this is one of the most
articulated:groups.google.com/forum/#!topic/clojure/yW1Xw1dllJ8
;; [..]
;; realizing 99
;; false
❶ empty? is used to check if a lazy sequence is empty. The lazy sequence contains a side effect that
prints a message on the screen. We can see 32 messages printed before the result is returned, the
outcome of realizing the first "chunk" of a lazy (chunked) sequence.
❷ When count is used on the same input, the entire lazy sequence is realized.
8.2 Polymorphic
This chapter contains a selection of those functions lacking a strong association with a
specific collection type, but whose behavior can change drastically depending on the
input.
There is no perfect rule to decide what function should be in this section. The
following are a few reasons why “conj”, “get” and contains? have been selected, while
other functions are described elsewhere:
• Functions like assoc have similar polymorphic capabilities: assoc can be used on
vectors as well as hash-maps with a very different semantic. Despite this
fact, assoc is predominantly used on hash-maps and has a strong association with
them. It would be weird for a chapter dedicated to hash-maps not to contain assoc,
hence why it’s not appearing here.
• Functions like nth also works on many types (excluding hash-maps and sets). But
the behavior of nth is the same when used on lists, vectors, lazy sequences and so
on. nth can be considered polymorphic, but there is no drastically different
behavior you should be aware of (except for the performance profile).
NOTE More in general, the reader should see the classification implemented by the book as a tool to
better visualize the content of the standard library and to help remember its content by
association, naming or meaning.
8.2.1 conj
function since 1.0
(conj
([coll x])
([coll x & xs]))
conj (an abbreviation for conjoining) inserts one or more elements into an existent
collection. It takes the collection as the first argument:
(conj [1 "a" :c] \x) ; ❶
;; [1 "a" :c \x]
;; (99 0 1 2)
The example shows conj polymorphic behavior in respect of: where the element gets
added (at the beginning or at the end) or some special input format (for
example maps require vectors of two items). In general, conj delegates the accepting
collection to perform the most efficient insertion.
CONTRACT
Input
• "coll" can be any collection type, but arrays and Java collections (such
as java.util.ArrayList) are not supported. When "coll" is nil the emtpy
list () is used by default.
• "x & xs" are one or more items to be added to "coll". They can be nil.
"x & xs" can be of any type for lists, vectors, sub-vectors, sets and sequences in
general. There are restrictions for other types of "coll":
• If "coll" is a primitive type vector of type "T", then "x" needs to be of a type
compatible with "T". Compatibility can be verified using one of Clojure type
conversion operator (like int, char and so on) applied to "x" into the type "T", for
example: (int \a) is a valid operation, so a (vector-of :char) can accept integers.
• If "coll" is a map type (excluding sorted), then "x" can be:
1. A vector of two elements.
2. An object implementing the java.util.Entry interface (one case
being clojure.lang.MapEntry object obtained when sequencing with (first
{:a 1}})).
3. Another map (like in (conj {:a 1} {:b 2 :c 3})).
4. A record definition (for example (defrecord Point [x y])).
5. An empty sequential collection like #{}, "" or (). Note that list pairs are not
supported, so (conj {} '(1 2)) throws exception.
• If "coll" is a sorted set or sorted map then "x" needs to follow the compare
compatibility rules between the type of the elements in "coll". For maps, the rule
applies to keys only.
Notable exceptions
• IllegalArgumentException when "coll" is a map type and "x" is a collection that
is not supported by seq.
Output
• returns: the result of adding "x" and any additional argument to "coll". The
element will appear at the beginning of "coll" (for lists and lazy sequences), at the
end of "coll" (for the vector family of types) or undetermined (for unsorted
collections like maps or sets).
Examples
The most frequent use of conj is to insert a single item to vectors or sequences. But as
seen in the contract, there are more ways to use conj for different collection types. The
following examples show some of the most interesting cases:
(conj () 1 2 3) ; ❶
;; (3 2 1)
(conj ; ❸
(Person. "Jake" "38")
(Address. 18 "High Street" 60160)
(Phone. "801-506-213" "299-12-213-22"))
During recursion, conj can be used to incrementally build results. The following
example shows a function that writes text snippets to disk. The function expects a
vector as input containing titles and snippets as strings. The function writes to disk and
outputs the list of the created files:
(require '[clojure.java.io :as io])
(def examples
["add" ["(+ 1 1)" "(+ 1 2 2)" "(apply + (range 10))"]
"sub" ["(map - [1 2 3])" "(- 1)"]
"mul" ["(*)" "(fn sq [x] (* x x))"]
"div" ["(/ 1 2)" "(/ 1 0.)"]])
;; ("/tmp/add/0.clj"
;; "/tmp/add/1.clj"
;; "/tmp/add/2.clj"
;; "/tmp/sub/0.clj"
;; "/tmp/sub/1.clj"
;; "/tmp/mul/0.clj"
;; "/tmp/mul/1.clj"
;; "/tmp/div/0.clj"
;; "/tmp/div/1.clj")
❶ fname joins together a folder path and a file name with extension "clj".
❷ write is a recursive function over the list of input examples. It writes snippets to disk accumulating
their absolute path at each iteration. The accumulation process is a good candidate for conj.
❸ map-indexed creates a list of tuples of number, path and content. “doseq, dorun, run!, doall, do” is
destructuring on the output of map-indexed a couple of lines below.
❹ The first side-effect is to create one or more folders for each group of files.
❺ The second side effect creates one file for each snippet, naming the file with a sequential number.
❻ The function recurs stepping to the next 2 elements in the input using nnext. While invoking recur,
results are pushed via conj to the accumulating vector of results.
❼ We can see the list of files that were created after invoking the write function.
138
consel was present in the early Lisp papers, then abbreviated to simply "cons"
function conj was designed to decouple the insertion logic from the collection
implementation. conj delegates the input collection to add a new element in the best possible way. For
sequences, it simply delegates back to cons logic, but for vectors it adds the new element at the end of
the vector where a "tail buffer" is available (see vector for the implementation details). The pattern to
delegate the input argument for the best way to implement an operation is a common theme in the
Clojure standard library.
See also:
• cons works on sequences (lists, ranges or lazy) and sequential objects (thus
including vectors, sets, maps and many more). conj delegates to the
internal cons method for list and sequences, so using conj is the best choice in the
general case. cons has also other applications including building custom lazy-
sequences.
• conj! is a special conj operation dedicated to transients. A transient is a special
collection state in which a normally immutable data structure can mutate. A
special set of functions (ending with an exclamation mark) is dedicated to
transients.
• “into”, similarly to conj, adds elements from a collection to another.
Prefer into for bulk-insertion of many items, as into is optimized for this kind of
operation. Prefer conj to gradually build a collection one item at a time.
• assoc inserts a new value at some index. For maps the index is the key, for vectors
it is the zero-based ordinal position in the sequence of items. conj on vectors will
simply add to the end of the vector, with assoc you have more control about
where the new element will appear (completely replacing the old one).
Use assoc when the goal is to insert and replace an element at a specific
position/key.
Performance considerations and implementation details
Figure 8.3. Efficiency of conj for sets and maps, ordered and unordered.
conj is O(log N) for sorted maps and sorted sets. This is remarkable feature for an
ordered collection. conj on unsorted maps (or sets) is instead near O(1) constant time
(more precisely O(log 32 N) which is irrelevant in most practical cases).
The next chart repeats similar benchmark on some of the most used sequential types.
As expected, conj performs faster on these types because there is no need to check on
the existence of the element, or to prepare space for the new key-pair in a specific
place in the underlying data structure.
The very fast results on lists and ranges is primarily because conj just creates a cons-
cell, a small object that is linked to the rest of the collection. The reader should also
keep in mind that conj is not designed for bulk-inserts of many elements (for which
other functions like “into” are better suited).
8.2.2 get
function since 1.0
(get
([map key])
([map key not-found]))
In the group of the functions that make access to collections (the others
being “nth” and “find, key and val”), get is specifically dedicated to maps (although it
works on other types too):
(get {:a "a" :b "b"} :a)
;; "a"
get design is to avoid throwing exception, preferring nil when the collection type is
not supported:
(get (list 1 2 3) 1)
;; nil
get offers a third argument which is returned when the element is not found or the
collection is not supported:
(def colls [[1 2 3] {:a 1 :b 2} '(1 2 3)])
(def ks [-1 :z 0])
(def messages ["not found" "not found" "not supported"])
❶ We use map variable arity to call get with groups of items from each collection at at the same time.
This, along with returning nil instead of exception, makes get quite flexible for
handling mixed type input:
CONTRACT
Input
The first argument called "map" is not limited to map types. It can also be:
• set (ordered but not transient).
• A record created with “defrecord”.
• Classes implementing the java.util.Map interface (such
as java.util.HashMap object instances).
Other arguments are:
• "key" can be any positive integer up to (Integer/MAX_VALUE). When "key" is
beyond that range, the result of get can be difficult to predict, as "key" is
truncated to an integer with potential loss of precision.
• "not-found" is optional and can be of any type. "not-found" is returned when the
requested index is not present.
• nil is accepted as a degenerated collection type. get of nil always
returns nil unless "not-found" is provided.
When "key" is a number, "map" can additionally be:
• vector (including sub-vectors and native vector).
• A native Java array (such as those created with make-array or int-array).
• A string.
Notable exceptions
• ClassCastException is thrown when "coll" is a sorted-map or sorted-set and
"key" is not compatible with the content of "coll". To be compatible, the type of
©Manning Publications Co. To comment go to liveBook
"key" and content of "coll" needs to be the same or have a suitable comparator.
See the examples for additional information.
Output
get returns a value which has a different meaning depending on the type of "coll":
❶ get is searching for a non-existent key. The default value is thus returned.
❷ Another case of a non existent key, but with a type that cannot be compared to the keywords in the
map.
The second interesting case in get contract has to do with transient collections. get can
correctly access a transient map or vector, but cannot do the same on transient sets.
This is a bug that should be fixed in the next Clojure releases 139:
(get (transient {:a 1 :b 2}) :a) ; ❶
;; 1
139
The problem with some transients being not compatible with collection operations is described in the following Jira
ticket: dev.clojure.org/jira/browse/CLJ-700
The last surprising behavior happens on numerical keys exceeding the maximum value
allowed for integers. get implementation makes use of a lossy integer truncation
with (Number/intValue) that can return unexpected results. Numerical keys are
allowed for vectors, strings and arrays:
(+ 2 (* 2 (Integer/MAX_VALUE))) ; ❶
;; 4294967296
(.intValue 4294967296)
;; 0
❶ A sufficiently large number can mistakenly return valid indexes when coerced into an integer.
❷ get is used on vectors, strings and arrays using a sufficiently large number. The expectation would be
for get to return nil instead.
get is the more resilient of the group of functions dedicated to element lookup in a
collection. It works on every type (even scalars) at the cost of returning nil instead of
throwing exception. It also accepts a nil collection as input, making it a good
candidate when the target collection could potentially be nil:
(def mixed-bag [{1 "a"} [0 2 4] nil "abba" 3 '(())])
get also accepts objects implementing the java.util.Map interface, which is typical
for a some Java interoperation scenarios. For instance, Java methods like
System/getProperties or System/getenv are useful to gather information about the
running environment and they return Java maps. The following example shows how we
can search for interesting environment properties using get:
©Manning Publications Co. To comment go to liveBook
(defn select-matching [m k] ; ❶
(let [regex (re-pattern (str ".*" k ".*"))]
(->> (keys m)
(filter #(re-find regex (.toLowerCase %))) ; ❷
(reduce #(assoc %1 (keyword %2) (get m %2)) {})))) ; ❸
(search "user") ; ❺
;; {:USER "reborg"
;; :user.country "GB"
;; :user.language "en"
;; :user.name "reborg"}
❶ select-matching searches for the given key (or portion of it) inside a Java map.
❷ The regular expression is built from the given key name and used to filter the matching keys
regardless of the case.
❸ The following reduce operation builds a new Clojure map with the matching keys. The use of get is
essential to access the Java map for the related value.
❹ search wraps access to the system properties and environment, merging them together before the
actual selection.
❺ We can see the result of searching for "user", producing a Clojure map containing the matching keys.
The content could be different on other systems.
({:a 1 :b 2} :b) ; ❶
;; 2
hash-maps implement the clojure.lang.IFn interface, thus defining an invoke method that is used
when the map appears in a callable position. It supports a second argument to be used as a default
when the key is not found, exactly like the get function. get and "map as a function" even share the
same implementation, both ending up calling the method valAt() from
the clojure.lang.ILookup interface.
Java interop
Clojure maps are also instances of java.util.Map, so you can also access them with the get(Object
key) Java method:
Find
find is similar to the other methods seen so far but wrapping results in a java.util.Map.Entry object
(which Clojure extends in its own clojure.lang.IMapEntry interface). Apart from wrapping the final
result in a newly created map entry object, it shares the same implementation as get:
(find {:a 1 :b 2} :b) ; ❶
;; [:b 2]
Choosing between one of the possible ways to access a key in a Map is a matter of how maps are used
by different applications. Performances is less of a concern in this case, as Map access is overall a very
fast operation independently from the method used.
See also:
• “find, key and val” is similar to get, but it returns the entry map tuple (a vector of
two element) instead of just the value. If you need to use the value but maintain
the relationship to its key, find is the perfect choice.
• “select-keys and get-in” can access multiple keys at once and returns a map with
those key-value pairs. Use select-keys to pick multiple values at once with their
corresponding keys.
• “nth” accesses an element by index. get works on vectors too, but nth is
specifically dedicated to that goal. get actually uses nth when the argument is a
vector. Unless you are interested in get flexibility with nil, prefer using nth with
vectors.
• get-in allows fetching values from withing nested maps (or the different kind of
collections supported by get) without the need of nesting get invocations:
(def m {:a "a" :b [:x :y :z]})
Overall get is a fast operation that shouldn’t raise particular concerns (times are in
nanoseconds on a recent laptop). Sorted collections are penalized, because access
requires to compare the key at each branch of a balanced tree 140. Also note how for
sorted collection, get is O(log N) operation while is roughly constant time for other
collections (it’s not perfect constant time because persistent collections are
implemented on top of a very shallow tree). Following sorted collections, sets are
roughly 2 times slower than maps. Following that, vectors are 2 times faster
for get access.
In terms of get compared to other ways to access maps (we illustrated the different
ways to access a map before in the chapter) please see the following chart.
The chart compares 6 different ways to access a key in a map at increasingly bigger
sizes (up to 1 million keys). Please note again that times are nanoseconds and overall
we are talking about very fast operations. The bars in the chart, left to right, are:
• "get" is showing get access to the map. It scores around 30 ns average access.
• "find" is using the function “find, key and val” and despite the creation of the
map-entry object, it performs roughly the same as get. This can be explained with
the missing check for nil and the default value in “find, key and val” compared
to get.
• "keyword" is using the key itself to access the map. It is 10% faster than get.
• ".get clj-map" is using the Java interop ".get" Java method to access a persistent
hash-map in Clojure.
140
Red-black trees are used in Clojure to implement sorted collections. Please see the following Wikipedia entry for more
information: en.wikipedia.org/wiki/Red–black_tree
The chart essentially shows how good Clojure is performing against mutable Java
HashMaps (for read-access). It also shows that using a map object directly as a
function on keys is one of the best choices both for readability and speed. get remains
a very good option when we want to take care of potentially nil maps without using an
explicit condition or we want to access Java HashMaps.
8.2.3 contains?
function since 1.0
Other types are supported, although their use is less common. The contract section
goes into deeper details.
CONTRACT
Input
contains? main goal is to check for the presence of an element inside an "associative"
data structures. In Clojure "associative" is a broad category
including maps, vectors and by relationship with maps, records. Sets are not
implementing the clojure.lang.Associative interface, but they are supported
by contains?. The following is an exhaustive list of all supported collections for the
"coll" argument and related restrictions:
• map (sorted but not transient).
• set (sorted but not transient).
• vector (including sub-vectors and native vector).
• A record created with “defrecord”.
• Classes implementing the java.util.Map interface (such
as java.util.HashMap object instances).
• Classes implementing the java.util.Set interface (such
as java.util.HashSet object instances).
• When "key" is a number, "coll" can additionally be a string or a native Java array.
• nil is accepted as a degenerated collection type.
"key" can be any positive integer up to 232. When "key" is beyond that range, the result
of contains? depends on the result of truncating key into an integer, which can be
lossy. For instance:
(def power-2-32 (long (Math/pow 2 32))) ; ❶
(contains? [1 2 3] power-2-32) ; ❷
;; true
Notable exceptions
• IllegalArgumentException when contains? is not supported on a specific
collection type. Noteworthy examples of collection throwing exception
with contains? are clojure.lang.PersistentQueue, java.util.ArrayList and t
ransients in general.
• ClassCastException is thrown when "coll" is a sorted-map or sorted-set and
"key" is not compatible with the content of "coll".
Output
contains? has a different meaning depending on the type of
"coll". contains? returns true when:
• The "key" is present when "coll" is a “hash-map”, sorted-map, “array-map”.
• The "key" is in the set, when "coll" is a set or sorted-set.
• The index "key" is present when "coll" is a “vector”, sub-vector, native vector or
string. "key" must be numeric in this case.
• The index "key" is present when "coll" is a native Java array.
• There is a field "key" in a record instance, similarly to map access.
contains? returns false (or exception) when "coll" is nil and for all other types.
Examples
contains? distinguishes between the presence of a key with a nil value and the
absence of a key in a hash-map. Other functions like get would return an unexpected
result when used as conditions:
(def m {:a 1 :b nil :c 3}) ; ❶
❹ To achieve the same effect with get with need a sentinel value ::none as default to differentiate from
a potential nil.
Expanding on the issue, let’s have a look at the output of a group of electronic devices.
The device presents a "nil" to indicate that the sensor did not send a response when it
was requested. The following example shows how we could use contains? to verify
the presence of a nil "key" in a Clojure hash-set:
(def sensor-read ; ❶
[{:id "AR2" :location 2 :status "ok"}
{:id "EF8" :location 2 :status "ok"}
nil
{:id "RR2" :location 1 :status "ok"}
nil
{:id "GT4" :location 1 :status "ok"}
{:id "YR3" :location 4 :status "ok"}])
(raise-on-error sensor-read) ; ❸
;; RuntimeException At least one sensor is malfunctioning
❶ sensor-read is an example of a vector containing sensor data as maps. Two sensors returned no
data resulting in a nil.
❷ contains? is used to build a predicate function problems? that can be used to verify the presence
of nil in the set.
❸ In case of failed read of at least one sensor, it correctly throws an exception.
Note that the common approach to use a set as a function passing nil as argument
would not work in this case:
((into #{} sensor-read) nil) ; ❶
;; nil
❶ Checking for the presence of nil in a set, using the set as a function generate an ambiguous result.
The nil returned in this case is ambiguous: it could refer to the fact that nil is a
matching element as well as nil is not present in the set. contains? is the right
function for this problem.
(contains? [1 2 3 4] 4) ; ❶
;; false
❶ contains? used on a vector requires the use of an integer as the second argument.
❷ Array indexes should be in the integer range. When a type other than integer is used, contains? on a
vector always returns false.
contains? on vectors only works in the presence of an integer as the second argument. Other types are
accepted but they always return false, adding to the confusion of certain expressions where the
element is clearly present in the vector.
Clojure beginners expect contains? on vector to perform a linear search of the element in the
vector instead of checking the presence of the element at the index and many discussions happened in
the past around the option of changing name to contains? or to introduce a different function to scan
collections for elements.
contains? is deliberately designed the way it is to prevent use in those contexts where it could
generates a linear search. Other options are available in Clojure to perform linear scans of elements
which are more explicit (for example some or .contains Java interop). This prevents abuse of standard
library functions and data structures that would result in less than optimal performance.
See also:
• some can be used to perform linear scans on vectors and other sequential
collections: (some #{:a} [:a :b :c]). There are two restrictions compared to
contains?: some needs a predicate, so the element to search usually end up inside
a set. The other problem is that it can’t be used to search for the nil element. This
can be fixed using equality as predicate: (some #(= val %) coll) works
when val is nil.
• .contains does not belong to the standard library but is a similar Java method.
Many Clojure data structures support the java.util.Collection interface with
the exception of maps. .contains also works on strings and allows to search for
substrings:
(.contains [:a :b :c :d] :a) ; ❶
;; true
141
This long thread from the mailing list summarizes many of the concerns related
to contains?: groups.google.com/d/msg/clojure/bSrSb61u-_8/3AmJbVYOrzwJ
;; true
❶ Compared to contains?, the Java interop version .contains verifies the presence of the
element in the collection (not the index).
❷ .contains does not work on hash-maps, but it works on hash-sets (ordered, unordered or
transients).
❸ On strings, .contains verify the presence of a substrings.
Figure 8.7. contains? performs roughly constant time for all practical purposes.
142
HAMT (Hash Array Mapped Trie) is a data structure first presented by Phil Bagwell.
See: lampwww.epfl.ch/papers/idealhashtrees.pdf
143
Please note that records do not support more than 255 fields, so a fixed size record has been used throughout the test.
The chart shows that all collection types are slowly increasing in average access time
as the number of items in the collection increases. Considered the scale, it’s much
easier to see it for sorted collections than for vectors. In terms of
implementation, contains? is predominantly written in the Java side of Clojure where
a simple entry point dispatches the call to the correct data type. For vectors, the
implementation just checks if the requested index is within the length of the vector.
(rand-nth [coll])
rand-nth selects a random element from a collection (excluding maps and sets) and
returns it:
(rand-nth (range 10)) ; ❶
;; 2
(rand-nth "abcdefghijklmnopqrstuvwxyz") ; ❷
;; \b
CONTRACT
rand-nth (as implied by the name) is built on top of nth and the collection needs to be
"counted" in order to prevent exceeding the largest available index (please refer
to count for more information about counted collections). A combination of the
restrictions seen for nth and count applies to rand-nth as follows:
Input
• "coll" can be any collection type excluding maps, sets (the exclusion extends to
ordered, unordered or transients maps and sets). rand-nth also works
for java.util.ArrayList or native arrays.
• Empty collections are not accepted as input.
• nil is accepted as a degenerated collection type and always returns nil as output.
Notable exceptions
• IndexOutOfBoundsException: when the input collection is empty.
• UnsupportedOperationException: when "coll" is not a sequential collection, for
example for scalar values (longs, chars) or unsupported data types like maps and
sets.
• ArithmeticException: if the number of elements in the collection is
beyond (Integer/MAX_VALUE).
Output
• A randomly selected element from the input collection.
• nil is returned when "coll" is nil or when nil was present in the collection and
was selected.
Examples
rand-nth could be used to retrieve a random choice from typical enumerations like the
sides of a dice, or a coin toss:
(defn roll-dice [] ; ❶
(rand-nth [1 2 3 4 5 6]))
(defn flip-coin [] ; ❷
(rand-nth ["heads" "tails"]))
❶ Simple utility function roll-dice to return an equally probably number between 1 and 6.
❷ Equally simple helper flip-coin to return "heads or "tails" with 50% probability each.
rand-nth is often found in games to avoid repetitiveness. It was used for example in
the implementation of the "rock paper scissor" game in let to randomize the computer
choice.
The following example shows a way to generate proverbs given a grammar. Although
the grammar rules are very simple, it can still generate some realistic result:
(def article ["A" "The" "A" "All"])
(def adjective ["nearer" "best" "better" "darkest"
"good" "bad" "hard" "long" "sharp"])
(def subject ["fool" "wise" "penny" "change" "friend"
"family" "proof" "necessity" "experience"
"honesty" "no one" "everyone" "every"])
(def action ["is" "is not" "are" "are not" "help" "be" "create"])
(def ending ["dying." "a dangerous thing." "a lot of noise." "no pain."
(def grammar ; ❶
[article adjective subject action ending])
(defn generate ; ❹
([] (generate 1))
([n]
(repeatedly n #(to-sentence grammar))))
(generate 5)
❶ grammar contains the recipe to assemble a sentence. Each part is a vector containing a selection of
strings to randomly select from.
❷ to-sentence takes a grammar and proceeds to assemble the final string by joining all the parts
together.
❸ rand-nth is used to pick a random choice for each part in the sentence. A better grammar would
define weights by which each token is related to others.
❹ generate can be used to produce multiple proverbs using repeatedly.
As you can see, some generated sentences make more sense than others. Considering
the amount of required code, this is still remarkable result. For anything more
sophisticated, there are other more powerful and complicated techniques (see for
example Markov Chains en.wikipedia.org/wiki/Markov_chain).
See also:
• rand-int offers a mechanism to generate random integers within a range. The
number can then be used to access an element in a collection at that index, which
is essentially what rand-nth does. Use rand-int if you need to have control over
the index generation.
• “shuffle” returns a random permutation of the entire collection not just one
element. Use shuffle when the plan is making multiple sequential requests of
random elements. The "shuffled" collection can then be iterated without the risk of
retrieving the same element twice (something multiple calls to rand-nth would
eventually generate).
(def n (nth (map #(do (println %) ".") (range 100)) (rand-int 100))); ❷
;; prints 32 to 100 dots
❶ When rand-nth is used to pick an element at random from a lazy sequences, the entire sequence is
realized. Also note that this is a linear operation that depends on the size of the input collection.
❷ Using a combination of rand-int and nth we can avoid to realize the entire sequence on average.
This doesn’t not eliminate the case: the entire sequence can still be fully realized when the selected
element appears at the end.
(shuffle [coll])
shuffle takes a collection and returns a vector which contains a random permutation
of its elements:
(shuffle [1 2 3 4 5 6 7 8 9])
;; [1 7 3 4 5 6 2 9 8]
The algorithm used is the Fisher-Yates shuffling 144 shipped with the Java
144
Please refer to en.wikipedia.org/wiki/Fisher–Yates_shuffle for additional information about how the shuffling algorithm
works
145
Here a more specific explanation of the Round-Robin algorithm: en.wikipedia.org/wiki/Round-robin_DNS
(get) ; ❼
;; calling http://10.100.23.12/index.html
;; 1
❶ round-robin prepares internal state as part of the initial let block. One step of the initialization
process consists of shuffling the list of hosts passed as arguments. This prevents other clients in a
similar initialization state all to start requesting the first host in the list.
❷ The other part of the initialization contains the index of the host the next request should be made to.
❸ The request is made by invoking "f" on the host at the current index.
❹ Finally, the index is moved forward one element in the collection of hosts. mod makes sure we restart
from the first host every time we reach the end of the list.
❺ request is the generic function to use to make requests. In a real scenario we would probably make
actual http requests. We are printing to the standard output instead.
❻ get is assigned the function returned by round-robin. It can be now used by invoking it without
arguments.
❼ Calling get prints the result on screen and returns the index in the collection of hosts to use the next
request. The host is picked at random and it will be different if we re-initialize get var again.
See also:
• “rand and rand-int” are available to access randomly generated numbers.
• “rand-nth” randomly selects an item from a collection.
146
A post on comp.lang.functional explains how purely functional shuffle works. It is available
here: okmij.org/ftp/Haskell/perfect-shuffle.txt
(random-sample
([prob])
([prob coll]))
❶ random-sample with 0.5 (50%) probability is used on a sequence of 10 items. Each element has 50%
chances to appear in the output.
❷ Results could differ when the same form is evaluated again.
Note that 50% does not mean that half of the elements will appear in the output, but
"up to" 50% of the elements will definitely be.
When no input collection is provided, random-sample returns a transducer with the
same characteristics. The following example simulates a scenario in which a coin is
flipped repeatedly some number of times between 0 and "n":
(defn x-flip [n] ; ❶
(comp (take n) (random-sample 0.5)))
(def head-tail-stream ; ❷
(interleave (repeat "head") (repeat "tail")))
(flip-up-to 10)
;; ["head" "head" "tail" "head" "tail" "head" "tail" "tail"]
❶ x-flip is a function returning a transducer. The transducer applies a selection with 50% probability to
the input sequence. It then limits the number of results to return.
❷ head-tail-stream produces an infinite sequence of alternating head-tail strings.
❸ flip-up-to applies the transducer to the infinite stream of head-tail strings.
CONTRACT
Input
• "prob" can be any number, although only the range between 0 and 1 is meaningful
to calculate the probability. Any number below zero is considered 0%, while any
number above 1 is considered 100%.
• "coll" is an optional collection input.
Notable exceptions
• ClassCastException is raised when "prob" is not a number.
Output
• returns: a lazy sequence of randomly selected items from "coll". Each item has
probability "prob" to appear in the output.
Examples
random-sample can be used to implement a simple password generator. One fact to
take into account when using random-sample is that the probability passed as argument
influences the similarity of the output with the input. Observe the following:
(take 10 (random-sample 0.01 (cycle (range 10)))) ; ❶
;; (1 7 4 9 4 9 1 9 9 5)
(take 10 (random-sample 0.99 (cycle (range 10)))) ; ❷
;;(0 1 2 3 4 5 6 7 8 9)
❶ A very low probability of 0.01 prevents many elements to be selected for the output, so the same
range from 0 to 9 needs to be cycled several times before accumulating 10 elements.
❷ A probability close to 1 on the other hand, produces a sequence that very closely mimic the input.
Using a low probability for random-sample requires a longer input sequence to produce
items in the output. If we use cycle we can repeat the same input range until random-
sample picks enough elements for the output. We can use this recipe to create a random
password generator:
(def letters (map char (range (int \a) (inc (int \z)))))
(def LETTERS (map #(Character/toUpperCase %) letters))
(def symbols "!@£$%^&*()_+=-±§}{][|><?")
(def numbers (range 10))
(def alphabet (concat letters LETTERS symbols numbers)) ; ❶
(apply str)))
(generate-password 10)
;; "C3pu@Y6Xhm"
❶ The alphabet is the concatenation of all symbols, numbers and letters (upper and lower case).
❷ cycle is used to create an infinite concatenation of the alphabet to itself.
❸ random-sample is used here with a low probability to create a sufficiently random sequence.
(defn random-subset [k s]
(loop [cnt 0 res [] [head & others] s]
(if head
(if (< cnt k)
(recur (inc cnt) (conj res head) others)
(let [idx (rand-int cnt)]
(if (< idx k)
(recur (inc cnt) (assoc res idx head) others)
(recur (inc cnt) res others))))
res)))
❶ Using Algorithm "R" we can be sure that the output sample contains the required amount of elements.
See also:
• “rand and rand-int” are the primitives used in many other functions in the standard
library to deal with randomness. They generate random numbers that can be used
as indexes or as a general source of randomness.
• “rand-nth” extracts a random element from an indexed collection. Use rand-
nth when you are interested in a single random element from the input collection.
• “shuffle” returns a random permutation of the input collection. Use shuffle when
you are interested in all elements.
147
Reservoir Sampling is a family of algorithms dedicated to randomly choosing a sample from an arbitrarily large input.
Algorithm "R" is the most common example of such algorithms. Please see en.wikipedia.org/wiki/Reservoir_sampling for
more information
❶ random-sample has a simple implementation making use of rand to decide when an element should
be included or not.
Since there is no guarantee about how many iterations filter needs to find the
next true predicate, we can’t be sure about the number of steps required when we
use take to sample a specific number of elements (potentially O(n) in the worst case).
Please refer to the call-out section about "Controlling the input size" for alternatives.
8.3.4 frequencies
function since 1.2
(frequencies [coll])
frequencies counts the repetition of the same item in a collection and returns the
results as a map from the item (the key) into a count (the value):
(frequencies [:a :b :b :c :c :d]) ; ❶
;; {:a 1, :b 2, :c 2, :d 1}
❶ frequencies used to count the repetition of elements in a vector. We can see 2 occurrences
of :b and :c. All other elements are unique.
frequencies is a handy tool for quickly calculate the distribution of a group with just a
❶ Lists, vectors and queues all belong to the same equality category (please see = for more
information).
❷ Similarly, the same number expressed in all integer types is considered the same number.
(freq-used-words book) ; ❻
;; (["the" 34258] ["and" 21396] ["to" 16500] ["of" 14904] ["a" 10388])
❶ The content of the book as a string is piped through ->>. split takes the regular expression #"\s+" to
split the text apart into single words. We then lower-case the string to remove unwanted differences.
❷ frequencies is used as is to calculate the word counts.
❸ We then need to order the result by the highest frequency descending. It is done using sort-by and a
suitable comparator.
❹ The last step is to just take the top-most 5 words to show as results.
❺ “slurp and spit” can be used on local file paths as well as remote URLs. Be sure to have a working
Internet connection before trying the example.
❻ The result shows the most common articles and cnojunctions are at the top of the list.
Parallel frequencies
Counting distinct items is associative (it doesn’t matter which order the keys are added together), so it is
relatively easy to transform it into a parallel operation. Let’s revisit the word-count example
using “Reducers” instead of frequencies. The general design now consists of a merging operation (to
bring together results coming from different cores) and a reducing operation to be used on each single
core:
(require '[clojure.core.reducers :refer [fold]])
(require '[clojure.string :refer [blank? split split-lines lower-case]])
(defn combinef ; ❷
([] {})
([m1 m2] (merge-with + m1 m2)))
(freq-used-words book)
;; (["the" 34258] ["and" 21396] ["to" 16500] ["of" 14904] ["a" 10388])
❶ reducef is the reducing function used by fold. This will be used to reduce words on each processing
unit. We designed the reducing function to push part of the computation ahead of frequencies down
to a parallel chunks: splitting and lower-casing are now parallel as well.
❷ combinef is used by fold to merge processed chunks back together. This is essentially merge-
with with an additional no arguments arity to provide a starting empty map {}.
❸ The main data pipeline is very similar to the sequential one. We additional take care of splitting the
large text into lines ahead of the parallel part of the computation. This is what allows additional
processing (such as word splitting and casing) to happen in parallel. fold is used here with the
standard chunk size of 512, a quantity that could be set differently based on the size of the input text
and number of processing cores.
The parallel word-count is only marginally faster than the sequential version, showing that the fork-join
coordination cost should be taken into account for a simple computation like counting words on this file
size and degree of parallelism (it was tested on 4 cores). For other scenarios, for example if more
computation is required on the raw string or more cores are present, the parallel version could be visibly
faster than the sequential one.
See also:
• reduce is the fundamental construct used by frequencies to iterate over the input.
Please refer to frequencies sources for any optimization needed to the reducing
function, if that applies to your problem.
• group-by also produces a map as a result of a grouping rule. Use group-by when
you want to split a collection based on some logic and have a key to access each
group.
• partition-by produces a grouping but not as a map. The returned lazy sequence
contains other sequences grouping the initial input based on a function passed as
input. Use partition-by when the grouping needs to be iterated rather than
accessed through a key.
Performance considerations and implementation details
(sort
([coll])
([^java.util.Comparator comp coll]))
(sort-by
([keyfn coll])
([keyfn ^java.util.Comparator comp coll]))
sort and sort-by are sorting functions in the Clojure standard library. They take an
input collection containing items of comparable type and return a sorted version based
on a comparator (compare is the default if none is given):
(sort [:a :z :h :e :w]) ; ❶
;; (:a :e :h :w :z)
sort-by adds the option of passing a function to be invoked on each item before
passing it to the comparator. The additional function allows sort-by to pre-process
items or to transform their types before comparison:
(sort-by :age [{:age 65} {:age 13} {:age 8}]) ; ❶
;; ({:age 8} {:age 13} {:age 65})
❶ sort-by can be used to sort on a specific key of a map (instead of the entire map for example).
❷ str has been used to transform otherwise incompatible types into strings before comparing them. The
transformation does not appear in the final result. Note that (sort [:f "s" \c 'u]) would throw an
exception in this case.
CONTRACT
Input
"coll" is valid for sort or sort-by when it is supported by seq (transients and scalars in
general are not supported by seq while anything else is, including Java Iterable and
arrays). The items inside "coll" are sortable when they are nil, identical? or belong to
the following categories:
• They are numbers such that (instance? java.lang.Number x) is true for every
"x" in "coll".
• They are comparable such that (instance? java.util.Comparable x) is true for
every "x" in "coll".
• They provide a specific implementation of compareTo(). compareTo() is a method
required by the java.util.Comparable interface.
"comp", as declared by the type hint in the function signature, must support
the java.util.Comparator interface which implies the presence of a int
compare(Object o1, Object o2) method available for execution. All Clojure
functions implements such interface:
Examples
Sorting is core operation in computer science and subject to specific research. It’s
essential for the solution of many problems and this book already used sort and sort-
by in several examples. The user is invited to review the following:
❶ sort takes an optional comparator argument before the input collection. Clojure extends functions so
they can be used as comparators.
❷ Similarly, sort-by accepts a custom comparator before the input collection. Equal element remains
unsorted.
In the following extended example, we’re going to build a parallel (and lazy) merge-
sort on top of sort and sort-by. Basic `sort and sort-by are eager: they need to
load the collection into memory to perform reordering. This is good in many cases, but
if the dataset doesn’t fit in memory we need to operate differently.
Merge-sort is a popular sorting algorithm based on "divide and conquer" paradigm 148:
the idea is to split the initial collection, sort the smaller chunks and merge everything
back in order:
• Take advantage of Clojure “Reducers” to split and parallelize the processing of the
initial collection.
• On each parallel thread, we are going to fetch a chunk of data from some external
source, sort it and store it to disk.
• By having a relatively small amount "n" of concurrent threads, we can be sure that
only "n" chunks of data are actively loaded into memory, never loading the entire
dataset at once.
The algorithm has two separate phases: the first is about splitting the large input into
smaller pieces, process them in parallel and store the ordered chunks to disk. The
second part knows how lazily merge the ordered chunks so they appear as a single
sequence to the caller. Here’s how we could go about implementing the first phase:
(require '[clojure.java.io :as io])
(require '[clojure.core.reducers :as r])
(defprotocol DataProvider ; ❷
(fetch-ids [id-range]))
148
As usual Wikipedia has a very detailed article about Merge-Sort available at en.wikipedia.org/wiki/Merge_sort
(extend-type IdRange ; ❻
DataProvider
(fetch-ids [id-range]
(shuffle (range (:from id-range) (:to id-range)))))
❶ save-chunk! is dedicated to the creation of a temporary file and storing of a sorted collection to disk.
The exclamation mark is there to remember the side effect of calling this function: it stores on disk and
returns a file handle.
❷ The DataProvider protocol is dedicated to potential client of our algorithm. Fetching the data from
some external source is the only portion of the business logic that doesn’t depend on merge-sort, so
it’s a good idea to prominently extract that out and offer an easy way to plug-in different data logic.
The way this is done is further below through the use of extend-type.
❸ process-leaf is the core part of the algorithm, the computation that happens on each thread in
parallel. It collects the operations that we need to perform on each chunk of data: fetch the IDs, sorting
the resulting data, save the results.
❹ The “Reducers” library offers different entry points. We decided to encapsulate the logic about
how fold is going to behave in a new data type. IdRange is a record of two keys: from and to. They
represent an integer range, something typically offered by databases as primary key for a table.
The defrecord could be different to reflect a different system to store data which is not based on
integers to uniquely identify records. What is important is that there is way to express a partition of the
full dataset without loading the actual data. An unique identifier is normally present in most systems.
The second important aspect of the defrecord definition is extending the CollFold protocol from
the reducers namespace. By extending CollFold we can use an IdRange type as the last parameter
of a fold call and have the call to be routed to our custom implementation.
❺ The split happens with this if condition. If the chunk size is below the threshold (the chunk size
determines how much data is loaded in memory in parallel before storing it to disk) we proceed with
the sort operation. If the chunk is still too big, we fork-join the task of performing the same operation
on the newly produced splits. Functions like fjinvoke or fjfork are the lowest Clojure primitives
interacting with the Java side of the fork-join framework.
❻ Now that the IdRange type is defined, we can extend it to support the fetch-ids operation. This is
performed on each IdRange object inside process-leaf. fetch-ids, when invoked on
a IdRange object, delegates to the implementation specified with this extend-type instruction.
❼ fold, invoked on an IdRange instance, returns a list of file handles containing the sorted chunks.
Now that we are able to split, fetch and store sorted chunks of data, the second phase
consists of merging back the files together without loading them all at once. Since the
chunks are ordered, we can just look at the first item in each chunk to know which
should come first. At every iteration, we proceed to lazily load the next chunk of data,
as implemented by sort-all:
(defn sort-all ; ❶
([colls]
(sort-all compare colls))
([cmp colls]
(lazy-seq
(if (some identity (map first colls))
(let [[[win & lose] & xs] (sort-by first cmp colls)] ; ❷
(cons win (sort-all cmp (if lose (conj xs lose) xs))))))))
(defn psort ; ❸
([id-range]
(psort compare id-range))
([cmp id-range]
(->> (r/fold 10000 concat (partial sort cmp) id-range)
(map load-chunk)
(sort-all cmp))))
❶ sort-all assumes a collection of pre-sorted chunks (also collections) as input. It then iterates lazily
the first item in each coll, searching the smaller/biggest elements and gradually forming a lazy
sequence. Destructuring is quite helpful here to remove many occurrences of first and rest.
❷ We use sort-by on each iteration to find the next ordered element between all collections.
❸ psort is the main entry point (it means "parallel sort"). It offers a few defaults and prepares the call
to fold. When the list of file handles is available, it lazily loads the file from disk and calls sort-all.
Only enough of the content of each file is loaded into memory if we take from psort.
❹ Finally, we can invoke psort and see the results as expected.
Unless the client realizes the entire sequence, psort is never going to load the entire
dataset in memory. psort additionally allows some configuration to take place, for
example to change the comparator or provide specific logic to fetch the data.
See also:
• “compare” is the default Clojure comparator. It returns -1, 0, or 1 based on
comparing the two arguments.
• A custom predicate can be transformed into a comparator with comparator.
• sorted-set, “sorted-map and sorted-map-by” can be used to build ordering
incrementally, as the elements arrive. They also accept a custom comparator.
Performance considerations and implementation details
149
Timsort was implemented first for the Python language and subsequently adopted by Java. Please
see en.wikipedia.org/wiki/Timsort for more information
Figure 8.8. sort is used to sort arrays with different level of ordering.
The chart shows that sort is only marginally faster on a collection which is 95%
already sorted compared to a collection 10% sorted. Surprisingly, sort is faster when
the collection is completely unsorted (alternating ascending/descending contiguous
pairs) compared to 95% sorted (but this last case is not common in real life scenarios).
8.3.6 group-by
function since 1.2
(group-by [f coll])
group-by groups the elements of an input collection based on the result of a function.
The result identifies a key in the output hash-map while a vector contains the related
values:
(group-by first ["John" "Rob" "Emma" "Rachel" "Jim"]) ; ❶
;; {\J ["John" "Jim"], \R ["Rob" "Rachel"], \E ["Emma"]}
❶ first is called on each element in the collection, returning the first letter of each name. The letter is
used as the key entry in the resulting hash-map. If two items have the same initial, they are grouped
together in the same vector.
Contract
Input
• "f" is a required function argument. "f" is invoked with each item in "coll" and can
return any result. It can be nil when "coll" is also nil (or empty).
• "coll" is also required argument. It can be nil or empty. If "coll" is not nil, "coll"
needs to implement the Seqable interface such that (instance?
clojure.lang.Seqable coll) returns true (transients are not supported).
Notable exceptions
• NullPointerException when "f" is nil and "coll" is not empty.
Output
• A hash-map containing the result of invoking "f" on each item in "coll" (as keys)
and the grouping of the item in "coll" (as values). As a consequence, if a key exist
in the map then its value must be a vector containing at least one element.
Examples
group-by is a flexible function with a broad application scope. We can use group-by to
create dictionary-like data structures out of plain sequences, using the grouping
function to decide how the values should aggregate. In conjunction with juxt, group-
by allows to further restrict the grouping rules. The use of juxt determines the creation
of a composite key vector:
(group-by (juxt odd? (comp count str)) (range 20)) ; ❶
;; {[false 1] [0 2 4 6 8]
;; [true 1] [1 3 5 7 9]
;; [false 2] [10 12 14 16 18]
;; [true 2] [11 13 15 17 19]}
As you can see, juxt forms vectors of results based on the functions passed as input.
The range of 20 is split by odd/even numbers and then again based on the count of
digits they have.
Let’s use group-by now to search for anagrams. Anagrams are permutations of the
same group of letters, which in our case represents the key once ordered:
(def dict (slurp "/usr/share/dict/words")) ; ❶
(->> dict
(re-seq #"\S+") ; ❷
(group-by sort) ; ❸
(sort-by (comp count second) >) ; ❹
(map second) ; ❺
first)
❶ "/usr/share/dict/words" is a file present in most of Unix based systems. If you don’t have one on your
system, you can use the plain text version of War and Peace available from tinyurl.com/uyovxow (a
Github link), or any other sufficiently large file of words.
❷ The first step is to split the large string into each words. We can use re-seq to achieve the goal using a
regular expression.
❸ After the list of words is created, we feed it to group-by using sort. This creates an ordered list of the
characters in each word, allowing group-by to see which ones is formed by the same letters.
❹ Using sort-by, we can sort by the count of grouped words, starting from the larger group.
❺ Now we eliminate the key and just keep the list of words (the second in each vector pair). The first list
of 9 anagrams is visible in the output.
We can now extend the previous example to also force the presence of the letter "x" in
a word using juxt. The example only needs a couple of changes:
(def dict (slurp "/usr/share/dict/words"))
(->> dict
(re-seq #"\S+")
(group-by (juxt #(some #{\x \X} %) sort)) ; ❶
(filter ffirst) ; ❷
(sort-by (comp count second) >)
(map second)
(take 3))
❶ Along with sort, we use juxt to require additional rules for a word to enter an anagram group. some is
used with a set literal as a predicate to verify if the letter "x" exists in the word.
❷ We also need to eliminate all keys where the "x" component was not found. When a hash-map is used
as the input for filter it decomposes into a sequence of vector pairs, where the first element is the key
(which is again a vector pair containing the result of the some operation in first position). By taking
the ffirst we are taking the first of the first item from the key. If that is nil then the word doesn’t contain
the letter "x".
See also:
• partition-by does not produce a map, but it creates nested sequences inside the
input collection based on the changing results of a function. partition-by creates
a sequential grouping, where nested parenthesis separate the original items
without a key. Use partition-by when you don’t group access by key or when
you need laziness.
• “frequencies” also returns a map, but where the original input items are the key
and the values are the number of their repetitions.
Figure 8.9. group-by performance by number key collisions. The chart shows group-by
performance for a collection with the same input size that changes in the number of items that
can be grouped. The ratio "x/y" indicates the number of keys "x" in the final map containing "y"
items each.
The chart shows that appending items to an already existing vector is faster than
introducing new keys to the map, although in relative terms, the speedup is not huge
(going from 150 to 50 microseconds).
8.3.7 replace
function since 1.0
(replace
([smap])
([smap coll]))
❶ replace takes a dictionary of substitutions and replace keys appearing in the input collection with the
corresponding values. Note that a vector is returned if the input is a vector.
(transduce
(comp (replace {"0" 0})
(map inc)) ; ❶
+
["0" 1 2 "0" 10 11])
;; 30
Contract
Input
• "smap" is an associative data structure (a data structure supporting access by key)
and it’s mandatory argument. Type supported are vectors (including transient
vectors, subvectors, vector of primitives), maps (including transient hash-maps,
array maps, sorted maps and Java HashMap types). "smaps" can also be nil or
empty.
• "coll" is a collection and is optional argument. When not
provided, replace returns a transducer. Almost all collection types are accepted
with few exceptions. "coll" can also be nil or empty.
Notable exceptions
• IllegalArgumentException when "smap" is not associative, that
is (associative? smap) returns false.
• IllegalArgumentException when it’s not possible to obtain a sequential version
of the "coll" (most notably, transients).
Output
• returns: "coll" with "smap" substitutions. nil when "coll" is nil. The return type
is a vector when "coll" is a vector, or a sequence otherwise.
Examples
The most common type of substitution dictionary for replace is a map. It also accepts
a vector (another associative data structure). In this case it uses the index of the vector
to match the element to replace:
(replace [:a :b :c] (range 10)) ; ❶
;; (:a :b :c 3 4 5 6 7 8 9)
❶ vectors can be used as containers for substitutions. Each item in the vector is indexed by its position,
creating a relationship equivalent to the map: {0 :a 1 :b 2 :c}
It’s also possible to replace key-value pairs in maps, although is less common
operation:
(def user {:name "jack" :city "London" :id 123})
❶ Instead of searching for a key, we need to search for the entire map, including the value. When
a map is iterated sequentially, it returns a list of MapEntry objects, which is what we need to match
against. There is no Clojure function to create a MapEntry, but we can call the create method to the
same extent.
❷ The dictionary of substitutions contains MapEntry objects as keys and vector pairs as values.
❸ Once replaced with replace we need to turn the sequential list of vector pairs back into
a map with into.
(transduce
(comp
(replace sub) ; ❶
(interpose " "))
str
(clojure.string/split text #"\s")) ; ❷
❶ Placeholders are represented by curly braces enclosing an identifier. Once the text has been split into
words, they are isolated as a vector of sub-strings. The replace can be the first transducer to be
applied in the chain, followed by interpose to restore the missing spaces.
❷ We use string/split to split the string into a vector of sub-strings ready for processing.
See also:
• string/replace is a function with the same name in the clojure.string namespace.
It provides regular expressions based textual replacement for strings.
Use clojure.string/replace if replacements are easy to describe with a regular
expression and the input is text.
• clojure.walk/prewalk-replace works similarly to replace, but additionally walks
nested data structures to apply substitutions.
• reduce-kv, is another way to transform a map into another. It gives more power
than replace to select the right key-value pair to substitute and the actual
substitution semantic. Prefer reduce-kv to replace on maps for all non-trivial
transformations.
Performance considerations and implementation details
⇒ O(n) linear
replace needs to fully iterate the input to replace matching elements, so the number of
computation steps increases linearly with the length of the input collection.
replace also needs to perform a lookup for each element in the collection. Since the
lookup is almost constant time (O(log32N)), there shouldn’t be any visible degradation,
unless huge dictionaries are involved:
(require '[criterium.core :refer [quick-bench]])
❶ With the help of large-map function, we create a map with 1 million keys.
❷ We use a small map and a large map to invoke replace using the same input collection of 1 million
items. As expected, the size of the dictionary is also influencing the results, although dictionaries of 1
million keys are not common.
Now let’s have a look at the difference in using a vector or a sequence as input:
(let [s (range 1e6) ; ❶
v (into [] s)]
(quick-bench (doall (replace {:small "map"} s)))
❶ In this benchmark, we have a look at the differences in performance when we feed replace with a
sequence or a vector. replace has two different implementations for them.
❷ replace is just a bit slower on sequences and at the same time it offers laziness.
❶ In this benchmark, we compare replace and the replace transducer. We need to remember
to doall on the resulting lazy sequence to fully realize the results.
❷ We can see that the transducer version adds some more time to the iteration.
We can see that the transducer version is almost twice as slow than the normal version.
The reader in this case should consider that the real advantage of transducers is when
they are combined to perform multiple transformations at once without generating
intermediate sequences.
8.3.8 reverse
function since 1.0
(reverse [coll])
reverse, as the name suggests, returns an inverted list of the elements in a collection:
(reverse [9 0 8 6 7 5 1 2 4 3]) ; ❶
;; (3 4 2 1 5 7 6 8 0 9)
❶ reverse takes a collection as input and returns the elements in the input collection in reverse order.
While other sequential operations seen so far produce a lazy sequence, reverse is a
rare example of function producing a clojure.lang.PersistentList data structure:
(type (reverse [1 2 3])) ; ❶
;; clojure.lang.PersistentList
Contract
Input
• "coll" is mandatory and can be of any type supported by seq.
Notable exceptions
• IllegalArgumentException if you try to reverse something that does not offer a
sequential version (a transient for instance).
Output
• returns: a persistent list containing the input items in reverse order.
Examples
Clojure beginners tend to use the idiom sort then reverse to put a collection in reverse
order. This is very inefficient. We can instead use sort with comparator:
(reverse (sort (shuffle (range 10)))) ; ❶
;; (9 8 7 6 5 4 3 2 1 0)
❶ An inefficient use of reverse to order a collection starting from the bigger element. This is definitely
possible, but sort supports a custom comparator to provide a precise ordering.
❷ We could use the more efficient comparator ">" with sort.
❸ If the input is not numeric, we can create a custom comparator. Strings are comparable,
so compare works with them directly.
(->> DNA
reverse ; ❶
(replace {\A \T \T \A \C \G \G \C}) ; ❷
(apply str)) ; ❸
;; "AAGTCGGGCATGTGGAATGTATCTCACTGCAAGAACCGATTAAAAGATAG"
150
We can’t add too much details in the book regarding DNA transcription. However, the principle are clearly explained in
this Wikipedia
entry:en.wikipedia.org/wiki/Complementarity_(molecular_biology)#DNA_and_RNA_base_pair_complementarity
The proposed solution takes advantage of the sequential nature of strings to decompose
the input into single letters, inverting the sequence, applying substitutions and then
composing the result back into a string. Although the solution is not the most efficient
(we are going to see a faster version in rseq), it’s definitely simple and readable.
See also:
• rseq invert a sequence in constant time, but only those implementing
the clojure.lang.Reversible interface (essentially vectors, sets and maps).
• sort can be used instead of reverse if you also need to order the collection before
reading it in reverse. The reverse ordering can be obtained while sorting the
collection, without the need of an additional reverse step.
Performance considerations and implementation details
⇒ O(n) linear in n
reverse works by pushing each input item in a persistent list: the first item goes in
first, then it "cons-es" the second and so on, obtaining the reverse effect typical of
"cons-ed" lists ("cons-ed" is used colloquially in Lisp to identify a linked list built by
consecutively invoking cons on the input). Thus reverse is not a lazy operation:
(first (reverse (map #(do (print % "") %) (range 100)))) ; ❶
;; 0 1 2 3 4...98 99 99
Please note that it’s not possible to have a sub-linear (less than O(n)) implementation
of reverse. In-place reverse implementation on mutable data structures can achieve
a O(n/2) at most. rseq achieve constant time by creating a lazy reverse indexing of the
input. However it becomes linear as soon as the sequence is consumed. Let’s compare
the two approaches in the case of a fully consumed reverse sequence:
(require '[criterium.core :refer [quick-bench]])
❶ The first benchmark measures reverse on a long range, a typical case with lazy sequences.
❷ In the second benchmark, we use again reverse on a vector instead, a data structure that is more
suitable for rseq.
❸ Finally we compare with rseq. Note that we now need to doall the reverse sequence.
The benchmark on a collection of 1 million items shows that the results (for fully
realized results) are very similar between reverse and rseq. However, when a
reversible input is available, rseq remains the best choice to consume smaller parts of
the reversed sequence.
8.4 Traversing
Data structures in Clojure are even more fundamental than other languages. Clojure
applications not only use data but they are designed on top of them. As a consequence,
arbitrarily nested and multi-typed data structures are common, especially as part of the
data exchange between distributed systems.
Nested data models naturally as a tree. Let’s take the following nested data structure:
{:t 'x ; ❶
:n [{:t 'y :n [{:t 'x :n false}
{:t 'k :n [{:t 'h :n :halt}]}]}
{:t 'y :n "2011/01/01"}
{:t 'h :n [{:t 'x :n 90.11}]}]}
❶ An arbitrarily nested map which includes vectors and other data types.
There isn’t a single best way to model this data as a tree. We could for example
establish the following convention:
• The presence of a vector indicates branching: the item containing the vector
becomes a parent node and the items inside the vector are children nodes.
• The value at ":n" decides if there is additional branching.
• If the ":n" key does not contain a vector, the entire map at that level is a terminal
node (also referred as "leaf" node).
• Anything else other than the ":n" key are "data" belonging to the node.
The tree formed by feeding the example data to the convention described above would
look like the following picture:
❶ All functions in this section require explicit require of the clojure.walk namespace.
❷ "inner" and "outer" are identity functions that additionally prints their argument. After feeding them
to walk/walk we can see "inner" evaluates on each item from the input while "outer" executes just
once at the end.
NOTE clojure.walk/walk is not particularly interesting on its own, because it’s not recursive. It is
however the fundamental polymorphic step for all other clojure.walk functions. We are
going to see how clojure.walk/walk helps traversal when talking
about prewalk and postwalk.
❶ prewalk-demo executes a depth-first pre-order traversal of an arbitrarily nested data structure using a
printing function for illustrative purposes.
prewalk-demo
With prewalk-demo, we see the effect of traversing the data structure depth-first,
printing a debug message ahead of visiting each node and while "moving down" the
traversal. The traversal path formed by the nested vector in the example is illustrated
by the following picture:
Figure 8.11. Depth-first, pre-order traversal of a simple tree. The continuous line shows the
traversal path, while the little cameras represent the call to the visiting function.
postwalk-demo
Similarly, postwalk-demo shows the effect of traversing the tree depth-first but only
printing the node when ascending from a visited node:
(postwalk-demo [1 [2 [3]] 4]) ; ❶
;; Walked: 1
;; Walked: 2
;; Walked: 3
;; Walked: [3]
;; Walked: [2 [3]]
;; Walked: 4
;; Walked: [1 [2 [3]] 4]
;; [1 [2 [3]] 4]
Differently from prewalk-demo, postwalk-demo prints the message coming back from
each node and only after reaching the bottom of a branch, as illustrated by the next
diagram:
Figure 8.12. Depth-first, post-order traversal of a simple tree. The continuous line shows the
traversal path, while the little cameras represent the call to the visiting function.
There are good reasons to execute the visiting function ahead or after visiting each
node: in the pre-order case, we have a chance to alter the traversal path by changing
elements in the node. With a post-order visit we can process the output of the traversal,
for example "reducing" the tree. We are going to see examples of both in the following
section.
(postwalk [f form])
(prewalk [f form])
Input
• "f" is a function of one argument. The function evaluates for each nested
collection or other type in "form". It’s mandatory argument.
• "form" can be of any type (with a couple of exceptions, see below) including nil.
If "form" is a collection, then "form" is iterated recursively and "f" called on each
item in turn. It’s a mandatory argument.
Notable exceptions
• UnsupportedOperationException is possible if "form" is recognized as Clojure
collection type but the type does not implement all the necessary functions. One
rare case in the standard library is bean which produces a map-like representation
of an object but doesn’t provide an empty method.
Output
• The output type for both prewalk and postwalk depends mainly on the
transformations operated by "f". In general usage, the output is a collection of the
same type as the input.
Examples
prewalk executes a depth-first traversal of an arbitrarily nested collection, calling a
function "f" on each item before descending into any nested item (also see “prewalk-
demo” for more information):
(require '[clojure.walk :refer [prewalk]]) ; ❶
;; [1 [2 [3]]] ; ❸
;; 1 ; ❹
;; [2 [3]] ; ❺
;; 2
;; [3]
;; 3 ; ❻
;; [1 [2 [3]]] ; ❼
prewalk
In the following example, we are going to use prewalk to prevent processing of a large
branch in a deeply nested data structure. If a node is of type "pipeline" we don’t want
to execute any ":action" in the current or nested nodes:
(def data ; ❶
{:type "workflow"
:action '(do (println "flowchart") :done) ; ❷
:nodes [{:type "flowchart"
:action '(do (println "flowchart") :done)
:nodes [{:type "workflow"
:action nil
:nodes false}]}
{:type "routine"
:action '(do (println "routine") :done)
:nodes [{:type "delimiter"
:action '(println "delimiter")
:nodes "2011/01/01"}]}
{:type "pipeline"
:action '(do (println "pipeline") :done)
:nodes [{:type "workflow"
:action '(Thread/sleep 10000) ; ❸
:nodes 90.11}]}
{:type "delimiter"
:action '(do (println "pipeline") :done)
:nodes [{:type "workflow"
:nodes 90.11}]}]})
;; flowchart
;; routine
;; delimiter
;; pipeline
;; "Elapsed time: 4.098095 msecs"
;; {:type "workflow", :action (do (println "flowchart") :done), :nodes [{:type
"flowchart", :action (do (println "flowchart") :done), :nodes [{:type "workflow",
:action nil, :nodes false}]} {:type "routine", :action (do (println "routine")
:done), :nodes [{:type "delimiter", :action (println "delimiter"), :nodes
"2011/01/01"}]} {:type "pipeline", :action (do (println "pipeline") :done)} {:type
"delimiter", :action (do (println "pipeline") :done), :nodes [{:type "workflow",
:nodes 90.11}]}]}
❶ data is a small section of a much larger data structure that contains nodes that are very expensive to
process. We still want to process the data, but we want to skip any wasteful processing.
❷ Each node contains a :type, :action and :nodes keys. The :action needs evaluation but we don’t
want to evaluate any "pipeline" action, including sub-nodes. When the action evaluates, the keyword
":done" appears in the output data.
❸ To show that prewalk is not processing the entire tree, a Thread/sleep call adds a 10 seconds delay
if evaluated.
❹ The step function contains the necessary logic. If a node is of type ":pipeline", no action gets
evaluated. The nested nodes are removed to prevent any further evaluation and they won’t appear in
the output.
❺ Calling time on prewalk immediately reveals that there is no 10 seconds wait. At the same time, other
actions are evaluated as confirmed by the printouts of the different node types. Finally, the output data
is the same as the input but "pipeline" nodes have disappeared and evaluated action are replaced with
":done".
If we used postwalk in the example above, we would see that prewalk and postwalk
produce the same output but not the same side effects:
(require '[clojure.walk :refer [prewalk postwalk]])
;; flowchart ; ❷
;; flowchart
;; routine
;; delimiter
;; pipeline
;; flowchart
;; delimiter
;; routine
;; pipeline
;; flowchart
❶ This equivalence demonstrates that prewalk and postwalk produce the same result. However, side
effects ordering (if any) and computational cost are changing.
❷ The printouts corresponds to the type of the nodes, prewalk first and postwalk next. As you can see
the ordering is different.
❸ Another clue that postwalk is unable to prevent evaluation of sub-nodes comes from the 10 seconds
necessary to return the result.
While prewalk is useful to reason about the structure of nested data ahead of
processing, postwalk is perfect to process branches ahead of the parent node. A typical
case is representing an expression as a tree where nodes are the operators and branches
the operands. The operator cannot process the operands until they are of the correct
type (for example numbers) but this requires to process the operands first (the
equivalent of evaluating arguments before passing into a function). To illustrate the
point, let’s take the formula to calculate the compound interest that we saw talking
about map and its representation as data:
(defn compound-interest ; ❶
[rate loan-amount period]
(* loan-amount
(Math/pow
(inc (/ rate 100. 12))
(* 12 period))))
(defn compound-interest-data ; ❷
[rate loan-amount period]
{:function *
:children
[loan-amount
{:function #(Math/pow %1 %2)
:children [{:function inc
:children [{:function /
:children [rate 100. 12]}]}
{:function *
:children [12 period]}]}]})
❶ compound-interest is a formula that calculates the total cost of a loan at the given yearly rate and
period (please see the first example in map for more details). Code is data: in this case nested lists
are interpreted by Clojure as function calls.
❷ compound-interest-data is the same function expressed using a different data structure made by
maps and vectors.
❶ evaluate is a function of a node. If the node contains a function, then the function gets invoked on the
children of the node. The operation succeeds only if the children are also evaluated, which happens
only if we evaluate the nodes starting from the leaves. Note that the result of calling apply replaces the
node.
❷ We can see how much you are required to pay a loan of 5000$ with an annual interested of the 7.2%
which is payed back in 2 years.
prewalk, in this case, would be a difficult choice. prewalk would call evaluate while
descending each node, when children are not evaluated yet. postwalk instead calls
evaluate on leaves first, then on nodes while ascending back to the root, which is
exactly the expectation to apply a function to its (evaluated) values.
See also:
• clojure.zip is a different way to traverse nested data structures while maintaining
the traversal state. Use zippers to create a relationship between the state of the
program and the state of the traversal and, in general, take complete ownership of
the traversal policies.
• tree-seq flattens a depth-first pre-ordered traversal into a lazy sequence for
additional processing. Use tree-seq to leverage laziness, for example to stop
traversal when a specific node is found: (some pred? (tree-seq coll? identity
coll)). tree-seq removes the parent-children relationship from the output, which
is not ideal for any post-order processing.
• prewalk-replace or postwalk-replace offers a simpler approach if you are only
interested in changing or replacing nodes in the input.
Performance considerations and implementation details
(def data ; ❷
[[1 2]
[3 :a [5 [6 7 :b [] 9] 10 [11 :c]]]
[:d 14]])
We can generalize the concept of "dictionary of substitutions" with any data structure
supporting contains? that can be used as a function of one argument. array-map or
hash-map is a natural choice, but a vector works as well:
(def ^:const greek ; ❶
'[α β γ δ ε ζ η θ ι κ λ μ ν ξ ο π ρ σ τ υ φ χ ψ ω])
❶ greek is a vector containing the Greek letters in lower-case. Each index in the vector associates with
the given greek letter, forming a dictionary where keys are the number 0-23.
❷ prewalk-replace verifies if greek contains a value using the items in data as keys. If the item is a
number between 0 and 23 it performs the substitution.
postwalk-replace
While prewalk-replace applies the substitution before descending into the
data, postwalk-replace only applies the substitution after visiting a leaf that can’t be
iterated any further. This is similar to the difference we’ve seen between prewalk and
postwalk. For example, in the following problem involving boolean expressions, we
can simplify the input formula using postwalk-replace:
(def formula ; ❶
'(and (and a1 a2)
(or (and a16 a3) (or a5 a8)
(and (and a11 a9) (or a4 a8)))
(and (or a5 a13) (and a4 a6)
(and (or a9 (and a10 a11))
(and a12 a15)
(or (and a1 a4) a14
(and a15 a16))))))
(def ands ; ❷
'{(and true true) true (and true false) false
(and false true) false (and false false) false})
(def ors
'{(or true true) true (or true false) true
(or false true) true (or false false) false})
(def var-map ; ❸
'{a1 false a2 true a3 false a4 false
a5 true a6 true a7 false a8 true
a9 false a10 false a11 true a12 false
a13 true a14 true a15 true a16 false})
(def transformed-formula ; ❹
(w/postwalk-replace (merge var-map ands ors) formula))
transformed-formula
;; (and
;; false
;; (or false true false)
;; (and true false
;; (and false false
;; (or false true false))))
❶ formula is a nested list of "and" and "or" boolean operators concatenating 16 variables
(from a1 to a16). The formula needs substitutions of the variables before it can evaluate
to true or false. However, before evaluation we have a chance to reduce the size of the formula
using truth tables.
❷ ands and ors contain the and and or table of truth as key-value pairs in a map.
❸ var-map is a possible combination of values for the variables from a1 to a16.
❹ postwalk-replace is given the concatenation of the truth tables with the variable substitution map.
The substitution of variables happens during a post-order traversal: leaf nodes processing happens
first, giving an opportunity to postwalk-replace to also use the truth tables.
The transformed-formula contains less nodes than the original while maintaining the
original meaning. prewalk-replace would not be able to simplify the formula after
replacing the variables, as you can test by replacing postwalk-replace with prewalk-
replace in the same example.
8.4.4 clojure.zip
clojure.zip is a namespace part of the standard library that contains an
implementation of the zipper data structure 151. A zipper represents a location inside a
tree: the zipper can "move" around, retrieve nodes, or perform functional changes
(where the original input never actually mutates).
When using zippers, it might be useful to think in terms of "modes". After creating a
zipper we are in "editing mode": we can move around, retrieve nodes and perform
changes. After performing any relevant operation, we call the function root to retrieve
the resulting data (which includes all changes). Calling root exits editing mode: to re-
enter editing we need to create a new zipper.
While in editing mode with a zipper, we have the option of retrieving nodes or
151
Gerard Huet introduced a formalization of zippers in his paper from 1993: gallium.inria.fr/%7ehuet/PUBLIC/zip.pdf
locations: a node is pure data, the same you would expect from using normal Clojure
functions. Locations are instead new positions that effectively move the focus of the
zipper elsewhere.
Compared to clojure.walk functions, zippers separate the concept of traversal from
processing. With clojure.walk, a depth-first traversal is the only option to perform
any kind of operation. With zippers, the traversal algorithm is not part of the contract
(although a depth-first traversal is optionally offered using zip/next or zip/prev) and
we can pick a different (or partial) traversal quite easily.
The following is a summary of the zipper functions including brief explanation of their
goal. We are going into more details in the following sections:
• Building functions: a zipper can be created with the generic zipper function, or
we can use seq-zip, xml-zip or vector-zip to create one starting from
a sequential, xml or vector input, respectively. make-node creates a new single
node that can be added to the zipper while editing.
• Location functions: up, down, right and left move the zipper in one of the
possible directions. rightmost and leftmost will jump many
times right or left to reach the most distant sibling in the respective direction.
• Retrieving functions: node retrieves the data at the current
location. children, lefts and rights retrieves data from below, left or right of
the current location. branch? answers the question if the current location is a
branch node (and implicitly if it’s a leaf). Finally, path returns the list of nodes
necessary to reach the current location of the zipper.
• Update functions: replace applies a substitution of the current node with a new
one. edit is similar, but it takes a function of the current node to produce the
next. insert-left, insert-right and append-child add a new in one of the
respective directions.
• Traversal functions: zippers come with a built in depth-first traversal facility that
can start from any location. next retrieves the next location, depth-
first. prev retrieves the previous location in reverse depth-first order. After
reaching the end of the traversal, end? returns true after reaching the end of the
traversal.
Building Zippers
Clojure offers options to create zippers out of nested vectors (including subvectors and
primitive vectors), lists (including other native sequential types) and XML (as returned
by clojure.xml/parse):
(require '[clojure.zip :as zip]) ; ❶
(require '[clojure.xml :as xml])
(require '[clojure.java.io :as io])
(def vzip ; ❷
(zip/vector-zip
[(subvec [1 2 2] 0 2)
[3 4 [5 10 (vector-of :int 11 12)]]
[13 14]]))
(def szip ; ❸
(zip/seq-zip
(list
(range 2)
(take 2 (cycle [1 2 3]))
'(3 4 (5 10))
(cons 1 '(0 2 3)))))
(def xzip ; ❹
(zip/xml-zip
(->
"<b>
<a>3764882</a>
<c>80.12389</c>
<f>
<f1>77488</f1>
<f2>1921.89</f2>
</f>
</b>"
.getBytes io/input-stream xml/parse)))
❶ Clojure zippers live in the clojure.zip namespace, that needs to be explicitly required.
❷ The first example shows how to create a zipper from a vector, including other vector-like types
provided by Clojure.
❸ Similarly, we can create a zipper starting from some type sequences (those implementing
clojure.lang.ISeq which are the native sequences list, cons and sequence generators such
as range or cycle).
❹ zip/xml-zip creates zippers out of XML documents. The document needs to be in the format
provided by the output of clojure.xml/parse which in turns requires an input-stream.
The recipe that describes how to navigate a specific type of data structure is embedded
as metadata as part of the zipper:
(pprint (meta vzip)) ; ❶
;; {:zip/branch? ; ❷
;; "clojure.core$vector_QMARK___4369@23802cfd"],
;; :zip/children ; ❸
;; "clojure.core$seq__4357@265a07d7"],
;; :zip/make-node ; ❹
;; "clojure.zip$vector_zip$fn__7605@4cae7ff1"]}
❶ The recipe that tells the zipper how to traverse the data structure is embedded as metadata.
❷ The zip/branch? key is similar to the "branch?" parameter in tree-seq and contains a function that
tells the zipper how to distinguish a branch from a leaf. If (branch? item) returns true for any item
in the input then the zipper knows that the item can be descended further.
❸ The zip/children key contains a function to retrieve the children from a branch.
❹ Finally, :zip/make-node contains the function that is used to create new node when needed.
The default zipper builders cover a few interesting cases and they were created at the
time when XML was the common data exchange format. For general use, you’ll likely
need to the most general zipper function. But before creating a custom zipper, we need
to introduce a few more primitives to change and retrieve locations.
Location Functions
The "location" of a zipper is the vector of two items returned right after construction
and after calling a location function. The location contains a copy of the original input
destructured to represent the current position in the data structure. A location function
is a function that changes such destructuring to represent another position. Let’s see
how zip/down changes the location:
(pprint vzip)
;; [[1 2] ; ❷
;; {:l [],
;; :pnodes [[[1 2] [3 4 [5 10 [11 12]]] [13 14]]],
;; :ppath nil,
;; :r ([3 4 [5 10 [11 12]]] [13 14])}]
❶ This is the original vzip instance as built by the zip/vector-zip constructor. The input data is intact
as the first item in the vector.
❷ After calling zip/down the focus of the original data becomes [1 2] while the rest appears as part of
the map in the second item of the tuple. The map contains keys for :l left nodes (there are no left
nodes at this level), :r right nodes (we have the 2 nodes [3 4 [5 10 [11 12]]] and [13 14])
and :pnodes parent nodes looking up from where we came from.
❸ The call to zip/rightmost is now related to the location [1 2]. It moves the location all the way to
the right most node available at that point of the traversal, which is [13 14].
As you can see, each location function preserves the information required to move to
other directions in the data structure without the need of any additional state. On
reaching the edge of the data in any direction, location functions return nil:
(-> vzip zip/down zip/down zip/down) ; ❶
;; nil
❶ On reaching a leaf (in this case the number 1) the next request to move down result in a nil returned.
This signals that we reached the edge of the data and we cannot move further in that direction. Also
note that it’s quite idiomatic to compose location functions using the -> macro.
Looking inside locations is useful to understand how zippers work, but it’s not the way
they are actually used. After the desired location has been reached, there are dedicated
functions to access the current or neighboring nodes that we are going to see in the
next section.
Retrieving Functions
One of the most used function after moving the zipper to a specific location
is zip/node, which retrieves the node at the current location as pure data:
(-> xzip zip/down zip/node) ; ❶
;; {:tag :a, :attrs nil, :content ["3764882"]}
We could also "look around" starting from a location and retrieves nodes which are to
the right, left or below the current node:
(-> xzip zip/down zip/children) ; ❶
;; ("3764882")
❶ zip/children takes a location and returns a list of children nodes. In this case the only children node
is available traversing the :content key of the xml map structure.
❷ There are no zip/lefts nodes available starting at the current location. zip/lefts returns nil.
❸ There are 2 zip/rights nodes to the right of the current location.
Using the zip/path function, we can "look up" from the current location and retrieve
the trail of parent nodes descended so far. For example, if we move through the
sequential zipper created at the start of the section down to the "(5 10)" node we can
collect parents traveled so far with zip/path:
(zip/node szip) ; ❶
;; ((0 1) (1 2) (3 4 (5 10)) (1 0 2 3))
(-> szip ; ❷
zip/down zip/right zip/right
zip/down zip/rightmost
zip/down
zip/path)
❶ It’s useful to print the root of the sequential zipper to follow the traversal.
❷ We pass the sequential zipper "szip" through locations to reach the "(5 10)" node. At the end of the
chain we call zip/path to collect all visited nodes so far.
❸ The result of zip/path only includes the nodes zip/down was called on.
NOTE zip/path does not return a traversal of all visited node. As you can see from the example
above, nodes like "(0 1)" and "(1 2)" were also part of the visit, but they are not collected
from zip/path. We are going to see how to achieve proper traversal with zip/next below.
❶ The document presented here is a fragment from a much larger data structure originally transmitted as
JSON file. A node is represented by a map with :tag, :meta and :node keys. If the value at
the :node key is a collection of maps, then those represent the children of the node.
There is no built-in zipper constructor for these kind of data, but we are going to take
inspiration from the very similar xml-zip implementation to build our own:
(defn custom-zip [root]
(zip/zipper ; ❶
#(some-> % :node first map?) ; ❷
(comp seq :node) ; ❸
(fn [node children] ; ❹
(assoc node :node (vec children)))
root)) ; ❺
❶ We can use zip/zipper to create a generic constructor for custom data structures.
❷ some-> is a good choice to compose together the conditions that determine if the passed argument is
a branch. It needs to contain a :node key, the value at that node has a first element and that
element is a map?.
❸ The second argument is the "children" function. The function embeds the logic to extract
a sequence out of the nodes.
❹ We also need to tell how to assemble a new node given the node and a collection of children,
although we are not using it in this specific example.
❺ The final argument is the input data structure.
❻ We use custom-zipper as usual to create a new zipper and navigate to the deepest node.
Update functions
Clojure offers a few functions to change, insert or delete nodes in a
zipper. replace overwrites the node at the current location without looking at its
current content, while edit requires a function of the old node to the
new. remove simply deletes the current node:
(-> vzip zip/down zip/rightmost (zip/replace :replaced) zip/up zip/node) ; ❶
;; [[1 2] [3 4 [5 10 [11 12]]] :replaced]
(-> vzip zip/down zip/rightmost (zip/edit conj 15) zip/up zip/node) ; ❷
;; [[1 2] [3 4 [5 10 [11 12]]] [13 14 15]]
(-> vzip zip/down zip/rightmost zip/remove zip/root) ; ❸
;; [[1 2] [3 4 [5 10 [11 12]]]]
❶ vzip is the vector zipper created at the beginning of the zipper section. In this first example we can
see how to change the location to "[13 14]" and replace its content with the :replaced keyword.
❷ If we use zip/edit instead of zip/replace at the same location, conj receives "[13 14]" as the first
argument to which it adds "15".
❸ In the last example, the node is completely removed from the output. Note that we used zip/root to
unwind directly to the root without using zip/up.
Note that the current location remains unchanged for all update functions
except zip/remove. After removing a node, the current location becomes the location
of the node that comes before the one that was removed in depth-first traversal order:
(-> vzip zip/down zip/rightmost zip/remove zip/node) ; ❶
;; 12
(zip/node vzip)
;; [[1 2]
;; [3 4 [5 10 [11 12]]] <-- location jump on 12
;; [13 14]] <-- removes here
❶ zip/remove remove the current node, in this case "[13 14]" and jump to the previous location in depth
traversal order, in this case "12". Note that the location could jump to a completely different branch,
like in this case.
Zipper also offers a few options to add nodes: insert-left and insert-right add a
new node to the left or to the of the current location, respectively:
(-> vzip zip/down zip/rightmost (zip/insert-left 'INS) zip/up zip/node) ; ❶
;; [[1 2] [3 4 [5 10 [11 12]]] INS [13 14]]
(-> vzip zip/down zip/rightmost (zip/insert-right 'INS) zip/up zip/node) ; ❷
;; [[1 2] [3 4 [5 10 [11 12]]] [13 14] INS]
(-> vzip zip/down zip/rightmost (zip/insert-child 'INS) zip/up zip/node) ; ❸
;; [[1 2] [3 4 [5 10 [11 12]]] [INS 13 14]]
(-> vzip zip/down zip/rightmost zip/down (zip/insert-child 'INS)) ; ❹
;; Exception called children on a leaf node
❶ For all 3 examples, the location moves do the node "[13 14]" before any insert operation. In this first
case, we add a node "INS" to the left and at the same level of the current location.
❷ With zip/insert-right the new element is added to the right.
❸ insert-child adds a new element as the leftmost item in the collection of existing children.
❹ insert-child does not automatically promote a leaf node to a branch. If the node is a leaf, it throws
exception.
insert-child and append-child are similar operations for branch nodes only. insert-
child adds a new children node of the current location as the leftmost node,
while append-node appends the new node as the rightmost:
(-> vzip zip/down zip/rightmost (zip/insert-child 'INS) zip/up zip/node) ; ❶
;; [[1 2] [3 4 [5 10 [11 12]]] [INS 13 14]]
❶ insert-child adds a new element as the leftmost item in the collection of existing children.
❷ append-child adds a new element as the rightmost item instead.
❸ Neither insert-child nor append-child automatically promote a leaf node to a branch. If the node
is a leaf, the operation throws exception.
MAKE-NODE
make-node is useful to create new branch node that can be part of an existing zipper
without necessarily knowing how nodes are assembled together. For example, we
©Manning Publications Co. To comment go to liveBook
could use make-node to write a function to remove the first children from a node as
follow:
(defn remove-child [loc] ; ❶
(zip/replace loc (zip/make-node loc (zip/node loc) (rest (zip/children loc)))))
❶ make-node takes a location and a collection of children. The details related to the internals of the
zipper comes as part of the metadata at the location.
❷ remove-child is used similarly to the rest of the zipper interface. A call to remove-children removes
the first children node at that location.
❸ We can call remove-child repeatedly until there are no more children to remove, leaving the node
empty.
Traversal Functions
The zipper namespace contains a few functions that move the current location in
predefined direction following a depth-first traversal
path. zip/next and zip/prev moves the location to the next or previous depth-first
location respectively:
(-> vz zip/next zip/node) ; ❶
;; [1 2]
(-> vz zip/next zip/next zip/node)
;; 1
(-> vz zip/next zip/next zip/next zip/node)
;; 2
(-> vz zip/next zip/next zip/next zip/next zip/node)
;; [3 4 [5 [6 7 8 [] 9] 10 [11 12]]]
❶ vz is the vector zipper that was defined at the beginning of the chapter. We can follow the result of
repeated invocations of zip/next descending to the first node "[1 2]", visiting its elements and finally
moving up to the next node.
If we want to perform the traversal of all the available nodes, we can repeatedly
call zip/next on the result of the previous invocations with iterate. Note the use
of zip/end? to decide when to stop the traversal:
(->> vz
(iterate zip/next) ; ❶
(take-while (complement zip/end?)) ; ❷
(map zip/node)) ; ❸
;; [5 [6 7 8 [] 9] 10 [11 12]]
;; 5
;; [6 7 8 [] 9]
;; 6 7 8 [] 9 10
;; [11 12]
;; 11 12
;; [13 14]
;; 13 14)
❶ iterate repeatedly calls zip/next on the location returned during the previous invocation (initially the
vector zipper "vz" which is the first location).
❷ There is a specific zip/end? predicate of a location that returns true when the location is the last
available in the traversal.
❸ All collected locations so far needs to be translated into simple nodes to visualize their content.
(zip-walk ; ❷
#(if (zip/branch? %) % (zip/edit % * 2))
(zip/vector-zip [1 2 [3 4]]))
;; [2 4 [6 8]]
(def zipper-end ; ❶
(zip/end? zipper-end) ; ❷
;; true
(zip/prev zipper-end) ; ❸
;; nil
❶ The simple vector "[1 2]" produces an equivalently simple zipper.
❷ After calling zip/next 3 times, we are at the end of the traversal.
❸ After reaching the end of the traversal, the location cannot be used for any further navigation,
including going back the traversal path with zip/prev.
If the traversal does not reach the end instead, then zip/prev or any other location change is possible.
Also note that zip/prev does not have the same behavior on reaching the root
node: zip/next after zip/prev on the root location works as expected.
This concludes our description of the zipper functions and the chapter on collections.
We are going to see a more specialized version of collection called "sequence" in the
next chapter.
Sequences
A Clojure sequence is an abstract data type. An abstract data type (or ADT) describes
the behavior of a data structure without mandating a specific implementation. The
following are the main properties of the abstraction:
• As the name implies it’s iterated sequentially: you cannot access the nth element without first
accessing the nth -1 (and there are no gaps).
• It works like a stateless cursor: the iteration can only move forward.
• It’s persistent and immutable: like all other core data structures, sequences cannot be altered once
created, but changes are possible in terms of a new sequence based on the previous (with
structural sharing).
Optionally, sequences also support the following features (although they are not part of
the contract):
They are commonly (but not necessarily) lazy: the next element is produced only
if that element is requested.
Thew are also cached: the first access to the sequence elements produces a cached
version of each item. Subsequent access to the sequence does not require further
computation.
They often apply "chunking" to improve performance. Chunking consists of
processing a few more elements than requested, assuming the caller will soon
move forward and access the rest of the sequence.
Clojure makes large use of sequences and as a consequence there are many functions
devoted to them. The book dedicates two chapters to the topic, one about the way
sequences are produced and another about their processing. The following diagram
The next chapter is about sequence producers. There are essentially 4 ways to create
them:
1. "Seqable" collections are collections supporting the sequential interface. A
sequential view is produced by calling seq on them directly, or implicitly through
one of the many processing functions (using seq internally).
2. On-demand generation: the data doesn’t exist before consuming the sequence, but
it is generated as soon as requested. Functions like range are a perfect example:
the list of numbers does not exist until it is requested. range (and other similar
functions) describes a recipe to produce the data, but it’s not the data itself.
3. Custom generation: a sequence is built (often through the use of lazy-seq) on top
of some source of data that is not necessarily structured or available in memory.
4. Native sequence: Clojure offers two concrete data structures implementing the
sequential interface natively: consed lists and persistent lists.
• rseq creates a reverse sequence, a sequence that reads backwards from the last
element.
• subseq and rsubseq produces a sequence from a portion of sorted-set or sorted-
map.
• seque creates a blocking sequence, a sequence backed by a blocking queue that
can potentially block if the consumer gets ahead of the producer.
• pmap produces a sequence while applying a transformation on the items in
parallel.
We are now going to see the different types of sequential generation in detail.
9.1.1 seq and sequence
function since 1.0
(seq [coll])
(sequence
([coll])
([xform coll])
([xform coll & colls]))
seq and sequence enable sequential behavior on top of existing collections. The
collection is ultimately responsible for sending the data over, but the end result has to
conform to a Clojure "sequence", a persistent and immutable data structure. Based on
the target collection, the sequence is additionally cached or chunked (see the beginning
of the chapter for a brief explanation of these features).
Despite being of huge importance for Clojure internals (all sequence functions call seq
in one way or another), explicit use of seq has just a few idiomatic uses. Seq can be
used for example to check if a collection contains at least one element:
(def coll [])
❶ An idiomatic use of seq is to offer an uniform way to check if a collection contains at least one
element. seq of an empty collection returns nil (and not the empty collection), which is what enables
this conditional to work properly.
❷ Assuming we are happy to invert the order of the conditional branches (giving prominence to the fact
that the collection is empty in the first place), we could use emtpy? instead.
❸ Finally, if we agree to negate the conditional form, we can also use not-empty.
sequence has additional features. Used with a single collection argument, it works
similarly to seq with the only difference being the treatment of the empty collections:
(seq nil)
;; nil
(sequence nil) ; ❶
;; ()
(seq [])
;; nil
(sequence []) ; ❷
;; ()
❶ seq returns nil when invoked on a nil collection. sequence returns a new empty list instead.
❷ seq returns nil on an empty collection, while sequence returns an empty list.
After the addition of transducers to the standard library, sequence also offers the
possibility to apply a transducer:
(sequence (map str) [1 2 3] [:a :b :c]) ; ❶
;; ("1:a" "2:b" "3:c")
❶ sequence accepts a transducer (or composition thereof) and a variable number of collections.
❶ When multiple collections are present, the first transducer receives a transformation call with 2 (or
more) parameters. In this case the mapping function "*" receives two parameters to generate the
square of a number.
CONTRACT
Input
• "coll & colls" are compatible collection types, including Clojure collections
(excluding transients), other sequences, strings, arrays or Java
iterables. seq requires one collection argument that can be empty
or nil. sequence allows for a variable number of empty or nil collections, but at
least one must be present.
• "xform" is a function following the transducers semantic. sequence is the only
transducer function supporting multiple collection inputs. If sequence receives
two or more "colls" arguments, then the transducer "xform" receives two or more
arguments as well.
Notable exceptions
• IllegalArgumentException is thrown for unsupported collection types.
Output
• returns: a sequence representing the sequential view over the input collection. If
one or more transducers are present, sequence applies the transducer chain to each
element returned in the output sequence. When more than one collection is
present, the output stops after reaching the shortest input.
Examples
seq has a few idiomatic uses. We saw in the introduction that it can be used to verify if
a collection is not empty. This is useful property, for example during recursion, to
gradually consume a collection. Here’s the general mechanism implemented to reverse
a generic collection input:
(defn rev [coll]
(loop [xs (seq coll) done ()] ; ❶
(if (seq xs) ; ❷
(recur
(rest xs) ; ❸
(cons (first xs) done)) ; ❹
done)))
(rev [8 9 10 3 7 2 0 0]) ; ❺
;; (0 0 2 7 3 10 9 8)
❶ To be absolutely sure we can operate on "coll" through the sequential interface, we call seq when we
first initialize the loop-recur construct. This has the effect of throwing Exception at the earliest possible
point. If the rev function only operates in the context of Clojure data structures, explicit seq is not
usually necessary.
❷ seq is used to check if a collection is not empty. We could also ask if the collection is empty? and
reverse the if statement, but depending on algorithmic emphasis and programming style, it’s good to
have both options always available.
❸ We can definitely call rest now, as we are forcing a seq "xs" conversion at the beginning of the loop.
❹ We are using cons on "done", a local name initially bound to the empty list. lists support the sequential
interface natively without a sequential adapter.
❺ Collections like vectors are not native sequences, but Clojure adapts them easily by walking their
internal structure.
In a similar fashion, seq could be used as a predicate to verify that all collections in a
list contains at list one item:
(every? seq [#{} [:a] "hey" nil {:a 1}]) ; ❶
;; false
After the introduction of transducers, sequence is now available for a brand new set of
applications. Similarly to seq, sequence produces a sequential view on top of the input
collection. Additionally, sequence applies a transformation on each item using the
provided transducer chain.
In the following example, we are going to parse some unstructured input to extract the
information we need. We need to connect to a device which returns data as a two-
dimensional representation similar to the grid on a spreadsheet. The output also
contains interleaving rows that we don’t need. Here’s an example of the data we would
like to extract and how it appears when connecting to the device:
;; == example data pattern == ; ❶
;; Wireless MXD CXP ; header: kind & codes
;; ClassA 34.97 34.5 ; metric: name & measures
;; ClassT 11.7 11.4 ; metric: name & measures
;; ClassH 0.7 0.4 ; metric: name & measures
(def device-output ; ❷
[["Communication services version 2"]
["Radio controlled:" "Enabled"]
["Ack on transmission" "Enabled" ""]
["TypeA"]
["East" "North" "South" "West"]
["10.0" "11.0" "12.0" "13.0"]
["Wireless" "MXD" ""]
["ClassA" "34.97" "" "34.5"]
["ClassB" "11.7" "11.4"]
["Unreadable line"]
["North" "South" "East" "West"]
["10.0" "11.0" "12.0" "13.0"]
["Wired" "QXD"]
["ClassA" "34.97" "33.6" "34.5"]
["ClassC" "11.0" "11.4"]])
❶ The example shows the kind of data pattern we are searching for: it contains an header followed by
several lines of numerical metrics.
❷ This is the output we receive from the device. The interesting data appears in the output interleaved
by additional "noise" we want to remove.
The approach we follow is to read the device output top to bottom and gradually create
groups of rows. We keep the group only if it conforms to the interesting pattern of
data. We can use the following predicates to check if a group of lines is something we
are interested in or not:
(defn measure? [measure] ; ❶
(and
measure
(re-matches #"[0-9\.]*" measure)))
❶ A measure? is a string representing a decimal number. The regular expression is very simple for the
purpose of this example.
❷ A metric? is a list containing a name and any number of measures. The name must start with "Class"
and followed by a letter.
❸ A header? is a list of strings. It should start with a "kind" (either "Wireless" or "Wired") and followed by
a "code".
❹ A pattern? matches the entire specification. It checks that the first line is a valid header and what
follows are metrics.
❺ We can try the predicate on a test specification.
Now that we have implemented the predicates to recognize the interesting pattern of
data, it’s time to process the raw feed from the device. We proceed by iterating a range
from 0 up to the number of input lines and then use nthrest to gradually remove them
from the top. This generates a list of all ordered subsets of the input. We know that
some subset of the input could be the pattern of data we are interesting in. sequence
comes handy to process this sequence using transducers:
(defn all-except-first [lines] ; ❶
#(nthrest lines %))
(def if-header-or-metric ; ❷
#(take-while (some-fn header? metric?) %))
(filter-pattern device-output)
❶ The first transducer takes all lines except the first "n", where "n" comes from the iteration of
a range input.
❷ The second transducer only keeps lines in a subset that are either headers or metrics, which are the
only line types we are interested in.
❸ filter-pattern assembles all transducers together.
When we finally test filter-pattern on the raw input coming from the device, we can
see that it correctly assembles the patterns of data we are searching for.
See also:
• list creates a concrete collection that is also a sequence natively.
• lazy-seq offers a way to create a custom data generator for a sequence.
• iterator-seq-and-enumeration-seq returns a sequential view over
a java.util.Iterator instance. Many Clojure and Java collections are accessible
through the Iterator interface.
Performance considerations and implementation details
Figure 9.2. seq called on the most common sequential types "hset" and "hmap" are
On the fast side of the chart, we find native sequential collections and generators like
ranges and long ranges. Other types of vectors are also performing well. For many of
the types benchmarked in the chart the sequential transformation is of relative
importance since their main goal is to offer direct lookup access.
Caching is an important factor in performance, especially if expensive computations
are involved. If reuse is necessary, the sequence can be closed over (for example with a
let binding) and re-used efficiently. We could go ahead and extend the parser output
example seen previously by adding a side-effecting transducer to see if the same
message appears more than once:
(defn filter-pattern [lines] ; ❸
(sequence
(comp
(map (all-except-first lines))
(keep if-header-or-metric)
(filter pattern?)
(map #(do (println "executing xducers") %))) ; ❷
(range (count lines))))
;; executing xducers ; ❸
;; executing xducers
;; [nil nil nil]
❶ The side effecting transducer is a map transducer printing to standard output and returning the input
without any modification.
❷ groups is the local name for the results of the sequence call. We are then creating another sequence
on top, accessing the first and the last element.
❸ We can see two printouts corresponding to the groups that were found in the input.
Despite first and last are both calling seq on their input (effectively creating a new sequence on top
of groups) no other printouts are visible, showing that the transducer chain is never invoked
again. eduction prevents this behavior for those cases where transducers are expected to execute
again.
Laziness and caching have implications when the wrapped collection is mutable (like a
Java collection) and it mutates after the sequential view is created. Here’s an example
showing the effect of mutation on the sequential view:
(import '[java.util ArrayList])
A final remark about the use of sequence with transducers from the performance
perspective. sequence implementation for transducers uses a buffer mechanism to
temporarily park each transformed item and return it as the sequence is consumed. For
trivial transducer chains, normal sequential generation is faster than transducer
sequential generation:
(require '[criterium.core :refer [bench quick-bench]]) ; ❶
❶ We use the Criterium library to measure performance in the most accurate way.
❷ A range of 500.000 items is completely evaluated by accessing the last element. The processing
involved is very simple.
❸ We operate similarly with sequence and a transducer chain. We can see a slight performance
degradation.
(rseq rev)
rseq creates a reversed sequential view on top of a collection. The collection needs to
know how to produce such a view to work with rseq (it needs to implement
the clojure.lang.Reversible interface).
One of the main use of rseq is to provide a constant time reverse version for vectors,
sorted maps and sorted sets which would otherwise be forced into a linear sequence
scan. Rseq returns a "reversed view" of the input data structure, so when elements are
iterated they are returned in inverted order:
(rseq [:b :a :c :d]) ; ❶
;; (:d :c :a :b)
As we can see, results print inside round parentheses. This correctly indicates that the
return type is sequential (vectors, sorted maps and sorted sets return a
specific rseq wrapper that implements the sequential interface):
(conj (rseq [1 2 3]) :a) ; ❶
;; (:a 3 2 1)
❶ conj into the reversed vector inserts to the "head" instead of the tail position (as it would be the case
with vectors).
rseq on vectors returns effectively a sequence. For this reason beware that operations
like peek or nth are not optimized for rseq output even when the input data structure is
a vector.
WARNING rseq differs from reverse in its treatment of the empty collection and nil. (reverse
nil) and (reverse []) returns nil, but (rseq nil) leads to
a NullPointerException being thrown.
CONTRACT
Input
• "rev" is the only mandatory argument. "rev" must be a collection implementing
the clojure.lang.Reversible interface. An helper can be used to check if this is
true for "rev": (reversible? rev) should return true. Currently only vectors
created with vector, vector-of or “subvec” are reversible. Additionally, sorted
maps and sorted sets are also reversible.
Notable exceptions
• If "reversible" is not a vector or a sorted map/set, then it
throws ClassCastException.
• If "reversible" is nil, then it throws NullPointerException.
Output
rseq returns:
152
en.wikipedia.org/wiki/Palindromic_sequence
(is-palindrome? [\a])
;; => false
The algorithm above is an elegant example that demonstrates the use of rseq and
avoid reversing the input. However, it wasn’t designed with production scale in mind
and you should look into more sophisticated techniques for real-life problems 153.
See also:
• seq is used to return a sequence on a collection without reverting its content.
• “reverse” returns a reverse sequence on all seqable collections, not just vectors.
Performance considerations and implementation details
153
On the topic of efficient DNA processing, please see www.ncbi.nlm.nih.gov/pmc/articles/PMC3602881/
using reverse and then comparing with the original input. The reverse-based solution is
compared here to a solution based on rseq:
(require '[criterium.core :refer [quick-bench]])
The solution based on rseq is almost an order of magnitude faster than a similar
solution based on reverse. The result is achieved considering that random sequences
have a very low probability of being a palindrome and equality returns false pretty
soon after the start of the iteration, without the need to fully realize the reversed
sequence.
(subseq
([sc test key])
([sc start-test start-key end-test end-key]))
(rsubseq
([sc test key])
([sc start-test start-key end-test end-key]))
subseq and rsubseq create a sequence out of the elements enclosed by a lower/upper
bound in a sorted collection:
(subseq (apply sorted-set (range 10)) > 2 < 8) ; ❶
;; (3 4 5 6 7)
rsubseq differs from subseq in the order the sequence is generated: from the first
matching element (subseq) or from the last (rsubseq). subseq and rsubseq implicitly
require the input collection to support a notion of ordering, which restricts the possible
input types to soted-set and sorted-map.
CONTRACT
Input
• "sc" is a sorted collection implementing the clojure.lang.Sorted interface.
There are currently 2 concrete implementations in the standard library: sorted-
set and sorted-map.
• "test" can be one of the four comparators: <, <=, > or >=.
• "key" type needs to be comparable with the content of "sc". In most practical
situations it means that "key" has the same type of the keys in "sc".
• "start-test", "start-key", "end-test" and "end-key" are of the same type as "test" and
"key" respectively. The different name is necessary just when both bounds are
present in the function call.
Notable exceptions
• ClassCastException is thrown when "key" is not comparable with the keys in
"sc" or when "sc" is not a sorted collection.
Output
• subseq: returns the forward sequence of elements enclosed between the
lower/upper bound (when both present). When the upper bound is not present, the
start/end of the sequence delimits the upper/lower bound implicitly.
• rsubseq: works like subseq but inverts the order in which the elements are
returned in the output sequence.
Examples
subseq and rsubseq can be used to perform searches of elements above or below a
certain key. For example, here’s how to answer the question "What is the
smallest/biggest element for which the key is more/less than x":
(defn smallest> [coll x] (first (subseq coll > x))) ; ❶
(defn smallest>= [coll x] (first (subseq coll >= x)))
(defn greatest< [coll x] (first (rsubseq coll < x))) ; ❷
(defn greatest<= [coll x] (first (rsubseq coll <= x)))
❶ smallest> implementation retrieve the upper sequence beyond the given boundary. The first element
is the smallest after the target.
❷ greatest< uses rsubseq to avoid taking the last from the resulting sequence. Access the last of a
sequence is usually not a great idea in terms of performance and we have a straightforward way to
avoid it.
The following example shows how to implement auto-completion of words given the
first few letters. We can load a dictionary in a sorted-set as part of the application
bootstrap and use it to quickly select a range of words to complete what the user types.
(require '[clojure.string :refer [split]])
(def dict
(into (sorted-set) ; ❶
(split (slurp "/usr/share/dict/words") #"\s+")))
❶ We build a dictionary starting from a list of words (in this case the Unix standard dictionary location).
The dictionary is created as a sorted-set.
❷ complete takes a word or more likely the first few letters and returns the first 4 words after the
fragment from the dictionary.
❸ Here you can see a simulation of the user typing, progressively trying to spell "closure". You can see
that the word is first in line at the 5th letter (and it could appear before if we took longer auto-
completion lists).
Red-black Trees
Ideal Hash Trees usage in Clojure is well documented in articles and presentations (Phil Bagwell’s HAMT
tree idea has been adapted to be the basis of persistent data structures in Clojure, you read the original
paper here lampwww.epfl.ch/papers/idealhashtrees.pdf). But Clojure also makes practical use of other
interesting data structures like Red-black Trees, a type of self-balancing tree which is at the basis
of sorted sets and sorted hash maps 154.
In Red-black tree nodes, a bit is dedicated to identify a color (black or red by convention) which helps
maintaining the tree balanced during insertion. Inspiration for the current Clojure Red-black tree
implementation comes from Okasaki 155 and it was partially described on the Clojure mailing list 156.
Apart from direct use of sorted-set and sorted-map for basic ordering, subseq and rsubseq are the
only functions in the standard library to make explicit use
of clojure.lang.PersistentTreeMap methods. The reason is the O(log n) access guarantee
(compared to the linear access in a sequence) to reach the requested element from where to start
generating the sequence. Even assuming an ordered vector or sequence, it would take linear time to
reach the requested element with other data structures.
See also:
• subs retrieve a sub-string from another larger string.
• subvec creates a sub-vector given a start and end index.
• drop and take can be used with their variant to isolate a portion of a sequence.
• rseq can be used to generate a sequence from the element of a collection in
reverse.
• sort is used to order the content of a collection.
• sorted-set and sorted-map store their content ordered by a comparator.
Performance considerations and implementation details
154
Please refer to this general introduction to Red-black trees available on Wikipedia: en.wikipedia.org/wiki/Red–
black_tree
155
Purely Functional Data Structures is an important book in functional programming that describes ways to implement the
most common data structures persistently, that is, with structural sharing preserving older versions
156
See groups.google.com/forum/#!msg/clojure/CqV6smX3R0o/_ZnnimboYjQJ
❶ Before we can search the threshold value, we need to explicitly sort the sequence of "items". We then
proceed to drop-while until we reach the threshold and then return the first element. This is a linear
scan of the sequence.
❷ The same "items" are used to create a sorted-set. We use subseq to access the smallest item after
reaching "x".
The example shows an evident advantage of subseq for these kind of operations (from
88 microseconds to 76 nanoseconds). What the example does not show though, is the
time required to create the sorted-set compared to the creation of the sorted sequence.
There is a trade-off to consider when designing an algorithm around subseq, which is
the way the sorted collection is created and evolved during the lifetime of the
application. Performance of subseq and rsubseq are definitely good for use cases like
suggesting words from a dictionary (and optionally adding more words after the initial
creation) like presented in this chapter.
9.1.4 seque
function since 1.0
(seque
([s])
([n-or-q s]))
❶ seque used on a range produces another lazy sequence with the same content.
;; produce 4
;; consume 4 ; ❹
;; produce 3
;; consume 3
;; produce 2
;; consume 2
;; produce 1
;; consume 1
;; produce 0
;; consume 0
❶ fast-producer is a function creating a list, which supports the sequence interface natively
(specifically for this example, it doesn’t use any optimization that would generate a confusing
output). map was added to print each produced element.
❷ slow-consumer takes a sequence "xs" as input and simulates some lengthy computation. keep is
used here to map over each item and to suppress the nil output that would otherwise produce
confusion in the printout.
❸ After calling the slow consumer using the fast producer input, we can see each item producing a
"produce-consume" pair of lines every 2 seconds.
❹ Each "consume" printout happens after 2 seconds from the related "produce".
We can now add seque between producer and consumer. seque creates a in-memory
buffer to help reducing the need for the producer to wait for the consumer:
(slow-consumer (seque (fast-producer 5))) ; ❶
;; produce 4
;; produce 3
;; produce 2
;; produce 1
;; produce 0 ; ❷
;; consume 4
;; consume 3
;; consume 2
;; consume 1
;; consume 0
❶ The only addition to the previous example is the seque call wrapping the fast producer.
❷ fast-producer is now able to move forward without waiting. slow-consumer starts catching up after
around 2 seconds, slowly consuming items from the input sequence but without a dependency on the
fast producer, which has now the opportunity to park resources or do some other work.
Here’s a similar example with opposite roles. A slow producer is attached to a fast
consumer and seque is between them:
(defn slow-producer [n] ; ❶
(->> (into () (range n))
(map
#(do
(println "produce" %)
(Thread/sleep 2000) %))))
produce 4 ; ❹
produce 3
produce 2
consume 4
4
produce 1
produce 0
When seque is present between a slow producer and a fast consumer, it allows the
producer to work "n" items ahead in the background, even when we only request the
first item.
WARNING Note that although similar to sequence "chunking" (an internal optimization that allows
sequences to compute some number of items ahead), seque operates independently from it
and even on sequences that are not necessarily chunked.
(defn lazy-scan [] ; ❷
(->> (java.io.File. "/")
file-seq
(map (memfn getPath))
(filter (by-type ".txt"))
(seque 50)))
(defn go []
(loop [results (partition 5 (lazy-scan))] ; ❸
(println (with-out-str (clojure.pprint/write (first results))))
(println "more?")
(when (= "y" (read-line))
(recur (rest results)))))
(go)
("/usr/local/Homebrew/docs/robots.txt" ; ❹
"/usr/local/Homebrew/LICENSE.txt"
"/usr/local/var/homebrew/linked/z3/todo.txt"
"/usr/local/var/homebrew/linked/z3/LICENSE.txt"
"/usr/local/var/homebrew/linked/z3/share/z3/examples/c++/CMakeLists.txt")
;; more?
❶ by-type is a function to build a predicate for filter below. It returns true if the file name ends with the
given extension.
❷ lazy-scan creates a sequence of files starting from the home folder and only with the given
extension. file-seq is providing the initial lazy sequence by following the file system down each
available folder. We are asking seque to look ahead 50 items on this sequence. Note that even if the
folder contains many files or subfolders, the producers only produces some amount of items ahead of
time, without triggering an entire file system scan.
❸ This loop asks the user to input "y" to see the next page of results, or any other letter to stop. While
the first 5 items are displayed, seque is searching for the next 50 in the background.
❹ The list of results could appear different on a different machine.
seque also allows to use a custom queue. We could use this feature to print a message
every time the buffer is full, an useful information to understand the best buffer size
based on the relative speed of the consumer. The buffer indicator runs from a different
thread and prints how many items are in the queue every second:
(import '[java.util.concurrent LinkedBlockingQueue])
(defn counter [] ; ❷
(let [out *out*]
(future
(binding [*out* out]
(dotimes [n 50]
(Thread/sleep 1000)
(println "buffer" (.size q)))))))
(defn lazy-scan [] ; ❸
(->> (java.io.File. "/")
file-seq
(map (memfn getPath))
(filter (by-type ".txt"))
(seque q)))
(counter) ; ❹
;; #object[clojure.core$future_call$reify__8454 0x4b672daa {:status :pending, :val
nil}]
;; buffer 0
;; buffer 0
;; buffer 0
(go)
;; ("/usr/local/Homebrew/docs/robots.txt" ; ❺
;; "/usr/local/Homebrew/LICENSE.txt"
;; "/usr/local/var/homebrew/linked/z3/todo.txt"
;; "/usr/local/var/homebrew/linked/z3/LICENSE.txt"
;; "/usr/local/var/homebrew/linked/z3/share/z3/examples/c++/CMakeLists.txt")
;; more?
;; buffer 544 ; ❻
;; buffer 745
;; buffer 745
;; buffer 749
;; buffer 749
;; ...
;; buffer 2000
;; buffer 2000
;; ...
❶ The blocking queue is stored in a var. To show that seque is working in the background, we make a
much larger buffer of 2000 files so there is time to print the increasing size of the buffer.
❷ counter starts a future that wakes up every second to print the size of the buffer. This happens 50
times before exiting, which is enough to see progress while typing instructions at the REPL.
❸ lazy-scan is the same as before except when we build seque. Instead of passing the buffer size, we
pass the queue instance directly.
❹ We start the counter first and we can see it printing "buffer 0" every second.
❺ As soon as we invoke the (go) function, we can see the first page of results (this could be different on
different machines).
❻ The counter continues in the background, showing progress while the buffer is filling up from 0 to
2000 items. If we wait long enough without doing anything, we can see it printing 2000 continuously,
sign that the buffer is full and no more I/O is continuing on disk.
157
Back-pressure is an important concept in event driven systems, where we want downstream components to be able to
limit upstream producers. This talk from Zach Tellman www.youtube.com/watch?v=1bNOO3xxMc0 does a wonderful
job introducing the key concepts.
The consumer and the producer in this example are slowed down on purpose, so we can play with them.
Here’s what happens when we start the producer:
Although this is a possible idea to solve the problem of building a sequence on a blocking queue,
libraries like core.async (github.com/clojure/core.async) provides a robust solution to the problem,
which should be evaluated before rolling a custom solution like the one presented above.
See also:
• sequence can create a sequential view on a LinkedBlockingQueue following a
standard, non-blocking, approach.
• lazy-seq is the fundamental building block for seque. Is worth revisiting the
mechanism by which lazy-seq allows the creation of lazy sequences to
understand seque.
Performance considerations and implementation details
• An agent iterates the input sequence in a loop. It processes enough of the sequence
that can fit in a LinkedBlockingQueue instance. The remaining of the sequence is
stored as the new state of the agent and the task exits.
• While the agent is filling the queue, the main thread of computation is building
a lazy sequence off the queue. Each time the loop is able to take an item off the
queue, another "fill" requests is sent to the agent. Each task sent to
the agent resumes from the remaining part of the input sequence.
• Any error produces a retry from the previous state.
• Sentinels objects are used to signal the end of the input which propagates to the
lazy sequence that eventually returns nil.
9.1.5 pmap, pcalls and pvalues
function: pmap and pcalls macro: pvalues since 1.0
(pmap
([f coll])
([f coll & colls]))
pmap, pcalls and pvalues build a lazy sequence as the result of processing a set of
expressions in parallel (using futures). Both pcalls and pvalues build on top of pmap.
pmap has a similar interface to map, but transformations apply to the input in parallel:
❶ pmap has a similar interface to map but the input transformations happen in parallel.
pcalls builds on top of pmap accepting any number of functions as input. It then
creates a lazy sequence from the results of calling the functions without
arguments. pcalls is a good solution for side effecting parallel transformations:
(pcalls ; ❶
(constantly "Function")
#(System/currentTimeMillis)
#(println "side-effect"))
;; side-effect ; ❷
;; ("Function" 1553770187108 nil)
pvalues is a macro also building on top of pmap. It takes any number of expressions
❶ pvalues is a macro, allowing deferred and parallel evaluation of the expressions in the input.
❷ The result is the lazy sequence of evaluations of the expressions in the input.
All the functions, pmap, pcalls and pvalues, produce a lazy sequential output which
corresponds to the ordered evaluation of the items in the input.
CONTRACT
Input
• "f" in pmap is mandatory argument. "f" must be a function of one or more
arguments. The number of arguments corresponds to the number of input
collections.
• "coll" in pmap is mandatory argument. "coll" needs to provide a sequential view,
such that (instance? clojure.lang.Seqable coll) is true.
• "colls" in pmap means that any number of additional collection is accepted after
"coll". The number of collections determines the required arity for "f".
• "fns" in pcalls is any number of functions of no arguments. It also accepts no
arguments.
• "exprs" in pvalues is any number of valid Clojure expressions, including an
empty list of expression arguments.
Notable exceptions
• ArityException when invoking pmap without at least one "coll". Note: there is
no transducer version of pmap.
Output
• pmap returns the lazy sequence containing the result of applying "f" to all the
elements in the input collection. Please see map for general considerations about
the presence of multiple collections.
• pcalls returns the lazy sequence containing the result of invoking each input
function without arguments.
• pvalues returns the lazy sequence containing the evaluation of all expression
arguments.
Examples
pmap achieves easy and immediate parallelism with the same map interface. However,
there are good reasons to avoid replacing every use of map with pmap:
• The computational cost of the transformation function (or the arguments in case
of pcalls/pvalues) should be substantial. In that case, there are good chances the
thread orchestration cost is going to outweigh any performance benefit.
• pmap produces the output in the same order as the input by performing ordered
batches of parallel computation. If one input produces significantly more work
than another in the batch, pmap needs to wait before moving on to the next batch.
The presence of a longer computation in a batch decreases the level of parallelism
(see the performance section for an example).
Even taking into account the constraints above, a sufficiently large application usually
contains some part of the code justifying the need for pmap. One of such case is when
processing large datasets, for example the documents resulting from querying
ElasticSearch 158 or some other service. The reader is invited to review a few solutions
already using pmap illustrated by the book:
• xml-seq contains an example processing of large documents that uses pmap to
speedup processing. Each document is roughly the same length and the
transformation is not trivial.
• partition also contains an interesting example of pmap using aggregation to group
smaller tasks into larger ones.
In those cases where it makes sense to use pmap, depending on the type of
transformation "f" and the input size, we can control the thread orchestration by
using partution-all. By grouping items, we create partitions of sequential processing
that could improve performance:
(require '[criterium.core :refer [quick-bench]])
❶ "xs" is a relatively large sequence. We use eval to simulate a non-trivial computation. Calling eval on
a number just produce the same number. We can see that it takes around 23ms on average to
process the input sequence.
158
ElasticSearch is a popular document store. Queries to ElasticSearch can retrieve potentially large list of documents in
JSON format.
❷ We decide to give pmap a go, assuming eval is expensive enough to justify the thread orchestration
cost. This is true, but the advantage is minimal, taking around 19ms on average.
❸ By partitioning the input we trade some of the parallelism in exchange for a reduced thread overhead
that in this case is definitely paying off.
In cases where pmap advantage seems minimal compared to the sequential case, it’s
worth testing if partitioning the input produces positive effects. However, the reader
should always remember to carefully benchmark such assumptions.
Understanding pmap
One frequent question about pmap is how many threads are actually working in parallel. We can’t
control pmap parallelism directly, but we have some control over the chunk size of the input sequence
(the other option being increasing the number of CPU cores). The easiest case to understand is to
assume no chunking in the input sequence:
(defn f [x] ; ❷
(Thread/sleep (+ (* 10 x) 500))
(println (str "done-" x))
x)
(first s) ; ❹
0
;; done-2
;; done-3
;; done-4
;; done-5
;; done-6
(take 2 s) ; ❺
(0 1)
;; done-7
❶ dechunk creates a lazy sequence on top of another, removing any chunking for consumers upstream.
❷ We call pmap using a tracing function that prints a message after a sleep period that slightly increases
on each call. The increasing sleep time gives time to println to flush the entire string to standard
output, so we can see each message appearing on a different line (they would interleave otherwise).
❸ Interestingly, two threads start at definition time, even if we don’t consume elements from the
sequence. This is a byproduct of how pmap implementation destructures input internally. In general,
this shouldn’t be a problem, unless you’re searching for maximum laziness.
❹ As soon as we take the first item, pmap goes ahead with the computation. The expression was
evaluated on a 4 cores machine. pmap is designed to stay (+ 2 N-cores) ahead of the requested
item. The requested item is the first, 2 items have been already computed before, there are 4 more
items evaluated.
❺ From this point onward, following requests move the head of the computation forward one element,
starting a new future each time.
If the input sequence is not chunked, pmap stays (+ 2 N-cores) ahead of the requested item,
providing readily cached results for incoming requests. If we suddenly request the last
element, pmap guarantees it would never go beyond (+ 2 N-cores) concurrent requests. The situation
changes with chunked sequences, as the next request for an item might result in the entire chunk
getting realized:
(first s) ; ❷
0
The following rules can be used to understand how many threads pmap runs at once (assuming tasks are
roughly the same computational cost). The min level correspond to the situation where the consumer is
slower than the producer, while the max level is when the consumer is faster than the producer:
• When the sequence is not chunked (for example subvec) the min parallelism is 1 and the max
parallelism is (+ 2 N-cores). Example: with 12 cores, (doall (pmap #(Thread/sleep %)
(subvec (into [] (range 1000)) 0 999))) keeps 12+2 threads busy.
• In case of chunked sequences (vast majority are size 32), the min parallelism is (min chunk-size
(+ 2 n-cores)), while the max amount is equal to (+ chunk-size 2 N-cores). Example: with
12 cores, (doall (pmap #(Thread/sleep %) (range 1000))) keeps 12+2+32 threads busy.
With those rules in mind, by changing the chunk size, we can get any grade of parallelism:
(defn re-chunk [n xs] ; ❶
(lazy-seq
(when-let [s (seq (take n xs))]
(let [cb (chunk-buffer n)]
(doseq [x s] (chunk-append cb x))
(chunk-cons (chunk cb) (re-chunk n (drop n xs)))))))
Sequences with a custom chunk size are rare but possible. If your application implements one, it
uses pmap and the chunk size is in the thousands, there is some possibility to saturate the
unbounded future’s thread pool and lose control of the JVM. Something to keep in mind.
See also:
• fold is the main entry point into a different model of parallel computation called
"fork-join". fold is designed to handle some variance in computational complexity
thanks to an algorithm called "work-stealing".
• future is the threading primitive used by pmap to send computations off to parallel
threads.
Performance considerations and implementation details
The situation does not necessarily improve with aggregation when the computation is
too trivial. However, if we repeat the same example with aggregation, we can see
how pmap is now only 2 times slower:
(let [xs (partition-all 1000 (range 100000))]
(quick-bench
(into [] (comp cat (map inc)) xs))) ; ❶
;; Execution time mean : 6.553814 ms
❶ patition-all groups input items into inner sequences of 1000 items each. We can
use into with transducers to process each item in the inner sequence an remove inner groups with cat.
❷ pmap is now only 2 times slower.
Another aspect to consider is uniformity of the computational cost across the input.
The following example distributes 10 long computation across an input sequence of
320 items. The size of the input and the distribution of the long tasks exacerbate the
dependency of pmap on chunked sequences:
(def xs (map #(if (zero? (mod % 32)) 1000 1) (range 0 320))) ; ❶
❶ The input sequence "xs" contains 320 repetition of the number 1, but every 32 items the 1 was
replaced by 1000. This means that "xs" contains exactly 1 occurrence of the number 1000 for each
chunk of size 32.
❷ We first run normal map with a function waiting the number of milliseconds as indicated by the input
item. As expected, the sequential execution lasts roughly 10 seconds.
❸ The second run uses pmap instead of map. The execution lasts again 10 seconds despite the
parallelism.
The example above is the worst possible scenario for pmap that executes exactly like
the sequential case. But if we push multiple occurrences of the number 1000 inside the
same chunk (the highest the number the slowest the task), we give pmap the opportunity
to execute them in parallel:
(time (dorun (pmap #(Thread/sleep %) (sort xs)))) ; ❶
;; "Elapsed time: 1028.686387 msecs"
❶ The only change compared to the previous example is sorting the input.
After sorting the input we achieve the effect of compacting all long running tasks
inside the same chunk, showing that uniformity is key for pmap.
• repeatedly calls a function without arguments to obtain the next element for the
sequence.
• iterate calls a function and use the results to invoke the same function again to
obtain the next element for the sequence.
• repeat repeats the same input items to produce a sequence.
• cycle repeats the same input items in order to produce a flattened sequence.
Implementation notes
Generators like iterate, repeat, cycle and range have a dedicated Java implementation as part of the
Clojure internals (while repeatedly is the only one written in pure Clojure). There are good performance
reasons to implement them in Java (especially to provide a fast path for reducing and transducing), but
as an exercise in functional design, here’s how they would be implemented in Clojure with cons and lazy-
seq:
(defn iterate* [f x] ; ❷
(lazy-seq (cons x (iterate* f (f x)))))
❶ repeatedly* is the only function from this set already implemented in Clojure. The implementation
has been copied over as a model for the others. We can see that in order to create the sequence we
need to "cons" (f) on each iteration.
❷ iterate* treats the function differently, as we want the result of each (f x) to become the input for
the next iteration. (f x) is invoked on iteration and we "cons" the value that was returned before
iterating again.
❸ repeat* does not use a function. We just "cons" the input over and over at each iteration while
generating the lazy sequence.
❹ cycle* works similar to repeat* but the fact that the input collection needs to be iterated requires an
additional inner loop wrapped by the step function. The outer recursion starts with the step function
being invoked with "coll" the initialization parameter. After destructuring the input we verify if we have
more items coming from "coll" and in that case we "cons" the first of them into the sequence. In case
we don’t have more items, we start a new outer cycle using "coll" again.
❺ range* presented here is a simplified version of the function in core. Similarly to cycle* we need use
an inner recursion to maintain a incrementing counter. Each step recursion we "cons" the current
number into the sequence and call step again incrementing it. We are done when the counter is equal
to "n".
What follows is a more formal treatment for each of the functions in this section.
9.2.1 repeatedly
function since 1.x
(repeatedly
([f])
([n f]))
repeatedly generates an infinite lazy sequence by calling the same function with no
arguments and collecting the results:
(take 3 (repeatedly rand)) ; ❶
;; (0.2416205627046507 0.8326807316362209 0.9275189497929626)
❶ rand returns a random double between 0 and 1. repeatedly calls rand each time producing a
different number which is collected as a lazy sequence. Remember to use take to avoid printing
infinite numbers on screen.
repeatedly is useful to create sequences from functions with side effects. With a pure
function (a function returning the same output given the same input) it would return a
repetition of the same item over and over (and repeat already exists for that use case).
repeatedly also takes a number "n" of repetition to perform before stopping:
❶ An example showcasing the number "n" of repetition to produce. The side effecting function converts
random numbers into sequence of true or false.
CONTRACT
Input
• "f" is a function of no arguments and possibly side-effecting to return different
results each invocation.
• "n" is optional. When present, it is expected to be a positive number or
zero. double numbers are rounded up to the nearest integer.
Notable exceptions
• NullPointerException if "f" or "n" is nil.
Output
repeatedly returns the sequence generated by calling "f" with no arguments "n" or an
infinite amount of times if "n" is not present.
Examples
repeatedly is used throughout the book for random data generation. One interesting
example is about the generation of proverbs while discussing rand-nth. Other examples
generate streams of random numbers, but we could also generate strings or keywords
with gensym:
(zipmap (map keyword (repeatedly gensym)) (range 5)) ; ❶
;; {:G__52 0, :G__53 1, :G__54 2, :G__55 3, :G__56 4}
❶ We use repeatedly and gensym to generate more interesting keys for a map.
Another example of side effecting functions are futures. A future takes a form body
without evaluating it and sends the form to a separate thread for evaluation. Using this
knowledge, we can build an infinite lazy sequence of workers. When we need more
workers, we take from the sequence and start the concurrent computation. The
resulting sequence contains the results of the computation:
(import '[java.util.concurrent ConcurrentLinkedQueue])
(def q (ConcurrentLinkedQueue. (range 1000))) ; ❶
(def ^:const parallel 5)
(def workers ; ❸
(repeatedly
#(let [out *out*]
(future
(binding [*out* out]
(when-let [item (.poll q)]
(task item)))))))
(run workers) ; ❾
;; -> starting 5 new workers
;; Work done on 0
;; Work done on 1
;; Work done on 2
;; Work done on 3
;; Work done on 4
;; -> starting 5 new workers
;; Work done on 5
;; Work done on 6
;; Work done on 7
;; Work done on 8
;; Work done on 9
;; [6 7 8 9 10]
❶ We need a concurrent data structures to hold the input for the workers. Workers compete to take an
item from the data structure to produce a
result. java.util.concurrent.ConcurrentLinkedQueue is a good choice, as in a real life scenario
we could push more input into this queue while the workers are running. For this example we push
integers into the queue.
❷ The task is the core job of each worker. The example simplifies the task by just waiting some number
of milliseconds, printing a message and returning the incremented integer.
❸ workers builds an infinite sequence of workers ready to work on tasks. The function passed
to repeatedly needs to establish a binding between the standard output of the main thread and the
one in the future. This is done here to show messages on screen. The body of the future invocation
takes one item from the queue and produce a result.
❹ run orchestrates how many workers should be put at work and when we should stop.
❺ done? is a predicate that decides if the current set of results satisfies the global condition to stop and
return the results we found. In this example we check if the sum of the results is above 30 to finish the
computation.
❻ Each iteration we take and realize some numbers of futures (5 in this example, but it would likely be
an higher number of concurrent thread in a real life scenario). (doall (take parallel
workers)) realizes the first 5 elements from the infinite sequence returned by repeatedly, which
means 5 concurrent threads will be at work. doall is necessary to actually kick-off the threads as
the take operation would be lazy without.
❼ mapv and deref ensures the futures are done before checking the results (an equivalent of a "join"
operation to wait all threads).
❽ If we are done with the results, we return them so they can be inspected. If the queue is empty, we
stop the recursion and return no results. If we can proceed, we recur after dropping the first 5 workers
we just used.
❾ We can see no output messages before running the loop, confirming that the entire design is fully lazy
and no threads are in flight before we start.
After calling run we can see batches of 5 workers printing messages. After reaching
the exit condition, run prints the batch that produced the right result.
See also:
• repeat produces a sequence starting from a single repeating value or expression,
instead of a function of no arguments.
• dotimes does not produce a sequence, but executes the body the given amount of
time for side effects. Use dotimes when the computation is purely for side effect
and there are no results to be collected.
• iterate passes the result of the previous invocation as the input for the next one.
Use iterate when there is a relationship between the results of each invocation.
• rand, future or atom are typical target for repeatedly, typically appearing as part
of the function passed as argument.
• lazy-seq is repeatedly building block. General considerations about laziness
discussed in lazy-seq applies to repeatedly as well.
Performance considerations and implementation details
❶ We generate a large sequence of random numbers. When we access the last item, there is nothing
else before the end of the form that needs to consume the sequence, so the items in the sequence
can be iterated and discarded, giving a change to the garbage collector to remove them from memory.
❷ The second example asks again for the last item, but we also want to access the first at the end of the
evaluation of the form. The sequence cannot be garbage collected and needs to stay in memory to
satisfy this last request. At the same time, the previous last request completely evaluated the
sequence, caching all random numbers in memory. We can see the typical message
of OutOfMemoryError because the garbage collector is unable to free memory fast enough for other
items to be cached.
As mentioned at the beginning of the generators section, repeatedly does not have a
fast reducing/transducing path like the other generators. There are the following
consequences:
1. When invoking reduce or any transducing functions (like transduce or into) the
iteration happens through the lazy sequence without optimizations (while other
generators have custom reducing paths).
2. Results while reducing are cached, so multiple reduce calls on the same sequence
are also cached.
The last point about caching during reduction is important for a function
like repeatedly designed to work with side effects: if results were not cached, you
could see different results each call. Other generators working with pure functions can
instead skip caching and gain speed during reduce or transduce.
9.2.2 iterate
function since 1.0
(iterate f x)
iterate takes a function "f" and an initial parameter "x" and invokes "f" over "x". The
result of the invocation is used as the new parameter to invoke "f" again, which returns
the next result and so on, gradually building a sequence of the results:
(take 10 (iterate inc 0)) ; ❶
;; (0 1 2 3 4 5 6 7 8 9)
❶ iterate is used here to simulate range. 0 is the explicit start of the sequence, then (inc 0) produces
1, (inc 1) produces 2 and so on, until we take 10 elements.
One of the main use case for iterate is the generation of arbitrarily complex lazy
sequences to be used as input into other sequential processing.
CONTRACT
Input
• "f" is a function of one argument and is a mandatory argument. Compared
to repeatedly, "f" is expected to be free of side effects.
• "x" is the argument for "f" and can be of any type. It is also a required argument.
Notable exceptions
• NullPointerException when "f" is nil.
• ClassCastException when "f" is not a function (does not
implement clojure.lang.IFn).
Output
• returns: the infinite lazy sequence of x, followed by (f x), (f (f x)) and so on.
Iterate returns a clojure.lang.Iterate object which is a type of sequence
(implementing the clojure.lang.ISeq interface).
Examples
One of the classic example of iterate is the function producing the Fibonacci
numbers. The series start with a fixed "0,1" pair and then numbers that follow are the
sum of the previous:
(def fibo
(iterate ; ❶
(fn [[x y]] [y (+' x y)])
[0 1]))
❶ The design of the function follows quite straightforward from the definition of the series. The initial
argument for iterate is the given pair [0 1]. The iteration shifts the pair forward by using the sum of
the elements as the new item in the next pair. Note the use of +' to enable automatic promotion
to clojure.lang.BigInt. This is useful for Fibonacci numbers that grow quite quickly.
❷ The vector pair in the sequence carries all the information needed for the next iteration, but to extract
the actual results we need to only take the first element.
iterate shines with series where the previous element has a relationship with the next,
like Fibonacci or the inverse tangent series at the basis of the Leibniz approximation of
Pi 159 [159]. We saw the formula already in filterv in its sequential
version. iterate has a custom reduce implementation and we want to take advantage
of that by rewriting the formula to use transduce:
(defn calculate-pi [precision] ; ❶
(transduce
(comp
(map #(/ 4 %))
(take-while #(> (Math/abs %) precision)))
+
(iterate #(* ((if (pos? %) + -) % 2) -1) 1.0)))
(calculate-pi 1e-6) ; ❷
;; 3.141592153589724
❶ calculate-pi is a rewrite of the function presented in filterv to calculate the approximation of Pi. The
transformation of the formula to use transduce is straightforward: the source becomes the target
for iterate and processing is now part of the composition of transducers.
❷ We can see how to calculate an approximation of Pi. Decimals are correct up to the 6th decimal digit.
Note that asking for additional precision requires exponentially more time.
iterate can be generalized to any process in which state depends on the previous state,
not necessarily simple numbers. We already created a Game of Life implementation
when talking about for, but we never used it for a full simulation. The reader is invited
to review that implementation, but for the purpose of iterating states of the Game of
Life we are only interested in the next-gen function.
next-gen takes height and width of the grid where cells live and an initial set of alive
cells. It returns the next state of also as a set. We are going to iterate some amount
of next-gen states and print them:
;; please see "next-gen" from the "for" chapter.
159
The Leibniz formula is well described in the dedicated Wikipedia page en.wikipedia.org/wiki/Leibniz_formula_for_π
(def pulsar-init ; ❸
#{[2 4] [2 5] [2 6] [2 10] [2 11] [2 12]
[4 2] [4 7] [4 9] [4 14]
[5 2] [5 7] [5 9] [5 14]
[6 2] [6 7] [6 9] [6 14]
[7 4] [7 5] [7 6] [7 10] [7 11] [7 12]
[9 4] [9 5] [9 6] [9 10] [9 11] [9 12]
[10 2] [10 7] [10 9] [10 14]
[11 2] [11 7] [11 9] [11 14]
[12 2] [12 7] [12 9] [12 14]
[14 4] [14 5] [14 6] [14 10] [14 11] [14 12]})
(defn pulsar [] ; ❹
(let [height 17 width 17 init pulsar-init]
(doseq [state (take 3 (life height width init))]
(println (grid height width state)))))
❶ The grid function contains all the code necessary to format the life grid. A living cell is printed with
"<>" else the cell is left blank. The function also takes care of printing horizontal and vertical edges to
enclose the grid.
❷ life contains a call to iterate to create an infinite sequence of Game of Life states given an initial
one. "height" and "width" are necessary to next-gen to calculate neighbors. next-gen function is
visible in the example section of the for chapter.
❸ A <www.ericweisstein.com/encyclopedias/life/Pulsar.html,Pulsar> is a period 3 oscillator that creates a
nice shape. We can give any of the 3 states as the initial population. The initialization is a set of live
cells.
❹ To print a Pulsar we need a large enough grid of 17x17 cells. We use doseq to take 3 states (after
which the printout restart from the initial state).
(pulsar)
----------------------------------
| |
| <><><> <><><> |
| |
| <> <> <> <> |
| <> <> <> <> |
| <> <> <> <> |
| <><><> <><><> |
| |
| <><><> <><><> |
| <> <> <> <> |
| <> <> <> <> |
| <> <> <> <> |
| |
| <><><> <><><> |
| |
----------------------------------
----------------------------------
| |
| <> <> |
| <> <> |
| <><> <><> |
| |
| <><><> <><> <><> <><><> |
| <> <> <> <> <> <> |
| <><> <><> |
| |
| <><> <><> |
| <> <> <> <> <> <> |
| <><><> <><> <><> <><><> |
| |
| <><> <><> |
| <> <> |
| <> <> |
| |
----------------------------------
----------------------------------
| |
| <><> <><> |
| <><> <><> |
| <> <> <> <> <> <> |
| <><><> <><> <><> <><><> |
| <> <> <> <> <> <> |
| <><><> <><><> |
| |
| <><><> <><><> |
| <> <> <> <> <> <> |
| <><><> <><> <><> <><><> |
| <> <> <> <> <> <> |
| <><> <><> |
| <><> <><> |
| |
----------------------------------
See also:
• repeat produces a sequence starting from a single repeating value or expression,
instead of a function invoked on the previous arguments.
• repeatedly produces a sequence by invoking the same functions without arguments
(primarily for side effects).
• dotimes produces the side effects (if any) of evaluating its body the given amount
of times.
Performance considerations and implementation details
(defn iterate* [f x] ; ❶
(lazy-seq (cons x (iterate* f (f x)))))
❶ iterates* is a straightforward implementation using lazy-seq and cons. iterate* shows how a pure
Clojure implementation would look like.
❷ The first benchmark measures our iterate* implementation based on lazy
sequences. into uses reduce which in this case is not optimized.
❸ The second benchmark repeats the same operation on the standard optimized iterate.
Note that the custom reduce implementation of iterate lacks the typical sequential
caching. In this respect iterate is similar to eduction:
(let [itr (iterate* #(do (println "eval" %) (inc %)) 0) ; ❶
v1 (into [] (take 2) itr)
v2 (into [] (comp (drop 2) (take 2)) itr)]
(into v1 v2))
;; eval 0
;; eval 1
;; eval 2
;; eval 3
;; [0 1 2 3]
❶ The first example shows iterate* (an implementation of iterate using cons and “lazy-seq”) being
used twice after creation in two separate into invocations to create a vector and applying a take-
drop transducer combination. The function used to iterate is also printing each incremented number.
❷ In the second example, core iterate has been used instead for the same exact operation.
We can see how iterate* caches evaluations of items (and also evaluates an additional
item ahead, to check for the end of the sequence condition before creating the
lazy lazy-seq instance). The standard iterate version produces multiple prints of the
same item evaluation, showing that there is no caching. This is especially important to
understand when iterate is used with side-effecting functions, as there is no guarantee
about how many times iterate will call the function.
9.2.3 repeat and cycle
function since 1.0
(repeat
([x])
([n x]))
(cycle
[coll])
repeat and cycle have a similar goal, generating a sequence by performing some
repetition of the input. Repeat takes a single value "x" and produces a lazy sequence by
repeating it infinite (or "n") times. cycle takes instead a collection of values and
produces a sequence by repeating the content of the collection in a cycle:
(take 5 (repeat (+ 1 1))) ; ❶
;; (2 2 2 2 2)
❶ repeat follows normal evaluation rules and evaluates the expression passed as an argument. The
number is then repeated indefinitely creating an infinite sequence. We take some amount of items
from the sequence to show the result.
❷ cycle uses the given collection as the source of repetition, taking the items in the collection in a cycle
and producing a new sequence out of them.
repeat also accepts a number of elements to limit the length of the sequence:
(repeat 5 1) ; ❶
;; (1 1 1 1 1)
CONTRACT
Input
• "x" can be any expression or value including nil. It is a mandatory argument.
• "n" is optional. When present it can be positive number. If 0 or
negative, repeat produces an empty clojure.lang.PersistentList. If "n" is a double,
it is truncated to the nearest integer.
• "coll" is a sequential collection (a collection that can produce a sequence
following seq contract>>) or nil.
Notable exceptions
• NullPointerException when "n" is nil.
Output
• repeat generates a sequence by repeating the single value "x" infinite (or "n"
times). The type of the result is a sequence-
compatible clojure.lang.Repeat object.
• cycle generates a sequence by cycling through a collection of values infinite
times. The type of the result is a sequence-
compatible clojure.lang.Cycle object. It produces an
empty clojure.lang.PersistentList when the sequence is empty.
Examples
repeat and cycle are flexible tools for sequential processing. For example, repeat can
be used as the second input for map to transform a collection of words:
(defn bang [sentence] ; ❶
(map str (.split #"\s+" sentence) (repeat "!")))
❶ The bang function splits a string into separate words. repeat is used to generate an infinite sequence
of exclamation marks which can be adapted to the sequence of words as the input for map.
The second parameter "n" can be used to limit the sequence length, for example to
calculate xy (the power of "x" to exponent "y") with:
(defn pow [x y] (reduce * (repeat y x))) ; ❶
(pow 2 3)
;; 8
❶ We use repeat to create a sequence of multiplication factors that we can reduce with "*" to obtain the
final result.
(defn new-tally [] ; ❷
(let [cnt (atom 0)]
(fn []
(to-tally (swap! cnt inc)))))
(def t (new-tally))
(t) ; ❸
;; "|"
(t)
;; "||"
(t)
;; "|||"
(t)
;; "||||"
(t)
;; "卌"
(repeatedly 5 t) ; ❹
;; ("卌|" "卌||" "卌|||" "卌||||" "卌卌")
❶ The implementation of the tally system is a concatenation of the constituent characters based on their
ordinal number. The special UTF-8 symbol 卌 (U+534C) is used to simulate the horizontal
strikethrough every 4 vertical lines. repeat is used twice, one for the repetition of the strikethrough
and another for the reminder after the last strikethrough.
❷ A new tally is created by closing over an atom state. This simulates the appearance of a new tally sign
each time we call the generated function.
❸ Each invocation of the tally function t returns the next state of the tally, up to the point where the
strikethrough is used.
❹ We can optionally use repeatedly to generate many of them in a sequence.
cycle can be used to adapt a short collection to fit a larger sequential view. We’ve seen
160
Tally marks are simple numeral systems with a single symbol. The amount of symbol is the total count while their
grouping is a visual tool to help counting large numbers. See the Wikipedia entry to know
more: en.wikipedia.org/wiki/Tally_marks
a few example already in the book that the reader is invited to review:
• cycle was used in trampoline to build an infinite sequence of states for the
changing colors of a traffic light. There are only three colors, but they’ve been
adapted by cycle to simulate infinite traffic light changes over time.
• We’ve used cycle to generate an arbitrary long password using a finite alphabet of
symbols in random-sample.
• The Leibniz formula approximation for Pi also uses cycle in the "creating your
own fold" call-out.
As of idiomatic use of cycle the following should be preferred instead of concat:
(take 10 (apply concat (repeat [1 2 3]))) ; ❶
;; (1 2 3 1 2 3 1 2 3 1)
❶ concat is used with repeat over a collection of elements, producing a nested sequence of [1 2 3]
vectors.
❷ This is exactly the use case for cycle.
See also:
• iterate is a flexible form of repetition to generate a lazy sequence with a function
that decides the next item based on the previous. repeat, for instance, can be
written in terms of iterate as: (iterate identity x).
• repeatedly has a specific focus on side effecting functions to produce the output
sequence.
• constantly produces a function not a sequence. The function can be passed any
number of arguments and always return the same result.
• dotimes iterates the body and produce nil. The body is evaluated for side effects
only.
Performance considerations and implementation details
❶ The initial map operation with two sequences as input produces the series of alternating sign integer
numbers (0 1 -2 3 -4 5 -6 7 -8 9) etc. and calculate the sum of the first million numbers.
❷ The transduce version of the same operation uses the generation of positive numbers by map-
indexed and keeps cycle as the only source. We can see an approximate 50% speed gain.
(lazy-seq 1 2 [3]) ; ❹
;; (3)
❶ The REPL automatically prints the evaluation of the last expression. When the value is sequential, like
in this case, it prints as a list.
❷ We can try another sequential input, such as a vector. We can see that the type is a
special clojure.lang.LazySeq object.
❸ The object returned by lazy-seq is a sequence conforming to the clojure.lang.ISeq interface.
❹ Note that you can pass a variable number of arguments. They are implicitly treated in a do block.
The reader at this point might be wondering what’s the purpose of wrapping a
sequential object (something that can be turned into a sequence) in a sequence. There
are two reasons. The main purpose of lazy-seq as a macro is to delay the evaluation of
the input:
(def output (lazy-seq (println "evaluated") '(1 2 3))) ; ❶
;; #'user/output
(first output) ; ❷
;; evaluated
;; 1
❶ Note that we added a side effecting println as part of the arguments. lazy-seq prevents the
evaluation of the input: we don’t see the message when we declare the output.
❷ As soon as we access output, for example to fetch the first item, the body evaluates.
The second goal of lazy-seq is to cache the result of evaluating the input. When the
same lazy-seq form evaluates again, the result comes from the internal cache:
(defn trace [x] (println "evaluating" x) x) ; ❶
(first output) ; ❸
;; evaluating 1
;; 1
(first output) ; ❹
;; 1
❶ trace is a simple debugging function that prints the argument before returning it without any changes.
❷ We use list to produce a list of 3 numbers. The first number is wrapped within the trace function. On
creation of the lazy-seq nothing is printed on screen.
❸ Evaluating the first element produces the message as well.
❹ The second time we access the output, we only see the number 1 and no "evaluating" string. lazy-
seq evaluated the body once and then cached the result of evaluating the expression.
The fundamental properties of lazy-seq are not particularly interesting when used in
isolation, but it’s a fundamental building block to produce lazy sequences in pair
with cons. We can chain lazy-seq objects together to delay the evaluation of the items
in a sequence (the fundamental aspect of laziness). Please see the example section for
the details.
CONTRACT
Input
• "& body" is a variable length argument. The arguments are implicitly wrapped in
a do block. The result of evaluating the do block expression needs to be sequential
(any collection implementing the clojure.lang.Seqable interface).
Notable exceptions
IllegalArgumentException when "body" does not evaluate to a sequential collection,
as per seq semantic. Note that given the implicit do argument, the last expression needs
to return a sequential collection:
(lazy-seq 1 2 3) ; ❶
;; IllegalArgumentException Don't know how to create ISeq
❶ lazy-seq accepts any number of arguments that are treated as an implicit do block. The result of the
evaluation of the do block needs to return a sequential collection. In this case, "3" is not sequential.
Output
• returns: a sequence representing the sequential view over the input "body".
Returns an empty sequence if there are no arguments or if the arguments is an
empty collection. lazy-seq never returns nil.
Examples
A sequential operation, like transforming each item in a sequence, can become lazy by
interleaving a delaying lazy-seq to each transformation step. The pattern consisting of
using lazy-seq and cons in a recursive function, is the canonical way to generate a
lazy sequence in Clojure (and it’s used pervasively throughout the standard library).
Please compare the following functions to transform the input "coll" in a list (a custom
version of map):
(defn eager-map [f coll] ; ❶
(when-first [x coll]
(println "iteration")
(cons (f x)
(eager-map f (rest coll)))))
(first lazy-out) ; ❺
"iteration"
"0"
❶ eager-map is a recursive function creating a list of cons "pairs". Each iteration produces a cons object
by pairing the current transformed item and the computation of the rest of the list.
❷ We can see that even without using the output of eager-map we already fully evaluated the recursion
including performing all transformations. The newly transformed sequence of cons is already existing
in memory right after evaluation.
❸ lazy-map wraps the outer form in a lazy-seq call. This is the only change compared to eager-
map. cons produces a sequence type which is accepted by lazy-seq (along with persistent
list, cons is the other native sequential Clojure type).
❹ The function call is evaluated but it does not produce any output.
❺ Additionally, we can see that asking a single element produces a single recursion loop. This is
because the next lazy-seq wrapper does not force another recursion unless explicitly requested.
The following diagram shows the general idea of the pattern: the fundamental aspect is
the presence of the wrapping lazy-seq call before the body of the function myfn, which
can then be called recursively at any point.
We can apply the same pattern to build lazy sequential abstractions on top of a
disparate range of data producers (the standard library does this extensively). Data
producers can be concrete collections, services or abstract generators.
The following naive implementation of the Sieve of Eratosthenes, uses a natural
number generator to return an infinite sequence of prime numbers 161:
(defn sieve [n]
(letfn [(divisor-of? [m] #(zero? (rem % m))) ; ❶
(step [[x & xs]] ; ❷
(lazy-seq (cons x ; ❸
(step (remove (divisor-of? x) xs)))))] ; ❹
(take n (step (nnext (range))))))
161
The Sieve of Eratosthenes is possibly one of the most instructional algorithms to study the effect of laziness. The naive
version presented here is far from being the best algorithm to find prime numbers, but it’s relatively simple to understand.
This Wikipedia page describes the enhancement to the basic form as well as links to other
algorithms:en.wikipedia.org/wiki/Sieve_of_Eratosthenes#Algorithmic_complexity
(sieve 10)
;; (2 3 5 7 11 13 17 19 23 29)
❶ divisor-of? predicate takes the dividend number "m" and returns a function that can be used to find
divisors of "m". ((divisor-of? 4) 2) is for example true.
❷ The step function contains the recursive action. Each step corresponds to an iteration during
recursion. The step wraps lazy-seq around its body and contains the logic to calculate the next
number. The initial destructuring into "x" and "xs" helps removing the occurrence
of first and rest later on.
❸ The basic recurring step consists of taking the current number "x" at the beginning of the list
and cons it into the rest of the computation. "x" is already a prime number because the rest of the
calculation removes all its divisors from the list of all natural numbers.
❹ The beginning of the recursion prepares an infinite list of positive integers starting from 2
(using nnext is equivalent at calling next twice).
;; (new clojure.lang.LazySeq
;; (fn* []
;; (when coll
;; (cons
;; (f (first coll))
;; (lazy-map f (next coll))))))
❶ We call macroexpand on the body definition of the lazy-map function as seen in the examples before
in the chapter.
The expansion shows that lazy-seq is a Java object constructor accepting a function
object as parameter. The function object has no parameters and when invoked it
evaluates the content of the body. The recursive call to lazy-map that appears inside
the body of the function does not live on the stack, because it returns the next
clojure.lang.LazyMap object immediately. The object contains a promise for
computation at some later time and is parked on the heap (the default residence for
object allocation).
When a consumer pulls an item from the lazy sequence, the outer LazySeq object
evaluates and caches its value. At the same time, the recursion produces the next
promise for computation. As the consumer asks additional items, the recursive action
produces more promises for computation. If nothing is holding the head of the
sequence, the first outer LazySeq can be garbage collected and the entire sequence
never resides in memory at once.
©Manning Publications Co. To comment go to liveBook
Simple recursion with lazy-seq produces linear, heap-consuming lazy sequences like
the one shown in this diagram:
Figure 9.4. Recursive use of lazy-seq produces a concatenation of cons objects with a
final LazySeq holding the unrealized rest of the infinite sequence.
You need to be careful when nesting recursive lazy sequence generators. This happens
quite often as part of designing typical algorithms for sequential processing. The
problem is that the nesting could end up consuming the stack even when you don’t
expect it. The following lazy-bomb function illustrates the problem:
(defn lazy-bomb [[x & xs]]
(letfn [(step [[y & ys]]
(lazy-seq
(when y
(cons y (step ys)))))] ; ❶
(lazy-seq
(when x
(cons x (lazy-bomb (step xs))))))) ; ❷
❶ lazy-bomb contains a step function which contains another recursion using the typical lazy-
seq pattern. The step doesn’t do anything apart from destructuring and rebuilding the input with an
intermediate recursive call.
❷ lazy-bomb main body is a similar recursion which takes each item from the input and cons it into a
call to the inner step function.
❸ lazy-bomb generates stack overflow for modestly large inputs of a few thousands items.
The step function in lazy-bomb follows the standard lazy-seq pattern but produces a
stack overflow error when we would expect an heap consuming recursion. The
problem is in the interleaving of the outer recursion of lazy-bomb and the inner
recursion of step. The structure of the algorithm is such that lazy-bomb always return
an unrealized sequence as the target for the first cons, like illustrated in the following
diagram.
Figure 9.5. Nested use of lazy-seq can produce a sequence with intermediate unrealized
steps.
In order to satisfy requests for more items, the unrealized lazy sequence appearing at
the head needs to be traversed to get to the range generator. The farther away we
request the item, the longer is the traversal to get to the generator. The traversal is on
the stack because it’s part of evaluating the "misplaced" lazy-seq.
Unfortunately, nested lazy-seq recursions can be hidden away behind other innocents
functions making them difficult to see. If you look carefully at the sieve function
presented before, it contains such accidental nesting disguised as a remove call. We
can rework the sieve function to inline the remove call and make that explicit:
(defn sieve [n]
(letfn [(remove-step [x [y & ys]] ; ❶
(lazy-seq
(when y
(if (zero? (rem y x)) ; ❷
(remove-step x ys)
(cons y (remove-step x ys))))))
(sieve-step [[x & xs]] ; ❸
(lazy-seq
(cons x (sieve-step (remove-step x xs)))))]
(take n (sieve-step (nnext (range))))))
(sieve 10) ; ❹
;; (2 3 5 7 11 13 17 19 23 29)
(sieve 10000) ; ❺
;; StackOverflowError
❶ The previous remove call has been replaced with a remove-step local function, which is just the
implementation of remove with a fixed predicate.
❷ divisor-of? appears here as part of the if condition, which is the general rule for removing items: if
the next number is divisor of the current prime, ignore and iterate the next.
❸ Also the recurring call to sieve was extracted into a new sieve-step. The relationship between the
outer sieve-step and inner remove-step is now explicit.
❹ We can see that this is generating prime numbers like before.
❺ For large numbers, this sieve implementation goes into stack overflow.
The sieve function described so far suffers from the lazy-seq nesting problem,
producing a stack overflow for relatively small numbers. There are several alternatives
to consider, including reformulating the algorithm to be tail recursive. In doing so, we
have an opportunity to look at the accumulated list of prime numbers so far and use
that knowledge to reduce the searching space for the next prime 162:
• We could search for odd numbers only, as no even number could ever be a prime.
• We can concentrate on just the last found prime number onward.
• We can check for prime factors up to the square root of the prime candidate.
The following sieve generates a vector using the suggestions above:
162
The problem of efficiently generating prime numbers is vast and fascinating: if you want to know more, this paper about
the "Genuine Sieve of Eratosthenes" is worth reading: www.cs.hmc.edu/~oneill/papers/Sieve-JFP.pdf
❶ sieve starts with a few function definitions for internal use before entering the main loop-
recur recursion.
❷ odds-from starts a range of odd numbers starting from the first available after "n".
❸ divisor? returns a predicate to check if a number is divisor of "p".
❹ cross-upto takes a list of prime numbers and returns them up to first that when squared goes beyond
the given candidate prime "n". The "cross" evokes the similar operation in the Sieve of Eratosthenes.
❺ The main loop starts by setting the counter and initial vector of primes (which always starts from 2).
❻ The next prime is the first after we dropped candidates using a predicate function to cross the relevant
divisors.
❼ The result of calling sieve to see the 10000th prime number is not resulting in a stack overflow.
Although the algorithm presented above is not consuming the stack, is far from being
efficient. Asking for a million primes takes a considerable amount of time and the
results are not lazy consuming linear memory. Improving on such algorithm is possible
but beyond the goals of this book.
See also:
• seq produces a lazy sequence out of collection supporting the sequential interface.
• concat creates a lazy sequence out of the concatenation of 2 or more collections.
Performance considerations and implementation details
after use is called "locals clearing" and happens both inside the LazySeq object and the
generated fn, which is an interesting case of ^{:once true} metadata).
lazy-seq locals clearing feature is subject to sporadic NullPointerException when an
error occurs during evaluation (a known bug, see dev.clojure.org/jira/browse/CLJ-
2069):
(defn squares [x] ; ❶
(cons (* x x) (lazy-seq (squares (* x x)))))
(take 5 sq2) ; ❷
;; (4 16 256 65536 4294967296)
(take 6 sq2) ; ❸
;; ArithmeticException integer overflow
(take 6 sq2) ; ❹
;; NullPointerException
The example above shows that locals clearing removed references to local bindings
before the lazy-seq step was able to cache the result. Additional requests for the same
item results in the LazySeq object being unable to cache the result instead of showing
the previous error.
9.3.2 tree-seq
function since 1.0
(pretty-print ; ❷
(tree-seq vector? identity [[1 2 [3 [[4 5] [] 6]]]]))
;; 2
;; [3 [[4 5] [] 6]]
;; 3
;; [[4 5] [] 6]
;; [4 5]
;; 4
;; 5
;; []
;; 6)
❶ pretty-print is a helper function to format the result of tree-seq to read on multiple lines.
❷ After instructing tree-seq on how to recognize a branch, it returns the lazy sequence of visited nodes
in depth-first traversal order.
1. How to distinguish a branch node: when a branch is found, tree-seq iterates its
content potentially following down into other branches.
2. How to iterate the content of a branch in case it’s not sequential.
3. How to pre-process nodes before moving further.
tree-seq is the lazy equivalent of clojure.walk. By producing a lazy depth-first
traversal, it can process large data structures that don’t fit into memory (paying
attention at not holding the head of the sequence).
CONTRACT
Input
• "branch?" is a predicate function returning logical true or false. It’s invoked on
each node to understand if it’s a branch or not.
• "children" is a function of one argument. It’s invoked on a branch to obtain its
sequential view.
• "root" is root object from which lazy-seq starts the traversal. It can be nil or an
empty collection.
Notable exceptions
• NullPointerException when either "branch?" or "children" are nil.
Output
• tree-seq returns the lazy sequence of nodes visited during a depth-first traversal
starting at "root".
Examples
tree-seq is useful to traverse deeply nested data structures to process interesting
nodes. In the following example, we can see how to collect all positive values from a
nested vector:
(defn collect [pred? branch?] ; ❶
(fn [children]
(filter
(fn [node]
(or (branch? node) (pred? node)))
children)))
❶ collect is a function of 2 predicates, "pred?" and "branch?". "pred?" is used to process nodes that
are not branches (in our example, everything that is not a vector). "branch?" is used to understand
which node is a branch. collect returns a function of a collection of children nodes. This function
decides which nodes should belong to the final result.
❷ collect-if prepares the call to tree-seq. It defines the meaning of what is a "branch?" and how to
process "children".
❸ Note that we need to remove branches from the results if we only care about terminal nodes.
❹ We can see that the traversal returns positive nodes in depth-first order.
(take 5
(tree-seq
(memfn ^File isDirectory) ; ❶
(comp seq (memfn ^File listFiles)) ; ❷
(File. "/"))) ; ❸
❶ tree-seq invokes the File/isDirectory method for each file. The presence of a directory
induces tree-seq to descend its content.
❷ File/listFiles is used by tree-seq for each File object representing a directory. When the item is
a directory, listFiles produces an array of file objects (nil otherwise). seq transforms the array into
sequential content.
❸ The root object is a file object representing the beginning of the iteration.
❹ Note that the output presented here might be different on other system.
In the following example, we want to process a nested document which contains a mix
©Manning Publications Co. To comment go to liveBook
of vectors and hash-maps. Such data structure is typically the result of parsing JSON or
similar exchange formats:
(def document ; ❶
{:tag :balance
:meta {:class "bold"}
:node
[{:tag :accountId
:meta nil
:node [3764882]}
{:tag :lastAccess
:meta nil
:node ["2011/01/01"]}
{:tag :currentBalance
:meta {:class "red"}
:node [{:tag :checking
:meta nil
:node [90.11]}]}]})
(def branch? ❷
(complement (some-fn string? number?)))
(def document-seq ; ❸
(tree-seq
branch?
:node
document))
❶ The document implements branching through maps and vectors. If a node is a map type and contains
a :node key, then the children are available as the value at that key. Terminal nodes are either strings
or numbers. The document seems to follow this convention top to bottom.
❷ The branch? predicate works by negating the type of a terminal node with complement. It seems
more straightforward to mention what a branch is not than what a branch actually is.
❸ document-seq stores the lazy evaluation of the document sequence in a var. The "children" is simply
the :node keyword.
❹ As seen before, we remove branch nodes from the final result to concentrate on simple values.
❺ Other kind of filtering are also possible, for example to show all the :meta values.
An eager tree-seq
The tree-seq implementation in the standard library consists of a lazy recursive walk to generate only
as much output as requested. For those scenarios where the output is fully consumed, we can achieve
better performance by giving away laziness as follows:
The new eager-tree-seq function gives away laziness to gain some speed. Please check the
performance section to see a full comparison between the lazy and the eager version of tree-seq.
See also:
• clojure.walk/walk performs a depth-first traversal of arbitrarily nested data
structures. The "branch?" predicate for clojure.walk/walk is implicitly true for
all the most common collection types. clojure.walk/walk maintains the original
nesting of the input, instead of creating a flattened sequence of nodes.
• A zipper is another option to traverse a deeply nested data structure. zippers are
the most flexible option, as they separate the traversal logic (which is fixed to
depth-first for both tree-seq and clojure.walk/walk) from the traversal state.
Performance considerations and implementation details
(def document
(parse "https://nvd.nist.gov/feeds/xml/cve/misc/nvd-rss.xml"))
❶ The first benchmark uses standard tree-seq. We need dorun to fully realized the sequence.
❷ This is the same benchmark using the eager-tree-seq seen before. This function produces a vector
instead of a sequence and does not require dorun.
The eager version is roughly 5 times faster. This doesn’t mean that lazy-seq is
necessarily slow. There are many factors influencing the choice between a slower but
lazy function and a faster but memory consuming eager version. If you are planning to
access just the initial part of the output and potentially move toward the end,
then lazy-seq is still the best choice. If your application requires maximum speed and
the input tree is reasonably sized, prefer an eager version. An eager application of the
file system scan seen previously, could easily consume the entire memory for instance.
9.3.3 file-seq
function since 1.0
(file-seq [dir])
;; ("/etc"
;; "/etc/afpovertcp.cfg"
;; "/etc/aliases"
;; "/etc/aliases.db"
;; "/etc/apache2"
;; "/etc/apache2/extra")
❶ "/usr/share/man" is a typical Unix location for command manuals. On this system there are 16727 files
and folders, as file-seq returns both.
❷ "/etc" is another standard folder on Unix systems. file-seq returns a sequence
of java.io.File objects from which we can extract the full path as a string with getPath. “memfn” is
used here to create a Java-interop anonymous function.
file-seq performs a depth-first file traversal: if the next file is a folder, file-
seq descend into the folder immediately before traversing the other files at the same
level. The example above shows that file-seq descends into "apache2" as soon as it is
found ("apache2/extract" appears right after).
CONTRACT
Input
• "dir" is a mandatory argument of type java.io.File.
Notable exceptions
• ClassCastException if "dir" is not a java.io.File.
• NullPointerException if "dir" is nil.
Output
• returns: a depth-first traversal of all files and folders under "dir" as a
lazy sequence.
Examples
Let’s start by illustrating the behavior of file-seq in the presence of a few special
java.io.File objects. Some of them produce a valid (abstract) path, which results in
apparently empty folders:
(def work-dir (file-seq (java.io.File. ".")))
(def abstract-path (file-seq (java.io.File. "")))
(def non-existent (file-seq (java.io.File. "NONE")))
❶ The "." is the standard representation for the current folder, which is the folder the JVM process was
started from.
❷ The empty string is accepted as a valid path, but it is referred as "abstract path" as it is not a physical
path. It brings potential inconsistencies and should be avoided. We can see that it prints the current
folder, but that’s the current folder name plus the empty abstract path folder, which is still a non
existent folder.
❸ A malformed path (a random string "NONE") shows a non existent folder appended to the working
directory. It has the same effect as the empty abstract path.
The following example shows a simplified "grep" utility. "grep" is a common Unix
command to verify the presence of a string inside a file. Our "grep" offers the
possibility to search specific file extensions starting from the working folder:
See also:
• clojure.java.io/file is the recommended way to create file objects in Clojure.
• tree-seq is the generic mechanism used by file-seq to perform the folder
traversal. Have a look at how file-seq is implemented if you need to perform the
file traversal in a specific way.
Performance considerations and implementation details
9.3.4 xml-seq
function since 1.0
(xml-seq [root])
(def balance
"<balance>
<accountId>3764882</accountId>
<currentBalance>80.12389</currentBalance>
<contract>
<contractId>77488</contractId>
<currentBalance>1921.89</currentBalance>
</contract>
</balance>")
❶ For illustration purposes, we are going to use a small XML fragment encoded directly as a
string. xml/parse requires that we convert the string into an input-stream before parsing.
❷ The output sequence produced by xml-seq contains branch nodes, those with a :content key which
refers to other nodes. Here we are just interested in the terminal nodes of the XML structure, those
with a :content key that contains strings.
xml-seq was added to the standard library in a period where XML was the lingua
franca for inter-process communication. Nowadays, other formats are commonly used,
but xml-seq remains an effective approach to basic XML processing.
CONTRACT
Input
• "root" should be consistent with the format produced by clojure.xml/parse.
(def feeds
[[:guardian "https://www.theguardian.com/world/rss"]
[:wash-post "http://feeds.washingtonpost.com/rss/rss_blogpost"]
[:nytimes "https://rss.nytimes.com/services/xml/rss/nyt/World.xml"]
[:wsj "https://feeds.a.dj.com/rss/RSSWorldNews.xml"]
[:reuters "http://feeds.reuters.com/reuters/UKTopNews"]])
❶ The first thing to do is to keep nodes with a content attribute of type string.
❷ There are many types of terminal nodes, including those with only metadata, links and so on. The
nodes with a title are selected next.
❸ It’s time to match against the given regular expression and only keep those nodes matching the query.
❹ Each feed generates a potentially expensive http call. pmap is an easy choice to achieve better
performance by processing the feeds in parallel.
The output from the previous example could be different based on the type of news in
the feeds. At the time the example was added to the book it found only one news
related to "climate".
See also:
• clojure.xml/parse is a mandatory step before calling xml-seq. You could parse the
xml using other tools and still be able to call xml-seq by making sure the output
from the parser is compliant with the required format.
• tree-seq implements a generic depth-first traversal for nested data structures. xml-
seq is based on tree-seq as other functions in this chapter.
re-seq creates a lazy sequence from the matching instances of a regular expression in a
string:
(re-seq #"\d+" "This sentence has 2 numbers and 6 words.") ; ❶
;; ("2" "6")
❶ re-seq creates a sequence of numbers from this sentence. The numbers are still in their original
format as strings.
CONTRACT
Input
• "re" is an object of type java.util.regex.Pattern. Clojure has a reader
literal #"regex" to create a Pattern instance, or you can use the extended form
with Java interop.
• "s" is a java.lang.CharSequence. It is normally a string, but it could be another
object implementing the same interface. For example:
(def sb (doto (StringBuilder.)
(.append "23")
(.append "aa 42")))
(re-seq #"\d+" sb) ; ❶
;; ("23" "42")
Notable exceptions
• NullPointerException if either "re" or "s" are nil.
Output
• returns: a sequence of the matching instances of "re" in the string "s". This could
be a list of strings or vector of strings, depending on the presence of matching
groups. nil is returned when "s" is the empty string.
Examples
Strings are inherently sequential in Clojure, producing a sequence
of java.lang.Character objects forming the string. re-seq can be used to produce a
sequential list of strings instead of characters:
(seq "hello") ; ❶
;; (\h \e \l \l \o)
re-seq returns matching instances with individual matching groups if they are present.
In the following example we match and destructure a list of repeating names and phone
numbers:
(def signed-up ; ❶
"Jack 221 610-5007 (call after 9pm),
Anna 221 433-4185,
Charlie 661 471-3948,
Hugo 661 653-4480 (busy on Sun),
Jane 661 773-8656,
Ron 555 515-0158")
❶ This is a sample string of a potentially longer list of people who signed up to teach programming to a
group of kids. The text contains the name, phone and an optional note regarding availability.
❷ We can destructure the text using a regular expression, because name and phone numbers appear
with the same pattern throughout the file. re-seq is given the pattern to search for and the string. Note
the round parenthesis in the pattern: we want to be able to isolate specific portion of the matching
substring (this is part of standard regex format).
❸ re-seq returns a vector containing the matching string and any groups within. We can use the list of
vectors to group the information we need.
In the next example we are taking advantage of re-seq laziness on some large text.
The text contains a fairly long list of 1 million digits of Pi 163:
(def pi-digits
(slurp "https://tinyurl.com/pi-digits")) ; ❶
(def pi-seq ; ❷
(sequence
(comp
cat ; ❸
(map int) ; ❹
(map #(mod % 48))) ; ❺
(re-seq #"\d{10}" pi-digits)))
(take 20 pi-seq)
;; (1 4 1 5 9 2 6 5 3 5 8 9 7 9 3 2 3 8 4 6)
❶ pi-digits contains a text version of the "EBook of Pi", a book containing Pi digits up to 1 million
places. The text contains predominantly digits, but includes also an introduction and a specific space
separated format for the digits.
❷ We can produce the sequence of matching Pi digits from the book by matching them by groups of 10,
which is how they are formatted in the book.
❸ cat transducer takes each 10 digits string, splits them into single numbers and concatenates them.
❹ The java.lang.Character instance coerced into an int produces the index at which it is stored in the
ASCII table.
163
Supercomputers are now enabling many more digits to be calculated, see
en.wikipedia.org/wiki/Pi#Modern_quest_for_more_digits
❺ Since "0" is 48 in the ASCII table, the modulo operation has the effect of translating the ASCII entry
index into the actual number.
Thanks to re-seq laziness, we don’t need to match the entire book to retrieve the first
20 digits, saving us some computation steps in case we don’t want to consume all
digits. However, note that we are still forced into loading the entire book in memory
before starting the computation. In the next extended example we are going to see how
to fix that.
The sequence generation downloads just enough HTTP request to satisfy the number of digits to print,
preventing the entire book to reside in memory all at once.
See also:
• “re-pattern, re-matcher, re-groups, re-seq, re-matches, re-find” are other functions
dedicated to regular expression matching that don’t create a sequence. Use re-seq
if you are interested in a sequential view of the matching pattern in a string.
Performance considerations and implementation details
(line-seq [rdr])
line-seq creates a sequence of lines from a stream of characters. A new line item in
the sequence is created for each line termination marker found in the input:
(require '[clojure.java.io :refer [reader]]); ❶
CONTRACT
Input
• "rdr" stands for "reader". line-seq expects an instance
of java.io.BufferedReader, a commonly used Java type to process newline
separated files.
Notable exceptions
• NullPointerException if "rdr" is nil.
• ClassCastException if "rdr" is not a java.io.BufferedReader instance.
Output
• returns: the sequence of lines found by reading the input.
Examples
line-seq is useful to process large textual files without loading them completely in
memory. This can happen in two ways:
1. The lazy sequence is never fully realized, stopping processing after a few items.
2. The lazy sequence is fully realized, but processing only accesses the item to retain
some partial information (for example the count or other statistics). In doing so the
head of the sequence is not retained and each item can be garbage collected as we
process the rest of the input.
In the next example we are making access to the top 1 million Alexa entries (Alexa is a
company providing web analytics) which is a relatively large (15 MB) archive
containing the most popular website by traffic. The archive is compressed, but we can
uncompress it while processing the content line by line. We want to know what is the
top ranking ".me" domain:
(import '[java.net URL])
(import '[java.util.zip ZipInputStream])
(require '[clojure.java.io :as io])
ZipInputStream.
(doto .getNextEntry)
io/reader))
(first-of-domain "me") ; ❹
;; "246,line.me"
❶ zip-reader creates a java.io.BufferedReader starting from an URL. As you can see, Clojure
does a great job creating straightforward and readable code out of many objects required to open the
reader. One important detail is that while opening a zip archive you have to position the input stream
at the beginning of the next entry with getNextEntry. In our case it’s relatively easy since the archive
contains a single entry.
❷ domain takes a line and extract the domain of the website it contains.
❸ first-of-domain takes an extension as parameter. It then access the sequence in the context of
a with-open block. some is used to take enough sequence until the predicate first returns a match,
which is the corresponding line containing the domain.
❹ We can see that the highest traffic ".me" website is "https://line.me/", a website for sending free SMS
messages. This result could be different at some other time.
The search for the first matching domain executes relatively fast, which signals the
URL is not completely downloaded. The next example extends the previous by ranking
the list by the most frequent domain:
(defn top-10-domains-by-traffic []
(with-open [r (zip-reader alexa)]
(->> (line-seq r)
(map domain)
frequencies ; ❶
(sort-by last >)
(take 10)))
(top-10-domains-by-traffic alexa)
;; ["com" 487682] ["org" 50189] ["ru" 43619]
;; ["net" 42955] ["de" 36887] ["br" 20192]
;; ["uk" 18828] ["ir" 16915] ["pl" 16730] ["it" 11708]
❶ We use the same zip-reader function to operate on the sequence in the context of a “with-open” call.
This ensures any resource is correctly closed at the end of the computation. line-seq starts
generating the lazy sequence from which we extract the domain extension and then pass the entire
sequence to frequencies.
The second part of the example uses frequencies, an eager function that scans the entire
input to populate its counters. In doing so, it doesn’t hold on the head of the sequence.
The use of line-seq allows for very large files to be processed in memory assuming
the collected information (keys and numbers in the case of frequencies) can be stored
in memory. Anything else that is not retained in the final results, can be safely garbage
collected while processing is still ongoing.
See also:
• slurp downloads the content of a URL or local file into memory as a single string.
It’s usually a straightforward choice for configuration files or other small files.
• split-lines can be used to split a large string into lines producing a vector. It is not
lazy, so it should be used when the memory footprints it generates is predictable.
Performance considerations and implementation details
164
JDBC, the Java Data-Base Connection framework, is one of the well known Java features. For an overview of the
framework and how to work with it, please have a look at the Java
Tutorial docs.oracle.com/javase/tutorial/jdbc/basics/index.html
❶ db-driver creates a dynamic instance of a ResultSet implementation with reify. It contains the
implementation of the functions used by resultset-seq and a small mechanism to produce semi-
realistic results. Note that next implementation always returns true.
❷ After creating an instance of the ResultSet stub, we can invoke resultset-seq directly on it. We
always need to remember to take a finite amount of elements from the infinite sequence of results.
CONTRACT
Input
• "rs" must be an instance of java.sql.ResultSet and is required argument.
Notable exceptions
• NullPointerException if "rs" is nil.
• ClassCastException if "rs" is not a java.sql.ResultSet instance.
Output
• returns: a sequence of (now deprecated) Clojure struct types.
Each struct contains the keyword rendering of the name of column in the
database, followed by the value of that record at that key.
NOTE Although structs are now deprecated in favor of defrecords, you can access them the same as
normal hash-maps. Their use in resultset-seq does not require any other specific
knowledge.
Examples
The following example shows a basic JDBC interaction with a database and the
way resultset-seq wraps the results. The example requires the SQLite driver for Java
(available from github.com/xerial/sqlite-jdbc) in the classpath of the running process.
We are going to use SQLite configured as an in-memory database:
(import '[java.sql DriverManager ResultSet])
NOTE When the sequence creation happens inside a try-finally block that closes the connection,
make sure that all necessary operations happen inside the block. If portion of the unrealized
sequence escapes the block, you might incur in a "connection already closed" exception. One
solution to the problem (that removes the benefits of laziness though) is to use doall as shown
in the example. Another laziness-friendly option is to pass the processing function into the body
of try-finally block.
• The database driver streams the results from the server (instead of bulk loading
them into memory).
• The results are too large to load into memory at once, but we are able to process
them incrementally.
The SQLite driver streams results lazily by default (the same is not true for other
popular drivers like MySql). If you are dealing with large results and want to process
them lazily, you need to be sure the driver supports streaming capabilities.
See also:
• Ad-hoc solutions to the problem of iterating a JDBC ResultSet can be created
with lazy-seq and cons, the building blocks for Clojure lazy sequences
(resultset-seq included). For any other standard iteration, consider
using resultset-seq.
(iterator-seq [iter])
(enumeration-seq [e])
(iterator-seq an-iterator) ; ❸
;; (1 2 3)
(enumeration-seq an-enumeration) ; ❹
;; (1 2 3)
❶ The java.util.Iterator::iterator method is present in most of the Clojure and Java collections.
We can see here how to invoke it on a Clojure PersistentVector.
❷ Enumeration objects are more difficult to find, as the interface was gradually abandoned in favor of
iteterators. There are still plenty of examples for backward-compatibility in the JDK. The
method java.util.Collections::enumeration can be used to extract an enumeration from any
collection supporting an iterator.
❸ iterator-seq is used here to generate a sequence out of the iterator object.
❹ You can see that the output of enumeration-seq is the same as iterator-seq.
CONTRACT
Input
• "iter" must implement the java.util.Iterator interface.
• "e" must implement the java.util.Enumeration interface.
Notable exceptions
• NullPointerException when "iter" or "e" are nil.
Output
• returns: a sequence generated by iterating the Iterator or Enumeration object. It
returns nil in case of iterators or enumerations coming from empty collections.
Examples
The large majority of Java and Clojure collections support the Iterator interface.
Collections don’t implement the interface directly, but provide an iterator() method
to get a fresh Iterator object for every new iteration. While seq knows how to
produce a sequence using the iterator() method, iterator-seq remains for those
cases in which the Iterator is the only object available (for example as the return type
from another function call).
Java 8 introduced java.util.stream.Stream, a new interface to support a more
functional style of collection processing in Java. The Stream supports the iterator
interface, so we can use iterator-seq to generate a sequence:
(->> "Clojure is the best language"
(.splitAsStream #"\s+") ; ❶
.iterator ; ❷
iterator-seq) ; ❸
❶ splitAsStream is available on regular expression patterns and can be applied to a string. In this case
the regular expression returns any group of 1 or more non-space characters.
❷ The stream is not sequential by default and it does not implement the Iterable interface.
Calling seq would throw an exception. We can instead call iterator explicitly to retrieve the iterator
for this stream.
❸ iterator-seq knows how to translate an iterator into a sequence.
WARNING iterator-seq produces a cached sequence. After processing an item from the iterator
source, that item gets cached by the generated sequence, effectively creating an immutable
view of the iterator at that point in time. Contrary to Clojure design principles,
the Iterator interface even includes a remove method that allows clients to remove objects
from the source object of the iterator! These changes, if any, are not visible from iterator-
seq after the output sequence has been generated. The reader should remember this behavior
when wrapping Java classes that reuse the same iterator instance.
There are still a few objects in the Java standard library that offers an Enumeration
©Manning Publications Co. To comment go to liveBook
❶ reducers/fold is used on top of a mutable, but concurrent, data structure. This is why combinef always
seed the initial reduction with the reference to the map "m". There is also no combinef step, as there
are no chunks to concatenate.
❷ The last step converts the mutable hash-map (an implementation details of how parallel-
distinct works) back into an immutable sequence using enumeration-seq on the keys of the map.
❸ After producing a long vector of repeating numbers, we can see that parallel-distinct returns
them without duplication.
See also:
• sequence generates sequences from objects implementing
the java.util.Iterable interface. When that is not available but there is another
way to produce a java.util.Iterator object, use iterator-seq instead.
• lazy-seq is the main mechanism used to create sequences out of iterable or
enumerable objects.
Performance considerations and implementation details
165
There is no ConcurrentHashSet in the Java standard library, but it’s possible to obtain a ConcurrentHashMap
backed KeySet that fulfills a similar role. See
docs.oracle.com/javase/8/docs/api/java/util/concurrent/ConcurrentHashMap.html#newKeySet-- for more
❶ dbg-coll creates a new list of the given size, wrapping it with console logs for each element
evaluated.
❷ We can extract an Iterator from a Clojure sequence with .iterator as they all
implement Iterable. We can see that iterator-seq pulls 32+1 items out of the input sequence: 32
items is the size of the chunk that is evaluated when we call first and 1 item is evaluated from the next
chunk to check if there is more input.
❸ enumeration-seq is fully lazy without any chunking.
NOTE the reason for iteration-seq to provide a chunking behavior is a by-product of recent work
related to enabling sequence with transducers, which is in turn driven by performance. In
general, the chunking behavior of sequences is often a trade-off between performances and
full laziness.
(concat
([])
([x])
([x y])
([x y & zs]))
❶ concat is used here to concatenate several types of sequential collections. It produces a lazy-seq.
lazy-cat is a macro build on top of concat that wraps the input collection into lazy-
seq before passing it to concat. The additional layer of protection, enables lazy
concatenation without arguments evaluation. We can see the strong relationship
of lazy-cat with concat by macroexpanding a simple form:
(macroexpand '(lazy-cat [1 2 3] (range))) ; ❶
;; (concat (lazy-seq [1 2 3]) (lazy-seq (range)))
❶ The macroexpansion of lazy-cat shows the use of concat with argument wrapping into lazy-seq.
Laziness is the most interesting aspect of both concat and lazy-cat and we are going
to see how it can be used in the example section.
CONTRACT
Input
• With no arguments, concat returns the empty sequence ().
• "x" single argument is accepted by concat. It needs to be of sequential type (such
that (instance? java.lang.Seqable x) is true) or nil. With a single
argument concat behaves similarly to lazy-seq producing a lazy sequence from
"x".
• "x", "y" and "zs" can be sequential collections or nil.
Notable exceptions
• IllegalArgumentException if any of the input is not sequential as
per seq contract.
Output
• returns: the lazy sequence generated by concatenating the content of "x", "y" and
"zs" (if any) or empty sequence otherwise.
Examples
concat is useful to create an uniform view over different sources, each one producing
an independent collection or lazy sequence. Here’s for example
an identifier function that produces an unique object identifier which includes all
implemented classes and interfaces:
(defn identifier [x]
(let [classname #(.getName ^Class %) ; ❶
split #(.split % "\\.")
typex (type x)]
(apply str
(interpose "-"
(concat
(split (classname typex)) ; ❷
(mapcat (comp split classname) (supers typex))))))) ; ❸
(identifier #"regex") ; ❹
;; "java-util-regex-Pattern-java-io-Serializable-java-lang-Object"
❶ The let block defines two helper functions: classname to get the name of a class as a string
and split to split a string at each "." dot position.
❷ One source of names is the package and class name.
❸ Another source of names is derived from processing supers of the same class. We concat those
together before interposing them with a dash "-" sign.
❹ You can try identifier with an object like a java.util.regex.Pattern or a Clojure vector (which
produces a much longer list).
When the list of source collections is only known at runtime, concat can be used
with apply to concatenate all arguments:
(def sold-icecreams ; ❶
[[:strawberry :banana :vanilla]
'(:vanilla :chocolate)
#{:hazelnut :pistachio}
[:vanilla :hazelnut]
[:peach :strawberry]])
(next-day-quantities sold-icecreams); ❸
;; ([:vanilla 3] [:strawberry 2] [:hazelnut 2]
;; [:banana 1] [:chocolate 1] [:pistachio 1] [:peach 1])
❶ In this example, we receive a list of today’s sold ice creams. The list is simplified in length and
structure, reporting only the group of flavors each ice cream contained.
❷ We want to be able to see all flavors together, so we can calculate how much ingredients we need to
stock for the next day. apply concat is an useful idiom to concatenate all lists together.
❸ We can see which flavors are most requested and stock accordingly.
Perhaps the most interesting aspect in both concat and lazy-cat is laziness. Both of
them concatenates just enough of the input to satisfy the consumer request:
(defn trace [x] (println "evaluating" x) x) ; ❶
❶ trace is a simple function that prints its argument before returning it.
❷ l1 and l2 are lazy sequences built with map.
❸ Nothing is printed when calling concat
❹ Access to the first element only realize enough of the concatenation to return the first element.
❶ concat is used on a small vector and a much larger one. We only want the first element, but we incur
in the cost of creating the large vector anyway.
❷ lazy-cat defers evaluation of arguments until the last possible moment. Since we only look at the
first element, the large vector is never materialized.
We could leverage laziness to produce "padding" for strings: we want to fill a string
with spaces to the right until it reaches a given width. The following example shows
how we could use concat to draw a rectangle on screen to enclose a given sentence:
(require '[clojure.string :as s])
❶ padder creates a padding function for a given width. After being created it can be used as a
transformation function for map. We use lazy-cat on the input string (which is sequential) and the
infinite repetition of the space character. Since the infinite sequence appears at the bottom, we can
take as much padding as we need without worrying about an upper bound.
❷ line is a function that creates line made of dashes ready for display.
❸ transduce composes the lines forming the drawing together. The horizontal header is initial argument
for the reducing function, while the closing footer is created as the single-argument call to the reducing
function. We use completing to compose the finalization and the reducing function str.
❹ We can see how to use quote-sentence to draw a rectangle around a given sentence.
(defn step
([n] (step n ()))
([n res]
(if (pos? n)
(recur (dec n) (concat res (get-batch n))) ; ❷
res)))
(step 4) ; ❸
;; (4 4 4 4 3 3 3 2 2 1)
❶ get-batch simulates some computation to retrieve a list of items. In a real scenario this could be a
database query.
❷ concat concatenates the last results in front of the current batch. The operation is recursive until we
reach the amount of desired step.
❸ Calling step with a small number results in a flat sequence.
❹ But large enough steps produce an unexpected StackOverflowError.
The StackOverflowError is surprising as we are using loop-recur, a construct that doesn’t consume
the stack. The problem is not recurring from tail position, but the nested concat calls that are gradually
building up on the stack.
Each iteration, concat produces a new result wrapped in a lazy-seq object. The chain of lazy-
seq segments grows up to the point that the traversal needs too many stack frames 166. A quick solution
is to break the lazy-seq nesting by inverting the order of the concat arguments. Please note that this
also changes the order of the results :
(defn step
([n] (step n ()))
([n res]
(if (pos? n)
(recur (dec n) (concat (get-batch n) res)) ; ❶
res)))
(step 4) ; ❷
;; (1 2 2 3 3 3 4 4 4 4)
166
Stuart Sierra wrote an article about the same problem at stuartsierra.com/2015/04/26/clojure-donts-concat
The new version of the step function traverses sequences instead of “lazy-seq” closures to produce
results, generating a heap-consuming algorithm. However, the results have now a different ordering
which might not be an option in some situations.
See also:
• mapcat is preferred choice when the concatenation is preceded by a
transformation. For example:
(apply concat (map rest [[1 2 3] [4 5 6]])) ; ❶
;; (2 3 5 6)
(take 10 fibs) ; ❷
;; (0 1 1 2 3 5 8 13 21 34)
❶ Note how fibs is just about to be defined, but is used already in the definition itself. This is possible
because lazy-seq is a macro and does not evaluate its argument. concat would not work in this
case.
❷ We can see that fibs works as expected for small numbers.
❸ If we try to access the 1 millionth Fibonacci number we incur in a OutOfMemoryError (depending also
on JVM settings).
9.4 Lists
The term "list" is subject to overlapping definitions. In Clojure, a list is a concrete data
type (clojure.lang.PersistentList), in the same way vectors or maps are.
The list function is also the builder for the same data type. Lists are also sequences as
they implement the abstraction directly, but they are not technically sequence
generators because they are the sequence themselves. Lists are fundamental, starting
from the fact that evaluating a Clojure file creates a list which eventually feeds the
compiler.
cons is, along with list, the other concrete data type
extending clojure.lang.ASeq directly (other collections have an adapter class for
it). cons is also the builder function with the same name similarly to list. The two
types, list and cons, are closely related and some functions treat them interchangeably
or transparently (thanks to supporting the same sequential interface). cons and list
shares the same system of building up chains of linked cells to create the sequential
effect. list supports more features than cons, for example a list can be counted in
constant time or reduced with an optimized algorithm.
9.4.1 list
function since 1.0
The list function creates a new clojure.lang.PersistentList data type from the
given items:
(list 1 2 3 4 5) ; ❶
;; (1 2 3 4 5)
The list constructor is fundamental in the language: Clojure code that appears as text
is first transformed into lists and symbols, then macro-expanded and eventually
compiled to bytecode. Functional arguments are also processed as lists. It’s not a
coincidence that list is the first function defined in the standard library at the top of
the core namespace 167.
Despite their extensive use in the language itself, lists are less flexible compared to
lazy sequences, vectors, sets or hash-maps for everyday Clojure programming. There
are however a few use cases which we are going to explore in the example section.
CONTRACT
Input
• "args" are zero or more arguments including nil.
Output
• returns: a clojure.lang.PersistentList object containing the arguments, or
empty list if no arguments.
Examples
Lists are created by "linking" elements to one another. The last item to enter the list is
added to the head (the left of the list when printed) and pointed at the previous head
forming a chain. The following diagram shows the constituent of a typical list:
167
core.clj is main standard library file in the Clojure codebase:
github.com/clojure/clojure/blob/master/src/clj/clojure/core.clj
Figure 9.6. A list formed from a chain of PersistentList objects. The EmptyList is a
specialized version of PersistentList to handle the tail position.
conj can be used to push elements into a list. Each item is pushed at the head of the
list, which makes the list appear backward when printed:
(conj () 1) ; ❶
;; (1)
(conj (conj () 1) 2)
;; (2 1)
(conj (conj (conj () 1) 2) 3) ; ❷
;; (3 2 1)
❶ The versatile conj understands how to push elements to a list, as well as pushing elements to many
other data types.
❷ Arguments pushed to the list appears backward when we print the list.
The fact that list concatenates new elements to the head, can be used with into (which
repeatedly uses conj) to reverse the content of another collection:
(defn rev [coll] (into () coll)) ; ❶
❶ Creating a list by pushing items into an empty list. The produced list now prints backward,
because into uses conj to push new elements to the head of the list.
stack
A list is a good choice to implement a stack (last-in first-out queues are also called
stacks) as it supports the peek-pop interface. In the following example we are going to
use a list to build a stack to find the sequence of the nearest smaller values 168.
To understand how the nearest smaller value search works, let’s have a look at a small
input first. Given a list (8 11 4 12 5 6) the sequence of the nearest smaller values
is (8 4 4 5):
• "8" has no previous number, so there is nothing to add to results.
• "11" has a previous smaller value, so "8" is added to the results.
• "4" has two previous values, but none is smaller.
• "12" has three smaller previous numbers. "4" is the nearest and is added to the
output.
• "5" closest smaller value is "4" so another "4" appears in the output.
• "6" nearest and smaller value is "5".
Note how, once found that "y" is bigger than "x" (for example y="12" and x="5" in the
previous list), we can exclude all elements before "y" that are bigger than "y" (for
example, we don’t need to compare "5" with "11"). We are going to use a stack to keep
track of the visited elements, implicitly giving us an opportunity to skip items for the
next iteration:
(defn stack [] ()) ; ❶
(defn push [x stack] (conj stack x)) ; ❷
(nearest-smaller [0 8 4 12 2 10 6 14 1 9 5 13 3 11 7 15])
;; (0 0 4 0 2 2 6 0 1 1 5 1 3 3 7)
❶ stack is "syntactic sugar" to create a new list. It effectively renames list to stack helping us
dealing with the abstraction properly.
❷ Similarly, the primitive conj operation has been renamed push to enforce the proper use of a stack.
168
Nearest smaller value search is an optimization used in many algorithms, for instance merge sort. Please see the
Wikipedia entry for an overview:en.wikipedia.org/wiki/All_nearest_smaller_values
❸ nearest-smaller general setup follows the typical lazy-seq pattern: the step inner function
generates the lazy sequence after receiving the input sequence and an empty stack. As soon
as when-first returns nil the generation ends. What characterizes nearest-smaller is the presence
of an inner loop-recur constructs to iterate the content of the stack.
❹ The inner (not lazy) loop performs the iteration of the stack content at that point during recursion. We
are searching the stack for the first smaller number compared to the current head of the input. If we
find one, it goes into the generated output sequence straight away. If we can’t find one, we search the
next top of the stack and so on.
❺ Every time we reach the bottom of the stack without a smaller item, we step over the next iteration
without cons. Note how the recursion into generating the output always push "x" to "st" (the current
stack view), which means the current head of the list is positioned at the top of the stack.
The outer step recursion and the inner loop are nested in nearest-smaller. The
presence of nested loops typically indicates O(nm) behavior (where m is the level of
nesting and n is the length of the input), but not in this case. Each item is pushed and
popped from the stack at most once, effectively limiting the number of operations for
the inner loop to a constant factor.
See also:
• vector, compared to list, offers direct lookup of elements by index.
• conj is the primary tool to push elements into a list after construction.
• cons is also a form of list (for which list* is the constructor, note the "star" at the
end of the name). cons chains are not generally used as data structure as they offer
very limited flexibility (they are neither counted nor reducible). cons main use
case is as building blocks for lazy sequences.
• seq generates a lazy sequence that apart from laziness behave similarly to list.
Lists support the sequence interface without the need for an adapter. At the same time,
it prevents list to provide chunked behavior like ranges or vectors. Worth
remembering that lists are sequential but not lazy: their creation already implies the
evaluation of all the elements.
Since one of the practical use case for lists is to implement a stack, let’s compare them
to vectors. The following check function is used to verify balanced parenthesis (the
example is presented fully in “peek and pop”). All we need to do to use a different kind
of stack is to pass a different stack parameter to the function:
(require '[clojure.set :refer [map-invert]])
(cond
(brackets x) (push q x)
((map-invert brackets) x)
(if (= (brackets (peek q)) x)
(pop q)
(throw
(ex-info
(str "Unmatched delimiter " x) {})))
:else q)) stack form))
❶ All we need to do to change stack implementation is passing something different as stack parameter
to the check function.
We are now going to compare a vector stack and list stack for the check function:
(require '[criterium.core :refer [quick-bench]])
❶ small and large contain similar patterns of nested parenthesis, for example "". The pattern repeats at
different depths up to the given maximum as directed by take, forcing deeper stacks and stressing the
different stack implementations.
We can see that list outperforms vector implementing a stack, although not by a big
margin.
9.4.2 cons and list*
function since 1.0
(cons [x seq])
(list*
([args])
([a args])
([a b args])
([a b c args])
([a b c d & more]))
cons creates a new clojure.lang.Cons data structure by linking the given element to a
sequential tail:
(cons :a [1 2 3]) ; ❶
;; '(:a 1 2 3)
❶ cons takes the element to be added and another sequential data structure, joining them together in a
new sequential view.
The output of cons is itself sequential and can be used for another cons forming
gradually longer linked lists of cons cells:
(cons 1 (cons 2 (cons 3 (cons 4 ())))) ; ❶
;; (1 2 3 4)
Beyond the first few items, list* can be used to create longer cons chains and avoid
repetition:
(list* 1 2 3 4 5 ()) ; ❶
;; (1 2 3 4 5)
❶ list* is used to create a linked list of cons cells by repeatedly applying cons on each element in the
input. Similarly to cons, the last element needs to be sequential.
List of cons cells are rarely used to create large data structures (Clojure for example
uses list* internally to compose arguments into a single list). cons is used primarily
as the building block of lazy sequences.
CONTRACT
Input
• "x" can be any type and is a required for cons.
• "seq" is a sequential collection (as per seq contract) or nil.
• "a", "b", "c" and "d" for list* can be of any type. "a" is the last item being added
to the resulting cons list.
• "args" in list* indicates that the last argument is different from the other and it’s
required to be sequential or nil.
• "more" in list* allows any number of arguments but the last needs to be a
sequential collection or nil.
Notable exceptions
• IllegalArgumentException is thrown when "seq" is not a sequential. This
• The clojure.lang.Cons instance containing "x" as the first element and "seq" as
the rest.
• When "seq" is nil it returns clojure.lang.PersistentList instead
of clojure.lang.Cons.
list* returns:
• A linked list composed by clojure.lang.Cons cell objects and "args" as the last
element.
Examples
Creating a cons-list longer than a few items is possible but not encouraged, as their use
as collection is penalized by general performance considerations. If you are thinking of
using apply to create longer chains, keep in mind that the end result might not be a
pure cons-list:
(def l (apply list* -2 -1 (range 10) ())) ; ❶
;; (-2 -1 0 1 2 3 4 5 6 7 8 9)
As discussed in the introduction, one of the main use case for cons is to build lazy
sequences. Let’s review the typical sequence generation scenario and focus on the use
of cons:
(defn lazy-loop [xs] ; ❶
(lazy-seq
(when-first [x xs]
(cons x ; ❷
(lazy-loop (rest xs))))))
❶ lazy-loop generates a lazy sequence from its input without any transformation. One side effect of the
apparently useless loop, is the removal of chunks when iterating chunked sequences.
❷ After checking if there are more items to process, it pushes the current item into the next recursion
with cons.
❸ We can access elements that are far away from the head without consuming the stack, even without
tail-recursion.
❶ Compared to the previous example, we are using conj instead of cons. Note that we had to invert the
argument order.
❷ This time, lazy-seq is unable to hold the body unevaluated as an effect of conj a PersistentList not
having an option for a sequential tail.
NOTE Cons cells in Clojure are different from cons cells in other Lisps. In most Lisps, a "cons" holds
pointers to arbitrary objects. In those Lisps, you could use cons cells to build trees for instance.
A Clojure cons takes an arbitrary Object as head but only allows an ISeq as tail.
See also:
• conj understands all Clojure types including list, invoking the right "append"
semantic based on the type of the input collection. Use cons to create lazy
sequences (or small throw-away lists) and if you are absolutely certain
that conj won’t work in your case.
• lazy-seq understands cons to which is intimately connected for the generation of
lazy-sequences.
Performance considerations and implementation details
10
Sequential Processing
This chapter is about functions and macros for sequential processing. The functions in
this chapter typically transform their input in a sequence (if it’s not already) and
produce another sequence. Although sequential functions can be used with any
collection type offering a sequential interface, they tend to perform at their best with
pure sequential input/output. Sequential processing can be broadly categorized as
follows:
• Partitioning: retrieve a consecutive portion of the sequence, at the beginning or
from the end, by index, number of items or using a custom predicate.
• Selection: retrieve items from the sequence but not necessarily as a consecutive
selection of elements.
• Transforming: apply a transformation function to each item in the sequence to
produce another sequence.
• Combining: combination of multiple sequences to form another sequence.
• Chunking: process a sequence by groups of multiple elements instead of one at a
time.
Partitioning in particular has a rich interface. The naming convention is consistent but
it could be confusing to pick the right function given the fact that there are so many.
For this reason, partitioning functions has been divided into two groups:
• “rest, next, fnext, nnext, ffirst, nfirst and butlast” partition the sequence by step of
a single item. Then they return either the single item or the rest of the sequence.
Some of them are combined into other functions.
• “drop, drop-while, drop-last, take, take-while, take-last, nthrest, nthnext” partition
the sequence by the item index or a predicate. As a result, the partitioning starts or
©Manning Publications Co. To comment go to liveBook
ends at any point from the input sequence. Also in this case, some combination of
functions have been extracted into other separate functions.
Some functions like “first, second and last”, “map and map-indexed” or “filter and
remove” also belong here, but they have given specific treatment in the "Basic
Constructs" chapter.
(rest [coll])
(next [coll])
(butlast [coll])
(fnext [coll])
(nnext [coll])
(ffirst [coll])
(nfirst [coll])
rest, next, butlast nnext and nfirst generate a lazy sequence after removing at most
1 element from either the head or the tail of the input
collection. fnext and ffirst return a single element instead.
The following table summarizes the functions in this section and their goals:
Name Description
rest Returns coll except for the first item, or empty list.
next Returns coll except for the first item, or nil if no items.
butlast Returns coll except for the last item, or nil if no items.
fnext Returns the first of the next of coll. Same as second.
nnext Returns the next of next of coll, nil if empty.
ffirst Returns the first of the first of coll. Assumes nested coll.
nfirst Returns the next of the first of coll. Assumes nested coll.
As you can see, this group of functions differs in a few aspect, like what they return if
there are no more items (empty sequence or nil), which side of the input to partition
(beginning or end) or if they return another sequence or a single item.
rest and next (along with first) play an important role in recursive algorithm
definitions over sequential inputs. Their combinations
(fnext, nnext, ffirst and nfirst) spares a few keystrokes and parenthesis when
necessary.
Contract
By using juxt we can verify the behavior of the functions in this section on corner
cases:
((juxt rest next butlast nfirst ffirst nnext fnext) nil) ; ❶
;; [() nil nil nil nil nil nil]
❶ All functions in this section can be called on nil. rest is the only one returning an empty list.
❷ All functions also accept an empty collection as input producing exactly the same result.
Input
• "coll" can be any sequential input or nil. A collection is "seqable" when it
provides a sequencing strategy for seq. Most of Clojure data structures are
sequential as are the most important Java data structures.
Notable exceptions
• IllegalArgumentException especially for ffirst and nfirst which assume
nested data structures.
Output
• rest, next, butlast and nfirst return a sequence. In case there are no items to
return for the requested operation, rest returns an empty list while the others
return nil.
• ffirst, nnext and fnext return a single item or nil if the operation results in no
item being available.
Examples
As mentioned in the introduction, rest or next are part of the fundamental recursive
idiom for sequences along with first:
(defn rest-loop [coll] ; ❶
(loop [xs coll results []] ; ❷
(if-let [xs (seq xs)] ; ❸
(recur
(rest xs) ; ❹
(conj results (first xs)))
results)))
❶ rest-loop iterates over the elements of the given collection and puts them in a vector without any
transformation.
❷ The loop-recur construct defines the defaults. We bind "coll" to "xs" inside the loop, so we are free to
consume it each iteration. "results" holds the gradual accumulation of the output.
❸ We need to check if there are elements in "xs" before starting another recursion. seq is used here to
transform a potential empty list (what rest returns) in a nil so it can be used in the if-let condition.
❹ If there are more elements to process, the first one is added to the results and the rest of "xs" is used
for the next recursion.
Is now apparent why next is useful if we want to put the condition on the collection
itself. Since next returns nil to signal the end of the input (instead of empty list), we
can use the data structure itself as a logical boolean (another Clojure idiom also known
as "nil punning"). By doing so we can remove the if-let binding:
(defn next-loop [coll] ; ❶
(loop [xs coll results []]
(if xs ; ❷
(recur
(next xs) ; ❸
(conj results (first xs)))
results)))
❶ next-loop is a rewrite of the previous rest-loop to take advantage of the nil-punning quality
of next.
❷ The if condition now happens directly on "xs" which is the current view of the sequence.
❸ next is used instead of rest.
Note that the two functions rest-loop and next-loop are designed to fully consume
their input, without specific concerns about laziness. If however the input was
something extremely expensive to compute, then we might be interested in the
difference between rest and next in terms of laziness. To illustrate the point, let’s now
create a lazy-seq recursive loop using next:
(defn lazy-expensive [] ; ❶
(map #(do (println "thinking hard") %)
(into () (range 10))))
❶ Our input is an expensive lazy sequence. lazy-expensive produces a side effecting print on screen
so we can see when something is produced.
❷ lazy-loop uses the recursive lazy-seq idiom to build a lazy sequence on top of the input. As author of
the function we don’t know what kind of input will be passed in, but we guarantee to the outside world
that we are going to consume it lazily.
❸ We decide to use the next looping style, taking advantage of nil punning in the when condition.
❹ There are two prints on screen when we ask for the first element.
❶ lazy-loop has been changed to accommodate for rest recursive style. We are using the
handy when-first shortcut which expands into an assignment of (first xs) into the local binding "x".
❷ We can now use rest instead of next.
❸ The output is now fully lazy, without consuming additional items than the ones actually requested.
❶ Our into* is reusing into internally after processing the parameters. After the first parameter "to" there
is a catch-all "args" that can optionally include transducers.
❷ We isolate potential transducers in the arguments with butlast. We know that an origin collection is
always required, so we can safely exclude the last argument. We can rely on into regarding
parameters validation.
❸ A few tests to verify that into* works as intended.
(process message) ; ❹
;; ({:item "A", :succ ["H" 37 82 11]}
;; {:item "H", :succ ["N" 127 0]}
;; {:item "N", :succ :incomplete})
❶ The sample message presented here is a short vector, but the process function is designed to accept
an (arbitrarily long) lazy sequence and return another lazy sequence.
❷ process generates a lazy sequence from the input message. We have an option to customize the
recursion so we can check following elements in the sequence to decide what to do.
❸ ffirst and fnext are convenience functions to access the current head of the element and the next
element.
❹ In the final output we can see that each element is potentially linked to the next and the last one is
marked as "incomplete".
See also
• first and last are other popular functions to access specific element of a sequence.
• second is equivalent to fnext to access the second element in a sequence. There is
no "third" function or following ordinals.
• pop is the equivalent of butlast for vectors. butlast, in order to remove the last
element, needs to walk the entire sequence achieving linear behavior (the worst
case). Vectors are specifically optimized for tail access and pop is the correct way
to get rid of the last element.
169
The rationale for introducing “lazy-seq” and remove lazy-cons is described on this page of the main Clojure
website: https://clojure.org/reference/lazy. The page is still there mainly for historical purposes.
(not-chunked rest) ; 32
(not-chunked next) ; 33
(chunked rest) ; 32
(chunked next) ; 64
❶ counter creates a mapping function which closes over a mutable atom. We can use the state to count
how many items are flowing through the sequence.
❷ not-chunked creates a non-chunked sequence by using a persistent list as the source for
the map operation. map will call seq on list which doesn’t use chunking. Note that the fact that the list
is created on top of a chunked range is neutralized as soon as we use it build the list.
❸ chunked on the other hand, creates a chunked sequence by using the range directly. Both not-
chunked and chunked drops 31 items before applying "f" to the resulting sequence.
The 4 results we see when executing not-chunked and chunked can be explained by the
following:
1. We call rest on a non-chunked sequence after dropping 31 items. The item at
index 31 is evaluated in order for rest to move forward. The total of items
evaluated is 32 (they start from zero).
2. We call next on a non-chunked sequence, again dropping the first 31 items. The
item at index 31 is evaluated for next to get past and the item at index 32 is
evaluated to establish if to return nil or not. 33 items in total are evaluated.
3. We call rest on a chunked sequence after dropping 31 items. The item at index 31
needs to be evaluated to get past to the next chunk in the sequence, which is not
evaluated yet. 32 items are evaluated in total, the entire content of the chunk from
index 0 to 31.
4. Finally, we call next on the chunked sequence. We dropped 31 items, item at
index 31 needs evaluation, as well as item at index 32 to verify the end of the
sequence. But item at index 32 sits in the next chunk, therefore the second chunk
is evaluated, resulting in a total of 64 items evaluated.
Chunked evaluation is a trade-off that gives up some laziness in exchange for the
preemptive caching of additional items beyond what is requested. Given that 32 is the
branching factor of persistent data structures, there is clearly a correlation: the
evaluation of the chunk corresponds to the bulk array copy of a node in the hash array
mapped trie (please refer to the introduction of the vectors chapter to know more about
the implementation details of Clojure persistent data structures).
The functions in this section generate a sequence of contiguous elements from an input
collection. They offer different parameters to control how the selection should happen:
• By number of items from the head or from the tail of the input (drop, drop-
last, take, take-last, nthrest, nthnext)
• Using a predicate (drop-while and take-while)
• Keep the head, drop the tail (take, take-while, drop-last)
• Drop the head, keep the tail (drop, drop-while, take-last, nthrest, nthnext)
The following table summarizes the functions in this section and their goals:
As the reader can see from the table, some functions provide a transducer version while
others (specifically take-last, nthrest and nthnext) are partially evaluating their
arguments. Please look at the "Examples" section below for more details.
Contract
Input
• "n" can be any number, positive or negative, integer or floating point. It defaults to
1 for drop-last. It’s required argument for take-
last, drop, take, nthrest and nthnext.
• "pred" is a function of one argument returning logical true or false. It is required
argument for drop-while and take-while.
• "coll" can be any sequential input or nil (a collection is "seqable" when it
provides a sequencing strategy for seq or is a sequence itself). It is optional input
for the transducer-aware functions: drop, drop-while, take and take-while.
Notable exceptions
• IllegalArgumentException when "coll" is not sequential (as per seq contract).
Output
The functions in this section generate a sequence from an input collection (except
when returning the transducer version, see below):
• drop removes the first "n" items, or empty sequence if "n" bigger than (count
coll). When only "n" is present it returns a transducer that removes the first "n"
elements when used.
• drop-while removes those items from the head of "coll" for which (pred
item) returns logical true. An alternative way to describe it is: drop-while drops
elements from "coll" stopping at the first time (pred item) returns false. When
"coll" is not provided, returns the transducer version of the function. Returns
empty sequence when there are not enough items to satisfy the request.
• drop-last removes "n" items from the tail of the input. Removes the last item
when "n" is 1. Returns empty sequence when there are not enough items to satisfy
©Manning Publications Co. To comment go to liveBook
the request.
• take keeps the first "n" items, dropping the rest. When "coll" is not provided,
returns a transducer version that keeps the first "n" element. Returns empty
sequence when "n" is larger than (count coll).
• take-while keeps those items for which (pred item) is true. When "coll" is not
provided returns the transducer version of the function. Returns empty sequence
when there are not enough item to satisfy the request.
• take-last keeps the last "n" elements starting from the tail of "coll". It returns
empty sequence if "n" is larger than (count coll).
• nthrest removes the first "n" items from "coll". It returns empty list when "n" is
larger than (count coll).
• nthnext is like nthrest, but returns nil when "n" is larger than (count coll).
The functions returning a transducer when "coll" is not present are: take, take-
while, drop and drop-while. The returned transducer follows the transducers contract
and accept a reducing function to use in a transducing context.
Examples
drop and take are typically used when the number of items to drop or take is known
ahead. For example, if we wanted to process the remaining days in the current year
(assuming today is December 25th):
(import '[java.util Calendar])
(def day-of-year (.get (Calendar/getInstance) (Calendar/DAY_OF_YEAR))) ; ❶
❶ One way to obtain the number for the current day of the year.
❷ Once we have the current day of the year, we can drop it from the range of the days in a year.
We can use take in a similar way, for example to extract information that always
appear at the beginning of a collection. In the following example, a message hub
contains messages from other applications. The message is encoded as a vector starting
with an error code. We only want to process some messages and discard others based
on the error code and the month the message was generated:
(def hub-sample ; ❶
[[401 7 :mar "-0800" :GET 1.1 12846]
[200 9 :mar "-0800" :GET 1.1 4523]
[200 2 :mar "-0800" :GET 1.1 6291]
[401 17 :mar "-0800" :GET 1.1 7352]
[200 23 :mar "-0800" :GET 1.1 5253]
[200 7 :mar "-0800" :GET 1.1 11382]
[400 27 :mar "-0800" :GET 1.1 4924]
[200 27 :mar "-0800" :GET 1.1 12851]])
(process-errors hub-sample)
;; ([401 7 :mar "-0800" :GET 1.1 12846]
;; [401 17 :mar "-0800" :GET 1.1 7352]
;; [400 27 :mar "-0800" :GET 1.1 4924])
❶ A small sample of the content from the message hub has been created for this example.
❷ The error? function isolates error codes for a specific month.
❸ process-error contains the logic to iterate through the messages in the hub. It uses a filter function
to isolate the interesting messages. The predicate of the filter only takes the first 3 items from each
message, which include an error code, a day of the month (which is not used) and the name of the
month this message belongs to. The error code and the month are sent to error? to decide if the
message should be kept.
We can use a predicate function when there is a rule driving what should be
taken/removed. The following example shows how we could implement a function to
isolate contiguous items in a list and generate a lazy sequence of them:
(defn tokenize [pred xs]
(lazy-seq
(when-let [ys (seq (drop-while (complement pred) xs))] ; ❶
(cons (take-while pred ys) ; ❷
(tokenize pred (drop-while pred ys)))))) ; ❸
❶ The first step involves drop-while to remove all the items we don’t want from the head of the list.
Once we are positioned on something we are interested in (the first odd number in this example) we
start collecting results. Note how we need to use seq here, as drop-while returns an empty list when
we are at the end of the input.
❷ take-while collects the items we want to group and isolate from the head of the sequence. Those
are pushed with cons to the lazy sequence under construction.
❸ Before we start over, we need to drop-while all the items that we just collected, as we don’t want to
have them again for the remaining part of the input.
❹ We can use a list of digits to test the results. The resulting lazy sequence contains all the groups of
contiguous odd digits present in the input.
❶ We can see the transducer version of drop used to remove the first 3 items before summing them up
with +.
warning: take, drop, take-while and drop-while are stateful transducers. Stateful
transducers produce inconsistent results in a concurrent scenario. You should pay
attention when using them with fold or core.async pipeline construct (see
https://clojuredocs.org/clojure.core.async/pipeline and the transducer chapter for more
information).
Laziness considerations
You need to pay attention to some of the functions in this section when dealing with infinite sequences:
One interesting effect of using eager evaluation with nthrest or nthnext is the possibility to execute
side effects even when remaining elements in a sequence are never evaluated. In the next example, a
service produces a sequence backed by temporary files connected to each item. Some application logic
decides to drop elements from the sequence before returning it to a client. After the control returns from
the service we don’t know if the client is going to consume the rest of the sequence or not, but we
certainly know that we can clean the files related to the items dropped from the sequence:
(defn service [] ; ❸
(let [data (map #(generate-file %) (list 1 2 3 4 5))]
(nthrest (map fetch-clean data) 2)))
See also
• “rest, next, fnext, nnext, ffirst, nfirst and butlast” are similar functions to those in
this section, which default "n" to 1 for most of their operations.
• subvec can be used to extract portions of a vector in a similar way
to drop or take with better performances.
• pop should be used instead of drop-last on vectors, as drop-last would convert
the vector to a sequence first.
Performance Considerations and Implementation Details
⇒ O(1) time and space, best case ⇒ O(n) step and space, worst case
For most of the functions in this section, the worst case is fully consuming the input
sequence determining a linear behavior. Here’s the combinations of parameters forcing
full evaluation of an input sequence "xs" of length "l":
(drop l xs) ; ❶
(drop-while (constantly false) xs) ; ❷
(drop-last xs) ; ❸
(take l xs) ; ❹
❶ drop over the length of the input, fully evaluates the input but does not retain any element
in O(1) space.
❷ Similarly for drop-while with an always false predicate.
❸ drop-last, with default "n" of 1, fully evaluates its input and hold on the head,
determining O(n) space.
❹ take over the length of the sequence is equivalent to the sequence itself, plus another copy of its
elements for the sequence just created by take. This operation is O(n) steps and O(n) space.
❺ Same happens to take-while with an always true predicate.
❻ take-last for a single element is O(n) steps and O(1) space.
❼ nthrest over the length of the input evaluates all but holds nothing in O(1) space.
❽ Same as nthrest for nthnext.
There are differences in the level of laziness offered by some functions. take-
last, nthrest and nthnext for instance, evaluate results partially at creation time, as
outlined in "Laziness considerations" above.
Using sequence processing functions on data other than sequential is not usually a
good idea, as they need to be converted into sequences. When necessary equivalent
operations exist for vectors for example subvec. Here is how we would
write drop and take in terms of subvec:
(defn dropv [n v] (subvec v n (count v))) ; ❶
(keep
([f])
([f coll]))
(keep-indexed
([f])
([f coll]))
❶ keep in used to take the first from each of the collections in the input. Note that when first happens on
a empty collection or nil the generated nil does not appear in the final output.
keep has certainly a lot in common with map including a keep-indexed version which
also includes the index of the current item:
(keep-indexed #(nthnext (repeat %2 %1) %1) [1 3 8 3 4 5 6]) ; ❶
;; ((0) (1 1) (2 2 2 2 2 2))
❶ Similarly to map-indexed, keep has a keep-indexed version. We are using it here to generate a
sequence of repetitions of positive integers. The input collection drives the number of repetition for
each index (e.g. at index 2 in the vector we have number 8) and the transformation function drops
"index" elements from the repetition (e.g. 8 times the index 2 is removed 2 elements). When the
number at index is the same the index (e.g. the numbers 4,5,6 in the vector) we get a nil that is
removed from the final result.
Unlike map, keep doesn’t accept multiple collections arguments, but like map it
provides a transducer version:
(sequence (keep #(when (> 0.5 (rand)) %)) (range 20)) ; ❶
;; (0 2 3 4 5 10 13 15 16 17 18 19)
Contract
Input
• "f" is a function of one (keep) or two (keep-indexed) arguments. In the case of 2
arguments, the first is the index of the item passed as second argument. It is
mandatory argument.
• "coll" is any sequential-aware collection, as per seq contract. "coll" is optional.
Notable exceptions
• ArityException when map-indexed is erroneously passed a function only
accepting one argument.
Output
• When "coll" is present: the lazy sequence generated by applying the
transformation "f" to each element in "coll". If any application of "f"
returns nil, nil is not appearing in the final output.
©Manning Publications Co. To comment go to liveBook
• When "coll" is not present, keep and keep-indexed return a transducer which
accepts a reducing function as mandated by the transducers contract.
Examples
keep has been used a few times in the book. The reader is invited to review those
examples:
• flatten has an example of retaining just the last item of the results of applying a
regular expression with re-find.
• sequence contains a transducer example of keep to parse some structured output.
• seque contains an example of keep to remove unwanted nil from the output.
Idiomatically, keep is known for its shortening effect on (remove nil? (map f
coll)) form:
❶ The implementation of first-index-of takes advantage of the nil filtering provided by keep. map-
indexed would produce a sequence of nil except for the matching item. By using keep-indexed the
output contains a single element (or empty).
note: please be aware that accessing an element by index is not the designed use for
sequences. There are better data structures with random lookup access such
as vectors or maps for this. They allow an almost constant time index lookup
in O(log32N) where "N" is the size of the data structure.
See also
• map and map-indexed are similar to keep and keep-indexed without filtering
for nil.
• filter can be used on top of any sequential processing to remove nil or other
unwanted items.
• range can be used to generate an infinite list of positive integers to use as indexes
in custom solutions not involving keep-indexed.
Performance Considerations and Implementation Details
As with other sequence processing functions, keep expects a sequential input and
produce caching of the sequential output, producing linear behavior in space when the
results are fully consumed.
Also see map and lazy-seq performance sections for more information.
10.4 mapcat
function since 1.0
(mapcat
([f])
([f & colls]))
mapcat functionality is well described by its name, the union of a map operation and
the concatenation of the transformation produced by map. For the concatenation part to
work correctly, it is assumed that the transformation produces a sequential collection:
(mapcat range [1 5 10]) ; ❶
;; (0 0 1 2 3 4 0 1 2 3 4 5 6 7 8 9)
❶ mapcat applies range to each item in the collection, producing intermediate sequences of numbers.
The intermediate sequences are concatenated together for the final output.
❶ mapcat transducer is used with two collections input. The function repeat accepts two arguments: the
first is the number of repetition (appearing in the first collection) and the second is the item to repeat
(appearing in the second collection).
Contract
Input
• "f" is a function of one or more arguments returning any sequential type (as
described by seq contract) or nil. It is mandatory argument. The number of
arguments accepted by "f" corresponds to the number of "colls" passed as input.
• "colls" is a variable number of 0 or more sequential collections (as described
by seq contract). "colls" is optional.
Notable exceptions
• IllegalArgumentException is typically thrown if the transformation "f" does not
produce sequential output.
Output
• returns: a lazy sequence of the concatenation of the transformations produced by
applying "f" to each element in "coll". If multiple "colls" are present, "f" is applied
to all first elements in "colls", then all second elements, and so on until reaching
the end of the shortest collection. If no "colls" is provided, it
returns mapcat transducer version.
Examples
mapcat with multiple collections can be used to isolate different ranges out of a larger
set and merge them back together. Hexadecimal characters for instance, are defined by
the range of numbers "0123456789" and the six letters "ABCDF". Both sets are
available in the ASCII set at different indexes. The index range for numbers is 48-58
while the range for the first six uppercase letters of the alphabet is 65-71:
(def hex?
(set (sequence ; ❶
(comp
(mapcat range) ; ❷
(map char))
[48 65] ; ❸
[58 71])))
❶ hex? function takes advantage of the fact that set can be used as functions of one argument to decide
if an element belongs to the set or not. We just need to def the set to the var hex? to create a properly
working predicate.
❷ mapcat is the first transducer in the list, because it takes care of the input coming from multiple
collections. It is followed by a int to char transformation.
❸ The 2 input vectors contain the index ranges for the characters we are interested in. All the lower
bounds appears in the first vector and all the upper bounds are on the second vector. This is to
allow range to receive them in the right order.
❹ We can use the predicate on a string with every?.
We can see mapcat in action in the following topological sort: order a list of
interdependent tasks so that it starts from tasks without dependencies 170 .
The example uses a map of library names and their direct dependencies. We can
use mapcat to extract the next layer of transitive dependencies using the map as a
function. Let’s have a look at this pattern in isolation first:
(def libs {:async [:analyzer.jvm] ; ❶
:analyzer.jvm [:memoize :analyzer :reader :asm]
:memoize [:cache]
:cache [:priority-map]
170
Typical application of topological sorting is the ordering of the Java classpath so that classes are loaded only when all
their dependencies are satisfied. The Wikipedia entry has more
examples: https://en.wikipedia.org/wiki/Topological_sorting
:priority-map []
:asm []})
Using mapcat on any group of keys, we can see the first layer of transitive
dependencies. To be sure all dependencies are satisfied we need to iterate further until
we find a library that has no dependencies. We can then walk the list of dependencies
backward to read the order in which we should satisfy the tasks:
(defn tsort [deps k] ; ❶
(loop [res () ks [k]]
(if (empty? ks) ; ❷
res
(recur (apply conj res ks) ; ❸
(mapcat deps ks)))))
❶ tsort implement a simplified form of topological sort (it’s not detecting cycles). It contains a loop-
recur which is used internally for recursion after preparing the initial arguments.
❷ The termination condition for recursion is an empty list of transitive keys dependencies, which means
we’ve followed all dependencies.
❸ mapcat is used here to send the newly discovered layer of transitive dependencies to the next
iteration. At the same time results are accumulated by conj into a list. The form with apply allows to
treat the "ks" list like separated arguments to insert in the list, removing the unwanted nesting. Note
that cons does not support a variable number of arguments.
❹ We ask tsort to find the order of dependencies to process so that the root task :async is satisfied.
Note that library, tasks or other forms of dependencies are equivalent for tsort assuming we can
form an initial map which contains all of the direct dependencies.
See also
• map is used ahead of concat by mapcat to apply transformations.
• concat concatenates two or more sequential collections together.
• cat is the transducer version of concat.
• r/mapcat is the reducers version of mapcat.
❶ mapcat is used here to concatenate different ranges of numbers. We can see 4 dots printed even if
there is no consumer of the generated sequence.
(def a (mapcat* range (map #(do (print ".") %) (into () (range 10))))) ; ❷
❶ We can lazily apply the transformation to all collections with map. The laziness problem affects the
second apply call that is on top of this first one in the core implementation.
❷ Using the same example, no dots are printed this time, sign that mapcat* is fully lazy.
mapcat has also a transducer version that we can compare for sequential generation or
reduction. We expect mapcat transducer to have an advantage over normal mapcat, as
the transducer version eliminates the need of the intermediate sequence generate
internally by map. The following benchmark measures the different ways of producing
output with mapcat:
(require '[criterium.core :refer [bench]]) ; ❶
(let [xs (range 1000)] (bench (last (mapcat range xs)))) ; 18ms ; ❷
(let [xs (range 1000)] (bench (last (sequence (mapcat range) xs)))) ; 48ms ; ❸
(let [xs (range 1000)] (bench (last (eduction (mapcat range) xs)))) ; 48ms
(let [xs (range 1000)] (bench (reduce + (mapcat range xs)))) ; 8.5ms ; ❹
(let [xs (range 1000)] (bench (transduce (mapcat range) + xs))) ; 6.9ms
❶ As usual throughout the book, we are making use of the Criterium benchmarking library.
❷ This is the basic mapcat generating a lazy sequence without using transducers.
❸ sequence and eduction are benchmarked next. They are considerably slower than basic mapcat.
❹ Finally we can see a comparison between reduce and transduce using mapcat.
The transduce version is marginally faster than the reduce version, but sequence or
eduction perform 3 times slower than plain mapcat. This can be explained with the
additional complexity that sequenceimplementation has to deal with to enable
transducers. The problem was discussed in the performance section of sequence in
more detail. If performance are important and the transducer chain is not particularly
complicated, the best option for sequential access is to use basic mapcat.
Alternatively, if laziness is not an issue, mapcat transducer can generate
a vector using into with very good performances:
(let [xs (range 1000)] (bench (into [] (mapcat range) xs))) ; 10.4ms ; ❶
❶ Creating a vector with a transducer chain using mapcat is faster than lazy sequential mapcat.
(interpose
([sep])
([sep coll]))
(interleave
([])
([c1])
([c1 c2])
([c1 c2 & colls]))
interpose and interleave add elements to a sequence by alternating old and new
items in a new output sequence. In the case of interpose the new items are the same
and are repeated throughout the length of the input sequence:
(interpose :orange [:green :red :green :red]) ; ❶
;; (:green :orange :red :orange :green :orange :red)
❶ The keyword :orange is interposed to each element in the input sequence to form a new sequence.
Note how the interposing stops before the last item and there is no :orange as the last element in the
output sequence.
❷ interpose also has a transducer version.
interleave takes interpose a step further and offers a way to define the elements to
alternate from a another sequential source:
❶ interleave takes two or more sequential inputs to produce a lazy sequence of the alternating
elements in them.
Contract
Input
• "sep" can be any expression or constant value including nil. It is required
argument for interpose.
• "c1", "c2", "coll" or "colls" are collection arguments supporting the sequential
interface (see sequence contract) or nil.
Output
• Without a collection, interpose returns a transducer that alternates "sep" in the
reducing step.
• With no arguments, interleave returns the empty list. With a single
collection, interleave transforms the input into a lazy-seq without altering its
content. With 2 or more "colls", interleave alternates items from each collection
into a new output sequence stopping at the shortest input.
Examples
interpose and interleave are flexible functions for general sequential
processing. interpose, for example, works well with string concatenation:
(def grocery ["apple" "banana" "mango" "other fruits"])
❶ interpose can merge words together using a separator. No separator appears at the end of the
output sequence by default, which is a welcomed feature in this scenario.
❶ The same operation is translated into a transducer context. See the performance section for a
comparison between the two forms.
There are numerous examples in the book that can be used as a starting point to
observe how interpose or interleave are used in practice:
• random-sample features interleave laziness to alternate two infinite sequences of
"head" and "tail" results for a coin toss simulation.
• rand-nth contains an example of interpose to generate a sentence from a list of
words.
• partition contains an example of interpose used to format a SQL query for
submission.
©Manning Publications Co. To comment go to liveBook
;; (:john :arthur :rob :giles :jessica :arthur :john :giles :rob :arthur)
❶ We need a way to define a team taking a list of names. We want the list to repeat so we can cross the
content with other teams. Each name is first repeated indefinitely, then interleaved with the others in
the order they enter the function.
❷ Second we need to interleave members of each team. interleave would stop at the shortest of
the team, but we made sure each team is instead a sequence of its members repeating ad infinitum.
The output of shifts is again an infinite sequence of interleaving team members.
❸ After creating a few teams for testing, we can see that the generated sequence is an average of all the
team members across all teams. Smaller teams are appearing more often and needs to do more work.
Inverse of interleave
interleave merges two sequences together by alternating their elements into a new sequence. We’d
like to figure out a function to invert the process and produce the opposite effect.
We need at least one change though: interleave takes a variable number of collections as input,
but we cannot have a "variable number of outputs" from a function in Clojure. The consequence is that
our function is going to return a sequence of the originally interleaved sequences.
Second, once we receive an interleaved input, we lost the information about how many input
sequences were interleaved in the first place. We need that number to be part of the input. Here’s how
we could approach the problem:
❸ take-nth is a good option for this problem, because it selects alternating elements from an input
sequence. We can start by selecting the first series of alternating items at index (0,2,4..) to cons into
the results. That’s the first sequence that was interleaved.
❹ We now shift one forward using take-nth again the next iteration.
❺ Calling untangle with 2 interleaved sequences extract back the untangled sequences.
Another aspect to consider is laziness. interleave handles laziness accepting infinite sequences as
input and creating a fully lazy output to consume. If the interleaved sequences from the input "xs" are
infinite, we need to be careful to not realize them when they are returned:
(def infinite ; ❶
(interleave
(iterate inc 1)
(iterate dec 0)
(iterate inc 1/2)))
See also
• concat concatenates collections without interleaving of their elements.
• clojure.string/join performs string merging with an optional separator. Consider
using clojure.string/join if you are not interested in laziness and only targeting
formatting of strings.
Performance Considerations and Implementation Details
;; likely OOM
❶ Access to the end of the sequence is the last operation performed. All elements cached so far can be
garbage collected because nothing is holding an explicit reference to them.
❷ The end of the sequence is reached first thing, but the head of the sequence needs to stay around
for first to access the first element after that. Depending on JVM settings, this can lead to massive
work of the garbage collector that ends in out of memory error.
❶ For this benchmark we are using a large public domain book which is split into lines (thus eliminating
the new line character).
❷ interpose transducer runs as eduction to produce a non-cached sequence as a result.
❸ Basic interpose is faster in this text.
❶ xform adds some processing on each line of the large text. After splitting into words, they are
changed into upper-case, words with numbers are removed and finally they are separated with the
pipe symbol.
❷ plainform is the nested sequential rendition of the same transducers chain.
The advantage in case of multiple step processing is still marginal, suggesting that
there are few cases where interpose transducer should be used to generate a sequence
instead of basic interpose. The results are inverted if we give up laziness, for example
to generate a single output string:
(import '[java.lang StringBuilder]) ; ❶
(quick-bench
(str
(reduce ; ❷
#(.append ^StringBuilder %1 %2)
(StringBuilder.)
(interpose "|" lines))))
;; Execution time mean : 14.763760 ms
(quick-bench
(transduce ; ❸
(interpose "|")
(completing #(.append ^StringBuilder %1 %2) str)
(StringBuilder.)
lines))
;; Execution time mean : 9.631605 ms
❶ We use a mutable StringBuilder to create the string incrementally and avoid the creation of many
intermediate strings.
❷ interpose is part of an initial sequential transformation. The sequence is the used as the input
for reduce. The reducing function appends the string accumulated so far (including pipe separators) to
a StringBuilder instance which is initially empty. We need to remember the type hints because
the StringBuilder instance is passed as generic object into the reducing function and typing
information are lost.
❸ The same operation is translated into transduce. interpose transducer is used instead and apart
from a couple of other small differences, the principle is the same.
❹ A further comparison with clojure.string/join shows similar performances. clojure.string/join is
definitely simpler if the only reason to process the input is to join into a final string.
We can see a speed improvement by using transduce instead of reduce (the equivalent
operation for standard interpose). To summarize: if the main goal of
using interpose is to completely evaluate the sequential output (giving up laziness)
it’s worth investigating the possibility offered by interpose as transducer. If laziness
is still important, interpose transducer can offer some advantage only with other
sequential transformations in the same transducer chain.
(partition
([n coll])
([n step coll])
([n step pad coll]))
(partition-all
([n])
([n coll])
([n step coll]))
(partition-by
([f])
([f coll]))
The three partitioning functions in this paragraph create subsequences (a lazy sequence
containing other sequences) from an input collection. partition and partition-
all uses a counter to decide when to split into the next sub-sequence:
❶ A new subsequence is created every time the number of digits in the list of numbers changes.
There are a few differences between the three functions that make them suitable for
different problems. We are going to see what they are and how to use them in the
contract and example sections.
Contract
Input
• "coll" is the input collection to partition. "coll" is mandatory for partition but not
for partition-by or partition-all that return a transducer in this case. "coll"
must be compatible with seq to generate a sequence in case "coll" is not a
sequence already.
• "n" establishes the max number of items in each partition. It’s mandatory
for partition or partition-all which are based on that, but not for partition-
by which instead uses an user provided function to decide the split point. It can be
negative (in which case an empty list is returned), 0 (which returns a potentially
infinite sequence of empty lists) or positive (which is the most common case).
• "step" is only used by partition. It determines the distance (in terms of how
many items apart) at which each subsequence should start. It works similarly to an
offset, potentially repeating elements in multiple subsequences.
• "padding" is another collection only supported by partition. When there are
remaining items from the input collection that cannot fit in the
partitioning, partition can use the padding collection to fill the gaps and return
the remaining items. It can be empty of nil and needs to be supported by seq.
• "f" is the mandatory argument for partition-by. "f" is a function of 1 argument
returning any type. The returned values from "f" are compared and a new
subsequence is cut each time the value changes (as per equality contract).
Notable exceptions
Although not properly exceptions, it’s possible to incur in infinite recursion:
(partition 3 0 (range 10)) ; ❶
;; WARNING infinite sequence of ((0 1 2) (0 1 2)...)
❶ We are asking partition to return subsequences of 3 items each with a zero offset always restarting at
"0" to create the next partition. Use take to limit the amount of results you need.
Output
• returns: a lazy sequence of partitions of the input collection "coll" as dictated by
the given input parameters, or nil when "coll" is nil.
Examples
Let’s start with a few examples to illustrates padding and offset:
(partition 3 3 (range 10)) ; ❶
;; ((0 1 2) (3 4 5) (6 7 8))
(partition 3 2 (range 10)) ; ❷
;; ((0 1 2) (2 3 4) (4 5 6) (6 7 8))
❶ The default "step" is the same as the partition size "n": the next subsequence should start 3 elements
apart from the beginning of the previous one. The default step can be omitted without altering the
results.
❷ We decreased the "step" and we can see that each next subsequence is now starting 2 elements
apart from the beginning of the previous, even if this implies repeating the same item in different
subsequences.
❶ The last subsequence, the one with the number "9" in it, is now appearing thanks to the padding
collection that was also provided. The padding collection [:a :b :c] provides additional padding
elements to compensate for the missing 2 elements in the last partition.
❶ If we don’t care about uneven partitions, partition-all achieve similar results to partition with
padding.
❷ partition-all also allows an offset specification that works similarly to partition with any items
left added to the last partition.
partition-by is used idiomatically with identity to transform the element itself into
the partitioning function. This has the effect of splitting the input collection for each
contiguous repetition of elements. For example, let’s assume a sensor network reads
the temperature every minute. To know how fast the temperature is changing we could
use partition:
(def temps [42 42 42 42 43 43 43 44 44 44 45
45 46 48 45 44 42 42 42 42 41 41])
❶ partition-by used with identity has some interesting application in highlighting churning rates, such
as how much the temperature is changing by minute.
❷ We can read the results as: after 4 minutes the temperature changes, then after 3 minutes changes
again, then after 3 minutes and so on, until we see a different reading every 1 minute, which means a
sharp gradient change.
partition-by was also used for a simple sentiment analysis while showing
the indentity function. partition-by was used there with the assumption that
sentiment are sometimes expressed in text as repetition of letters. The reader is invited
to review the example to see how partition-by was used.
Finally, both partition-all and partition-by can be used with transducers:
(eduction ; ❶
(map range)
(partition-all 2)
(range 6))
;; ([() (0)] ; ❷
;; [(0 1) (0 1 2)]
;; [(0 1 2 3) (0 1 2 3 4)])
❶ eduction produces a cache-less sequence and takes any number of transducers as input (without the
need for comp).
❷ Note how in the case of transducers, the partition produced by partition-all or partition-by is
not a lazy sequence but a vector.
fv (f fst)
run (cons fst (take-while #(= fv (f %)) (next s)))] ; ❷
(cons run (partition-by f (seq (drop (count run) s))))))))
❶ partition-by as it is implanted in core, but stripped of the transducer implementation.
❷ The "recipe" of partitioning lives here. "coll" is accumulated with take-while until an (f item) is found
that is different from (f first-item). At that point we take the accumulation and we
invoke partition-by recursively with the remaining elements.
To change the partitioning strategy, we can change the take-while predicate. Here’s for example a
predicate "f" taking two arguments: the current item and the next item in the collection. Based on
returning true or false a new partition is created:
❸ The burst? function is a predicate of two instants. It returns false if their different is more than 120
seconds, indicating that the two events are too far apart to be considered part of the same group.
❹ We use partition-with as illustrated before, using the burst? function as the predicate. The result
contains a partitioning of the input events so that two or more events are together if their time
difference is below 2 minutes. If more than 2 minutes, then the event is considered part of the
following group.
See also
• mapcat and partition are often found together in a processing pipeline, because
while partition introduces a new nested level of partitions, mapcat removes the
level returning to a flat sequence.
• pmap is a parallel map implementation. It is often associated with partition-
all to perform parallel batch processing.
• “split-at and split-with” are similar partitioning functions that splits the input
sequence into 2 parts only. Use split-at or split-with if you are looking for a
single split point in the sequence.
Performance Considerations and Implementation Details
⇒ O(n) linear
The partition functions are implemented on top of a relatively simple recursion that
starts every time a new partition is created. The number of iterations is linked to the
number of input items producing a linear behavior.
partition functions are lazy, producing just enough of the iterations as requested by
the caller:
(first (partition 3 (map #(do (println %) %) (range))))
;; 0
;; 1
;; 2
;; (0 1 2) ; ❶
❶ partition, partition-all and partition-by are lazy functions. As expected, we can see that only
the items necessary to form the first partition are realized.
Laziness with transducers works differently and in general is more eager. Even when a
single partition is requested, this transducer partition-all realizes n*(32+1) items.
This is because how sequenceworks, always requesting at least 32 items from the result
of the transducer, which translates in 32 partitions of n elements each even when only
one is requested:
(first
(sequence
(comp (map #(do (print % ",") %)) ; ❶
(partition-all 100)) ; ❷
(range)))
;;0, 1, 2, ....., 3299, ; ❸
;;[0 1 2 ... 98 99] ; ❹
❶ We use an identity transducer which is just printing and passing the element back.
❷ partition-all is used here as transducer.
❸ The identity transducer is printing up to 3299, which is the (* 100 (+ 1 32)) 3300th element in the
list.
❶ partition-by is strictly lazy and just evaluates enough to give the result.
❷ The transducer version tries to realize the second partition which is infinite.
10.7 flatten
function since 1.2
(flatten [x])
flatten is a function that takes an arbitrarily nested collection and returns a sequence
where all the nested sequential collections have been removed:
(flatten [[1 2 [2 3] '(:x :y [nil []])]]) ; ❶
;; (1 2 2 3 :x :y nil)
Contract
Input
• "x" is the only mandatory argument. It can be any type including nil. When "x" is
not a sequential type (a sequential type returns true when sequential? is invoked
on it) then "x" alone is just returned included in a list. Types that are not sequential
are: maps, sets, transients, native arrays and Java iterables like ArrayList. If they
are present in "x" at any level, the simply won’t be iterated any further:
(flatten [{:a 1} (doto (ArrayList.) (.add [1 2 3]))])
;; ({:a 1} [[1 2 3]])
Notable exceptions
None.
Output
• returns: a lazy sequence containing all the items that are not sequential at any level
of the input collection.
Examples
flatten is an useful function when upstream processing is creating additional level of
nesting. There are legitimate reasons why other Clojure functions wrap elements in
subsequences and flatten can be used at the end of the processing pipeline to clean-up
nesting when it becomes unnecessary.
A typical nested structure is the result of macro-expansion. We could decide to
understand which Clojure functions are used after expanding the result of a macro. We
know that core functions are generally prefixed with the clojure.core namespace
(although special forms are not and won’t appear in the output). We could
use flatten to surface all the symbols from their nested position and then clean-up the
results:
(require '[clojure.walk :as w])
(core-fns ; ❹
'(for [[head & others] coll
:while #(< i %)
:let [a (mod i 2)]]
(when (zero? a)
(doseq [item others]
(print item)))))
❶ macroexpand-all is a function that given a form, invokes macroexpansion recursively until there are no
more macros to expand and then returns the expanded form. This is usually quite bigger than the
original form depending on the usage and complexity of macros in it.
❷ flatten called on the expanded form returns 271 symbols after unwrapping any level of nesting.
❸ After transforming symbols into strings and pattern matching on them, we need to remove repeating
function names.
❹ Finally, an example for loop form is used as input. We can see in the results what are the functions
used by our for-loop (although special forms like let*, if, loop or recur are also used but are not
visible).
See also
• mapcat applies a function to each element in a collection. Assuming the
transformation introduces an additional layer of sequences, mapcat also removes
that layer. When used with the identity function, mapcat can be used to remove
one layer of nested collections.
• transducer cat and reducer cat apply a similar concept to mapcat to remove a
single nesting from their input collection.
Performance Considerations and Implementation Details
⇒ O(n) linear
To walk the nested levels of the input collection, flatten needs to reach for every
sequential collection and unwrap its content. It follows that the number of steps to
perform is linear in the amount of elements at any level and of any type in the input
collection.
From the implementation perspective, flatten is built on top of tree-seq which in turn
is built on top of clojure.walk/walk. It is then a matter of distinguishing sequential
collection during and filter only their content.
flatten operates lazily and will only pull enough of input sequence to output the
requested result:
(->>
(range) ; ❶
(map range)
(map-indexed vector)
flatten
(take 10))
;; (0 1 0 2 0 1 3 0 1 2)
❶ The input source for this processing chain is the infinite sequence of integers produced by
the range invocation. It can be used safely with flatten.
(distinct
([])
([coll]))
(dedupe
([])
([coll]))
(distinct?
([x])
([x y])
([x y & more]))
distinct and dedupe remove duplicates from an input collection, while distinct? just
reports about their presence returning true or false. distinct and distinct? detects
all duplicates of the same items in a collection (or list of arguments) while dedupe only
removes contiguous repetitions:
(distinct [1 2 1 1 3 2 4 1]) ; ❶
;; (1 2 3 4)
(distinct? 1 2 3 2 4 1) ; ❷
;; false
(dedupe [1 2 1 1 3 2 4 1]) ; ❸
;; (1 2 1 3 2 4 1)
❶ distinct removes duplicated items throughout the collection, independently from their relative
position.
❷ distinct? detects duplicates presence also independently from their position.
❸ dedupe only removes contiguous repetitions of the same item, allowing duplicates which are at least 1
element apart.
distinct, dedupe and distinct? use Clojure extended equality semantic, accepting
both scalars (numbers, keywords, symbols and so on) as well as compound values in
collections. Clojure equality uses "compatibility groups" to decide when two items are
the same (please refer to = for an exhaustive explanation).
distinct, dedupe and distinct? are used quite frequently in Clojure, reflecting the
presence of many problems in computer science dealing with duplicates (for example
data compression 171.
Contract
There are small contract differences between the three functions illustrated in this
section. distinct does not accept sets or maps as argument (also in their native Java
form as HashMap or HashSet), while dedupe accepts them. maps and sets already detect
duplicates at construction time, so there is small sense in feeding them to dedupe.
Input
distinct and dedupe:
• "coll" is optional and can be nil. When "coll" is not present, both functions return
the related transducer. transients are not accepted.
• distinct does not allow sets or maps (throwing exception).
• dedupe works on all collection types excluding transients.
171
See https://en.wikipedia.org/wiki/Data_deduplication
distinct?:
• "x", "y" and "more" is the typical signature of variadic functions like distinct?.
"x", "y" and "more" can be any kind of Clojure forms, literal or nil. At least one
argument is required.
Notable exceptions
• UnsupportedOperationException when a set, map, HashMap or HashSet are used
as an argument for distinct.
• ArityException when distinct? is invoked without arguments. There is a
specific case which is less easy to detect when distinct? is used
with apply: (apply distinct? []) produces the exception on empty collections,
forcing a check to happen ahead, for example: (and (seq []) (apply distinct?
[]))
Output
distinct and dedupe:
• A lazy sequence of the non repeated items in "coll". dedupe allows duplicates
which are at least 1 item apart.
• An empty list when "coll" is nil.
distinct?:
• true when there are no duplicates independently from their position in the list of
arguments.
• false where there is at least one duplicate item.
Examples
A voting system allows for a maximum of 5 votes for 3 distinct candidates. Users of
the system might double vote for a candidate (either on purpose or by mistake) and we
want to be sure that when votes are counted, we discard any additional vote for a
candidate which is coming from the same user:
(def votes [ ; ❶
{:id 14637 :vote 3 :secs 5}
{:id 39212 :vote 4 :secs 9}
{:id 39212 :vote 4 :secs 9}
{:id 14637 :vote 2 :secs 43}
{:id 39212 :vote 4 :secs 121}
{:id 39212 :vote 4 :secs 121}
{:id 45678 :vote 1 :secs 19}])
(->> votes ; ❷
(group-by :id)
(reduce-kv
(fn [m user votes]
(assoc m user (distinct (map :vote votes))))
{}))
❶ Votes enter the system as a list of all the votes at the end of the competition. Here we show a little
sample of a much larger list. Each ":id" is an user, followed by a ":vote" for a candidate as a number
and finally the number of seconds elapsed since the beginning of the competition.
❷ After grouping votes by users using group-by we can see how many votes each candidate received
from each user. We can process the map with “reduce-kv” and make sure that each value list does not
contain duplicates for a specific candidate using distinct.
After analyzing the data, it is discovered that a problem with the voting hardware is
generating "burst of clicks" each time the user presses a button on the voting remote
control. We want to get rid of the unwanted clicks as early as possible, as they present
a problem for the scalability of the system. Luckily for us, the problematic clicks
happen a few milliseconds apart so we can clearly tell which votes are to be discarded,
as they appear exactly the same in the list. We can do this with dedupe ahead of
the group-by:
(->> votes ; ❶
dedupe
(group-by :id)
(reduce-kv
(fn [m user votes]
(assoc m user (distinct (map :vote votes))))
{}))
❶ We execute the same operation as before, but we get rid of contiguous duplicates first thing
with dedupe.
❷ The results are the same as before, as expected.
distinct? can be used as a predicate to find collections of distinct items. Clojure, for
example, uses distinct? internally to find a suitable combination of hashes when
implementing case 172. caseneeds to adapt test expressions so they can fit the switch-
table of the relative JVM instruction, which requires distinct integer keys.
When case test constants are generic objects, case calculate the hash of each object
and then attemps several bit/mask combination to find a transformation that produces
distinct keys:
(def max-mask-bits 13) ; ❶
172
Please see case implementation in core at this link https://github.com/clojure/clojure/blob/clojure-
1.8.0/src/clj/clojure/core.clj#L6343
(maybe-min-hash
(map (memfn hashCode) [:a :b :c :d])) ; ❺
;; [1 3]
(map #(shift-mask 1 3 %)
(map (memfn hashCode) [:a :b :c :d]))
;; (0 2 1 3)
❶ The specific max size of the mask bits is because of the 32 bits size allowed for the JVM tableswitch
instruction 173
❷ shift-mask applies a shift of bits and the specific mask to a hash number. This is ultimately the
transformation that we want to apply to each test case expression, but only if it’s not producing
duplicates.
❸ maybe-min-hash generates all possible permutations of a bit-shift and a bit-mask using a for loop. It
then applies them to the given hashes resulting in a collection from which we only want the first
combination that is not producing duplicates.
❹ apply distinct? takes the result of the shift-mask transformation and verify that is producing distinct
results for each hashed test expressions of the case statement. We don’t want the actual collection of
distinct values, so we use the predicate to filter the shift-mask combination instead.
❺ We can see how to use maybe-min-hash on [:a :b :c :d]. The result "[1 3]" says that by bit-
shift-right each hash of "1" and then bit-and each hash with mask "3" generates distinct keys to
be used in the generated JVM instruction.
❻ This shows how the case expression for this example would look like. The 4 keywords used as text
expression can be encoded as the integer keys "0,2,1,3" with the shift-mask transformation "[1 3]".
As transducers, distinct and dedupe can be used invoking their zero-arity version:
(sequence
(comp
(map range)
cat
(distinct)) ; ❶
(range 10))
173
there is some complexity related to the way the JVM implements a fast lookup switch, if you want to know more this is a
good starting pointhttps://stackoverflow.com/questions/10287700/difference-between-jvms-lookupswitch-and-tableswitch
;; (0 1 2 3 4 5 6 7 8)
(sequence
(dedupe) : ❷
[1 1 1 2 1 1 1 3 1 1])
;; (1 2 1 3 1)
❶ distinct as a transducer requires wrapping inside parenthesis, unlike “cat” which appears directly
above.
❷ dedupe transducer removes contiguous duplicates like the sequential version.
(distinct duplicates) ; ❶
;; (8 1 2 7 3)
❶ distinct, as implemented in the standard library, maintains ordering of the original collection while
removing duplicates.
❷ sort has the effect of grouping duplicates together in a way that dedupe can completely remove.
Both distinct and dedupe with sort remove all duplicates, but they return the same list of number in
different order. There is also a data structure with very similar properties: hash sets by forcing
uniqueness of their elements by design, can be used to remove duplicates:
(set duplicates) ; ❶
;; #{7 1 3 2 8}
❶ set can be used to create a Clojure set directly from another collection, producing no duplicates.
We’ve seen many ways of removing duplicates in this section. Which one to use depends on many factor:
• Constraints on the ordering of the output: if ordering of the initial collection is important,
then distinct is the primary choice.
• Presence of transformations on the input collection: then the best choice is to use the transducer
version of distinct or dedupe.
• Need for checking the presence of an element while removing duplicates: transforming the input in a
hash-set offers performing lookups and duplicate removals at the same time.
See also
• sort was mentioned a few times. sort does not remove duplicates, but put the
input collection in a condition where they are quickly visible. dedupe can be used
in conjunction with sort to obtain a form of ordered distinct.
• set produces a Clojure hash-set from an input collection, automatically removing
❶ distinct is the laziest, consuming exactly the required amount of items to satisfy the request.
❷ dedupe is semi-lazy, as it is implemented on top of its transducer version which makes use
of sequence and sequence always consumes the first 32 items.
❶ The Criterium library that we used through out the book is available on Github.
❷ with-dupes is an helper function to create a collection with of n*n elements with duplicates.
❸ The benchmark compares both distinct and dedupe in their standard and transducers version.
Some of the Criterium output is omitted for clarity.
The results of the simple benchmark shows that distinct transducer version
outperforms the standard version with a significant margin, while dedupe performances
are roughly the same. A more precise benchmark should take into account the amount
of duplicates present in the original input and their ordering, as they both influence the
final result, especially in the case of dedupe.
10.9 take-nth
function since 1.0
(take-nth
([n])
([n coll]))
take-nth selects elements from another sequence (the first is included by default). The
next items to include are identified by repeatedly dropping the same amount of items:
(take-nth 3 [0 1 2 3 4 5 6 7 8 9]) ; ❶
;; (0 3 6 9)
❶ take-nth selects "0" as the first element of the output. Then it skips 3 items to reach the number "3"
which is added to the output. The process repeats until reaching the end of the input.
❶ take-nth without the collection arguments returns a transducer with similar capabilities.
Contract
Input
• "n" is the number of elements to drop after the first to reach the next element to
include in the output. take-nth requires a positive number greater than zero.
Decimal number are possible but get rounded.
• "coll" can be any sequential collection and is an optional argument.
Notable exceptions
• ArithmeticException divide by zero error: only on the transducer version for n =
0 and non empty "coll", such as (into [] (take-nth 0) [1 2 3]).
• NullPointerException for the transducer version, when "n" is nil and "coll"
contains at least one element.
Output
The transducer and the basic version differs in the treatment of corner cases. For the
normal case with "n" positive integer, take-nth returns the sequence generated by
taking the first element from "coll", dropping "n" - 1 elements, then the element at "n",
then dropping "n" - 1 elements and so on.
take-nth base version:
take-nth transducer:
Examples
take-nth is a natural solution for the problem of generating the multiples of a number:
❶ mult-n is a function that given a number, generates an infinite sequence of the its multiples. We
use rest to drop the initial zero from the list.
❷ We can see how to generate multiples for the number "11" and "42".
take-nth is also useful when handling a variable number of key-value pair arguments.
Here’s for example a function to create a sparse vector, a vector that has zeros at every
index except for those indicated by the arguments:
(sparsev 1 4 3 7 21 8)
;; [0 4 0 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8]
Writing drop-nth
take-nth almost naturally calls for a drop-nth with similar features but inverted meaning: generate
the lazy sequence of items which is left after we select each "nth" element in the input sequence. One
option is to open up take-nth sources and change the relevant parts:
❶ Apart from the name change, the general design of the function stays the same.
❷ Instead of cons results, we need to concat them, because we select the multiple elements between
the nth gaps instead of a single nth element. We need to skip the first item which is in nth position and
then take up to nth - 1 more items.
❸ The next iteration uses the sequence after dropping everything up to the next nth element.
Alternatively, we could use rem to see which items corresponds to which index and implement drop-
nth on top of keep-indexed:
See also
• filter removes elements from a sequence using a predicate instead of using the
distance between the elements.
• split-at splits a sequence into two parts at the requested index.
• partition also splits a sequence into subsequences of the requested size.
Performance Considerations and Implementation Details
❶ take-nth basic version is benchmarked against the transducer version in the same scenario of
generating another sequence. Note the use of last to fully evaluate the sequence.
❷ We can see that the transducer version is slightly faster. This is in part due to the different
implementation used for the transducer version of take-nth.
(split-at [n coll])
(split-with [pred coll])
❶ split-at is used with a number representing the index at which we want the split. In this case for
example we are indicating we want a split when we reach the 8th element in the sequence.
❷ split-with works with a predicate in the same way as take-while, splitting the input as soon as the
predicate turns false when evaluated on an item. We can see that the split happens on the first
appearance of a zero.
Input
• "n" can be any integer equal or greater than zero. Negative numbers are possible,
but their effect is the same as "n" zero. Decimal point numbers are also possible
and rounded up to the nearest integer. It is required argument for split-at.
• "pred" is a function of one argument returning logical true or false. It is required
argument for split-with.
• "coll" is any sequential collection (as per seq contract) or nil.
Notable exceptions
• IllegalArgumentException when "coll" is not sequential (as per seq contract).
Output
• split-at returns a vector of two elements: the first is a lazy sequence of the first
"n" items in "coll" and the seconds are the remaining (- (count coll) n) items.
• split-with returns a vector of two elements: the first is the lazy sequence of
elements up to the first false evaluation of "pred" and the second contains all the
remaining items.
Examples
split-at and split-with vector output can be restructured easily:
❶ The vector results from split-at can be easily destructured, so the two parts can be processed
independently.
Both functions work lazily on their input and the fact that results are returned in
a vector doesn’t mean laziness is lost. For example:
(take 10 (last (split-at 10 (range)))) ; ❶
;; (10 11 12 13 14 15 16 17 18 19)
❶ split-at is invoked on an infinite range. The first result can be fully evaluated but the last needs
bounded access with take.
The predicate option can be used with sets or maps to split based on existence of a key.
Since the split happens on the first false evaluation of the predicate, we need to
remember to complement the predicate:
(split-with (complement #{\a \e \i \o \u}) "hello") ; ❶
;; [(\h) (\e \l \l \o)]
❶ We created a set literal containing vowels and used it as a predicate in its complemented form. The
meaning of the expression is: return the split of the word at the first occurrence of any vowel.
❷ split-with is used on a sorted-map. The map needs to be sorted for the split to be consistent and
repeatable, otherwise the order of the map is undetermined and as a consequence also the split
output is undetermined.
split-with only splits the sequence the first time the predicate returns false. Any
additional items with the same property does not cause the sequence to split. The
following split-by function recursively calls split-with to further split the sequence
and returns all partitions found:
(defn split-by [pred coll] ; ❶
(lazy-seq
(when-let [s (seq coll)] ; ❷
(let [!pred (complement pred)
[xs ys] (split-with !pred s)] ; ❸
(if (seq xs) ; ❹
(cons xs (split-by pred ys)) ; ❺
(let [skip (take-while pred s) ; ❻
others (drop-while pred s)
[xs ys] (split-with !pred others)] ; ❼
(cons (concat skip xs)
(split-by pred ys))))))))
❶ split-by follows the typical lazy sequence organization pattern. There are two possible cons points
before calling split-by recursively.
❷ This condition verifies the end of the input and terminates the recursion.
❸ During each recursion we move a virtual cursor forward in the input sequence and we can
invoke split-with at that point.
❹ There are two possible outcomes: the sequence starts with a splitting point (or multiple of them in a
sequence) or the splitting point is beyond the first element. split-with returns an empty "xs"
sequence of the head elements in the case the split point is the first element.
❺ In case the split point is beyond the first element, "xs" is the first batch of results to cons into the
results. We recur with the rest "ys".
❻ In case the split point is on the first element we need additional processing to discover the next split
point without throwing away elements. To do so we take-while until the predicate becomes false,
which reaches our next split point. We call these initial elements "skip" elements. "others" is anything
beyond the new splitting point.
❼ We can now apply a new split-with on the "others" and proceed from there. What goes into the
results with cons is the concatenation of the item we had to "skip" with the new group "xs". We then
recur with the rest of items beyond the second split point.
❽ We can use split-by to partition a list of integers into multiple of 5. Note that partition-by can do
something similar with the difference that splitting items are isolated in their own partition.
See also
• “partition, partition-all and partition-by” are generic partitioning functions with
support for multiple splitting points.
• drop-while and take-while work on similar principles as split-with.
Performance Considerations and Implementation Details
10.11 when-first
macro since 1.0
Contract
(when-first bindings <body>)
Input
• "bindings" is a vector of exactly 2 elements: the "name" of the local binding and
its "value". "name" should be a valid Clojure symbol and "value" any sequential
collection (as supported by seq).
• "body" can be any Clojure form. "name" local binding will be available for use in
the body at compile-time (during macro-expansion).
Notable exceptions
• IllegalArgumentException if "bindings" does not contain exactly 2 arguments or
it is not a vector.
Output
• returns: the result of evaluating "body" when "value" is a collection with at least
one item, nil otherwise. "name" becomes available symbol during evaluation of
the "body".
Examples
One good use of when-first is to improve readability of the idiomatic lazy recursive
loop. The following dechunk function loops over the input to remove chunking from
chunked sequences (for example ranges or vectors). Sequential chunking is normally
useful feature, but there might be cases where we want full control over evaluation (for
example dealing with costly or side-effecting inputs):
(first (map #(do (print ".") %) (range 100))) ; ❶
;; ................................0
❶ In this first experiment, we use a side-effecting map operation to print a "." dot each requested
element. We can see that by asking the first element from a range, 32 dots are printed.
❷ dechunk implements a simple lazy loop that cons the first item of the input into the output sequence
without any additional transformation.
❸ By using when-first, we can avoid using first on "x" removing one set of parenthesis.
❹ We can use dechunk in front of the chunked sequence to prevent the rest of the sequential
computation to be performed in chunk of 32 items. The print of a single dot confirms that chunking has
been removed. Please note that chunking was removed upstream of the range creation
but range itself is still evaluating 32 items at a time.
when-first also avoids double evaluation of the input by reusing the transformation of
the input into a sequence to extract the first item. Please check the performance section
below for additional details on when-first evaluation policy.
See also
• first is used to access the first item of a sequential collection.
• when-let implement a similar mechanism to when-first, creating a local binding
to a generic object (collection or not) if and only if it has been evaluated to true.
Performance Considerations and Implementation Details
❶ We use write to format the macro-expansion into multiple lines. Namespaces have been removed
from functions for clarity.
❷ seq is called on "coll" with the effect of creating a sequential version (if it’s not already a sequence)
and generating a nil in case "coll" is empty.
❸ If "temp_123" (a randomly generated symbol name) is not nil, it means we have some content to
process. This is where we use first to bind the first element of the input to "x".
We can see from the macro expansion that the input "coll" is only evaluated once as
the argument for seq. This feature could be important depending on the kind of input.
If "coll" is not a cached sequence for instance, we might be interested in preventing
multiple evaluations. The following take-first function produces a lazy sequence of
the first element from the input collection. The first implementation illustrates the
problem:
(defn take-first [coll] ; ❶
(lazy-seq
(when (seq coll) ; ❷
(cons (first coll) ()))))
❶ take-first is a simple function to create a lazy sequence containing only the first element from the
input "coll".
❷ sequence as a function can be used to verify if a sequence is empty. We don’t want to push a nil in
the generated output, so we first check to see if there is anything in "coll".
❸ take-first is invoked on an input produced with sequence. Note that we print "eval" for each item
that is evaluated. We can see a single print of "eval" for the number 1.
❹ Now take-first is used on a sequence produced with eduction which is a non-caching type of
sequence. We can see two prints for the evaluation of the number 1.
The provided implementation of take-first evaluates the input twice, once to check if
it’s empty and the second to get the first element. The only reason why we don’t see
double evaluation is because of the implicit caching provided by sequence.
eduction makes the problem clear because it does not cache evaluations of the input.
©Manning Publications Co. To comment go to liveBook
❶ The second version of take-first uses when-first which evaluates "coll" just once. It also removes
one function call to first.
❷ The new version of take-first not only produces the same results with a caching sequence but it’s
also more efficient. It is in fact avoiding going back to the cache to get the first item again.
❸ The test with a non-caching sequence now shows that we are not evaluating the input twice.
(chunk-buffer [capacity])
(chunk-append [b x])
(chunk [b])
The functions in this section (now on chunk-* for brevity) are part of the chunking
sequence abstraction. chunked-seq? is also part of the same group but it has been
described in another section.
Chunking is a Clojure feature that allows data structures to enforce a specific fetching
granularity during sequential iteration. Without chunking, a lazy sequence would
always realize one item at a time advancing the related iterator one position forward in
the collection. With chunking, the amount of items to process can be more than one,
even if they are not consumed straight away. Chunked items are parked in an
intermediate iterator (the buffer) that provides elements until the end of the chunk. If
more items are requested upstream another chunk is created, positioned in the buffer
and consumed. The cycle repeats until the end of the input.
Chunking is mainly a performance optimization that leverages the internal data layout
of some sequential collections: vectors, vector-of and ranges are among those that
benefit the most from chunking. The chunk-* functions in this section allow other data
sources to take advantage of chunking to improve their performances during sequential
processing and can be divided into two groups:
1. chunk-cons, chunk-first, chunk-rest and chunk-next are almost drop-in
replacements for cons construction in lazy sequences along
with first, rest and next.
2. chunk-buffer, chunk-append and chunk serve the purpose of creating the
intermediate (mutable) buffer and serve the buffer back to chunk-cons. Buffers
should be used as internal artifacts for the sole purpose of processing chunked
sequences.
WARNING chunk-* functions were released as part of Clojure 1.1 and labeled "implementation
details" 174. The release note also specifies that they are public to allow experimentation. Ten
years later, chunking functions are used extensively in the standard library and despite being
still undocumented, there are no signs of them being deprecated, changed or removed.
Contract
Input
• "chunk" is an object of type clojure.lang.IChunk which is the returned type
of chunk and input for chunked-cons.
• "rest" is any sequential collection (as per seq contract) not necessarily chunked.
• "s" indicates a chunked sequence, such that (chunk-seq? s) returns true.
• "capacity" is a positive integer (must be less than Integer/MAX_VALUE) that
represents the size of the buffer.
• "b" must be an object of type clojure.lang.ChunkBuffer which is essentially a
wrapper around a Java object array.
• "x" can be any object.
Notable exceptions
• NullPointerException if "s", "capacity",
• NegativeArraySizeException if "capacity" is less than 0.
• ArrayIndexOutOfBoundsException when attempting to chunk-append on a
full chunk-buffer.
• ClassCastException trying to chunk-first, chunk-rest or chunk-next on
something that is not a chunked sequence. Notably, (chunk-rest ()) produces
the error, because the empty list is sequential but not chunked.
174
The full text of Clojure release 1.1 is visible here
https://github.com/richhickey/clojure/blob/68aa96d832703f98f80b18cecc877e3b93bc5d26/changes.txt#L92 with this
link pointing at chunked sequence functions
Output
Depending on the specific function:
• chunk-cons returns a clojure.lang.ChunkedCons object, similarly to
the clojure.lang.Cons object returned by cons.
• chunk-first return the first chunk of a chunked sequence input.
• chunk-rest returns the rest of a chunked sequential collection after removing the
first chunk, empty list otherwise.
• chunk-next returns the rest of a chunked sequential collection after removing the
first chunk, nil otherwise.
• chunk-buffer returns a clojure.lang.ChunkBuffer of the given "capacity".
• chunk-append adds an element to a clojure.lang.ChunkBuffer instance, up to the
available space in the buffer.
• chunk returns an object of type clojure.lang.IChunk given
a clojure.lang.ChunkBuffer instance.
Examples
Copying the data to a buffer is necessary to process the data as chunk-first returns an
(immutable) view to the internal state of the collection that cannot be processed
directly. A buffer can be created and used as follows:
(def b (chunk-buffer 10)) ; ❶
(chunk-append b 0) ; ❷
(chunk-append b 1)
(chunk-append b 2)
(def first-chunk (chunk b)) ; ❸
(chunk-cons first-chunk ()) ; ❹
;; (0 1 2)
Please note that a buffer that was used to create a chunk becomes unusable:
(def b (chunk-buffer 10))
(chunk-append b 0)
(chunk b)
(chunk-append b 0) ; ❶
;; NullPointerException clojure.lang.ChunkBuffer.add
❶ Once a buffer has been transformed into a chunk, any following attempt to chunk-append to the buffer
fails with an exception. The buffer should be used once to create the buffer and thrown away.
❶ Before processing we need to verify that there is at least one item in "coll".
❷ We can proceed and take chunk-first from the chunked sequence.
❸ A new chunk-buffer is created and assigned as local binding.
❹ Chunked objects such as range or vector are equipped with a reduce implementation that is
accessible only through Java interop.
❺ The buffer is converted into the corresponding chunked array instance which is used by chunk-
cons to go into recursion and gradually build a lazy chunked sequence.
In the following example we are going to create our own chunked sequence. Data on a
physical device (such as a disk or network) might be subject to hardware related
constrained like a specific storage block size, or transmission packet size. On a file
system for instance, data is usually organized in blocks to produce uniform allocation
and predictable performances. We want to be able to read bytes from a file lazily and
choose the optimal chunk size for the device:
(import '[java.io FileInputStream InputStream])
❶ byte-seq follows the typical lazy sequence generation pattern, enclosing the body of
the step function in a lazy-seq call. Both "is" and "ib" (the input stream and input buffer respectively)
are mutable objects and don’t need to be passed as parameters to the inner step. So
the step function is defined and immediately invoked (note the additional set of parenthesis).
❷ "ib" is a byte-array buffer that indicates the .read operation how many bytes to fetch (from the file
system in this case). It is overwritten each iteration with fresh data.
❸ .read returns the number of bytes read. "-1" indicates the end of input, so we stop recursion unless
there are more bytes to process.
❹ The second buffer "cb" is created to hold the content of the first buffer. This is redundant, but the two
buffers have incompatible types and the input buffer cannot be used directly to create the chunk-
cons.
❺ dotimes is used to transfer the content of the "ib" buffer into the "cb" buffer with chunk-append.
❻ chunk-cons is called as the last step before going into step recursion.
❼ When a java.io.InputStream class is involved in lazy processing, the .close operation needs to
happen after the end of the computation. In this example, we read 20 bytes using a 4096 buffer size.
We need to process the bytes before the end of the with-open block which closes the input stream.
See also
• lazy-seq allows chunk-cons (and cons) to generate a sequence lazily.
• first and rest are the equivalent of chunk-first, chunk-rest for sequences not
supporting chunking.
Performance Considerations and Implementation Details
11
Maps, along with sequences and vectors, are possibly the most flexible and used
Clojure data structures. They support Clojure application design in several ways:
Maps
• Attaching "names" to data that semantically belong together. Each key in a map is
a name for a value.
• Supporting immutability and persistency in a performance effective way (map
uses the same HAMT Hash Array Mapped Trie data structure used by vectors
175
).
• Allowing lookup by key, including using the map itself as a function.
• The standard library contains many functions dedicated to map manipulation (such
as “assoc, assoc-in and dissoc”, merge, select-keys, etc.). The description of such
functions is the topic of this chapter.
Clojure contains several kind of maps which are often described using a mix of
constructors names and actual Java types. The following are concrete implementations
inheriting from a common clojure.lang.IPersistentMap interface:
• array-map is the default choice for small maps. The class name
is PersistentArrayMap. Under this condition it’s small and fast, but it doesn’t
scale well for larger maps. For this reason, Clojure automatically promotes array-
map to hash-map under certain conditions. Along with structs it also maintains
insertion order.
• hash-map is the most flexible implementation. Its class is PerisestentHashMap. It
scales well to larger number of keys maintaining good performance at the same
175
Please see the vector chapter for additional information on the implementation details of HAMT
time. It does not retain insertion order, so when keys or values are requested, they
are not necessarily returned in the same order they were added.
• sorted-map maintains internal ordering of key-value pairs thanks to a comparator.
The implementation is contained inside a PeristentTreeMap class. Sorted maps
come handy with certain use cases.
• structs are based on PersistentHashMap but they maintain an additional notion of
minimal set of keys, or "structure" as a new PersistentStructMap class type.
They have similar features to hash-map plus additional constraints.
• records generates a new Java class whose attributes are accessible as if they were
part of a map. They are not designed to scale beyond a few keys, as their main
goal is to provide object-like feature such as inheritance of behavior
(with protocols).
Clojure further distinguish the different nuances of map behaviors into other interfaces,
most
notably clojure.lang.Associative and clojure.lang.ILookup. clojure.lang.IPers
istentMap (thus all the types in the list above) implements both, but other Clojure data
structures implement only one or the other interface. The net result is that collection
like vectors that implement Associative get a subset of map properties, such as the
possibility to assoc elements from them. Another example are transient maps that
support lookup with get, but have a completely different assoc! (note the bang)
function.
The functions in the chapter has been grouped in the following sub-chapters:
• Creating contains map constructors for the different types of maps. The function
in this section can be used to create a new instance of any of the map types. Other
functions like frequencies or group-by also create a map, but in the context of
processing general collections.
• Accessing are functions dedicated to fetch a specific key or group of keys,
returning their values with or without their associated key.
• Processing contains functions to alter the content of a map. All map types (with
the exception of transient maps) are immutable (they can’t be changed) and
persistent (changes generate a copy of the original object plus the changes).
• Map utilities contains other interesting functions to manipulate maps.
11.1 Creating
11.1.1 hash-map
FUNCTION SINCE 1.0
hash-map is the builder function for Clojure hash maps, a type of immutable data
structure that supports direct access lookup by key (also called hash table in other
languages):
(def phone-book
(hash-map "Jack N" "381-883-1312" ; ❶
"Book Shop" "381-144-1256"
"Lee J." "411-742-0032"
"Jack N" "534-131-9922"))
❶ hash-map takes any number of key-value pairs. Note that "Jack N" is duplicated entry.
❷ The result of calling hash-map (the map itself) can be used as a function to lookup its content by key.
The last value associated to the "Jack N" entry overwrites any previous entries associated with the
same key.
Contract
Input
• "keyvals" can be any even number of arguments or no arguments.
Notable exceptions
• IllegalArgumentException is thrown when the number of arguments is not even.
Output
• returns: a clojure.lang.PersistentHashMap instance containing the given key-
value pairs or empty. When the same key is present multiple times, the last key-
value pair overwrite the content of the previous, while metadata on the key (if
any) are retained from the original key.
Examples
hash-map, when compared to the syntax literal {}, allows dynamically creation of the
hash map at run time, usually from the content of other data:
(apply hash-map (mapcat vector (range 4) (repeatedly rand))) ; ❶
;; {0 0.6232152613924482
;; 1 0.07009565532668205
;; 3 0.9616604642779419
;; 2 0.8674645383318249}
❶ apply spreads the list of arguments for hash-map. The result of mapcat is a list that alternates positive
integers with randomly generated floats.
One frequent case of key-value pairs are URL parameters. In the following example we
have a relatively long URL and we would like to build a map of the parameters of what
is passed in the request:
(require '[clojure.string :as s])
(def long-url ; ❶
(str "https://notifications.google.com/u/0/_"
"/NotificationsOgbUi/data/batchexecute?"
"f.sid=4896754370137081598&hl=en&soc-app=208&"
"soc-platform=1&soc-device=1&_reqid=53227&rt="))
(params long-url) ; ❹
;; {"soc-device" "1"
;; "_reqid" "53227"
;; "soc-platform" "1"
;; "f.sid" "4896754370137081598"
;; "rt" nil
;; "soc-app" "208"
;; "hl" "en"}
❶ To avoid pagination problems, the long url string has been artificially split into parts. It’s not easy to
recognize keys and their values in this form.
❷ split-pair takes a string pair in the form "a=b" and returns a vector of [a b]. The only complexity to
deal with is the potential absence of the value which is replaced with nil.
❸ params is organized to flow vertically making good use of as->. Each line executes a small bit of
processing and the output becomes the input of the following form using the "x" placeholder. hash-
map is the last part to apply.
❹ The output shows what parameters are present on the request URL and deals with potential missing
values.
See also
• apply can be used in conjunction with hash-map to spread arguments from a
collection instead of enumerating them explicitly.
• zipmap allows the creation of a hash map from two ordered collections, the first
providing the keys the second the values.
• into offers another option to build an hash map starting from a list of pairs.
Differently from hash-map which would require apply (a linear spread of the
arguments into the hash-map function), into accepts a collection of key-value
pairs.
⇒ O(n) space
Clojure hash-map creates and potentially assoc multiple elements to a
new clojure.lang.PersistentHashMap instance. hash-map performs linearly with the
number of key-value pairs to be added and similar is the linear relationship with
occupied memory space. Each subsequent assoc operation can be considered constant
time for practical use, but assoc profile is O(log32N). For very large maps (million of
keys) the non linear profile starts to be visible, but this is more a concern
for assoc than hash-map (please see assoc performance profile).
The following benchmark compares hash-map to create a large map of the same size
with a few alternative solutions:
(require '[criterium.core :refer [quick-bench]])
❶ In all examples the created map has 1M keys. The first benchmark uses apply on a flat list of items
which hash-map organizes as key-value pairs.
❷ In the second approach we create a list of pairs suitable for into. The generated map is the same as in
the first example.
❸ The third example assumes we have a Java java.util.HashMap created elsewhere.
The java.util.HashMap instance to use in the benchmark is created from a Clojure map, but this is
just for the benchmark.
❹ The last benchmark uses zipmap.
❶ This version of zipmap* assumes arguments are vectors. With that simplification in mind, we can
make direct access lookup to them to incrementally build the final map. Notice how the map enters the
loop as a transient to temporarily remove persistence and is turned back into persistent! at the end.
❷ We can shave about 100ms by removing any sequential access to keys and values, in case keys and
values are given as vectors.
11.1.2 array-map
function since 1.0
(array-map
([])
([& keyvals]))
(m :a) ; ❷
;; 1
❶ array-map has the same interface as hash-map with an additional arity to handle the no arguments
case. It accepts any number of parameters otherwise.
❷ The map created by array-map can be used exactly like an hash-map.
array-map maintains the same interface as hash-map, but it has a simpler linear
implementation (compared to the tree-like implementation of
clojure.lang.PersistentHashMap). One peculiar aspect compared to hash-map or
sorted-map is that array-map maintains insertion order.
Contract
Input
• "keyvals" is an even list of arguments of any type, including no arguments.
Notable exceptions
• IllegalArgumentException is thrown when the number of arguments is not even,
showing which key is missing the respective value.
Output
• returns: a clojure.lang.PersistentArrayMap containing the given "keyvals" in
insertion order (when iterated). In case of duplicate keys, the value of the last key
overrides the previous, while metadata on the key (if any) are retained from the
original key.
Examples
Clojure contains a mechanism of self-promotion for array-maps into hash-maps. It can
be seen in action using map literals:
(type {0 1 2 3 4 5 6 7 8 9})
;; clojure.lang.PersistentArrayMap ; ❶
(type {0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19})
;; clojure.lang.PersistentHashMap ; ❷
assoc (and functions like zipmap based on it) also promotes an array-map into hash-
map if necessary:
❶ We start from an array-map with 100 keys. It is correctly reported as an ArrayMap instance.
❷ As soon as we assoc into an-array-map we get back a HashMap instance instead.
❸ Note however that dissoc does not self-promote.
The reason for the self-promotion is that array-map is a simple map-like data structure
that does not scale beyond a few hundreds entries. Even with this limitation, array-
map has some applications related to its insertion order guarantee. An array-
map representing the headers of a table, for example, could aggregate the following
information in a single data structure:
• The key is the name of the column to be used when exporting data into CSV files.
• The value is a vector pair containing the mapping in the database column and a
function that can be used to validate the corresponding data.
• The position of the key in the map is the position at which the corresponding
column should appear.
The last information regarding insertion ordering could be used to validate and export
data into a comma separated file (CSV, or comma separated values, is a simple and
ubiquitous plain text exchange format for tabular data):
(def query-results ; ❶
[{:date "01/05/2012 12:51" :surname "Black"
:name "Mary" :title "Mrs" :n "20"
:address "Hillbank St" :town "Kelso" :postcode "TD5 7JW"}
{:date "01/05/2012 17:02" :surname "Bowie"
:name "Chris" :title "Miss" :n "44"
:address "Hall Rd" :town "Sheffield" :postcode "S5 7PW"}
{:date "01/05/2012 17:08" :surname "Burton"
:name "John" :title "Mr" :n "41"
:address "Warren Rd" :town "Yarmouth" :postcode "NR31 9AB"}])
(def customers-format ; ❸
(array-map
'TITLE [:title (checkfn #{"Mrs" "Miss" "Mr"})]
'FIRST [:name (checkfn (comp some? seq))]
'LAST [:surname (checkfn (comp some? seq))]
'NUMBER [:n (checkfn #(re-find #"^\d+$" %))]
'STREET [:address (checkfn (comp some? seq))]
'CITY [:town (checkfn (comp some? seq))]
'POST [:postcode (checkfn #(re-find #"^\w{2,4} \w{2,4}$" %))]
'JOINED [:date (checkfn #(re-find #"^\d{2}/\d{2}/\d{4} \d{2}:\d{2}$" %))]))
❶ query-results is a sample output for a typical database query about customers. Data are organized
by maps with keys corresponding to the column of the corresponding table. These column are not
necessarily in the format required for data export, for example CSV files.
❷ checkfn is a helper function to create a validation functions. It takes a function of 1 argument and
creates a function that can be used as validation. The "predicate" is used to determine the validity of
the input. If the input is not valid, it throws a RuntimeException.
❸ customer-format is an array-map representing the relationship between the internal database
format and the external format. The keys have the right name and the right order. Values are vector of
two elements: the first is the name of the column in the database, the second is a validation function
created with checkfn. Note the use of symbols instead of keywords for keys: symbols print as they
are, without ":" prefix that would need to be removed.
❹ csv-str takes a collection of values and composes them into a comma separated string ending in a
new line.
❺ The format-row function accepts a "format" parameter specification and uses it to create another
function. The output function can be used with a single "row" parameter. The returned function takes 1
database row and produces a single line comma separated string containing the selected data in the
right order. The function is also responsible for invoking each column validation function.
❻ format-data transforms each record into the corresponding row, invoking format-row on each
database record.
❼ We can see the output from format-data when all validations are successful.
(def customers-format ; ❶
(array-map
(with-meta 'TITLE {:db :title }) (checkfn #{"Mrs" "Miss" "Mr"})
(with-meta 'FIRST {:db :name }) (checkfn (comp some? seq))
(with-meta 'LAST {:db :surname }) (checkfn (comp some? seq))
(with-meta 'NUMBER {:db :n }) (checkfn #(re-find #"^\d+$" %))
(with-meta 'STREET {:db :address }) (checkfn (comp some? seq))
(with-meta 'CITY {:db :town }) (checkfn (comp some? seq))))
See also
• hash-map is the more robust and scalable version of array-map. array-map differs
❶ "r1" is a range of 2000 elements where the last 1000 are repeating the same number. The resulting
map is going to contain 500 unique keys.
❷ "r2" is a range of 2000 items without duplicates. The resulting map will contain 1000 keys.
sorted-map uses the default comparator to maintain order. sorted-map-by can be used
to pass a different comparator:
(sorted-map-by #(< (:age %1) (:age %2)) ; ❶
{:age 35} ["J" "K"]
{:age 13} ["Q" "R"]
{:age 14} ["T" "V"])
;; {{:age 13} ["Q" "R"], {:age 14} ["T" "V"], {:age 35} ["J" "K"]}
❶ sorted-map-by accepts a function of two arguments returning -1, 0 or 1 (first argument is less, equal
or more than second argument respectively), which is the typical interface for a comparator. The
custom comparator is used instead of the default to determine the ordering of the keys.
Ordering requires that the key in the map can be compared. It follows that key objects:
1. Need to support the java.lang.Comparable interface.
2. Keys must be of the same type of the first key in the map (it follows that all the
keys must have the same type).
Contract
Input
• "keyvals" is a list of arguments of any type, including no arguments. Arguments
need to be in pairs, so the list count must be even.
• "comparator" is a function of 2 arguments. The function should return a negative,
0 or positive number to indicate the first argument is less than, equal or more than
the second argument respectively. It’s a mandatory argument for sorted-map-by.
Notable exceptions
• IllegalArgumentException is thrown when the number of arguments is not even,
showing which key is missing the respective value.
(println m) ; ❸
;; {a 2}
❶ The timed function takes a symbol "s" and returns the symbol after adding metadata containing the
time of the creation. It also prints on standard output when the key was created.
❷ We create a sorted-map using the timed function to insert keys. The second key is the same as the
first.
❸ We can see the value of the "a" key is the second value that was added.
❹ But the metadata on the key is coming from the creation of the first key.
One interesting use of sorted-map is implementing priority queues 176 . Priority queues
are at the base of important algorithms in computer science, such as searching for the
optimal path in a graph. The A* (a-star) algorithm, for instance, can be implemented
using a priority queue.
A* was devised in the late '60 to guide a robot around obstacles in the most optimal
way and is still used nowadays in games and navigation software. A* requires a
heuristic, an approximate distance to the destination used to filter out unwanted paths.
In the case of a car navigation system for example, the heuristic could be the straight
line distance between two locations, which is easily obtainable from geospatial data.
176
A priority queue is a type of data abstraction similar to a queue where each item is also assigned a priority. The priority
is used to decide which element is dequeued next. Please see the Wikipedia entry for more information:
We could implement the priority queue of the optimal paths using sorted-map-by with
a composite vector key. Here’s how it would look like after processing two paths
departing from "origin" and going to "point 1" and "point 2":
(sorted-map-by compare ; ❶
[4.5 "point 2"] [["origin"] 1.5] ; ❷
[5.5 "point 3"] [["origin"] 2.5]) ; ❸
sorted-map-by in the example above shows how to use composite keys with sorted
maps. vectors can be used as keys but they don’t implement
the java.lang.Comparable interface natively, requiring the explicit use of compare as
comparator.
Before we can iterate locations, we need a suitable representation. Locations can be
represented graphically as a cyclic directed graph like the one shown below:
Figure 11.1. Locations and routes connecting them. We want to move from "Orig" to "Dest"
using the shortest path.
We could translate the graph using a map containing all locations as keys:
(def graph
{:orig [{:a 1.5 :d 2} 0] ; ❶
:a [{:orig 1.5 :b 2} 4]
:b [{:a 2 :c 3} 2]
:c [{:b 3 :dest 4} 4]
:dest [{:c 4 :e 2} 0]
:e [{:dest 2 :d 3} 2]
:d [{:orig 2 :e 3} 4.5]})
❶ The value for each location key is a vector of two items. The first item is another map containing the
connected locations and their distances from the key. The second element is the heuristic for that
location.
The following implementation of the A* algorithm takes graph as input, an origin and
a destination. The function contains two inner letfn definitions, helping with the task of
keeping the list of parameters short (the inner functions can access parameters from the
outer scope). walk is a tail recursive inner function that traverses the graph with the
format given before. discover helps removing already visited nodes at each iteration,
eliminating potentially infinite cycles. Each cycle we move to the next node that
minimizes the heuristic plus the actual distance from origin and discover new nodes
that we put in the priority queue for the next iteration:
(defn a* [graph orig dest]
(letfn
[(discover [node path visited]
(let [walkable (first (graph node)) ; ❶
seen (map last (keys visited))]
(reduce dissoc walkable (conj seen (last path))))) ; ❷
(walk [visited]
(let [[[score node :as current] [path total-distance]] (first visited)] ; ❸
(if (= dest node)
(conj path dest)
(recur
(reduce-kv ; ❹
(fn [m neighbour partial-distance]
(let [d (+ total-distance partial-distance)
score (+ d (last (graph neighbour)))]
(assoc m [score neighbour] [(conj path node) d]))) ; ❺
(dissoc visited current) ; ❻
(discover node path visited))))))]
(walk (sorted-map-by compare [0 orig] [[] 0])))) ; ❼
❶ To access all accessible nodes from the current location we need to make access to the graph using
the location as key.
❷ The repeated use of dissoc removes nodes that we’ve already seen.
❸ destructuring greatly helps creating concise code to access standard data structures
including sorted-map-by. The first from the visited is the location which currently minimizes the
sum of the distance traveled so far with the heuristic (in this example the straight line distance to
destination).
❹ reduce-kv iterates over the list of walkable locations starting from the priority map of visited locations.
While iterating each new path departing from the current location, it calculates aggregated metrics like
the total distance from the origin.
❺ While iterating each new location, we also need to build the path necessary to reach it in case this is
the solution (e.g. the new location is the destination). The path is returned as the final result at
destination.
❻ We need to remove the first node in the priority list before we start calculating again to avoid infinite
recursion.
❼ At initialization, a sorted map is created with sorted-map-by containing "orig" only.
❽ We can invoke a* to search for the best path from :orig to :dest or any other pair of locations.
(sorted-map-by
#(compare (count %2) (count %1)) ; ❶
[:a :b] 1 [:a] 2 [:b] 3)
;; {[:a :b] 1, [:a] 3}
❶ A flawed custom comparator for a sorted-map-by.
The first sign that the comparator is not working properly is the missing [:b] key in the resulting set. A
second hint is that subsequent operations fail in an apparently unpredictable way:
(def ordered-by-count ; ❶
(sorted-map-by
#(compare (count %2) (count %1))
[:a :b] 1 [:a] 2 [:b] 3))
The reason our comparator is flawed is that the comparator is used to check if a key is in the map as well
as deciding in which relative order it should appear in the resulting sorted-map. Our comparator does
not take into account the first aspect, the fact that is going to be used to check if the key is already in the
map:
(def flawed-comparator
The comparator will be called for each key already in the sorted-set and the new [:x] key to be
inserted. Since they have the same size, it wrongly concludes [:x] is already in the set and should not
be added. What is missing is an additional check if the keys are actually the same or not, not if they
compare in terms of size only:
(def good-comparator ; ❶
#(compare [(count %2) %1] [(count %1) %2]))
With the correct formulation of the custom comparator, we can see that the sorted-set-by behaves as
expected:
(def ordered-by-count
(sorted-map-by
#(compare [(count %2) %1] [(count %1) %2])
[:a :b] 1 [:a] 2 [:b] 3))
The reader is also invited to check the call-out section in sorted-set for additional considerations.
See also
• sorted-set are similar functions to create an ordered set instead of a map.
• “subseq and rsubseq” are used to generate a sequence starting from an element of
a sorted map or set.
• “hash-map” or “array-map” are the other kind of persistent dictionary-like data
structures available in Clojure.
❶ We use apply to spread a list of key-value pairs for the sorted-map constructor.
❷ In this case we use into with the same number of keys.
❸ The last case builds a similar but mutable Java data structure.
The benchmark shows almost no difference between apply and into. The mutable Java
version is roughly 3 times faster.
While discussing the a* algorithm, we used first to access the best location from the
sorted map. Although first is in general a good choice, it adds the overhead of
transforming the sorted-map into a sequence. The Java class implementing sorted-
map contains two public methods min and max that are not exposed from the standard
library. The following benchmark shows a consistent performance gain when we
access .min using Java interop:
(require '[criterium.core :refer [quick-bench]])
(import '[clojure.lang PersistentTreeMap])
177
Red-Black trees are a flexible data structure. Compared to plain binary trees, Red-Black trees offer self adjusting to
avoid unbalanced branches. Please have a look at the Wikipedia article to know more https://en.wikipedia.org/wiki/Red–
black_tree
It’s common case to access a sorted-map to retrieve the first or the last element
(especially when they are used as priority queues). In that case we can
use min or max methods directly remembering to type hint the map to avoid a very
costly reflection lookup.
11.1.4 create-struct, defstruct, struct-map, struct and accessor
functions (except defstruct) since 1.0
in the function body) and it’s still in use in some parts of the Clojure standard library
(for example xml processing, “resultset-seq” or cl-format). Along with array-
maps and defrecord, struct is the only other map-type to maintain insertion order.
Contract
Input
• "keys" is the list of the minimal set of keys that should be present in the struct.
• "name" defines the name for struct just assigned to a local var in the current
namespace by defstruct.
• "s" is a struct definition created with create-struct or defstruct. It is used
by struct-map, struct and accessor to get access to a previously defined struct.
• "inits" are unstructured key-value pairs that can be passed to struct-map to
initialize a new struct instance.
• vals is a list of values passed to struct. Values should match positionally in
relation to the keys already present in a struct definition.
• key is used by accessor to create a function to access that key in a
given struct instance.
Notable exceptions
• IllegalArgumentException if too many "vals" are passed to struct given a
definition that has less keys than the given values.
• RuntimeException trying to dissoc a key that is part of the struct definition.
Output
• create-struct creates a new PersistentStructMap$Def definition object with
the given "keys". At least one key is required.
• defstruct is equivalent of assigning the definition created by create-struct to a
var name, (equivalent of using def).
• struct-map creates a new struct instance based on a struct definition and a list
"inits" of key value pairs. If "inits" does not contain the keys defined in
the struct, then the keys are assigned a default nil value. Other key-value pairs
are accepted along with the keys from the definition.
• struct accepts a definition and a list of values. struct will try to match values
positionally against the key definitions. Missing values for keys result in a default
value of nil being assigned. "vals" can be equal or less than the number of
keys. IllegalArgumentException is thrown attempting to pass more values than
keys in the definition.
• accessor returns a function of one argument. The function accepts
a struct instance and retrieves the value at the specific "key".
Examples
struct are map-like types with a minimal set of required keys. When using
a struct we need to distinguish between their definition (the result of calling create-
❶ In the first example, we compare two struct definitions. Despite they contain the same keys, the two
definitions are independent.
❷ The second example instantiate an actual struct object from the definition using the same values.
The two instances with same keys and values are the same.
Additionally, definitions with different set of keys can generate equal struct instances
by adding the missing keys:
(defstruct point-2d :x :y) ; ❶
(defstruct point-3d :x :y :z) ; ❷
A struct is much simpler than the more powerful defrecord. One fundamental
difference is that create-struct creates an anonymous struct definition,
while defrecord creates a Java class. Class generation is a necessary side-effect of
using defrecord: this allows powerful features such as full-fledged inheritance of
record types. At the same time, records are more complicated to use across namespaces
(as they require an explicit import statement) and they are trickier to reload when their
definition changes. The following example illustrates the difference:
(struct (create-struct :a :b :c) 1 2 3) ; ❶
;; {:a 1, :b 2, :c 3}
(abc. 1 2 3) ❸
;; #user.abc{:a 1, :b 2, :c 3}
❶ The struct definition exists without necessarily assigning a name, so they can be created and
instantiated directly.
❷ defrecord on the other hand, requires a name and generates a java.lang.Class that cannot be used
directly.
❸ The class "abc" generated by defrecord, is now available in the "user" namespace and can be used to
initialize a new record instance.
We are going to make use of inlined struct definition in the following example. A
waypoint contains the coordinates of a point of interest on Earth. It contains a type, id,
latitude and longitude. We could define a struct that contains the necessary keys:
(require '[clojure.string :refer [split-lines split]])
(def waypoints
(let [sdef (create-struct :type :lat :lon :id)] ; ❷
(transduce
(comp
(map #(split % #"\s+"))
(map #(apply struct sdef %))) ; ❸
conj
lines)))
(first waypoints) ; ❹
;; {:type "VHF", :lat "0.000000", :lon "0.000000", :id "ABI"}
❶ This url points to a list of about 50000 point of interest, one for each line. We can load its content and
proceed to split the lines using clojure.string/split-lines.
❷ The struct definition is created as part of the computation to parse the list with create-struct. Note
that if necessary, the definition could be altered at runtime, for example to add a specific key looking at
the ":type" of the waypoint.
❸ Each line is split into 4 values which are then applied to the struct definition.
❹ We can see one example of waypoint that appears printed exactly like a normal map.
struct-map is the other way to instantiate a struct. It’s useful when converting a
normal hash-map into a struct with corresponding keys:
(defstruct waypoint :type :lat :lon :id) ; ❶
(def coordinates [ ; ❷
{:alt 150 :lat "18.3112" :lon "3.1314" :id "XVA"}
{:alt 312 :lon "10.04883" :id "FFA" :type "XFV"}
{:temp 78.3 :lat "23.7611" :id "XJP"}])
(to-waypoints coordinates) ❹
;; ({:type nil, :lat "18.3112", :lon "3.1314", :id "XVA", :alt 150}
;; {:type "XFV", :lat nil, :lon "10.04883", :id "FFA", :alt 312}
;; {:type nil, :lat "23.7611", :lon nil, :id "XJP", :temp 78.3})
❶ The waypoint struct is now defined using a namespace local var definition. The definition is visible
in the current namespace (or other namespaces if required) from this point onward.
❷ We receive a vector of heterogeneous coordinates, sometimes with keys we need, sometimes with
missing data.
❸ to-waypoints transforms a list of maps into waypoint struct objects. We need to use mapcat to
flatten the content of the map into a plain list of key-value pairs without parenthesis. We can then
use apply to feed the list into struct-map.
❹ The end result is a list of structs. What is different from before is that the presence of the minimal set
of keys from the waypoint definition is guaranteed, although the corresponding value could be nil.
accessor optimizes frequent access to the field of the struct by skipping the
typical hash tree traversal and using a faster array-index lookup. With reference to the
previous example, we can define accessors to access waypoint instances as follows:
(def type (accessor waypoint :type)) ; ❶
(def lat (accessor waypoint :lat))
(def lon (accessor waypoint :lon))
(def id (accessor waypoint :id))
❶ accessor creates a function in the current namespace that accepts a struct instance. We define
one accessor for each key in a waypoint.
Lisp defstruct
The main goal of Clojure struct is to remove repeated storage of keys. struct guarantees a minimal
set of keys, similarly to the consistent set of attributes offered by object classes. At the time when Clojure
didn’t have deftype or defrecord, it was a common question in the Clojure mailing list how to design
inheritance using defstruct because of its class-like appearance. But defstruct doesn’t offer any of
the powerful features an object system provides, such as inheritance of behavior through types. One
additional reason to think defstruct is the entry point to create an object system, is the presence of
the same keyword in Common Lisp:
;; Lisp code
(defstruct ; ❶
(person (:constructor create-person (id name age)))
id name age)
Common Lisp defstruct is designed to offer object oriented features like inheritance:
(defstruct
The similarities between Clojure defstruct and Common Lisp might explain why people coming to
Clojure in the early days (especially from a Lisp background) were trying to use struct differently from
what it was intended for (a map optimization). Clojure introduction of defrecord and defprotocol a
few years later removed this ambiguity. defrecord also offers the same capabilities of defstruct:
currently, there are very few use cases that require struct instead of defrecord and some of the
differences have been highlighted in this chapter. This is the reason why the official Clojure
documentation points the reader to defrecord instead of defstrutct.
See also
• defrecord effectively takes over defstruct to define strongly typed map-like
structures. It additionally offers inheritance and interface declaration mechanisms
more suitable for object oriented programming. Unless you need lightweight
throwaway map-like objects for internal computations, the better option is to
use defrecord instead.
Performance considerations and implementation details
(defn w-hmap [type lat lon id] (hash-map :type type :lat lat :lon lon :id id))
❶ lines contains a list of waypoints as vectors of strings. By asking the last we have fully realized the
sequence.
❷ w-struct, →w-record, w-map and w-hmap are the 4 constructors that we need for the test. Note
that w-map creates an array-map not a hash-map.
❸ We can see that defrecord is the fastest of the group, with array-map and struct following. The
creation of a hash-map has the worst performance.
We can now benchmark access to the waypoints. Each type has a few ways to access
values given a key. For struct we are going to create specific accessors:
(def points-struct (doall (map #(apply struct w-struct %) lines)))
(def points-record (doall (map #(apply ->w-record %) lines)))
(def points-map (doall (map #(apply w-map %) lines)))
(def points-hmap (doall (map #(apply w-hmap %) lines)))
❶ We can see substantial equivalence between struct and defrecord, but we used the
faster accessor for struct and the normal key lookup for defrecord.
❷ defrecord also generates a Java type that accepts direct access to the accessors methods on its
class. We need to remember to give the compiler a type hint, but the results are almost 50% faster.
Overall defrecord is faster at creating and accessing data compared to defstruct, with
the additional Java interop option to achieve further speed-up. The
following picture shows the memory profiling of the 4 types of struct-like types
examined so far. The byte count has been obtained loading the same list of waypoints
presented before on an empty JVM using memory profiling in VisualVM 178 :
178
VisualVM is available for free at https://visualvm.github.io
Figure 11.2. A comparison of the memory allocation for the different type of map-like
structures.
We can see roughly the same amount of 42,900 allocated objects, corresponding to the
amount of waypoints in the file. clojure.lang.PersistentStructMap represents
defstruct with 1,716,360 allocated bytes. defstruct is cheaper than the following
user.poi-record defrecord 2,069,632 bytes. The cheapest of the group is array-map
with a total of 1,373,152 allocated bytes. The last entry in the group is for
clojure.lang.PersistentHashMap, the normal hash-map. In this case also
BitmapIndexedNode instances are allocated, adding up to 2,746,328 bytes for the total
memory allocation.
11.1.5 zipmap
function since 1.0
zipmap creates a new hash-map from two collections of keys and values respectively.
Key-value pairs are formed by making ordered sequential access to both inputs:
(zipmap [:a :b :c] [1 2 3]) ; ❶
;; {:a 1, :b 2, :c 3}
Like hash-map, there is no order guarantee that the generated map will follow the order
of the input keys or vals:
(zipmap (range 10) (range 10)) ; ❶
;; {0 0, 7 7, 1 1, 4 4, 6 6, 3 3, 2 2, 9 9, 5 5, 8 8}
❶ The example shows that even if the keys and values are given in order, the map internal ordering
when printed is not determined.
zipmap is useful to create maps programmatically at runtime and covers the case in
which keys and values are not alternating in the same sequence (in that case we could
use hash-map) or they are not already forming pairs (for which we could use into).
Contract
Input
• "keys" and "vals" are both sequential collection following seq contract for
conversion into a sequence (when necessary). They are both required arguments
but they can be empty or nil.
Notable exceptions
• IllegalArgumentException if either "keys" or "vals" are not sequential.
Output
• returns: the hash-map formed by taking keys and values alternating from "keys"
and "vals" consecutively. If "keys" contains duplicates, the last key (and
corresponding value) overwrites the previous (following assoc semantic). If the
number of keys and values are different, zipmap stops after reaching the shortest
between the two.
NOTE Please note that the actual type returned by zipmap can be
either clojure.lang.PersistentArrayMap for smaller maps with less than 10 keys
or clojure.lang.PersistentHashMap for 10 keys or more. This is consistent with the auto-
promoting behavior of hash-map related functions such as assoc. Also note that the order in
which the map is printed (or iterated) could be different from the order of the initial "keys" and
"vals".
Examples
Despite the absence of ordering guarantee for the generated map, the following
expression is always true:
(let [m {:a 1 :b 2 :c 3 :d 4 :e 5}]
(= m (zipmap (keys m) (vals m)))) ; ❶
❶ We can always build an equivalent map by calling keys and vals on an input map "m" and
use zipmap to build a new map. The original map "m" and the newly created one are equivalent
despite non deterministic ordering.
zipmap could be used to generate a map where all values are the same (for example
initializing them to "1" and later count them):
(zipmap ["red" "blue" "green"] (repeat 1)) ; ❶
❶ Infinite sequences are not a problem for zipmap assuming the other is not. Here we can see that each
color key gets a value of "1".
zipmap is useful to build a map when keys and values are not necessarily known at the
time of writing the code. A typical case are record-oriented data such as a database
result sets or comma separated values files. In the following example we want to
process a CSV file (plain text file that usually contains a header and many rows
corresponding to the labels in the columns). We want each row of values to be
transformed into a Clojure hash-map using the name of the headers as keys:
(require '[clojure.java.io :as io])
(require '[clojure.string :as s])
(def file-content ; ❶
"TITLE,FIRST,LAST,NUMBER,STREET,CITY,POST,JOINED
Mrs,Mary,Black,20,Hillbank St,Kelso,TD5 7JW,01/05/2012 12:51
Miss,Chris,Bowie,44,Hall Rd,Sheffield,S5 7PW,01/05/2012 17:02
Mr,John,Burton,41,Warren Rd,Yarmouth,NR31 9AB,01/05/2012 17:08")
❶ A small portion of the CSV file is simulated in memory as a string. The string breaks in the source are
automatically converted into new lines.
❷ The split function contains the logic to split a line strings into multiple strings removing all ","
between them. The resulting sequence can be used as values for zipmap.
❸ transform contains the transformation logic to transform "data" (assumed to be
a java.io.Reader instance`) into a well formed hash-map. The responsibility to close the reader is
delegated to the caller.
❹ The transformation is modeled as transducers composition with eduction. zipmap is part of the last
transducer applied to each incoming list of values. The headers have been already assigned to a local
binding and are ready to be used.
❺ We simulate loading from a sample string instead of a file. To load the data from a file, we would have
to change the binding to "data" to (io/reader "somefile.csv") with no other changes.
See also
• hash-map is the standard map constructor taking any number of arguments (as
key-value pairs). hash-map can be used to enumerate key-value pairs explicitly or
when key and values are coming as a single collection.
• array-map is a specialized version of map that maintains insertion order. It works
effectively for a small number of keys (like maps to pass around as argument to
functions) but it suffers from inefficient linear access.
• sorted-map offer a way to create ordered maps by key using a comparator.
Performance considerations and implementation details
❶ The new zipmap* is a slight modification of the existent implementation to transform the internal
"map" into a transient as it builds.
❷ We can see about 30% speed up when using transient in this example of 1000 keys.
11.2 Accessing
11.2.1 keys and vals
(keys [map])
(vals [map])
keys and vals are two utility functions to retrieve a sequence of keys or values from a
map-like type (hash-map, array-map, sorted-map, struct-map for Clojure or any
implementation of the java.util.Map interface):
(keys {:a 1 :b 2 :c 3 :d 4 :e 5 :f 6 :h 7 :i 8 :j 9}) ; ❶
;; (:e :c :j :h :b :d :f :i :a)
❶ keys retrieves a sequence of the keys in the given map. Note that there is no order guarantee,
because the map is large enough and is created as a hash-map instead of array-map.
❷ vals retrieves the sequence of values in a map. Small maps, when created with curly braces syntax
literal, are created as array-maps which maintain insertion order. The insertion order is then reflected
when retrieving keys or values.
Contract
Input
• "map" is the only argument and is mandatory. It can be one of the possible Clojure
map types (such that it implements the clojure.lang.IPersistentMap interface),
a Java map type (extending java.util.Map) or a collection
of java.util.Map$Entry, an inner class representing a list of key-value pairs.
Empty collection or nil are also possible.
Notable exceptions
• ClassCastException is thrown when the "map" is not one of the allowed map
types, or when the collection does not contain java.util.Map$Entry instances.
Output
• keys returns a sequence of the keys in "map" in undetermined order.
• vals returns a sequence of the values in "map" in undetermined order.
NOTE Although keys and vals retrieves sequences in undetermined order, ordering between them is
consistent assuming they are called on the same map instance, such that: (= (zipmap (keys
m) (vals m)) m) is always true.
Examples
The contract allows for all types of Clojure and Java maps, but also includes the option
©Manning Publications Co. To comment go to liveBook
❶ We call filter on a hash-map. The map is transformed into a sequence of pairs, where each pair is
a java.util.Map$Entry instance. We can use keys to see which keys points to an odd number.
keys and vals could be used to extract meaningful information from a configuration
map. A configuration map is a hash-map instance which is read at program startup that
determines the behavior of the application. We could use it for example to implement a
simple form of language processing. The following configuration map contains
information about the n-grams (combination of words) used to evaluate the specific
tone of a sentence. Let’s say we want to measure how much emphasis is used in a
sentence:
(def matchers ; ❶
{"next generation" 10
"incredible" 10
"revolution" 10
"you love" 9
"more robust" 9
"additional benefits" 8
"evolve over time" 8
"brings" 7
"better solution" 7
"now with" 6})
❶ matchers is a map of words into weights. The higher the weight, the more the fragment is important to
determine emphasis in a sentence. In a real scenario, the map would be much bigger and the result of
applying natural language processing with sophisticated techniques.
❷ avg-xf is an average transducer. Apart from usual transducer requirements, this special transducer
needs to be the last in a composition chain that process only numbers. Internally, it maintains a
counter of how many items are added up to the final sum and it produces the average on the exit step.
❸ score contains the logic to calculate the total score given a sentence. The text is searched for the
fragments and each fragment weights differently based on the content of the matchers map.
❹ The keys of the matchers map is the input for transduce. Each key is used in a regular expression
and then again to access the weight.
❺ A typical advertisement message scores around 7 with our simple emphasis metric.
See also
• find search for the given key in a map and returns the
corresponding java.util.Map$Entry key-value pair object.
• key and val extract the key or the value from a java.util.Map$Entry instance
respectively.
• select-keys returns a map containing just the selected keys from another input
map.
Performance considerations and implementation details
While the creation is constant time, the iteration of the sequence returned by keys
or vals is linear in the number of keys present in the map.
find searches for a key in a map such as those created by hash-map, array-map, sorted-
map, struct or objects implementing java.util.Map:
(import 'java.util.HashMap)
;; [:a 1] ; ❶
❶ All the calls to find in the example produce the same result of [:a 1] when we search for the key :a.
find also accepts vectors, subvectors and native vectors. In this case find lookups an
item at the given index:
(find [:a :b :c] 1) ; ❶
;; [1 :b]
❶ find used on a common vector. If an element is found at the index, a Map$Entry is returned
containing both.
❷ find used on a subvector works similarly.
❸ Finally, find can be used on native vectors built with vector-of.
All find examples so far have the return type in common: a java.util.Map$Entry
instance composed by a key and a value. key and val are function dedicated to extract
the key or the value from a Map$Entry without using Java interoperation:
(key (first {:a 1 :b 2})) ❶
;; :a
;; ["JENV_LOADED" "1"]
❶ key can be used on the elements of the sequence produced by a map. Here we extract the first key-
value pair and then the key.
❷ find similarly produce a Map$Entry instance that we can then access to retrieve they key.
❸ System/getenv returns a map of all the environment variables currently visible by the running Java
virtual machine. We can access the last key-value pair directly using last but here we prefer to retrieve
a vector instead of a Map$Entry, so we use juxt with key and val.
Contract
Input
find:
NOTE since Clojure 1.9 release, find also accepts transient maps or vectors as arguments returning
a java.util.Map$Entry instance with the same semantic.
Examples
find works similarly to filter by returning the java.util.Map$Entry instance for a
specific key. We could use it on a list of maps to isolate interesting key-value pairs:
(def records ; ❶
[{(with-meta 'id {:tags [:expired]}) "1311" 'b "Mary" 'c "Mrs"}
{(with-meta 'id {:tags []}) "4902" 'b "Jane" 'c "Miss"}
{(with-meta 'id {:tags []}) "1201" 'b "John" 'c "Mr"}])
❶ records is an example list of maps resulting from some data store. The 'id key for each record
additionally contains metadata.
❷ We could use (map 'id records) directly, but that would remove the key and the potentially useful
metadata. By using find we can extract the key-value pairs and decide to use the metadata later if we
need.
❸ We can access the metadata attached to each key, for example the first one, using key and then
<http://clojuredocs.org/#meta,meta>.
When find is used on vectors, the "key" argument becomes one of the available indexes and should be
a positive integer to be meaningful. Passing any other type as index is possible but would always result in
a nil.
There are several other cases in the standard library where the in-function documentation is not
explicit about the meaning of "key" for all the possible types. Such cases have to be inferred from using
the function or looking at the sources.
Once the rule is established, using a non-integer argument as "key" for a vector with find is
generally referred as "garbage in, garbage out" to express the violation of an input contract resulting in
undetermined results. A glaring example of GIGO (Garbage In Garbage Out) is the following:
(find [1 2 3] power-2-32) ; ❷
;; [4294967296 1]
32 32
❶ 2 - 1 corresponds to the maximum number that can fit a 32 integer in Java. To express 2 and
32
beyond, it needs more than 32 bits. Integers are truncated to access array indexes, so 2 becomes 0
(the "1" at bit 33 is not considered).
32
❷ This is why find effectively find something at 2 , because that truncated integer is instead 0.
See also
• contains? works similarly to find but it returns true or false to indicate if the
element is present. It also extends to other non associative types such
as sets because it does not verify the presence of the element at a specific index (it
wouldn’t be possible on sets because they are unordered).
• get is the most flexible and general compared to contains and find.
⇒ O(log N) steps
For all practical purposes find is a constant time operation. find is however O(log 32
N) for all supported types except for sorted-map, which is O(log 2 N). The
algorithmic family is O(log N) in all cases, but sorted-map has a higher constant
factor. We did similar considerations for contains? that the reader is invited to review.
find implementation is mostly delegated to the Java side of
Clojure. clojure.lang.RT/find contains a dispatch based on types and the actual
search is delegated to the specific collection implementation.
11.2.3 select-keys and get-in
function since 1.0 (select-keys), 1.2 (get-in)
(get-in
([m ks])
([m ks not-found]))
select-keys and get-in are functions to access keys and values from maps:
❶ select-keys retrieves keys and related values from a map returning a new map with the select pairs,
if any.
❷ get-in only retrieve values but it can follow an arbitrarily nested map multiple level deep.
select-keys and get-in also work on vectors (although they are used most frequently
with maps):
(select-keys [:a :b :c :d :e] [1 3]) ; ❶
;; {1 :b, 3 :d}
❶ select-keys on a vector returns a map which contains integers as keys and values from the
input vector. The keys are the indexes passed as input.
❷ get-in accepts integer coordinates and traverses a nested vector to retrieve the value at the index.
Contract
Input
select-keys
Notable exceptions
• IllegalArgumentException is thrown by select-keys on unsupported types.
Output
• select-keys always returns a map-type (array-map for smaller maps, hash-
map when there are over 10 keys). The resulting map contains the matching keys
from "keyseq" if any. It returns an empty map if either "map" or "keyseq" are nil.
• get-in returns the value found by accessing "m" using the keys in "ks", in order.
Each key in "ks" extract a value from the following nested level in "m", if any.
Returns nil if no value was found or "not-found" if a default is present. Sequential
collections like ranges or lists does not work with get-in by always returning nil:
(get-in '(0 1 2 3) [0]) ; ❶
;; nil
❶ Accessing a list with get-in always returns nil, even when the corresponding nth operation works
correctly.
Examples
Note the opposite behavior of select-keys compared to get-in when an
empty vector is used to extract keys:
(select-keys {:a 1 :b 2} []) ; ❶
;; {}
❶ select-keys returns an empty map when an empty vector is used to select the keys.
❷ get-in returns the input instead.
(meta m)
;; {:original true}
❶ A map is created using the metadata literal notation "^:", which implies the given key associated to
a true value.
❷ The map returned by select-keys preserves the given metadata.
select-keys can be used with vectors, for example to extract letters from a word or
sentence:
(let [word "hello"]
(select-keys (vec word) (filter even? (range (count word)))))
;; {0 \h, 2 \l, 4 \o}
get-in can be used to extract values from deeply nested data structures containing
supported collection types (typically a mix between map and vector). Data in Json
format, for instance, is used to encode information between data services and tends to
be arbitrarily nested. In the following example we receive a list of financial products
ordered by lowest legal fees. The list have been translated from Json (JavaScript
Object Notation) to Edn (Extensible Data Notation) using one of the several library
available and is ready for processing:
(def products ; ❶
[{:product
{:legal-fee-added {:rate "2%" :period "monthly"}
:company-name "Together"
:fee-attributes [["Jan" 8] 99 50 13 38 62]
:initial-rate 9.15
:initial-term-label {:bank "provided" :form "Coverage"}
:created-at 1504556932727}}
{:product
❶ products is a small sample of a much larger data structure received from another service after Json
serialization. It contains an initial vector of products and each product is detailed with additional nested
vectors and maps.
We can inspect the product at the top of the list using get-in:
(defn lowest-rate [products] ; ❶
(get-in products [0 :product :legal-fee-added :rate]))
(lowest-rate products)
;; "2%"
❶ We can structure the function lower-rate around the list of products. get-in accesses the element
at index "0" first and the rest of keys extract a specific information a few level deeper, like the rate of
the legal feeds.
See also
• get does not offer access to nested data structures, stopping at the first nesting
level. Prefer get to get-in if the key sequence contains just a single item.
• keys retrieves the collection of keys from a map instance.
• zipmap creates a new map instance starting from two sequential collection of keys
and values respectively.
Performance considerations and implementation details
⇒ O(n) Linear
Both select-keys and get-in produce linear behavior in the number of selection keys.
select-keys implementation needs to build the output by gradually introducing keys.
Currently it does not take advantage of transients (something which is captured already
as enhancement for the next Clojure releases 179 ). We could go ahead and produce
such implementation and compare it against the current one:
(require '[criterium.core :refer [bench]])
179
https://dev.clojure.org/jira/browse/CLJ-1789
❶ This version of select-keys is called select-keys2 and is based on transduce. It iterates "keyseq"
and uses each key to call find against the input map "m". Each entry is then added to
the transient results using conj!. Results are finally transformed back into a persistent data structure
using completing and persistent!.
❷ The benchmark tests an average size map of 20 keys and a selection of 7 keys.
❸ We can see that select-keys2 has a small advantage over the standard version.
11.3 Processing
11.3.1 assoc, assoc-in and dissoc
functions since 1.0
(assoc
([map key val])
([map key val & kvs]))
(assoc-in
[map ks val])
(dissoc
([map])
([map key])
([map key & ks]))
assoc, assoc-in and dissoc are fundamental operations on maps (they also work on
other associative data structures such as vectors).
assoc replaces the value of an existing key or inserts a new key if one doesn’t exist:
(assoc m :b "changed") ; ❶
;; {:a "1" :b "changed" :c "3"}
m ; ❷
;; {:a "1" :b "2" :c "3"}
❶ The effect of assoc in the presence of an existing key is to replace its value.
❷ As in all other persistent Clojure data structures, assoc returns a new instance of the input with
changes on top, any references to the same input data structure remain unchanged.
❶ We use dissoc on the original var "m" which contains a reference to a previously created map.
❶ The vector [:c :x1 :x2] identifies a descending path in the nested data structure "m". The value
"z2" is swapped with "z1" after subsequent extractions of the nested maps from the input collection.
Contract
The first input argument "map" is common for all the 3 functions. It’s an associative
data structure so that (associative? map) is true. When nil it defaults to the empty
map and it’s a mandatory argument.
assoc input
• "key": for maps, it can be any object while for vectors it has to be an integer. To
match an existing key, "key" is compared using = equality semantic. It’s
mandatory argument.
• "val": is the value to assoc to "key". It’s mandatory argument.
• "kvs": is any additional key-value pair. It’s optional but if present, it contains an
even number of items.
assoc-in input
• "ks" is a sequence of keys. Each key is used to subsequently look-up at the value
that was found at the previous key. When empty or nil, it is equivalent to [nil],
which is nil key at the first level of the input "map". In case of nested vectors,
❶ "m" is a partially built map. We can see that each step executes some operation one the map and the
vertical arrangement helps keeping track of the flow.
❷ maps are unordered by default and we can see that the processed map is returned without any
specific key or value ordering.
assoc can be also used to gradually build a hash-map with reduce. This can be useful
when the values can be derived from the keys, for example to retrieve data using an
"id":
(defn lookup [id] ; ❶
{:index "backup"
:bucket (rand-int (* 100 id))})
❶ lookup simulates the interaction with a service or database to retrieve structured information by id.
❷ Likewise request contains a random selection of ids that in a real life application are probably coming
from some user interaction.
❸ reduce accepts an empty map literal {} to start creating new results. reduce iterates on the content
of the request to pass the content of the map so far and the next item to assoc. The key is the item
itself, while the value is retrieved from the lookup service.
assoc-in tends to be used with deeply nested data structures, especially those
mixing maps and vectors that can be traversed by assoc-in independently from the
nested type:
(def articles ; ❶
[{:title "Another win for India"
:date "2017-11-23"
:ads [2 5 8]
:author "John McKinley"}
{:title "Hottest day of the year"
:date "2018-08-15"
:ads [1 3 5]
:author "Emma Cribs"}
{:title "Expected a rise in Bitcoin shares"
:date "2018-12-11"
:ads [2 4 6]
:author "Zoe Eastwood"}])
(def articles
[{:title "Another win for India"
:date "2017-11-23"
:ads [2 5 8]
:author "John McKinley"}
{:title "Hottest day of the year"
:date "2018-08-15"
:ads [1 3 5]
:author "Emma Cribs"}
{:title "Expected a rise in Bitcoin shares"
:date "2018-12-11"
:ads [2 3 6] ; ❸
:author "Zoe Eastwood"}])
❶ articles is a simplified portion of a larger map which contains several level of nesting in the form
of vectors (for items that need to be listed) or maps (for items that can be retrieved by key). The ":ads"
key contains the position of the ads in the articles, for example after the 2nd, 5th and 8th paragraph.
❷ We want to alter the ads position for the last article, moving one ad up a position after the 3rd
paragraph instead of the 4th.
❸ We can see the change from [2 4 6] to [2 3 6].
It’s worth remembering that assoc is an effective option to update vectors and change
their content. We used assoc in identity for instance, to change the availability of
cashiers in a queue modeled as a vector. assoc also works as an alternative to conj for
those situations where an element could be either replaced or added to the vector:
(def pairs [[:f 1] [:t 0] [:r 2] [:w 0]]) ; ❶
❶ The second element of each pairs is a number that can be used as an index for the
following assoc operation.
❷ Using destructuring, we access the first "item", the "index" and the entire pair as "v". We can
then assoc the pair "v" using the "index" as key and the "item" as value.
❸ The result shows the different transformations involved: when the index is 1 the pair is repeated, when
the index is 0 nothing happens and when the index is 2 the pair becomes a triplet.
dissoc-in
There is no dissoc-in in the standard library but in this extended example we are going to create one.
Differently from assoc, which works naturally for both maps and vectors, dissoc is tricky for vectors
which needs shifting to be shortened. But let’s proceed step by step and solve dissoc-in for maps first:
An different and elegant version of dissoc-in that works for maps only, is the following:
But this version of dissoc-in fails when the last dissoc executes on a vector instead of a map:
To handle vectors similarly to assoc-in, we need to treat them differently. The section related
to subvec also contains an example function called remove-at that can be used to remove an element
at the given index in a vector. We can use remove-at to dissoc from a vector after checking the type of
the collection:
We can see that the last solution solves all combinations of nesting.
See also
• assoc! is an assoc version specifically designed for transients.
• get-in and update-in are similar functions to assoc-in to handle reading/writing of
nested data structures. Differently from assoc-in, update-in accepts a function of
the old value to produce the new one.
⇒ O(log2N) sorted-map
⇒ O(n) defrecord
Many Clojure persistent collections, such as maps and vectors are built on HAMTs
(Hash Array Mapped Tries), a shallow bit-mapped tree data structure. Most common
operation on HAMTs, like traversing or updating are O(log32N) with "N" the number
of items. The only exception is sorted-map which is instead implemented as binary tree
(more precisely, a self-adjusting variant called Red-Black tree). assoc for sorted-
map is still logarithmic but with a constant factor of 2 instead of 32: in practice, there is
small (large maps) or no difference (small maps).
assoc-in profile is different from assoc, because it also has a linear dependency on the
length "ks" of the input which dominates the tree traversal. In practice, "ks" normally
contains just a few items as it represents the level of nesting of the input.
In the following chart we are going to compare assoc on some supported data
structures. The benchmark executes assoc on different map types sizes: 10 keys, 50
keys and 100 keys. The key is selected so it exists in the current structure, replacing the
existing value:
We can see that defrecord is an order of magnitude slower than other types. This can
be explained remembering that defrecord builds its associative nature on top of Java
class attributes: assoc is implemented by matching the required key in
a condp condition resulting in a linear behavior. Since normal usage
of defrecord includes just a few keys, it shouldn’t be a concern for performance.
11.3.2 update and update-in
function since 1.7 (update), 1.0 (update-in)
(update ; ❶
([m k f])
([m k f x])
([m k f x y])
([m k f x y z])
([m k f x y z & more]))
(update-in
([m ks f & args]))
❶ The many arities offered by update are a performance optimization for the most frequent calls.
update and update-in are both designed to alter values inside associative data
structures (those implementing direct lookup access are hash-map, sorted-map, array-
map, records, structs, vector, sub-vectors and native-vectors). Differently from assoc
and assoc-in, they accept a function of the old value into the new one:
(update {:a 1 :b 2} :b inc) ; ❶
;; {:a 1, :b 3}
❶ inc is a function of one argument. update invokes int with the current value of the ":b" key and
replace it with its increment.
❷ update-in takes a list of keys [:b :c] as input. Each key is used in turn to get nested associative
collections.
Both functions are built on top of assoc, so similar considerations apply for the input
types and performance.
Contract
• "m" is an associative data structure. It implies that (associative? m) is true. It’s
mandatory argument.
• "k": when "m" is a map type (hash-map, sorted-map, array-map, records, structs or
native Java maps), "k" can be any object. When "k" is a vector type (vector, sub-
vectors or native-vectors) "k" must be an integer not exceeding 232. To match an
existing key, "k" is compared using = equality semantic.
• "ks" is a sequence of keys. Each key follows the same contract as "k", applying to
the type of associative data structure found at that level of nesting.
• "f" is a function from a generic object into another generic object. "f" is invoked
with the value found at the relevant key.
• "x", "y", "z", "more" and "args" are additional arguments for the function "f"
(other than the value found at key which is passed as the first argument).
Notable exceptions
• UnsupportedOperationException happens if you forget to wrap "ks" in a list or
vector when using update-in.
• IllegalArgumentException using a key that is not an integer to access a vector,
including when the vector is nested with update-in.
Output
• update returns the input data structure "m" with the value "v" indicated by "k"
replaced with the result of invoking (f v). If "k" does not exist and "m" is a map,
"k" is created and the result of (f nil) is used as the new value. If "m" is a vector,
then "k" needs to be within 0 and (inc (count m)).
• update-in returns the input data structure "m" with the value "v" indicated by "ks"
replaced with the result of invoking (f v). "ks" is interpreted so the first key is
used to get a value in "m", the second key is used to get a value in the previous
value and so, up to the last key. Same considerations as update apply in terms of
the type of "ks", which should be an integer for vectors and any object for maps.
Examples
Differently from assoc, update and update-in can be used to "upsert" new keys
(update or insert instead of replace or insert). This model applies well to counters, or in
general any update that requires the presence of the previous value. fnil can be used
with update to provide a default when the key is not existent:
(def words {"morning" 2 "bye" 1 "hi" 5 "gday" 2}) ; ❶
update on vectors follows a similar pattern, allowing the addition of a new element at
the tail when "k" is equal to the length of the vector:
(update [:a :b :c] 3 (fnil keyword "d")) ; ❶
;; [:a :b :c :d]
❶ The index "3" is allowed for update, but it’s out of bound for [:a :b :c] where the only available
index are 0 (:a), 1 (:b) and 2 (:c).
❷ The addition of a new item in the vector only works right after the last item. Trying to access past the
size of the vector causes IndexOutOfBoundsException.
In the following example, a list of products contains a key dedicated to store how many
items are in stock. When a product is sold, we want to decrease the number:
(def products ; ❶
{"A011" {:in-stock 10
:name "Samsung G5"}
"B032" {:in-stock 4
:name "Apple iPhone"}
"AE33" {:in-stock 13
:name "Motorola N1"}})
(get-in ; ❸
(sale products "B032")
["B032" :in-stock])
;; 3
❶ The list of products is a sample version of a bigger list coming from the database. Products are
located by key and the value contains the details, including how many items are still in stock.
❷ We use update-in to access a product by key and then updating the ":in-stock" key decreasing its
content by 1.
❸ get-in is useful to focus on the newly updated :in-stock key.
update (but also assoc and their *-in variants) are often seen in conjunction
with swap! to change the atom during a compare and swap (CAS) transaction 180 . With
reference to the previous example, we could now allow concurrent sales:
(def products ; ❶
(atom {"A011" {:in-stock 10
:name "Samsung G5"}
"B032" {:in-stock 4
:name "Apple iPhone"}
"AE33" {:in-stock 13
:name "Motorola N1"}}))
(total-products @products) ; ❸
;; 27
180
Compare and swap (CAS) is the semantic used by Clojure concurrency primitives: atom, ref and agent. It consists in
attempting an unsynchronized mutating operation followed by a check on the original value. Only if the original value is
still intact, the new value is committed to the atom.
(total-products @products) ; ❻
;; 21
❶ products is an in-memory view of the warehouse state. It wraps the state in atom to enable
concurrency control.
❷ total-products is used to count how many products, in total, are still in stock. It uses reduce to
count over all the ":in-stock" keys.
❸ We can see that we have 27 items in the warehouse before sales are taking place.
❹ sale! is similar to the function we defined before, but it is now side-effecting altering the state of the
products. swap! has a similar call format to update-in: it takes a function of an old value into a new
value, plus any additional arguments to be passed to update-in when it’s executed. The effect is
similar to the thread first macro ->: products is placed as the first argument of update-in and the
vector of keys and fnil default follows.
❺ sale-simulation! simulates the interaction of multiple clients concurrently. We use pmap to start
several sales in parallel. Note the conventional "!" bang symbol at the end of the function name to
denote a side-effecting function.
❻ If we check the number of total products, we can see that is consistently selling the right amount of
products. By using swap! we make sure that the operation always decrease the stock number by one,
as the transaction repeats if the operation ends and another thread was able to decrease the number
at the same time.
See also
• fnil was mentioned a few times in the examples. It’s not an alternative to update,
but it works well to provide a default for missing keys.
• assoc produces a similar effect to update where the new value does not depend on
the old. Same applies to assoc-in compared to update-in.
• get and get-in retrieve the value without changes.
Performance considerations and implementation details
⇒ O(log2N) sorted-map
⇒ O(n) defrecord
update and update-in are built on top of assoc. Supported types are the same, as well
as the performance profile. The reader is invited to visit assoc’s performance section
for additional details.
Like assoc, update and update-in are in general well performing operations, also
considering they don’t operate on sequences preventing unwanted linear behavior.
merge and merge-with are functions useful to merge one or more maps together:
❶ merge-with allows to decide what should be done in case the target key already exists at destination.
In this case the current value of the key ":a" is added to the new value.
Contract
Input
• "maps" is any number of map types (hash-map, sorted-map, array-
map, records, structs but not native java.util.HashMap). The first map in maps is
the "target" map and determines the return type. Sequential types
(like vectors or lists) do not generate exceptions, but they don’t result in a proper
merging of data structures, with one or more of them forming a nested level.
Types other than map types are therefore not supported.
• "f" is mandatory argument for merge-with. If a key already exists in the results,
"f" is invoked with two arguments, the current value of the key and the new value
of the key. The result of "f" is used in place of the old value.
Notable exceptions
• ClassCastException when the target map is not
a clojure.lang.IPersistentCollection.
• IllegalArgumentException when the target map is followed by elements that are
not pairs, for example (merge {} [1 2 3]). This is because merge is attempting
to transfer the sequential collection into the target map but there is a key ("3") that
is missing its value.
Output
• No arguments or nil arguments: returns nil.
• Single argument other than nil: returns the argument itself.
• In all other cases, merge attempts to transfer the content of (rest
maps) into (first maps) copying each key-value pair. In case of conflicting keys,
the value corresponding to the last key to enter the output overwrite the previous
value.
• merge-with output is the same as merge but the key conflict resolution is handled
by the custom function "f".
The type of the output type is the same of (first maps), with the following specific
rules:
• When the target is an array-map the output can auto-promote to hash-
map depending on the number of keys (usually beyond 10).
• When the target is a sorted-map, keys need to have the same types to be
comparable.
Examples
The book uses merge and merge-with while describing other functions. Here’s a list of
the examples the reader can have a quick look at:
• fn contains an example of merge on an arbitrary number of input maps. apply can
be used with merge if the input is in the form of a list of maps.
• ->> contains a similar example to merge parameters after parsing them from an
URL.
• merge and merge-with are also typical in combining algorithms. fold contains
several example of combining function based on merge-
with using reducers/monoid like (r/monoid merge (constantly {})).
If the values in the input are all vectors, we can use functions like into to store all
values for the same key. This can be useful to group values together:
(let [m1 {:id [11] :colors ["red" "blue"]} ; ❶
m2 {:id [10] :colors ["yellow"]}
m3 {:id [31] :colors ["brown" "red"]}]
(merge-with into m1 m2 m3))
❶ Note that ":id" and ":colors" for all maps "m1", "m2", "m3" needs to be vectors.
❷ The result of (merge-with into) is a map with the same keys and the union of all values from all
maps for that key.
The following merge-into function can be used to move keys at different nesting
levels inside a map. In this example we receive a product that contains keys at different
nesting levels. We want to group them together all at top level:
(def product-merge ; ❷
(merge-into :product [:fee-attribute :created-at]))
(def product ; ❸
{:fee-attributes [49 8 13 38 62]
:product {:visible false
:online true
:name "Switcher AA126"
:company-id 183
:part-repayment true
:min-loan-amount 5000
:max-loan-amount 1175000}
:created-at 1504556932728})
(product-merge product) ; ❹
;; {:visible false,
;; :online true,
;; :name "Switcher AA126",
;; :company-id 183,
;; :part-repayment true,
;; :min-loan-amount 5000,
;; :max-loan-amount 1175000,
;; :created-at 1504556932728}
❶ merge-into is designed as higher order function. It returns a function of the map to be transformed,
given the key that corresponds to the target map for merge and the list of keys to lift.
❷ The result of invoking merge-into can be further assigned globally so it can be reused from different
functions. We call this specialization product-merge.
❸ Here’s an example of input product. The :fee-attributes and :created-at keys really belong to
the :product, which already contains other relevant keys.
❹ product-merge transforms the input into a new map which contains everything that was previously
under :product, including :fee-attributes and :created-at that were not.
In the next example we are going to use merge-with to implement addition on complex
numbers. Complex numbers have a real and imaginary part that we could implement as
keys in a map-like type. Clojure records gives us a syntactically appealing form to deal
with complex numbers, including the option to expand the set of available operations
in the future:
(defprotocol IComplex ; ❶
(sum [c1 c2]))
❶ The protocol IComplex defines a suitable interface to collect complex numbers operations (the "I"
suffix in the name remembers us this is an interface).
❷ The sum of a complex number defined as Clojure record can be implemented using merge-with.
Multi-type merge
When merging maps, merge and merge-with offer two strategies to deal with different values for the
same key: merge simply replaces the old value with the new value while merge-with accepts a function
of the old and the new value to decide how to replace the old. If the values in all maps are homogeneous
(for example all values are vectors) then we could use (merge-with into) to collapse all values into
the same key. But what if values are not vectors or they are different types?
We could write a custom function for merge-with with conj to store new values in a vector:
(let [m1 {:a 1 :b 2}
m2 {:a 'a :b 'b}
m3 {:a "a" :b "b"}]
(merge-with (fn [v1 v2]
(if (vector? v1) ; ❶
(conj v1 v2)
[v1 v2]))
m1 m2 m3))
❷ If the metadata are not present, we know for sure that the incoming value needs wrapping in a
new vector with the additional metadata to distinguish it from any other vector.
❸ We can see that the results are now the expected ones.
See also
• group-by can be used to collapse a list of collections into a map based on some
feature of the data. While group-by is more about grouping by key, merge is more
about how values are combined. Compared to merge-with, group-by collects the
entire input as values of the resulting map, which can be unwanted feature for
large inputs. With merge-with there is more flexibility to decide what part of the
input ends up in the final result.
Performance considerations and implementation details
❶ Criterium 181 is the library often used throughout the book to benchmark code snippets.
❷ The benchmark creates a hash-map with 1000 keys and another one of the same size but different
keys, without overlap.
❸ The same benchmark is attempted on a variation of merge that uses transients, with a visible speed
up.
181
https://github.com/hugoduncan/criterium/
❶ First of all, let’s benchmark normal merge-with using 2 maps with 1000 non-overlapping keys.
❷ The new version of merge-with* follows the standard core implementation with some cosmetic
changes. The use of transients translates into using assoc! instead of normal assoc.
❸ Note that we can’t use contains? directly on transient because is not supported. We can
use get instead, provided we use a sentinel value ::none to establish if the key exists (with a potential
value of nil) or doesn’t not exist.
❹ The inner function mrge-into establishes the switch from normal hash-map to transients hash-maps.
❺ The call to persistent! happens at the end of the computation.
❻ The use of transient generates a visible speed up.
11.3.4 reduce-kv
function since 1.4
(reduce-kv ; ❷
(fn [m k v] (assoc m k (inc v)))
{}
{:a 1 :b 2 :c 3})
;; {:a 2, :b 3, :c 4}
❶ Normal reduce for a map-type like array-map or hash-map requires a reducing function that
understands the next item is an entry formed by a key and a value. We use destructuring to get them
individually. Under the hood, reduce is forced into transforming the associative data structure into
a sequence first.
❷ reduce-kv is dedicated to associative data structures, so the reducing function takes 3 arguments
instead of 2: the accumulator, the key and the value. What is not visible though, is that reduce-
kv takes a faster path to iterate the input.
reduce-kv allows the input data structure to provide a specific implementation through
the dedicated clojure.core.protocols/IKVReduce protocol. Compatible Clojure data
structures already implement IKVReduce and other associative types could extend the
same abstraction. We are going to see how to extend reduce-kv to other map-like types
in the examples.
Contract
Input
• "f" needs to be a function of 3 arguments that is expected to return the
accumulation of the results so far. It is mandatory argument.
• "init" is the value that is passed to "f" as first argument during the first call. This is
usually an empty collection (not necessarily associative) that reduce-kv is
supposed to fill with results.
• "coll" can be one of the supported types or nil. Supported types are those
implementing clojure.lang.IKVReduce or clojure.lang.IPersistentMap. The
former interface indicates a custom implementation that usually performs better.
The following table contains a summary of what is supported and does not throw
exception:
Table 11.1. Summary of the collection types supported by reduce-kv (including nil as an
exceptional case).
NOTE associative collections like subvectors or java.util.Map are not supported (while in general
they are for other functions in this chapter).
Notable exceptions
• IllegalArgumentException is thrown when there is no reduce-kv implementation
for a specific type (for example list).
Output
• returns: the result of applying "f" on "init" and the first item in "coll", followed by
applying "f" again on the previous result and the second item in "coll" and so on,
until there are no more items.
Examples
reduce-kv is useful to process maps, for example to update all values, keys or both.
We could for example transforms keys representing environment variables into
keywords following normal Clojure naming conventions:
(def env ; ❶
{"TERM_PROGRAM" "iTerm.app"
"SHELL" "/bin/bash"
"COMMAND_MODE" "Unix2003"})
(reduce-kv ; ❸
(fn [m k v] (assoc m (transform k) v))
{}
env)
;; {:term-program "iTerm.app", ; ❹
;; :shell "/bin/bash",
;; :command-mode "Unix2003"}
❶ env contains a list of environment variables that conventionally are uppercases and underscored.
❷ The transformation we want to apply is a composition of toLowerCase, replace underscores with
dashes and move to keyword. some-> is a good idea just in case the transformation receives nil,
which generates NullPointerException with string manipulation functions. The type hint contribute
to reinforce the expectations around the type of the input, as well as improving performance.
❸ The reduce-kv call is relatively straightforward. We are going to move the entries from the input map
into another map transforming the keys while doing so.
❹ We can see the transformed keys as we would expect them formatted in standard Clojure code.
Environment variables like the ones from the example can be retrieved using the Java
interoperation call (System/getenv). The type returned by (System/getenv) is
java.util.Collections$UnmodifiableMap for which there is no default reduce-kv
implementation:
(reduce-kv
(fn [m k v] (assoc m (transform k) v))
{}
(System/getenv)) ; ❶
;; IllegalArgumentException No implementation of method: :kv-reduce of protocol:
#'clojure.core.protocols/IKVReduce found for class:
java.util.Collections$UnmodifiableMap
❶ reduce-kv doesn’t have a specific implementation for the kind of Map returned
by (System/getenv). transform was defined in the previous example.
(extend-protocol clojure.core.protocols/IKVReduce ; ❶
java.util.Map ; ❷
(kv-reduce [m f init]
(let [iter (.. m entrySet iterator)] ; ❸
(loop [ret init]
(if (.hasNext iter) ; ❹
(let [^java.util.Map$Entry kv (.next iter)]
(recur (f ret (.getKey kv) (.getValue kv))))
ret)))))
(reduce-kv
(fn [m k v] (assoc m (transform k) v))
{}
(System/getenv)) ; ❺
;; {:jenv-version "oracle64-1.8.0.121",
;; :tmux "/private/tmp/tmux-502/default,2685,2",
;; :term-program-version "3.1.5",
;; :github-username "reborg"
;; ...}
(reduce-kv
(fn [m k v] (assoc m (transform k) v))
{}
(System/getProperties)) ; ❻
;; {:java.vm.version "25.121-b13",
;; :java.specification.name "Java Platform API Specification",
;; :java.io.tmpdir "/var/folders/25/T/",
;; :java.runtime.name "Java(TM) SE Runtime Environment",
;; ...}
❷ We want to extend the protocol to the java.util.Map type using the extend-protocol dedicated
function.
❸ The fastest way to iterate a Java Map is to get the entrySet (usually cached internally) which
provides an iterator instance. The iterator returns all entries sequentially.
❹ The main design of the iteration consists of a loop-recur instruction until the
iterator .hasNext returns true. Each iteration we read the content of the entry invoking "f" with the
result so far, the key and the value. The result of invoking "f" is conventionally the accumulation of the
next result, which is used for the next iteration.
❺ The invocation of reduce-kv on (System/getenv) now results in the polymorphic call to the relevant
protocol extension, producing the expected results (truncated for brevity).
❻ Similarly, other Java types implementing the java.util.Map interface are now returning results, like
this call to (System/getProperties) for example.
(reduce-kv
(fn [m k v]
(if (> k 2)
(reduced m) ; ❶
(assoc m k v)))
{}
[:a :b :c :d :e])
;; {0 :a, 1 :b, 2 :c} ; ❷
❶ We choose an arbitrary condition based on one of the keys. When the condition is true, we return a
"reduced" result and we skip the related assoc operation, signaling the fact that we want to terminate
the reduction. reduce knows how to interpret the signal and does not proceed any further.
❷ As expected, the resulting map is missing keys above the number 2.
(import 'java.util.LinkedHashMap)
(reduce-kv
(fn [m k v]
(if (= k :abort)
(reduced m) ; ❶
(assoc m k v)))
{}
(LinkedHashMap. {:a 1 :abort true :c 3})) ; ❷
❷ We can see that using reduced is not handled by our previous protocol extension. Instead of stopping
after reaching a reduced item, we pass that to assoc which fails.
To solve the problem, we need to enhance the protocol extension for java.util.Map types to handle an
element wrapped in a reduced object:
(extend-protocol clojure.core.protocols/IKVReduce
java.util.Map
(kv-reduce [m f init]
(let [iter (.. m entrySet iterator)]
(loop [ret init]
(if (.hasNext iter)
(let [^java.util.Map$Entry kv (.next iter)
ret (f ret (.getKey kv) (.getValue kv))]
(if (reduced? ret) ; ❶
@ret
(recur ret)))
ret)))))
(reduce-kv
(fn [m k v]
(if (= k :abort)
(reduced m)
(assoc m k v)))
{}
(LinkedHashMap. {:a 1 :abort true :c 3}))
;; {:a 1} ; ❷
❶ We repeat the same protocol extension from before, but this time, we check to see if we are passed
a reduced? element and in that case, stop recursion. We also need to unwrap the reduced item with
the "@" (dereference) reader macro.
❷ We can see that the result only contains keys up to the point where the ":abort" request was found.
See also
• reduce is the model reduce-kv is inspired by. reduce should be considered for all
other non associative data structures.
• “reduced, reduced?, ensure-reduced, unreduced” are the function implementing
the signaling mechanism that was discussed also for reduce-kv.
Performance considerations and implementation details
The chart shows records an order of magnitude slower (but they are also unlikely to
contain that many keys). The fastest of the benchmark is array-map but the other types
(except record) follow closely.
Memory allocation depends on the reducing function: the generation of a map from
another map where all the keys are preserved is linear in space. reduce-
kv (like reduce) is not lazy and unless interrupted by reduced, it processes all items
independently from how many are consumed downstream.
❶ There are no changes to the input map as the keys are not of type string.
❷ Likewise stringify-keys is not transforming the keys if the key type is not keyword.
(def products ; ❶
[{"type" "Fixed"
"bookings" [{"upto" 999 "flat" 249.0}]
"enabled" false}
{"type" "Variable"
"bookings" [{"upto" 200 "flat" 20.0}]
"enabled" true}])
(keywordize-keys products)
;; [{:type "Fixed" ; ❷
;; :bookings [{:upto 999 :flat 249.0}]
;; :enabled false}
;; {:type "Variable"
;; :bookings [{:upto 200 :flat 20.0}]
;; :enabled true}]
WARNING both functions accept all map types (hash-map, array-map, sorted-map, records and sructs) but
they always return array-map (or hash-map for bigger inputs, following the array-map auto-
promoting feature).
11.4.2 clojure.set/rename-keys
rename-keys name is quite self-explanatory. Given an input map and a dictionary map
it renames keys according to the content of the dictionary:
(require '[clojure.set :refer [rename-keys]]) ; ❶
❶ rename-keys is a public function inside the clojure.set namespace which is part of the standard
library.
❷ Each matching key in the first map is replaced by the corresponding value in the second map.
rename-keys is an useful function for simple renaming of keys, for example passing
from one data format to another. The rename is limited to the first level and does not
loop over nested maps, if any. If there are clashing keys, then the last key to be added
from the dictionary map :
(rename-keys {:a 1 :b 2 :c 3} {:c :a :a :b :b :c}) ; ❶
;; {:a 3, :b 2}
❶ In this example, :c is replaced with :a, but the input map contains :a already. The old {:a 1} pair is
effectively replaced by a new {:a 3} pair which is equivalent to the previous value of :c with the
replaced key :a.
There are some restrictions dependent on the type of input map to consider. Let’s start
from records:
(defrecord A [a b c])
(type *1) ; ❷
;; clojure.lang.PersistentArrayMap
❶ After creating a simple record "A" of 3 fields :a, :b and :c, we ask rename-keys to change some keys.
❷ The operation is successful but the type of the results is not a record but array-map.
We can use rename-keys on sorted-map provided we the replacements are of the same
type of the keys:
(rename-keys (sorted-map :a 1 :b 2 :c 3) {:a :z}) ; ❶
;; {:b 2, :c 3, :z 1}
❶ We use a sorted-map as input for rename-keys. The operation completes successfully and the
returned type is again a sorted-map.
❷ We need to be careful to use the proper replacement type, as once the sorted-map is created, it
requires comparable keys.
Finally, a note on structs. Since rename-keys first removes replacement keys using
dissoc, structs throw errors, as removing keys that are part of the definition is not
allowed:
(rename-keys (struct (create-struct :a :b :c) 1 2 3) {:a 9}) ; ❶
;; RuntimeException Can't remove struct key
11.4.3 clojure.set/map-invert
map-invert swaps keys and values in a map:
If there are identical values in the input map the last instance of the value is the one
inverted:
(map-invert (zipmap [0 1 2 3] [0 0 0 0])) ; ❶
;; {0 3}
❶ map-invert output on a small array-map that doesn’t cross the threshold to become hash-map is
easy to predict.
Special cases include inverting the empty map or other types of empty sequential
collection:
(map-invert {})
(map-invert [])
(map-invert ())
(map-invert "")
;; {} ; ❶
❶ All these examples return the same empty map, despite the input type is not necessarily a map type.
All map types can be reverted. In the following examples, all map-invert return the
same empty map result:
(map-invert (hash-map :a 1 :b 2 :c 3))
(map-invert (array-map :a 1 :b 2 :c 3))
(map-invert (sorted-map :a 1 :b 2 :c 3))
(map-invert (struct (create-struct :a :b :c) 1 2 3))
(defrecord A [a b c])
(map-invert (A. 1 2 3))
❶ All the map-invert invocations in this example returns the same result.
❶ scramble-key is a map from char to char pairing up every letter in the alphabet with another random
one.
❷ scramble uses the scramble-key to obfuscate the content of a sentence.
❸ unscramble can revert the obfuscation effect using the randomized letter (what appears as the value
of scramble-key) as a key. We can quickly obtain this effect by using map-invert.
❹ The obfuscated text is transformed back into the correct clear text.
12
Thanks to Rachel Bowyer for contributing this chapter.
Vectors
Clojure’s vector is one of the standout features of the language: performant, immutable
and with a convenient literal syntax. Back in 2008 when Clojure launched there was
nothing else quite like it; and it set Clojure apart from earlier LISPs. Since then other
functional languages such as Scala and Haskell have added their own immutable
vector.
Clojure’s vector stores elements sequentially, indexed by zero based integers. It
provides efficient index based read and write, and also append. It supports efficient
delete from the tail of the vector (with pop), but not from other locations (for which the
best workaround is to use “subvec”).
The literal syntax for a vector consists of merely enclosing a space separated list of the
elements within a pair of square brackets.
[:a :b :c]
;; => [:a :b :c]
As well as being a data structure, a vector is also a function that looks up a value. It
takes one argument, the zero based index, and if it is out of range then
an IndexOutOfBoundsException is thrown.
(ifn? []) ❶
;; true
([:a :b :c] 2) ❷
;; => :c
([:a :b :c] 3) ❸
;; IndexOutOfBoundsException clojure.lang.PersistentVector.arrayFor
(PersistentVector.java:158)
❶ The contains? function from the standard library is used to verify if there is an element at index 3, not
if the value "3" is present in the vector.
❷ Note that a dot "." is prefixed to this "contains" invocation, indicating the call to the instance method
"contain" of the Java type used to implement a Clojure vector.
WARNING Although vectors can behave like a sequence and thus used as an argument to seq, they are
not a sequence type themselves: seq? returns false. The implications for vectors is that
all sequence operations work on vector, but they are implicitly transformed into a sequence
before.
There are specialized versions of map, filter and reduce, called mapv, filterv and
reduce-kv respectively (reduce-kv works on associative data structures as well, hence
the "kv" key-value name). mapv and filterv returns a vector rather than a
sequence. “reduce-kv” avoids internal transformations of the vector into a sequence to
output the final result. Vectors also offer a way to be efficiently reversed with rseq.
Likewise, subvec could be considered a specialized version of rest.
182
Typical application of topological sorting is the ordering of the Java classpath so that classes are loaded only when all
their dependencies are satisfied. The Wikipedia entry has more examples: en.wikipedia.org/wiki/Topological_sorting
Figure 12.1. Part of a persistent vector containing the text to "Pride and Prejudice"
When an element is modified, the leaf node, the root node and all the nodes on the path from the leaf
node to root have to be copied, but crucially none of the other nodes. Further, the tree uses 32 way
branching at each non-leaf node and leaf nodes that contain up to 32 values, leading to a very shallow
tree. The entire text of "Pride and Prejudice" fits into a tree that is only 4 levels deep! Therefore, although
technically modifying an element in a vector with n elements is O(log n) in time, in practice the behavior
is almost constant time. This is because the trees are so shallow. Even a vector containing a billion
elements would only be 6 levels deep as log32(1 billion) < 6 183.
In September 2011, Bagwell, along with Rompf, extended the Clojure vector to create Relaxed Radix
Balanced Tree (RRB Trees). These offer concatenation and insert-at in O(log n) rather than O(n) time.
RRB Trees remain an active on going research topic footnote:[The following are a few links related to
RBB-trees that might be of interest for further explorations: RRB-Trees paper by
Bagwell infoscience.epfl.ch/record/169879/files/RMTrees.pdf, a Clojure
implementation github.com/clojure/core.rrb-vector and Improving RRB-Tree
Performance hypirion.com/thesis.pdf.
183
P. Bagwell. Ideal Hash Trees. Technical report, EPFL, 2001. lampwww.epfl.ch/papers/idealhashtrees.pdf
also a persistent vector. However, these names are used by the Clojure community
when discussing about them instead of the lengthy class type.
The following table is a quick summary of the behavior of the different types of vector.
To learn more about their characteristics, please see the entries for the related
functions.
Created with: Supports Stores nil Stores mixed Efficient use Efficient
transient data types of space construction
“vector” Yes Yes Yes No Yes
“vector-of” No No No Yes No
“subvec” No Depends on Depends on Yes Yes
underlying vector underlying vector
(first {:a 1}) No Yes Yes Yes Yes
12.1 vector
function since 1.0
(vector
([])
([a])
([a b])
([a b c])
([a b c d])
([a b c d e])
([a b c d e f])
([a b c d e f & args]))
The function vector creates a vector (one of the main Clojure data structures) whose
elements consist of its arguments. The order of elements in the vector is the same as
the order of arguments given to the function:
(vector :a :b :c)
;; [:a :b :c]
It produces the same output as the reader literal [] (a pair of square brackets enclosing
other forms or constants):
[:a :b :c]
;; [:a :b :c]
If the elements are known in type and number at the time of writing the code, then the
literal [] is normally used instead of the function as it is shorter and more idiomatic 184
. The function is still used when it’s not possible to write each argument of
the vector explicitly, for example to collect all the variable arguments of a function
184
For a more detailed but still accessible explanation of how vectors work in Clojure please see J. N. L’orange'
blog hypirion.com/musings/understanding-persistent-vector-pt-1
declaration:
(defn var-args [a b & all]
(apply vector a b all)) ❶
(var-args :a :b :c)
;; [:a :b :c]
❶ apply is used to collect the sequence of arguments and pass them to vector. Note how it wouldn’t be
possible to use the vector literal syntax. This is because all the arguments to the function "var-args"
that have been gathered up into "all" would be added to the vector as a single element of type list,
instead of being added as individual elements.
Contract
Input
• vector accepts zero arguments, returning an empty vector.
• "a", "b", "c", "d", "e", "f" and any additional arguments can be of any type
including nil and other arbitrarily nested vectors.
• Although all provided arities are going to create the same kind of vector, the first
7 arities (from 0 to 6 arguments) are slightly faster (a common pattern in the
Clojure standard library). See the call-out section further below for details.
Output
• A persistent vector containing the given arguments in order. If there are zero
arguments, then an empty persistent vector is returned.
Examples
Another case in which vector needs to be used instead of the square brackets (along
the one presented in the introduction to collect var-args), is when it is part of a function
literal invocation. The syntax #() expands into a function call that would be unsuitable
to contain a vector literal as first form. The following longest-palindrome example
illustrate the use of vector from inside a function literal:
(def palindromes ["hannah" "kayak" "civic" "deified"])
(longest-palindrome palindromes)
;; [7 "deified"]
❶ We reverse the word to compare it with itself. This quick solution to the problem of finding a
palindrome is simple enough for this example, but there are more efficient alternatives (see rseq).
❷ The similar but incorrect syntax using a vector literal (map #([(count %) %])) would
throw ArityException at runtime. This is because the function literal #() expands its content into a
function call.
❸ This macroexpansion shows why the vector literal inside the function literal doesn’t work. The
expansion shows that the generated “fn” uses the vector as a function, invoking it without the
mandatory argument.
Additionally vector can be used with higher order functions like map. The next
example shows how two streams of data can be joined together before being compared.
This could happen before releasing a new version of the data feed into the website, so
the new version can be regression tested against the old. You could use the following
code:
(require '[clojure.data :refer [diff]])
(def old-real-estate-system
[{:summary "Bijou love nest" :status "SSTC"}
{:summary "Country pile" :status "available"}])
(def new-real-estate-system
[{:summary "Bijou love nest" :status "SSTC"}
{:summary "Country pile" :status "SSTC"}])
(doc vector)
([] [a] [a b] [a b c] [a b c d] [a b c d e] [a b c d e f] [a b c d e f & args])
Creates a new vector containing the args.
Many other functions in the standard library follow a similar pattern. You might wonder why the signature
isn’t simply (defn vector [& args]) and the reason is performances. There are two aspects
connected to arities and performances:
• In general, the presence of a “& args” arity implies that the function implementation contains
somewhere an iteration over the variable number of arguments. This iteration is more expensive than
directly accessing arguments.
• Depending on the function implementation, the different arities could simply share the same
underlining code or have completely different ways to achieve the final result.
vector in particular, takes advantage of knowing the number of arguments at compile time (0 to 6) by
creating the tail of the persistent vector directly. The variable arity case instead, makes use of a more
flexible "create and expand" loop. Here’s a simple benchmark that shows the speed gain:
(handler 7) ❸
(handler 8)
(handler 9)
(handler 10)
See Also
• vec to create a persistent vector from a seqable collection. Prefer vec if the input
content for the newly created vector is coming from an already existing sequence.
• vector-of to create a vector specifying a primitive Java type. Use vector-of if
185
The Clojure Style Guide github.com/bbatsov/clojure-style-guide#literal-col-syntax
186
github.com/rachbowyer/keirin
(defn test-speed-creation-keirin []
(let [results (for [i (range 10)] (test-speed-creation-keirin' i))]
(doseq [i (range 3)]
(doseq [result results]
(let [num (cond-> (get result i)
(not= i 0)
:median)]
(printf "%10.3f " (double num))))
(println))))
The linear behavior of the functions can be clearly seen. vec runs slightly faster
than vector because vector iterates the input using a first/next semantic compared
to the faster reduce. We can verify the assumption by using a lazy-sequence which
❶ Execution time is equivalent for vector and vec when the input collection is seqable but not
reducible.
❷ vec is faster if the reduce implementation provided by the input function is more per formant.
Let’s now have a look at the memory allocation using the Java Jamm library 187. The
snippet below is used to illustrates the memory used by a persistent vector in the plot
that follows:
(import 'org.github.jamm MemoryMeter)
(defn test-memory-vector-of-jamm []
(let [meter (MemoryMeter.)
results (for [elements (range 100000 1100000 100000)]
[elements
(.measureDeep meter (make-array Object elements))
(.measureDeep meter (vec (repeat elements nil)))])]
(doseq [i (range 4)]
(doseq [result results]
(printf "%11d " (get result i)))
(println))))
The next plot shows the memory usage (in megabytes) for a Clojure vector and
confirms the linear behavior as "n" increases.
187
The way range implements reduce is visible in the clojure.lang.LongRange class
Figure 12.4. Memory overhead of using a persistent vector compared to a Java array
The overhead is very high for small vectors, but after around 1000 elements settles
down at 35%. For comparison, Java’s java.util.ArrayList, when growing
dynamically one item at a time, will have an overhead of 25% on average 188. This is
because each time the array resizes, it increases in length by 50%. However, if the size
of the ArrayList is known in advance, then the array will be sized correctly for the
number of elements and have minimal overhead.
12.2 vec
function since 1.0
(vec coll)
vec creates a new “vector” given another collection as input. vec works on almost all
188
github.com/jbellis/jamm
collection types:
• Common Clojure collections list lists, sets, hash-maps, etc.
• Java iterables like clojure.lang.PersistentQueue or java.util.ArrayList.
• Native Java arrays (like the ones created with make-array).
The order of the resulting vector matches the order of the elements in the input except
for unsorted collections (like hash-sets or hash-maps), for which there isn’t a specific
order. vec can be used like:
(vec '(:a 1 nil {})) ❶
;; [:a 1 nil {}]
❶ Note that other nested collections are not transformed by vec recursively.
WARNING If the input collection is a Java array of reference types containing 32 elements or fewer, the
output vector produced by vec will be just an alias to the native array. Therefore, the Java array
should not be modified after the call to vec or else the immutable Clojure vector may change
value! See the examples for more information.
Contract
Input
• "coll" can be a "seqable" collection (such that (instance? clojure.lang.Seqable
coll) is true) an "iterable" collection (such that (instance?
java.lang.Iterable coll) is true) or a Java array (such that (.isArray (class
coll)) is true).
• "coll" can also be nil.
• The only collection-like Clojure data structures that do not work
with vec are transients and the now obsolete structs.
Notable exceptions
• In event that "coll" is not "seqable" data type, then a RunTimeException is thrown.
Output
• A persistent “vector” containing the elements in the collection "coll". Order in the
produced vector is respected for ordered collections.
Examples
The following example shows the potential side effects of using vec on Java arrays:
(def a (make-array Long 3))
(def v (vec a))
v
;; [nil nil nil]
(aset a 1 99)
;; 99
v
;; [nil 99 nil] ❶
❶ aset operation on the array is side-effecting on the vector created by vec. The same doesn’t happen
for array of primitives, e.g. (def a (int-array [1 2 3])).
The following table shows several examples of vec against different collection types. A
brief note is given in the table to explain the results.
Now a more involved example. Madison is looking to buy a blue dress from the
"Rachel’s Rags" website. Firstly, she searches on the site for a blue dress. Behind the
scenes the website queries a database, converts the results to a vector, allocates a
search id and caches the results. Then the website returns the first page of the results as
JSON to Madison’s browser along with the search id. Madison’s browser then renders
the JSON as HTML.
Madison, for some reason, then decides to look at page 3 of the results. In response her
browser makes an AJAX request to the website passing the search id. The website then
retrieves page 3 of the results and returns it to her browser as JSON.
Caching the search results as a vector works particularly well as Madison can jump
from page to page of the search results at random. However, as most Clojure database
libraries will return their results as a sequence, vec is needed to convert the results to a
©Manning Publications Co. To comment go to liveBook
vector. A simplified version of the server code, with the database mocked, may look
like this:
(import (java.util UUID))
(cache-user-search-results!
search-id
(search-merchandise {:type :dress :color :blue}))
(println
(-> (retrieve-user-search-results search-id 0)
render-to-json))
(println ❼
(-> (retrieve-user-search-results search-id 2)
render-to-json))
❶ search-merchandise searches the database based on the "search-options" passed in and returns a
list. For simplicity the results have been mocked.
❷ The search results are cached in a map held by an “atom, swap!, reset! and compare-and-
set!”. This is thread safe and for small systems completely appropriate depending on memory
189
requirements. Larger more complicated systems may benefit from using libraries like core.cache
190
or a distributed cache such as Redis .
❸ cache-user-search-results! takes the search results, converts them to a vector using vec, and
stores them in the cache.
189
Technically this is the per element overhead. Java’s ArrayList also has some static overhead such as the class
overhead, but this will be insignificant unless the number of elements is very small.
190
github.com/clojure/core.cache
❹ Given a search id, retrieve-user-search-results returns the results to show on a given page.
"page" is the page of results required and is zero-based indexed, so for example to retrieve the
second page of results, "page" would be 1. retrieve-user-search-results is efficient as the
results are stored as a vector. To keep the example simple, one item per page is returned.
❺ render-to-json is simple json render. A real world system would use an external library such as
Cheshire 191 .
❻ This code simulates the scenario when Madison initiates her search. A unique search id is allocated at
random using Java UUID generation. Then a search of the database is performed for "blue dresses"
and the result is cached. Finally the first page of her results is retrieved and converted to JSON.
❼ This code simulates the scenario when Madison views the third result page. Using the existing search
id, the third page of her results is retrieved from the cache and converted to JSON.
See Also
• vector-of to create a vector of a primitive type. Use vector-of if space efficiency
is the biggest concern.
• make-array to create a Java array. Use a Java array if the performance benefits of a
mutable data structure are required or if interfacing with Java code.
• into can be used to "transfer" the content of a collection into another, including
vectors. There are a few small differences compared to vec: into does not alias
arrays, it supports transducers and can also create other collection types. If you
focus on vectors only, vec better conveys the meaning of the transformation and
spares a few keystrokes.
Performance Considerations and Implementation Details
191
redis.io/
by vec to ask the input to build itself. This is possible if the input collection supports
the clojure.lang.IReduce interface. In that case, reduce is used to build the
vector. vec performance profile is thus dependent on the type of the input collection, as
demonstrated by the following benchmark:
(require '[criterium.core :refer [quick-benchmark]]) ❶
(import '[java.util ArrayList LinkedList])
Since we’ve compared into and vec briefly before, here’s a quick benchmark between
the two:
(require '[criterium.core :refer [bench]])
The difference between into and vec is very small and is not sufficient alone to
determine a clear winner between the two. As explained before, vec should be
preferred in the context of vector processing to better convey the meaning of the
computation. In terms of implementation, while into is mainly implemented in Clojure,
vec delegates almost immediately to clojure.lang.LazilyPersistentVector that
proceeds to invoke the correct sequential transformation on the input sequence to
create the final vector.
(peek [coll])
(pop [coll])
peek and pop access or remove (in the immutable data structure sense) the head
element from either a vector, list or queue. The head position depends on the collection
type:
(import '[clojure.lang PersistentQueue])
(def q (PersistentQueue/EMPTY))
(def v [])
(def l ())
❶ peek called on a clojure.lang.PersistentQueue type returns the first element added to the queue.
❷ peek on a vector, returns the last element added to the vector (when printed, this appears as the right-
most element).
❸ peek is used on a list, which returns the last element added (when printed it appears as the left-most
element).
It’s easy to get confused when collection are printed, especially when we look
at vectors and lists> (<<queue,queues need a transformation), as the elements we just
defined "head" is not printed on the same side:
((juxt vec peek) (conj (PersistentQueue/EMPTY) "a" "b" "c"))
;; [["a" "b" "c"] "a"] ❶
((juxt identity peek) (conj [] "a" "b" "c"))
;; [["a" "b" "c"] "c"] ❷
((juxt identity peek) (conj () "a" "b" "c"))
;; [("c" "b" "a") "c"] ❸
❶ The peek element for a queue is the element that was added last and is printed as the first element.
❷ The peek element for a vector is the element that was added last and is printed also as the last
element.
❸ The peek element for a list is again the element that was added last, but it appears first when printed.
;; ("b" "a")
❶ When we invoke pop on a queue the non-printable queue object is returned. We can see the content
of the queue using vec.
❷ pop on a vector removes the tail element, which is the right-most item when printed and the last that
was added.
❸ pop used on a list removes the head element, which is the left-most and was the last added.
The following table summarizes what we have just seen in the examples and
additionally it shows if it’s possible to pop from an empty collection of that kind:
Table 12.2. Difference between insertion and printing order for peek and pop
Contract
Input
• "coll": coll is the only mandatory argument. The collection needs to implement
the IPersistentStack interface, which can be verified with (instance?
clojure.lang.IPersistentStack coll). The commonly used collection types
supporting this interface are: queues, lists and vectors.
Notable exceptions
• ClassCastException when the collection does not support
the IPersistentStack interface. Common examples producing the errors
are: (peek (range 10)) or (peek #{1 2 3 4}).
• IllegalStateException trying to pop from an empty vector or an empty list. It’s
still okay to pop an empty queue.
Output
• peek: returns the element that was added first (for queues) or the element that was
added last (vectors and lists). nil if "coll" is empty or nil.
• pop: returns what is remaining in the collection after the removing the element
returned by peek. pop throws exception for an empty vector or list.
Returns nil when "coll" is nil.
Examples
peek and pop are useful to create a consistent interface around queues in Clojure. A
queue is an abstract data type characterized by its insertion/extraction order 192.
192
github.com/dakrone/cheshire
Thanks to peek and pop, a LIFO queue (last-in first-out queues are also called stacks)
can be implemented efficiently on top of vectors (for FIFO queues there is a dedicated
data structure called a queue). Let’s see how we can use a vector-based queue to verify
if a Clojure form contains balanced parenthesis without evaluating it:
(require '[clojure.set :refer [map-invert]])
❶ To additionally conform to the idea of using a queue, we add this simple constructor that wraps an
empty vector. We could swap this with an empty list or another queue implementation provided they
support the peek pop and conjsemantic.
❷ Similarly push is just an alias for conj to help us thinking in terms of queues.
❸ We need a list of all the allowed brackets and their matching pairs. We can organize this as a
dictionary for quickly lookup the closing bracket given the open bracket as a key.
❹ The check function performs the scan of the input. It is organized as a reduce around an initially
empty queue and the list of input characters.
❺ Each character goes through a cond expression: if we have an opening bracket, we push it in the
queue waiting to see what happens the next iteration.
❻ If we have a closing bracket, we check (first out) to see if the related opening bracket is the last thing
we saw (last in). This is a peek operation from the queue. If we have a match, we pop the matching
bracket and we wait the next iteration. If it’s not a match, an exception is thrown. Note the use
of map-invert from the clojure.set namespace to invert keys-values in a map.
❼ If it’s not a bracket, we do nothing and return the queue.
peek and pop can also be used efficiently in vector-based loops, where the current item
is extracted at each iteration and the remainder is sent to the next via recur:
(defn reverse-mapv [f v] ❶
(loop [v v res (transient [])] ❷
(if (peek v) ❸
(recur ❹
(pop v)
(conj! res (f (peek v))))
(persistent! res)))) ❺
❶ reverse-mapv returns the reverse of a vector and at the same time can apply a transformation to
each element.
❷ The loop starts by assigning the vector to the local binding "v" and creating an empty vector for the
results. For additional efficiency, we can use a transient vector because the accumulation of results
is local to the loop.
❸ peek returns nil when there are no more elements, so we can use this as the signal for stopping the
recursion.
❹ The two arguments for recur are the pop of the rest of the vector with the new results vector in which
we conj the transformation of the first element.
❺ The transient vector needs to be made immutable again before leaving the local context
with persistent!.
See Also
• conj is what is used to push element in lists, vectors and queues. There is no
"push" function in the standard library but conj can be used exactly the same way.
• first and rest, while working natively for lists, require a transformation into a
sequence when used on a vector. peek and pop represent the efficient way to
perform similar operations on a vector.
Performance Considerations and Implementation Details
tend to perform better in both benchmarks, although in absolute terms these are all
lists
very fast operations. The histograms are not perfectly flat (by collection type) because
of small fluctuations in the benchmarks. Some loss of precision is possible at this
resolution of just a few milliseconds.
12.4 vector-of
function since 1.2
(vector-of
([t])
([t x1])
([t x1 x2])
([t x1 x2 x3])
([t x1 x2 x3 x4])
([t x1 x2 x3 x4 & xn]))
This is a specialist function creating a persistent vector that stores its elements
internally as a primitive type. It is used when a lower memory footprint is required
than the version created by vector, provided primitive types can be used for the specific
problem. vector-of behaves very similar to a normal vector: it can be accessed
randomly or treated as sequential, it can be used as a function to access its elements by
index and it’s comparable:
(vector-of :int) ❶
;; []
((vector-of :int 1 2 3) 2) ❸
;; 3
Contract
Input
• "t" is one of the following 8 keywords representing their respective
types: :int, :long, :float, :double, :byte, :short, :char or :boolean.
• "x1", "x2", "x3, "x4 and "xn" are optional arguments. The elements can be of
different types than "t", but they must be able to be coerced to "t".
Notable exceptions
(vector-of Integer 1 2 3) ❶
;; NullPointerException
❶ NullPointerException when "t" is not one of the 8 accepted types. An improvement is currently
being discussed to generate a better error message 193.
❷ If an element cannot be coerced to "type", then a CastClassException is thrown.
❸ If an element is nil, then a NullPointerException is thrown.
❹ If there is an underflow or overflow of an element then an IllegalArgumentException is thrown.
OUTPUT
• A persistent vector containing the given elements in order, if any. Otherwise an
empty vector.
193
Abstract data types are a specification of the semantic of a data structure without specification details. The most common
abstract data type are summarized on this Wikipedia page: en.wikipedia.org/wiki/Abstract_data_type
WARNING vectors created with vector-of cannot contain nil, whether passed at construction time or
added via conj. This is because vector-of only allow primitive types.
EXAMPLES
One area where vector-of is useful is in numerical computing where many numbers
have to be stored in memory, for example in creating fractal images. Fractal images,
such as the Mandelbrot set 194, have entered the popular culture and illustrate the
beauty and complexity of mathematics.
To produce the image of the Mandelbrot set, an iterative process is applied to the
numbers on the complex plane. For each complex number 195, the number of iterations
before the process heads off to infinity is counted. The 3-tuple consisting of the real
part, the imaginary part and the number of iterations, can be efficiently stored in a
vector of primitives. The triplets are then plotted: the real part on the x-axis, the
imaginary part on the y-axis, and the number of iterations is mapped on to a color
gradient.
Here is a simplified version of the code 196 that produced the image of the Mandelbrot
set:
194
See the Clojure Jira ticket system at dev.clojure.org/jira
195
See en.wikipedia.org/wiki/Mandelbrot_set
196
See en.wikipedia.org/wiki/Complex_number
(def mandelbrot-set ❹
(for [im (range 1 -1 -0.05) re (range -2 0.5 0.0315)]
(calc-mandelbrot re im)))
;; **************************************************************************
;; ******************************************************** *****************
;; ***************************************************** ****************
;; **************************************************** ***************
;; ***************************************************** ***************
;; ************************************************** * ** *************
;; ******************************************* *** **********
;; ****************************************** ** ****
;; ******************************************* ****
;; ***************************************** *****
;; **************************************** ****
;; ************************************** *
;; **************************** ********* **
;; *********************** * * ***** **
;; *********************** *** **
;; ********************* * **
;; ********************* **
;; ***************** ****
;; *** ***** ******
;; ***************** ****
;; ********************* **
;; ********************* * **
;; *********************** *** **
;; *********************** * * ***** **
;; **************************** ********* **
;; ************************************** *
;; **************************************** ****
;; ***************************************** *****
;; ******************************************* ****
;; ****************************************** ** ****
;; ******************************************* *** **********
;; ************************************************** * ** *************
;; ***************************************************** ***************
;; **************************************************** ***************
;; ***************************************************** ****************
;; ******************************************************** *****************
;; **************************************************************************
SEE ALSO
• vec to create a vector from different types of collections. You probably want to
use normal vectors most of the time, as they are the most flexible. However, if the
application has memory constraints related to storing many small vectors, consider
using vector-of primitives.
• make-array to create a Java array. Use a Java array if, along with space
optimizations, the performance benefits from a mutable data structure. Also use
Java arrays if interfacing with Java code that requires them.
PERFORMANCE CONSIDERATIONS AND IMPLEMENTATION DETAILS
⇒ O(n log(n)) time, with n number of arguments ⇒ O(n) space
All arities, except the last, are faster options to create a new vector with vector-of:
(require '[criterium.core :refer [quick-bench bench]])
For larger vectors there is no vec-of function, but the overhead can be slightly
alleviated using conj:
(def data (doall (range 100000)))
vector-of has a similar invocation semantic as “vector”. The two can be compared
with the following benchmark, which shows “vector” being more than twice faster
than “vector-of”:
(require '[criterium.core :refer [quick-bench]])
197
See github.com/rachbowyer/csl-10-vector-public
198
Open ticket to add transients to vector-ofdev.clojure.org/jira/browse/CLJ-1416
(defn memory-vector-of []
(let [meter (MemoryMeter.)
bytes-vector (.measureDeep meter (vector 1.0 1.1 1.2))
bytes-vector-of (.measureDeep meter (vector-of :double 1.0 1.1 1.2))
saving (* (double (/ (- bytes-vector bytes-vector-of) bytes-vector)) 100)]
(println "Bytes used by vector" bytes-vector)
(println "Bytes used by vector of" bytes-vector-of)
(println (str "Saving " (format "%3.2f" saving) "%"))))
(memory-vector-of)
;; Bytes used by vector 328
;; Bytes used by vector of 264
;; Saving 19.51%
Accessing elements in a vector created with vector-of is slower than normal vectors:
(let [v1 (vec (range 10000))]
(bench (nth v1 1000))) ❶
;; Execution time mean : 12.264993 ns
❶ Making access to the element at index "1000" using nth on a normal vector created with “vec”.
❷ Access to the same element at index "1000" on a vector created with vector-of.
As the example shows, the slow down can be as high as 40% (but in absolute scale, we
are still talking about very fast access times in the order of nanoseconds). The cause of
this is that the accessors functions get and nth or even using a vector as a function, all
return a reference type. Therefore, the element has to be boxed before it can be
returned. The user of vector-of should therefore pay attention if the vector is
frequently accessed and weight the gain in memory space against access speed to
decide which implementation to use.
12.5 mapv
function since 1.4
(mapv
([f coll])
([f c1 c2])
([f c1 c2 c3])
([f c1 c2 c3 & colls]))
mapv also works with multiple collections. In this case "f" is applied simultaneously to
the first element in each collection, then the second and so on until reaching the end of
the shortest collection:
(mapv hash-map [:a :b :c] (range))
;; [{:a 0} {:b 1} {:c 2}]
In this example the transducer runs more than 50% faster than the sequence version reducing the need
for a "removev".
EXAMPLES
Persistent vectors of type double and length "n" can be used to represent mathematical
vectors in \mathbb{R}^n 199. It is then straightforward to use mapv to implement
addition, subtraction, scalar multiplication as follows:
(defn create-vector-fn [f] ❶
(fn [a & b] (apply mapv f a b)))
(defn scalar-multiply [c a]
(mapv (partial * c) a))
(defn dot-product [a b] ❸
(reduce + (map * a b)))
(add [1 2] [3 4])
;; => [4 6]
(subtract [2 7 3] [5 4 1])
;; => [-3 3 2]
(scalar-multiply 3 [1 2 3])
;; => [3 6 9]
(dot-product [1 1 0] [0 0 1]) ❹
;; => 0
199
github.com/jbellis/jamm
❶ There is a lot going on in this function. It is a high order function that takes as its input a function, "f",
and returns another function. The function returned takes one or more vectors as arguments and
uses mapv to apply "f" to their elements. The argument "b" optionally contains a list of vectors and
so apply is needed to execute mapv correctly.
❷ Here we make use of the high order function create-vector-fn. We pass in the operator + and
receive back a function that adds one or more vectors together. The function is then bound to "add"
using def.
❸ This function implements the algebraic dot product 200. The operation multiplies elements in the
vectors before summing them. As a scalar is returned, not a vector, mapv does not help us.
Instead map and reduce are used.
❹ As [1 1 0] and [0 0 1] are perpendicular, then their dot product should be 0 as expected.
SEE ALSO
• map the standard map operation, which produces either a transducer or a lazy
linked list. Use standard map when you are not interested in the result as a vector.
• mapcat is useful when the result of applying f to an item is again a sequence, with
the overall results of producing a sequence of sequences.mapcat applies a
final concat operation to the resulting list, flattening the result one level.
• amap operates with the same semantics of map on Java arrays.
• “pmap” executes the map operation on a separate thread thus creating a
parallel map execution pool. Replacing map with “pmap” makes sense when the
overall cost of handling the function f to separate threads is less than the
execution of f itself. Long or otherwise processor-consuming operations usually
benefit from using “pmap”.
• clojure.core.reducers/map is the version of map used in the context of “Reducers”.
It has the same semantic of map and should be used similarly in the context of a
chain of reducers.
PERFORMANCE CONSIDERATIONS AND IMPLEMENTATION DETAILS
⇒ O(n) time and space, n number of elements in the shortest input collection
Without considering laziness, mapv will typically outperforms the single input
collection version of map as mapv writes its results directly to a transient vector
whereas map creates a lazy sequence 201:
(require '[criterium.core :refer [bench quick-bench]])
❶ We give away laziness by forcing the sequence into a vector as this is the goal of the current
comparison.
200
en.wikipedia.org/wiki/Real_coordinate_space
201
en.wikipedia.org/wiki/Dot_product#Algebraic_definition
Using map with transducers skips the intermediate sequence creation and performs
considerably better than the basic map version but still worse than the mapv version:
(let [r (range 10000)] (quick-bench (into [] (map inc) r))) ❶
;; Execution time mean : 293.384399 µs
❶ The solution that uses map with into via transducers perform roughly in between the two other
versions seen before.
In some other cases, the benefits of being lazy leads to code using the two-arity version
of map outperforming mapv. For example, if an application is predominantly using just
a few items from a bigger collection, it makes sense to use lazy sequences instead of
vectors (if possible):
(let [r (range 10000)] (quick-bench (subvec (mapv inc r) 0 10))) ❶
;; Execution time mean : 263.561818 µs
❶ After incrementing the element of the vector, we use subvec to extract the first 10 of them.
❷ Similarly, the first 10 elements are extracted after map is applied, with a vector transformation
happening at the end. This operation, since it only realizes just a few element of the sequence, is 200
times faster!
mapv arities taking more than one collection are not making direct use of transients.
Despite the fact that the input/output are still vectors, mapv performs an intermediate
transformation into a sequence and back, with appreciable performance impact:
(let [r (range 10000)] (quick-bench (into [] (map + r r)))) ❶
;; Execution time mean : 1.139211 ms
If we wanted to implement a version of mapv that performs better using two collections
as input, the following could be a possible option:
(defn mapv+ [f c1 c2] ❶
(let [cnt (dec (min (count c1) (count c2)))]
(loop [idx 0
res (transient [])]
(if (< cnt idx)
(persistent! res)
(recur (+ 1 idx) (conj! res (f (nth c1 idx) (nth c2 idx))))))))
❶ mapv+ strategy is to use a loop-recur over the shortest of the two input vectors. A transient vector is
then gradually built using conj!. A persistent! vector is then returned when the max number of
elements has been reached.
❷ The benchmark confirms around a 50% improvement compared to plain mapv.
12.6 filterv
function since 1.4
Apart from returning a vector, filterv differs from filter in the following:
• filterv is not lazy and will eagerly load the resulting vector into memory.
• filterv is missing a transducer dedicated arity.
CONTRACT
INPUT
• "pred" is a predicate function. The returned value is interpreted as logical boolean.
• "coll" is any seqable collection. If "coll" is nil, then it is treated as an empty
collection.
EXCEPTIONS
• If "pred" has the incorrect arity, then an ArityException is thrown.
• If "pred" is not a function, then a ClassCastException is thrown.
• If "coll" is not "seqable", then a IllegalArgumentException is thrown.
OUTPUT
A persistent vector consisting of all items in "coll" for which (pred item) is truthy.
The order of the items in the vector matches the order of the items in "coll".
EXAMPLES
In the following example, two tasks need asynchronous processing. An invaluable tool
to orchestrate asynchronous tasks is "core.async" 202 a library commonly found in
many concurrent Clojure applications. There are many possible candidates for
asynchronous processing, for example calling 3rd party APIs requiring a network
connection. The main application needs to wait for the results of the asynchronous
tasks before continuing, but the main thread is free to do additional processing, usually
resulting in better resource allocation.
After the first task completes, filterv could be used to remove the channel that has
completed from the list of all channels 203.
For simplicity, instead of calling external resources, the code below runs the following
tasks in parallel: calculating the "e" and "π" constants. The important concept to
consider is that the tasks will be busy for some considerable time depending on the
precision requested:
(require '[clojure.core.async :refer [>!! <!! >! <! alts!! chan go]])
202
Benchmarked using Criterium: github.com/hugoduncan/criterium
203
github.com/clojure/core.async
❶ π is calculated using the Leibniz formula 204 . Mathematicians view the calculation of π as the
summation of an infinite series and the power of a functional language like Clojure can easily reflect
this. "precision" is used in an informal way to specify the accuracy.
❷ e is calculated using one of the Brothers' formulae 205 . Again "precision" is used to specify the
accuracy in an informal way.
❸ get-results is a recursive function that waits for each channel in turn to complete and return its
results.
❹ filterv is used to remove the channel that completed from the list of all channels. The code would
work perfectly well if it used filter instead of filterv, but this would introduce a subtle problem.
When get-results is called, "channels" is a vector. But filter returns a sequence, so the data type
of "channels" is changed as the code runs in a non-obvious way. This type of behavior should be
avoided as it can lead to bugs.
❺ Two go blocks are setup, the first calculates π and the second e.
❻ Requests to calculate π and e are placed on the appropriate channel and the code then waits for both
results
SEE ALSO
• filter is the less specific and more frequently used filterv sister function. Usage
of filterv should be restricted to cases where the input/output are expected to be
(and remain) vectors. Prefer filter to filterv when the type of the output is not
relevant or laziness is more important.
• mapv is the another vector oriented operation to process each element of a vector.
• reduce-kv is dedicated to associative structures, but works on vectors just fine. It
completes the set of processing operations along with map and filter.
PERFORMANCE CONSIDERATIONS AND IMPLEMENTATION DETAILS
⇒ O(n) time and space, where n is the number of items in "coll"
The comments in the "Performance Considerations and Implementation Details"
section for mapv regarding transducers and laziness also apply to filterv and filter.
In summary, filterv typically outperforms the two-arity version of filter as filterv
writes its results directly to a transient vector whereas filter first creates a lazy
sequence. For instance:
(require '[criterium.core :refer [quick-bench]])
❶ We remove laziness from filter by forcing the final result into a vector.
204
"This example is inspired by the following stack overflow post stackoverflow.com/questions/31858846/waiting-for-n-
channels-with-core-async"
205
en.wikipedia.org/wiki/Leibniz_formula_for_π
12.7 subvec
function since 1.0
(subvec
([v start])
([v start end]))
subvec creates a "sub-vector" from the elements contained in another vector. The
"start" and the (optional) "end" parameters are used to define the portion of elements to
extract in the new vector. subveccan be used to extract contiguous parts of a vector
without the need of walking all elements from the beginning. We can for example
extract a specific "window" into the input vector specifying an end and start:
(subvec [1 2 3 4] 1 3)
;; [2 3]
If the "end" is omitted, the sub-vector contains the range starting from the element at
index "start" (inclusive) up to (count v) (exclusive):
(subvec [1 2 3 4] 1)
;; [2 3 4]
sub-vectors can be created on top of the other kind of vectors. In this case the resulting
sub-vector inherits the characteristics of the underlining vector:
(def subv (subvec (vector-of :int 1 2 3) 1)) ❶
(conj subv \a)
;; [2 3 97] ❷
(conj subv nil)
;; java.lang.NullPointerException ❸
WARNING The vector returned by subvec is an independent view of a range of elements contained in the
input vector. Once the subvec is generated, the originating vector can be altered without any
impact on related sub-vectors instances. Although the two instance are substantially
independent, there is one subtle side effect: as Fogus M. and Houser C. warn 206, vectors are
not queues and subvec should not be used to implement a pop from the front of the vector.
Although subvec completes in constant time, subvec keeps a reference to the underlying
vector so none of the items popped off are ever garbage collected. See the "Performance and
Implementation Details" section below for more more details.
CONTRACT
INPUTS
• "v" parameter is a vector. It derives that (vector? v) must be true. "v" can be
empty but it cannot be nil.
• "start" and "end" are numbers in the integer range
(from Integer/MIN_VALUE to Integer/MAX_VALUE)
• "end" is optional. When "end" is not present, then "end" == (count v)
NOTABLE EXCEPTIONS
• If "v" is not a vector, then a ClassCastException is thrown.
• If "v" is nil, then a NullPointerException is thrown.
• If "start" or "end" cannot be coerced to a Number than a ClassCastException is
thrown.
• If "start" or "end" are beyond the Integer range,
then IllegalArgumentException is thrown.
• If "start" < 0 or "start" >= (count v) then an IndexOutOfBoundsException is
thrown.
• If "end" is provided and "end" < "start" or "end" > (count v) then
an IndexOutOfBoundsException is thrown.
OUTPUTS
• A vector. The sub-vector starts as the zero based position "start" inclusive and
runs to the zero based position "end" exclusive if "end" is provided, otherwise it
runs until the end of "v".
EXAMPLES
subvec is an efficient solution to "remove" an element from a vector. Vectors are
immutable data structures, so "removing" an element implies creating a new vector
from two sub-vectors where the element to remove has been left out:
(defn remove-at [v idx] ❶
(into (subvec v 0 idx)
(subvec v (inc idx) (count v))))
206
en.wikipedia.org/wiki/List_of_representations_of_e#As_an_infinite_series. Amazingly this formula was only discovered
in 2004!
(remove-at [0 1 2 3 4 5] 3)
;; [0 1 2 4 5]
❶ “into” can be used to the first sub-vector to include the second one, starting from the given
index idx plus 1.
subvec can be used for recursion, in a similar way the first/rest is used to advance a
sequence. The following norm function calculates the norm of a vector 207:
(defn norm [v]
(loop [v v
res 0.]
(if (= 0 (count v))
(Math/sqrt res)
(recur (subvec v 1) ❶
(+ res (Math/pow (nth v 0) 2)))))) ❷
❶ subvec with the input vector and 1 as arguments is similar to the effect of rest on sequences.
❷ The "head" of the vector is accessed via nth.
:else
(let [[v1-head & v1-tail] v1
[v2-head & v2-tail] v2]
(if (cmp v1-head v2-head)
(recur (conj result v1-head) v1-tail v2)
(recur (conj result v2-head) v1 v2-tail))))))
(defn merge-sort
([v]
(merge-sort v <=))
([v cmp]
(if (< (count v) 2)
v
(let [split (quot (count v) 2)
v1 (subvec v 0 split) ❷
v2 (subvec v split (count v))]
207
"The Joy of Clojure by Michael Fogus and Chris Houser. Chapter 5.2.7"
208
en.wikipedia.org/wiki/Norm_(mathematics)
(merge-sort [2 1 5 0 3])
;; => [0 1 2 3 5]
(merge-sort [[2 :b] [2 :a] [1 :c]] #(<= (first %1) (first %2))) ❸
;; => [[1 :c] [2 :b] [2 :a]]
❶ merge-vectors takes two sorted vectors ("v1-initial" and "v2-initial") and "cmp" a comparison function
and merges the vectors into one sorted vector. It compares the first element of each vector and the
element that is less according to the comparison function is added to the "result" vector. This is
repeated recursivly until all the elements have been merged.
❷ subvec is used to split the vector "v" in half. Each half is sorted before being merged into the final
vector.
❸ One of the key properties of the merge sort algorithm is that it is a stable sort, which means that if two
elements have equal keys their relative order is unchanged. Here the key of both [2 :b] and [2 :a] is 2,
so their order is left unchanged.
SEE ALSO
• vec and vector also produce a persistent vector from a seqable collection.
Use vec when the creation of the new vector does not involve a subset from
another vector. Use vector to specify the element that should belong to the vector.
• vector-of to create a vector of a primitive types. Use vector-of if space efficiency
is the biggest concern.
• into was used throughout the chapter to join sub-vectors back together.
❶ norm was slightly modified compared to the version in the examples. The introduction of the
index idx in the parameters of the loop-recur avoids the use of count at each iteration to count the
remaining elements in the sub-vector.
-6
❷ The timing for a medium size vector of 1000 elements is about 91 micro-seconds (10 seconds)..
❸ norm-idx is very similar to norm. The changes are removing the sub-vector and using nth to fetch the
element at idx.
❹ norm-idx is roughly 6 times faster. Considering that the loop is identical but missing sub-vector (the
difference between nth and peek is negligible) the slow-down is determined by the use
of subvec alone.
Thanks to immutability, the view created by subvec is isolated from additional changes
happening to the original vector. The isolation is not perfect though and can result in
subtle side effects dealing with large vectors. The newly created sub-vector is
essentially a wrapper around the original vector screening out the unwanted elements,
so no copying of elements occurs and hence subvec completes in constant time and
space. The generated sub-vector maintains a reference to the original vector though,
which can prevent elements being garbaged collected. The following example
illustrates the problem: 2 sub-vectors are created with the intention to join them later:
(defn bigv [n]
(vec (range n)))
❶ subvec is used to cut a tiny slice of a much larger vector. The large vector is not referenced anywhere
else, but is kept alive by a reference living inside the subvec implementation.
❷ Depending on the JVM settings, you might need to tweak the size of the larger vector to see the out of
memory problem shown here. The JVM was started with 512Mb of heap size in this case.
❸ Each subvector is transferred into a new vector instance, so their inner vector reference can be
garbage collected.
13
A set is a data structure that contains distinct and unordered items. Sets are widely used
in computer science and have a strong connection with mathematical set theory.
Clojure offers two types of sets with their corresponding initializer functions:
Sets
• hash-set is the most used, offering fast lookup, specific syntax literal #{} and
a transient version. hash-set is implemented on
top clojure.lang.PersistentHashSet which in turn is a thin layer on top
of clojure.lang.PersistentHashMap, the same of hash-map. Both hash-
209
set and hash-map are instances of Hash Array Mapped Trie . hash-set offers
near constant time lookup (at O(log32N)), addition and removal of items.
• sorted-set also guarantees uniqueness of items but additionally maintains ordering
based on a comparator. It is based on Red Black trees (see subseq call-out section)
and offers a well balanced logarithmic access (at O(log2N)).
Both set types can be used as functions (especially as predicates) to verify the presence
of items in a concise way:
((sorted-set 5 3 1) 1) ; ❶
;; 1
❶ An example of using a sorted-set as a function. The sorted-set looks up the argument in the set and
returns it if present. It returns nil otherwise (or a default value if given as additional argument).
209
HAMT, or Hash Array Mapped Trie, is a tree-like data structure suitable for implementing persistent collections. We
introduced its general properties at the beginning of the vector chapter.
❷ Another idiomatic use of sets with some to determine if at least one of the items in the input vector is
in the set. In this case, the number "1" is present and returned.
13.1 hash-set
function since 1.0
hash-set is the main initializer function for Clojure hash sets, a type of unordered data
structure that does not allow duplicates:
(hash-set :yellow :red :green :green) ; ❶
;; #{:yellow :green :red}
❶ hash-set takes any number of values. Note that :green is present twice in the input but only once in
the resulting set.
CONTRACT
INPUT
• "keys" can be any number of items of any type.
OUTPUT
• returns: a clojure.lang.PersistentHashSet instance containing the given
"keys". It returns an empty set when invoked without arguments. If multiple
instances of the same item (as per equalitysemantic) are present, only one instance
is added to the set.
EXAMPLES
hash-set has a reader literal #{} that denotes the same data structure:
❶ The reader literal #{} expands to create the related hash-set data structure.
Note however that the syntax literal #{} does not take care of duplicates automatically:
#{:yellow :red :green :green} ; ❶
;; IllegalArgumentException Duplicate key: :green
❶ Differently from hash-set, the set syntax literal does not allow duplicate keys and doesn’t create the
set, throwing exception.
Also differently from the syntax literal, hash-set is useful when the input items are not
compile-time constants (such as numbers, strings, keywords etc.) but need evaluation:
#{(rand) (rand) (rand)} ; ❶
;; IllegalArgumentException Duplicate key: (rand)
❶ We are trying to create a set with 3 random numbers in it. The syntax literal #{} treats the input items
as compile-time constants and only evaluates the expression (rand) after it is added to the map. This
results in a duplicate key exception.
❷ Using hash-set we make sure input elements are evaluated before entering the map, avoiding
unexpected compile-time errors.
Let’s clarify the meaning of uniqueness of the items in a set in the context of metadata.
If multiple instances of the same value are added to the set and they have different
metadata, hash-setretains the metadata from the first item:
(def set-with-meta
(hash-set ; ❶
(with-meta 'a {:pos 1})
(with-meta 'a {:pos 2})
(with-meta 'a {:pos 3})))
set-with-meta ; ❷
;; #{a}
❶ We use hash-set as usual. The items are all instances of the same symbol 'a but they have different
metadata.
❷ When we check the content of the set, we see a single element as expected.
❸ The metadata on the element are the one of the element that was first inserted. This behavior is
similar to hash-map treatment for multiple instances of the same key.
hash-set (like hash-map) has a syntax literal #{} to enable quick creation of sets. The
syntax literal enables elegant use of sets as predicates, like in the following example
using some:
(some #{:x :c} [:a :b :c :d :e]) ; ❶
;; :c
❶ A set created with hash-set or the syntax literal #{} is also a function. As a function the set verify if
the given argument is present in the set or not. some applies the set as a predicate on all the elements
in the input collection stopping at the first one returning something different from nil.
❷ If there are no elements from the collection in the set, some returns nil.
(def powerset-of-s ; ❷
#{#{} #{:a} #{:b} #{:c} #{:a :b} #{:a :c} #{:b :c} #{:a :b :c}})
❶ Our input is a set of 3 elements. We want to find all the combinations of the 3 elements irrespective of
ordering, including the empty set #{}. If we were to take order into account, we should more
specifically talk about permutations instead of combinations.
❷ We can see here the expected output. Note that we are using set literals #{} for the outer and inner
sets to enforce uniqueness of items.
The powerset of a set contains (Math/pow 2 (count s)) items (the example above contains
23 elements including the empty set), where (count s) is the size of the input. There are several ways
to calculate the powerset. With reference to the powerset of the 3 items :a, :b and :c, we can start by
observing the following equivalence (pseudo-code):
[[] [:a] [:b] [:c] [:a :b] [:a :c] [:b :c] [:a :b :c]] ; ❸
❶ The U operator means union. The first term of the union is the powerset(:a, :b, :c) after we
remove all the subsets containing the item :c. The item has been selected at random.
❷ The second term of the union are all the other subsets that instead contain the previously removed
item :c. Note that this second subset can be derived from the first one by adding :c to each item.
❸ The union U results in the complete powerset(:a, :b, :c).
A more formal definition of what we observe in the example above is: the powerset(:a :b :c) can be
assembled together from the union of powerset(:a :b) with the set obtained from powerset(:a
:b) once we’ve added :c to each element. We can use this observation to recursively build a powerset:
❶ when-first takes the first element from the input set and, at the same time, checks if it’s nil. When the
input set "s" is empty, we’re done with the recursion.
❷ disj is used next to remove the first element from the set. The recursive powerset call potentially
returns nil after reaching the end of the set (the result of when-first), so we provide the initial value
for the computation using or. This is the empty hash-set containing another empty set. We could
have written {{}} instead, but the double nested constant literal is less readable.
❸ We have now the result of calling powerset on "s" (minus one item) available as the local binding "p".
We can now apply the rest of the observation: the clojure.set/union function is applied to "p" and
the result of merging back the removed element into each item of "p".
❹ The test confirms that the powerset function produces the expected 8 output items.
This powerset formulation faithfully implements the initial observation that we made on what
constitutes a "powerset", but it’s not tail-recursive. It takes a top-down approach by defining the powerset
from it’s final content into the empty set by consuming the stack and building the actual result on the
way back. If we want to express it as tail-recursive, we need a way to accumulate results from the
bottom-up, starting from the empty set and gradually adding elements to each subset. The observation is
the same but reads backward: the next powerset is equal to the union of the previous one plus the set
where each item has been added a new element. The following formulation takes this approach resulting
a better performing and more concise code:
❶ Instead of custom recursion or loop-recur, we can now delegate recursion to reduce and build the
incremental results starting from the empty set.
❷ Each reduce recursion we are presented with the partial view of the powerset so far and the next
element "x". We can proceed to apply the observation using union of the current powerset "s" with all
element in "s" after adding the new element "x".
❸ We confirm that the new formulation is returning the same results.
SEE ALSO
• set is optimized to create a set starting from an already existing collection.
Prefer set to hash-set with apply to create a set from a collection.
• sorted-set maintains the order of the items in a set through a comparator (default
or custom). When the set is iterated, the items are returned following the ordering
generated with the comparator. Use sorted-set instead of hash-set if you require
ordering of the elements in a set. Note that sorted-set is not based on insertion
order (like array-map for instance). If you need insertion order, the closest option
is to use a vector and remove duplicates with distinct before iteration.
PERFORMANCE CONSIDERATIONS AND IMPLEMENTATION DETAILS
❶ The first benchmark uses hash-set with apply to create a large set. apply transforms the arguments
in a sequence and each element is added to the set. Internally, items are added to a transient version
of the set.
❷ into supports transients. The multiple conj operations can now happen on a mutable instance.
❸ set is a specialized function to transform other collections into a set. It also uses transients.
Please also check set perfomance section for additional benchmarks related to different
©Manning Publications Co. To comment go to liveBook
13.2 set
function since 1.0
(set [coll])
set creates a new clojure.lang.PersistentHashSet (the Clojure set type) from the
items in the given input collection:
(set [1 2 3 4 1 4]) ; ❶
;; #{1 4 3 2}
❶ set "converts" the input vector into a new set instance. Note that the output set doesn’t have a
specific ordering and doesn’t contain duplicates.
Along with hash-set and into, set is the main way to create new set instances.
CONTRACT
INPUT
• "coll" is the only mandatory argument. "coll" can be any collection including the
most common Java iterable types but excluding transients.
NOTABLE EXCEPTIONS
• IllegalArgumentException when "coll" is not a collection.
OUTPUT
• returns: a new set instance containing the items from the input collection. If "coll"
is empty or nil, it returns an empty set. All duplicated items from "coll" (if any)
are removed in the output set. If "coll" is already a set the input is passed through
without transformations, but the metadata, if any, are removed.
EXAMPLES
A typical use of set is to transform an existing collection, removing any duplicates in
the process and preventing new ones in future operations. Note that set implies the
creation of a new independent set, so if any metadata is present in the input, it is
removed on purpose:
(def input-set (with-meta #{} {:original true})) ; ❶
(meta input-set) ; ❷
;; {:original true}
❶ The original input set has some metadata attached using with-meta.
❷ We can see the metadata anytime using meta.
❸ The input set is used as an input for the set function. We can see that the metadata are stripped
away.
Also note that if the input is a sorted-set it is not transformed into an unordered hash-
set:
Transformations with set are also useful in conjunction with contains?. In the
following example, we setup a simple "honeypot" mechanism to prevent fraudulent use
of a web form 210 . The honeypot consists of an input HTML tag that is not visible to
human users but appears legitimate when bots parse the page.
Once the web request comes in as a hash-map we need to verify if it contains a specific
value encoded for the honeypot input. Depending on the page, there could be one or
more honeyput fields with legitimate names like "option1" or "option2":
(def honeypot-code "HP1234")
(def valid-request
{:name "John"
:phone "555-1411-112"
:option1 ""
:option2 ""}) ; ❶
(def fake-request
{:name "Sarah"
:phone "555-2413-111"
:option1 "HP1234" ; ❷
:option2 ""})
(honeypot? valid-request)
;; false
(honeypot? fake-request)
;; true
210
A "Honeypot" in computing is a legitimate mechanism to interact with a service that allows to distinguish fraudulent use,
see en.wikipedia.org/wiki/Honeypot_(computing) for more information
❶ valid-request contains a few honeypot fields that are correctly empty. A legitimate user can’t see
the corresponding input and can’t make a choice.
❷ If we access the raw source of the same web page instead (like an automatic program would do),
there is no way to distinguish honeypot inputs (at least in this simple example). The automatic
program would then proceed to analyze the page and fill the input with the honeypot code (because
it’s mandatory choice for a radio-button for instance).
❸ A straightforward way to verify the presence of one or more honeypot codes, is to use set on the
sequence of values from the request map. contains? supports sets (it would not work with
a sequence or vector) to verify the presence of the honeypot code.
Collection initializers
You might have noticed a pattern learning about data structures and their functions: Clojure usually
offers a constructor taking any number of items (like hahs-set) and another to deal with entire
collections (like set). The following table contains the most common constructors for the different
collection types:
Table 13.1. A summary of the initializers available for the different collection types
SEE ALSO
• hash-set also creates new set data structures, allowing any number of items as
arguments. In general, (set coll) should be preferred instead of (apply hash-
set coll).
The following chart shows set applied to different collections types of the same size
(the chart also includes a native array). We can’t see a big difference, as reduce is
optimized for most of the collection types:
In terms of comparison with hash-set, both functions use transients to populate a new
mutable set and convert it into persistent before returning it. The differences in terms
of pure speed are small, with some advantage for set:
(require '[criterium.core :refer [quick-bench]])
Let’s now explore how set behaves related to the percentage of distinct elements.
The chart below confirms that the more items are repeating in the input (clashing with
already existing ones in the output) the faster the creation is:
Figure 13.2. Chart showing how creation speed changes related to the percentage of distinct
elements in the input.
sorted-set and sorted-set-by are initializers for ordered sets, a collection type
similar to hash-set that also maintains ordering based on a comparator. We can build a
new sorted-set passing the required elements as arguments:
(sorted-set "t" "d" "j" "w" "y") ; ❶
;; #{"d" "j" "t" "w" "y"}
We can force a different comparator than the default one using sorted-set-by:
❶ When default ordering is not desirable, we can pass sorted-set-by a new comparator.
The compare function in the standard library is compatible with many Clojure and Java types and can
be used with strings as well. If we want to invert the order, we just need to swap the arguments.
CONTRACT
INPUT
• "keys" can be any number of elements to be added to the newly created sorted set.
• "comparator" can be any object implementing
the java.util.Comparator interface. Clojure functions (with the exception of data
structures used as functions) implement the Comparatorinterface, a nice trick to
allow <, >, >=, ⇐ as comparators.
NOTABLE EXCEPTIONS
• ClassCastException is thrown when any pair of "keys" is not comparable. In
general keys are not comparable when they are of a different type (although
different numeric types are comparable). For a more detailed list of corner cases,
please see compare.
OUTPUT
• returns: a new instance of clojure.lang.PersistentTreeSet, the class
implementing sorted sets in Clojure. The new instance contains all "keys" passed
as input. When iterated, the sorted set returns "keys" in the order determined by
"comparator" or the default one if none is given.
NOTE Please note that, unlike normal hash-set, there is no transient version of sorted-
set or sorted-map.
EXAMPLES
sorted-set and sorted-set-by guarantee their content does not contain duplicates.
Like hash-map and hash-set, sorted-map maintains the metadata from the first inserted
item in case of duplicates with different metadata in the input:
(defn timed [s] ; ❶
(let [t (System/nanoTime)]
(println "key" s "created at" t)
(if (instance? clojure.lang.IMeta s)
(with-meta s {:created-at t})
s)))
❶ We use the timed function on the items to store the creation date as part of the object metadata.
❷ "s" is a sorted-set created with the same symbol "a" twice. Symbols support metadata along with
most of the collection types.
❸ We can see the retained metadata is the one attached to the first symbol.
sorted-set is useful to maintain order as the set gets updated incrementally, as each
new item is inserted in order following the comparator. In the next example we let
users update a base dictionary of words with their own new words and spellings. If we
store the dictionary in a sorted-set there is no need to sort after each update:
(require '[clojure.string :refer [split-lines]])
(def dict ; ❶
(atom
(->> "/usr/share/dict/words"
slurp
split-lines
(into (sorted-set)))))
(defn ask-word []
(println "Please type word:")
(when-let [w (read-line)]
(spell-check w)))
(ask-word) ; ❺
;; Please type word:
;; google
;; Could not find the word: google
;; Add word to dictionary? [y/n]
;; y
;; word added (google googly googol googolplex googul)
(ask-word)
;; Please type word:
;; google
;; Word spelled correctly
❶ dict is a top level definition in the current namespace that holds the initial load of the words from a
local file. /usr/share/dict/words is a file present on most Unix file system. After splitting the file into
lines, we store the corresponding words into a sorted-set. The set is wrapped in an atom to allow
controlled mutation.
❷ new-word is used when we discover a word that is not in the dictionary already. The function asks the
user if they want the word to be added and proceed to update the atom.
❸ subseq is a perfect choice to extract a portion from the ordered set, as it avoids a linear scan to reach
the target word. We can provide a quick feedback about the position of the word in the dictionary.
❹ Another idiomatic operation on sorted-set is contains? that we use here to verify if a word is in the
set.
❺ We can see a user interacting with the example to add a new word to the dictionary.
Custom comparators
Thanks to Clojure extended equality, we can also use collections in a sorted-set:
❶ Equality of vectors is the same as equality of each single item positionally. After checking that "1" is
the same, the comparison continues with the next item. Since "a" comes before "b" the entire vector
containing "a" is moved before the entire vector containing "b".
❷ This second formulation using sorted-set-by with compare is equivalent to the first. You’re going to
use sorted-set-by with a different comparator in case the default behavior is not what you want.
If the default compare is not sufficient, we can pass a custom one, for example to sort by first or last
element only. At this point is useful to understand how the comparator works. There are two distinct
phases involved in adding a new element to a sorted-set:
• Skip the item if it’s already in the set. To do this, the comparator is called on each existing item
against the new one. If any comparison returns "0" (which means they are the same for a possibly
custom definition of equality), then the element is not added to the set.
• If not already in the set, modify the set structurally to accommodate the new item in the right place.
This phase uses the comparator again to decide where the new item should be added.
The aspect of equality and relative ordering of two items might not be the same. Here’s an example
where we would like to order vectors in a sorted-set by count, bigger vectors first:
(sorted-set-by ; ❶
(fn [a b] (compare (count b) (count a)))
[1 :a] [:b] [3 :c] [:v])
;; #{[1 :a] [:b]}
Some items in the input are not appearing in the output, which is correct behavior since we designed a
custom comparator that returns "0" when the items have the same size: the output sorted-set is going
to be unique by vector size instead of vector content.
To prevent the problem, the custom comparator should distinguish between the two aspects. The
following example introduces a fall-back option in case the two sizes are the same, to verify if the
elements are also equal:
(sorted-set-by
(fn [a b]
(let [cmp (compare (count b) (count a))]
(if (zero? cmp) ; ❶
(compare a b)
cmp)))
[1 :a] [:b] [3 :c] [:v])
;; #{[1 :a] [3 :c] [:b] [:v]}
❶ It’s not enough to compare sizes, as the two items "a" and "b" could be equal in size but not content-
equal.
Finally, the same comparator can be shortened using wrapping vectors. This is possible because vector
equality semantic compares items positionally, removing the need for conditionals:
(sorted-set-by
(fn [a b] ; ❶
(compare [(count b) a] [(count a) b]))
[1 :a] [:b] [3 :c] [:v])
;; #{[1 :a] [3 :c] [:b] [:v]}
❶ The same fall-back option can be expressed by putting the different aspects of equality inside a
vector.
SEE ALSO
• hash-set and sorted-set serve different purposes and have different performance
profiles. Use hash-set if you are not interested in ordering but still need
uniqueness guarantee.
• "sorted-map" are very closed cousins to sorted-set and sorted-set-by. They are
actually based on the same implementation. Use a sorted-map when it makes
sense to have key value pairs.
• compare was mentioned many times in this chapter. It is probably the most
flexible tool for comparison as it expands equality semantic to most of the Clojure
types.
PERFORMANCE CONSIDERATIONS AND IMPLEMENTATION DETAILS
13.4 disj
function since 1.0
(disj
([set])
([set key])
([set key & ks]))
disj removes one or more elements from a set (sorted or not sorted):
(disj #{1 4 6 8} 4 8) ; ❶
;; #{1 6}
❶ disj is used to remove the number "4" and the number "8" from the input set.
❷ disj can be used similarly on sorted set.
CONTRACT
INPUT
• "set" is the argument name for the input set. When this is the only argument, it can
be of any type including nil. When additional arguments are present, then "set" is
expected to implement the clojure.lang.IPersistentSet interface. There are
two Clojure built-in types implementing IPersistentSet, those created
using set and sorted-set.
• "key" and "ks" are any number of items to remove. They can be of any type and
they are optional arguments.
NOTABLE EXCEPTIONS
• ClassCastException when "set" does not implement
the clojure.lang.IPersistentSet interface. To understand if the "set" argument
implements the right interface, you can use the set?function or
alternatively, (instance? clojure.lang.IPersistentSet set)` should
return true.
OUTPUT
• returns: the same input "set" with element "key" or "ks" removed. If "set" is the
only argument, it returns "set" with the same semantic of identity (e.g. it returns
the argument). If "set" is nil, then disj returns nil even if "key" or "ks" are
present.
EXAMPLES
We could use disj to verify the presence of invalid values. In the following example
we receive a vector containing configuration values. If we have a list of allowed
values, we can use disj to detect the presence of unwanted values by exclusion:
(defn valid? [allowed values]
(empty?
(apply disj (set values) allowed))) ; ❶
❶ After transforming the input values into a set (which also removes any duplicate), we repeatedly
remove all allowed values with disj. If anything is left, it is not part of the list of valid values.
❷ When valid? is used against a collection that includes an invalid number, it returns false.
❸ after removing the invalid number, valid? returns true.
In the next example we are going to use a set to maintain a list of open connections.
Each connection corresponds to a local port served by a echo sever (a server that listen
for incoming connections and replies by repeating the input). We maintain a list of
used ports in a globally accessible hash-set of ports 211 and start a new listener only
if the port is free:
(require '[clojure.java.io :as io])
(import '[java.net ServerSocket])
211
Please note that to simplify the example, the condition that checks the presence of a free port and the following start of a
new ServerSocket is not in the same transaction. Many concurrent requests on the same port might actually result in
exceptions because the port is already in use.
To see the echo-sever in action, we need a mix of Clojure from the REPL and a
command like Telnet 212. After calling serve at the REPL, we need a
corresponding telnet from the command line to unblock the port:
(serve 10001) ; ❶
;; #object[future_call 0x41da {:status :pending, :val nil}]
(serve 10001) ; ❷
;; "Port already serving requests."
❶ serve on a unused port creates a new future instance which is returned in :pending state, as
the readLine call from standard input is blocking. The thread is ready to receive a request from
port 10001.
❷ A second call to serve on the same port, results in a message that the port is already in use.
❸ We use telnet from a command line. After establishing a connection, telnet waits for the input. The
input ends after hitting "return" on the keyboard. If we type "hello" and hit "return" we can see another
"hello" repeated just below.
SEE ALSO
• dissoc is the equivalent operation for hash-map.
• disj! is the equivalent operation for transient sets.
• conj is the opposite of disj, although conj is a general purpose function that
works on many collection types, not just set.
212
Telnet is installed by default in many Linux distributions, Mac OS and Windows (although on Windows it might require
configuration to enable it).
• difference is another option to remove multiple items from a set. difference takes
the items to remove grouped in a set, while disj accepts them as distinct
arguments.
PERFORMANCE CONSIDERATIONS AND IMPLEMENTATION DETAILS
⇒ O(log32N) single item, hash-set ⇒ O(log2N) single item, sorted-set ⇒ O(n) linear
in the number of arguments
disj (similarly to dissoc) has a linear dependency on the number of keys to remove. In
case of single key, the performance profile is different based on the type of
set: disj for hash-set is close to constant time, more precisely log32N where "N" is
the number of items in the set. disj for sorted sets is still logarithmic but with a
different constant factor (the base is 2 instead of 32). In general, removing a single
element from a set is a quick operation that shouldn’t generate any major concern.
In case disj needs to be used to remove multiple arguments, there are alternatives to
consider. In the following benchmark we measure disj and clojure.set/difference:
(require '[clojure.set :refer [difference]])
(require '[criterium.core :refer [quick-bench]])
❶ The first benchmark measures disj on a medium size set to remove some 200 arguments. Note that
we need to use apply to spread the collection of arguments.
❷ clojure.set/difference requires the items to remove to be in a set. The second benchmark
assumes you don’t have a set already, so we create a set with the arguments along with the call
to difference.
❸ In the final benchmark, we assume the optimal case in which the arguments are already grouped as a
set.
The difference between disj and clojure.set/difference for this specific case is
small, but you should use clojure.set/difference if the items to remove are already
in a set (the creation of a set slows the benchmark considerably).
If your application has a critical section that requires removing items from a set, you
should look into transients:
❶ This version of disj called disj* transforms the set into a transient before removing items.
❷ The benchmark confirms that removing items while the set is in transient state has positive effects
on speed.
disj* is roughly 50% faster than normal disj, but remembers that this is true only if
the number of items to remove is sufficiently large.
(union
([])
([s1])
([s1 s2])
([s1 s2 & sets]))
(intersection
([s1])
([s1 s2])
([s1 s2 & sets]))
(difference
([s1])
([s1 s2])
([s1 s2 & sets]))
213
For additional information on set theory and other possible operations, please see the Wikipedia
page: en.wikipedia.org/wiki/Set_theory
;; #{2}
❶ The union of two sets groups together all items from two sets removing duplicates.
❷ The difference of two sets removes all items in the first set that are also present in the second set.
❸ The intersection of two sets groups together all the items that are common between the sets.
CONTRACT
INPUT
• union is the only function accepting no
arguments. intersection and difference require at least one set argument
or nil.
• "s1" and "s2" can be nil, hash-set or sorted-set. If they are not nil, it follows that
(set? s1) returns true, (set? s2) returns true and they should both implement
the clojure.lang.IPersistentSet interface.
• "sets" is any additional sets following the same specification as "s1" and "s2".
NOTABLE EXCEPTIONS
Most exceptions happen when arguments are not implementing
clojure.lang.IPersistentSet interface:
OUTPUT
union, intersection and difference return a new set instance or nil. The new set
instance has the type of the first set argument (hash-set or sorted-set in case of
native set types, or possibly nil). The content of the result set depends on the
operation:
• The union of "s1", "s2" and "sets" is the set containing the sum of all the unique
elements from "s1", "s2" and any additional "sets".
• The intersection of "s1", "s2" and "sets" is the set containing all the common
elements in "s1", "s2" or any additional "sets".
• The difference of "s1", "s2" and "sets" is the set containing all the items in "s1",
minus the common items from "s2", minus other common elements from any
additional "sets".
WARNING Use of union, difference and intersection on other types of collections does not generate
error, but the results are unpredictable. Please try to ensure the arguments are all sets before
using them.
EXAMPLES
Let’s illustrates a few interesting cases first. intersection results in empty set
or nil if any of the sets is empty or nil respectively. It returns empty set even in the
presence of nil if the last argument is an empty set:
(require '[clojure.set :as s])
The effect of nil and empty set caused by their relative position is difficult to predict
also for union and difference. If nil is potentially one of the arguments, you should
not rely on the equivalence to the empty set to implement a conditional statement. In
general, better removing nil:
(apply s/intersection
(remove nil? [#{1 2 3} nil #{4 2 6}])) ; ❶
;; #{2}
❶ We use remove to get rid of potentially nil arguments before using intersection.
Figure 13.3. Diagram that shows the symmetric difference between S1 and S2 (darker color).
The symmetric difference of "s1" and "s2" is equivalent to the union of the sets after
SEE ALSO
• disj, when used with multiple arguments, works similarly to difference.
Prefer difference when all arguments are part of a set. If the items to subtract are
part of another collection type, the apply with disj is a possible option. There are
minor performance implications to consider discussed in the disj section.
PERFORMANCE CONSIDERATIONS AND IMPLEMENTATION DETAILS
Like other functions in this chapter, subset? and superset? should not be used with
collection types other than sets or nil:
(s/superset? nil #{}) ; ❶
;; true
❶ You should avoid nil arguments because they can give inconsistent results.
❷ Similarly you should avoid collection types that are now set. In this example we are testing the
presence of items at index "0" and "3" in the second vector argument, not the actual content.
(def accounts ; ❷
#{{:acc-id 1 :user-id 1 :amount 300.45 :type "saving"}
{:acc-id 2 :user-id 2 :amount 1200.0 :type "saving"}
{:acc-id 3 :user-id 1 :amount 850.1 :type "debit"}})
❶ A relation like users has a strong resemblance with the table in a relational database.
❷ The accounts relation contains an "user-id" key. This key can be used to lookup the related record in
another relation and is commonly referred as a "foreign key".
The resemblance of "users" and "accounts" to the tables in a modern database is not a
coincidence. We can now use relation-oriented functions to perform interesting queries
on relations:
(require '[clojure.set :as s])
(s/project ; ❷
(s/join users accounts {:user-id :user-id}) ; ❸
[:user-id :acc-id :name])
(s/project
(s/join users accounts {:user-id :user-id})
[:user-id :acc-id :type]) ; ❹
214
The relational algebra defines the semantic of operations to handle data stored in a relational database. It was invented
around 1970 at IBM. Please see en.wikipedia.org/wiki/Relational_algebra for more information.
(s/project
(s/join users (s/rename accounts {:type :atype})) ; ❺
[:user-id :acc-id :type :atype])
;; {{:type "company"}
;; #{{:user-id 2, :name "jake", :age 28, :type "company"}},
;; {:type "personal"}
;; #{{:user-id 3, :name "amanda", :age 63, :type "personal"}
;; {:user-id 1, :name "john", :age 22, :type "personal"}}}
❶ s/select is similar to the "select" construct in SQL. SQL is the structured query language used in
relational databases. s/select filters relations based on a predicate (which is similar to the "where"
clause in SQL).
❷ s/project only keeps the given keys from each map in a relation, removing all the others. It is similar
to select-keys applied to all relations in the set.
❸ We can also s/join two relations, which creates the union of the relations based on a common value
of one or more keys. In this specific example, an user map merges into the account map that has the
same ":used-id" value. s/join would automatically join on user-id, even if we didn’t pass the
mapping explicitly as we did. Note that we don’t see "Amanda" in the names, because no account
belongs to her and there is no matching user-id in accounts.
❹ We have a problem of clashing keys between relations: if we join two maps with the same key, the last
key to be merged override any previous values. What if we wanted to see both the user type and the
account type?
❺ We can solve clashing keys in a join using s/rename to give a specific key a new name
before s/join. We can see that now we have access to both types. This is equivalent to the "AS"
renaming contract in SQL. Note that this time we didn’t pass an explicit key to join on,
as s/join automatically uses user-id which exists in both relations.
❻ The final example shows how can we group a relation by a specific key using s/index (something
similar to "GROUP BY" is SQL).
Relations and the functions operating on them enable a small but fully working in-
memory database for Clojure. An in-memory database can be used on relatively small
datasets like configuration, rules or other structured data that are limited to the to the
lifetime of the application.
14 Concurrency
Unlike other mainstream languages, Clojure does not require explicit locking to solve
concurrency problems 215. Locking is still available as low level option, but the
presence of persistent data structures enables elegant and powerful abstractions to
solve the hardest concurrency problems.
This chapter describes what functions in the standard library are dedicated to
concurrency and how to use them. There are three main groups of them:
• future, promise and delay control threads in different ways. These functions are
not connected to a specific way to handle state, nor they wrap any state
themselves.
• ref, atom and agent are several models to handle state, each one providing
different guarantees in case of concurrent access. var and volatile! also belong to
this group but they are illustrated in other chapters for their role in scenarios other
than pure concurrency.
• deref, validators and watchers are common features implemented in all
concurrency models.
• Finally, locking is the last resort in terms of concurrency handling around critical
sections of the code. The need for explicit locking should be exceptional and
relegated to some case of Java interoperation.
215
For an introduction to locking in concurrency problems, please see en.wikipedia.org/wiki/Lock_(computer_science)
14.1 future
NOTE This section also mentions other related functions such as: future-call, future-
done?, future-cancel, future-cancelled? and future?
future takes one or more expressions as input and evaluates them asynchronously in
another thread:
(defn timer [seconds] ; ❶
(future
(Thread/sleep (* 1000 seconds))
(println "done" seconds "seconds.")))
❶ The timer function takes a number of seconds as input and creates a future that contains
a Thread/sleep call for the requested time and prints a message on screen. The result of the
computation is nil.
❷ Invoking timer doesn’t block. The message prints after the requested amount of seconds.
(future? t2) ; ❷
;; true
(future-done? t2) ; ❸
;; false
;; done 10 seconds.
(future-done? t2) ; ❹
;; true
❶ timer is the same function defined at the beginning of the section. We create a second timer "t2" set
to 10 seconds. The timer starts immediately.
❷ future? confirms that "t2" is effectively a future object.
❸ future-done? expects an object with type java.util.concurrent.Future which is part of the
interface returned by future. future-done? returns false while the timer is still running.
❹ When "done 10 seconds." message prints, the timer is done evaluating the form.
We don’t necessarily need to wait for the future to finish. We can use deref (or the
corresponding reader macro @) to access the result of the computation. Beware that if
the future is not done yet, the call to deref is going to be blocking:
(def sum (future (Thread/sleep 10000) (+ 1 1))) ; ❶
(realized? sum) ; ❷
;; false
(deref sum) ; ❸
;; 2
(realized? sum) ; ❹
;; true
(deref sum) ; ❺
(future-cancelled? an-hour) ; ❷
;; false
(future-cancel an-hour) ; ❸
;; true
(future-cancelled? an-hour) ; ❹
;;true
(future-cancel an-hour) ; ❺
;; false
future-call
future-call is the lower level function used by the future macro to create a future.
There are very few reasons to use future-call directly. One reason is if you need to
use it as higher order function:
(mapv future [:f1 :f2]) ; ❶
;; CompilerException java.lang.RuntimeException
When a function is created using fn* with ^{:once true} metadata, the Clojure compiler generates a
method that enforces clearing of all the class attributes. Clearing is implemented by setting arguments to
"null" as soon as possible. This prevents the same function object to be callable again:
(let [s "local-var"
f1 (^{:once true} fn* [] (str "local-var: " s))
f2 (^{:once false} fn* [] (str "local-var: " s))]
[(f1) (f1)
(f2) (f2)])
;; ["local-var: local-var"
;; "local-var: " ; ❶
;; "local-var: local-var"
;; "local-var: local-var"] ; ❷
❶ Note that the second time "f1" is invoked, the local "s" has been cleared.
❷ When the once-only semantic is disabled, the local var is never cleared.
future calls benefit once-only semantic because the body expression runs just once
while future objects can stay around for arbitrarily long time (even when terminated). The once-only
semantic gives the garbage collector an opportunity to remove potentially large locals even when
a future is still referenced by a thread pool (a common scenario). If you decide to use future-
call directly, it is advisable you apply the same pattern to avoid potential out of memory issues.
One use of future is to decouple the current thread from one or more expensive
computations. In the next example, several futures wrap potentially long http requests
and each call requires roughly the same time. By using future all requests start in
parallel:
(require '[clojure.xml :as xml])
❶ fetch-async takes an URL and makes an HTTP requests as part of xml/parse. The resulting XML is
transformed into Clojure data structures which are completely realized.
❷ Each binding in the let block uses fetch-async to download a different feed. Instead of blocking for
each http request, future decouple the call and returns immediately.
❸ At the point of concatenating the articles from the feeds, some request might be done and others still
downloading. The last call to deref (@) to return results is the longest call. But at that point, all feeds
have been downloaded.
While future models independent threads effectively, it becomes even more powerful
when used with other concurrency functions like promise or delay. We are going to see
more examples in the following sections.
(deliver p :location) ; ❸
;; #object[clojure.core$promise$reify__7005 0x16fb93fb {:status :ready, :val
:location}]
;; Thread 3 got access to Thread 2 got access to Thread 1
;; got access to :location:location:location
❶ promise creates a protected location "p". The location is initially empty and any attempt to deref "p"
results in a blocking call (unless deref with timeout is used).
❷ We create three separate threads with future. The expression contains a request for access to the
protected location "p" and blocks immediately. The thread evaluating these forms at the REPL never
blocks.
❸ deliver performs the following actions atomically: it stores the value :location inside the promise
"p" and opens the gate to allow access. What we see is the concurrent printing of the threads created
earlier.
As you can see from the example above, as soon as a value is delivered to the promised
location, all blocking threads gain access to the location at the same
time. promise creates a callable object (which is used by deliver):
(def p (promise))
(future (println "Delivered" @p))
(p :value) ; ❶
;; Delivered :value
(realized? p) ; ❷
;; true
(p :value) ; ❸
;; nil
❶ A promise object is also a function of one argument. We can invoke the function with a value to obtain
the same effect as deliver.
❷ realized? returns the current state of the promise which can be either realized or not realized.
❸ Any further delivery to the promise produces no action.
promise and future can be useful for threads coordination. The next example is
inspired by the cigarette smokers problem described by Suhas Patil in 1971 216. The
problem models the following:
• A cigarette requires tobacco, paper and matches to be prepared and smoked.
• 3 smokers sit at a table having an infinite supply of tobacco, paper and matches
respectively, but missing the other 2 ingredients.
• A person not sitting at the table picks 2 ingredients at random and put them on the
table.
• Each round, there should be only one person able to light up a cigarette.
The problem poses interesting challenges such as thread contention and
synchronization. We can model the problem using a promise each ingredient and a
future each player:
(def msgs (atom [])) ; ❶
216
The Cigarette and Smokers problem is described in this Wikipedia page
en.wikipedia.org/wiki/Cigarette_smokers_problem
(defn run [] ; ❹
(dotimes [i 5]
(swap! msgs conj (str "Round " i))
(let [tobacco (promise) paper (promise) matches (promise)]
(future (smoke "tobacco holder" paper matches))
(future (smoke "paper holder" tobacco matches))
(future (smoke "matches holder" tobacco paper))
(doseq [add (pick-two tobacco paper matches)] (add))
(Thread/sleep 10)))
@msgs)
❶ Printing to standard output would produce unreadable messages because of the multiple threads
trying to write to the same stream. A possible option is to serialize messages into a vector inside
the msgs atom.
❷ smoke takes the name of a smoker and the two missing ingredients they are waiting for. It
then deref both ingredients in the attempt to complete and light a cigarette.
❸ pick-two selects two ingredients at random. It uses shuffle to randomize access to the ingredients.
Note that we need to wrap the deliver request in a function to avoid immediate execution.
❹ run orchestrates the simulation. Each promise represents the location for an ingredient
and future wraps the attempt at smoking a cigarette for each smoker. doseq evaluates delivery of only
two ingredients while three smokers compete for them. run returns the messages collected so far.
There is also the possibility for a deadlock if, for example, the smoker with tobacco
removes the paper while another removes the matches from the table. The simulation
avoids this kind of deadlock because the promise caches its value internally.
Unfortunately, there is another problem we need to fix. Let’s check the output from the
simulation:
(pprint (partition 5 (run))) ; ❶
;; ("Round 0" ; ❷
;; "tobacco holder attempts"
;; "paper holder attempts"
;; "matches holder attempts"
;; "matches holder successful!")
;; ("Round 1"
;; "tobacco holder attempts"
;; "paper holder attempts"
;; "tobacco holder successful!"
;; "matches holder attempts")
;; ("Round 2"
;; "tobacco holder attempts"
❶ run returns the collection of messages retrieved during the simulation. We use partition to group
messages by iteration.
❷ There is only a successful smoker each round, as expected.
(run)
(print @msgs)
;; ["Round 0" ; ❷
;; "tobacco holder :paper :matches"
;; "Round 1"
;; "matches holder :tobacco :paper"
;; "Round 2"
;; "matches holder :tobacco :paper"
;; "Round 3"
;; "tobacco holder :paper :matches"
;; "Round 4"
;; "tobacco holder :paper :matches"
;; "matches holder fail! :paper"
;; "paper holder fail! :matches"
;; "tobacco holder :paper fail!"
;; "paper holder :tobacco fail!"
;; "paper holder :tobacco fail!"
;; "tobacco holder :paper fail!"
;; "matches holder fail! :paper"
;; "paper holder fail! :matches"
;; "matches holder fail! :paper"
;; "paper holder fail! :matches"]
❶ The new version of the smoke function adds a timeout of 100 milliseconds to the deref call. It also
adds a default message to return when reaching the timeout.
❷ The messages are now showing failure messages after the timeout expires.
The modified version of deref using timouts prevents threads from hanging
indefinitely when an ingredient is not delivered to the promise.
14.3 delay
NOTE This section also mentions other related functions: delay? and force.
delay is a macro that takes a form as argument. delay guarantees that the form is going
to be evaluated only once the first time deref is called:
(def d (delay (println "evaluated"))) ; ❶
(deref d) ; ❷
;; evaluated
;; nil
@d ; ❸
;; nil
❶ delay returns an object of type clojure.lang.Delay that we save in a var called "d". Nothing prints
on screen as delay stores the expression without evaluating it.
❷ As soon as we call deref on "d" the expression is evaluated.
❸ We can also use the reader macro "@" instead of deref. The result of evaluating the expression is
cached internally and returned. We can see there is no printout the second time we call deref.
Please note that using an atom would not replace the need for delay. Even with
an atom in place, concurrent threads would still be able to produce multiple
initializations:
(import '[java.net InetAddress Socket])
(def connection (atom nil)) ; ❶
(defn connect [] ; ❷
(swap! connection
(fn [conn]
(or conn
(let [socket (Socket. (InetAddress/getByName "localhost") 61817)]
(println "Socket connected to 61817")
socket)))))
(dotimes [i 3] ; ❹
(future (handle-request i)))
;; Socket connected to 61817
;; Socket connected to 61817Doing something with
;; 0
;; Doing something with 2
;; Socket connected to 61817
;; Doing something with 1
When the application starts receiving requests, multiple threads are able to execute
the swap! request to change the atom content. The atom handles concurrency by
allowing a few retries until one thread is able to store the socket connection. In doing
so, other connections have been created but immediately abandoned, wasting
resources. By using delay instead of an atom, we can achieve the desired effect of
connecting once without wasting resources:
(def connection
(delay ; ❶
(let [socket (Socket. (InetAddress/getByName "localhost") 61817)]
(println "Socket connected to 61817")
socket)))
(dotimes [i 3] ; ❸
(future (handle-request i)))
;; Socket connected to 61817
❶ Instead of defining an atom, the connection is now declared as a delay. The logic required to
compare the current content of the atom in a swap! request is gone except for the creation of a new
Socket object.
❷ handle-request now uses @ to dereference the delay.
❸ We start the same number of threads as before and now we can see a single printout confirming that
the socket object has been created just once. Following that, we can see the interleaving output
generated by handle-request.
delay? and force are utility functions to help manage delay objects. delay? asks if the
given argument is a delay:
(def d (delay (println :evaluated)))
(if (delay? d) ; ❶
:delay
:normal)
;; :delay
❶ delay? is a predicated function returning true when the given argument is a delay object. Note
that delay? does not force any evaluation.
force encapsulates a condition on the input argument: it deref the argument if it’s
a delay or returns it otherwise:
(def coll [(delay (println :evaluated) :item0) :item1 :item2]) ; ❶
(map force coll) ; ❷
;; :evaluated
;; (:item0 :item1 :item2)
Note that if a delayed computation produces an exception, the same exception object is
re-thrown at each deref:
(def d (delay (throw (ex-info "error" {:cause (rand)})))) ; ❶
❶ This delay definition produces an error on purpose. The error created with ex-info contains a random
number to verify if the body that creates the exception get reevaluated or not.
❷ As we can see by calling deref/@ twice on the same delay object, the exception is the same.
14.4 ref
NOTE This section also mentions other related functions such
as: sync, dosync, alter, commute, ensure, ref-set, ref-history-count, ref-min-
history, ref-max-history and io!.
ref, dosync and the other functions in this section are the main entry point into Clojure
Software Transactional Memory (STM for short). Specifically, ref is one of the
concurrency primitives along with atoms, agents and vars. What differentiates ref
from the other concurrency primitives is that multiple ref can coordinate inside the
same transaction. The canonical example of reference coordination in a transaction is
modeling the transfer of a sum from a bank account to another:
(def account-1 (ref 1000)) ; ❶
(def account-2 (ref 500))
(transfer 300) ; ❹
;; {:account-1 700, :account-2 800}
❶ ref creates an object of type clojure.lang.Ref. The ref accepts an initial value of any type
(including nil).
❷ dosync initializes a transaction context that monitors access to the reference objects inside the body
of the expression.
❸ Operations like alter notifies the STM about the intention to perform a change to the reference. Such
changes can happen immediately, at the end of the dosync block, repeat multiple times or not
happening at all.
❹ In this simple example, we can verify that a sum of "300" was withdrawn from "account-1" and moved
to "account-2".
An apparently innocuous account transfer operation (such as the one above), leads to
many challenges in concurrent applications: after checking that there is enough money,
how do we make sure that another thread does not empty the first account before we
are able to transfer money to the second? A traditional solution to this problem is to
use explicit locking and delegate programmers the responsibility to deal with
concurrency. Clojure however takes a lock-free approach to thread coordination with
the STM 217.
217
The STM does use some degree of locking internally. The claim that the STM is "lock-free" is from the user perspective,
not the internal implementation
NOTE dosync is a wrapper around sync that does not pass any option. sync was designed to accept
options but as of now, there is no option available.
(ref-history-count r) ; ❷
;; 0
(future ; ❸
(dosync
(println "T1 waiting 5 seconds")
(Thread/sleep 5000)
(println "T1 reading ref:" @r)))
;; T1 waiting 5 seconds.
(future ; ❹
(dosync
(println "T2 changing ref")
(println "T2 new value of ref:" (alter r inc))))
;; T2 changing ref ; ❺
;; T2 new value of ref: 1
;; T1 waiting 5 seconds
;; T1 reading ref: 1
(ref-history-count r) ; ❻
;; 1
STM transaction isolation ensures that any temporary state of a ref is not visible
(ref-min-history r 1) ; ❸
(ref-max-history r 7)
The pre-allocated space ensures that the committed value becomes available in the
history of the reference straight away, preventing the faulty read. The storage space is
used similarly to a stack: if a new in-transaction value becomes available, the current
value of the ref is pushed into the history queue first. In our example, the transaction
"T2" pushes the value "0" into the history queue while the ref assumes the in-
transaction value of "1":
(def r (ref 0 :min-history 1)) ; ❶
(future
(dosync
(Thread/sleep 5000)
(println "T1 reading ref:" @r)))
(future
(dosync
(println "T2 changing ref")
(println "T2 new value of ref:" (alter r inc))))
;; T2 changing ref
;; T2 new value of ref: 1
;; T1 reading ref: 0 ; ❷
❶ The new definition of the reference "r" contains an explicit requirement to store the current value of the
reference before it is replaced with a different in-transaction value.
❷ The rest of the computation remains the same as the previous example, except that the transaction is
not restarted and "T1" can now read from the history that the value "0" of the reference was present at
the time "T1" started.
❶ alter takes thee reference and a function of one argument as arguments. The function receives the
old value of the ref that can be used to compute a new value.
❷ ref-set ignores any old value in thee ref and just replaces it with a new one.
(defn perform [] ; ❷
(dosync
(dotimes [i 3] ; ❸
(println (format "###-%s-###" ; ❹
(hash (Thread/currentThread))))
(alter op1 inc)
(alter op2 inc)
(alter res conj (+ @op1 @op2))
(println
(format "%s + %s = %s (i=%s)"
@op1 @op2 (+ @op1 @op2) i))
(Thread/sleep 300))
@res))
(perform) ; ❺
;; ###-2023564354-###
;; 1 + 2 = 3 (i=0)
;; ###-2023564354-###
;; 2 + 3 = 5 (i=1)
;; ###-2023564354-###
;; 3 + 4 = 7 (i=2)
;; [3 5 7]
(perform)
;; ###-2023564354-###
;; 4 + 5 = 9 (i=0)
;; ###-2023564354-###
;; 5 + 6 = 11 (i=1)
;; ###-2023564354-###
;; 6 + 7 = 13 (i=2)
;; [3 5 7 9 11 13]
❶ The ref objects are given an initial value of 0, 1 and empty vector respectively.
❷ perform executes a few calculations inside a dosync block. The results are stored in the ref and
returned.
❸ Inside each dotimes loop, perform increments the operands and prints their sum on screen. The loop
executes inside a transaction, enforcing the constraint that once perform is called, all changes either
happens as a whole or they don’t.
❹ At the beginning of each loop, we also print a message that contains the thread identification (as
a hash of the thread object). This is useful to understand how threads are competing to control the
execution of the code.
❺ We call perform without concurrency at first, to show what results to expect. We can see [3 5 7] as
the first result and [3 5 7 9 11 13] if we call perform again without resetting op1 and op2.
In the previous example, perform is invoked twice without concurrency just to show
how the loop behaves. If we run multiple perform in separate threads, the STM
guarantees the same result seen in the sequential case at the price of some amount of
restarts:
(dosync ; ❶
(ref-set op1 0)
(ref-set op2 1)
(ref-set res []))
;; ###-1539638732-### ; ❸
;; ###-1047541620-###
;; 1 + 2 = 3 (i=3)
;; ###-1047541620-### ; ❹
;; ###-1047541620-###
;; ###-1047541620-###
;; ###-1539638732-###
;; 2 + 3 = 5 (i=2)
;; ###-1047541620-###
;; ###-1047541620-###
;; ###-1539638732-###
;; 3 + 4 = 7 (i=1)
;; ###-1047541620-###
;; ###-1047541620-###
;; ###-1047541620-###
;; ###-1047541620-### ; ❺
;; 4 + 5 = 9 (i=3)
;; ###-1047541620-###
;; 5 + 6 = 11 (i=2)
;; ###-1047541620-###
;; 6 + 7 = 13 (i=1)
;; [3 5 7 9 11 13]
❶ Before starting each experiment, it’s good practice to reset the content of the shared state.
❷ perform is now invoked from two separate threads and then we wait for results to come back by de-
referencing them in a vector. The results are available by calling @ref at the end of the let block.
❸ One of the threads, either p1 or p2, enters the transaction first.
❹ We can see repeated attempts of the second thread, the one that didn’t get access to the reference
first. What we see is the result of the STM restarting the body of the dosync instruction. Considering
our sleep period of 300 milliseconds, we can infer that the STM applies a 100 milliseconds wait period
between transactions retries, which is indeed the case (this is not user configurable).
❺ Eventually, the second thread is able to perform the loop. This happens when the first one completes
the transaction.
To understand how the output interleaves, we need to consider that one thread always
enters the transaction first. As soon as that happens, the late thread is forced to restart.
One restart is not enough time for the first thread to complete the transaction, so we
can see a few of them happening each loop. There is a hard limit of ten thousands
retries before the STM gives up and throws exception (in our example, we are pretty
far from hitting that limit).
commute is a relaxed form of alter that signals the STM that writes operation can
execute in any order (write operations with this property have to be commutative).
When commute is used instead of alter, transactions don’t need to restart waiting on
each other results, because the computation does not dependent on read order
consistency.
commute is the wrong choice for non commutative operations, such as the mix of
increment and addition seen in the previous example. It is however a good candidate in
other scenarios where the order of updates is not important. In the following
simulation, a polling system receives votes for candidates and prints the name of the
first candidate who reaches 100 votes. We are not interested in maintaining an ordered
list of the preferences as they were received, so commute seems a natural choice:
(def votes (ref {})) ; ❶
(dosync
(doseq [pref poll]
(commute votes update pref (fnil inc 0))))))
;; {"candidate-0" 50 ; ❺
;; "candidate-1" 153
;; "candidate-2" 42
;; "candidate-3" 157
;; "candidate-4" 33}
❶ votes is a Clojure map wrapped by a ref. Votes are collected by the ref and shared across the
system.
❷ counter is a function returning a future. The body of the future takes the incoming batch of votes
and increments a counter inside the map corresponding to the votes. Using alter would force a
restart to maintain read consistency, but we are not interested to know what number gets incremented
first, just their totals. The STM can optimize for this scenario with commute.
❸ generate-poll simulates users casting votes for candidates. It takes any number of votes, assuming
the position of each number in the arguments associates with a specific candidate starting at index 0,
then index 1 and so on.
❹ The incoming batches of votes are assigned to different counter threads, so the counting can operate
in parallel. The creation of the future object also starts the computation. The vector [@c1 @c2] makes
sure all futures have finished before reading the results.
❺ We can see the expected count of results.
NOTE even if using ref and transactions, consistency is possible only by using appropriate data
structures. If we used a collection that is not concurrent or immutable, the STM wouldn’t be
able to help with consistency.
Along with faulty reads, snapshot isolation can also generate "writes skew". A write
skew could potentially happen when there are constrains applied to multiple refs. For
example, let’s add a constraint to the voting system that stops the competition as soon
as there are more than 5 "honeypots" votes. An "honeypot" on a web page consists of
adding a hidden input field on a form. Humans don’t see the input but bots do and fill
it. As soon as we detect some amount of suspicious submissions we stop the
competition. Each transaction should now ensure (not coincidentally using ensure) that
the honeypot count does not change outside the current transaction:
(def votes ; ❶
{"honeypot" (ref 0)
"candidate-0" (ref 0)
"candidate-1" (ref 0)
"candidate-2" (ref 0)
"candidate-3" (ref 0)
"candidate-4" (ref 0)})
❶ All candidates are now modeled as reference objects inside the votes map. This enables
to ensure the honeypot key as well as commute votes independently.
❷ The dosync body now contains a call to ensure on the reference containing the honeypot
count. doseq has also been updated to check the honeypot count before proceeding any further.
❸ generate-poll accepts a number of honeypot entries to generate along proper candidates for the
simulation.
❹ A test run of the new voting system confirms a possible fraud. A winner is still calculated but additional
batch of votes would not alter the current result.
An important aspect to consider when using the STM is to make sure that expressions
in a transaction are side-effects free (more properly, the expression doesn’t rely on side
effects to succeed). As the transaction could call into any other function, there needs to
be a way for arbitrary code to signal unsuitability for transactions. We can signal this
fact using io!. For example, function f1opens a transaction context that involves
calling function f2 (possibly many more layer below). f2 is side effecting but designed
for the general case. We can use io! to signal the fact that should f2 ever be part of a
transaction, the transaction should throw an exception:
(def counter (ref 0))
(defn f2 [value] ; ❶
(io! (println "Sorry, side effect on" value))
(inc value))
(defn f1 [] ; ❷
(dosync (f2 (commute counter f2))))
(f1) ; ❸
;; IllegalStateException I/O in transaction user/f2
❶ f2 is a function that processes a value and makes use of side effects. While f2 is designed for the
general case, using it in a transaction could create problems. Knowing about this possibility, the author
of the function wrapped the side effect in a io! call.
❷ f1 performs operations inside a transaction and explicitly use f2 (but the use of f2 might now be so
easy to see).
❸ An attempt of running f1 reveals the presence of io! down the calling chain.
To finish the section, it’s worth mentioning that you can pass metadata to a ref object
during construction using the :meta option:
(def r (ref 0 :meta {:create-at :now})) ; ❶
(meta r) ; ❷
;; {:create-at :now}
❶ Functions like with-meta doesn’t work with reference types. But ref offers the :meta option during
construction to specify metadata.
❷ We can see the metadata are correctly set.
14.5 atom
NOTE This section also mentions other related functions such as: swap!, reset! and compare-and-
set!
;; 1
swap! is the most flexible choice to work with atom. It takes any additional number of
arguments to pass to the update function, making it an ideal choice for updating
collections:
(def m (atom {:a 1 :b {:c 2}})) ; ❶
(swap! m
(fn [m]
(update-in m [:b :c]
(fn [x] (inc x))))) ; ❷
;; {:a 1 :b {:c 3}}
NOTE If the update function passed to swap! contains side effects, please be aware
that swap! might execute the function any number of times, especially in highly concurrent
scenarios.
In case there was no interest in changing the atom depending on the previous value, we
could use reset! instead of swap!. Differently from swap!, reset! would always
succeed without need for retries:
(def configuration (atom {})) ; ❶
(defn initialize [] ; ❷
(reset! configuration (System/getenv)))
(initialize) ; ❸
(take 3 (keys @configuration))
;; ("JAVA_MAIN_CLASS_65503" "IRBRC" "PATH")
❶ configuration is designed to contain the view of the environment variables available after
initialization. Other parts of the program can alter the configuration at any time but an additional
initialization should reset the atom to the system environment.
❷ initialize can remove any previous state of the configuration atom. reset! is more appropriate
than swap! in this case, as reset! does not attempt and compare logic.
❸ After calling initialize we can see the configuration contains the current view of the system
environment.
NOTE An atom does not need to be necessarily a top level var. A typical use of atom is for example as
"closed-over" state for some class of problems, such as memoization. Please
see memoize performance section to have an idea of the use of atom as a cache for repeating
computations.
❶ The interface of swap-or-bail! is similar to swap! except for taking an additional "attempts"
argument with a default of 3 if non is given.
❷ (f old) triggers a potentially slow update function on the old value. If by the time (f old) returns
there was a change to the value in the atom (situation that we are going to force in this example),
then compare-and-set! fails and the operation retries.
❸ slow-inc is an intentionally slow version of inc. We have 5 seconds to evaluate a reset! call on the
atom to force a CAS retry.
❹ swap-or-bail! runs in its own thread, so we are free to change the content of the atom while its
running.
❺ If each reset! operation happens before 5 seconds from the last compare-and-set! attempt, we
force another attempt. But after reaching the max number of attempts, swap-or-bail! prints a failure
message and exit.
You should avoid calling compare-and-set! with a value that is not coming from the
same atom instance you’re trying to update. The risk is to incur in surprises determined
by Java equality semantic used by compare-and-set!. For example:
(def a (atom 127))
❶ compare-and-set! requires 3 arguments: the atom instance, a comparison value and the desired
new value. The desired value becomes the new value of the atom only if the comparison value is the
same as the current value of the atom.
❷ Since the atom was mutated to contain 128, the second compare-and-set! operation fails, as the
comparison value does not match the current value (127 != 128).
❸ Strangely enough, even passing the right value (128), compare-and-set! refuses to update.
The reason why compare-and-set! refuses to update the atom even when the old value
is apparently the same as provided, is because of Java reference equality (Java
operator ==) and Clojure autoboxing. Long values from -127 to 127 are cached so new
Long(127) == new Long(127) is true because the two numbers are effectively the
same instance. But new Long(128) == new Long(128) is false in Java because the
two objects are effectively different instances (as there is no implicit caching). Clojure
wraps numerical arguments into a new java.lang.Long instance, resulting in the
observed compare-and-set! behavior.
14.6 agent
NOTE This section also mentions other related functions such as: send, send-off, send-via, set-
agent-send-executor!, set-agent-send-off-executor!, restart-agent, shutdown-
agents, release-pending-sends, await, await-for, agent-error, set-error-
handler!, error-handler, error-mode, set-error-mode!
more:
• Asynchronous: an agent performs actions in a separate thread than the caller. In
this respect, an agent is analogous to a future (they even share the same thread
pool). Differently from a future, an agent has an internal state.
• Uncoordinated: like atom, an agent can’t coordinate with another agent (as in,
preventing state changes in another agent based on conditions on the current), but
they can participate in a STM transaction by holding actions until commit time.
• Sequential: actions sent to an agent executes in the order they are received. This
makes an agent a good candidate to handle side effecting operations.
The agent works on queued actions unless it encounters an error condition.
Pending work can resume or be removed after an error. The agent can also create
tasks for itself or other agents to execute immediately or after a state change.
Note: agent brings some resemblance with Erlang’s actors 218. There are however a
few fundamental differences: agent is not distributed, they accept any function (not
just a predefined set of messages) and you can access an agent anytime without
sending a message to it.
send delivers the agent a function from the current value to the new:
(send a inc) ; ❷
(deref a) ; ❸
;; 1
@(send a inc) ; ❹
;; 1
(deref a) ; ❺
;; 2
As seen in the previous example, agent are asynchronous and we might need to wait to
see the results of sending an action to them. If we need to wait for the action to
perform, we can use await:
(def a (agent 10000)) ; ❶
(def b (agent 10000))
218
Erlang is a popular functional language with a solid industrial history. It popularized the "actor" approach to concurrency
from which many other languages took inspiration. Clojure agent brings some similarities to actors but they are also
fundamentally different in several ways.
(send a slow-update) ; ❸
(send b slow-update)
Both await and await-for accept any number of agent instances. await-for also
accepts a timeout in case the agent takes too long to return:
(send a slow-update) ; ❶
(time (await-for 2000 a)) ; ❷
;; "Elapsed time: 2003.144351 msecs"
;; false
In the previous example, a slow-update function was sent to the agent keeping a
thread busy for potentially a long time. To accommodate for different workloads
, agent has two default options available: send executes on a fixed size thread pool
(number of cores + 2) while send-off uses an unbounded thread pool. Using on or the
other is a function of the particular problem at hand. Longer input/output operations
would normally benefit from the unbounded pool with send-off, while shorter CPU
intensive operations are better suited with send. However, unbounded thread pools
could result in lagging applications or out of memory problems if left unattended, so
there isn’t really a thread pool for all situations.
Controlling thread pools
If none of the pre-configured thread pool is suitable for a particular problem, send-
via allows to use a custom thread pool. A good option for shorter, CPU-bound tasks, is
the ForkJoin pool introduced with Java 8 219:
(import '[java.util.concurrent Executors])
219
The ForkJoin pool is also used by Reducers. Please refer to the related chapter to know more about the fork-join
paradigm and work-stealing.
Clojure also allows changing the thread pool strategy for send or send-off, for
example to control agent usage in an application without the need of changing call
sites:
(import '[java.util.concurrent Executors])
(set-agent-send-executor! fj-pool) ; ❷
(set-agent-send-off-executor! fj-pool)
❶ fj-pool defines a WorkStealingPool that will attempt to maintain the number of concurrent worker
below or equal to 100 (this does not reflect necessarily on the number of created threads).
❷ From this point onward, send and send-off are going to use the newly created thread pool.
❶ We can see that agent and a are the same object from within the updating function sent to the agent.
Using agent we could create the following "ping" agent, that executes a request to
some URL just to verify if it responds correctly:
(def a (agent {:enable false :url nil})) ; ❶
(try
(slurp url)
(println "alive!")
(catch Exception e
(println "dead!" (.getMessage e)))))
(Thread/sleep 1000)
(send-off *agent* ping)
m) ; ❸
(send-off a ping) ; ❹
(send-off a assoc :url "http://manning.com")
(send-off a assoc :enable true) ; ❺
alive!
alive!
alive!
(send-off a assoc :url "http://nowhere.nope") ; ❻
dead! nowhere.nope
dead! nowhere.nope
dead! nowhere.nope
(send-off a assoc :enable false)
❶ An agent is initialized with a state map containing the :enable and :url keys
initially false and nil respectively.
❷ The ping function is sent to the atom as the update function. It takes the current state map as input
and prints a message regarding the availability of the target when the :url is available and
the :enable condition is true. Apart from that, ping never change its internal state and always return
the current state of the agent unaltered as the last line.
❸ Note that in all cases, the ping function always send-off another request to execute itself after
waiting 1 second.
❹ The first send-off call starts the internal agent recursion, but given the conditions are not met,
the agent doesn’t actually ping the requested web page.
❺ After all conditions are met, we can finally see messages coming from the inner loop.
❻ If we change the target page to a non-existent one, the catch block executes printing a different
message.
An agent action could result in any of the following (or combination thereof):
• Nothing happens and the state of the agent remains unchanged.
• The state of the agent changes.
• The agent dispatches another action to itself (like the ping function seen before).
• The agent dispatches one or more actions to other agents.
In case of additional dispatches from within the update function, the default behavior
of the agent is to wait until the state has changed before proceeding sequentially with
all created actions. But in case this wait is undesirable (usually because the order in
which actions are performed doesn’t matter), we can force all pending actions to start
immediately with release-pending-sends. In the following example, we use the
coordination between different agents to find words and letters frequencies in a large
text. While the first agent is busy calculating the word frequencies, we can release
messages for other agents, one for each letter of the alphabet:
(send-off words ; ❹
(fn [state]
(doseq [letter book
:let [l (Character/toLowerCase letter)
idx (- (int l) (int \a))]]
(send (get alpha idx others) inc))
(release-pending-sends) ; ❺
(merge-with + state (frequencies (split book #"\s+")))))
❶ alpha is a vector containing 26 agents, one for each letter in the alphabet.
❷ We also prepare others for any other letter which is not part of the simple alphabet.
❸ The "words" agent collects the frequencies of all the words found in the text.
❹ Processing start from the send-off instruction. The update function first process the book letter by
letter, sending each corresponding agent an increase by one request. The second part updates
the agent state with the word frequencies.
❺ Creating frequencies is a potentially expensive operation so we take advantage of release-pending-
sends to start processing letter frequencies even if the state of the current agent has not been
updated yet.
Handling Errors
When a problem happens in a different thread, the source of the problem is often lost
unless specific care is taken to raise the problem to the attention of another controlling
thread. For example, the following agent is given the impossible task of dividing a
number by zero. We can check if there was a problem with agent-error:
(def a (agent 2))
(send-off a #(/ % 0)) ; ❶
(agent-error a) ; ❷
;; #error {
;; :cause "Divide by zero"
;; :via
;; [{:type java.lang.ArithmeticException
;; :message "Divide by zero"
;; :at [clojure.lang.Numbers divide "Numbers.java" 163]}]
;; ...
;; }
❶ After generating an ArithmeticException on purpose, the send-off call does not report about the
problem as it is happening on a different thread.
❷ The root cause of the problem was saved inside the agent for later inspection and is visible when
calling agent-error.
After the agent enters an error condition, it stops processing more work and any task in
the working queue is suspended. To reset the state of the agent and resuming any
pending work, we can use restart-agent (or we can discard any pending work by
passing :clear-actions true):
(restart-agent a 2) ; ❶
(send-off a #(/ % 2)) ; ❷
@a
(restart-agent a 2) ; ❸
;; 1
❶ restart-agent removes any error condition and replace the current state with the new one passed
as parameter.
❷ The agent is now ready to accept additional work.
❸ Note that restart-agent throws exception when called on a healthy agent with no error conditions.
(defn handle-error [a e]
(println "Error was" (.getMessage e))
(println "The agent has value" @a)
(restart-agent a 2))
(set-error-handler! a handle-error) ; ❶
(send-off a #(/ % 0))
;; Error was Divide by zero ; ❷
;; The agent has value 2
@a ; ❸
;; 2
If all error conditions are considered recoverable and we can always accept to resume
working after an error, we can set the :continue error mode on the agent, completely
ignoring the problem (and not requiring a restart-agent call):
(def a (agent 2))
(set-error-mode! a :continue) ; ❶
(send-off a #(/ % 0)) ; ❷
@a ❸
;; 2
❶ set-error-mode! changes the way the agent handles errors. If we use :continue (as opposed to
the default :fail mode) the agent would simply ignore any error.
❷ This send-off operation generates an error. The agent does not enter failure state and doesn’t throw
error.
❸ The state does not change in case of an update generating error.
Finally, if running agents get out of control or if the application doesn’t exit properly
after executing, chances are that some agent thread are still allocated in the thread
pools. In that case, shutdown-agents performs a graceful shutdown of all agent pools
and prevents the execution of any other action from that point onward.
deref works similarly for the types in the table, but there are a few differences:
• When applied to a agent, var, volatile! or atom it returns its current state.
• When applied to a delay it also forces its body to evaluate, unless it was already
evaluated: in that case it returns the cached value immediately.
• When applied to a ref, it returns the in-transaction value for the ref or missing
that, it returns the most recently committed value. Please refer to ref-min-
history to see how deref of a ref object could cause a read fault and a transaction
restart.
• When applied to a future the call might be blocking waiting for the future to
finish the evaluation of its body.
• When applied to a promise it will block until a value is delivered. An expression
©Manning Publications Co. To comment go to liveBook
like @(promise) will always require a thread interrupt if evaluated from the
current thread. You should always define the promise as an independent reference,
so you can deliver a value to it, for example: (let [p (promise)] (deliver p 0)
@p). Alternatively, use the deref variant that supports a timeout.
deref also has a variant supporting a timeout (with a default value) for those reference
types that could result in a blocking call. For the same type, we can also
use realized? to check if a value is available without necessarily blocking. The only
supported types are promise and future:
(def p (promise)) ; ❶
(def f ; ❷
(future
(loop []
(let [v (deref p 100 ::na)]
(if (= ::na v) (recur) v)))))
(realized? p) ; ❸
;; false
(realized? f)
;; false
(deliver p ::finally)
(deref f 100 ::not-delivered) ; ❺
;; :user/finally
(realized? p) ; ❻
;; true
(realized? f)
;; true
realized? also works with Clojure lazy sequences to verify if the first item in the
sequence has been evaluated (and cached for later use):
(def s1 (map inc (range 100))) ; ❶
(realized? s1) ❷
(first s1) ❸
;; 1
(realized? s1) ❹
;; true
❶ map (as well as many other common sequential processing functions) produces a lazy sequence.
❷ The sequence is initially unrealized, as proved by calling realized? on it just after construction.
❸ By calling first, we force realization of the first (and potentially others) element in the sequence.
❹ realized? now returns true.
(set-validator! a should-be-positive)
(swap! a dec) ; ❷
;; ExceptionInfo 0 should be positive
;; :value 0,
;; :error "Should be a positive number",
;; :action "State hasn't changed"}
❶ should-be-positive is a more descriptive validator function than pos? alone. Instead of returning
just false it throws a descriptive exception with ex-info.
❷ An invalid operation now explains what exactly went wrong.
❸ If we want even more details, we can catch the exception and extract the descriptive error map
with ex-data.
(def a 1)
(get-validator #'a) ; ❶
(set-validator! #'a pos?)
(def a 0) ; ❸
;; IllegalStateException Invalid reference state
❶ var objects also accept validation functions. We need to remember to access the var object using
the var function or the equivalent reader literal "#'".
❷ After setting the pos? function as the validator, we can get back to its name from the function object
returned by get-validator.
❸ The var "a" is now a global object and attempting a redefinition is now controlled by the existing
validator. To remove a validator, use set-validator! with nil.
NOTE set-validator! will fail if at the moment of installing the new validator function, the current
state of the reference already violates the validator constrains. For example: given (def a
(atom 0)), the following (set-validator! a pos?) throws exception and the validator is
not installed.
Validator functions are also accepted at reference creation time (except for vars). We
could use validators to prevent account overdraft in the following bank transfer
simulation:
(def account-1 (ref 1000 :validator pos?)) ; ❶
(def account-2 (ref 500 :validator pos?))
❶ Each account is represented by a ref object. Each ref is installed a validator that prevents the
account from going to 0 or below.
❷ The transfer function moves money from one account to another inside a dosync block.
❸ An attempt to transfer more money than what is actually available results in
a IllegalStateException.
❶ account-1 and account-2 are ref created with a validator function preventing them to go in overdraft.
❷ Additionally, we want the account to sends transfers to a monthly statement file every time there is a
money transfer in or out from the account.
❸ The monthly statement tracing is installed as a watcher on each of the reference objects.
❹ transfer is the same function that was used in the previous example about validators. The function
subtract the sum from the origin account and moves it into the destination account inside a
transaction.
❺ We can print the content of the statement files to verify withdraws and deposits from the account.
Watchers execute synchronously and only after the new reference state has been set,
isolating correct state handling from any problem happening on the watcher call.
Multiple watchers are called in an unspecified order:
(dotimes [i 10] ; ❶
(add-watch a i (fn [k r o n] (print k))))
(swap! a inc) ; ❷
;; 07146329581
❶ There are 10 watches with keys ranging from 0 to 9 added to an atom instance.
❷ After calling swap!, each watch prints its own key. The order in which watchers are invoked is not
specified (in the same way the keys of a hash-map are returned in no specific order). Note: the last
repeated "1" is the return value from the swap! call.
If a watcher stops being useful for a specific scenario, it can be removed with remove-
watch:
(dotimes [i 10] ; ❶
(remove-watch a i))
(swap! a inc) ; ❷
;; 2
❶ With reference to the previous example, we call remove-watch for all 10 keys.
❷ Then next call to swap! to change the state only prints the new value of the atom.
(dotimes [_ 1500] ; ❹
(future (transfer 1 acc1 acc2)))
[@acc1 @acc2] ; ❺
;; [1 1299]
❶ The lock Object is only used to synchronize the critical section in the transfer function.
❷ A volatile! is a mutable object that does not implement any protection against concurrent access.
❸ With locking, we prevent another thread from changing the content of the accounts while the current
assumes that there are enough money to move.
❹ There are many more requests to transfer money than the actual amount available.
❺ Our assumption is that, at the end of all transfers, there should always be 1 in the first account and
1299 in the second. If you try to remove the locking protection from the example above you would
see that the second account has been credited more money than expected, as if they appeared from
nowhere.
NOTE "Threads contention" usually refers to the situation in which many threads are trying to access
the same lock-protected section of code, although contention can be experienced for other
contended resources, not just locks.
monitor-enter and monitor-exit are even lower level primitives and they have even
less reasons be used explicitly. They are special forms translating directly to the
corresponding Java bytecode to mark critical sections. They only work when a local
binding represents the lock:
(def v (volatile! 0))
❶ monitor-enter marks the beginning of the critical area that should be protected from concurrent
access.
❷ If we failed to release the lock, no other thread would be allowed to execute the code. The operation is
so critical, that it usually appears in a finally block.
220
Rich Hickey articulated the reasons why there are so many ways to generate classes in Clojure in this post from the
Clojure mailing list: https://groups.google.com/forum/#!msg/clojure/pZFl8gj1lMs/qVfIjQ4jDDMJ
Printing Types
Every Clojure object has a related Java type in the Java class space. We can access the type of a Clojure
object by printing it as a string. Type names are usually composed by a list of nested namespaces (based
on Java packages) separated by dots and the name of the type, for example:
(type []) ; ❶
;; clojure.lang.PersistentVector
(type "") ; ❷
;; java.lang.String
(type #()) ; ❸
;; user$eval25$fn__26
(type nil) ; ❹
;; nil
❶ [] is a type defined by Clojure. The name has the typical format of any other class in Java.
❷ "" The string type is borrowed directly from Java and has exactly the same type as the Java type.
❸ #() anonymous classes generate a new Java class on each evaluation. The printed class name
shows the following information: the class name is "user$eval25$fn\__26", it was assigned the
evaluation id "25" and generated as "fn" expression with id "26". Incremental numbers and name rules
are not used in further explicit uses of the type, so the reader should consider them implementation
details.
❹ nil doesn’t have a class associated with it. If that was the case, we should then deal with instances of
the type nil. But this would lead to a contradiction as nil is by definition the absence of an object.
Clojure borrows many of its basic types from Java. Strings, numbers, chars and
booleans for example, share the same implementation between the two
languages. clojure.lang.Symbol and clojure.lang.Keyword are specific to Clojure
and functions like symbol and keyword create the new corresponding instances:
(symbol "s") ; ❶
;; s
(keyword "k") ; ❷
;; :k
Symbols, when printed, are similar to strings without the double quotes. A symbol is
also different from a string in the following ways:
• It can be further qualified using a namespace. The namespace part of the symbol
comes before its name and is followed by "/". For example: a/b is the symbol "b"
in the "a" namespace.
• It supports metadata.
• It can be used as a function to lookup itself into a map.
• Although symbols allow spaces or punctuation, they are not normally used as they
don’t represent general text.
WARNING Using the symbol or the keyword functions allows to bypass some validation checking that
happens when using their respective constant literal. Please refers to the Clojure reference
guide for the rules concerning valid symbols and keywords
at https://clojure.org/reference/reader.
Symbols are used extensively in Clojure to alias var objects, local bindings or function
parameters. If a symbol appears in an expression, Clojure attempts to lookup the
symbol in the current namespace or surrounding scope:
first ; ❶
;; #object[clojure.core$first__4339]
Symbols also appear after reading Clojure code, which is one of the main reasons they
exist. Without this distinction, we wouldn’t be able to tell the difference between text
that belongs to the program and text that represents data. Macro evaluation also
produces symbols for similar reasons: a macro is a function that executes after reading
"text" but before actual evaluation:
(def form (read-string "(a b)")) ; ❶
(reading-symbols a b) ; ❹
;; (clojure.lang.Symbol clojure.lang.Symbol)
15.1.1 name
The presence of a forward slash "/" in the name of a symbol or keyword assigns them
to a specific namespace. To access the different parts of a qualified 221 symbol we use
the function name and namespace:
(def ax (symbol "a/x")) ; ❶
(def bx (symbol "b/x"))
(= ax bx) ; ❸
;; false
❶ The presence of a "/" in the name of a symbol creates a link between the symbol and a namespace
object.
❷ Namespace qualification plays a role in equality: here the symbol named "x" is defined in 2 different
namespaces. To see the name of the symbol we need to use the name function.
❸ Even if the 2 symbols have the same name, they are not equal.
❹ The reason they are not equal is because their namespace component is not the same.
The equivalent way to assign a symbol or keyword to a namespace is to use their two
arguments constructors:
(def ax (symbol "a" "x")) ; ❶
(def bx (keyword "b" "x"))
❶ This example produces an equivalent result as the previous. The namespace portion of the name of a
symbol or keyword can also be assigned using the first argument of the two argument constructor.
❷ We can see that the symbol and the keyword belongs to the expected namespaces.
221
Qualification of a symbol or keyword in Clojure means that they have been assigned a namespace reference. The
presence of a namespace relationship is optional.
15.1.2 find-keyword
On the surface, symbols and keywords are similar. They both help to name element of
the language instead of representing some kind of text. But keyword additionally
implements a form of caching called "interning". Once a keyword is created, the object
instance is cached and reused:
(identical? (symbol "a") (symbol "a")) ; ❶
;; false
❶ Two calls to symbol using the same letter "a" produce different objects.
❷ The effect of interning are visible on keyword.
❶ We compare the cost of creating symbols and keywords. symbol is faster in producing a new object,
but different objects are allocated for the same symbol "a".
❷ keyword is slower with the advantage of creating a single instance of the keyword ":a" thanks to
interning.
The difference is related to the additional cost of the caching mechanism. If you need
to understand if a keyword is already present in cache, you can use find-keyword:
(find-keyword "never-created") ; ❶
;; nil
(find-keyword "doc") ❷
;; :doc
❶ add-meta is a vector with an associated :type key in metadata. no-meta is the same vector without
metadata.
❷ We can see that type first verifies the presence of the :type key and falls back to the class name
when no key is present.
We could build a simple form of polymorphism using maps with associated :type
metadata. For example, a "contact" could be a person or a business and we want to
print them differently:
(defn make-type [obj t] ; ❶
(vary-meta obj assoc :type t))
(print-contact person) ; ❹
;; Mr John
(print-contact manning)
;; Manning (Marjan)
(print-contact nil)
;; Unknown format.
❶ make-type alters an incoming object adding the :type key to the metadata map. vary-meta is the
perfect choice if we want to maintain any existent metadata intact.
❷ We can see a person and a business definition. The two maps differ in number and type of keys. Their
metadata contains a :type key with value :person and :business respectively.
❸ print-contact takes a contact and print it differently based on type. condp ("cond" with "p"-
predicate) creates a condition on top of the type. As you can see, we don’t need explicit access to
metadata.
❹ Each contact prints differently based on its type.
❺ type also works with class types when there is no :type metadata.
The previous example works okay for a small and fixed number of types. If the types
are added frequently and in large numbers, Clojure offers a much better solution based
on multimethods. With multimethods, types can be added incrementally to the
application without touching existing code:
(defmulti print-contact type) ; ❶
(print-contact person) ; ❸
;; Mr John
(print-contact manning)
;; Manning (Marjan)
(print-contact nil)
;; Unknown format.
Type syntax
The type name has to conform to the Java Language Specification 222 . One of the most visible effects of
the Java influence is that idiomatic Clojure dashes "-" need to be replaced by underscores "_" (and other
restrictions that can be found in the specification related to package names). For example:
(in-ns 'my-package) ; ❶
(clojure.core/refer-clojure) ; ❷
(type (fn q? [])) ; ❸
;; my_package$eval1777$q_QMARK___1778
❶ We define and set "my-package" as the current namespace name.
❷ In moving to the new namespace, Clojure doesn’t automatically include all of the functions in the
standard library. To do that, we can refer-clojure.
❸ We create a new named function "q?". Both "?" and "-" are not allowed in package names (leaving
only a few other separators like "$" available). Clojure transforms "?" to "_QMARK_".
(instance? java.lang.Comparable 1) ; ❷
;; true
222
Versions of the Java Language Specification are available from https://docs.oracle.com/javase/specs/
223
AOT compilation is used in Clojure to produce physical class files on disk. The generated classes contain the bytecode
necessary to run a Clojure application. When classes are not saved to disk (the default), Clojure just load them in memory.
AOT compilation can be used to avoid distributing Clojure sources or to speedup Clojure start time
• To provide a "main" method to run a Clojure application from the command line
using Java tools.
• To extend, access and use existing Java classes.
15.3.1 gen-interface
gen-interface supports a few key-value parameters (:name to specify the interface
name, :extends to specify other interfaces to implement and :methods which specify
the method signatures) and returns the newly generated class:
(gen-interface
:name "user.BookInterface" ; ❶
:extends [java.io.Serializable]) ; ❷
(ancestors user.BookInterface) ; ❸
;; #{java.io.Serializable}
(reify user.BookInterface ; ❹
Object (toString [_] "A marker interface for books."))
;; #object[user$eval20 0x2e "A marker interface for books."]
gen-class on the other hand, is not designed to be called directly, but to work in
cooperation with a Clojure namespace that contains the related method
implementations. If you call gen-class directly nothing happens:
(gen-class 'testgenclass) ; ❶
;; nil
❶ A direct call to gen-class does not produce any noticeable effect, either memory or file system.
The reason no classes are generated is because gen-class checks if the dynamic
variable compile-files is set to true before doing any work. But if we try again, even
binding the variable does not produce any visible effect:
(binding [*compile-files* true]
(gen-class 'testgenclass)) ; ❶
❶ Another attempt made at class generation setting compile-files fails without producing any noticeable
effect.
gen-class is a macro evaluated while the Clojure runtime is bootstrapping and at that
point (assuming you start a REPL) compile-files is set to false. We can finally
see gen-class in action when we force evaluation from a different context than the one
currently running the REPL, such as when we invoke compile on a file:
(spit "bookgenclass.clj"
"(ns bookgenclass)
(gen-class :name book.GenClass
:main true)") ; ❶
❶ We create a Clojure file using spit. The simple file contains a namespace declaration and a gen-
class directive. Note that the name of the file needs to correspond to the name of the namespace
(with added extension ".clj"). By using :name we make sure the generated class has a specific
package and name. :main true enables the generation of a public static void main[] Java
method.
❷ The dynamic variable compile-path forces output class files in a folder we can control, for example
the current folder. compile searches for files inside the classpath of the running JVM: the current
folder needs to be part of the classpath so "bookgenclass.clj" can be found (by default the system
property java.class.path is set to the current folder).
If we inspect the current folder ".", we can see a few classes generated by Clojure,
including a new folder containing "./book/GenClass.class". We can now call the
generated main method on GenClass:
(import 'book.GenClass) ; ❶
❶ This import refers to the previous example where the class was generated.
❷ Note that the exception refers to a missing bookgenclass/-main function.
As soon as we try to call methods defined on the newly generated GenClass class, we
can see that the class assumes the presence of the function -main in
the bookgenclass namespace (the one that was previously written on disk). Note the
added prefix "-", which is by default and can be changed. We can provide the missing
function to prove the connection between the generated class and the namespace:
(spit "bookgenclass.clj"
"(ns bookgenclass)
❶ Compared to the previous example, we added a new function called "-main" taking a variable number
of arguments.
❷ The related class on disk needs to be generated again. The previous version is written over.
❸ If we try again to call the static GenClass/main method we can see that the generated class correctly
looks up the -main function in the namespace.
When the namespace exists with the only goal to drive gen-class or gen-interface,
there is also the option to embed the directive inside ns directly:
(ns bookgenclass
(:gen-class :name book.GenClass)) ; ❶
❶ The embedded :gen-class key produces the same effect as the previous isolated gen-class call,
with the difference that the main function is now generated by default (no :main true is necessary).
gen-class accepts a long list of parameters to influence the generation of the class,
covering the most complicated interoperability scenarios 224. The rich set of features
requires some time to use proficiently. Moreover, there are easier alternatives
like proxy or reify covering the most common use cases. For this reason, gen-class is
mostly used as a lower level tool when other options fail. One exception is the
generation of the main entry point for Clojure applications where gen-class is used
pervasively.
(.x p) ; ❸
(.-x p)
(. p y)
;; 2
❶ deftype requires a vector of attributes. In this case the generated class contains two attributes "x"
and "y". Note that you can redefine Point as much as you want, as the old definition gets simply
replaced with the new one.
224
gen-class and gen-interface are well documented, as you can see if you type (clojure.repl/doc gen-
class) at the REPL
❷ There are several ways to create an instance of the newly generated class. We can see here a few
equivalent options. new is a special form that result in the invocation of the related constructor.
Appending a dot after the class name is a shorter form to invoke the new operator. The last form using
an arrow "→" is generated by deftype and immediately conveys the fact that Point was generated
by a deftype call. For this reason this last form should be preferred.
❸ Once an instance is created, we can access its attributes using the Java inter-operation "." macro.
Again, there are options: in case of presence of both an "x" attribute and an "x" method, the second
form .-x uniquely identify the attribute call, while the first form will pick the method "x" and missing
that, the attribute "x". The last form (where the "." is detached from the rest) is the most basic form but
also give emphasis to the object receiving the call first.
After the attribute declaration, deftype accepts one or more interface declarations
followed by the implementation of the related functions. We could for example
compare points based on their distance from the origin of the 2D plane:
(defn- distance [x1 y1 x2 y2] ; ❶
(Math/sqrt
(+ (Math/pow (- x1 x2) 2)
(Math/pow (- y1 y2) 2))))
(deftype Point [x y]
Comparable
(compareTo [p1 p2] ; ❷
(compare (distance (.x p1) (.y p1) 0 0)
(distance (.x p2) (.y p2) 0 0))))
❶ The euclidean distance between two points is the length of the straight line connecting them.
❷ We define a Point including a declaration for the java.lang.Comparable interface. This interface
requires a method compareTo taking the current Point instance and another point. The function
calculates the distance between the current Point and the origin, then other Point and the origin.
❸ When we call sort on a collection of points they are returned in increasing order of distance from the
center.
The reader can probably see what we should do next in order to print a point in a way
that it displays its coordinates. We could for example override toString from the
Object class:
(deftype Point [x y]
Object
(toString [this] ; ❶
(format "[%s,%s]" x y)))
(Point. 1 2) ; ❷
;; #object[user.Point 0x65f02188 "[1,2]"]
❶ This deftype declaration shows how to override toString from the Object class. You can see that to
access declared fields inside an implemented method, there is no need of Java interop (x and y are
used without reference to "this").
❷ We can see the coordinates of the Point printed as part of the object signature.
NOTE java.lang.Object is the only class accepted by deftype that is not an interface.
❶ A complete Point declaration which is both comparable and printable showing that sort is ordering
points starting from the closer to origin (0,0).
deftype is one of the fewest options to create truly mutable objects in Clojure, a
feature documented in the API. deftype attributes are normally declared public and
final (a Java keyword that prevents the attribute from being written once assigned).
WARNING by making deftype attributes mutable, the programmer has to deal with explicit
synchronization in case of concurrent access to the type, exactly like it would happen in Java.
225
The volatile keyword in Java has deep implications on the attribute visibility by concurrent threads. A very simplistic
view is the following: without volatile there is no immediate guarantee that changes made by a thread are seen by
other threads. For more information, please see the chapter 3.1.4 of "Java Concurrency in Practice" book)
226
The JavaBeans specificcation is accessible from https://www.oracle.com/technetwork/java/javase/documentation/spec-
136004.html
(definterface IPerson ; ❶
(getName [])
(setName [s])
(getAge [])
(setAge [n]))
deftype generated classes are especially useful during AOT (Ahead Of Time)
compilation, so they can be exported on the file system for Java application to use. In
the following example we write the deftype declaration to a Clojure file, simulating
the normal conditions under which compilation occurs:
(spit "bookdeftype.clj" ; ❶
"(ns bookdeftype)
(defn bar [] \"bar\")
(defprotocol P (foo [p]))
(deftype Foo [] P (foo [this] (bar)))")
❶ The spit instruction creates a new file called bookdeftype.clj on the file system. The file is created
in the same folder the REPL was first started in. The file contains a namespace declaration followed
by a bar function declaration, a defprotocol declaration and a deftype directive. We are also
showcasing the fact that deftype also supports defprotocol as extension mechanism. The
function foo declared in the type "Foo" invokes the bar function.
❷ Once the namespace is on disk, we can ask Clojure to compile the file to produce the actual Java
class.
If we now inspect the file system, we can see several files generated by the Clojure
compiler. If we open another REPL from the same folder, we can try to import and use
❶ After opening another REPL session we import the newly created class sitting in the current folder
(which should be part of the classpath for this to work).
❷ Something goes wrong when we try to use the Foo class. The problem is that functions inside the
namespace but outside the deftype definition might not be loaded.
Functions that are inside the namespace used by the foo function are not necessarily
loaded just by importing the generated Java class. deftype contains an option to load
the hosting namespace automatically:
(spit "bookdeftype.clj"
"(ns bookdeftype)
(defn bar [] \"bar\")
(defprotocol P (foo [p]))
(deftype Foo [] :load-ns true P ; ❶
(foo [this] (bar)))")
If we try again to open a REPL session and invoke the function foo we don’t see any
error:
;; from another REPL
(import 'bookdeftype.Foo)
(def p (Foo.))
(.foo p) ; ❶
"bar"
❶ After loading the class Foo as we did before, we invoke the .foo method on a newly created instance.
This time it prints "bar" correctly.
15.5 proxy
proxy generates a Java class that can either extend or implement other classes or
interfaces. The main purpose of proxy (going back to when it was first introduced), is
to allow complex interoperation scenarios. There are for example Java frameworks that
explicitly require clients to extend from a specific class. In that situation you have the
option of using gen-class (but it requires a specific compilation step) or
use proxy which returns an instance of the newly created class right away:
(def ^Runnable r ; ❶
(proxy [Runnable] [] ; ❷
(.run r) ; ❹
;; 0.1678203879530764
❶ Adding type hints to proxies is normally a good idea considering they are consumed as any other Java
object. Without type hints, Clojure doesn’t know the type of an object until runtime when it’s too late of
any optimization.
❷ We are creating a new class that implements the java.lang.Runnable interface.
❸ The Runnable interface requires to implement a single method called run. If we don’t provide an
implementation, Clojure generates a default one that throws exception when called.
❹ We can call run directly on the newly created proxy.
proxy captures any binding surrounding the proxy definition form and we can use them
as part of our implementations. We could for example implement new functionalities
on java.io.File without touching the existing interface:
(import '[java.io File])
(definterface Concatenable ; ❶
(^java.io.File concat [^java.io.File f])) ; ❷
❶ We want proxy to extend the existing class java.io.File. The generated proxy can be used
anywhere a file would be used. In this case we want to add a new function (not altering the behavior of
an existing one). The only way to add new behavior is to create an interface containing the new
methods.
❷ Note that the interface to concatenate two files takes a single argument. This is because the call
happens on the first implicit argument this. Type hints on the interface are correctly propagated
by proxy to the corresponding function overrides.
❸ cfile creates a new "concatenable-file" that extends java.io.File and implements Concatenable.
The second vector in proxy lists the arguments to use to invoke the constructor from the super class.
In this case we invoke java.io.File with a string representing the file path.
❹ It follows a list of function implementations. concat copies the content of this into the second file
argument.
❺ We store the files in local vars using def. Note that only the first call to cfile generates a new class.
The second call to create f2 creates a new instance but not another class definition. Generated
classes conforming the same interface are cached and reused.
❻ We need to use spit on f2 to create the file before copying content into it.
❼ The concat call appends etchosts into f2. If we open "temp2.txt" we can see an initial comment line
"need to create this file" followed by the content of "/etc/hosts".
proxy classes with the same signature are cached and reused to enable fast instantiating
of similar proxies. Although classes are cached, actual proxy instances are
not. proxy uses instance attributes to store overridden methods along with their closed-
over vars. Using functions like proxy-mappings or update-proxy we can inspect or
alter implemented functions after the proxy creation. With reference to the previous
example, we can now add an option to automatically create a file before using it to the
already existent etchosts instance:
(update-proxy ; ❶
etchosts
{"concat"
#(let [^File f1 %1 ^File f2 %2]
(.createNewFile ^File f2) ; ❷
(spit (.getPath f2) (slurp f1) :append true)
f2)})
(-> etchosts ; ❸
(.concat (cfile "temp3.txt"))
(.concat (cfile "hosts-copy.txt")))
❶ update-proxy takes a proxy instance and a map of method names into functions. We call update-
proxy on the previously created etchosts proxy. Inside the map each key represents a method to be
overridden/implemented.
❷ Compared to the previous implementation of the same function, we added a call
to .createNewFile that creates the file when the file is not existent. Note also that the function takes
2 explicit arguments: the first is this the second is the target file. The let block was added with the
only goal of type hinting the arguments.
❸ We don’t need to create "temp3.txt" or "hosts-copy.txt" anymore before adding them into chain
of concat calls.
The function proxy groups together class generation, object creation and functional
overrides into a single call (which is in general very convenient). However, you can
separate the life cycle phases using functions like get-proxy-class, construct-
proxy and init-proxy:
(defn bail ; ❷
([ex s]
(throw
(-> ex
(construct-proxy s)
(init-proxy
{"deref" (fn [this] (str "Cause: " s))}))))
([ex s ^Exception e]
(throw
(-> ex
(construct-proxy s e)
(init-proxy
{"deref" (fn [this] (str "Root: " (.getMessage e)))})))))
(try ; ❹
(let [age "AA"]
(verify-age age))
(catch Exception e @e))
;; "Root: For input string: \"AA\""
This is very unlikely scenario in production code, but is much more common
while developing at the REPL.
• proxy is not suitable to create complex hierarchies. For example, while you can
extend and implement classes or interfaces, you can’t extend another proxy.
• Performance are good in general, but every method on a proxy performs a lookup
to see if there is an override available or not.
• It’s also quite important to keep reflective calls under control (using (set! warn-
on-reflection true) for example) and add the proper type hinting.
• It’s "closed polymorphism": either the methods are in the interface at definition
time, or you’re not going to be able to call them later on even if you update-
proxy with the correct override.
The lesson learned from proxy is that unless you’re forced to extend a class from a
Java framework in order to use it, you should probably look into reify instead
of proxy for the creation of quick throw-away instances. If instead your goal is
polymorphism in Clojure, there are better options with protocols and multimethods.
15.6 reify
reify is a lightweight proxy. It focuses on the essentials: generate a one off object
instance implementing a set of interfaces. reify can be useful when a framework (or
computation model) requires the creation of an object with a specific interface (like
events, observable, listeners etc.). These objects are short-lived and there is not much
value in creating and maintaining an explicit class for them.
In the following example, a Java framework provides classes with properties that are
"fired" when something interesting happens (it could be a button click for example).
The Java framework uses the PropertyChangeSupport facility in the java.beans
package to implement the feature.
import java.beans.PropertyChangeSupport;
import java.beans.PropertyChangeListener;
❶ ClassWithProperty is how a class with observable properties would be implemented inside the Java
framework.
❷ When we alter the content of the value field, the class fires a property change call to notify all
potential listeners.
❶ We implement the PRopertyChangeListener interface using reify. This interface has a single
method propertyChange. Note that we need to pass the implicit this parameter even if this is not
declared in the interface. We use the bean method to transform the event argument so we can access
its property like keys from a Clojure map.
❷ The "reified" instance is ready for use and we can now register it to receive events.
❸ As soon as we change the value on the Java class, propertyChange is invoked and we can see what
was the old value along with the new one.
When Clojure needs to provide points of extension, it normally uses defprotocol and a
Protocol implicitly creates an interface that reify can implement. We’ve seen
reified protocols already in the book, for example when talking about reducer to
implement the clojure.core.reducers.CollFold protocol. We could use a similar
mechanism to extend reduce-kv to java.util.HashMap:
(import 'java.util.HashMap)
(import 'clojure.core.protocols.IKVReduce)
(defn stringify-key [m k v] ; ❷
(assoc m (str k) v))
(reduce-kv stringify-key {} m) ; ❸
;; IllegalArgumentException No implementation of method: :kv-reduce...
(reduce-kv stringify-key {}
(reify IKVReduce ; ❹
(kv-reduce [this f init]
(reduce-kv f init (into {} m)))))
❶ We create a java.util.HashMap instance using doto and adding a few key-value pairs invoking the
method .put directly.
❷ stringify-keys is a simple function to assoc a key and a value to a map "m" passed as arguments.
❸ If we try to use reduce-kv directly on a java.util.HashMap the call generates exception. The
reason is that Clojure doesn’t have an implementation of reduce-kv for java.util.HashMap.
❹ We have several options to provide reduce-kv to java.util.HashMap. The easiest by far is to create
a Clojure map out of the Java map and delegate to that version of reduce-kv. A better performing
version would be to mutate the Java map in place, assuming our application is single-threaded and
there are no concurrency issues.
reify is generally faster than proxy. proxy has some advantage for the implementation
of interfaces with many methods, where class caching removes continuous generation
of large classes. reify is preferable in all cases where you don’t need any of
the proxy features.
15.7 defrecord
defrecord generates a deftype-based class that additionally implements Clojure map
semantic on declared attributes 227:
(defrecord Point [x y]) ; ❶
❶ We declare a Point record which includes two attributes "x" and "y".
❷ We can use the deftype nature of the record through Java interop.
❸ defrecord also understands map semantic, so we can access attributes as key in a map, or create a
new Point from a map containing those attributes.
NOTE defrecord replaces defstruct in generating custom types with map-like semantic. There are
still very few legitimate cases to prefer defstruct: please have a look at defstruct to
understand more.
Like deftype, defrecord can implement any number of interfaces. A record also
extends java.lang.Object by default. We could write a Comparable Point record as
follows (you can see a similar example in the deftype section):
227
When we say that a data structure offers "Clojure map semantic", we intend the ability to access attributes by name,
similarly to access a value by key in a map
(defrecord Point [x y] ; ❶
Comparable
(compareTo [p1 p2]
(compare (euclidean-distance (.x p1) (.y p1) 0 0)
(euclidean-distance (.x p2) (.y p2) 0 0))))
❶ Compared to the deftype version of the same Point class, the only difference is the use
of defrecord instead of deftype. After the defrecord keyword we have the option to implement one
or more interfaces. The methods conventionally follow after the interface name, but they could appear
in any order.
❷ We can prove points are Comparable by sorting them. In this case they are ordered by increasing
Euclidean Distance from the origin (0,0) of the coordinate system.
defrecord prints better than the corresponding deftype. This is because there is
a print-method override for records that works with println. If we call str on a record
though, we get back a raw string without attributes. We can alter the
way defrecord transforms into a string by overriding toString():
(defrecord Point [x y])
(str (->Point 1 2)) ; ❶
;; "user.Point@78de238e"
(defrecord Point [x y] ; ❷
Object
(toString [this]
(format "[%s,%s]" x y)))
❶ If we don’t provide a specific implementation, toString() is instructed to format an object with the
name of its class followed by the "@" sign and the result of invoking hashCode() on the object (which
normally results in a hexadecimal string that roughly map an address in memory).
❷ Let’s extend Point to include an override of toString().
❸ The new string rendering now contains relevant information.
Apart from the option of using them as map-like Java-aware classes, defrecord plays a
fundamental role in the Clojure polyporphic offering, working in conjunction
with protocols. We are going to explore this aspect in the following sections.
15.8 defprotocol
The defprotocol macro initializes a polymorphic dispatch mechanism for functions
based on types. The macro orchestrates the generation of:
©Manning Publications Co. To comment go to liveBook
• A native Java interface (with gen-interface). The interface is given the same name
of the protocol.
• A var holding data about the methods and their signatures.
• The dispatch functions connecting the protocol methods to their implementations.
We can see the generated artifacts in the current namespace after
calling defprotocol with a name and a list of methods:
(defprotocol Foo ; ❶
(foo [this])
(bar [this]))
(pprint Foo) ❸
;;{:on user.Foo,
;; :on-interface user.Foo,
;; :sigs
;; {:foo {:name foo, :arglists ([this]), :doc nil},
;; :bar {:name bar, :arglists ([this]), :doc nil}},
;; :var #'user/Foo,
;; :method-map {:bar :bar, :foo :foo},
;; :method-builders
;; {#'user/foo #object["user$eval1884$fn__1885@69be5837"],
;; #'user/bar #object["user$eval1884$fn__1896@5377a034"]}}
(fn? foo) ; ❹
;; true
(fn? bar)
;; true
❶ The defprotocol definition takes a name (this will be the name of the generated class and local var)
and a list of method signatures. The methods are part of the generated interface.
❷ The class user.Foo was generated defprotocol and we can see it contains the two expected
methods.
❸ The var Foo is also generated by defprotocol. It contains a map that defines the content of the
protocol, including method signatures and method builders. Each method builder generates an
instance of the dispatching mechanism for that specific function.
❹ defprotocol also creates a function for each method. The body of each function is generated from
the method builder with the same name of the function.
defprotocol generates functions in the current namespace, one for each method found
in the interface declaration. The generated functions contain the dispatch mechanism to
find and invoke the right function based on the type of the first argument:
(foo "arg") ; ❶
;; IllegalArgumentException:
;; No single method: foo of interface: user.Foo found for class: java.lang.String
(extend java.lang.String ; ❷
Foo
{:foo #(.toUpperCase %)}
(foo "arg") ; ❸
;; ARG
❶ If we try to invoke foo just after defining the protocol, Clojure tells us that there is no implementation of
that method that can be called.
❷ There are several ways to add valid implementation. One way is to use extend to provide an
implementation as a map from the name of the method on the protocol to the function to invoke.
❸ After adding the method for the type String, foo returns the result of invoking the provided function.
When we "extend" the protocol, we provide a new entry in the dispatch table that the
protocol provides. But if the class contains a method implementing the protocol, then it
takes precedence (and it cannot be extended again):
(deftype FooImpl [] ; ❶
Foo
(foo [this] "FooImpl::foo"))
(foo (FooImpl.)) ; ❷
;; "FooImpl::foo"
(extend FooImpl ; ❸
Foo
{:foo (constantly "extend::foo")})
;; IllegalArgumentException class FooImpl already implements interface user.Foo
❶ FooImpl is a class defined with deftype implementing the interface Foo generated by the protocol.
❷ The protocol method foo has a dispatch available for the class FooImpl and the call succeeds.
❸ An attempt to extend the same class FooImpl to the protocol Foo fails because the class already
implements the same interface directly.
NOTE Clojure 1.10 introduces a new dispatch method based on metadata: (foo (with-meta 1
{'foo (fn [this] "on numbers")})) prints "on numbers" (we assume the
same defprotocol definition using at the beginning of this section). The dispatch mechanism
is updated as follows: first the presence of the method is checked on the actual class, then in
the metadata (when supported) and finally in the extension table.
From the examples we can see that there are two options to provide the implementation
of a protocol method:
• Implementing the protocol interface directly (this can be achieved with any
of defrecord, deftype, proxy or reify).
• Extending the protocol to an existing class (using extend and related
variants extend-type or extend-record).
Implementing the interface directly is as fast as extending the protocol at some later
point. Let’s have a look at the following benchmarks:
(require '[criterium.core :refer [bench]])
extend (along with the related helper macros extend-type and extend-protocol) adds
a new dispatch option to a protocol:
(require '[clojure.string :refer [replace]])
(defprotocol Reflect ; ❶
(declared-methods [this]))
(extend java.lang.Object ; ❷
Reflect
{:declared-methods
(fn [this]
(map
(comp #(replace % #"clojure\.lang\." "cl.")
#(replace % #"java\.lang\." "jl."))
(.getDeclaredMethods (class this))))})
❶ We create a Reflect protocol that contains a method declared-methods that inspect a class to
retrieve its publicly declared methods.
❷ With extend we can attach this new capability to any class. Since all classes in Java (and Clojure)
ultimately extend from Obejct, declared-methods is now available to any type of argument.
❸ As a test we call declared-methods on the object returned by an atom. The resulting list of methods
is shortened for readability.
extend takes a map from function names (as keywords) to function bodies, enabling
simple reuse of functions across different objects. We could for example implement a
lightweight version of a Java abstract class as follows 228 :
(defprotocol IFace ; ❶
(foo [this])
(bar [this])
(baz [this]))
(def AFace ; ❷
{:foo (fn [this] (str "AFace::foo"))
:bar (fn [this] (str "AFace::bar"))})
(extend MyFace
IFace
(assoc AFace :foo (fn [this] (str "MyFace::foo")))) ; ❹
(foo (->MyFace)) ; ❺
;; "MyFace::foo"
(bar (->MyFace)) ; ❻
;; "AFace::bar"
(baz (->MyFace)) ; ❼
;; No implementation of method: :baz of protocol: #'user/IFace
228
Java abstract classes contain a mix of implemented methods and abstract methods. You can’t instantiate an abstract class
until you sub-class it and provide the missing implementation part.
(foo my-face) ; ❶
;; "MyFace::foo"
(extend MyFace
IFace
(assoc AFace :foo (fn [this] (str "new")) ; ❷
:baz (fn [this] (str "baz"))))
(foo my-face) ; ❸
;; "new"
(baz my-face) ; ❹
"baz"
❶ With reference to the previous example, we assign an instance of the MyFace record to the var my-
face, showing that it prints the custom foo implementation as expected.
❷ We now repeat the extend call, this time changing the implementation of the method foo.
❸ Previously created instances are now extending the new version of the function foo.
❹ We also take the opportunity to provide the last missing implementation of baz that now prints
correctly.
You can extend protocols to interfaces or other protocols. In the next example we
model the relationship between nodes in a tree. The fact of being a "Node" and having
a "value" is inherited by both branches and leaves in the tree:
(defprotocol INode (value [_])) ; ❶
(defprotocol IBranch (left [_]) (right [_]))
(defprotocol ILeaf (compute [_]))
(def tree ; ❹
(->Branch 1
(->Branch :A (->Leaf 4) (->Leaf 5))
(->Branch :B (->Leaf 6) (->Leaf 7))))
(defn traverse
([tree]
(traverse [] tree))
([acc tree]
(let [acc (conj acc (value tree))] ; ❺
(if (satisfies? IBranch tree) ; ❻
(into
(traverse acc (left tree))
(traverse acc (right tree)))
(conj acc (compute tree))))))
(traverse tree) ; ❼
;; ["Branch::1" "Branch:::A" "Leaf::4" "computed:4"
;; "Branch::1" "Branch:::A" "Leaf::5" "computed:5"
;; "Branch::1" "Branch:::B" "Leaf::6" "computed:6"
;; "Branch::1" "Branch:::B" "Leaf::7" "computed:7"]
❶ INode, IBranch and ILeaf are protocol definitions. INode represents everything that is common to
the node in a tree and is designed to be "mixed-in" along with other node specializations.
❷ We express the fact that a branch and a leaf are also nodes by extending the INode protocol into
the IBranch and ILeaf protocols. This enables the value method in INode to be invoked on
branches and leaves.
❸ The Branch defrecord takes an id and a left and right branches. It implements both
the INode and Ibranch protocols.
❹ We create a sample tree by instantiating and linking together a root, branches and leaves.
❺ The function traverse shows that the call (value tree) is always valid independently from the type
of the node. traverse calls recursively on a branch and always accumulate the value of the
node. traverse also collects the result of calling compute on leaves in the final results.
❻ Note the use of satisfies? to verify if the IBranch protocol contains an implementation for (type
tree) (the class of the tree object).
❼ We can see the result of the traversal going depth first into "Leaf::4" and follow the tree left to right to
the last available leaf.
extend-type is a shorter form of extend that uses fully formed function body, very
similarly to what is available at defrecord declaration time:
(extend-type MyFace
IFace
(bar [this] (str "MyFace::bar"))) ; ❶
(bar my-face)
;; "MyFace::bar"
❶ The extend-type syntax is the same as declaration time of the MyFace record. In this case we
override the initial default implementation of the bar method that we saw earlier.
If the same protocol needs to be extended to many types, it can be repetitive to write
an extend-type for each of the types the protocol should extended to. extend-
protocol aggregates all types into a single call:
(defprotocol Money
(as-currency [n]))
(extend-protocol Money ; ❶
Integer
(as-currency [n] (format "%s$" n))
clojure.lang.Ratio
(as-currency [n]
(format "%s$ and %sc" (numerator n) (denominator n))))
(extenders Money) ; ❷
;; (java.lang.Integer clojure.lang.Ratio)
❶ The Money protocol is extended to only 2 types in this example, but the list could be longer. We could
express the same with one extend-type for each numeric type, but it would be more verbose.
❷ extenders shows that the Money protocol has now two extensions.
❸ Similarly we can ask if a protocol extends to a specific type with extends?.
In this section we’ve seen how to use defprotocol to create interesting relationships
between types and their interfaces. extend is the idiomatic mechanism to dispatch
function calls, but we could include other aspects of an object, not just the type. This is
what we are going to see in the next section dedicated to derivation and multimethods.
(def h (custom-hierarchy ; ❷
[:clerk :person]
['owner 'person]
[String :person]))
❶ We are going to use custom-hierarchy now on to repeatedly apply derive on pairs of parent-child
derivations. This allows us to create many of them conveniently.
❷ To create a new hierarchy, we invoke custom-hierarchy with any number of vector pairs. In this
example we use only keywords and one Java class, but we could use symbols as well (or any other
object implementing the clojure.lang.Named interface).
❸ We can check the relationships we just created with isa?. The three example all return true.
Derivation is transitive:
(def h (custom-hierarchy
[:unix :os]
[:bsd :unix]
[:mac :bsd]))
❶ The fact that a :mac is a :unix is not declared explicitly, but it can be inferred traversing the hierarchy
on multiple levels.
(descendants h :unix) ; ❶
;; #{:redhat :linux :debian :bsd :mac}
(ancestors h :mac) ; ❷
;; #{:unix :os :bsd}
❶ We want to modify this hierarchy to remove the relationship that says that :windows is a :unix.
❷ underive takes the hierarchy (optional) and the parent-child pair to remove.
isa? works on vectors by testing their respective items. We could for example check
different inheritance chains in the same hierarchy:
(def h (custom-hierarchy ; ❶
[:clerk :person]
[:owner :person]
[:unix :os]
[:bsd :unix]
[:mac :bsd]))
❶ This hierarchy contains specializations for "clerks" and "owners" as well as flavors of Unix systems.
❷ isa? tests the pair :mac and :unix followed by :owner and :person. It then returns true only if all
the relationships are true.
(ancestors String) ; ❶
#{java.lang.CharSequence
java.io.Serializable
java.lang.Comparable
java.lang.Object}
❶ We can use ancestors directly with Java classes and use this information to cast the object to one of
his superclasses or interfaces.
You can use derivation functions with multimethods to express polymorphic behavior
which is not based on types. isa?, for instance, is used by multimethods instead of
plain equality to enable derivation on keywords or symbols. We are going to explore
their use in the next section.
The group of functions and macros presented in this section controls "multimethods" in
Clojure. A "multimethod" is a special Clojure function that has multiple
implementations. The choice for a specific implementation is done through a dispatch
function that the user has to provide.
Here’s for example how we would design a multimethod to evaluate mathematical
operations represented as data. We can create a calculate multimethod that dispatches
known operations and use a :default case to evaluate all other expressions:
(def total-payments ; ❶
{:op 'times
:expr
[[:loan 150000]
{:op 'pow
:expr
[{:op 'plus
:expr
[[:incr 1]
{:op 'divide
:expr [[:rate 3.16]
[:decimals 100]
[:months 12]]}]}
{:op 'times
:expr [[:months 12] [:years 10]]}]}]})
(def ops ; ❷
{'plus +
'times *
'divide /
'pow #(Math/pow %1 %2)})
(defmulti calculate ; ❸
(fn [form] (:op form)))
(calculate total-payments) ; ❼
;; 205659.10262863498
❶ total-payments represents the data structure for the formula: (* 150000 (Math/pow (+ 1 (/
3.16 100 12)) (* 12 10))). Operations are encoded as Clojure maps with an :op key and
an :expr vector. An expression can be a pair [<:decription> <number>] or another operation
recursively.
❷ ops is a dictionary that translates from the operation as symbol to the actual function to call.
❸ The definition of the multimethod starts with a defmulti declaration called calculate. It takes a
single argument "form" and uses the :op key to select a specific implementation.
❹ Each defmethod implements a specific operation. We can destructure the form in the parameters and
just concentrate on "expr", a vector of other expressions. For most of the supported operations, we
use apply to accept any number of arguments. Note that we call calculate on the content of the
expression recursively.
❺ Calling a keyword on a vector such as (:op [:int 20]), invariably gets a nil in return (keywords
are only able to lookup themselves in maps). When we reach a "leaf" in the data structure, we stop
recursion and return the number from the vector pair.
❻ The special dispatch value ":default" receives calls that are not matching defmethod declarations. We
can use it to prompt the user about a potential missing dispatch value by stopping the evaluation and
throwing exception.
❼ calculate is now ready to evaluate the calculate-payments data. What we see here is how much
we are going to repay a 150,000$ loan over 10 years at the annual interest rate of 3.16%.
We could use a custom hierarchy to take care of the potential duplication generated by
similar defmethods. In the previous example, defmethod implementations are changing
in relation to the number of arguments supported by each operation. In the following
improved implementation, we provide calculate with a custom hierarchy to group
operations by number of arguments. The custom hierarchy builds on top of a new data
structure that replaces the previous dictionary called ops:
(def ops
{'plus [+ :varargs] ; ❶
'times [* :varargs]
'divide [/ :varargs]
'pow [#(Math/pow %1 %2) :twoargs]})
❶ ops is a mapping from the operation symbol (for example 'plus), to the actual function to call (for
example +) and the operation type (for example, :onearg, :twoarg or :varargs).
Here is how we could model calculate on top of a custom hierarchy expressed by the
new ops mapping:
(defn- add-ops [hierarchy ops] ; ❶
(reduce
(fn [h [op [f kind]]] (derive h op kind))
hierarchy
ops))
(def hierarchy ; ❷
(add-ops (make-hierarchy) ops))
(do ; ❹
(def calculate nil)
(defmulti calculate
(fn [form] (:op form))
:hierarchy #'hierarchy))
(throw
(RuntimeException.
(str "Don't know how to calculate " form))))
(calculate total-payments) ; ❼
;; 205659.10262863498
WARNING defmulti definitions are not overridable. Attempts to redefine a defmulti with the same
name and namespace of an existing one would silently fail.
Another interesting feature of multimethods is that we can extend them after they’ve
been declared. Differently from protocols (which are also extensible at run-time), we
don’t need to register a new type to add behavior. The next example contains a
different formula with new operators. We are going to use a special form (which is not
part of the actual data) to instruct the defmethod about the new operators:
(defn sound-speed-by-temp [temp] ; ❶
{:op 'with-mapping
:expr
[{'inc [inc :onearg]
'sqrt [(fn [x] (Math/sqrt x)) :onearg]}
{:op 'times
:expr
[[:mph 738.189]
{:op 'sqrt
:expr
[{:op 'inc
:expr
[{:op 'divide
:expr [[:celsius temp]
[:zero 273.15]]}]}]}]}]})
❶ The function sound-speed-by-temp takes a temperature and produces the data equivalent of the
formula (* 738.189 (Math/sqrt (inc (/ temp 273.15)))). The formula calculates the speed of
sound at a given temperature, returning the result in miles per hour. The formula is wrapped
inside 'with-mapping, a special operator used by the :default dispatch multimethod.
❷ If we try to invoke the new formula we get an error, as it contains unknown symbols: 'with-
mapping, 'sqrt and 'inc. The :default multimethod informs us about about the problem.
The formula to calculate the speed of sound in relation to temperature introduces two
new operations: 'sqrt and inc. The next example designs a new :default dispatch
method to intercept the presence of 'with-mapping. In the presence of the special
operator, we alter the hierarchy and the operator mappings to introduce the new
operations:
(defmethod calculate :default
[{op :op [ops forms] :expr :as form}]
(if (= 'with-mapping op)
(do
(alter-var-root #'hierarchy add-ops ops) ; ❶
(alter-var-root #'ops into ops) ; ❷
(calculate forms)) ; ❸
(throw (RuntimeException. (str "Don't know how to calculate " form)))))
❶ We use alter-var-root to change the content of the hierarchy and add the new operators. Note
that this operation is idempotent: the hierarchy doesn’t change if the same relationship is already
present.
❷ The mapping definitions ops also needs updating. Differently from the previous update on the
hierarchy, this second alter-var-root "upsert" (update or insert) the given operator in the table.
❸ calculate can now move forward to the rest of the data structure.
❹ We can see that the speed of sound at -60 celsius (a typical temperature at about 11000 feet of
altitude) is around 112 MPH slower than room temperature.
Let’s now review other utility functions for multimethods. While defmethod adds a
new dispatch option to an already existing multimethod, remove-method performs the
opposite operation (remove-all-methods removes all dispatch options instead):
(remove-method calculate :twoargs) ; ❶
(calculate {:op 'pow :expr [[:int 2] [:int 2]]}) ; ❷
;; RuntimeException Don't know how to calculate {:op pow [...]}
❶ With reference to the previous example involving the calculate multimethod, we now proceed to
remove the dispatch method for the :towargs key.
❷ calculate is now unable to calculate the power of a number.
❶ methods prints all expected multimethods, confirming the fact that :twoargs was removed from the
dispatch table.
Finally let’s have a look at prefers and prefer-method. The reason why we are able to
express a "preference" is that sometimes there is ambiguity while dispatching to the
correct defmethod. This is typical situation extending multimethods to interface types,
as Java allows inheritance from multiple interfaces. Here’s an example using typical
Clojure data structures:
(defmulti edges ; ❶
"Retrieves first and last from a collection" type)
❶ edges is a multimethod to retrieve the edges from a collection. It dispatches using the type function.
❷ Many collections are Iterable in Clojure so we defined a version of edges to deal with them.
❸ We are also interested in a specific version for Clojure lists.
❹ If we try to call edges on a list we discover that lists are both IPersistentList and Iterable.
❺ prefer-method establishes a preference order to use in case of ambiguous dispatch.
We can check the presence of entries from the table of preferences using prefers.
Here’s for example the rich set of preferences defined by print-dup:
(pprint (prefers print-dup)) ; ❶
;; {clojure.lang.ISeq
;; #{clojure.lang.IPersistentCollection java.util.Collection}
;; clojure.lang.IRecord
;; #{java.util.Map clojure.lang.IPersistentCollection
;; clojure.lang.IPersistentMap}
;; java.util.Map #{clojure.lang.Fn}
;; clojure.lang.IPersistentCollection
;; #{clojure.lang.Fn java.util.Map java.util.Collection}
;; java.util.Collection #{clojure.lang.Fn}}
❶ prefers shows a few choices for print-dup. Almost everything is printable in Clojure. Collections are
a typical example. print-dup defines a format suitable for serialization of an object to file and has to
deal with the many similar interfaces required for interoperation.
(recursive 1) ; ❹
;; 1
;; 2
;; 3
;; 4
;; 5
❶ For the purpose of this example, the dispatch function is identity.
❷ This defmethod definition contains a locally bound name recursive-impl between the dispatch
value and the arguments.
❸ recursive-impl can be used inside defmethod to call the same definition without passing through
the dispatch mechanism.
❹ When we call recursive with a number, we dispatch to the corresponding (and only) multimethod
available. The number is then used as the initial value for the recursion. On reaching "5" the recursion
ends as expected.
229
Thanks to Rupert Ede for suggesting the inclusion of this section in the book
The name of the anonymous function is attached to a specific defmethod definition so we can track it
down in stack traces while debugging.
16
Vars and Namespaces
Before we introduce functions and macros in this section, it’s useful to refresh a few
definitions about Clojure namespaces, vars and bindings.
Namespaces
• A "namespace" is an instance of clojure.lang.Namespace class and is essentially
a container for other objects.
• The "mapping table" is a dictionary object associated with each namespace. It
contains mappings between Clojure symbols and objects such as vars or classes.
Functions like intern or def adds items into the table, while ns-map shows the
content.
• The "alias table" is another type of dictionary associated with a namespace. It
contains relationships between namespaces, using a symbol as key and a
namespace name as value. Items can be added to the alias table with :as during
namespace creation or with the alias function.____ shows its content.
• Once a namespace is created, a reference is added to the global namespace
repository. This is a static map inside the clojure.lang.Namespace class itself.
Hence namespaces are "global", that is, the running process doesn’t need to hold
an explicit reference to keep them alive (from the garbage collector perspective).
Libraries
• A "library" is a source file named after the namespace declaration it contains. The
concept of library overlaps substantially with namespace, but libraries dictates a
few conventions to store and reuse Clojure code from disk.
• Loaded libraries are stored inside the loaded-libs dynamic var.
• The existence of a library implies the existence of the related namespace, while
the opposite is not always true (for example, create-ns creates just a namespace,
©Manning Publications Co. To comment go to liveBook
not a library). require loads a library and any additional transitive dependency.
Vars and Bindings
• A "var" object is an instance of the clojure.lang.Var class (to avoid confusion
with other langauges, we are going to use the shorter form "var" instead of
"variable"). A var is a container for a single value with the option to hold multiple
when concurrency is involved.
• A "root binding" is the optional value that the var can be associated with. It is
visible from all threads and its the default behavior.
• A "dynamic binding" is a value that the var associates with a specific thread. Each
thread can access a different value for the same var. The feature is not enabled by
default but it can be enabled by passing the ^:dynamic metadata during var
definition.
• A var that allows dynamic bindings is called "dynamic" and its name is
conventionally surrounded by "*" (colloquially called "earmuffs").
• A var can be "bound" (it has a root bound value) or "thread-bound" when it has at
least one thread bound value.
• Compared to root bindings, thread bindings can "stack up". This is achieved by
nesting binding forms. Previous dynamic values are preserved while moving back
from each nested bindingscope.
• A var is always associated with at least one namespace (through the mapping
table). It follows that vars are also global.
• Vars behavior is heavily influenced by the metadata attached to them. The most
important are:
• :dynamic: indicates that the var is thread-local enabled and can stack-up values
per-thread. A var marked as dynamic can be used with set! or var-get.
• :inline and :inline-arities enable var "inlining", an alternative
implementation that takes precedence over the root binding or thread binding if
present. See definline for more information.
Vars are functions themselves (ifn? on returns true on var objects) and they are
integral part of Clojure indirection mechanism (what makes Clojure a dynamic
language). For example, the expression (+ 1 1) needs the following steps to execute in
the namespace "user":
• The symbol '+ is resolved through the namespace mappings. Assuming an entry is
present (this is always true for the "user" namespace and clojure.core/+) the
value of the entry is returned.
• If the value from the mappings is a var object, it gets invoked with the given
arguments. This is equivalent to call (#'clojure.core/+ 1 1) instead of
simply (+ 1 1).
• The var then delegates the call to either the inlined version, the thread local value
or the root binding in this order. + has inlined arity-2, so the inlined version takes
precedence. + arity-1 is not inlined, so the root binding is invoked instead.
At the end of the invocation chain, from a namespace entry to the var value, sits the
bytecode that was generated during compilation. Each of the following invocations,
produce the same result by removing one of the intermediate steps:
(+ 1 1) ; ❶
((var clojure.core/+) 1 1) ; ❷
((deref (var +)) 1 1) ; ❸
❶ All the forms in this example produce the same result "2". The first form goes through the full lookup
chain, from the namespace entry for the symbol +, to the var indirection, to the bytecode generated for
the function.
❷ The second form skips the namespace lookup.
❸ The third form skips namespace indirection and var indirection (performed explicitly by deref).
mydef ; ❸
"thedef"
❶ ns moves (potentially creating) a new namespace. When learning about def and vars we want to be
sure to start from a pristine namespace definition.
❷ def returns an object of type clojure.lang.Var.
❸ Typing the symbol "mydef" produces the lookup of the var from the namespace mapping. The var is
then asked its value.
❹ We can fetch the var object from the namespace mappings with ns-map. The 2 expressions shown
here retrieve the same object. The last one is the most explicit: ns-map returns the hash-
map mappings corresponding to 'myns. We then access the map using the symbol 'mydef as key.
❺ Finally, we can see what additional information are attached to the var as metadata. The REPL
instance where the form was evaluated is using a temporary file that we can see here as ":file".
def also supports custom metadata and a documentation string (normally abbreviated
"docstring"):
(ns myns)
(clojure.repl/doc def-meta-doc) ; ❸
;; -------------------------
;; myns/def-meta-doc
;; A def with metadata and docstring. ❹
❶ A metadata literal map was added between def and the name of the definition. It contains a creation
date for the definition.
❷ The documentation string should go just after the def name and before the body of the definition.
❸ We can see the documentation extract by clojure.repl/doc.
❹ The metadata are attached to the var object.
declare
We can also invoke def without a body. The generated var object becomes "unbound".
(ns myns)
(def unbound-var) ; ❶
;; #'myns/unbound-var
unbound-var ; ❷
;; #object[clojure.lang.Var$Unbound 0x3f351b94 "Unbound: #'myns/unbound-var"]
Unbound vars can be useful in case of mutually recursive definitions. Clojure even
provides a declare macro to clarify the meaning of an empty definition. In the
following example we create a simple state machine that verifies the presence of
alternating "0" and "1" in a string:
(declare state-one) ; ❶
(def state-zero ; ❷
#(if (= \0 (first %))
(state-one (next %))
(if (nil? %) true false)))
(def state-one ; ❸
#(if (= \1 (first %))
(state-zero (next %))
(if (nil? %) true false)))
(state-zero "0100100001") ; ❹
;; false
(state-zero "0101010101") ; ❺
;; true
❶ (declare state-one) is equivalent to (def state-one) but it immediately clarifies the reason for
the missing body. declare it’s a warning about the presence of a recursive cycle in the following
definitions.
❷ state-zero defines an anonymous function that calls state-one which is not yet
defined. declare allows state-one definition regardless.
❸ state-one is defined again, this time with a proper body. The var object linked to state-one is not
created again, but it is assigned a body for evaluation.
❹ The state machine only matches patterns starting with "0" and then alternating "1" and "0". This
pattern has a series of repeating zeroes and therefore is not valid.
❺ A valid pattern triggers the correct answer.
intern
intern works similarly to def, but offers the possibility to create definitions in other
namespaces:
(ns myns)
(create-ns 'ext) ; ❶
*ns* ; ❷
;; #object[clojure.lang.Namespace 0x68ff111c "myns"]
❶ create-ns create a new namespace 'ext. Compared to other namespace related macros, create-
ns does not change the current namespace.
❷ We can check the content of the dynamic variable ns to verify that we are still in the same
namespace.
intern is useful for all programmatic definitions of vars. We could for example create
a new namespace and a list of vars without the need of creating a Clojure source file:
(def definitions ; ❶
{'ns1 [['a1 1] ['b1 2]]
'ns2 [['a2 2] ['b2 2]]})
(defns definitions) ; ❸
;; (#'ns1/a1 #'ns1/b1 #'ns2/a2 #'ns2/b2)
❶ The definitions map contains namespaces to be created as keys and definitions as vectors.
❷ defns takes the map of definitions and iterate through namespaces and required definitions using for.
We need to remember to create-ns to make sure the namespace exists before we call intern.
❸ defns returns the list of vars created and mapped to the respective namespaces.
defonce
defonce is another def variant. def allows redefinition by "upserting" the current
namespace mappings: if an entry exists for the definition, then the existing var is
updated removing the old body (what the var evaluated to) with a new
one. defonce first checks for the presence of an already defined var for the given name
and only creates the new definition if not already existing:
(def redefine "1") ; ❶
(defonce dont-redefine "1")
(def redefine "2")
(defonce dont-redefine "2")
redefine
;; "2"
dont-redefine
;; "1"
❶ def is compared to defonce by repeating the same definitions and attempting to change their value
from "1" to "2". defonce does not change it’s value.
The reason for defonce is to protect important data from accidental redefinition. One
NOTE defonce should not be confused with (def :^const). The presence of the
metadata :const on a definition produces an effect similar to "inlining": all references to the
definition are replaced verbatim with the value of the definition. After evaluation the (def
:^const) effectively stops existing, without a namespace mapping or a var being created.
var
var is a special form that retrieves a var definition (the clojure.lang.Var object
associated with a symbol) from the current or another namespace. It throws exception
if the var is not found and does not create namespaces or symbols automatically:
(var a) ; ❶
CompilerException java.lang.RuntimeException: Unable to resolve var: a [...]
(def a 1)
(var a) ; ❷
;; #'user/a
(var test-var/a) ; ❸
;; CompilerException java.lang.RuntimeException: Unable to resolve var: test-var/a
(create-ns 'test-var)
(intern 'test-var 'a 1)
(var test-var/a) ; ❹
;; #'test-var/a
❶ The var "a" does not exist in the current namespace and var throws exception. Note that "a" is
evaluated as a symbol even if we don’t explicitly qualify it as such. var behaves as a macro in this
respect.
❷ After defining "a", var correctly retrieves the var object.
❸ var accepts namespace qualified symbols such as test-var/a. But the namespace does not exist
yet, let alone the var "a" in that namespace.
❹ After creating the namespace and a definition for "a", var returns the var object associated with test-
var/a.
The var objects are different despite the fact that the definition is about the symbol "a" in both cases.
230
github.com/stuartsierra/component
(var clojure.core/+) ; ❶
;; #'clojure.core/+
#'clojure.core/+ ; ❷
;; #'clojure.core/+
❶ var retrieves the var object given the qualified symbol clojure.core/+. The printer method for vars
is instructed to print the var object as the reader syntax literal.
❷ The Clojure reader interprets "#" as lookup into the reader syntax table for ' single quote that follows.
This is translated internally as (var clojure.core/+) and then compiled as usual, resulting in
exactly the same form being evaluated.
find-var
find-var works similarly to var but requires fully qualified symbols as input (a
symbol name preceded by a namespace and separated by a forward slash, such as: a/b).
It does not throw exception in case the var isn’t found:
(find-var 'user/test-find-var) ; ❶
;; nil
(find-var 'test-find-var) ; ❷
;; IllegalArgumentException Symbol must be namespace-qualified
❶ find-var returns nil when searching for the yet to be created test-find-var. Note that the name
of the var is qualified with user, a namespace that is guaranteed to exist at the REPL.
❷ Not fully qualified symbols are not accepted by find-var.
Prefer find-var to var if don’t want to use a try-catch block to deal with non-existing
vars.
resolve and ns-resolve
resolve and ns-resolve adds a couple of additional options on top of var and find-
var while searching for vars. resolve always uses the current namespace for
searching (the content of the nsdynamic var) while ns-resolve can be given a specific
namespace to search for.
The first added feature is searching for Java classes as well as var objects:
(resolve 'Exception) ; ❶
;; java.lang.Exception
;; [I
❶ resolve returns the class associated with the symbol 'Exception. This is how, at the REPL, we can
just type "Exception" without having to import the class first. The REPL imports most of
the java.lang classes in the "user" namespace automatically.
❷ An array of integers has a class type in Java named using the open square bracket. We can retrieve
those kind of classes creating the symbol from a string.
(def mydef 1) ; ❷
(def system :dont-change-me)
(replace-var 'x 2) ; ❸
;; nil
(replace-var 'mydef 2) ; ❹
mydef
;; 2
(replace-var 'system 2) ; ❺
system
;; :dont-change-me
❶ replace-var is a function that swaps existing var values with new ones. It contains a set of
"protected" vars that cannot be overridden. The set is passed to resolve to avoid resolution of an
existing var.
❷ We define mydef and system. The latter is protected by replace-var.
❸ resolve-var accepts non-existing vars and does nothing.
❹ Existing vars that are not protected are replaced with a new value/expression.
❺ However, we are unable to replace system which is protected and not visible by resolve.
(dir-fn 'clojure.set) ; ❶
;; (difference index intersection
;; join map-invert project rename
;; rename-keys select subset?
;; superset? union)
(dir clojure.set) ; ❷
;; difference
;; index
;; intersection
;; [..]
❶ dir-fn returns the list of public definitions in a namespace. In this case we can see public definitions
for clojure.set. The list is sorted alphabetically.
❷ dir retrieves the same list and prints it on the screen returning nil.
(def avar) ; ❷
(bound? #'avar)
;; false
(thread-bound? #'avar)
;; false
❶ dvar is a var marked as ^:dynamic using metadata. The var was not bound at definition, which is
true irrespective of the evaluation happening inside a specific thread. We can see that the var object
has been created and entered to the namespace mappings. Both bound? and thread-
bound? returns false. With binding we can open a thread-aware context to set the dynamic var. Both
functions return true inside the binding.
❷ We define now a normal var called avar. The var is still unbound at definition and both functions
agree on this fact.
❸ If we intern the var with a value (which looks up the var in the mapping and update its root) we can
see that the var is now bound. But from the perspective of the current thread, the var still does not
have a value and thread-bound? returns false.
Both functions accept more than one parameter returning true only if all vars are
bound:
(def a 1) (def b 2) (def c 3)
(bound? #'a #'b #'c) ; ❶
;; true
❶ Both bound? and thread-bound? accept any number of vars. In this case it verifies that all the vars
are bound at once.
❶ alter-var-root changes the root binding of the var a-var. It accepts a function from the old value
into the new (in this case update-in), plus any additional parameters. The new value is returned.
alter-var-root performs the change atomically: while changing the var, the
corresponding var object is locked for reading or writing ("synchronized" in Java
terminology):
(def a-var 1)
(future ; ❶
(alter-var-root
#'a-var
(fn [old]
(Thread/sleep 10000)
(inc old))))
❶ timespi is a simple function that multiplies a number with the Pi constant. The var that
contains timespi is inlined: the metadata on the var contains a pre-compiled evaluation of the
function.
❷ We call alter-var-root on the var object, to change the root binding to a new function that always
returns 1.
❸ Calling timespi invoked the inlined version of the function that hasn’t changed.
❹ We need to change or remove the related metadata on the var object to force evaluation through the
root binding.
timespi now invokes the root binding of the var object.
❶ fetch-title is designed to request the content of a given web address and search the text with a
regular expression. The function as is would need to connect to the network to execute the request.
❷ Instead of testing the function assuming a network connection is available, we use with-redef to
temporarily change the function attached to the slurp var, forcing it to return a samepl string.
❶ with-redefs-fn requires some more syntactic sugar compared to with-redefs but the two
functions are equivalent.
The example above related to testing is not a coincidence, because a normal Clojure
application shouldn’t probably contain any redefinition, as with-redefs is not thread
safe:
(defn x [] 5) ; ❶
(defn y [] 9)
(dotimes [i 10] ; ❷
(future (with-redefs [x #(rand)] (* (x) (y))))
(future (with-redefs [y #(rand)] (* (x) (y)))))
[(x) (y)] ; ❸
;; [0.6022778872500808 9]
The problem with the example above is that with-redefs might access an altered value
of a variable while it’s in the process of being changed by the other thread. It then
replaces back a root binding that was not the original. To solve the problem, we should
use thread-bound dynamically bound vars:
(defn ^:dynamic x [] 5) ; ❶
(defn ^:dynamic y [] 9)
(dotimes [i 10] ; ❷
(future (binding [x #(rand)] (* (x) (y))))
(future (binding [y #(rand)] (* (x) (y)))))
[(x) (y)] ; ❸
[5 9]
16.4 binding
NOTE This section also mentions other related functions such as: with-binding, with-
binding*, push-thread-bindings, pop-thread-bindings, bound-fn, bound-fn*.
binding was used already throughout the book and the previous sections. As already
mentioned, binding creates a context in which vars can be assigned a thread-local
value, leaving the root binding untouched.
Dynamic vars can be used to share simple state between calls in the same thread
without necessarily passing the same parameter to all functions. The following
©Manning Publications Co. To comment go to liveBook
example assumes a concurrent system allocating one thread per-request (common case
for web applications). If the system is invoked with trace=enabled we collect a lot
more information about that specific request:
(def ^:dynamic *trace*) ; ❶
Note how, in the example above, the vector containing the messages is never passed as
parameter to the other functions. Messages are shared through the trace dynamic var
without requiring any special synchronization apart from the enclosing binding form.
bound-fn
The other functions in this section, with-binding, with-binding*, push-thread-
bindings, pop-thread-bindings, bound-fn and bound-fn* are lower level or more
specific (and rarely used). Here’s a summary:
• with-bindings and with-bindings* (macro and function version) is similar
to binding but requires a map of binding and explicit var objects. Use them to
customize the way to fetch var objects or if you want to store the binding pairs
(var objects and values) separately.
• push-thread-bindings and pop-thread-bindings are lower level primitives that
are used to set thread local binding. They should be called with the following
sequence: push bindings, evaluate body, pop bindings in a finally block. It’s
unlikely you’ll ever need to process bindings in a different way and that’s
what binding does for us already.
• bound-fn and bound-fn* (macro and function version) are helpers to wrap existing
functions so thread-locals are propagated when the function creates a new thread
from within a binding form.
bound-fn deserves some additional explanation. Dynamic vars values are thread local
and don’t share their state with other threads. There is however the legit case in which
a new thread is created inside an already existing binding context:
(def ^:dynamic *debug*)
❶ debug is a function that checks the presence of a thread binding for the debug dynamic var.
❷ If we wrap function calls inside a binding form that set debug to true, we expect debugging
messages to print on screen. But messages are not appearing. The reason is that the inner form is
creating a separate thread and the thread local bindings are by definition invisible to the new thread.
❸ bound-fn* wraps the function passed as argument into another function that, before calling the inner
function, copies the binding from the current thread into the next. We can see the debugging message
correctly appearing.
bound-fn helps propagating bindings correctly. Clojure itself uses bound-fn internally
with future and other concurrency primitives.
❶ In the spirit of going "all the way", ++ encapsulates the concept of mutation with increment on a var
object.
❷ with-local-vars creates a new var object called "a" and then assigns its thread-local value to 0. The
var "a" is now available inside the form for reading or writing.
❸ As expected, there are 5 even numbers in the range from 0 to 9.
Note that multiple threads accessing count-even would perform isolated changes to the
counter resulting in thread-safe mutations. Also note how the var "a" was never defined
explicitly (for example with def) outside the function. Even if that was the case,
(e.g. (def a 10) appears somewhere at the top level), the definition of "a"
inside with-local-vars would shadow any external reference.
The only reason to use with-local-vars is to force imperative style mutable locals
instead of idiomatic recursion, a concession given to mimic other Lisps that have this
option available. However, with-local-vars is rarely seen in Clojure code.
• Search the global namespace repository for an existing namespace myns. If none is
found, it creates a new one and adds it to the global namespace repository.
• When creating a new namespace, in-ns also injects the "default imports" into the
namespace mapping, the list of java.lang.* classes that are directly available to
other forms declared in the namespace.
• It sets the dynamic var ns to myns (the REPL for instance uses that information to
change the cursor name).
(ns ns1)
(ns ns2)
(def b 1) ; ❷
(ns-name (.ns #'b))
;; ns2
❶ The Clojure REPL bootstrap itself in the user namespace. We can verify this creating a definition for
the var "a" an asking for the name of its namespace. There is no standard library function to access
the namespace of a var object directly, so we use Java interop to call the .ns method. ns-
name extracts the name of the namespace.
❷ After creating a new namespace ns1 and ns2, we repeat the operation and create another definition.
This time the var namespace is "ns2".
ns supports a large set of options to alter mappings and aliases at the time the
namespace is created. Most of the options are also available as top-level functions and
can be used independently (using the implicit ns reference). Please refer to the
following functions for the details: refer-clojure, refer, require, use, import and gen-
class. Here’s a sample:
(create-ns 'a)
(ns my.ns
(:refer a) ; ❶
(:refer-clojure :exclude [+ - * /]) ; ❷
(:import java.util.Date) ; ❸
(:require (clojure.set)) ; ❹
(:use (clojure.xml))) ; ❺
❶ :refer copies public vars from the mapping table of namespace "a" into the mapping table of
namespace "my.ns".
❷ :refer-clojure is the same as :refer but for clojure.core only. We see here one of the
supported options :exclude that prevents a few arithmetic functions to be available in "my.ns".
❸ :import creates a new entry in "my.ns" mapping table for the one or more Java classes. The Java
class is added using just the name as key, not the entire package.
❹ :require assumes the existence of a file on the current classpath. The conventionis that all dot-
separated words except the last form the folder path from the root of the classpath (e.g. "clojure/") and
the last word is the name of the file (e.g. "set.clj"). If the file is present, the corresponding namespace
is created.
❺ :use loads the corresponding file like :require but additionally :refer to all public symbols.
Compared to ns and in-ns, create-ns is purely about creating namespaces. The only
side effect of using create-ns is the entry created or removed in the global namespace
repository, which immediately enables resolving of vars:
ns1/v1 ; ❶
;; CompilerException java.lang.RuntimeException: No such namespace: ns1
(create-ns 'ns1) ; ❸
(intern 'ns1 'v1 "now it's working") ; ❹
(contains? (ns-map 'ns1) 'v1) ; ❺
;; true
ns1/v1 ; ❻
;; "now it's working"
❶ The namespace "ns1" does not exist. We get an error if try to access anything in that namespace.
❷ The global namespace repository confirms there is no such namespace.
❸ create-ns creates the namespace (and doesn’t change the current one).
❹ intern creates a new var "v1" in ns1.
❺ We also verify that intern added the related entry in "ns1"'s mapping table.
❻ The simple expression now resolves correctly, looking up the namespace in the global repository, then
the symbol in the mapping table.
❼ One final check for the presence of the newly created namespace.
ns1/v1 ; ❷
;; CompilerException java.lang.RuntimeException: No such namespace: ns1
More dangerously, mappings to the same var in other namespaces are going to prevent
garbage collection of the namespace plus, if the namespace is recreated, potentially
become stale:
(create-ns 'disappear) ; ❶
my-var
;; 0
(remove-ns 'disappear) ; ❷
(.ns #'my-var)
;; #object[clojure.lang.Namespace 0x1f780201 "disappear"]
(create-ns 'disappear) ; ❸
(intern 'disappear 'my-var 1)
my-var ; ❹
;; 0
@#'disappear/my-var ; ❺
;; 1
❶ The sequence of calls create a new namespace "disappear", including a var "my-var". The var is
imported into the mappings of the current namespace, where it evaluates normally to 0.
❷ After removing the namespace "disappear" we can see that "my-var" as it appears in the mappings of
the current namespace is keeping the namespace alive.
❸ A namespace with the same name and var is created again, this time with value 1.
❹ However, the local "my-var" entry is still running the old copy of the var.
❺ We can see that the new var should evaluate differently.
❶ ns-aliases shows the content of the alias table for namespace "com.web.tired-of-typing-this.myns".
The name is annoyingly long on purpose.
A var from another namespace, is readily available without doing anything special:
(intern 'com.web.tired-of-typing-this.myns 'myvar 0) ; ❶
com.web.tired-of-typing-this.myns/myvar ; ❷
;; 0
However, it would be nice to give the namespace a shorter name that is not subject to
the restrictions in place for Java packaging. We can create a new alias for the
namespace like this:
❶ alias adds an entry to the alias table for the current namespace. The entry uses the wanted name as
the key and the namespace it should resolve to as the value.
❷ We can now use the shorter form.
ns offers a similar feature with :require and the :as option, but only if the namespace
can load from a file (which is the most common case):
(ns anotherns (:require [clojure.set :as s])) ; ❶
(ns-aliases 'anotherns) ; ❷
;; {s #object[clojure.lang.Namespace 0x5d1fa08b "clojure.set"]}
❶ ns (and require) offers the option of creating the namespace and the alias at the same time using
the :as option. This works only if the namespace is defined on a file in the classpath.
❷ We check the alias table to see that the symbol "s" is now pointing at the clojure.set namespace.
A final mention for ns-unalias, which not surprisingly, removes an alias entry from
the alias table. With reference to the preceding example, we can decide to remove the
alias:
(ns-aliases 'anotherns) ; ❶
;; {s #object[clojure.lang.Namespace 0x5d1fa08b "clojure.set"]}
(ns-unalias 'anotherns 's) ; ❷
(ns-aliases 'anotherns) ; ❸
;; {}
❶ The alias to clojure.set created previously is still visible in the alias table.
❷ ns-unalias takes a namespace symbol and the entry key to remove.
❸ The alias was removed as expected.
;; sort-by #'clojure.core/sort-by,
...
We can remove an entry from the mapping (for example to remove the content or a
deleted namespace) using ns-unmap:
(ns-unmap 'myns '+) ; ❶
;; nil
(+ 1 1) ; ❷
;; Unable to resolve symbol: +
The "+" function has not disappeared from the system, it’s just unavailable in the
"myns" namespace. We can put it back with refer:
(refer 'clojure.core :only ['+]) ; ❶
(+ 1 1)
;; 2
❶ refer has access to the mapping table of the namespace and adds back the removed entry.
(ns myns)
(#'user/clean-ns 'myns)
(clojure.core/alias 'c 'clojure.core) ; ❷
(c/ns-map 'myns) ; ❸
;; {}
❶ clean-ns completely removes any mapping from the mapping table of a given namespace, getting rid
of all defaults in the process.
❷ After cleaning the namespace, we need a quick way to access the functions in the standard library,
without importing any mapping. We can use alias to create an alias "c" to clojure.core.
❸ We can see that the namespace is definitely empty.
Now that we cleared the namespace we can add a few definition to test the filters:
(def normal-var :public) ; ❶
(def ^:private private-var :private)
(c/import 'java.lang.Number)
(c/ns-map 'myns) ; ❷
;; {private-var #'myns/private-var,
;; Number java.lang.Number,
;; normal-var #'myns/normal-var}
(c/ns-publics 'myns) ; ❸
;; {normal-var #'myns/normal-var}
(c/ns-interns 'myns) ; ❹
;; {private-var #'myns/private-var
;; normal-var #'myns/normal-var}
(c/ns-imports 'myns) ; ❺
;; {Number java.lang.Number}
❶ Note that def is a special form and it doesn’t have an entry in the mapping table of clojure.core or
any other namespace, so we don’t need to prefix the call with c/def.
❷ The setup for the mappings is complete.
❸ ns-publics retrieves only entries where the value is a public var.
❹ ns-interns retrieves only entries where the value is a public or private var.
❺ ns-imports retrieves only entries where the value is a class (more properly, anything that is not a
var).
(clean-ns 'myns) ; ❶
;; {}
(ns-map 'myns) ; ❸
;; {minus #'clojure.core/-
;; plus #'clojure.core/+}
refer-clojure is pretty much the same as refer but restricted to use clojure.core as
the source for importing. So the previous could be written:
(binding [*ns* (the-ns 'myns)] ; ❶
(refer-clojure
:only ['+ '-]
:rename {'+ 'plus '- 'minus}))
require
require unit of work is the "library", a file available on the classpath that follows a
specific naming convention. As a side effect of loading a library, it also creates a new
namespace:
(contains? (set (map ns-name (all-ns))) 'clojure.set) ; ❶
;; false
(require 'clojure.set)
❶ Assuming a freshly started REPL session from the Clojure uberjar, the clojure.set namespace does
not exist, despite the fact that the classpath contains a file called clojure/set.clj that contains the
recipe to create that namespace. But the file was never required.
❷ After calling require on the namespace that will be created once the file is loaded, the namespace
appears in the global repository.
We can use loaded-libs to verify what libraries have been loaded so far. This can be
the result of calling require or use explicitly, of the result of walking
❶ This is the typical results of running loaded-libs on a freshly opened REPL. We can see a few of the
usual files from the Clojure standard library.
❷ An interesting aspect to verify is that there is definitely a namespace corresponding to each library, but
the opposite is not necessarily true. Namespaces created with “ns, in-ns, create-ns and remove-
ns” or in-ns for instance, are not registered as libraries.
❸ We read the results of clojure.data/diff as: there are no libs that are not also namespaces (good, we
wanted to hear that).
❹ But there are a few namespaces that don’t have a library recorded, although things
like clojure.set or clojure.data are definitely there as files in the distribution. One reason for this
is that functions like ns starts as special form with basic functionality and are redefined later on in the
bootstrap process. Some namespaces are created using the special form ns while others are passing
through the following macro redefinition.
❺ Finally, the list of namespaces with a corresponding library file.
Note that require is not going to work on a namespace that is not backed by a file,
even if the namespace is already existent:
(create-ns 'myns) ; ❶
(require 'myns) ; ❷
;; Could not locate myns__init.class or myns.clj on classpath
require is often used with the :as or :refer options. This is true especially in
conjunction with the ns macro, but they can be used directly:
(ns myns)
(require
'[clojure.set ; ❶
:as se ; ❷
:refer [union]] ; ❸
'[clojure.string ; ❹
:as st
:refer :all]) ; ❺
❶ Note the use of the square brackets. They are required to include options, as there could be different
set of options for different libraries.
❷ :as creates an alias in the current namespace that is resolved to the library.
❸ :refer imports the specified symbols in the namespace mappings.
❹ You can list many libraries in a single require call.
❺ When :refer :all is specified, all public vars from the library are imported into the namespace
mappings.
The general wisdom about importing mappings from libraries, is to restrict them to the
minimum necessary for readability and prefer aliases when possible. Once all symbols
from a library are imported into the current namespace mappings, it becomes difficult
to understand where they are defined, with the risk of polluting the namespace of
unwanted mappings over time.
use
use mixes the semantic of require with the options of refer:
(ns myns)
(use '[clojure.java.io ; ❶
:only [reader file] ; ❷
:rename {reader r}] ; ❸
:verbose
:reload-all) ; ❹
;; (load "/clojure/java/io")
;; (in-ns 'myns)
;; (refer 'clojure.java.io :only '[reader file] :rename '{reader r})
❶ use requires quoted symbols when used outside the ns macro. We can quote the vector to quote all
symbols within.
❷ :only restricts the number of symbol to import into the local namespace mapping. Also :exclude is
supported with opposite intent.
❸ :rename works by offering the option of interning symbols with a different name than the one of the
origin namespace.
❹ use and require also supports :reload, :reload-all and :verbose. Reloading forces the reload of
the file to re-sync with possible changes on the file system. :reload-all also reloads any transitive
dependency. :verbose prints information about the namespace loading, in particular regarding the
dependency tree.
use does not support aliasing with :as, which could lead to very long lists of :only
symbols. Probably for this reason it has been abused in the past to import all symbols
all the time, attracting a good amount of bad press. Nowadays, require with aliasing
offers a more scalable option to namespace dependencies than use and is generally
preferred. But use can still be used for renaming that is not supported by require.
import
import is the macro equivalent of refer for classes instead of vars:
(clean-ns 'myns) ; ❶
(binding [*ns* (the-ns 'myns)] ; ❷
(import '[java.util ArrayList HashMap])) ; ❸
(ns-imports 'myns) ; ❹
;; {HashMap java.util.HashMap
;; ArrayList java.util.ArrayList}
❶ Please see refer about clean-ns. It deletes all mappings for a given namespace.
❷ import does not accept an origin namespace as a parameter, so with swaps the current namespace
temporary with binding.
❸ import does not support options, but it can import many classes at once. The vector shows how to
group several class names from the same package.
❹ We can see that the requested classes have been added to the namespace mapping table.
find-ns retrieves a namespace object given its name as a symbol, assuming the
namespace exists:
(find-ns 'clojure.edn) ; ❶
;; #object[clojure.lang.Namespace 0x20312893 "clojure.edn"]
(find-ns 'no-ns) ; ❷
;; nil
❶ find-ns retrieves the namespace object corresponding to the given symbol, assuming the
namespace was created at some point.
❷ If we try with a namespace not created yet, we get not surprisingly a nil.
(the-ns 'notavail) ; ❶
;; Exception No namespace: notavail found
(the-ns 'clojure.edn) ; ❷
;; #object[clojure.lang.Namespace 0x20312893 "clojure.edn"]
(the-ns *ns*) ; ❸
;; #object[clojure.lang.Namespace 0xcc62a3b "user"]
ns-name works on top of the-ns to access the name of the namespace as a symbol.
This is useful if you have the namespace object and want to transform it into a symbol
that can be used as a key:
(ns com.package.myns)
(ns-name *ns*) ; ❶
;; com.package.myns
❶ ns-name transforms a namespace object (here corresponding to the current namespace) into the
corresponding name as a symbol.
(namespace ::a)
;; user
(namespace 'alsosymbols/s)
"alsosymbols"
231
Non-functional requirements of an application are all aspects of software that are not directly driven by business
requirements: logging, tracing, performance, stability, robustness, etc. have all an impact on code, but they are not the
main goal of the application.
hinting and so on. We also used metadata in interesting ways throughout the book.
Please check the following:
• We used metadata in defn to mark functions for benchmarking.
• Metadata was used to store database mapping in array-map
• sorted-set has an example of storing timestamps in metadata.
There are three main families of metadata support and each family has access to
specific functions:
1. Read: metadata are present on the object but we can only read them with meta.
2. Clone: the object supports the creation of a new object of the same type and value
of the old, but with different metadata using with-meta or vary-meta.
3. Write: the object supports thread-safe mutation of metadata without cloning into a
new object with alter-meta! and reset-meta!.
Many Clojure objects in the standard library are metadata-aware:
• Persistent data structures usually support reading and cloning: lists, vectors, sets,
maps.
• Also lazy sequences support reading and cloning: ranges, cons, repeat, iterate etc.
• Reference types have mutable (but thread-safe) metadata: vars, atoms, refs, agents.
An exception to this group are namespaces which are not references but support
mutable metadata. The metadata function to mutate metadata are alter-
meta! and reset-meta!.
• Other object supports a mix of different kind of metadata support: symbols,
functions, subvectors.
meta shows the metadata attached to an object, nil otherwise:
;; {:added "1.2",
;; :ns #object[clojure.lang.Namespace 0x1edb61b1 "clojure.core"],
;; :name +,
;; :file "clojure/core.clj",
;; :inline-arities
;; #object[clojure.core$_GT_1_QMARK_ 0x7b22ec89 "GT_1_QMARK"],
;; :column 1,
;; :line 965,
;; :arglists ([] [x] [x y] [x y & more]),
;; :doc
;; "Returns the sum of nums."
;; :inline
;; #object[clojure.core$nary_inline 0x790132f7 "clojure.core$nary_inline"]}
(meta 1) ; ❷
;; nil
❶ meta shows a quit rich set of metadata for the function "" from the standard library. Metadata are
attached to the var object, not the symbol "".
❷ We can point meta to unsupported objects without having to catch an exception.
❶ with-meta stores the information about the initial count at the time the vector was created.
❷ The fact that new elements are added to the vector doesn’t change its metadata.
❸ Careful, because with-meta completely replaces an existing set of metadata.
❹ The policy for metadata migration between data structure is different depending on the
function. into design is to create a new data structure using a copy of the content of another to being
with, but not sharing the same metadata.
❺ Although the exception message is not particularly clear, you can’t use with-meta if the object
supports mutable metadata. It wouldn’t make a lot of sense to "clone" an atom or a ref when their
purpose is to handle safe mutation instead of persistency.
with-meta replaces any existing set of metadata if they are already present. If you want
to preserve the existent, or update metadata selectively, you can use vary-meta instead:
(def v (with-meta [1 2 3]
{:initial-count 3 :last-modified #inst "1985-04-12"})) ; ❶
(meta v)
;; {:initial-count 3
;; :last-modified #inst "1985-04-12T00:00:00.000-00:00"}
(def counter ; ❶
(atom 0
:meta {:last-modified #inst "1985-04-12"}))
(meta counter)
;; {:last-modified #inst "1985-04-12T00:00:00.000-00:00"}
(alter-meta! ; ❷
(do (swap! counter inc) counter)
assoc :last-modified #inst "1985-04-13")
(meta counter) ; ❸
;; {:last-modified #inst "1985-04-13T00:00:00.000-00:00"}
❶ An atom is a type of reference supporting mutable metadata. The atom constructor takes
a :metadata key at construction time to initialize metadata.
❷ alter-meta! takes a function of the old metadata plus any additional arguments. We use assoc to
selectively change the :last-modified key.
❸ Calling meta on the atom correctly reports the change of time.
Finally, reset-meta! offers the possibility to completely replace the metadata map in
case we are not interested of keeping anything of the old:
(reset-meta! *ns* {:doc "The default user namespace"}) ; ❶
(meta *ns*)
;; {:doc "The default user namespace"}
❶ Namespaces are not reference types, but they support mutable metadata. We use reset-meta! to
specify some documentation about the namespace.
17
(REPL for short). When you open the Clojure REPL, you’re welcomed by a prompt
waiting for a form to evaluate. That’s a call to read using the standard input as
Evaluation
The functions illustrated in this section are at the heart of the Read, Eval, Print, Loop
argument. After hitting the return key, eval analyzes the forms and emits bytecode that
is put into execution. The result is printedon the screen and the loop start again. But
even when you are not using the REPL, Clojure uses the same functions extensively to
run a program. There are three main families of functions dedicated to evaluation (with
some overlapping):
• read, read-string and eval work on a single form and produce an in-memory
evaluation.
• compile, load, load-file, load-reader and load-string operate on a library as a group
of forms 232 . compile also produces a file on disk.
• clojure.edn/read and clojure.edn/read-string are the equivalent read and read-
string operations for EDN, the Extensible Data Notation. EDN is a subset of the
Clojure syntax designed specifically for data transport.
232
Please review the introduction to "Var and Namespaces" for an explanation about the concept of library
output ; ❹
;; (+ 1 2)
(type output) ; ❺
;; clojure.lang.PersistentList
read supports a few options to control the reading process. For example, reader
conditionals are turned off by default 233. :read-cond can be set to
either :allow or :preserve to allow reader conditionals and to preserve all branches
respectively:
(def example ; ❶
"#?(:clj (System/currentTimeMillis)
:cljs (js/Console :log)
:cljr (|Dictionary<Int32,String>|.)
:default <anything you want>)")
❶ The #? is the read conditionals reader macro and specifies a version of a form based on one of the
default platforms.
❷ reader-from takes a string and create a suitable reader to use with read.
❸ By default, read does not read the conditional macro and throws exception.
233
Reader Conditionals is a relatively new feature to support Clojure implementations on other platforms, in particular
ClojureScript and ClojureCLR. See https://clojure.org/guides/reader_conditionals for more information
❹ We ask read to enable reader conditionals with :read-cond :allow. The selected form is the one
corresponding to the hosting platform of the Clojure runtime (in this case "clj" means Clojure on the
JVM).
❺ If we use the :preserve keyword, we prevent read to make any choice and return the form as is.
We can add a new platform and using the :features option. The new platform
becomes the default option instead of the one indicated by the :default key:
(def example ; ❶
"#?(:cljs :cljs :my :my :default <missing>)")
Using the :eof option, we can control the behavior of read in case it reaches the end of
file (abbreviated "eof") before reading a form:
(read (reader-from ";; a comment")) ; ❶
;; RuntimeException EOF while reading
❶ A comment skips to the end of the stream without reading a form. read throws an exception in this
case.
❷ If we prefer nil, we can indicate so using the :eof option.
❸ The same option is also available as argument.
Beside the options we’ve seen so far, the reader also complies with the read-
eval dynamic var. read-eval controls the behavior of the reader when parsing the
read-eval reader macro #=. When present in front of a form, the reader first parses the
form as usual and then invokes eval on it:
(read (reader-from "#=(+ 1 2)")) ; ❶
;; 3
❶ Instead of the expected list of a symbol and two numbers ('+ 1 2) the reader also evaluates the
form.
❶ The System/exit call exits the running JVM. But read just evaluates the form into a list.
❷ The read-eval macro in front of the form forces evaluation of the form, which in this case exits the
JVM.
To prevent evaluation, wrap read in a binding context that sets read-eval to false:
(binding [*read-eval* false] ; ❶
(read (reader-from "#=(java.lang.System/exit 0)")))
;; RuntimeException EvalReader not allowed when *read-eval* is false
❶ read throws exception if a read-eval macro is present in the form and read-eval is set to false.
In case you wanted to prevent reading altogether (for example to prevent large data
structures to load into memory via read), you could use :unknown instead of false:
(binding [*read-eval* :unknown] ; ❶
(read (reader-from "(+ 1 2)")))
;; RuntimeException Reading disallowed - *read-eval* bound to :unknown
NOTE There is also another feature influencing how data are read: tagged literals. This is a way to
extend the set of available reader syntax macros beyond the ones that are installed by default.
Please check the section on tagged literals to know more.
17.1.2 read-string
The examples so far have created a clojure.lang.LineNumberingPushbackReader on
top of a string to simulate the content of a file (or other input stream). But if you are
dealing directly with a string, read-string works the same as read:
(read-string "(+ 1 2)") ; ❶
;; (+ 1 2)
❶ read-string works exactly like read using a string to create a reader object similarly to what we did
for the example above.
❷ Options are also the same.
17.2 eval
eval takes a object and returns its evaluated form. If the object is a native sequence
(it’s not enough for the object just to be sequential) then eval interprets the first
element in the sequence as function and the rest of the sequence as arguments:
(eval [+ 1 2]) ; ❶
;; [#object[clojure.core$_PLUS "clojure.core$_PLUS"] 1 2]
(eval '(+ 1 2)) ; ❷
;; 3
❶ A vector is sequential (it produces a sequence when calling seq on it) but is not a sequence
itself. eval doesn’t interpret the vector any further.
❷ A list is a native sequence. eval assumes the first element is a function and rest are the
arguments. eval returns the evaluation of the function on the arguments.
❶ In case of multiple forms wrapped in a do block, eval proceeds to evaluate them all.
(def rules ; ❶
"If the light is red, you should stop
If the light is green, you can cross
If the light is orange, it depends")
parenthesize
(map read-string)
(map #(list* (first %) color (rest %))) ; ❺
(some eval)))
❶ rules contains simple facts about the meaning of the different colors in a traffic light.
❷ Our goal is to transform uppercase If into a macro and the rest of the sentence into arguments we
can manipulate. If has to be a macro so the content of the sentence does not evaluate (there would
be a lot of unknown symbols but we can ignore most of them anyway). We then destructure the input
and use when to verify if the "light" argument is the same color as that described by the sentence.
❸ parenthesize is a small transformation of the input string so that each sentence is wrapped in a set
of parenthesis. The sentence should appear to eval as a list, so it triggers a call to the If macro.
❹ traffic-light orchestrates the process: rules are first wrapped in parenthesis, then transformed
into lists using read-string.
❺ Now that rules are encoded as lists, we inject the missing "light" parameter. Note that at this point, we
are doing list manipulation, not processing string. The list is ready for eval which invokes
the If macro and returns the answer.
❻ These are a few example with different color to test the different answers.
(test #'sqrt) ; ❷
;; RuntimeException sqrt(4) should be 2
❶ defn accepts metadata in several locations. We need to provide a :test key with a function of no
argument as value. The best placement to enhance readability (also depending on the lenght of the
testing function) is before the arguments.
❷ We can prove the algorithm works as expected by calling test on the var object that contain the
function. In this case we encounter a surprise, as our function guesses an approximate square root of
4 using the Newton method which is very close to but not exactly 2.
As you can see from the example, test recognizes the failure only in the presence of
an exception. To reduce the amount of boilerplate necessary to prepare and throw the
exception, we can use assert. assert is a macro that evaluates the given expression
for truthiness and throws an exception otherwise:
(assert (= 1 (+ 3 3)) "It should be 6") ; ❶
;; AssertionError Assert failed: It should be 6
;; (= 1 (+ 3 3))
We can now see how to use assert in the previous sqrt function:
(defn sqrt
{:test #(assert (== (sqrt 4) 2.) "sqrt(4) should be 2")} ; ❶
[x]
(loop [guess 1.]
(if (> (Math/abs (- (* guess guess) x)) 1e-8)
(recur (/ (+ (/ x guess) guess) 2.))
guess)))
(test #'sqrt) ; ❷
;; AssertionError Assert failed: sqrt(4) should be 2
;; (== (sqrt 4) 2.0)
17.4.1 load
load is dedicated to the evaluation of "libraries". A library is a file that contains
Clojure source code that conforms to the following conventions:
• The code lives in a file available from inside the classpath of the running process.
• The file contains a namespace declaration with the same name as the relative path
of the file (replacing "/" with dots ".").
To load a library, we need a file path as a string. If the file path starts with "/" (forward
slash) the file loads from the root of the classpath. If it doesn’t start with "/" then it is
assumed that the file path starts from the location of the current namespace:
(load "/clojure/set") ; ❶
;; nil
(ns clojure.set) ; ❸
(load "zip")
❶ "clojure/set.clj" is a library present in every Clojure distribution. To load the library from the root of the
classpath, we prefix its name with "/". You have to omit the ".clj" file extension.
❷ The set library is effectively loaded and we can use it, although we need to use the explicit prefix for
function calls.
❸ If we now move to the clojure.set namespace, we also virtually move to the "clojure" folder which
contains other libraries such as "zip".
load supports an useful verbose mode that prints every loaded file while traversing the
dependency tree. Use the loading-verbosely dynamic var to activate this
feature:
(binding [clojure.core/*loading-verbosely* true]
(load "criterium/core")) ❶
;; (clojure.core/load "/criterium/core")
;; (clojure.core/in-ns 'criterium.core)
;; (clojure.core/refer 'clojure.set)
;; (clojure.core/load "/criterium/stats")
;; (clojure.core/in-ns 'criterium.core)
;; (clojure.core/refer 'criterium.stats)
;; (clojure.core/load "/criterium/well")
❶ load supports verbose output to print all dependencies loaded during traversal, including additional
information on created alias and refer.
17.4.2 load-file
load-file evaluates a file not necessarily located in the Java classpath. load-file is a
good choice for running Clojure scripts (Clojure programs that run and terminate from
a single file):
(spit "source.clj" ; ❶
"(ns ns1)
(def a 1)
(def b 2)
(println \"a + b =\" (+ a b))")
(load-file "source.clj") ❷
;; a + b = 3
❶ source.clj contains a simple Clojure program that defines two vars "a" and "b" and sums them up.
❷ load-file without a forward slash uses the relative location starting from where the current process
was started (the content of the "user.dir" Java property).
17.4.3 load-string
load-string is substantially equivalent to read-string followed by eval:
❶ load-string performs both parsing and evaluation of a string. We can see it produces the same
result of eval followed by read-string.
However, there are a few differences which makes load-string more suitable to load
the content from a file (as string):
• load-string does not support options, while read-string can be instructed about
several aspects of the reading process (please see read-string for a list of the
supported options).
• load-string does not require wrapping in a do block to read multiple
forms. read-string on the other hand, reads a single form only.
• load-string keeps track of line numbering in vars metadata.
Let’s investigate the last aspect with an example:
(ns user) ; ❶
(def code "(do (def a 1)\n(def b 2)\n(def c 3))")
(ns code1) ; ❷
(meta (load-string user/code))
(:line (meta #'c))
;; 3
(ns code2) ; ❸
(eval (read-string user/code))
(:line (meta #'c))
;; 1
❶ code is a string containing var definitions separated by newline, forming the equivalent of a file with
three lines.
❷ A new namespace "code1" loads and evaluate the content from user/code. Metadata on the last var
definition contains the expected line number.
❸ Another namespace "code2" loads and evaluate user/code using an eval - read-
string combination. The var definition is still loading from line 1.
17.4.4 load-reader
load-reader behaves exactly like load-string or load-file, but it requires a
java.io.Reader input type. load-reader is useful to control the specific type of reader
to use.
17.5 compile
compile accepts a library path as symbol and performs parsing and evaluation similarly
234
to load . It additionally dumps the generated bytecode to disk:
(spit "src/source.clj" ; ❶
"(ns source)
(defn plus [x y] (+ x y))")
❶ Let’s create a simple Clojure library called "source". The related file is saved in "src/source.clj", which
is part of the classpath.
❷ *compile-path* needs to be set. It might be already on your system, but in case is not, let’s set that
to the "target/classes" folder, also part of the classpath.
compile produces a few class files on disk for "source.clj". Some of them initialize the
namespace and associated vars, including static loading to register the namespace in
the global repository and related vars in the namespace mapping table. Another class
file implements the "plus" function. compile is going to produce additional class files
for each function (including anonymous) from the input library. The availability of the
generated classes in the classpath is sufficient to make sure that namespaces and vars
declared therein are available after bootstrap.
Compilation with compile is also called Ahead of (run) Time compilation (or briefly
AOT). AOT compilation is useful for several reasons. On the plus side:
• Eliminates the need for distributing Clojure sources. Once classfiles are generated
with compile they take precedence over sources. Class files are also amenable for
"obfuscation", the process by which class files are made uneasy to read or
decompile back into sources.
• Reduces the application startup time, which is useful especially for large
applications.
• Makes the application available to other languages on the JVM.
AOT compilation also removes some flexibility:
234
we talked about libraries in load and vars
• Fixes code at some specific version in time. This is especially true for Clojure
libraries (Clojure code designed to be used by other applications). It also fixes the
classfile format to a specific version of the Java runtime.
• Adds complexity to the build process.
• Adds complexity to the testing process, as some inconsistencies are introduced by
the different order in which the application loads as classes compared to the same
application as Clojure sources. Some bugs become visible only testing the
application after it has been AOT compiled.
The choice of compiling a Clojure application ahead of time is therefore a trade-off
between the additional complexity introduced by AOT and the advantages it gives in
distributing the application.
(core/read-string "@#'+") ; ❷
;; (clojure.core/deref (var +))
(edn/read-string "@#'+") ; ❸
;; RuntimeException Invalid leading character: @
❶ The clojure.edn namespace is not available by default and requires explicit require. Just for this
example and following ones, we are going to add an alias to clojure.core for clarity.
❷ The "@" sign in front of an expression is equivalent to call the var function and is interpreted by the
reader.
❸ EDN does not have such macro, along with several others and throws exception.
(edn/read-string ; ❷
{:readers {'point identity}}
"#point [1 2]")
;; [1 2]
(edn/read-string ; ❸
{:readers {'inst (constantly "override")}}
"#inst \"2017-08-23T10:22:22.000-00:00\"")
;; "override"
The :default option creates a default implementation if a tagged literal is not found
in default-data-readers or data-readers:
(edn/read-string ; ❶
{:default #(format "[Tag '%s', Value %s]" %1 %2)}
"[\"There is no tag for \" #point [1 2] \"or\" #line [[1 2] [3 4]]]")
❶ Trying to read a "#point" or "#line" tags would result in exception. We can handle all missing tags with
the :default options. :default takes a function of two arguments, the tag name and its value.
❶ tagged-literal creates a new TaggedLiteral object. Clojure knows how to print them nicely.
❷ We can also access the :tag or the :form by keys.
Tagged literal objects can be used whenever the reader requires one to parse a custom
tag instead of a custom function, sparing some typing. There are currently the
following options:
(require '[clojure.edn :as edn])
(edn/read-string
{:default tagged-literal} ; ❶
"[\"There is no tag for \" #point [1 2] \"or\" #line [[1 2] [3 4]]]")
;; ["There is no tag for " #point [1 2] "or" #line [[1 2] [3 4]]]
❶ We’ve seen this example in edn/read-string where we’ve used an anonymous function of 2
arguments. tagged-literal receives the unregistered tag point and related form as arguments.
❷ read and read-string don’t support a :default option. However, the dynamic var *default-data-
reader-fn* assumes the same meaning.
17.8 default-data-readers
default-data-readers retrieves the default data readers installed with Clojure.
Currently, Clojure ships with the following data readers:
default-data-readers ; ❶
;; {inst #'clojure.instant/read-instant-date,
;; uuid #'clojure.uuid/default-uuid-reader}
❶ default-data-readers contains a mapping between the name of the tag as a symbol and the
function of one argument that is going to receive the form as read by the reader.
Data readers have been introduced along with EDN to allow easy "round-tripping" of
data to and from strings:
(def date (edn/read-string "#inst \"2017-08-23T10:22:22.000-00:00\"")) ; ❶
❶ "#inst" is a default tagged literal instructed to parse the string that follows as
a java.util.Date object.
❷ Tagged literals are designed so they write and read strings, enabling round-trip data exchange over a
network or file. We can see verify that transforming a date to a string and reading that string produces
the same original object.
The dynamic var *data-readers* uses the same format as default-data-reader and
when bound, it allows to add or modify default data readers:
(binding [*data-readers* {'uuid (constantly 'UUID)}] ; ❶
(read-string "#uuid \"374c8c4-fd89-4f1b-a11f-42e334ccf5ce\""))
;; UUID
❶ *data-readers* offers a way to change the default readers or add new ones.
❶ When reading a reader conditional without splicing, the reader interpret the corresponding form
literally, in this case the vector [1 2 3].
❷ Splicing unwrap the form (assuming is inside a collection) and retrieves just the elements.
In case we evaluated a reader conditional instance, we can access its form and splicing
conditions with:
(def parse (read-string {:read-cond :preserve} "#?(:clj [1 2 3])"))
(:form parse) ; ❶
;; (:clj [1 2 3])
(:splicing? parse) ; ❷
;; false
❶ A reader conditional object offers a :form key to access the matching form, including the platform.
❷ Another :splicing? key accesses the splicing status, in this case false (there is no "@" sign).
format, printf and cl-format are functions dedicated to string formatting. format is a
wrapper around Java’s String::format method, which is inspired by the
venerable printf function from the C language. In Clojure printf is a small function
wrapping println with format.
cl-format is instead a port of Common Lisp’s format function, formerly an external
package called XP. 236.
We’ve seen both format and cl-format in action in the book. Here’s a few pointers to
interesting examples for review:
• In memoize we used format to print the cache hit or miss information.
• In rand we used format to print a text-based progress handler.
• In vec we used format to render a simple JSON snippet.
• We used cl-format in Chapter 1 when describing how to improve printing of
decimal values in the XML example.
format has a rich set of formatting directives. The reader is invited to check
the java.util.Formatter Java documentation for the full details, but here’s a group of
useful examples:
(format "%3d" 1) ;; " 1" ; ❶
(format "%03d" 1) ;; "001" ; ❷
(format "%.2f" 10.3456) ;; "10.35" ; ❸
236
The XP pretty printing library detailed description is available at dspace.mit.edu/bitstream/handle/1721.1/6503/AIM-
1102.pdf. The paper also contains historical notes linking our Clojure cl-format all the way back to MacLisp original print
system in 1977.
237
The clojure.pprint.cl-format namespace source is well documented and worth a
read: github.com/clojure/clojure/blob/master/src/clj/clojure/pprint/cl_format.clj
In the following example we see cl-format in action to wrap text to a specific line
size 238:
(def paragraph
["This" "sentence" "is" "too" "long" "for" "a" "small" "screen"
"and" "should" "appear" "in" "multiple" "lines" "no" "longer"
"than" "20" "characters" "each" "."])
As you have seen cl-format is quite powerful, but it takes some time to use
proficiently: there are many directives and their syntax can be difficult to read. When a
directive becomes too long or too complicated, the user should consider longer but
explicit alternatives, for example using sequential processing.
❶ Both functions end with a nil which is an artifact of the REPL printing the result of the last evaluated
expression. Both pr and print are side effecting functions returning nil, this nil is printed after
printing to the standard output. We can see that pr distinguishes between strings, symbols and
characters by printing them with additional double quotes, removing the single quote and printing
back-slash \ respectively.
238
This example is adapted from the following paper: cybertiggyr.com/fmt/fmt.pdf
❷ println prints the three types of objects the same way, removing any quoted decoration.
What makes pr functions suitable to be read back into the Clojure reader is the
presence of specific quoting that helps the reader interpreting the character stream. You
could argue that from the human readability perspective there isn’t much of a
difference, but let’s have a look at Java maps:
(import 'java.util.HashMap)
(prn java-map) ; ❶
;; {:a "1", :b nil}
(println java-map) ; ❷
;; #object[java.util.HashMap 0x1ffddcad {:a=1, :b=null}]
❶ prn appends a new line to standard output after printing its arguments. The evaluation of the form
is nil and the REPL prints it after the new line: in this example the nil was omitted for clarity.
❷ The human readable output for java-map contains the Java object hash in hexadecimal, the name of
the class and the content of the map. As you can see nil is printed as null.
The four functions ending with *-str return their content as a string instead of printing
to the current value of *out* (a dynamic variable pointing at standard output by
default):
(def data {:a [1 2 3]
:b '(:a :b :c)
:c {"a" 1 "b" 2}})
(pr-str data) ; ❶
;; "{:a [1 2 3], :b (:a :b :c), :c {\"a\" 1, \"b\" 2}}"
(prn-str data) ; ❷
;; "{:a [1 2 3], :b (:a :b :c), :c {\"a\" 1, \"b\" 2}}\n"
(print-str data) ; ❸
;; "{:a [1 2 3], :b (:a :b :c), :c {a 1, b 2}}"
(println-str data) ; ❹
;; "{:a [1 2 3], :b (:a :b :c), :c {a 1, b 2}}\n"
❶ pr-str is like pr but the result is the printout of the arguments as a string instead of printing to
standard output (the default value of out).
❷ prn-str just adds a new line to the previous output by pr-str.
❸ print-str is the same as pr-str but some objects print differently, like strings for example (see the
double quotes surrounding them in the two examples).
❹ println-str appends an additional new line at the end of the string that becomes visible as \n.
In the case of pr, prn, print and println, the value of out can be bound
with binding to output to an alternate Java OutputStream or Writer (two Java
❶ clojure.java.io is a namespace part of the standard library. It contains functions wrapping the Java
IO framework.
❷ io/writer returns a BufferedWriter object "w". with-open makes sure that the buffer is closed
after evaluating the body. Buffering accumulates bytes in memory before writing to disk, limiting the
number of transmissions to physical disk (an expensive operation).
❸ binding temporarily swaps the current value of out with the newly created writer.
❹ print is instructed to output a long range of numbers. Note that range creates the 100000 numbers
lazily.
data ; ❶
;; {:a ["red" "blue" "green"], :b (:north :south :east :west), :c {"x-axis" 1, "y-
axis" 2}}
(pp) ; ❷
;; {:a ["red" "blue" "green"],
;; :b (:north :south :east :west),
;; :c {"x-axis" 1, "y-axis" 2}}
(pprint data) ; ❸
;; {:a ["red" "blue" "green"],
;; :b (:north :south :east :west),
;; :c {"x-axis" 1, "y-axis" 2}}
❶ By simply typing "data" at the REPL we trigger a basic printout of the content of the
corresponding var object. If "data" is large, we could potentially wait a few seconds for the screen to
scroll a dense wall of text.
❷ pp invokes pprint on the last evaluated expression. We can see that pprint is aware of what kind of
object we want to print and nicely aligns keys and values for us in a readable way.
❸ We can call pprint directly on any printable object. We can see that pprint produces exactly the
same output on "data" as pp before.
pprint and pp are the main entry points into the pretty printer, but other functions part
of the interface are available. pprint is readily available at the REPL but any other use
requires an explicit require, for example from inside a program or to access other
available functions, such as clojure.pprint/write:
(require '[clojure.pprint :as pretty]) ; ❶
(require '[clojure.java.io :as io])
239
Tom Faulhaber announced the cl-format library in this post groups.google.com/d/msg/clojure/hkDA8zotzUc/x3b-
QBbBfvYJ from the Clojure mailing list.
(defn
op
[sel]
(condp
=
sel
"plus"
+
"minus"
-
"mult"
*
"div"
/
"rem"
rem
"quot"
quot))
❶ We are given a Clojure function as a string. This can be the result of opening a Clojure file as text or
perhaps it was stored in a database.
❷ read-string invokes the Clojure Reader to read the content of the string. read-string returns
the list containing the list starting with the symbol "defn". The list remains in its unevaluated form.
❸ pprint does not distinguish between a list containing code (a list that is supposed to be evaluated at
some point in time) and a simple data structure.
To print Clojure code correctly we need a specific formatting that understands Clojure
forms. We can change default formatting for pprint using with-pprint-dispatch:
(pretty/with-pprint-dispatch
pretty/code-dispatch ; ❶
(pprint (read-string op-fn)))
;; (defn op [sel]
;; (condp = sel
;; "plus" +
;; "minus" -
;; "mult" *
;; "div" /
;; "rem" rem
;; "quot" quot))
;; | 0 | 7 | 1 | 4 | 6 | 3 | 2 | 9 | 5 | 8 |
;; |-----+-----+-----+-----+-----+-----+-----+-----+-----+-----|
;; | 100 | 107 | 101 | 104 | 106 | 103 | 102 | 109 | 105 | 108 |
;; | 100 | 107 | 101 | 104 | 106 | 103 | 102 | 109 | 105 | 108 |
;; | 100 | 107 | 101 | 104 | 106 | 103 | 102 | 109 | 105 | 108 |
;; | 100 | 107 | 101 | 104 | 106 | 103 | 102 | 109 | 105 | 108 |
❶ print-table takes a collection of maps (it also take an optional list of keys, but as a default header, it
uses the keys found in the first map).
240
For an historical perspective see groups.google.com/d/msg/clojure/5wRBTPNu8qo/1dJbtHX0G-IJ. The suffix "dup"
refers to "duplication", as the printed object is effectively duplicated once it is evaluated back from a string.
structures. The printing mechanism is based on a multimethod and the standard library
comes with a default implementation for most of Clojure types. When that’s not the
case, printing defaults to the name of the class and a few additional information:
(deftype Point [x y]) ; ❶
;; user.Point
❶ deftype creates the corresponding Java class in the current classpath. Clojure doesn’t have any
directive regarding how this new object should be printed.
❷ When we print a new custom type, Clojure uses the default formatting: it includes the initial "#object"
declaration, followed by the class name and the object hash as hexadecimal ("0x2e6b5958").
If we want to print custom types differently, we can tell Clojure using print-method:
(defmethod print-method user.Point [object writer] ; ❶
(let [class-name (.getName (class object))
args (str (.x object) " " (.y object))]
(.append writer (format "(%s. %s)" class-name args)))) ; ❷
❶ print-method is a multimethod. We extend the multimethod with defmethod and the type of the
object. print-method is defined with two arguments, the object that received the call to print (or
other printing functions) and the currently open "writer" instance.
❷ We can now .append anything we want to the writer: in this example we pick the same format used to
create a new Point instance (user.Point. x y) replacing "x" and "y" with the current content of the
"object".
❸ Let’s print the string representation of a Point first. We also print the type for clarity.
❹ We can now take the string representation and ask the Clojure Reader to parse the content of the
string. We do this using read-string. The result is a PersistentList instance ready for evaluation.
❺ Calling eval on a list forces interpretation of the first element as a function (or Java interoperation call,
like our case) and the rest of the list as arguments. eval invokes the Point constructor, generating a
duplicate instance of the initial Point with coordinates [1 2].
❶ print-dup is not a function to invoke directly. The dynamic variable print-dup controls print-
dup serialization format. We can see that the map {:a 1 :b 2} is serialized
as PersistentArrayMap and needs to be created explicitly with create.
❶ With a new definition of print-method, the Point instance produces a visually appealing
representation. However, note that the output is not valid Clojure.
❷ We can use print-dup to create a Clojure-aware string representation. We define a new multimethod
instance to deal with the Point class. print-ctor takes care of generating the correct constructor
call.
❸ To trigger the alternate print-dup representation, we bind the dynamic variable print-
dup to true before using any of the printing functions (in this case pr-str). Note that print-
ctor outputs the constructor call inside the "reader eval macro" #=().
The reader eval macro #= has the same effect of calling eval on the following form. As
a consequence, read-string can be used to read the string back into a list and evaluate at
the same time:
(binding [*print-dup* true]
❶ read-string produces the combined effect of reading and evaluating Clojure code when used
with print-dup aware objects.
Clojure serialization with print-dup is effective but vulnerable to code injection. This
might explain why print-dup is generally undocumented and read-string is
discouraged (unless you are in total control of serialized data). However, there might
be cases where it makes sense to use print-dup, for example to temporarily park data
on disk.
Additionally, if the first string argument can be interpreted as an URL (that is, it has
the URL syntax), slurp can read from it and return a string (spit instead works only
with URLs starting with file://):
(def book (slurp "http://www.gutenberg.org/files/2600/2600-0.txt")) ; ❶
(reduce str (take 22 book))
;; "The Project Gutenberg"
❶ slurp comes handy to quickly load the content of a web site, such as a book from the Gutenberg
project. The argument is a string, but the format is compatible with a java.net.URL object.
❶ The :encoding key forces a particular encoding on the string being read (like in this case) or written
(using spit). Here we see the result of interpreting an UTF-8 file as UTF-16: the content stops making
sense. The default encoding is "UTF-8" unless otherwise specified.
❷ We can see the effect of :append true on spit: subsequent calls to write on the same file are
appended to the current content instead of overwriting it (the default).
Both slurp and spit do their best to deal with the type of their arguments
automatically. The following table shows an example for each of the supported types.
To run the snippets in the table, you need the following imports:
(import '[java.io FileReader FileWriter])
(import '[java.io ByteArrayInputStream ByteArrayOutputStream])
(import '[java.io File])
(import '[java.net URL Socket])
19
One of the most common programming tasks is transformation and manipulation of
strings. Regular expressions are also an essential tool to process strings and is included
in this section.
19.1 str
str is one of the most used Clojure functions. It takes one or more strings and returns
their concatenation:
(str "Should " "this " "be " "a " "single " "sentence?") ; ❶
;; "Should this be a single sentence?"
❶ Note that str does not include any space of punctuation, so we had to add a space at the end of each
word for the final sentence to appear consistent.
But str is not limited to strings, because it converts any non-string before
concatenating it. It is common to see str used on all sort of values, because every type
in Clojure (inheriting this behavior from Java) contains at least a default conversion to
string. Sometimes this behavior is not desirable, as in the case of lazy sequences:
(str :a 'b 1e8 (Object.) [1 2] {:a 1}) ; ❶
;; ":ab1.0E8java.lang.Object@dd2856e[1 2]{:a 1}"
❶ A small sample of values supported by str. There is nothing that str can’t transform into string,
because Object (the parent of all classes in Java) has a default .toString method to be used in
case there are no specific definitions.
❷ Types like clojure.lang.LazySeq are containers for collection that are not evaluated yet. They don’t
have a default Object inherited method, also preventing accidental evaluations of potentially
expensive sequences.
❸ There is a way to ask Clojure to print the lazy sequence. pr-str main goal is to create a version of
the lazy sequence that can be stored in a file (or other media) so it can be read back by another
Clojure process. pr-str prints the lazy sequence as a list, a concrete data type that is sequential
(although not lazy).
Another typical use of str is on collections with either apply or reduce. The resulting
string is equivalent to the concatenation of all the items in the input collection:
(apply str (range 10)) ; ❶
;; "0123456789"
❶ We can use apply or reduce (example below) to concatenate the items from an input collection.
❷ We can play with string concatenation further and interleaving commas for example. This is typical in
human-readable data rendering, for example comma separated values for storage in files.
In terms of general performance, str is based on the Java StringBuilder class, which
accumulates fragments in a mutable buffer in case of multiple arguments.
Although reduce is generally a faster option, this is one case in which apply performs
better: reduce would simply call str each iteration throwing away any
buffering StringBuilder instance. Here’s a benchmark that shows the difference:
(require '[criterium.core :refer [quick-bench]])
❶ apply calls the variadic arity (defn str [x & xs]) of str which pushes each converted string into a
mutable Java StringBuffer before the final concatenation.
❷ reduce produces the same result but much slower: each new iteration creates a new StringBuffer of
the accumulated string plus an appended new item that is immediately concatenated into a new string.
19.2 join
The join function is part of the clojure.string namespace. It accepts a sequential
collection of objects and transforms them into a string:
(join (list "Should " "this " "be " 1 \space 'sentence?))
;; "Should this be 1 sentence?" ; ❷
❶ join also allows a separator, producing a shorter version of the similar interpose call.
❷ A quick comparison with the equivalent interpose function call that achieve the same effect.
❶ "xs" is the result of the lazy application of interpose to a range. Considering we want a single string out
of the collection, we pay the price of laziness without making use of it.
❷ join adds the separator to the StringBuffer while iterating the input collection, without the need
to cons the separator into a lazy sequence first. This results in a speed improvement.
❶ We need to require replace to be available in the current namespace. In this case we use the "s"
alias.
❷ Hyphens are replaced by a space using s/replace.
We could use a string instead of a single character to replace entire words, but we can’t
mix characters substitutions with string targets and vice versa:
(s/replace "Closure is a Lisp" "Closure" "Clojure") ; ❶
;; Clojure is a Lisp
❶ A string contains multiple dates expressed in month, day, year format. We can match the date format
with surrounding parenthesis to indicate groups.
❷ The special symbol "$" followed by a progressive number represents the different matching groups. To
swap month and day, we need to invert the position of "$1" and "$2".
In the last example we can see that the dollar sign "$" has a special meaning when
using a regular expression. If we want to treat it literally, we can use re-quote-
replacement:
241
Entire books and websites are dedicated to regular expressions. One useful online resource to get started is www.regular-
expressions.info
❶ We want to request 10$ for each month appearing in the string. Our first attempt fails because the "$"
has a special meaning for regular expression replacements.
❷ re-quote-replacement prevents the wrong (in this case) interpretation of the dollar sign.
replace-first has the same calling contract as replace, but it only executes the first
substitution, if any:
(def s "A drink here and a drink home.")
(subs s 20 30) ; ❷
;; "jumps over"
(s/split s #"\s") ; ❸
;; ["The" "quick" "brown" "fox" "jumps" "over" "the" "lazy" "dog"]
(s/split-lines s) ; ❹
;; ["The quick brown fox" "jumps over the lazy dog"]
❶ s/split and s/split-lines are defined inside the clojure.string namespace. We need
to require them in order to use them.
❷ subs returns a substring delimited by a "start" and "end" indexes. If the second argument is omitted,
the end index is the end of the string.
❸ s/split creates a split of the input string for each match of the given regular expression. "\s" means
"any white space character" which also includes "\n" new line separator. The resulting vector contains
all the splits.
❹ s/split-lines executes an implicit split for each "\n" new line separator. The resulting vector
contains all the splits by new lines.
subs is useful when the input string has a specified structure and the portion to extract
always appear in the same position:
(def errors ; ❶
["String index out of range: 34"
"String index out of range: 48"
"String index out of range: 3"])
❶ errors contains error messages coming from a log. The message is repeating with the same
structure and the variable portion always appear at index "27" (the index is 0 based).
❷ We don’t need to specify the end position if we know that the variable number always appear at the
end of the string.
If the portion of the input string we need to extract moves inside the string, we can use
a regular expression with s/split:
(def errors ; ❶
["String is out of bound: 34"
"48 is not a valid index."
"Position 3 is out of bound."])
❶ The new error messages contain the information about the wrong index at different position in the
string.
❷ By splitting at "\D+" ("one or more non-digit characters") we remove everything from the input
message that is not a digit. In doing so, we produce a vector as output that contains the number as the
last element. peek access the last element of a vector efficiently.
(last printers) ; ❹
;; "-rw-r--r-- 1 root _lp 1111829 10 May 13:49 _192_168_176_12.ppd"
(sequence
;; ("Brother_DCP_7055.ppd" ; ❼
;; "Training_room.ppd"
;; "_192_168_176_12.ppd")
❶ sh executes a command in the default shell provided by the operating system. It is available from
the clojure.java.shell namespace.
❷ sh takes a list of string representing the command and options and return a map that contains the
output as a string as a value of the :out key.
❸ As a first step, we split the single string output into a list of lines, more or less as they would appear on
the screen after executing the command from a terminal.
❹ We can see an example of the content of printers. Each line contains further information about the
files in the folder, including permissions, ownership and so on. The last part of the line contains the
name of the printer configuration with extension ".pdd".
❺ Processing of each line starts with a s/split instruction that splits the line at each group of spaces or
tabs, isolating the interesting parts into single strings.
❻ The printer name will appear as the last element in each line. A few lines don’t contain a printer name
though, such as "." or ".." special directory files. We remove them using filter
❼ We can see an example of the intended output (you might need to tweak the shell command and the
regular expressions to execute this example from your system).
A regular expression engine is a sophisticated tool that comes with some performance
penalty. Most of the time they are an invaluable feature that is difficult to replace, but
if you have a fixed structure string to analyze, it makes sense to avoid regular
expression to speed up computation. Here’s a benchmark that compares subs
and s/split to give you an idea of the speed implications:
(require '[criterium.core :as [quick-bench]])
variants, such as tabs, returns, separator and related variants (see the call-out for more
details).
(map ; ❶
#(hash-map :int % :char (char %) :hex (format "%x" %))
(filter (comp #(Character/isWhitespace %) char) (range 65536)))
Note how the Ogham space mark \u1680 is a white space with a printable representation (similar to a "-
") but others printable white space like the Ethiopic Wordspace \u1361 are not considered a white
(s/trim "\t1\t2n\n") ; ❸
❶ We need to require the clojure.string namespace to use any of the trimming functions.
❷ s/trim removes spaces from both end of a string.
❸ As discussed, the definition of white space also includes other non-printable characters such as
tabulations.
s/trimr and s/triml are similar to s/trim but they only remove Java white spaces
from the right or the left edge of a string respectively:
(s/trimr " *Spaces on the left are not removed with trimr.* ") ; ❶
;; " *Spaces on the left are not removed with trimr.*"
(s/triml " *Spaces on the right are not removed with triml.* ") ; ❷
;; "*Spaces on the right are not removed with triml.* "
❶ s/trimr only removes Java spaces that appear on the right edge of a string.
❷ s/triml only removes Java spaces from the left side of the string.
❶ s/trim-newline only trims newlines \n and returns \r at the end of the string.
(def link ; ❷
(def link-escape ; ❸
{\, "_comma_"
\space "_space_"
\. "_dot_"
\' "_quote_"
\: "_colon_"
\newline "_newline_"})
;; {\newline "newline"
;; \tab "tab"
;; \space "space"
;; \backspace "backspace"
;; \formfeed "formfeed"
;; \return "return"}
;; {\newline "\\n"
;; \tab "\\t"
;; \return "\\r"
;; \" "\\\""
;; \\ "\\\\"
;; \formfeed "\\f"
;; \backspace "\\b"}
❶ As we did before, we can print the content of the substitution map to inspect its content.
❷ We’d like to print these instructions as they appear in the string, but println correctly convert special
sequences like \n into a new line on screen.
❸ s/escape with char-escape-string converts the special character in the string so they print as
originally intended.
(-> some? ; ❶
source ; ❷
with-out-str ; ❸
s/upper-case ; ❹
println)
;; (DEFN SOME? ; ❺
;; "RETURNS TRUE IF X IS NOT NIL, FALSE OTHERWISE."
;; {:TAG BOOLEAN
;; :ADDED "1.6"
;; :STATIC TRUE}
;; [X] (NOT (NIL? X)))
❶ We have a set of primary colors. We could optionally spell the colors all uppercase, assuming we are
happy with that specific portion of the code standing out some more.
❷ book contains a string version of a large text, for example "War and Peace".
❸ We use s/split to split the book into single words. We do this with a simple regular expression that
covers most of the cases. We can filter the primary-colors only by using filter and the set itself as a
predicate. Finally, we call frequencies to see the numbers. In doing so, we compare our list of all
lowercase colors to the content of the book, which might appear with a different case.
❹ On a second attempt, words are converted to lower case before filtering. We can see that there are
two additional "red" occurrences, probably because they appeared at the beginning of the sentence.
capitalize transforms the first letter of a string to uppercase and all the other into
lowercase. This is useful to unify the format of a list, or to spell proper nouns correctly.
We could use capitalize on a vector of customer names collected from different
sources to be sure they are spelled correctly:
(def names
["john abercrombie"
"Brad mehldau"
"Cassandra Wilson"
"andrew cormack"])
(sequence
(comp
(mapcat #(s/split % #"\b")) ; ❶
(map s/capitalize) ; ❷
(partition-all 3) ; ❸
(map s/join)) ; ❹
names)
;; ("John Abercrombie"
;; "Brad Mehldau"
;; "Cassandra Wilson"
;; "Andrew Cormack")
❶ The first step is to s/split the string containing the full name. We need to join the string back at the
end, so we want to preserve spaces during the split. The regex "\b" means "any word boundary"
(which includes words made of spaces or other characters). mapcat collapses the inner vectors into a
single collection of words.
❷ s/capitalize turns each first letter to uppercase.
❸ Now that the names are in the correct format, we need to prepare to join the strings back together. We
use partition-all assuming names are always formed by a first and last name (no middle names or
other words).
❹ Finally, we use s/join to collapse triplets back into a single string.
Note that upper-case, lower-case and capitalize can be used on any printable object
(virtually all Clojure and Java types):
(map s/upper-case ['symbols :keywords 1e10 (Object.)]) ; ❶
;; ("SYMBOLS" ":KEYWORDS" "1.0E10" "JAVA.LANG.OBJECT@4C7A1053")
Both functions take an optional integer to start the search from. index-of skips the first
"n" characters if "n" is given, while last-index-of truncates the input at "n" before
starting the search backward:
(s/index-of "Bonjure Clojure" \j 4) ; ❶
;; 11
❶ index-of drops the first "4" chars from the input string before starting the search. Since "4" is beyond
the position of the first "j", the next "j" is found at index 11.
❷ last-index-of truncates the input string beyond index "10". The next "ju" in the string is found at
index "3" searching backward from index "10".
If the target char or sting is not found or the start index is beyond the string boundaries,
both functions return nil:
(s/index-of "Bonjure Clojure" "z") ;; nil ; ❶
(s/index-of "Bonjure Clojure" "j" 20) ;; nil
(s/last-index-of "Bonjure Clojure" "z") ;; nil
(s/last-index-of "Bonjure Clojure" "j" -1) ;; nil
❶ A group of example showing what happens when we search a non existent substring or we pass
values of the start index "n" beyond the string boundaries.
Apart from strings and single character, other types of java.lang.CharSequence are
accepted, for example java.lang.StringBuffer:
(import 'java.lang.StringBuffer)
(s/index-of ; ❶
(doto (StringBuffer.)
(.append "Bonjure")
(.append \space)
(.append "Clojure"))
\j)
;; 3
❶ java.lang.CharSequence interface has a few implementations available in the Java standard library,
such as StringBuffer, StringBuilder or the very common String class. index-of and last-
index-of work with any CharSequence.
❸ In this example we can see a few more, less common, Unicode characters that are also considered
white spaces.
ends-with? and starts-with? returns true when the given substring appears at the
end or the beginning of another string, respectively:
(s/starts-with? "Bonjure Clojure" "Bon") ;; true
(s/starts-with? "Bonjure Clojure" "Clo") ;; false
(s/starts-with? "" "") ;; true
(s/starts-with? "Anything starts with nothing." "") ;; true ; ❶
❶ Note how, for both starts-with? and ends-with?, the empty string always starts or ends a given
string, returning true.
includes looks for a substring match in any position for a given string:
;; ([tree-seq #'clojure.core/tree-seq] ; ❸
;; [line-seq #'clojure.core/line-seq]
;; [iterator-seq #'clojure.core/iterator-seq]
;; [enumeration-seq #'clojure.core/enumeration-seq]
;; [resultset-seq #'clojure.core/resultset-seq]
;; [re-seq #'clojure.core/re-seq]
;; [lazy-seq #'clojure.core/lazy-seq]
;; [file-seq #'clojure.core/file-seq]
;; [chunked-seq? #'clojure.core/chunked-seq?]
;; [xml-seq #'clojure.core/xml-seq])
❶ You can use re-find as a predicate to verify the presence of a substring inside another string.
❷ The list of strings in this case comes from all the public function names in the core namespace.
❸ The results answer the question: "What functions in the standard library contains -seq in their name?"
re-find is a good choice as a predicate because even if there are more matches, it stops
at the first one. But in case we want to extract all matching parts, we need to use re-
seq. Here for example we want all email address found in a web page:
❶ The regular expression presented here is a quick solution relying on the presence of HTML tags
around the email address. It works sufficiently well in most cases of web scraping. It also assumes
emails belong to the ".com" domain that is certainly not true for all email addresses. re-seq performs
multiple "find" operations on the same Matcher to accumulate results.
We can now briefly compare a few options to verify if a string contains another string.
We’ve seen index-of, re-find and includes?:
(require '[criterium.core :refer [quick-bench]]) ; ❶
(require '[clojure.string :as s])
(let [s contacts]
(quick-bench (s/index-of s "[email protected]")))
;; Execution time mean : 16.570516 ns
(let [s contacts
re #"[email protected]"]
(quick-bench (re-find re s)))
;; Execution time mean : 345.104914 ns ; ❸
(let [s contacts]
(quick-bench (s/includes? s "[email protected]")))
;; Execution time mean : 18.364512 ns
❶ As for the rest of the book, the Criterium library is used to benchmark functions.
❷ The benchmark consists of searching for the Manning support email address in a short string.
❸ Searching with re-find is about 20 times slower.
re-find is penalized because it has to analyze the regular expression before applying it
to the input string. At the same time re-find allows for much powerful features than
checking the presence of a substring.
Mutation.
20
Mutation and Side Effects
(conj v 1)
;; ClassCastException
;; clojure.lang.PersistentVector$TransientVector
;; cannot be cast to clojure.lang.IPersistentCollection
While a subset of read-only functions like get, nth or count still works, an entire new
set of functions is available to mutate a transient. Their name is like other standard
functions with the conventional "!" added at the end:
(def v (transient []))
(def s (transient #{}))
(def m (transient {}))
((conj! v 0) 0) ; ❶
;; 0
((conj! s 0) 0) ; ❷
;; 0
((assoc! m :a 0) :a) ; ❸
;; 0
❶ A transient vector "v" is mutated with conj! and used as a function to access the item at index "0".
The item was just added to the transient and is the number "0".
❷ Similarly, we can add a new element to a transient set using conj!. We can use the transient set
as a function to verify if the set contains the item "0". When the element is not present, nil is
returned.
❸ The transient map "m" is mutated with assoc!.
The main use case for transient is to enable controlled and isolated mutation to and
from persistent data structures, removing the overhead associated with creating many
persistent copies that are never going to be shared. The standard library itself has
plenty of such examples: into, mapv, group-by, set, frequencies (and more), are
functions transforming a collection into another using transient to speed up internal
processing.
There are still functions in the standard library that would benefit from transient but
have not been implemented yet. The book showed how to use transient to improve
the performance of: zipmap, merge, tree-seq, disj and select-keys for instance. We also
used transient in:
• peek to create a reverse-mapv function.
• dotimes contain a fast "FizzBuzz" implementation using transient.
• nth shows how to implement a hash-table data structure on top of arrays. The
hash-table grows and shrinks the internal array using faster transient operations.
The reader is invited to visit the above examples to see transient in action. For the
remaining of the section, let’s concentrate on a few "gotchas" using transient. One
important aspect of mutating transient in place is that old references to the
same transient instance might not be consistent. This is in contrast with other
mutable data structures such as java.util.HashMap. The following example illustrate
this behavior:
(import 'java.util.HashMap)
(dotimes [i 20] ; ❶
(assoc! transient-map i i)
(.put java-map i i))
(persistent! transient-map) ; ❷
;; {0 0, 1 1, 2 2, 3 3, 4 4, 5 5, 6 6, 7 7}
(into {} java-map) ; ❸
;; {0 0, 7 7, 1 1, 4 4, 15 15, 13 13, 6 6,
;; 3 3, 12 12, 2 2, 19 19, 11 11, 9 9, 5 5,
;; 14 14, 16 16, 10 10, 18 18, 8 8, 17 17}
❶ dotimes iterates the body of the expression 20 times. Each iteration we add the key-value pair [i
i] to both mutable maps.
❷ The persistent! version of the transient map appears to be missing many keys.
❸ The Java HashMap correctly shows 20 key-value pairs as expected.
Clojure transient doesn’t make any promise of mutation in place of the input. Each
operation should instead consider the input "obsolete" and consider the valid version
what is returned by the mutating function. The correct approach with transient should
always be to use the output of the previous mutating operation:
(def transient-map (transient {}))
(def m ; ❶
(reduce
(fn [m k] (assoc! m k k))
transient-map
(range 20)))
(persistent! m) ; ❷
;; {0 0, 7 7, 1 1, 4 4, 15 15, 13 13, 6 6,
;; 3 3, 12 12, 2 2, 19 19, 11 11, 9 9, 5 5,
;; 14 14, 16 16, 10 10, 18 18, 8 8, 17 17}
❶ The correct approach to apply multiple mutations to a transient is to always use the last mutated
instance.
❷ transient-map is pointing to a different state of the same mutable transient map than the result of
the reduce call. We can see that the persistent! version now contains the expected 20 keys.
WARNING transient are mutable and unsynchronized data structures. Use of same transient instance
by multiple threads can lead to unpredictable results. See locking for a way to
ensure transient mutations happen in a synchronized context.
All functions in this section assumes a computation that presents at least some side
effects and we want them to execute immediately. doseq, dorun and run! walk a lazy
sequence purely for side effects, not collecting results and returning nil:
(defn unchunked [n] ; ❶
(map #(do (print ".") %)
(subvec (vec (range n)) 0 n)))
❶ unchunked creates a non-chunked lazy sequence of size "n". By removing evaluations by chunk, we
make sure to evaluate exactly what is requested. With chunked sequences, we would always see the
evaluation of the first 32 items independently from how many are requested. subvec is one of the few
collection supported by seq that is not chunked.
❷ doseq is fully equipped with a rich semantic to iterate the input, the same available with for. Shown
here is the :while keyword that stops the iteration when "x" becomes equal to 5. The reader should
review for to see all other options available. We can see a dot printed each input element passing
through the map function, but the expression itself returns nil.
❸ dorun is simpler and offers less configuration. It accepts an optional number as the first argument that
represents how many elements to iterate. When there is no number, dorun iterates the entire input.
The result is the same as doseq.
❹ run! additionally takes a function and applies the function to each element in the input. run! always
runs through the entire input.
As we can see from examples, doseq, dorun and run! always return nil, independently
from the input, so if anything interesting happens while iterating, it must necessarily be
a side effect. As a consequence, doseq, dorun and run! are all O(1) memory allocation
as they don’t retain the head (or any other item) of the sequence.
doall is similar in behavior to dorun but it returns the output. doall is often used to
fully realize a lazy sequence before leaving a context that is necessary for the sequence
to work properly. A typical example is with-open:
(require '[clojure.java.io :refer [reader]])
(count lines) ❸
;; IOException Stream closed
❶ get-lines accepts an url and uses clojure.java.io/reader to produce the line-seq of the content
of the url. with-open is a macro that ensure the reader is closed properly after exiting the block.
❷ No problem to read some text from the internet, apparently.
❸ But as soon as we try to move forward in the sequence, for example to check how many lines there
are, we get an IOException.
What happens in the example above is that the java.io.Reader instance from where
the lazy sequence is supposed to read is closed at the time the related code actually
evaluates. with-open, as expected, took care of that aspects inside the function leaving
the lazy sequence free to escape the context where that connection was still open. The
solution is usually to fully realize the sequence:
(count lines) ; ❷
29301
❶ The only change to get-lines was to add doall to the lazy sequence produced by lazy-seq before
leaving the with-open context.
❷ Counting the lines does not produce any problem once the sequence is fully realized.
The last macro (a special form) of this section is do. do evaluates all expression
arguments and returns the result of the last (or nil in case of no expressions). The
expressions preceding the last are presumably side effecting, as their result cannot be
used:
(do
(println "hello") ; ❶
(+ 1 1))
;; hello
;; 2
❶ This form of do expression is often introduced temporarily to print a debug message in a critical
section of the code. Other tracing techniques, such as logging, are more likely to be left in the code
inside the doexpression.
❷ Another typical use of do is with if statements, as their "then-else" blocks only accept a single
expression.
❶ A volatile! is designed to hold data like other concurrency construct. The initial value must be part
of the call to volatile!.
❷ volatile! returns a clojure.lang.Volatile that is saved in the var "v". We can then ask if "v" is a
volatile?.
❸ There are two ways to mutate the volatile!: we can use vswap! passing a function of the old value
to the new, or use vreset! to just replace the value.
Java Volatile
volatile! does not protect its state against concurrent access like other concurrency primitives. On the
opposite, the goal of volatile! is to make sure concurrent threads promptly see any state update. To
understand why this might be necessary, we need to talk about the default thread visibility for variables
in Java.
In Java compiled bytecode, reads and writes to instance attributes of a class are subject to a process
called reordering. The JVM could re-order access to a variable to improve performances of a critical
section of the code. Moreover, the CPU registries might hold a copy of the value of a variable while the
variable is written from another section of the code. When the code execute sequentially, reordering and
local caches don’t constitute a problem. But when multiple threads are involved, there is no guarantee
that a reading thread will see the value of an attribute written by a writer thread.
Locking is one way to guarantee temporal ordering during concurrent access of multiple threads. But
there are cases where just disallowing reordering or register caching would be sufficient without locking
the overhead of locking. In Java we can achieve such a goal declaring a variable "volatile". When a field is
volatile, the compiler notifies the runtime that reads and writes to the variable should not be subject to
reordering and variables should not be stored in cache registries.
Clojure volatile! defines a reference type that implements such behavior. Before the introduction
of volatile!, there was no way to produce a stand-alone "volatile" attributes with the exception of
creating a whole deftype with the :volatile-mutable.
The main reason for the existence of volatile! in Clojure are stateful transducers and
their use in environments like core.async 242.
Some of the core.async constructs like "pipelines" achieve parallelism through
coordinated multi-threaded access. In a pipeline, multiple threads take care of each
stage of a transducer chain, but each thread makes access to only one stage at a time. It
follows that a thread executing the next step of a stage needs to see any previously
written state.
Under the conditions of a core.async pipeline, the state inside a stateful transducer
doesn’t need to be necessarily thread-safe. What’s more important is that state should
be immediately visible, not reordered and not cached in CPU registries. The reader is
invited to take a look at transduce to see an example of custom stateful transducer that
uses volatile!.
Other than stateful transducers, volatile! could be used to solve some thread
coordination problems. In the following example, a "producer" thread wants to
communicate a consumer thread that work is done and results are available:
242
core.async is a Clojure library implementing a form of concurrency called Concurrent Sequential Processes (CSP)
(defn start-consumer []
(future
(while (not @ready) ; ❶
(Thread/yield))
(println "Consumer getting result:" @result)))
(defn start-producer [] ; ❷
(future
(vreset! result :done)
(vreset! ready :done)))
(start-consumer) ; ❸
(start-producer) ; ❹
;; Consumer getting result: :done
❶ start-consumer implements a "spinning" loop that continuously check to see if results are
ready. Thread/yield communicates the willingness of the current thread to give back control to the
CPU for other threads to execute.
❷ start-producer starts a new future thread that delivers the result and change the volatile! "ready"
from false to true.
❸ The spinning thread starts checking to see if there are results.
❹ As soon as the producer flips the "ready" flag, the consumer exits the loop and prints a message.
20.4 set!
set! is a mutating assignment macro. It works differently based on the type of the
mutable target. For example, set! can write into static or instance attributes of Java
243
classes (assuming fields are "public" and not "final" ):
(import 'java.awt.Point) ; ❶
(def p (Point.)) ; ❷
[(. p x) (. p y)]
;; [0 0]
(set! (. p -x) 1) ; ❸
(set! (. p -y) 2)
[(. p x) (. p y)]
243
The following tutorial contains a good summary of Java field visibility options
docs.oracle.com/javase/tutorial/java/javaOO/accesscontrol.html.
;; [1 2]
❶ java.awt.Point is a simple class from the Abstract Window Toolkit (AWT) Java standard library. It
contains the public instance fields "x" and "y".
❷ A Point initializer with no arguments initializes the point at x=0, y=0 coordinates. We can see the
coordinates using the getter method "getX" available on the point instance. Clojure automaticall
translates (. p x) into the p.getX() equivalent.
❸ The dash sign "-" tells Clojure that "-x" is the instance attribute "x" and not the getter method
"p.getX()". set! uses the attribute instance to set the new coordinates.
set! can also mutate thread-bound var objects (never the root value):
(def non-dynamic 1) ; ❶
(def ^:dynamic *dynamic* 1)
(set! non-dynamic 2) ; ❷
(set! *dynamic* 2)
(binding [*dynamic* 1] ; ❸
(set! *dynamic* 2))
❶ non-dynamic and dynamic are 2 vars. dynamic is declared dynamic and can accept locally bound
values.
❷ set! cannot change the root binding of the var in both cases.
❸ It can however change the locally bound value of the dynamic var.
❶ This anonymous function is using Java interop to call the toString method on the object "x". The call
produces a Java object that is printed on screen.
❷ Access to warn-on-reflection is granted to set! because warn-on-reflection is implicitly bound
to the currently running thread.
❸ After setting the var, any reflective call is promptly reported.
❶ deftype is one of the few options in Clojure to create concurrently "unsafe" objects. Counter defines
a class with a private field "cnt".
❷ Functions defined as part of the interface of the class can set! the private field. In this case we make
the counter "callable" by implementing the clojure.lang.IFn interface.
❸ After creating a new counter instance, we can increment its content by invoking the object without
arguments.
21
Java Interoperation
The functions and macros in this chapter are predominantly related to Java
interoperation, the part of the standard library that sits closer to the Java runtime.
Clojure still offers a functional view over Java’s mutable world offering a syntax that
is concise and readable. The following is a summary of what is described in the
chapter:
• dot "." is a versatile special form to access Java methods and attributes. double-
dot ".." is useful to assemble chained calls while "doto" concatenates side-
effecting calls instead.
• new creates a new instance of an object.
• try and catch (and related special forms finally and throw) are the fundamental
mechanism for exception handling in Clojure.
• ex-info and ex-data builds on top of Java exception mechanism by hiding some of
the low level syntax and offering a mechanism to transport data with exceptions.
• bean is a macro that wraps Java objects in map-like interfaces.
• The functions in the clojure.reflect namespace perform introspection on the
structure and content of Java classes.
• Clojure also has a rich set of functions to deal with Java arrays. make-array is just
one of the many ways to create an array from Clojure. Macros
like aset, amap or areduce are useful to transform the content of the array. The
section also contains a brief explanation of array macro variants to leverage
primitive types (int, boolean, byte and so on).
Java objects. Several variations exist, with the target object appearing as the first
argument ("forward form"), or as the second ("inverted form"), or abbreviated with a
slash "/" ("slashed" form). The meaning of the expression also changes based on the
type of the arguments. Here’s is an exhaustive list of the alternatives:
❶
(. Thread sleep 1000) ; static method of 1 arg.
(. Math random) ; static field access first, static method of no args next.
(. Math (random)) ; static method of no args (unambiguously).
(. Math -PI) ; static field access (unambiguously).
(. Thread$State NEW) ; inner class static method.
❷
(Thread/sleep 1000) ; static method of 1 arg.
(Math/random) ; static field access first, static method of no args next.
(Math/-PI) ; static field access (unambiguously).
❸
(. (java.awt.Point. 1 2) x) ; instance field first, method of no args next
(.x (java.awt.Point. 1 2)) ; same as above
(. (java.awt.Point. 1 2) (getX)) ; instance method (unambiguously)
(.-x (java.awt.Point. 1 2)) ; instance field (unambiguously)
❶ The first group of dotted forms shows how to use them to access static members of Java classes. A
Java class could declare a static field and a static method of no arguments with the same name. While
in Java this generates no ambiguity, Clojure needs a way to distinguish between the two options. If not
specified, the basic form tries to access the static field first and the static method of no arguments
next. Static field access only can be specified prefixing the field name with a dash "-". Static method
access only requires a pair of parenthesis instead.
❷ The "slashed" form expands into the corresponding "dotted" form. It is shorter and easier to read. You
should prefer this form when possible.
❸ The last group shows how to use "." to access instance members. Similarly to static members, a dash
"-" prefix requests unambiguous access to the instance field in case an instance method of no
arguments with the same name exists. Parenthesis requests unambiguous access to the instance
method instead.
The "." is a special form and behaves like a macro in terms of arguments evaluation. If
we need to call a method but the method name is not known at compilation time, we
need to assemble the dotted form inside a defmacro,macro context:
(defmacro getter [object field] ; ❶
(let [getName# (symbol (str "get" field))] ; ❷
`(. ~object ~getName#)))
❶ In this example, we want to assemble the name of the method and call the
corresponding getX() passing the field name as argument.
❷ The getter name is assembled in the let block as a symbol.
❸ Finally, we use syntax quote to expand the assembled dot notation into the final call.
Double dot ".." uses the result of invoking the first two arguments as the input for the
next pair, and so on, until there are no more arguments. ".." is useful to connect a chain
of state-mutating calls so they always operate on the initial object. A common case, is
building instances using the Java builder pattern:
(import '[java.util Calendar Calendar$Builder])
(.. (Calendar$Builder.) ; ❶
(setCalendarType "iso8601")
(setWeekDate 2019 4 (Calendar/SATURDAY))
(build))
;; #inst "2019-01-26T00:00:00.000+00:00"
(macroexpand ; ❷
'(.. (Calendar$Builder.)
(setCalendarType "iso8601")
(setWeekDate 2019 4 (Calendar/SATURDAY))
(build)))
❶ The Calendar API in the Java standard library supports creation of arbitrarily complex calendar
instances through the builder pattern. The builder is created as the first instruction and the result of
each invocation is passed down to the next instruction up to the final build call that return the
assembled calendar. The returned calendar instance shows the date of the fourth Saturday of 2019.
❷ The macroexpansion of the same form shows the much less readable dotted form necessary to
execute the same instructions.
The last macro of this section is doto. doto resembles the double dot ".." macro on the
surface, but instead of passing the results of the previous invocation, the target object
remains the same through all the step of the chain. This is useful for repeated changes
to mutable objects, such as Java collections:
(import 'java.util.ArrayList)
(def l (ArrayList.))
(doto l ; ❶
(.add "fee")
(.add "fi")
(.add "fo")
(.add "fum"))
;; ["fee" "fi" "fo" "fum"]
❶ The ArrayList is repeatedly added elements without the need to interleave "l" through all the forms.
21.2 new
new is a special form (for all practical purposes, a macro) that takes a variable number
of arguments: the first argument must be a symbol that resolves to a Java class and the
rest of the arguments are optional. The additional arguments after the first are passed
directly to the constructor of the class in the first argument:
(str sb) ; ❷
;; "init item"
❶ A java.lang.StringBuffer is a mutable string with an option to append new fragments at the end of
an empty or initial string.
❷ Once the string is assembled correctly, it can be easily transformed back into an immutable string.
Constructors definitions for a class can come from the class itself or any superclass
(constructors in the child class are allowed to extend the corresponding constructor in
the superclass with additional behavior). In order to find the right constructor, Clojure
goes through a process of pattern matching against the number of arguments and their
type. Because of potential type inheritance on arguments, there could be multiple
matching constructor for the same call. One subtle example is java.lang.BigDecimal:
(let [l (Long. 1)] (new BigDecimal l)) ❶
;; java.lang.IllegalArgumentException: More than one matching method found
BigDecimal does not have a constructor with a java.lang.Long parameter but it does
contain one for primitive int and long. The problem from Clojure’s perspective is that
both are possible as the boxed type Long could downsize to
primitive int or long without loss of precision. Passing a primitive long solves the
problem:
(let [l 1] (new BigDecimal l)) ❶
;; 1M
new has a shorter form that is normally recommended. It removes the need to use
the new keyword which is replaced by a "." (dot) after the name of the class:
(StringBuffer. "init") ; ❶
;; #object[java.lang.StringBuffer 0x5fa1cc83 "init"]
❶ new has a shorter form that removes the need of using the keyword new itself.
Let’s talk about throw first, as it comes useful for testing the other
forms. throw expects a single argument of type java.lang.Throwable or any subclass
(most notably, java.lang.Exception and java.lang.Error are both subclasses
of java.lang.Throwable):
(throw (Throwable. ": there was a problem.")) ; ❶
;; Throwable : there was a problem. ; ❷
(clojure.repl/pst) ; ❸
;; Throwable : there was a problem.
;; user/eval1927 (form-init5670973898278733609.clj:1)
;; user/eval1927 (form-init5670973898278733609.clj:1)
;; clojure.lang.Compiler.eval (Compiler.java:6927)
;; [....]
try takes a body expression (automatically wrapped in a do block) and any number
of catch clauses. The catch clause declares a type, a symbol and an expression. If
there is an exception in the outer try and a compatible catch clause, the corresponding
expression evaluates:
(try
(println "Program running as expected") ; ❶
(throw (RuntimeException. "Got a problem.")) ; ❷
(println "program never reaches this line") ; ❸
(catch Exception e ; ❹
(println "Could not run properly. Bailing out." e) ; ❺
"returning home")) ; ❻
❶ try supports any number of top level expressions, similarly to the do block.
❷ We throw an exception on purpose. This alters the normal program flow, for example skipping lines
that would be normally evaluated.
❸ As written, our small example would never able to print this message.
❹ The catch directive declares java.lang.Exception as the type of exception class that is able to
handle. Other kind of errors such as java.lang.Throwable or java.lang.Error won’t match the
clause.
❺ Also catch contains an implicit do block. The local binding "e" is bound to the instance of exception
that was captured.
❻ After printing a message, the catch block returns the evaluation of the last expression.
The rule of evaluation is type based: a clause matches if the exception type is the same,
or a subclass, of the declared catch type. This allows for "cascading" style catching
from more specific types to the most general:
(import '[java.net
Socket InetAddressi
ConnectException SocketException])
(try
(Socket. (InetAddress/getByName "localhost") 61817) ; ❶
(catch ConnectException ce ; ❷
(println "Could not connect. Retry." ce))
(catch SocketException se ; ❸
(println "Communication error" se))
(catch Exception e ; ❹
(println "Something weird happened." e)
(throw e)))
❶ Our application needs to connect to some local port for communication. Several things can wrong, for
example the other party might not be ready and still starting up at the given port, or we might be able
to connect and then unable to communicate.
❷ If we are unable to connect, it means that for some reason the other application is not listening. It
could be temporary and it could be worth installing a "retry" mechanism such as with-
backoff presented here. Note that we also put the exception object in the printout. If we don’t, we
might throw away important information inside the body of the exception (also called "swallowing
exceptions").
❸ After successful connection to the socket, there could be problem related to unexpected packets being
sent through the channel. We can handle this condition separately, perhaps retrying the connection
again.
❹ Finally, the general Exception clause catches any other condition. Note that instead of just printing a
message, we also re-throw the same exception. Re-throwing signals that this block of code is unable
to handle the exception but some other code upstream could.
finally is a special form of catch clause that is always evaluated, even when there is
no exceptional condition. The try-finally pattern is pretty popular to handle
resources that must always be released. with-open is an excellent example of that
behavior and it uses finally internally (similarly, locking must always release the lock
after execution of a critical region). In the following example, we build a
simplified with-open to deal with java.io.Reader objects only:
(require '[clojure.java.io :as io])
(with-reader r "/etc/hosts"
(last (line-seq r)))
❶ The macro with-reader binds the first parameter to an open java.io.Reader instance using a file
as input source.
❷ When "body" evaluates, the parameter "r" is bound to an open reader.
❸ finally guarantees that the reader instance is released correctly, even in case of exception while
evaluating the body.
finally can be used alone or as the final statement after a group of catch clauses. The
entire expression returns the evaluation of the matching catch but the finally block
always executes for side effects:
(try
(/ 1 0)
(catch Exception e "Returning from catch") ; ❶
(finally (println "Also executing finally"))) ; ❷
;; Also executing finally
;; "Returning from catch"
❶ The result of the entire expression is the string evaluated in the catch block.
❷ The finally block still evaluates even after the result is returned.
(type ex) ; ❷
;; clojure.lang.ExceptionInfo
❶ ex-info works as a constructor for the exception type clojure.lang.ExceptionInfo. Along with a
name it accepts a map that can contain any sort of information.
ex-info hides any Java interoperation detail necessary to create the exception object.
The only information required is a message and the metadata map. It optionally accepts
a third argument to capture or re-throw a root-cause exception:
(try
(/ 1 0)
(catch Exception e ; ❶
(throw
(ex-info "Don't do this."
{:type "Math"
:recoverable? false} e)))) ; ❷
(defn main-program-loop []
(try
(println "Attempting operation...")
(randomly-failing-operation)
(catch Exception e
(let [{:keys [type recoverable?]} (ex-data e)] ; ❷
(if (and (= :connection type) recoverable?)
(main-program-loop)
(ex-info "Not recoverable problem."
{:type :connection} e))))))
(main-program-loop) ; ❸
;; Attempting operation...
;; Attempting operation...
;; #error {
;; :cause "Weak connection."
;; :data {:type :connection, :recoverable? false}
;; :via
;; [{:type clojure.lang.ExceptionInfo
;; :message "Not recoverable problem."
;; :data {:type :connection}
;; :at [clojure.core$ex_info invokeStatic "core.clj" 4617]}
;; {:type clojure.lang.ExceptionInfo
;; :message "Problem."
;; :data {:type :connection, :recoverable? false}
;; :at [clojure.core$ex_info invokeStatic "core.clj" 4617]}]
From the example above we can see that nested ex-info exceptions add up nicely
when examining the stack trace: we can see a "Not recoverable problem." caused by a
"Weak connection." and so on for all the nested exceptions.
The last function in this section is Throwable→map, an useful function to transform the
fragmented information inside a hierarchy of exceptions into a nice Clojure data
structure:
(def error-data ; ❶
(try (throw (ex-info "inner" {:recoverable? false}))
(catch Throwable t
(try (throw (ex-info "outer" {:recoverable? false} t))
(catch Throwable t
(Throwable->map t))))))
(keys error-data) ; ❷
;; (:cause :via :trace)
(:cause error-data) ; ❸
;; "Inner"
(:via error-data) ; ❹
;; [{:type clojure.lang.ExceptionInfo,
;; :message "outer",
;; :at [clojure.core$ex_info invokeStatic "core.clj" 4617],
;; :data {:recoverable? false}}
;; {:type clojure.lang.ExceptionInfo,
;; :message "inner",
;; :at [clojure.core$ex_info invokeStatic "core.clj" 4617],
;; :data {:recoverable? false}}]
❶ error-data simulates two nested exceptions for illustrative purposes. Throwable→map converts the
last java.lang.Throwable instance to a Clojure map type.
❷ error-data has 3 keys containing different aspects describing the error.
❸ The :cause key shows the root cause error description, in this case inner.
❹ :via is a vector of maps. Each map contains the basic data of each exception instance.
Each :data key contains metadata if the exception was created with ex-info.
❺ The :trace key is another vector containing one entry for each entry in the stack trace of the
exception.
21.5 bean
bean is a function that creates a map-like representation of the attributes available in an
object:
(import 'java.awt.Point)
❶ java.awt.Point is a simple class in AWT, the original (and basic) Java Abstract Windowing Toolkit.
It exposes x,y coordinates as public attributes, including the methods getX() and getY() to access
them.
❷ bean creates a proxy instance implementing Clojure’s map-like features, including keys for public
attributes.
❸ We can now use map-like keys to access the attributes of the original class.
bean uses introspection to analyze the content of a Java class through the JavaBean
programming interface (see the call-out below for additional information on
JavaBeans). It follows that not all available attributes are visible, but just those
exposed through the JavaBean standard. Compare for example the following:
(bean (Object.)) ; ❶
;; {:class java.lang.Object}
(import 'javax.swing.JButton)
(pprint (bean (JButton.))) ; ❷
;; {:y 0,
;; :selectedObjects nil,
;; :componentPopupMenu nil,
;; :focusable true,
;; :managingFocus false,
;; :validateRoot false,
;; :requestFocusEnabled true,
;; :containerListeners [],
;; :rolloverSelectedIcon nil,
;; :iconTextGap 4,
;; :mnemonic 0,
;; :debugGraphicsOptions 0,
;; [...]
❶ bean can process any type of object, but only those using the JavaBean conventions provide useful
information. Here we use bean on a basic Object instance which only contains the default :class key.
❷ javax.swing.JButton contains many getter methods (those starting with "get" and then the name of
a property) that bean can use to extract a large map of key-property values.
What is a JavaBean?
Introspection (also called reflection) is a Java feature to access a class structure at runtime, for example
listing methods, constructors or attributes. With reflection, a Java program can query the structure of
Java classes, or even invoking constructors to create new objects. Java took introspection to the next
level introducing JavaBeans in 1996, not too long after Java was born.
The JavaBean standard specifies a set of conventions that a Java class should obey 244 :
• All attributes should be private but accessible through getters/setters methods.
• It should provide a public, no-argument constructor.
• It should implement the java.io.Serializable interface.
Provided with this set of conventions, tools can find, analyze, instantiate and call methods on JavaBeans.
Tools can also store a JavaBean as bytes or send them through the network. JavaBeans are popular with
graphical interfaces, where they offer automatic discovery capabilities.
bean is an useful tool to keep in mind for quick conversions of Java bean-like objects
into Clojure maps. However, because of its heavy reflection use, you should
avoid bean in critical parts of the code where speed is important.
reflect is just a thin wrapper around type-reflect that takes care of calling class on
object instances. reflect accepts the following options: an :ancestors key and
a :reflector key. With :ancestors, reflect retrieves super-classes and super-
interfaces at all levels, not just directly above the target class:
(:ancestors (r/reflect clojure.lang.APersistentMap :ancestors true)) ; ❶
;; #{java.lang.Object clojure.lang.Associative
;; java.util.concurrent.Callable java.util.Map clojure.lang.ILookup
;; java.lang.Runnable clojure.lang.IPersistentCollection
;; clojure.lang.IHashEq clojure.lang.IFn clojure.lang.MapEquivalence
;; clojure.lang.Counted clojure.lang.IPersistentMap clojure.lang.Seqable
;; java.io.Serializable clojure.lang.AFn java.lang.Iterable}
244
The JavaBean specification is available from www.oracle.com/technetwork/java/javase/documentation/spec-
136004.html
❶ When :ancestors true is present, reflect retrieves super-classes and super-interfaces beyond
the first hierarchy level and collect the result in the additional :ancestors key.
❷ When :ancestors true, the :members key additionally includes methods declared in any of the
ancestors. We can see the number going up from 29 to 137 collected members (methods or
constructors).
The :reflector key allows to specify a different "Reflector". By default, reflect uses
Java reflection (which requires to load the class into memory for inspection). But there
are other libraries offering similar or better capability that can be plugged
into reflect using the :reflector key. Here’s for example how to create a dummy
reflector for illustration purposes:
(deftype StubReflector [] ; ❶
r/Reflector
(do-reflect [this typeref]
{:bases #{} :flags #{} :members #{}}))
function custom values specific size multi-dimensional custom types input collection
(vec a) ; ❷
;; [false false false]
(vec b)
;; [nil nil nil]
Any additional number after the last argument triggers the creation of a multi-
dimensional array:
(def a (make-array Integer/TYPE 4 2)) ; ❶
(mapv vec a) ; ❷
245
Java has a default value for all primitive types (for example 0 for int, false for boolean and "null" for all reference
types).
;; [[0 0] [0 0] [0 0] [0 0]]
❶ The presence of 2 integers prompts make-array to create a two-dimensional array. The structure
contains 4 arrays of 2 integers each.
❷ This time, to convert back into a vector, we need to call vec on each item in the initial array.
WARNING make-array has some (reasonable) limitations: the number of dimensions cannot be more
than 255, only zero or positive integers are accepted and the requested array should fit into the
available memory.
make-array is the most general function to create empty arrays (where empty means
array initialized with the default value for the requested type). The need for arrays of
primitive types is so common that an entire collection of array initializers has been
dedicated to that task. We are going to see them in the following section.
21.7.2 object-array and other typed initializers
NOTE This section also mentions other related functions such as: int-array, boolean-array,
byte-array, short-array, char-array, long-array, float-array and double-array.
❶ object-array creates an empty array of the given size. Here the created array of type "Object"
contains 3 nil elements.
❷ char-array initializes the array to the non-printable character corresponding to ASCII table index "0".
❸ double-array creates the array initialized to double type "0.0".
By passing a second parameter of compatible type, we can initialize the array with a
specific value instead of the default provided by Java:
(vec (float-array 3 1.0)) ; ❶
;; [1.0 1.0 1.0]
❶ When a second parameter is present and is compatible with the type of the created array, the value is
used to initialize all the items in the array.
When the first parameter is a sequential collection instead of a number, the content of
the input collection is copied into the newly created array:
(vec (int-array [1 2 3])) ; ❶
;; [1 2 3]
❶ When int-array first argument is a sequential collection, an array of int is created which contains
the items from the input.
The items in the input collection needs to be compatible with the type of the array. If
they are not compatible (that is, there is no cast operation that transforms the input into
the correct type) the compiler throws an exception:
(vec (int-array [\a \b \c])) ; ❶
;; ClassCastException java.lang.Character cannot be cast to java.lang.Number
❶ The items in the input collection needs to be compatible with the type declared by the array.
Other cases are subtle, as Java tolerates precision loss for certain types:
(vec (int-array [4294967296])) ; ❶
;; [0]
32
❶ The input collection contains a large number (2 in this case, which is beyond int capacity). Java
truncates the most significant bit to fit the large number in the 32 available bits. The result of the
truncation is the number "0".
Finally, we can pass both a size and an input collection. If the content of the collection
is not sufficient to fill the resulting array, values fall back to the default:
(vec (int-array 5 (range 10))) ; ❶
[0 1 2 3]
❶ In this case, the resulting array can only fit 5 of the available 10 items in the input collection.
❷ If there is not enough input instead, the rest of the items in the array is initialized with the default value
for that primitive type.
In the next section we are going to see other ways to create arrays starting from an
existing input collection.
21.7.3 to-array, into-array, to-array-2d
to-array is very similar to object-array, but to-array has specialized algorithms to
transform the input collection. The following benchmark compares to-array and
object-array. The output is the same, but to-array is faster:
❶ to-array performs twice as fast as object-array when the input collection is a vector.
❷ object-array is still a valid option to create an object-array without the need of an input collection.
The speed-up is not guaranteed but depends on the type of the input. When
possible, to-array delegates the creation of the array to the collection, while object-
array always transforms the input into a sequence and then iterates its content.
Prefer to-array to create an object array out of a collection, while object-array still
offers the option to create an empty object array.
into-array is similar to to-array, but it will try to guess or force a specific type for
the output array:
(type (to-array [1 2 3])) ❶
;; [Ljava.lang.Object;
(type (into-array [1 2 3])) ❷
;; [Ljava.lang.Long;
❶ to-array always creates a new Object array independently from the type of the items in the input
collection.
❷ into-array better specializes the type of the output array choosing java.lang.Long instead of the
more generic java.lang.Object.
into-array uses the type of the first element to guess an appropriate type for the
output array. Once the type has been fixed all following elements needs to be of the
same type:
(into-array [1 2 (short 3)]) ; ❶
;; IllegalArgumentException array element type mismatch
❶ into-array does not accept mixed type arrays. This is true even for type-compatible casting such
as short to long.
(type a) ; ❷
;; [S
(map type a) ; ❸
;; (java.lang.Short java.lang.Short java.lang.Short)
❹ into-array throws exception if we try to coerce a number too large to fit the 16 bits allocated for
a short type.
(map type a) ; ❷
;; ([Ljava.lang.Object; [Ljava.lang.Object;)
(mapv vec a) ❸
;; [[1 2] [3 4]]
(aget a 0) ; ❶
;; :a
(aset a 0 :changed) ; ❷
;; :changed
(vec a) ; ❸
;; [:changed :b :c]
❶ We can use aget to access the index "0" from the array "a". aget returns the content of the array at
that index.
❷ With aset we can write a specific location in the array. aset returns the item that was just written.
❸ Mutable arrays operate very differently from immutable Clojure collections: after writing the array
location with aset the array has permanently changed.
If the array is multi-dimensional, aget and aset accepts additional indexes to access
the nested arrays:
(def matrix (to-array-2d [[0 1 2] [3 4 5] [6 7 8]])) ; ❶
(aget matrix 1 1) ; ❷
;; 4
In the following example, we are going to see aget, aset and alength in action to
produce the transpose of a square matrix. The transpose of a matrix is a common
transformation that flips each item over its diagonal 246. For speed purposes, we decide
to implement the matrix as a mutable array of doubles:
(defn transpose! [matrix] ; ❶
(dotimes [i (alength matrix)]
(doseq [j (range (inc i) (alength matrix))]
(let [copy (aget matrix i j)] ; ❷
(aset matrix i j (aget matrix j i))
(aset matrix j i copy)))))
(def matrix ; ❸
(into-array
(map double-array
[[1.0 2.0 3.0]
[4.0 5.0 6.0]
[7.0 8.0 9.0]])))
(transpose! matrix)
(mapv vec matrix) ; ❹
❶ transpose! swaps items in place without the need to create a copy of the matrix. This version of the
algorithm is particularly effective for large matrices, as they don’t need duplication in memory at any
given point. As a side effect, the input matrix is permanently changed so the name ending with an
exclamation mark tries to convey this fact.
❷ We need a total of 2 aget and 2 aset operation. Note that we can use dotimes for the outer index
(cross the length of the side of the square matrix) and doseq for the inner loop, which needs to start
from the outer index plus one.
❸ We can’t use “to-array, into-array, to-array-2d” to create a two-dimensional array of type double. We
need double-array for the inner array and into-array for the outer array (because into-array can infer
the type at this point).
❹ After calling transpose! we can have a look at the result transforming the array into a vector of
vectors.
In the version of transpose! above we decided to mutate the array in place. The
solution has the advantage of not requiring a copy of the entire matrix at the price of a
side effect. We could re-work the example to transpose the array into a new copy and
avoid any side effect with aclone:
(defn transpose [matrix]
(let [size (alength matrix)
output (into-array (map aclone matrix))] ; ❶
(dotimes [i size] ; ❷
(dotimes [j size]
(aset output j i (aget matrix i j))))
246
(see en.wikipedia.org/wiki/Transpose).
output))
(def matrix
(into-array
(map double-array
[[1.0 2.0 3.0]
[4.0 5.0 6.0]
[7.0 8.0 9.0]])))
❶ transpose is a rework of the function transpose! seen previously. This version does not require the
mutation of the input array. We use aclone to clone each of the inner arrays that compose the matrix.
All changes can now happen on a brand new array.
❷ The new approach simplifies looping over the coordinates "i" and "j", which is just assigning the value
found at "[i,j]" to the inverted coordinated "[j,i]" on the output.
❸ Note that the transposed matrix is now the output of the function. The "matrix" array is left untouched.
aclone is not the only approach for the problem of transposing a matrix. We could
create an empty multi-dimensional array instead of cloning the input. Note however
that aclone has the advantage of handling the type of the copied array without the need
for transpose to mention double-array (or other typed function) explicitly.
Calculating the transpose matrix (as the example above) requires to know the length of
the array. Clojure provides a specific function alength for that purpose. count would
also work, but it would generate a reflective call in all cases, as count is designed to
receive a generic object without a compile-time notion of its type. When the type of the
array is known, alength can take advantage of that information. Have a look at the
following benchmark:
(require '[criterium.core :refer [quick-bench]])
❶ count works with arrays, but it’s not the most effective way to calculate the length, as the benchmark
demonstrate.
❷ alength knows the type of the array and delegates the call accordingly without the need of reflection.
(vec a2)
;; [1 1 1 1 1 1 1 1 1 1]
(vec a1) ; ❷
;; [0 1 2 3 4 5 6 7 8 9]
❶ This simple transformation replaces each element with the number 1. Note that we are not using "idx"
or "output". See below for an explanation of the meaning of these two parameters.
❷ The source array is unaltered after the transformation.
amap takes the array, an index and a result symbol and the expression. The expression
evaluates once for each element in the input, with the index symbol bound to the
current index and the result symbol bound to the output under construction:
(def a1 (int-array (range 4)))
❶ With the help of debug, we can see how "idx" and "output" are changing each iteration. The value of
"output" displayed is the one just before the new value is assigned to the corresponding index.
We can use the index symbol to access the input array and perform a transformation of
the old item into the new:
(defn ainc [a]
(amap a idx _ (inc (aget a idx)))) ; ❶
❶ ainc (or Array Increment) adds plus one to each element in a numeric array. "idx" is bound by amap to
the current index in the array and we can use it to transform the input item at that index.
The output symbol represents the output array under construction at each stage through
the iteration. We could use this information to prevent any further changes when the
sum of the updated items goes beyond a certain limit:
(defn asum-upto [a i] ; ❶
(loop [idx 0 sum 0]
(if (= idx i)
sum
(recur (inc idx) (+ sum (aget a idx))))))
❶ asum-upto is a function that sums the number in a numeric array starting at index 0 up to index "i".
❷ The amap expression needs to calculate the current item at the index, the value of the new item after
the transformation and the sum of all the transformed items so far. Only if that sum does not exceed
the limit the current item is updated.
❸ We can see that after squaring a few numbers, the processing "stops" (it doesn’t actually stop, it
continues to updates items with a copy of themselves).
(alength a) ; ❶
;; Reflection warning, call to static method alength on clojure.lang.RT can't be
resolved.
;; 10
❶ When using alength on an array of integers, we get a reflection warning related to the fact that
Clojure can’t determine the type of "a".
The array "a" defined as clojure.lang.Var object in the current namespace, has to be retrieved using
an automatic indirection: the symbol "a" is retrieved from the current namespace mappings and is
attached to a var object. The Clojure runtime proceeds to call deref on the var object
automatically. deref is a function returning a java.lang.Object because there is no notion of what var
refers to. Hence Clojure does not know which alength type to call and resolves to call the object version
which uses reflection.
To avoid the reflective warning we need a type hint:
(alength ^"[I" a) ; ❶
;; 10
(alength ^ints a) ; ❷
;; 10
❶ How to type hint an array of primitive integers using Java array class encoding.
❷ Clojure supports a shortcut version for all array types, so we don’t need to remember the Java
encoding for array classes.
Another important factor to consider to speed up performances is avoiding autoboxing. Let’s go back at
the function amap-upto that was used in the examples and have a look at reflection and boxing
warnings:
; ❷
Reflection warning, method aclone on RT can't be resolved (argument types: unknown).
Reflection warning, alength on RT can't be resolved (argument types: unknown).
Boxed math warning, boolean Numbers.lt(long,Object).
Reflection warning, aget on RT can't be resolved (argument types: unknown, int).
Boxed math warning, Number Numbers.unchecked_add(Object,Object).
Boxed math warning, boolean Numbers.gt(Object,Object).
Reflection warning, aset on RT can't be resolved.
❶ Along with warn-on-reflection, another useful warning to see is unchecked-math in the :warn-
on-boxed variant. :warn-on-boxed shows a message each time Clojure has a primitive type variant
of a function that can’t be selected.
❷ We can see a long list of reflection and boxed math problems to solve.
Problems like those in amap-upto requires some practice to solve. First of all, you need to learn about
the syntax for type hinting 247 and secondly is to follow the compiler message to understand which parts
of the code requires them. Luckily, the compiler is quite precise in indicating where the information is
missing and sometimes a single type hint solve several warnings at once. Here’s a version of amap-
upto that solves all warnings:
247
one good place to start is clojure.org/reference/java_interop#typehints
The lesson to learn from this quick exercise in "warnings elimination" is that, especially when working
with native arrays, you need to pay attention to type hints at function boundaries. Additionally, when
using items from the array, there is often the risk of unwanted boxing of primitives.
We are going to close the section having a look at areduce. As you probably imagine
from the name, this is a special reduce version for arrays. It works similarly
to amap with an additional "init" parameter:
(def a (int-array (range 10)))
❶ areduce binds "idx" and "acc" during the internal iteration, so the expression can access them. This is
similar to amap, but "acc" (the accumulator) contains the sum so far instead of the output array. The
new parameter is the initial value for the computation.
The same recommendation about type hinting are valid when using areduce.
21.7.6 set-int and other types setters
NOTE This section also mentions other related functions such as: aset-boolean, aset-byte, aset-
short, aset-char, aset-long, aset-float, aset-double
The group of functions in this section are related to aset and the need to avoid
reflective calls. Observe the following example:
(set! *warn-on-reflection* true) ; ❶
(aset a 0 9) ; ❹
;; Reflection warning, call to static method aset on clojure.lang.RT
;; can't be resolved (argument types: unknown, int, long).
;; 9
❶ warn-on-reflection turns on compiler warnings related to reflective calls. If the compiler is unable
to determine the type of one or more operands, it will issue a reflection warning. The warning does not
prevent correctness or the program, but simply warn the user that in order to find the right method to
dispatch to, the compiler had to use reflection.
❷ We create a simple array or primitive integers.
❸ The type of the array confirms that this is indeed an array of primitives.
❹ When we use aset on the array, the compiler issues a warning that it cannot find the
right clojure.lang.RT/aset call to dispatch to. The compiler also provides details such as the type
of the parameters that was searched for. The "Unknown" is about "a", the primitive "int" is the index
and the number "9" was passed from the reader as a primitive "long". Note that the array is modified
anyway and the item that was just written is returned.
The reflective problem above is produced by implicit var indirection. Let’s rewrite the
example so the implicit run time steps become explicit:
(def a-lookup (get (ns-map *ns*) 'a)) ; ❶
(type a-lookup) ; ❷
;; clojure.lang.Var
❶ def produces an additional side effect apart from creating an instance of java.lang.Var: a new entry
is created in the local namespace mappings (ns is a shortcut for the current namespace object). The
entry is a pair with the symbol "a" as key and the var object as value. We can use ns-map to retrieve
the var from the mappings.
❷ a-lookup has the expected "Var" type, as well as the content of the var which is the expected array of
integers.
❸ This is what aset actually sees at compile time.
When Clojure compiles the code above, it can only sees that deref returns an object
(because a var could really point at anything) and even if specialized aset exists for
different primitive types, that information is now lost.
There are 2 ways to remove the reflective call: first we could type hint a-lookup as
array of primitive integers. Second we could use aset-int (or any other typed aset-
* call in case of other primitive types). The case of type hinting the array is so common
that Clojure offers aset-int to make things easier:
(aset ^ints a 0 9) ; ❶
;; 9
(aset-int a 0 9) ; ❷
;; 9
❶ In the first case, we type hint "a" so it carries the information regarding its type at compile time and
Clojure can emit a call to the right aset specialization (in clojure.lang.RT).
❷ Alternatively, we can use aset-int and achieve the same result.
aset-int and the other typed aset-* grouped in this chapter have the same semantic
of aset. In particular, they support multiple indexes for multi-dimensional arrays:
(def matrix ; ❶
(into-array
(map int-array [[1 2 3] [4 5 6]])))
As expected, the type of the array should be consistent with the specific aset-* version
we are trying to use. aset-* functions follow Java conventions for allowed type of
casts. Forcing a type requiring less precision into one requiring more precision is
always possible, but the opposite is not true:
(def int-a (int-array 5))
(def double-a (double-array 5))
❶ We can aset-int into an array of double, because the implicit int to double conversion fits nicely
into the number of bytes allocated for the items in the array.
❷ The opposite however is not true. The type conversion requires a loss of precision of
the double number "99.0" to fit into the bytes allocated for a primitive int type.
Similarly to int-array (and related *-array functions) and aset-int (with related aset-
* functions), Clojure also offers a group of specialized cast functions.
Clojure provides such group of specialized primitive type functions to help working
with primitive arrays. By using the specialized version of a function we can provide the
necessary type information without type hinting, a nice plus in terms of general
readability. Here’s for example a function asum to sum numeric arrays:
(defn asum [a1 a2]
(let [a (aclone (if (> (alength a1) (alength a2)) a1 a2))]
(amap a idx ret
(aset a idx
(+ (aget a1 idx) (aget a2 idx))))))
The asum function works generically on any array type, but performs poorly because of
many reflective calls and boxing of primitive types:
(set! *warn-on-reflection* true) ; ❶
(set! *unchecked-math* :warn-on-boxed)
❶ If after turning on reflective warnings and boxing warnings we re-evaluate the same asum function we
get an impressive list of warnings.
If we can make the assumption that asum is always going to sum integer arrays, we can
use the relative specialized casts:
(defn asum-int [a1 a2]
(let [a1 (ints a1) a2 (ints a2) ; ❶
a (aclone (if (> (alength a1) (alength a2)) a1 a2))]
(amap a idx ret
(aset a idx
(+ (aget a1 idx) (aget a2 idx))))))
;; #'user/asum-int ; ❷
❶ let bindings preserve and propagate type information to downstream forms. By casting "a1" and "a2"
with ints we give the compiler all the necessary information for the input arrays.
❷ After using ints on the inputs, all warnings disappear.
Primitive array casting is useful when the relative type hint is not available or its
positioning in the form is not trivial to understand.
22 The Toolbox
The last chapter of the book collects a variety of functions in the standard library
dedicated to solve specific problems. Some of the functions or macros in this chapter
have been used throughout the book but we are going into greater details here. The
functions are grouped by their originating namespaces. Here’s an overview:
• clojure.xml: we used few XML related functions throughout the book. We are
going to review those functions in specific and others that are available in the
namespace.
• clojure.inspector contains a few facilities to visualize data structures.
• clojure.repl: the repl namespace contains useful helper functions dedicated to
improve the REPL experience.
• clojure.main contains the actual REPL implementation and a set of primitives to
customize the REPL experience.
• clojure.java.browse contains a single public function browse-url to open a native
browser given a specific URL.
• clojure.java.shell: the most important function in this namespace is sh, a function
to "shell-out" commands to the native operative system.
• clojure.core.server contains the implementation of a socket server, a service that
offers Clojure evaluation (similarly to the REPL) to remote clients.
• clojure.java.io contains Clojure wrappers to manage the Java IO (Input/Output)
system including files, streams, the classpath and more.
• clojure.test is the testing framework that ships with Clojure. clojure.test is
configurable and extensible.
• clojure.java.javadoc contains facilities to access Java documentation using the
default system browser.
22.1 clojure.xml
The main goal of clojure.xml is to produce an in-memory data representation of an
XML input source. In doing so the XML content is loaded into memory for further
processing. xml/parse is the main entry point that produces a Clojure data structure
from an XML document:
(require '[clojure.xml :as xml]) ; ❶
(keys document) ; ❸
;; (:tag :attrs :content)
xml/parse builds a nested structure of struct-map and vectors with the following
recursive structure:
{:tag ... ; ❶
:attrs ...
:content
[{:tag ... :attrs ... :content [...]}
...
{:tag ...
:attrs ...
:content
[{:tag ... :attrs ... :content [...]}
...
{:tag ... :attrs ... :content [...]}]}]}
Each struct represents an XML node. The :tag key is the name of the node,
the :attrs key contains a collection of attributes of the node and finally :content a list
of children nodes.
NOTE The use of structs in clojure.xml is one of the few in the standard library (other notable
examples are resultset-seq or cl-format). defrecords effectively replace the need for
<defstruct,structs>>.
(def conforming ; ❶
"<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE html SYSTEM 'http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd'>
<html xmlns='http://www.w3.org/1999/xhtml'>
<article>Hello</article>
</html>")
(def xml ; ❷
(-> conforming .getBytes io/input-stream xml/parse))
❶ This conforming snippet of XHTML (an XML-compliant dialect of HTML) requires an external DTD for
valiation. The DTD file contains the specification for the expected syntax in the XHTML.
❷ For this experiment, we load the XHTML directly from a string. xml/parse interprets strings as URL
that we don’t have in this case. So we need a quick transformation of the string into an input-stream, a
function available from clojure.java.io.
❸ If we have networking problems, we soon discover that parsing could take up to 60 seconds or more
(depending on the JDK settings).
Parsing XML under validation constraints is usually wise feature to have, especially in
production environment. However, for testing or development, we might want to avoid
incurring in intermittent networking issues. We can disable XML validation passing
a non-validating parser function to xml/parse:
(import '[javax.xml.parsers SAXParserFactory])
(def xml ; ❸
(-> conforming .getBytes io/input-stream (xml/parse non-validating)))
;; {:tag :html,
;; :attrs {:xmlns "http://www.w3.org/1999/xhtml"},
;; :content [{:tag :article, :attrs nil, :content ["Hello"]}]}
248
XML documents can specify a schema they should be forced to. XML parsers usually follow the directive and validate
the content against the schema. In some cases of corrupted documents or lacking network connectivity, we might need to
switch off validation to be still able to load the document. The list of all parser features is available
here: https://xerces.apache.org/xerces2-j/features.html
❶ non-validating is a function of the input source and a content handler "ch". The content handler is
the default used by clojure.xml and we can reuse it if we don’t need to alter the way XML is
transformed.
❷ The SAXParserFactory object accepts configuration using the setFeature method.
❸ Parsing is now independent from network calls and is also non-validating.
Exactly like you can parse xml with xml/parse you can reverse the process
with xml/emit. xml/emit takes the output of xml/parse and converts it back to a string
sending it to standard output:
(xml/emit xml) ; ❶
nil
❶ xml is the output of the previous example parsing. Here the data structure converts back to a string
except for the "DOCTYPE" declaration.
If you want to capture xml/emit output into a string or a file, please have a look
at with-out-str or binding to redirect the standard output to a different output stream.
22.2 clojure.inspector
The clojure.inspector namespace contains a small visualization utility for structured
data. The inspector wraps the data in a basic (Swing 249) UI that supports a few
visualization models: tabular, sequential and tree-like. For example, the following code
produces the window displayed below:
(require '[clojure.inspector :refer [inspect-tree]]) ; ❶
(inspect-tree {:a 1 :b 2 :c [1 2 3 {:d 4 :e 5 :f [6 7 8]}]})
249
The Java Swing framework is a windowing toolkit part of the standard library since Java 1.2. For more information
please see the introductory Wikipedia page:https://en.wikipedia.org/wiki/Swing_(Java).
Figure 22.1. The inspector visualization tree for a nested data structure.
(inspect-table events)
Figure 22.2. The inspector visualization table for a group of uniform data.
22.3 clojure.repl
clojure.repl contains functions and macros to help interacting with Clojure while
developing at the REPL (the Read Eval Print Loop console is one of the first things
Clojure beginners interact with). It provides two broad categories of utilities:
• Documentation related functions such as: doc, find-doc, source, apropos and dir.
• Exception handling functions like: “root-cause” or pst.
22.3.1 doc
doc is possibly the most used documentation macro. It takes a symbol and returns the
value of the :doc key from the metadata of the corresponding var or namespace object.
In this example we show doc before and after adding the documentation to a new
variable:
(def life 42)
(doc life) ; ❶
;; -------------------------
;; user/life
;; nil
;; nil
(doc life) ; ❸
;; -------------------------
;; user/life
;; Answer to the Ultimate Question of Life the Universe and Everything
;; nil
(doc doc) ; ❶
;; clojure.repl/doc
;; ([name])
;; Macro
;; Prints documentation for a var or special form given its name
(doc clojure.repl) ; ❷
;; clojure.repl
;; Utilities meant to be used interactively at the REPL
22.3.2 find-doc
doc works well if you know exactly what you are searching for. If you just happen to
know part of a name, or a specific use case you’re interested in, you can invoke find-
doc with a string to search for matches in all available documentation strings:
(find-doc "xml") ; ❶
;; ------------------------- ; ❷
;; clojure.xml/parse
;; ([s] [s startparse])
;; Parses and loads the source s, which can be a File, InputStream or
;; String naming a URI.
;; -------------------------
;; clojure.core/xml-seq
;; ([root])
;; A tree seq on the xml elements as per xml/parse
;; -------------------------
;; clojure.xml
;; XML reading/writing.
❶ find-doc takes a string as argument. It searches matches (including partial matches) for the string in
all available documentation strings, var or namespaces.
❷ The output of find-doc as well as doc is designed to go directly to standard output, where it’s nicely
formatted.
22.3.3 apropos
find-doc can be very verbose if you use too generic terms. For example, if you try to
search for "first", the output from find-doc is going to be several pages long. The main
responsible for the length of the results are descriptions where the word "first" appears
for very different reasons. If you want to see what functions have to do with "first" in
their names only, you can use apropos:
(apropos "first") ; ❶
;; (clojure.core/chunk-first ; ❷
;; clojure.core/ffirst
;; clojure.core/first
;; clojure.core/nfirst
;; clojure.core/when-first
;; clojure.string/replace-first)
❶ Compared to find-doc, apropos search for substring matches in the name of definitions only, no
namespaces nor descriptions. Also note that the result is a list and nothing gets printed to standard
output.
22.3.4 dir
Another useful way to search what is available is by namespace. dir takes a namespace
and produces an ordered list of all the public definitions available:
(dir clojure.walk) ; ❶
;; keywordize-keys
;; macroexpand-all
;; postwalk
;; postwalk-demo
;; postwalk-replace
;; prewalk
;; prewalk-demo
;; prewalk-replace
;; stringify-keys
;; walk
❶ dir is not coincidentally named like the similar MS-DOS utility to list the file in a folder.
22.3.5 dir-fn
Note that dir results are printed on the standard output. This is the most useful way to
access the information at the REPL, but if you need to manipulate the same results as
sequence, you can use dir-fn:
(require '[clojure.repl :refer [dir-fn]]) ; ❶
;; "keywordize-keys,macroexpand-all,postwalk,postwalk-demo,
;; postwalk-replace,prewalk,prewalk-demo,prewalk-replace,stringify-keys,walk"
❶ Differently from the functions and macros seen so far, dir-fn needs an explicit require.
❷ dir-fn returns a sequence ready for further processing. Here we are separating each definition with a
comma and produce a single string out of the result.
22.3.6 source
A special kind of documentation, particularly for a readable language like Clojure, are
sources themselves. The source macro takes the name of a public definition and prints
the sources to standard output:
(source unchecked-inc-int) ; ❶
;; (defn unchecked-inc-int ; ❷
;; "Returns a number one greater than x, an int.
;; Note - uses a primitive operator subject to overflow."
;; {:inline (fn [x] `(. clojure.lang.Numbers (unchecked_int_inc ~x)))
;; :added "1.0"}
;; [x] (. clojure.lang.Numbers (unchecked_int_inc x)))
❶ source is a macro and there is no need to quote the symbol passed as argument.
❷ The source definition of unchecked-int-inc prints at the screen with the formatting (lines and
indentation) used in the original definition.
22.3.7 source-fn
The source macro also exists in function version: source-fn takes a symbol and
returns the original string containing the sources without printing it to standard output:
22.3.8 pst
pst is an acronym for Print Stack Trace and is useful function to retrieve just the right
amount of error information. Java can produce pretty long stack traces that, in some
extreme cases, require scrolling multiple pages to see the root cause at the top. To
avoid cluttering of the screen, the REPL only shows the most important information by
default. For example, the following division by zero error would show only the
essential description:
(/ 1 0) ; ❶
❶ The REPL does a good job showing just the essential information by default. In this case we can
understand the problem quickly with just a short error description. But it might be difficult for more
generic errors.
For other types of error, we might need to have a look at the stack trace. The stack
trace tells us how the exception propagated up from the site where the exception
happened to the point the request was made. The REPL stores a copy of the full stack
trace in the *e dynamic variable:
(/ 1 0)
*e ; ❶
;;#error {
;; :cause "Divide by zero"
;; :via
;; [{:type java.lang.ArithmeticException
;; :message "Divide by zero"
;; :at [clojure.lang.Numbers divide "Numbers.java" 158]}]
;; :trace
;; [[clojure.lang.Numbers divide "Numbers.java" 158]
;; [clojure.lang.Numbers divide "Numbers.java" 3808] ; ❷
;; ....
❶ The full extent of the error message is stored in the *e dynamic variable. The error is truncated here to
display properly in the book, but it’s many lines longer.
❷ Starting from this item in the :trace we can see new information about how the exception propagated
up the request site.
By default, pst takes the content of *e and presents the first 12 items from the stack
©Manning Publications Co. To comment go to liveBook
trace:
(/ 1 0)
(pst) ; ❶
❶ pst looks into the content of *e by default. You might need to generate an exception
before pst invoked without arguments can actually show something.
pst optionally takes an exception argument if the last available in the REPL with *e is
not the one we are interested in. pst also accepts how many items to retrieve from the
top of the stack trace:
(def ex (ex-info "Problem." {:status :surprise}))
(pst ex) ; ❶
(pst ex 4) ; ❷
Exception objects can be arbitrarily nested, so upstream code can catch the exception,
do something about it and re-throw a new exception wrapping the original. The chain
of exceptions formed this way gradually enriches the original with additional
information, but it also becomes very long to inspect. pst truncates any nesting of
exceptions to the required number of items:
(def ex ; ❶
(ex-info "Problem." {:status :surprise}
(try (/ 1 0)
(catch Exception e
(ex-info "What happened?" {:status :unkown} e)))))
(pst ex 3) ; ❷
❶ This artificial exception code generates and nests exceptions together for illustration purposes.
❷ pst applies the same rule to all nested exceptions, creating a readable stack trace of the root-cause
chain of exceptions.
22.3.9 root-cause
When working with chained exception objects, it can be useful to access the root cause
directly. root-cause takes the initial reference to a potentially long exception chain
and retrieve just the root cause:
(require '[clojure.repl :refer [root-cause]]) ; ❶
(pst (root-cause ex) 3) ; ❷
❶ root-cause is not available by default and we need to require it from the clojure.repl namespace.
❷ ex is the chained exception object generated previously. It contains a chain of 3 exceptions. root-
cause just retains the inner-most exception.
DEMUNGE
STACK-ELEMENT-STR
Something that could not be immediately clear from looking at the stack traces so far,
is that pst not only controls the overall length of the trace, but also operates a clean-up
of Clojure "names" using demunge. We can use demunge directly or through stack-
element-str, which is specifically dedicated to improve the appearance of stack trace
elements.
NOTE Although munge and demunge are related, munge lives in the clojure.core namespace
while demunge lives in clojure.repl. This is because demunge is mainly used in
beautification of stack trace items for REPL display.
(str my-funct!) ; ❸
;; "my_namespace$my_funct_BANG_@621ada4f"
(nth stack-trace 2) ; ❻
;; [my_namespace$my_funct_BANG_ invokeStatic "form-init4179141376169992155.clj" 1]
❶ Note that my-namespace contains a dash "-" sign. Class names in Java does not allow dashes, so
Clojure needs to do some name transformation work on namespaces before they can be part of the
class name and package combination.
❷ Both demunge and stack-element-str need explicit require to be used.
❸ When a function evaluates its class is generated. Before generation, Clojure names needs translation
to Java conventions. Even if a call to munge is not visible here, the name translation happens behind
the scenes. The package and class name already follows Java name conventions.
❹ demunge accepts a string representing a Java name and translates into an idiomatic Clojure name. For
example we can see words such as "BANG" replaced back with the original "!".
❺ my-funct! generates an exception when called. We can retrieve an array of stack trace elements
calling .getStackTrace method on the exception instance.
❻ If we access the element at index 2 in the array, we can see it prints classes using the Java
convention, even if they are generated from a Clojure function.
❼ stack-element-str prints the stack trace element using the better looking Clojure conventions.
The REPL offers a few additional configurations to produce even a better development
experience. We are going to see some of them talking about
the clojure.main namespace.
22.4 clojure.main
clojure.main contains the entry point executable of the Clojure REPL and a few
assorted functions. We are going to concentrate on the following functions in specific:
• load-script compiles and evaluate the content of a Clojure file.
• repl starts the main REPL loop.
WARNING Most of the functions in clojure.main are public, but some of them are too narrow in scope
to be reusable beyond REPL customizations. This section touches briefly on some of them but
focus mainly on the main/repl function.
22.4.1 load-script
load-script takes a string that represents the path of a Clojure file available from the
classpath or the file system. If the file name starts with "@" or "@/" the file is
compiled and evaluated from the Java classpath:
(require '[clojure.main :as main]) ; ❶
clojure.core.reducers/fold ; ❷
;; CompilerException java.lang.ClassNotFoundException: clojure.core.reducers
(main/load-script "@clojure/core/reducers.clj") ; ❸
clojure.core.reducers/fold ; ❹
#object[clojure.core.reducers$fold 0x41414539
"clojure.core.reducers$fold@41414539"]
When we use the "@" sign to load a Clojure file, the effect are very similar to use
the load function from the core namespace with just a different encoding of the path. If
we remove the "@" sign we achieve the same effect of load-file, another function in
the core namespace:
(require '[clojure.main :as main])
(spit "hello.exe" ; ❶
"(ns hello)
(println \"Hello World!\")")
(main/load-script "hello.exe") ; ❷
;; "Hello World!"
;; nil
❶ We create a file on the file system called "hello.exe" which contains a namespace declaration and a
line that prints "Hello World!". The file doesn’t necessarily have to have the "clj" extension.
❷ If the "@" sign is not present, load-script uses the "hello.exe" path to search a file in the file system
relative to the folder the REPL process was started. The file "hello.exe" is found there and is
evaluated.
22.4.2 repl
The repl function starts a new Read Eval Print Loop (possibly on top of another
running one). It takes a few configuration options which are helpful to customize the
REPL experience. Here we start a new REPL assuming we are already in the default
one:
(require '[clojure.main :as main]) ; ❶
The :init option can be useful to put the new REPL in a state where a few commands
or vars are available for immediate use. For example, let’s assume we designed a small
calculator that implements the 4 fundamental operations: "plus", "minus", "times" and
"divide". When starting the new calculator REPL we want those functions to be readily
available:
(ns calculator) ; ❶
(plus 1 2) ; ❸
;; 3
There are other configuration options that can help improve the user experience
designing a custom REPL environment. We could use :prompt to clearly state the
purpose of the REPL:
(require '[clojure.main :as main])
(def repl-options
[:init #(require '[calculator :refer :all])
:prompt #(printf "enter expression :> ")])
❶ Using the :repl option we can change the appearance of the prompt.
We can go further and depart from the way the Clojure REPL normally evaluates
expressions. In the following example, a custom REPL calculates small infix
mathematical expressions. To do this, we need to override both the :read function as
well as the :eval function using the corresponding options keys:
(require '[clojure.main :as main])
(def repl-options
[:prompt #(printf "enter expression :> ")
:read (fn [request-prompt request-exit] ; ❶
(or ({:line-start request-prompt :stream-end request-exit} ; ❷
(main/skip-whitespace *in*))
(re-find #"^(\d+)([\+\-\*\/])(\d+)$" (read-line)))) ; ❸
:eval (fn [[_ x op y]] ; ❹
(({"+" + "-" - "*" * "/" /} op)
(Integer. x) (Integer. y)))])
❶ The :read option accepts a function of 2 arguments. The argument are also functions that we don’t
need to customize. We use them to indicate the REPL when a new prompt should be requested and
how to handle pressing "ctrl+D" which generates an end of stream signal.
❷ main/skip-whitespace walks the standard input skipping any white space characters (if any) and
positioning the standard input (a stateful object) in one of the possible 3 positions: :body, :line-
start or :stream-end. :body is the next readable token (the mathematical expression in our case)
so the or expression jumps to the next that contains the read-line calls.
❸ read-line reads an entire line from standard input. In our case, it waits for the user to type an
expression and hit enter. At that point the line is read as a string and sent to a regular expression that
splits the line into the relative matching groups.
❹ The line returns from :read as a list of 4 items. The first is the entire expression itself that we ignore.
The next 3 arguments are the first operand "x", the operator "op" and the second operand
"y". eval proceeds to convert the operands into number and the operator into a function which is
finally invoked.
❺ The different prompt warns the user of the different REPL semantic. There is no need of parenthesis
and operators appears in infix position. Hit ctrl+D to exit the inner loop and go back to normal REPL.
22.5 clojure.java.browse
clojure.java.browse contains functions to visualize HTML content through the
system browser. The main and only entry point is browse-url a function that takes an
URL as a string and interact with the operative system to open the URL in one of the
available methods:
• HTML Browser: this is the default method.
• Swing browser: if a default HTML browser is not available, browse-url tries to
use a basic Swing (Java graphic library) window.
• Custom script: it’s also possible to customize which command line executable to
use using the clojure.java.broswe/*open-url-script* dynamic variable.
Using browse-url is quite simple. For example the following opens a browser showing
the home page for this book:
(require '[clojure.java.browse :refer [browse-url]])
(browse-url "https://www.manning.com/books/clojure-the-essential-reference") ; ❶
In the unlikely case a system browser is not available on the current machine, we can
use the dynamic variable *open-url-script* to use a different command line
executable:
(require '[clojure.java.browse :refer [browse-url *open-url-script*]])
❶ *open-url-script* has been bound to "wget", a popular command line browsing utility. Assuming
"wget" is installed on the local system, the snippet downloads the book "War and Peace" from the
Project Gutenberg website.
22.6 clojure.java.shell
The clojure.java.shell namespace exposes a single entry point function sh which
executes a command on the host operative system as a separate process:
(require '[clojure.java.shell :refer [sh]]) ; ❶
;; {:exit 0, ; ❸
;; :out "README\nconnectives\npropernames\nweb2\nweb2a\nwords\n",
;; :err ""}
❶ sh is the main and only entry point in the clojure.java.shell namespace. We can refer to the
function directly and avoid using a namespace alias, as sh is a short and easy to recognize name.
❷ If the command line contains arguments, each argument is part of a separate string. We can see here
how to list the content of a folder in a Unix-based system.
❸ The result is always a map with an :exit, :out and :err key. If the :exit number is more than zero
it indicates that the command reported an error condition. :out contains the output of the command
directed to the standard output, while :err is the standard error stream.
While the command is executing in a sub-process (of the running Java Virtual
Machine), sh blocks for the exit code to be available. The commands can send results
to the standard output stream or the standard error stream. Both outputs are reported as
plain strings in the resulting map.
sh supports quite a few options. We can use the :in option key to pass an input stream,
reader, file, byte array or string to the running process input:
(sh "grep" "5" :in (apply str (interpose "\n" (range 50)))) ; ❶
❶ This grep command executes using the string passed with the :in key.
If the input for the command is encoded in another character set (for example because
it’s not produced inside the running JVM) then we can use :in-enc to specify which
encoding the input is. Similarly, :out-enc can be used to interpret the output from the
command with a specific encoding. :out-enc also supports a special value :bytes that
when presents does not convert the output into a string, returning the raw bytes:
(def image-file "/usr/share/doc/cups/images/smiley.jpg")
❶ This command assumes you have an image at the specified location. We could load the image
with sh using "cat" to send the image to standard output, where it is collected as a byte array and
returned.
Knowing how to pass inputs, we could build a helper function to "pipe" commands
together:
(defn pipe [cmd1 & cmds] ; ❶
(reduce
(fn [{out :out} cmd] ; ❷
(apply sh (conj cmd :in out)))
(apply sh cmd1)
cmds))
(println
(:out
(pipe ; ❸
["env"]
["grep" "-i" "java"])))
;; JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_181.jdk/Contents/Home
;; JAVA_MAIN_CLASS_61966=clojure.main
;; _=/usr/bin/java
;; LEIN_JAVA_CMD=java
;; JAVA_MAIN_CLASS_62001=clojure.main
❶ The pipe function accepts at least one parameter and possibly more. It expects each parameter to be
a vector of strings suitable for sh commands.
❷ pipe always starts by executing the first command, which becomes the initial value for reduce. The
reducing function takes the last command output and the new command. It then executes the new
command using the output of the previous as input for the next one.
❸ We can try to pipe the "env" command which returns the list of all environment variables currently set
and the "grep" command which searches for substrings. The output you see here could be very
different from the same command executed on your environment.
sh executes the command in a sub-process, which means that all environment variables
present for the parent are inherited by the children. We can change this behavior and
pass a completely different set using the :env map:
(def env
{"VAR1" "iTerm.app"
"VAR2" "/bin/bash"
"COMMAND_MODE" "Unix2003"})
(println
(:out
(sh "env" :env env))) ; ❶
;; VAR1=iTerm.app
;; VAR2=/bin/bash
;; COMMAND_MODE=Unix2003
❶ We can see that the default environment variables have been completely replaced by the content of
the map env.
Another default for sh is the working folder, the initial path which the command is
automatically given. In the following example we first print the current working folder
and then we change it using the :dir key:
(println (:out (sh "pwd"))) ; ❶
;; /Users/reborg/prj/my/book
Both the environment and the working folder and common settings possibly shared
across many sh invocations. To help avoid repeating the :dir and :env key in
all sh commands, clojure.java.sh also contains two handy macros to set those once
and for all inside a binding:
(require '[clojure.java.shell :as shell :refer [sh]])
(shell/with-sh-dir "/usr/share" ; ❶
(shell/with-sh-env {:debug "true"}
[(sh "env") (sh "pwd")]))
❶ We use with-sh-dir and with-sh-env to set the working folder and the environment variables for
all sh commands inside in the form.
22.7 clojure.core.server
clojure.core.server contains functions to expose the Clojure environment through a
socket connection and across network boundaries. It’s not that different from the
standard REPL environment: while a typical REPL accepts command from standard
input and prints results to standard output, a socket-based environment uses the socket
to receive requests and send responses. On the other side of the socket, a process (or
human) consumes the results of invoking Clojure functions as usual.
By default, clojure.core.server uses a slightly modified version of the same REPL
offered through the console. To start the server socket REPL we use start-server:
(require '[clojure.core.server :as server]) ; ❶
(server/start-server ; ❷
{:name "repl1" :port 8787 :accept clojure.core.server/repl})
;; #object["ServerSocket[addr=localhost/127.0.0.1,localport=8787]"]
The socket server is highly configurable. Here’s a summary of the available options
and their meaning:
• :address is the network interface the server should be using. It defaults to
127.0.0.1, the default host interface, so it’s not normally required.
• :port is the port the server should be using. There is no default and is mandatory
argument.
• :name is an identifier for this server. There could be many socket servers running
and each server requires a name. This is mandatory argument and can be any
string.
• :accept is a fully qualified function declared in a clojure file available from the
classpath. It’s not possible to pass a function created on the fly.
• :args is a list of optional arguments to the accept function.
• :bind-err tells the server if the standard error stream (the current standard error is
bound to err at the REPL) should be bound to the output socket. It defaults
to true which means that anything printed to the standard error is sent to the other
end of the socket.
• :server-daemon determines if the running socket server is a daemon thread. By
default, the socket server starts a daemon server, which means that the JVM can
shutdown even if they are still serving request. This is because we don’t want that
the fact of starting a socket server prevents the rest of the application to exit.
• :client-daemon configures client threads as daemon. When the socket server
receives a request, it handles the request as a separate thread. By default also the
thread serving incoming requests is a daemon.
Most of the available options are fairly self-explanatory. One that deserves some
attention is :accept which determines the behavior of the server on handling an
incoming request. By default :accept uses clojure.core.server/repl which starts a
new REPL loop. After starting a normal REPL, clojure.core.server/repl starts a
REPL loop on top of the existent one:
(clojure.core.server/repl) ; ❶
;; nil
:repl/quit ; ❷
❶ The side effects of calling clojure.core.server/repl are not immediately visible. Under the hood,
a new "while true" REPL loop has started to handle requests.
❷ One difference from the normal REPL is that a socket REPL needs a way to handle exit requests
without necessarily typing "CTRL+D", as the other side of the REPL might not have a keyboard. The
socket server adds a :repl/quit command that exits the REPL loop. In this case it returns to the
initial REPL.
If we want to customize the socket REPL experience, we need to pass the :accept
option a different function. The following example comes from the Replicant library, a
small proof of concept by the same author of the socket repl feature 250:
(ns data-server) ; ❶
(server/start-server
{:name "repl2" :port 8788 :accept 'data-server/data-repl}) ; ❹
❶ The accept function needs to be fully qualified. To make sure the example runs in the correct
namespace, we create one before defining functions.
❷ data-eval is the evaluation function the REPL loop will use after reading from the socket. We don’t
use the default clojure.core/eval because standard output and standard error would not be visible on
the other side of the socket. data-eval instead captures standard output and standard error on
a StringWriter instance. The writer is then used to push the output through the socket by
transforming it into a string.
❸ data-repl is a thin layer over clojure.main/repl so we can pass our custom evaluation function.
❹ We can now start a new server using the custom :accept function.
If you have Telnet installed 251 you can open a session to the running socket server
like follows:
250
The Socket REPL feature was implemented by Alex Miller in collaboration with the Clojure core team. You can find the
Replicant library here:https://github.com/puredanger/replicant
251
The venerable Telnet protocol is a way to utilize a terminal over the network. Telnet is also the name of the client utility
that connects to remote sockets, not just the protocol.
❶ An example Telnet session that connects to the socket server to evaluate a simple expression.
❷ Once connected we are offered the usual prompt and we can evaluate expressions as usual.
Listing 22.1. To stop one of the running socket server, we can use stop-
server or stop-servers to close them all with a single call:
(server/stop-server "repl2") ; ❶
;; true
(server/stop-servers) ; ❷
;; nil
❶ stop-server requires a server name (or it will try to use the server/session dynamic variable if no
server name is given). In this example we stop the server started previously. stop-
server returns true when successful, or nil if no server was found with that name.
❷ Alternatively, stop-servers stops all running instances at once, without the need to pass their
names.
It’s worth remembering that one of the main goal of the socket server is to start a
distributed REPL on top of an already existing applications without the need to change
the code in the application. We can open a socket server while starting the application
passing the necessary parameters from the command line:
; ❶
export M2_REPO="/Users/reborg/.m2/repository"
export CLOJURE_18="$M2_REPO/org/clojure/clojure/1.8.0/clojure-1.8.0.jar"
; ❷
java -cp .:$CLOJURE_18 \
-Dclojure.server.repl="{:port 8881 :accept clojure.core.server/repl}" \
clojure.main
;; Clojure 1.8.0
;; user=>
❶ There are a few requirements for this Bash script to work properly. You need a maven repository (this
is usually there already if you use Leininghen) and change the environment variable M2_REPO to point
at the root of that repository. By default, the Maven repository is installed in the ~/.m2 folder for the
local user. We are using here a Clojure 1.8 jar installation.
❷ We start Clojure using the clojure.main class directly. We also set the clojure.server.repl Java
property. The socket server checks for the presence of this property and starts one or more servers if
as configured. As you can see, the property content is a Clojure map.
You should be able to open a Telnet connection to 127.0.0.1 8881 as before. To stop
the server and the running Clojure instance, just type CTRL+C at the REPL above.
22.8 clojure.java.io
clojure.java.io contains a collection of functions that simplify the interaction with
the Java Input/Output (or simply IO) system. Over the years, Java evolved its
original InputStream and OutputStream IO abstractions into Reader and Writer,
eventually adding also asynchronous IO. During this transformation, Java put a lot of
effort in maintaining backward compatibility, a principle also shared with Clojure.
Unfortunately, these are now coexisting interfaces that impact negatively on usability,
forcing Java developers through bridges and adapters to move between different styles
of IO.
Sometimes it’s useful to create a reader from a string (especially for testing),
but reader interprets strings as locations. We can achieve the desired effect by
transforming the string into a character array first:
(require '[clojure.java.io :as io])
(def s "string->array->reader->bytes->string") ; ❶
❶ io/reader is commonly used to load external resources. Sometimes, especially for testing, it’s useful
to create a reader directly from a string. We use a simple string for illustrative purposes.
❷ char-array transforms the string into a primitive array of chars, preventing reader interpretation of
the string as location.
❸ slurp has polymorphic behavior similar to reader and in this case transforms the reader back into a
string by reading its content.
NOTE The book contains other intersting examples of use of io/reader: in line-seq we show how to
read from a java.io.InputStream. In disj instead, we can see an example about how to read
from a java.net.Socketobject.
22.9.3 writer
Not surprisingly, writer creates a new writer object accepting the same first argument
types as reader:
(with-open [w (io/writer "/tmp/output.txt")] ; ❶
(spit w "Hello\nClojure!!")) ; ❷
❶ Using a writer is very similar to using a reader. writer creates the object "w" that will automatically
close at the end of the expression thanks to with-open.
❷ spit sends the content of a string into a file. If the file already exists, the content is overwritten.
❸ To test the content of the file, we can use slurp instead of passing through a reader.
As we can see from the examples, reader and writer are almost interchangeable
with slurp and split. This is a valid assumption for the simple case of reading/writing
using memory as a buffer. If we want to avoid loading the entire content of a file (or
other streamable object) into memory at once, we can chain a reader and
a writer together and process the content using lazy functions like line-seq:
(require '[clojure.java.io :refer [reader writer]])
(require '[clojure.string :refer [upper-case]])
❶ Both the reader and the writer need to be closed after use. In this example we use the dictionary
file presents on most Unix-based systems. The file is large but not huge.
❷ Using doseq, we make sure that side effects are evaluated lazily and without holding the head of
sequence. The net effect is that just a small portion of the file is present in memory at any given time,
while the garbage collector can claim any processed item already written to output file.
❸ We wouldn’t be able to use spit in this case, because spit automatically closes the writer after the
first writing the first line.
Both reader and writer optionally accepts configuration key. Here we can see how to
replicate the effect of calling the .append method using the :append key:
(with-open [r (reader "/usr/share/dict/words")
w (writer "/tmp/words" :append true)] ; ❶
(doseq [line (line-seq r)]
(.write w (str (upper-case line) "\n")))) ; ❷
❶ We can use :append to prevent writer from removing any previous content from the file while writing
new content.
❷ Instead of using the .append method we can now using the more generic .write and control the
behavior using configuration options.
22.9.4 resource
resource is quite common in Clojure programming to retrieve resources from the Java
classpath. The classpath normally contains compiled Java classes, Clojure sources
(unless they are explicitely removed) or other artifacts. We could for example retrieve
the source of the clojure.java.io namespace with the following:
(require '[clojure.java.io :refer [resource reader]])
❶ Clojure sources are packaged as part of the Clojure executable. We can find them using the relative
path of the file inside the Jar archive.
❷ We can see the first line of the file after using a reader and line-seq.
22.9.5 as-url
as-url is a small utility function to create URL objects (without the need of
importing java.net.URL to use its constructor directly). as-url adds some level of
polymorphism to handle input types other than strings:
(require '[clojure.java.io :refer [as-url file]])
(import 'java.nio.file.FileSystems)
(def path ; ❶
(.. FileSystems
getDefault
(getPath "/tmp" (into-array String ["words"]))
toUri))
(= u1 u2 u3) ; ❺
;; true
❶ path shows how to convert a Java NIO (New IO Api) path into an URI.
❷ as-url accepts strings (with protocols) to identify a location on disc of a file.
❸ as-url also accepts the same location as a java.io.File object.
❹ Finally, as-url also accepts an URI as the result of passing through a java.nio.file.Path object.
❺ The 3 urls are different objects, but they represent the same location on disk of the file "/tmp/words".
(extend-protocol io/Coercions ; ❶
Path
(as-file [path] (io/file (.toUri path)))
(as-url [path] (io/as-url (.toUri path))))
(def path ; ❷
(.. FileSystems
getDefault
(getPath "/usr" (into-array String ["share" "dict" "words"]))))
(io/as-url path) ; ❸
;; #object[java.net.URL 0x1255fa42 "file:"/usr/share/dict/words"]
(io/file path) ; ❹
;; #object[java.io.File 0x1c80a235 "/usr/share/dict/words"]
❶ clojure.java.io contains the Coercions protocol declaring two functions, as-file and as-url.
While as-file has the file wrapper function available, as-url doesn’t have a
corresponding url function. The implementation consists of transforming the path into an URI and call
the corresponding (and already existing) implementations.
❷ Java NIO Path objects are roughly equivalent to URLs. java.nio.file.Path only has a translation
into URI available that we can use to create an URL. The getPath() method takes a first "root"
argument of the initial part of the path, followed by any other segment as a variable argument type.
Clojure needs to create an array of strings to be compatible with the type signature.
❸ After extending the protocol, we can use as-url to transform java.nio.file.Path directly.
❹ As a bonus, also file can now create a file object directly from a path.
(io/file "/a/valid/file/path")
;; #object[java.io.File 0x7936d006 "/a/valid/file/path"]
(io/file nil)
;; nil
❶ We can see what single argument types io/file accepts by checking the :impl key of
the Coercions protocol. What follows is a list of all the possible calls to io/file with the respective
argument type.
The default list of types that io/file can understand is visible inside
the Coercion protocol map, as demonstrated in the example. We’ve already seen that
by extending this protocol we can apply io/file to other argument types.
io/file also accepts other arguments after the first with the same type constraints.
Additional arguments have to be relative paths (i.e., they cannot start with a forward
slash '/'):
(io/file "/root" (io/file "not/root") "filename.txt") ❶
;; #object[java.io.File 0x6898f182 "/root/not/root/filename.txt"]
22.9.7 copy
io/file does not actually create a physical resource, but just a "pointer" that other
function like writer can use to write content to. Another way to create content is to
copy one file to another using the io/copy function:
(require '[clojure.java.io :as io])
❶ We can use io/copy to copy the existent /usr/share/dict/words file into a new file in
the /tmp folder.
❷ To check if the file was actually created, we can use the exists() on the java.io.File object.
io/copy supports a long list of arguments combinations: from reader to writer, from
string to file, from InputStream to OutputStream and so on. One of them, from file to
file, is specifically optimized using java.nio.channel.FileChannel which guarantees
optimal performance when the file is cached by the operative
©Manning Publications Co. To comment go to liveBook
system. io/copy however, does not support a string to string transfer (with a file to file
copy implementation). We can extend io/copy using the related do-copy multimethod:
(require '[clojure.java.io :as io])
❶ The defmethod definition for io/do-copy is private in clojure.java.io but we can still make access
to it by looking up the related var object (with the reader macro #') and
then dereferencing the var with @ (another reader macro). The implementation simply call io/file on
each argument.
❷ io/copy now accepts a pair of strings as arguments.
❸ We can verify the file was effectively created.
The example above shows that io/copy accepts options. The :buffer-size options
defaults to 1024 bytes and is used when the origin is an InputStream, while
the :encoding option is in effect for origin Reader objects.
22.9.8 make-parents
When a file path requires sub-folders, but those are not yet existing, we can use make-
parents to create all necessary folders. Conveniently, make-parents does not create
the last path segment considering it the name of the file that will likely be used right
after:
(require '[clojure.java.io :as io])
❶ Instead of a single string containing the path, we assembled the path out of fragments.
❷ make-parens creates any non-existent folder, but does not try to interpret "file.txt" as one, considering
it a file name instead.
❸ The same fragments of file name can be used with io/file to copy content over to the new folder.
❹ We can check if the content was correctly copied comparing lines at origin with the destination.
22.9.9 delete-file
We can use delete-file to remove files. The types supported are the same as io/file.
We can additionally pass a second argument if we want to prevent delete-file to
throw an exception in case of error:
(require '[clojure.java.io :as io])
(io/delete-file "/does/not/exist") ; ❶
;; IOException Couldn't delete /does/not/exist
❶ When we try to delete a file that does not exist, delete-file throws exception.
❷ We can prevent the exception in case of non existent files, by passing a second argument which is
returned to signal that the operation was not successful.
❸ This file was created previously and should exist on the file system. delete-file correctly
returns true.
22.9.10 as-relative-path
as-relative-path retrieves the path from resources objects (such as files, URIs,
URLs). This is especially useful to convert file objects into path strings for further
processing:
(require '[clojure.java.io :as io])
22.10 clojure.test
clojure.test is a testing framework shipped by default with Clojure. It works by
attaching specific metadata to var objects to store testing functions. This mechanism is
clojure.test offers several ways to create tests. To create somewhat realistic testing
examples, we are going to use the sqrt function to calculate the square of a number
implemented with the Newton method:
(defn sqrt [x] ; ❶
(when-not (neg? x)
(loop [guess 1.]
(if (> (Math/abs (- (* guess guess) x)) 1e-8)
(recur (/ (+ (/ x guess) guess) 2.))
guess))))
❶ The sqrt function calculates an approximation of the square root for the number "x" to the 8th decimal
point. The rest of the section uses this function as an easy testing target.
NOTE clojure.test is one of the few idiomatic uses of :refer :all in the require declaration.
Testing functions are so well known that they are required as a batch at the banning of a
testing namespace.
The most common way to define tests is deftest (and deftest- to create private test
functions):
(require '[clojure.test :refer [deftest]]) ; ❶
(test #'sqrt-test) ; ❹
;; AssertionError Assert failed: Expecting 2
;; (= 2 (sqrt 4))
❶ Although it’s customary to :refer :all the entire clojure.test namespace, we limit ourself to what
is necessary for a specific example to avoid any possible confusion.
❷ deftest creates a new function sqrt-test in the current namespace,
❸ It then adds a meta key :test to the var object sqrt-test using the body of the function as value.
❹ We can use clojure.core/test to verify that the tests are running as expected.
NOTE clojure.test offers better assertions primitive than assert to set expectations. We are going
to see them later on in this section.
WITH-TEST
A slight variation on deftest is with-test. with-test creates the target function and
the test definition at the same time and does not require the creation of an auxiliary
function just to hold the test implementation:
(require '[clojure.test :refer [with-test]])
(with-test ; ❶
(defn sum [a b] (+ a b))
(println "test called"))
(test #'sum) ; ❷
;; test called
;; :ok ; ❸
❶ with-test is the simplest macro to create a test other than setting the metadata manually.
❷ We call clojure.core/test on the target function itself instead of the generated test function like in the
case of deftest.
❸ The ":ok" printed here is the return value from test assuming that the lack of exceptions means the test
was successful.
clojure.test offers better way to verify expectations other than the basic assert. For
example, is verifies that the given expression is truthy and produces a nice summary
of the test results:
(require '[clojure.test :refer [is deftest test-var]])
(test-var #'sqrt-test) ; ❷
;; FAIL in () (form-init796879.clj:1) ; ❸
;; Expecting 2
;; expected: (= 2 (sqrt 4))
;; actual: (not (= 2 2.000000000000002))
❶ Compared to the previous example using deftest we replaced assert with is.
❷ We started using test-var instead of clojure.core/test. There is no much difference, but test-
var removes the confusing :ok that clojure.core/test generates.
❸ is interacts with clojure.test 's report system and produces nicer looking results on the screen.
TESTING
Thanks to is printed summary we can finally see why tests to calculate the square root
of 4 are failing. None of the assert variants seen so far was printing the reason for the
failure. is takes an optional string to better describe what the test is about. We can
enrich and nest tests contextually using testing:
(require '[clojure.test :refer [is deftest testing test-var]])
(deftest sqrt-test
(testing "The basics of squaring a number" ; ❶
(is (= 3 (sqrt 9))))
(testing "Known corner cases"
(is (= 0 (sqrt 0)))
(is (= Double/NaN (sqrt Double/NaN)))))
(test-var #'sqrt-test) ; ❷
;; FAIL in () (form-init796879.clj:3)
;; The basics of squaring a number
;; expected: (= 3 (sqrt 9))
;; actual: (not (= 3 3.000000001396984))
;;
;; FAIL in () (form-init796879.clj:5)
;; Known corner cases
;; expected: (= 0 (sqrt 0))
;; actual: (not (= 0 6.103515625E-5))
;;
;; FAIL in () (form-init796879.clj:6)
;; Known corner cases
;; expected: (= Double/NaN (sqrt Double/NaN))
;; actual: (not (= NaN 1.0))
❶ We use testing to group related group of tests together. This has the effect of visually grouping the
tests improving readability and also appears as a description in the output of the tests.
❷ It seems that we have quite a bit of work to do to make the sqrt function more robust.
ARE
In the previous example we started stacking up groups of similar tests, all repeating the
same operation with different values. are builds up on is offering a way to batch
together many similar assertions:
(require '[clojure.test :refer [are deftest test-var]])
(deftest sqrt-test
(are [x y] (= (sqrt x) y) ; ❶
9 3
0 0
Double/NaN Double/NaN))
(test-var #'sqrt-test) ; ❷
;; FAIL in () (form-init7968799.clj:2)
;; expected: (= (sqrt 9) 3)
❶ are requires 3 declarations: the first is what variables will be used (in our case, "x" and "y"). The
second part is a template expression that relates "x" and "y". In our case we want to see if the square
of the first number is equal to the second. Finally, a list of "x","y" values to use in the template.
❷ The end result is similar to multiple execution of is, one for each of the pairs.
Using equality as a predicate is common with is and are, but there are expressions
which are difficult to put in equality form, for example if we want to know if a
function throws exception given some input. clojure.test comes with an extended set
of predicates, thrown?, thrown-with-msg? and instance? to use for cases other than
equality:
(require '[clojure.test :refer [is deftest] :as t])
(deftest sqrt-test
(is (thrown? IllegalArgumentException (sqrt -4))) ; ❶
(is (thrown-with-msg? IllegalArgumentException #"negative" (sqrt -4))) ; ❷
(is (instance? Double (sqrt nil)))) ; ❸
(binding [t/*stack-trace-depth* 3] ; ❹
(t/test-var #'sqrt-test)) ; ❺
;; FAIL in () (form-init7968799.clj:2)
;; expected: (thrown? IllegalArgumentException (sqrt -4))
;; actual: nil
;;
;; FAIL in () (form-init7968799.clj:3)
;; expected: (thrown-with-msg? IllegalArgumentException #"negative" (sqrt -4))
;; actual: nil
;;
;; ERROR in () (Numbers.java:1013)
;; expected: (instance? Double (sqrt nil))
;; actual: java.lang.NullPointerException: null
;; at clojure.lang.Numbers.ops (Numbers.java:1013)
;; clojure.lang.Numbers.isNeg (Numbers.java:100)
;; user$sqrt.invokeStatic (form-init7968.clj:2)
❶ thrown? verifies that the target function throws a specific kind of exception.
❷ We can also verify that the error message matches a specific regex using thrown-with-msg?.
❸ instance? can verify if expression returns a specific type.
❹ t/stack-trace-depth is a dynamic variable available in clojure.test that can be used to
configure how many items to display in case of exception during a test. Here we are requiring only the
first 3 items.
❺ All tests are failing. The first 2 tests are failing because there is no exception thrown on passing
negative numbers. The last tests forces sqrt to throw an exception when it shouldn’t: we want (sqrt
0) to return 0.0.
(deftest sqrt-test ; ❹
(is (roughly 2 (sqrt 4) 14))
(is (roughly 2 (sqrt 4) 15)))
(t/test-var #'sqrt-test)
Now that we’ve seen how to create and increase expressiveness of our tests, it’s time to
look into options about how to run them. The most basic one that we’ve used so far
is test-var. test-vartakes a var object and execute the function found in
the :test key in the var metadata, if any.
TEST-ALL-VARS
test-vars (plural) is very similar and takes multiple var objects to test. But the most
common case is to declare all testing functions (and relative var objects) in a specific
namespace. To evaluate all tests in a namespace we have several options, for
example test-all-vars:
(ns my-tests) ; ❶
(require '[clojure.test :refer [is deftest] :as t])
(ns user) ; ❷
(require '[clojure.test :refer [test-all-vars]])
(test-all-vars 'my-tests)
;; FAIL in (a) (form-init205934.clj:1)
;; expected: (= 1 (+ 2 2))
;; actual: (not (= 1 4))
;;
;; FAIL in (b) (form-init20593408.clj:1)
;; expected: (= 2 (+ 2 2))
;; actual: (not (= 2 4))
❶ The example switches the current namespace to my-tests before defining new tests the usual way.
❷ When we are back to the user namespace we can run all tests in my-tests using test-all-vars.
TEST-NS
test-ns is almost the same as calling test-all-vars except that it also obeys "test
hooks" and additionally prints a summary. deftest calls can be nested at will, or
composed later on by grouping them in a special function test-ns-hook. If test-ns-
hook is found in the target namespace, test-ns executes the hook instead of all vars in
the namespace:
(ns composable-tests)
(require '[clojure.test :refer [is deftest]])
(ns user)
(require '[clojure.test :refer [test-ns]])
(test-ns 'composable-tests) ; ❸
;; FAIL in (fail-a) (form-init2059340.clj:1)
;; expected: (= 1 (+ 2 2))
;; actual: (not (= 1 4))
;;
;; FAIL in (fail-c) (form-init2059340.clj:1)
;; expected: (= 1 (+ 2 2))
;; actual: (not (= 1 4))
;; {:test 2, :pass 0, :fail 2, :error 0}
RUN-TESTS
Continuing with test runners, run-tests adds a summary at the end of the run
compared to test-ns. run-tests also run by default the current namespace if no
arguments given:
(ns running-tests)
(require '[clojure.test :refer [is deftest run-tests]])
(run-tests)
Testing running-tests
RUN-ALL-TESTS
Until now we’ve see how to run tests in a single namespace, but with run-all-
tests we can run all of them in any loaded namespace. It also accepts a regular
expression to filter a subset of the namespaces:
(ns a-new-test) ; ❶
(require '[clojure.test :refer [is deftest]])
(ns b-new-test)
(require '[clojure.test :refer [is deftest]])
(ns user)
(require '[clojure.test :refer [run-all-tests]])
(run-all-tests #".*new.*") ; ❷
;; Testing b-new-test
;;
;; Testing a-new-test
;;
;; Ran 4 tests containing 4 assertions.
;; 0 failures, 0 errors.
;; {:test 4, :pass 4, :fail 0, :error 0, :type :summary}
❶ Two namespaces are created containing the "new" word in their name. They contain some simple
illustrative tests.
❷ run-all-tests run all the tests found in all loaded namespaces. If we pass the optional regular
expression argument, run-all-tests only run matching namespaces.
FIXTURES
clojure.test also supports fixtures. A good guiding principle to write effective unit
tests, is that they should be isolated and repeatable. Unfortunately, some portion of the
code cannot be completely side effect free. Fixtures can help recreating necessary
preconditions for the test to run reliably. A common case is the presence of a specific
file on disk, or a table in a database that the executing test needs to return a specific
result. Once defined the fixture can be applied before executing the test or
after. clojure.test also offers the option to run fixtures at each test execution, or once
only in a testing namespace.
USE-FIXTURES
(run-tests) ; ❸
;; Testing fixture-test-1
;; ### before
;; ### after
;; ### before
;; ### after
;;
;; Ran 2 tests containing 2 assertions.
;; 0 failures, 0 errors.
;; {:test 2, :pass 2, :fail 0, :error 0, :type :summary}
❶ A fixture is a function of one argument. The argument is a single test or composition thereof. Calling
the argument executes the test (or tests). Before the test executes we can setup a database, file or
other resource that the code under test might use. Similarly, we can reestablish any pre-existent
condition after running the test.
❷ use-fixtures registers a new fixture with either :each or :once semantic. In this case we expect the
fixture to run for each declared test.
❸ The summary confirms that the fixture function ran once each test.
22.10.5 clojure.java.javadoc
It’s quite common for Clojure programmer to lookup Java related documentation. One
of the main sources of documentation for Java classes is "Javadoc", a specific markup
to create documentation directly as part of Java sources and related tooling 252.
The javadoc command produces HTML rendering of documented classes structured in
folders and subfolders. It also produces an "index.html" that can be open with a
browser for offline or online viewing.
JAVADOC
The below shows how the Javadoc for the String class looks like:
252
Please have a look at https://en.wikipedia.org/wiki/Javadoc for an overview of Javadoc
By default, javadoc opens what is now old documentation. The known Java versions
are either 6 (if currently used) or 7 (for any other version). This means that even if the
current REPL is running on a JDK 12 (or later version) javadoc opens the JDK 7
documentation regardless. javadoc relies on a few dynamic vars to pass a different
documentation version. There is no need to access them directly, we can use the add-
remote-javadoc version to add them:
(defn java-version [] ; ❶
(let [jsv (System/getProperty "java.specification.version")]
(if-let [single-digit (last (re-find #"^\d\.(\d+).*" jsv))]
single-digit jsv)))
(def jdocs-template ; ❷
(format "https://docs.oracle.com/javase/%s/docs/api/" (java-version)))
(def known-prefix ; ❸
["java." "javax." "org.ietf.jgss." "org.omg."
"org.w3c.dom." "org.xml.sax."])
(pprint @browse/*remote-javadocs*) ; ❺
;; {"java." "https://docs.oracle.com/javase/8/docs/api/",
;; "javax." "https://docs.oracle.com/javase/8/docs/api/",
;; "org.apache.commons.codec."
;; "http://commons.apache.org/codec/api-release/",
;; "org.apache.commons.io."
;; "http://commons.apache.org/io/api-release/",
;; "org.apache.commons.lang."
;; "http://commons.apache.org/lang/api-release/",
;; "org.ietf.jgss." "https://docs.oracle.com/javase/8/docs/api/",
;; "org.omg." "https://docs.oracle.com/javase/8/docs/api/",
;; "org.w3c.dom." "https://docs.oracle.com/javase/8/docs/api/",
;; "org.xml.sax." "https://docs.oracle.com/javase/8/docs/api/"}>
❶ To deal with the change from double to single digit, java-version checks the reported Java version
from the java.specification.version property and extract the version as single digit in case it
starts with a number followed by a dot. If the reported Java version is "1.8" for instance, java-
version returns "8" only.
❷ Java Oracle published JDK documentation follows the same format for all versions, so we can just
adjust the URL to the correct version.
❸ javadoc looks up the list of URLs for documentation using the package name of the target class. We
are going to update a few of the default prefix with the new Javadoc URL which are listed in
the known-prefixdefinition.
❹ We repeatedly update prefixes and URLs using doseq and add-remote-javadoc.
❺ The current list of known remote locations is visible after printing browse/remote-javadoc, the
dynamic var responsible for storing them in the namespace.