Python Mi Parte
Python Mi Parte
Python Mi Parte
ttp://pythonvisualization.github.io/folium/modules.html#folium.features.Choro leth
lick here to view code image
In [35]: sorted_df = tweet_counts_df.sort_values(
...: by='Tweets', ascending=False)
...:
The loop in the following snippet creates the Markers. First,
sorted_df.groupby('State')
groups sorted_df by 'State'. A DataFrame’s groupby method maintains the original
row order in each group. Within a given group, the senator with the most tweets will be first,
because we sorted the senators in descending order by tweet count in snippet [35]:
lick here to view code image
In [36]: for index, (name, group) in enumerate(sorted_df.groupby('State')):
...: strings = [state_codes[name]] # used to assemble popup text
...:
...: for s in group.itertuples():
...: strings.append(
...: strings.append(
...:
...: text = '<br>'.join(strings)
...: marker = folium.Marker(
...: (locations[index].latitude, locations[index].longitude),
...: popup=text)
...: marker.add_to(usmap)
...:
...:
We pass the grouped DataFrame to enumerate, so we can get an index for each group,
which we’ll use to look up each state’s Location in the locations list. Each group has a
name (the state code we grouped by) and a collection of items in that group (the two senators
for that state). The loop operates as follows:
We look up the full state name in the state_codes dictionary, then store it in the
strings list—we’ll use this list to assemble the Marker’s popup text.
The nested loop walks through the items in the group collection, returning each as a
named tuple that contains a given senator’s data. We create a formatted string for the
current senator containing the person’s name, party and number of tweets, then append
C
C
h
p
that to the strings list.
The Marker text can use HTML for formatting. We join the strings list’s elements,
separating each from the next with an HTML <br> element which creates a new line in
HTML.
We create the Marker. The first argument is the Marker’s location as a tuple containing
the latitude and longitude. The popup keyword argument specifies the text to display if
the user clicks the Marker.
We add the Marker to the map.
lick here to view code image
In [17]: usmap.save('SenatorsTweets.html')
Open the HTML file in your web browser to view and interact with the map. Recall that you
can drag the map to see Alaska and Hawaii. Here we show the popup text for the South
Carolina marker:
You could enhance this case study to use the sentimentanalysis techniques you learned in
previous chapters to rate as positive, neutral or negative the sentiment expressed by people
who send tweets (“tweeters”) mentioning each senator’s handle.
16.5 HADOOP
The next several sections show how Apache Hadoop and Apache Spark deal with bigdata
storage and processing challenges via huge clusters of computers, massively parallel
processing, Hadoop MapReduce programming and Spark inmemory processing techniques.
Here, we discuss Apache Hadoop, a key bigdata infrastructure technology that also serves as
the foundation for many recent advancements in bigdata processing and an entire ecosystem
C
f software tools that are continually evolving to support today’s bigdata needs.
0
ttp://www.internetlivestats.com/totalnumberofwebsites/.
1
ttp://www.internetlivestats.com/totalnumberofwebsites/.
2
ttp://www.internetlivestats.com/googlesearchstatistics/.
When Google was developing their search engine, they knew that they needed to return
search results quickly. The only practical way to do this was to store and index the entire
Internet using a clever combination of secondary storage and main memory. Computers of
that time couldn’t hold that amount of data and could not analyze that amount of data fast
enough to guarantee prompt searchquery responses. So Google developed a clustering
system, tying together vast numbers of computers—called nodes. Because having more
computers and more connections between them meant greater chance of hardware failures,
they also built in high levels of redundancy to ensure that the system would continue
functioning even if nodes within clusters failed. The data was distributed across all these
inexpensive “commodity computers.” To satisfy a search request, all the computers in the
cluster searched in parallel the portion of the web they had locally. Then the results of those
searches were gathered up and reported back to the user.
To accomplish this, Google needed to develop the clustering hardware and software,
including distributed storage. Google publishes its designs, but did not open source its
software. Programmers at Yahoo!, working from Google’s designs in the “Google File System”
paper, 3 then built their own system. They opensourced their work and the Apache
organization implemented the system as Hadoop. The name came from an elephant stuffed
animal that belonged to a child of one of Hadoop’s creators.
3
ttp://static.googleusercontent.com/media/research.google.com/en//archive/gf
osp2003.pdf.
Two additional Google papers also contributed to the evolution of Hadoop—“MapReduce:
Simplified Data Processing on Large Clusters” 4 and “Bigtable: A Distributed Storage System
for Structured Data,” 5 which was the basis for Apache HBase (a NoSQL key–value and
6
columnbased database).
4
ttp://static.googleusercontent.com/media/research.google.com/en//archive/ma red
sdi04.pdf.
5
ttp://static.googleusercontent.com/media/research.google.com/en//archive/bi tab
sdi06.pdf.
6
Many other influential bigdatarelated papers (including the ones we mentioned) can be
found at: ttps://bigdatamadesimple.com/researchpapersthatchanged
heworldofbigdata/.
HDFS (Hadoop Distributed File System) for storing massive amounts of data
throughout a cluster, and
MapReduce for implementing the tasks that process the data.
Earlier in the book we introduced basic functionalstyle programming and filter/map/reduce.
Hadoop MapReduce is similar in concept, just on a massively parallel scale. A MapReduce
task performs two steps—mapping and reduction. The mapping step, which also may
include filtering, processes the original data across the entire cluster and maps it into tuples
of key–value pairs. The reduction step then combines those tuples to produce the results of
the MapReduce task. The key is how the MapReduce step is performed. Hadoop divides the
data into batches that it distributes across the nodes in the cluster—anywhere from a few
7
nodes to a Yahoo! cluster with 40,000 nodes and over 100,000 cores. Hadoop also
distributes the MapReduce task’s code to the nodes in the cluster and executes the code in
parallel on every node. Each node processes only the batch of data stored on that node. The
reduction step combines the results from all the nodes to produce the final result. To
coordinate this, Hadoop uses YARN (“yet another resource negotiator”) to manage all
the resources in the cluster and schedule tasks for execution.
7
ttps://wiki.apache.org/hadoop/PoweredBy.
Hadoop Ecosystem
Though Hadoop began with HDFS and MapReduce, followed closely by YARN, it has grown
into a large ecosystem that includes Spark (discussed in ections 16.6– 6.7) and many other
8 9 0
Apache projects: , ,
8
ttps://hortonworks.com/ecosystems/.
9
ttps://readwrite.com/2018/06/26/completeguideofhadoop
cosystemcomponents/.
0
ttps://www.janbasktraining.com/blog/introductionarchitecture
omponentshadoopecosystem/.
Ambari ( ttps://ambari.apache.org)—Tools for managing Hadoop clusters.
Drill ( ttps://drill.apache.org)—SQL querying of nonrelational data in Hadoop
and NoSQL databases.
Flume ( ttps://flume.apache.org)—A service for collecting and storing (in HDFS
and other storage) streaming event data, like highvolume server logs, IoT messages and
more.
HBase ( ttps://hbase.apache.org)—A NoSQL database for big data with “billions
H
1
1S
t
e
c
h
2
3
1
f rows by millions of columns—atop clusters of commodity hardware.”
1
We used the word
by
to replace
X
in the original text.
Hive ( ttps://hive.apache.org)—Uses SQL to interact with data in data
warehouses. A data warehouse aggregates data of various types from various sources.
Common operations include extracting data, transforming it and loading (known as ETL)
into another database, typically so you can analyze it and create reports from it.
Impala ( ttps://impala.apache.org)—A database for realtime SQLbased
queries across distributed data stored in Hadoop HDFS or HBase.
Kafka ( ttps://kafka.apache.org)—Realtime messaging, stream processing and
storage, typically to transform and process highvolume streaming data, such as website
activity and streaming IoT data.
Pig ( ttps://pig.apache.org)—A scripting platform that converts data analysis
tasks from a scripting language called Pig Latin into MapReduce tasks.
Sqoop ( ttps://sqoop.apache.org)—Tool for moving structured, semistructured
and unstructured data between databases.
Storm ( ttps://storm.apache.org)—A realtime streamprocessing system for
tasks such as data analytics, machine learning, ETL and more.
ZooKeeper ( ttps://zookeeper.apache.org)—A service for managing cluster
configurations and coordination between clusters.
And more.
Hadoop Providers
Numerous cloud vendors provide Hadoop as a service, including Amazon EMR, Google Cloud
DataProc, IBM Watson Analytics Engine, Microsoft Azure HDInsight and others. In addition,
companies like Cloudera and Hortonworks (which at the time of this writing are merging)
offer integrated Hadoopecosystem components and tools via the major cloud vendors. They
2
also offer free downloadable environments that you can run on the desktop for learning,
development and testing before you commit to cloudbased hosting, which can incur
significant costs. We introduce MapReduce programming in the example in the following
sections by using a Microsoft cloudbased Azure HDInsight cluster, which provides Hadoop
as a service.
2
Check their significant system requirements first to ensure that you have the disk space and
memory required to run them.
Hadoop 3
3
Apache continues to evolve Hadoop. Hadoop 3 was released in December of 2017 with
many improvements, including better performance and significantly improved storage
4
efficiency.
3
For a list of features in Hadoop 3, see ttps://hadoop.apache.org/docs/r3.0.0/.
4
o
h
3
4
ttps://www.datanami.com/2018/10/18/ishadoopofficiallydead/.
We want you to experience the process of setting up clusters and using them to perform
tasks. So, in this Hadoop example, you’ll use Microsoft Azure’s HDInsight service to create
cloudbased clusters of computers in which to test our examples. Go to
ttps://azure.microsoft.com/enus/free
to sign up for an account. Microsoft requires a credit card for identity verification.
Various services are always free and some you can continue to use for 12 months. For
information on these services see:
ttps://azure.microsoft.com/enus/free/freeaccountfaq/
Microsoft also gives you a credit to experiment with their paid services, such as their
HDInsight Hadoop and Spark services. Once your credits run out or 30 days pass (whichever
comes first), you cannot continue using paid services unless you authorize Microsoft to
charge your card.
Because you’ll use your new Azure account’s credit for these examples, 5 we’ll discuss how to
configure a lowcost cluster that uses less computing resources than Microsoft allocates by
default. 6 Caution: Once you allocate a cluster, it incurs costs whether you’re
using it or not. So, when you complete this case study, be sure to delete your
cluster(s) and other resources, so you don’t incur additional charges. For more
information, see:
5
For Microsoft’s latest free account features, visit ttps://azure.microsoft.com/en
s/free/.
6
For Microsoft’s recommended cluster configurations, see
ttps://docs.microsoft.com/enus/azure/hdinsight/hdinsight
componentversioning#defaultnodeconfigurationandvirtualmachine
izesforclusters. If you configure a cluster that’s too small for a given scenario, when
you try to deploy the cluster you’ll receive an error.
ttps://docs.microsoft.com/enus/azure/azureresourcemanager/resourcegrouppo tal
or Azurerelated documentation and videos, visit:
ttps://docs.microsoft.com/enus/azure/—the Azure documentation.
ttps://channel9.msdn.com/—Microsoft’s Channel 9 video network.
ttps://www.youtube.com/user/windowsazure—Microsoft’s Azure channel on
YouTube.
ttps://docs.microsoft.com/enus/azure/hdinsight/hadoop/apachehadooplinuxcre teclusterget
hile following their Create a Hadoop cluster steps, please note the following:
In Step 1, you access the Azure portal by logging into your account at
ttps://portal.azure.com
In Step 2, Data + Analytics is now called Analytics, and the HDInsight icon and icon
color have changed from what is shown in the tutorial.
In Step 3, you must choose a cluster name that does not already exist. When you enter
your cluster name, Microsoft will check whether that name is available and display a
message if it is not. You must create a password. For the Resource group, you’ll also
need to click Create new and provide a group name. Leave all other settings in this step
as is.
In Step 5: Under Select a Storage account, click Create new and provide a storage
account name containing only lowercase letters and numbers. Like the cluster name, the
storage account name must be unique.
When you get to the Cluster summary you’ll see that Microsoft initially configures the
cluster as Head (2 x D12 v2), Worker (4 x D4 v2). At the time of this writing, the
estimated costperhour for this configuration was $3.11. This setup uses a total of 6 CPU
nodes with 40 cores—far more than we need for demonstration purposes.
You can edit this setup to use fewer CPUs and cores, which also saves money. Let’s change
the configuration to a fourCPU cluster with 16 cores that uses less powerful computers. In
the Cluster summary:
W
F
s
h
a
h
r
1. Click Edit to the right of Cluster size.
2. Change the Number of Worker nodes to 2.
3. Click Worker node size, then View all, select D3 v2 (this is the minimum CPU size for
Hadoop nodes) and click Select.
4. Click Head node size, then View all, select D3 v2 and click Select.
5. Click Next and click Next again to return to the Cluster summary. Microsoft will
validate the new configuration.
6. When the Create button is enabled, click it to deploy the cluster.
It takes 20–30 minutes for Microsoft to “spin up” your cluster. During this time, Microsoft is
allocating all the resources and software the cluster requires.
After the changes above, our estimated cost for the cluster was $1.18 per hour, based on
average use for similarly configured clusters. Our actual charges were less than that. If you
encounter any problems configuring your cluster, Microsoft provides HDInsight chatbased
support at:
ttps://azure.microsoft.com/enus/resources/knowledgecenter/technicalchat/
Hadoop supplies the input to the mapping script—called the mapper. This script reads
its input from the standard input stream.
The mapper writes its results to the standard output stream.
Hadoop supplies the mapper’s output as the input to the reduction script—called the
reducer—which reads from the standard input stream.
The reducer writes its results to the standard output stream.
Hadoop writes the reducer’s output to the Hadoop file system (HDFS).
The mapper and reducer terminology used above should sound familiar to you from our
discussions of functionalstyle programming and filter, map and reduce in the “Sequences:
Lists and Tuples” chapter.
In the mapper script (length_mapper.py), the notation #! in line 1 tells Hadoop to execute
the Python code using python3, rather than the default Python 2 installation. This line must
come before all other comments and code in the file. At the time of this writing, Python 2.7.12
and Python 3.5.2 were installed. Note that because the cluster does not have Python 3.6 or
higher, you cannot use fstrings in your code.
lick here to view code image
1 #!/usr/bin/env python3
2 # length_mapper.py
3 """Maps lines of text to keyvalue pairs of word lengths and 1."""
4 import sys
5
6 def tokenize_input():
7 """Split each line of standard input into a list of strings."""
8 for line in sys.stdin:
9 yield line.split()
10
11 # read each line in the the standard input and for every word
12 # produce a keyvalue pair containing the word, a tab and 1
13 for line in tokenize_input():
14 for word in line:
15 print(str(len(word)) + '\t1')
Generator function tokenize_input (lines 6–9) reads lines of text from the standard input
stream and for each returns a list of strings. For this example, we are not removing
punctuation or stop words as we did in the “ atural Language Processing” chapter.
When Hadoop executes the script, lines 13–15 iterate through the lists of strings from
tokenize_input. For each list (line) and for every string (word) in that list, line 15
outputs a key–value pair with the word’s length as the key, a tab (\t) and the value 1,
indicating that there is one word (so far) of that length. Of course, there probably are many
words of that length. The MapReduce algorithm’s reduction step will summarize these key–
value pairs, reducing all those with the same key to a single key–value pair with the total
count.
lick here to view code image
N
C
1 #!/usr/bin/env python3
2 # length_reducer.py
3 """Counts the number of words with each length."""
4 import sys
5 from itertools import groupby
6 from operator import itemgetter
7
8 def tokenize_input():
9 """Split each line of standard input into a key and a value."""
10 for line in sys.stdin:
11 yield line.strip().split('\t')
12
13 # produce keyvalue pairs of word lengths and counts separated by tabs
14 for word_length, group in groupby(tokenize_input(), itemgetter(0)):
15 try:
16 total = sum(int(count) for word_length, count in group)
17 print(word_length + '\t' + str(total))
18 except ValueError:
19 pass # ignore word if its count was not an integer
When the MapReduce algorithm executes this reducer, lines 14–19 use the groupby function
from the itertools module to group all word lengths of the same value:
The first argument calls tokenize_input to get the lists representing the key–value
pairs.
The second argument indicates that the key–value pairs should be grouped based on the
element at index 0 in each list—that is the key.
Line 16 totals all the counts for a given key. Line 17 outputs a new key–value pair consisting
of the word and its total. The MapReduce algorithm takes all the final wordcount outputs
and writes them to a file in HDFS—the Hadoop file system.
lick here to view code image
scp length_mapper.py length_reducer.py RomeoAndJuliet.txt
sshuser@YourClusterNamessh.azurehdinsight.net:
The first time you do this, you’ll be asked for security reasons to confirm that you trust the
C
target host (that is, Microsoft Azure).
7
Windows users: If ssh does not work for you, install and enable it as described at
ttps://blogs.msdn.microsoft.com/powershell/2017/12/15/usingthe
opensshbetainwindows10fallcreatorsupdateandwindowsserver
709/. After completing the installation, log out and log back in or restart your system to
enable ssh.
lick here to view code image
ssh sshuser@YourClusterNamessh.azurehdinsight.net
For this example, we’ll use the following Hadoop command to copy the text file into the
already existing folder /examples/data that the cluster provides for use with Microsoft’s
Azure Hadoop tutorials. Again, press Enter only when you’ve typed the entire command:
lick here to view code image
hadoop fs copyFromLocal RomeoAndJuliet.txt
/example/data/RomeoAndJuliet.txt
lick here to view code image
yarn jar /usr/hdp/current/hadoopmapreduceclient/hadoopstreaming.jar
D mapred.output.key.comparator.class=
org.apache.hadoop.mapred.lib.KeyFieldBasedComparator
D mapred.text.key.comparator.options=n
files length_mapper.py,length_reducer.py
mapper length_mapper.py
reducer length_reducer.py
input /example/data/RomeoAndJuliet.txt
output /example/wordlengthsoutput
The yarn command invokes the Hadoop’s YARN (“yet another resource negotiator”) tool to
manage and coordinate access to the Hadoop resources the MapReduce task uses. The file
hadoopstreaming.jar contains the Hadoop streaming utility that allows you to use
Python to implement the mapper and reducer. The two D options set Hadoop properties
C
1
h
3
that enable it to sort the final key–value pairs by key (KeyFieldBasedComparator) in
descending order numerically (n; the minus indicates descending order) rather than
alphabetically. The other commandline arguments are:
files—A commaseparated list of file names. Hadoop copies these files to every node
in the cluster so they can be executed locally on each node.
mapper—The name of the mapper’s script file.
reducer—The name of the reducer’s script file
input—The file or directory of files to supply as input to the mapper.
output—The HDFS directory in which the output will be written. If this folder already
exists, an error will occur.
The following output shows some of the feedback that Hadoop produces as the MapReduce
job executes. We replaced chunks of the output with to save space and bolded several lines of
interest including:
The total number of “input paths to process”—the 1 source of input in this example is the
RomeoAndJuliet.txt file.
The “number of splits”—2 in this example, based on the number of worker nodes in our
cluster.
The percentage completion information.
File System Counters, which include the numbers of bytes read and written.
Job Counters, which show the number of mapping and reduction tasks used and
various timing information.
MapReduce Framework, which shows various information about the steps performed.
lick here to view code image
ackageJobJar: [] [/usr/hdp/2.6.5.300413/hadoopmapreduce/hadoopstreaming2.7.3.2.6.5.3004
...
18/12/05 16:46:25 INFO mapred.FileInputFormat: Total input paths to process : 1
18/12/05 16:46:26 INFO mapreduce.JobSubmitter: number of splits:2
...
18/12/05 16:46:26 INFO mapreduce.Job: The url to track the job: http://hn0paulte.y3nghy5db
...
18/12/05 16:46:35 INFO mapreduce.Job: map 0% reduce 0%
18/12/05 16:46:43 INFO mapreduce.Job: map 50% reduce 0%
18/12/05 16:46:44 INFO mapreduce.Job: map 100% reduce 0%
18/12/05 16:46:48 INFO mapreduce.Job: map 100% reduce 100%
18/12/05 16:46:50 INFO mapreduce.Job: Job job_1543953844228_0025 completed successfully
18/12/05 16:46:50 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=156411
FILE: Number of bytes written=813764
...
Job Counters
Launched map tasks=2
Launched reduce tasks=1
...
MapReduce Framework
Map input records=5260
Map output records=25956
Map output bytes=104493
Map output materialized bytes=156417
Input split bytes=346
Combine input records=0
Combine output records=0
Reduce input groups=19
Reduce shuffle bytes=156417
Reduce input records=25956
Reduce output records=19
Spilled Records=51912
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=193
CPU time spent (ms)=4440
Physical memory (bytes) snapshot=1942798336
Virtual memory (bytes) snapshot=8463282176
Total committed heap usage (bytes)=3177185280
...
18/12/05 16:46:50 INFO streaming.StreamJob: Output directory: /example/wordlengthsoutput
lick here to view code image
hdfs dfs text /example/wordlengthsoutput/part00000
Here are the results of the preceding command:
lick here to view code image
8/12/05 16:47:19 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
18/12/05 16:47:19 INFO lzo.LzoCodec: Successfully loaded & initialized nativelzo library [ha
1 1140
2 3869
3 4699
4 5651
5 3668
6 2719
7 1624
8 1062
9 855
10 317
11 189
12 95
13 35
14 13
15 9
16 6
17 3
18 1
23 1
Deleting Your Cluster So You Do Not Incur Charges
Caution: Be sure to delete your cluster(s) and associated resources (like
storage) so you don’t incur additional charges. In the Azure portal, click All
resources to see your list of resources, which will include the cluster you set up and the
storage account you set up. Both can incur charges if you do not delete them. Select each
resource and click the Delete button to remove it. You’ll be asked to confirm by typing yes.
For more information, see:
ttps://docs.microsoft.com/enus/azure/azureresourcemanager/resourcegrouppo tal
6.6 SPARK
In this section, we’ll overview Apache Spark. We’ll use the Python PySpark library and
Spark’s functionalstyle filter/map/reduce capabilities to implement a simple word count
example that summarizes the word counts in Romeo and Juliet.
History
Spark was initially developed in 2009 at U. C. Berkeley and funded by DARPA (the Defense
Advanced Research Projects Agency). Initially, it was created as a distributed execution
8
engine for highperformance machine learning. It uses an inmemory architecture that
“has been used to sort 100 TB of data 3X faster than Hadoop MapReduce on 1/10th of the
machines” 9 and runs some workloads up to 100 times faster than Hadoop. 0 Spark’s
significantly better performance on batchprocessing tasks is leading many companies to
1 2 3
replace Hadoop MapReduce with Spark. ,
8
ttps://gigaom.com/2014/06/28/4reasonswhysparkcouldjolt
adoopintohyperdrive/.
9
ttps://spark.apache.org/faq.html.
0
ttps://spark.apache.org/.
1
ttps://bigdatamadesimple.com/issparkbetterthanhadoopmap
educe/.
2
ttps://www.datanami.com/2018/10/18/ishadoopofficiallydead/.
3
ttps://blog.thecodeteam.com/2018/01/09/changingfacedata
nalyticsfastdatadisplacesbigdata/.
4
ttp://spark.apache.org/.
At the core of Spark are resilient distributed datasets (RDDs), which you’ll use to
process distributed data using functionalstyle programming. In addition to reading data
from disk and writing data to disk, Hadoop uses replication for fault tolerance, which adds
even more diskbased overhead. RDDs eliminate this overhead by remaining in memory—
using disk only if the data will not fit in memory—and by not replicating data. Spark handles
fault tolerance by remembering the steps used to create each RDD, so it can rebuild a given
5
RDD if a cluster node fails.
5
ttps://spark.apache.org/research.html.
Spark distributes the operations you specify in Python to the cluster’s nodes for parallel
execution. Spark streaming enables you to process data as it’s received. Spark DataFrames,
which are similar to pandas DataFrames, enable you to view RDDs as a collection of
named columns. You can use Spark DataFrames with Spark SQL to perform queries on
distributed data. Spark also includes Spark MLlib (the Spark Machine Learning Library),
which enables you to perform machinelearning algorithms, like those you learned in
hapters 14 and 5. We’ll use RDDs, Spark streaming, DataFrames and Spark SQL in the next
few examples.
Providers
Hadoop providers typically also provide Spark support. In addition to the providers listed in
ection 16.5, there are Sparkspecific vendors like Databricks. They provide a “zero
management cloud platform built around Spark.” 6 Their website also is an excellent resource
for learning Spark. The paid Databricks platform runs on Amazon AWS or Microsoft Azure.
Databricks also provides a free Databricks Community Edition, which is a great way to get
started with both Spark and the Databricks environment.
6
ttps://databricks.com/product/faq.
Docker
Docker is a tool for packaging software into containers (also called images) that bundle
everything required to execute that software across platforms. Some software packages we
use in this chapter require complicated setup and configuration. For many of these, there are
A
1C
S
h
4
preexisting Docker containers that you can download for free and execute locally on your
desktop or notebook computers. This makes Docker a great way to help you get started with
new technologies quickly and conveniently.
Docker also helps with reproducibility in research and analytics studies. You can create
custom Docker containers that are configured with the versions of every piece of software and
every library you used in your study. This would enable others to recreate the environment
you used, then reproduce your work, and will help you reproduce your results at a later time.
We’ll use Docker in this section to download and execute a Docker container that’s
preconfigured to run Spark applications.
Installing Docker
You can install Docker for Windows 10 Pro or macOS at:
ttps://www.docker.com/products/dockerdesktop
On Windows 10 Pro, you must allow the "Docker for Windows.exe" installer to make
changes to your system to complete the installation process. To do so, click Yes when
7
Windows asks if you want to allow the installer to make changes to your system. Windows
10 Home users must use Virtual Box as described at:
7
Some Windows users might have to follow the instructions under Allow specific apps to
make changes to controlled folders at ttps://docs.microsoft.com/en
us/windows/security/threatprotection/windowsdefenderexploit
uard/customizecontrolledfoldersexploitguard.
ttps://docs.docker.com/machine/drivers/virtualbox/
Linux users should install Docker Community Edition as described at:
ttps://docs.docker.com/install/overview/
For a general overview of Docker, read the Getting started guide at:
ttps://docs.docker.com/getstarted/
We’ll use the jupyter/pysparknotebook Docker stack, which is preconfigured with
g
h
h
4