Python Mi Parte

The complete list of Choropleth keyword arguments is documented at:
ttp://pythonvisualization.github.io/folium/modules.html#folium.features.Choro leth
reating the Map Markers for Each State

Next, we’ll create Markers for each state. To ensure that the senators are displayed in
descending order by the number of tweets in each state’s Marker, let’s sort
tweet_counts_df in descending order by the 'Tweets' column:
lick here to view code image
In [35]: sorted_df = tweet_counts_df.sort_values(
...: by='Tweets', ascending=False)
...:
The loop in the following snippet creates the Markers. First,
sorted_df.groupby('State')
groups sorted_df by 'State'. A DataFrame’s groupby method maintains the original
row order in each group. Within a given group, the senator with the most tweets will be first,
because we sorted the senators in descending order by tweet count in snippet [35]:
In [36]: for index, (name, group) in enumerate(sorted_df.groupby('State')):
...: strings = [state_codes[name]] # used to assemble popup text
...:
...: for s in group.itertuples():
...: strings.append(
...: strings.append(
...:
...: text = '<br>'.join(strings)
...: marker = folium.Marker(
...: (locations[index].latitude, locations[index].longitude),
...: popup=text)
...: marker.add_to(usmap)
...:
...:
We pass the grouped DataFrame to enumerate, so we can get an index for each group,
which we’ll use to look up each state’s Location in the locations list. Each group has a
name (the state code we grouped by) and a collection of items in that group (the two senators
for that state). The loop operates as follows:
We look up the full state name in the state_codes dictionary, then store it in the
strings list—we’ll use this list to assemble the Marker’s popup text.
The nested loop walks through the items in the group collection, returning each as a
named tuple that contains a given senator’s data. We create a formatted string for the
current senator containing the person’s name, party and number of tweets, then append
C
C
h
p
that to the strings list.
The Marker text can use HTML for formatting. We join the strings list’s elements,
separating each from the next with an HTML <br> element which creates a new line in
HTML.
We create the Marker. The first argument is the Marker’s location as a tuple containing
the latitude and longitude. The popup keyword argument specifies the text to display if
the user clicks the Marker.
We add the Marker to the map.
Displaying the Map

Finally, let’s save the map into an HTML file
In [17]: usmap.save('SenatorsTweets.html')
Open the HTML file in your web browser to view and interact with the map. Recall that you
can drag the map to see Alaska and Hawaii. Here we show the popup text for the South
Carolina marker:
You could enhance this case study to use the sentimentanalysis techniques you learned in
previous chapters to rate as positive, neutral or negative the sentiment expressed by people
who send tweets (“tweeters”) mentioning each senator’s handle.
16.5 HADOOP
The next several sections show how Apache Hadoop and Apache Spark deal with bigdata
storage and processing challenges via huge clusters of computers, massively parallel
processing, Hadoop MapReduce programming and Spark inmemory processing techniques.
Here, we discuss Apache Hadoop, a key bigdata infrastructure technology that also serves as
the foundation for many recent advancements in bigdata processing and an entire ecosystem
C
f software tools that are continually evolving to support today’s bigdata needs.
16.5.1 Hadoop Overview

When Google was launched in 1998, the amount of online data was already enormous with
0
approximately 2.4 million websites —truly big data. Today there are now nearly two billion
websites 1 (almost a thousandfold increase) and Google is handling over two trillion searches
per year! 2 Having used Google search since its inception, our sense is that today’s responses
are significantly faster.
0
ttp://www.internetlivestats.com/totalnumberofwebsites/.
1
ttp://www.internetlivestats.com/totalnumberofwebsites/.
2
ttp://www.internetlivestats.com/googlesearchstatistics/.
When Google was developing their search engine, they knew that they needed to return
search results quickly. The only practical way to do this was to store and index the entire
Internet using a clever combination of secondary storage and main memory. Computers of
that time couldn’t hold that amount of data and could not analyze that amount of data fast
enough to guarantee prompt searchquery responses. So Google developed a clustering
system, tying together vast numbers of computers—called nodes. Because having more
computers and more connections between them meant greater chance of hardware failures,
they also built in high levels of redundancy to ensure that the system would continue
functioning even if nodes within clusters failed. The data was distributed across all these
inexpensive “commodity computers.” To satisfy a search request, all the computers in the
cluster searched in parallel the portion of the web they had locally. Then the results of those
searches were gathered up and reported back to the user.
To accomplish this, Google needed to develop the clustering hardware and software,
including distributed storage. Google publishes its designs, but did not open source its
software. Programmers at Yahoo!, working from Google’s designs in the “Google File System”
paper, 3 then built their own system. They opensourced their work and the Apache
organization implemented the system as Hadoop. The name came from an elephant stuffed
animal that belonged to a child of one of Hadoop’s creators.
3
ttp://static.googleusercontent.com/media/research.google.com/en//archive/gf
osp2003.pdf.
Two additional Google papers also contributed to the evolution of Hadoop—“MapReduce:
Simplified Data Processing on Large Clusters” 4 and “Bigtable: A Distributed Storage System
for Structured Data,” 5 which was the basis for Apache HBase (a NoSQL key–value and
6
columnbased database).
4
ttp://static.googleusercontent.com/media/research.google.com/en//archive/ma red
sdi04.pdf.
5
ttp://static.googleusercontent.com/media/research.google.com/en//archive/bi tab
sdi06.pdf.
6
Many other influential bigdatarelated papers (including the ones we mentioned) can be
found at: ttps://bigdatamadesimple.com/researchpapersthatchanged
heworldofbigdata/.
DFS, MapReduce and YARN

Hadoop’s key components are:
HDFS (Hadoop Distributed File System) for storing massive amounts of data
throughout a cluster, and
MapReduce for implementing the tasks that process the data.
Earlier in the book we introduced basic functionalstyle programming and filter/map/reduce.
Hadoop MapReduce is similar in concept, just on a massively parallel scale. A MapReduce
task performs two steps—mapping and reduction. The mapping step, which also may
include filtering, processes the original data across the entire cluster and maps it into tuples
of key–value pairs. The reduction step then combines those tuples to produce the results of
the MapReduce task. The key is how the MapReduce step is performed. Hadoop divides the
data into batches that it distributes across the nodes in the cluster—anywhere from a few
7
nodes to a Yahoo! cluster with 40,000 nodes and over 100,000 cores. Hadoop also
distributes the MapReduce task’s code to the nodes in the cluster and executes the code in
parallel on every node. Each node processes only the batch of data stored on that node. The
reduction step combines the results from all the nodes to produce the final result. To
coordinate this, Hadoop uses YARN (“yet another resource negotiator”) to manage all
the resources in the cluster and schedule tasks for execution.
7
ttps://wiki.apache.org/hadoop/PoweredBy.
Hadoop Ecosystem
Though Hadoop began with HDFS and MapReduce, followed closely by YARN, it has grown
into a large ecosystem that includes Spark (discussed in ections 16.6– 6.7) and many other
8 9 0
Apache projects: , ,
8
ttps://hortonworks.com/ecosystems/.
9
ttps://readwrite.com/2018/06/26/completeguideofhadoop
cosystemcomponents/.
0
ttps://www.janbasktraining.com/blog/introductionarchitecture
omponentshadoopecosystem/.
Ambari ( ttps://ambari.apache.org)—Tools for managing Hadoop clusters.
Drill ( ttps://drill.apache.org)—SQL querying of nonrelational data in Hadoop
and NoSQL databases.
Flume ( ttps://flume.apache.org)—A service for collecting and storing (in HDFS
and other storage) streaming event data, like highvolume server logs, IoT messages and
more.
HBase ( ttps://hbase.apache.org)—A NoSQL database for big data with “billions
H
1
1S
t
e
c
h
2
3
1
f rows by millions of columns—atop clusters of commodity hardware.”
1
We used the word
by
to replace
X
in the original text.
Hive ( ttps://hive.apache.org)—Uses SQL to interact with data in data
warehouses. A data warehouse aggregates data of various types from various sources.
Common operations include extracting data, transforming it and loading (known as ETL)
into another database, typically so you can analyze it and create reports from it.
Impala ( ttps://impala.apache.org)—A database for realtime SQLbased
queries across distributed data stored in Hadoop HDFS or HBase.
Kafka ( ttps://kafka.apache.org)—Realtime messaging, stream processing and
storage, typically to transform and process highvolume streaming data, such as website
activity and streaming IoT data.
Pig ( ttps://pig.apache.org)—A scripting platform that converts data analysis
tasks from a scripting language called Pig Latin into MapReduce tasks.
Sqoop ( ttps://sqoop.apache.org)—Tool for moving structured, semistructured
and unstructured data between databases.
Storm ( ttps://storm.apache.org)—A realtime streamprocessing system for
tasks such as data analytics, machine learning, ETL and more.
ZooKeeper ( ttps://zookeeper.apache.org)—A service for managing cluster
configurations and coordination between clusters.
And more.
Hadoop Providers
Numerous cloud vendors provide Hadoop as a service, including Amazon EMR, Google Cloud
DataProc, IBM Watson Analytics Engine, Microsoft Azure HDInsight and others. In addition,
companies like Cloudera and Hortonworks (which at the time of this writing are merging)
offer integrated Hadoopecosystem components and tools via the major cloud vendors. They
2
also offer free downloadable environments that you can run on the desktop for learning,
development and testing before you commit to cloudbased hosting, which can incur
significant costs. We introduce MapReduce programming in the example in the following
sections by using a Microsoft cloudbased Azure HDInsight cluster, which provides Hadoop
as a service.
2
Check their significant system requirements first to ensure that you have the disk space and
memory required to run them.
Hadoop 3
3
Apache continues to evolve Hadoop. Hadoop 3 was released in December of 2017 with
many improvements, including better performance and significantly improved storage
4
efficiency.
3
For a list of features in Hadoop 3, see ttps://hadoop.apache.org/docs/r3.0.0/.
4
o
h
3
4
ttps://www.datanami.com/2018/10/18/ishadoopofficiallydead/.
16.5.2 Summarizing Word Lengths in Romeo and Juliet via MapReduce

In the next several subsections, you’ll create a cloudbased, multinode cluster of computers
using Microsoft Azure HDInsight. Then, you’ll use the service’s capabilities to demonstrate
Hadoop MapReduce running on that cluster. The MapReduce task you’ll define will
determine the length of each word in RomeoAndJuliet.txt (from the “ atural Language
rocessing” chapter), then summarize how many words of each length there are. After
defining the task’s mapping and reduction steps, you’ll submit the task to your HDInsight
cluster, and Hadoop will decide how to use the cluster of computers to perform the task.
16.5.3 Creating an Apache Hadoop Cluster in Microsoft Azure HDInsight

Most major cloud vendors have support for Hadoop and Spark computing clusters that you
can configure to meet your application’s requirements. Multinode cloudbased clusters
typically are paid services, though most vendors provide free trials or credits so you can try
out their services.
We want you to experience the process of setting up clusters and using them to perform
tasks. So, in this Hadoop example, you’ll use Microsoft Azure’s HDInsight service to create
cloudbased clusters of computers in which to test our examples. Go to
ttps://azure.microsoft.com/enus/free
to sign up for an account. Microsoft requires a credit card for identity verification.
Various services are always free and some you can continue to use for 12 months. For
information on these services see:
ttps://azure.microsoft.com/enus/free/freeaccountfaq/
Microsoft also gives you a credit to experiment with their paid services, such as their
HDInsight Hadoop and Spark services. Once your credits run out or 30 days pass (whichever
comes first), you cannot continue using paid services unless you authorize Microsoft to
charge your card.
Because you’ll use your new Azure account’s credit for these examples, 5 we’ll discuss how to
configure a lowcost cluster that uses less computing resources than Microsoft allocates by
default. 6 Caution: Once you allocate a cluster, it incurs costs whether you’re
using it or not. So, when you complete this case study, be sure to delete your
cluster(s) and other resources, so you don’t incur additional charges. For more
information, see:
5
For Microsoft’s latest free account features, visit ttps://azure.microsoft.com/en
s/free/.
6
For Microsoft’s recommended cluster configurations, see
ttps://docs.microsoft.com/enus/azure/hdinsight/hdinsight
componentversioning#defaultnodeconfigurationandvirtualmachine
izesforclusters. If you configure a cluster that’s too small for a given scenario, when
you try to deploy the cluster you’ll receive an error.
ttps://docs.microsoft.com/enus/azure/azureresourcemanager/resourcegrouppo tal
or Azurerelated documentation and videos, visit:
ttps://docs.microsoft.com/enus/azure/—the Azure documentation.
ttps://channel9.msdn.com/—Microsoft’s Channel 9 video network.
ttps://www.youtube.com/user/windowsazure—Microsoft’s Azure channel on
YouTube.
Creating an HDInsight Hadoop Cluster

The following link explains how to set up a cluster for Hadoop using the Azure HDInsight
service:
ttps://docs.microsoft.com/enus/azure/hdinsight/hadoop/apachehadooplinuxcre teclusterget
hile following their Create a Hadoop cluster steps, please note the following:
In Step 1, you access the Azure portal by logging into your account at
ttps://portal.azure.com
In Step 2, Data + Analytics is now called Analytics, and the HDInsight icon and icon
color have changed from what is shown in the tutorial.
In Step 3, you must choose a cluster name that does not already exist. When you enter
your cluster name, Microsoft will check whether that name is available and display a
message if it is not. You must create a password. For the Resource group, you’ll also
need to click Create new and provide a group name. Leave all other settings in this step
as is.
In Step 5: Under Select a Storage account, click Create new and provide a storage
account name containing only lowercase letters and numbers. Like the cluster name, the
storage account name must be unique.
When you get to the Cluster summary you’ll see that Microsoft initially configures the
cluster as Head (2 x D12 v2), Worker (4 x D4 v2). At the time of this writing, the
estimated costperhour for this configuration was $3.11. This setup uses a total of 6 CPU
nodes with 40 cores—far more than we need for demonstration purposes.
You can edit this setup to use fewer CPUs and cores, which also saves money. Let’s change
the configuration to a fourCPU cluster with 16 cores that uses less powerful computers. In
the Cluster summary:
W
F
s
h
a
h
r
1. Click Edit to the right of Cluster size.
2. Change the Number of Worker nodes to 2.
3. Click Worker node size, then View all, select D3 v2 (this is the minimum CPU size for
Hadoop nodes) and click Select.
4. Click Head node size, then View all, select D3 v2 and click Select.
5. Click Next and click Next again to return to the Cluster summary. Microsoft will
validate the new configuration.
6. When the Create button is enabled, click it to deploy the cluster.
It takes 20–30 minutes for Microsoft to “spin up” your cluster. During this time, Microsoft is
allocating all the resources and software the cluster requires.
After the changes above, our estimated cost for the cluster was $1.18 per hour, based on
average use for similarly configured clusters. Our actual charges were less than that. If you
encounter any problems configuring your cluster, Microsoft provides HDInsight chatbased
support at:
ttps://azure.microsoft.com/enus/resources/knowledgecenter/technicalchat/
16.5.4 Hadoop Streaming

For languages like Python that are not natively supported in Hadoop, you must use Hadoop
streaming to implement your tasks. In Hadoop streaming, the Python scripts that
implement the mapping and reduction steps use the standard input stream and
standard output stream to communicate with Hadoop. Usually, the standard input
stream reads from the keyboard and the standard output stream writes to the command line.
However, these can be redirected (as Hadoop does) to read from other sources and write to
other destinations. Hadoop uses the streams as follows:
Hadoop supplies the input to the mapping script—called the mapper. This script reads
its input from the standard input stream.
The mapper writes its results to the standard output stream.
Hadoop supplies the mapper’s output as the input to the reduction script—called the
reducer—which reads from the standard input stream.
The reducer writes its results to the standard output stream.
Hadoop writes the reducer’s output to the Hadoop file system (HDFS).
The mapper and reducer terminology used above should sound familiar to you from our
discussions of functionalstyle programming and filter, map and reduce in the “Sequences:
Lists and Tuples” chapter.
6.5.5 Implementing the Mapper

16.5.5 Implementing the Mapper
In this section, you’ll create a mapper script that takes lines of text as input from Hadoop and
maps them to key–value pairs in which each key is a word, and its corresponding value is 1.
The mapper sees each word individually so, as far as it is concerned, there’s only one of each
word. In the next section, the reducer will summarize these key–value pairs by key, reducing
the counts to a single count for each key. By default, Hadoop expects the mapper’s output and
the reducer’s input and output to be in the form of key–value pairs separated by a tab.
In the mapper script (length_mapper.py), the notation #! in line 1 tells Hadoop to execute
the Python code using python3, rather than the default Python 2 installation. This line must
come before all other comments and code in the file. At the time of this writing, Python 2.7.12
and Python 3.5.2 were installed. Note that because the cluster does not have Python 3.6 or
higher, you cannot use fstrings in your code.
1 #!/usr/bin/env python3
2 # length_mapper.py
3 """Maps lines of text to keyvalue pairs of word lengths and 1."""
4 import sys
5
6 def tokenize_input():
7 """Split each line of standard input into a list of strings."""
8 for line in sys.stdin:
9 yield line.split()
10
11 # read each line in the the standard input and for every word
12 # produce a keyvalue pair containing the word, a tab and 1
13 for line in tokenize_input():
14 for word in line:
15 print(str(len(word)) + '\t1')
Generator function tokenize_input (lines 6–9) reads lines of text from the standard input
stream and for each returns a list of strings. For this example, we are not removing
punctuation or stop words as we did in the “ atural Language Processing” chapter.
When Hadoop executes the script, lines 13–15 iterate through the lists of strings from
tokenize_input. For each list (line) and for every string (word) in that list, line 15
outputs a key–value pair with the word’s length as the key, a tab (\t) and the value 1,
indicating that there is one word (so far) of that length. Of course, there probably are many
words of that length. The MapReduce algorithm’s reduction step will summarize these key–
value pairs, reducing all those with the same key to a single key–value pair with the total
count.
16.5.6 Implementing the Reducer

In the reducer script (length_reducer.py), function tokenize_input (lines 8–11) is a
generator function that reads and splits the key–value pairs produced by the mapper. Again,
the MapReduce algorithm supplies the standard input. For each line, tokenize_input
strips any leading or trailing whitespace (such as the terminating newline) and yields a list
containing the key and a value.
N
C
1 #!/usr/bin/env python3
2 # length_reducer.py
3 """Counts the number of words with each length."""
4 import sys
5 from itertools import groupby
6 from operator import itemgetter
7
8 def tokenize_input():
9 """Split each line of standard input into a key and a value."""
10 for line in sys.stdin:
11 yield line.strip().split('\t')
12
13 # produce keyvalue pairs of word lengths and counts separated by tabs
14 for word_length, group in groupby(tokenize_input(), itemgetter(0)):
15 try:
16 total = sum(int(count) for word_length, count in group)
17 print(word_length + '\t' + str(total))
18 except ValueError:
19 pass # ignore word if its count was not an integer
When the MapReduce algorithm executes this reducer, lines 14–19 use the groupby function
from the itertools module to group all word lengths of the same value:
The first argument calls tokenize_input to get the lists representing the key–value
pairs.
The second argument indicates that the key–value pairs should be grouped based on the
element at index 0 in each list—that is the key.
Line 16 totals all the counts for a given key. Line 17 outputs a new key–value pair consisting
of the word and its total. The MapReduce algorithm takes all the final wordcount outputs
and writes them to a file in HDFS—the Hadoop file system.
16.5.7 Preparing to Run the MapReduce Example

Next, you’ll upload files to the cluster so you can execute the example. In a Command
Prompt, Terminal or shell, change to the folder containing your mapper and reducer scripts
and the RomeoAndJuliet.txt file. We assume all three are in this chapter’s ch16
examples folder, so be sure to copy your RomeoAndJuliet.txt file to this folder first.
Copying the Script Files to the HDInsight Hadoop Cluster

Enter the following command to upload the files. Be sure to replace YourClusterName with
the cluster name you specified when setting up the Hadoop cluster and press Enter only after
you’ve typed the entire command. The colon in the following command is required and
indicates that you’ll supply your cluster password when prompted. At that prompt, type the
password you specified when setting up the cluster, then press Enter:
scp length_mapper.py length_reducer.py RomeoAndJuliet.txt
sshuser@YourClusterNamessh.azurehdinsight.net:
The first time you do this, you’ll be asked for security reasons to confirm that you trust the
C
target host (that is, Microsoft Azure).
Copying RomeoAndJuliet into the Hadoop File System

For Hadoop to read the contents of RomeoAndJuliet.txt and supply the lines of text to
7
your mapper, you must first copy the file into Hadoop’s file system. First, you must use ssh
to log into your cluster and access its command line. In a Command Prompt, Terminal or
shell, execute the following command. Be sure to replace YourClusterName with your cluster
name. Again, you’ll be prompted for your cluster password:
7
Windows users: If ssh does not work for you, install and enable it as described at
ttps://blogs.msdn.microsoft.com/powershell/2017/12/15/usingthe
opensshbetainwindows10fallcreatorsupdateandwindowsserver
709/. After completing the installation, log out and log back in or restart your system to
enable ssh.
ssh sshuser@YourClusterNamessh.azurehdinsight.net
For this example, we’ll use the following Hadoop command to copy the text file into the
already existing folder /examples/data that the cluster provides for use with Microsoft’s
Azure Hadoop tutorials. Again, press Enter only when you’ve typed the entire command:
hadoop fs copyFromLocal RomeoAndJuliet.txt
/example/data/RomeoAndJuliet.txt
16.5.8 Running the MapReduce Job

Now you can run the MapReduce job for RomeoAndJuliet.txt on your cluster by
executing the following command. For your convenience, we provided the text of this
command in the file yarn.txt with this example, so you can copy and paste it. We
reformatted the command here for readability:
yarn jar /usr/hdp/current/hadoopmapreduceclient/hadoopstreaming.jar
D mapred.output.key.comparator.class=
org.apache.hadoop.mapred.lib.KeyFieldBasedComparator
D mapred.text.key.comparator.options=n
files length_mapper.py,length_reducer.py
mapper length_mapper.py
reducer length_reducer.py
input /example/data/RomeoAndJuliet.txt
output /example/wordlengthsoutput
The yarn command invokes the Hadoop’s YARN (“yet another resource negotiator”) tool to
manage and coordinate access to the Hadoop resources the MapReduce task uses. The file
hadoopstreaming.jar contains the Hadoop streaming utility that allows you to use
Python to implement the mapper and reducer. The two D options set Hadoop properties
C
1
h
3
that enable it to sort the final key–value pairs by key (KeyFieldBasedComparator) in
descending order numerically (n; the minus indicates descending order) rather than
alphabetically. The other commandline arguments are:
files—A commaseparated list of file names. Hadoop copies these files to every node
in the cluster so they can be executed locally on each node.
mapper—The name of the mapper’s script file.
reducer—The name of the reducer’s script file
input—The file or directory of files to supply as input to the mapper.
output—The HDFS directory in which the output will be written. If this folder already
exists, an error will occur.
The following output shows some of the feedback that Hadoop produces as the MapReduce
job executes. We replaced chunks of the output with to save space and bolded several lines of
interest including:
The total number of “input paths to process”—the 1 source of input in this example is the
RomeoAndJuliet.txt file.
The “number of splits”—2 in this example, based on the number of worker nodes in our
cluster.
The percentage completion information.
File System Counters, which include the numbers of bytes read and written.
Job Counters, which show the number of mapping and reduction tasks used and
various timing information.
MapReduce Framework, which shows various information about the steps performed.
ackageJobJar: [] [/usr/hdp/2.6.5.300413/hadoopmapreduce/hadoopstreaming2.7.3.2.6.5.3004
...
18/12/05 16:46:25 INFO mapred.FileInputFormat: Total input paths to process : 1
18/12/05 16:46:26 INFO mapreduce.JobSubmitter: number of splits:2
...
18/12/05 16:46:26 INFO mapreduce.Job: The url to track the job: http://hn0paulte.y3nghy5db
...
18/12/05 16:46:35 INFO mapreduce.Job: map 0% reduce 0%
18/12/05 16:46:50 INFO mapreduce.Job: Job job_1543953844228_0025 completed successfully
18/12/05 16:46:50 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=156411
FILE: Number of bytes written=813764
...
Job Counters
Launched map tasks=2
Launched reduce tasks=1
...
MapReduce Framework
Map input records=5260
Map output records=25956
Map output bytes=104493
Map output materialized bytes=156417
Input split bytes=346
Combine input records=0
Combine output records=0
Reduce input groups=19
Reduce shuffle bytes=156417
Reduce input records=25956
Reduce output records=19
Spilled Records=51912
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=193
CPU time spent (ms)=4440
Physical memory (bytes) snapshot=1942798336
Virtual memory (bytes) snapshot=8463282176
Total committed heap usage (bytes)=3177185280
...
18/12/05 16:46:50 INFO streaming.StreamJob: Output directory: /example/wordlengthsoutput
iewing the Word Counts

Hadoop MapReduce saves its output into HDFS, so to see the actual word counts you must
look at the file in HDFS within the cluster by executing the following command:
hdfs dfs text /example/wordlengthsoutput/part00000
Here are the results of the preceding command:
8/12/05 16:47:19 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
18/12/05 16:47:19 INFO lzo.LzoCodec: Successfully loaded & initialized nativelzo library [ha
1 1140
2 3869
3 4699
4 5651
5 3668
6 2719
7 1624
8 1062
9 855
10 317
11 189
12 95
13 35
14 13
15 9
16 6
17 3
18 1
23 1
Deleting Your Cluster So You Do Not Incur Charges
Caution: Be sure to delete your cluster(s) and associated resources (like
storage) so you don’t incur additional charges. In the Azure portal, click All
resources to see your list of resources, which will include the cluster you set up and the
storage account you set up. Both can incur charges if you do not delete them. Select each
resource and click the Delete button to remove it. You’ll be asked to confirm by typing yes.
For more information, see:
ttps://docs.microsoft.com/enus/azure/azureresourcemanager/resourcegrouppo tal
6.6 SPARK
In this section, we’ll overview Apache Spark. We’ll use the Python PySpark library and
Spark’s functionalstyle filter/map/reduce capabilities to implement a simple word count
example that summarizes the word counts in Romeo and Juliet.
16.6.1 Spark Overview

When you process truly big data, performance is crucial. Hadoop is geared to diskbased
batch processing—reading the data from disk, processing the data and writing the results
back to disk. Many bigdata applications demand better performance than is possible with
diskintensive operations. In particular, fast streaming applications that require either real
time or nearrealtime processing won’t work in a diskbased architecture.
History
Spark was initially developed in 2009 at U. C. Berkeley and funded by DARPA (the Defense
Advanced Research Projects Agency). Initially, it was created as a distributed execution
8
engine for highperformance machine learning. It uses an inmemory architecture that
“has been used to sort 100 TB of data 3X faster than Hadoop MapReduce on 1/10th of the
machines” 9 and runs some workloads up to 100 times faster than Hadoop. 0 Spark’s
significantly better performance on batchprocessing tasks is leading many companies to
1 2 3
replace Hadoop MapReduce with Spark. ,
8
ttps://gigaom.com/2014/06/28/4reasonswhysparkcouldjolt
adoopintohyperdrive/.
9
ttps://spark.apache.org/faq.html.
0
ttps://spark.apache.org/.
1
ttps://bigdatamadesimple.com/issparkbetterthanhadoopmap
educe/.
2
ttps://www.datanami.com/2018/10/18/ishadoopofficiallydead/.
3
ttps://blog.thecodeteam.com/2018/01/09/changingfacedata
nalyticsfastdatadisplacesbigdata/.
rchitecture and Components

rchitecture and Components
Though it was initially developed to run on Hadoop and use Hadoop components like HDFS
and YARN, Spark can run standalone on a single computer (typically for learning and testing
purposes), standalone on a cluster or using various cluster managers and distributed storage
systems. For resource management, Spark runs on Hadoop YARN, Apache Mesos, Amazon
EC2 and Kubernetes, and it supports many distributed storage systems, including HDFS,
4
Apache Cassandra, Apache HBase and Apache Hive.
4
ttp://spark.apache.org/.
At the core of Spark are resilient distributed datasets (RDDs), which you’ll use to
process distributed data using functionalstyle programming. In addition to reading data
from disk and writing data to disk, Hadoop uses replication for fault tolerance, which adds
even more diskbased overhead. RDDs eliminate this overhead by remaining in memory—
using disk only if the data will not fit in memory—and by not replicating data. Spark handles
fault tolerance by remembering the steps used to create each RDD, so it can rebuild a given
5
RDD if a cluster node fails.
5
ttps://spark.apache.org/research.html.
Spark distributes the operations you specify in Python to the cluster’s nodes for parallel
execution. Spark streaming enables you to process data as it’s received. Spark DataFrames,
which are similar to pandas DataFrames, enable you to view RDDs as a collection of
named columns. You can use Spark DataFrames with Spark SQL to perform queries on
distributed data. Spark also includes Spark MLlib (the Spark Machine Learning Library),
which enables you to perform machinelearning algorithms, like those you learned in
hapters 14 and 5. We’ll use RDDs, Spark streaming, DataFrames and Spark SQL in the next
few examples.
Providers
Hadoop providers typically also provide Spark support. In addition to the providers listed in
ection 16.5, there are Sparkspecific vendors like Databricks. They provide a “zero
management cloud platform built around Spark.” 6 Their website also is an excellent resource
for learning Spark. The paid Databricks platform runs on Amazon AWS or Microsoft Azure.
Databricks also provides a free Databricks Community Edition, which is a great way to get
started with both Spark and the Databricks environment.
6
ttps://databricks.com/product/faq.
16.6.2 Docker and the Jupyter Docker Stacks

In this section, we’ll show how to download and execute a Docker stack containing Spark and
the PySpark module for accessing Spark from Python. You’ll write the Spark example’s code
in a Jupyter Notebook. First, let’s overview Docker.
Docker
Docker is a tool for packaging software into containers (also called images) that bundle
everything required to execute that software across platforms. Some software packages we
use in this chapter require complicated setup and configuration. For many of these, there are
A
1C
S
h
4
preexisting Docker containers that you can download for free and execute locally on your
desktop or notebook computers. This makes Docker a great way to help you get started with
new technologies quickly and conveniently.
Docker also helps with reproducibility in research and analytics studies. You can create
custom Docker containers that are configured with the versions of every piece of software and
every library you used in your study. This would enable others to recreate the environment
you used, then reproduce your work, and will help you reproduce your results at a later time.
We’ll use Docker in this section to download and execute a Docker container that’s
preconfigured to run Spark applications.
Installing Docker
You can install Docker for Windows 10 Pro or macOS at:
ttps://www.docker.com/products/dockerdesktop
On Windows 10 Pro, you must allow the "Docker for Windows.exe" installer to make
changes to your system to complete the installation process. To do so, click Yes when
7
Windows asks if you want to allow the installer to make changes to your system. Windows
10 Home users must use Virtual Box as described at:
7
Some Windows users might have to follow the instructions under Allow specific apps to
make changes to controlled folders at ttps://docs.microsoft.com/en
us/windows/security/threatprotection/windowsdefenderexploit
uard/customizecontrolledfoldersexploitguard.
ttps://docs.docker.com/machine/drivers/virtualbox/
Linux users should install Docker Community Edition as described at:
ttps://docs.docker.com/install/overview/
For a general overview of Docker, read the Getting started guide at:
ttps://docs.docker.com/getstarted/
Jupyter Docker Stacks

The Jupyter Notebooks team has preconfigured several Jupyter “Docker stacks” containers
for common Python development scenarios. Each enables you to use Jupyter Notebooks to
experiment with powerful capabilities without having to worry about complex software setup
issues. In each case, you can open JupyterLab in your web browser, open a notebook in
JupyterLab and start coding. JupyterLab also provides a Terminal window that you can
use in your browser like your computer’s Terminal, Anaconda Command Prompt or shell.
Everything we’ve shown you in IPython to this point can be executed using IPython in
JupyterLab’s Terminal window.
We’ll use the jupyter/pysparknotebook Docker stack, which is preconfigured with
g
h
h
4

Python Mi Parte

Uploaded by

Copyright:

Available Formats

Python Mi Parte

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Python Mi Parte

Uploaded by

Copyright:

Available Formats

The complete list of Choropleth keyword arguments is documented at:

reating the Map Markers for Each State

Displaying the Map

16.5.1 Hadoop Overview

DFS, MapReduce and YARN

16.5.2 Summarizing Word Lengths in Romeo and Juliet via MapReduce

16.5.3 Creating an Apache Hadoop Cluster in Microsoft Azure HDInsight

Creating an HDInsight Hadoop Cluster

16.5.4 Hadoop Streaming

6.5.5 Implementing the Mapper

16.5.6 Implementing the Reducer

16.5.7 Preparing to Run the MapReduce Example

Copying the Script Files to the HDInsight Hadoop Cluster

Copying RomeoAndJuliet into the Hadoop File System

16.5.8 Running the MapReduce Job

iewing the Word Counts

16.6.1 Spark Overview

rchitecture and Components

16.6.2 Docker and the Jupyter Docker Stacks

Jupyter Docker Stacks

You might also like