Week 02
Week 02
Rack
Rack
The computing nodes are clustered in racks connected to each other via a fast network.
Data Distributed across racks
2 5
Rack
1
3 4
High Concurrency:
• Since data is already on these nodes, then analysis of parts of the data is needed in a data parallel fashion, computation can be moved
to these nodes. This improves system performance and enable data-parallelism.
• Cluster computing: the computing nodes are clustered in racks connected to each other via a fast network.
• Computing in one or more of these clusters across a local area network or the internet is called distributed computing.
Data
Data Replicated across racks
4 3 2 1 5
Rack
5 1 4 2 3
3 2 5 4 1
1 4 3 5 2
Scalability
• Data replication also helps with scaling the access to this data by many users.
• In a highly parallelized replication, each reader can get their own node to access to and analyze
data.
Data Replication provides Fault Tolerance
4 3 2 1 5
Rack
5 1 4 2 3
3 2 5 4 1
1 4 3 5 2
• High Availability
• Reliability
Benefits of DFS
• Data scalability
• Data partitioning
• Fault tolerance
• Data replication
• High concurrency
Data Parallelism
• In data-parallelism many jobs that share nothing can work on different data sets or parts of a
data set.
• Large volumes and varieties of big data can be analyzed using this mode of parallelism, achieving
scalability, performance and cost reduction.
Rack
Commodity Cluster
• Commodity clusters are affordable parallel computers with an average
number of computing nodes.
• They are not as powerful as traditional parallel computers and are often
built out of less specialized nodes.
• Commodity clusters are Affordable and less specialized
• The service-oriented computing community over the internet have pushed
for computing to be done on commodity clusters as distributed
computations. And in turn, reducing the cost of computing over the
Internet.
• The commodity clusters are a cost effective way of achieving data parallel
scalability for big data applications.
Common Failures in Cluster computing
• A node, or an entire rack can fail at any given time.
• The connectivity of a rack to the network can stop, or
• the connections between individual nodes can break.
• It is not practical to restart everything every time, if failure happens.
• The ability to recover from such failures is called Fault-tolerance.
• For Fault-tolerance of such systems, we can:
1. Have Redundant data storage, and
2. Restart failed individual parallel jobs.
Rack
Rack
Rack
Programming Models for Big Data
• A programming model is a set of abstract runtime libraries
and programming languages
• The enabling infrastructure for big data analysis is distributed file
systems
• The programming model for big data enables the programmability of
the operations within distributed file systems.
• Programming model allows writing computer programs that
work efficiently on top of distributed file systems using big data
and making it easy to cope with all the potential issues.
Requirements for Big Data Programming Models
• Support Big Data Operations
• Split volumes of data: partitioning and placement of data in and out of computer
memory along with a model to synchronize the datasets later on
• Fast Access: The access to data should be achieved in a fast way.
• Distribution of computations to nodes: It should allow fast distribution to nodes
within a rack and these are potentially, the data nodes we moved the computation
to. Scheduling of many parallel tasks at once.
• Handle Fault Tolerance:
• Reliability: Enable reliability of the computing and full tolerance from failures.
• Recover files when needed.
• Scalability:
• Enable adding more resources when needed, e.g. Racks.
• This is also called Scaling out
• Optimized for different data types: Document Table
Graph
Key-value
Stream
Multimedia
MapReduce
MapReduce
MapReduce is a programming model and an associated
implementation for processing large data sets.
• Challenges:
• How to distribute computation?
• Distributed/parallel programming is hard
15
Single Node Architecture
CPU
Machine Learning, Statistics
Memory
16
Motivation: Google Example
• 20+ billion web pages x 20KB = 400+ TB
• 1 computer reads 30-35 MB/sec from disk
• ~4 months to read the web
• ~1,000 hard drives to store the web
• Takes even more to do something useful
with the data!
• Today, a standard architecture for such problems is emerging, which
is the Commodity Clusters:
• Cluster of commodity Linux nodes
• Commodity network (ethernet) to connect them
17
Cluster Architecture
2-10 Gbps backbone between racks
1 Gbps between Switch
any pair of nodes
in a rack
Switch Switch
20
Idea and Solution
• Issue: Copying data over a network takes time
• Idea:
• Bring computation close to the data
• Store files multiple times for reliability
• Map-reduce addresses these problems
• Google’s computational/data manipulation model
• Elegant way to work with big data
• Storage Infrastructure – File system
• Google: GFS. Hadoop: HDFS
• Programming model
• Map-Reduce
21
Storage Infrastructure
• Problem:
• If nodes fail, how to store data persistently?
• Answer:
• Distributed File System:
• Provides global file namespace
• Google GFS; Hadoop HDFS;
• Typical usage pattern
• Huge files (100s of GB to TB)
• Data is rarely updated in place
• Reads and appends are common
22
Distributed File System
• Chunk servers
• File is split into contiguous chunks
• Typically each chunk is 16-64MB
• Each chunk replicated (usually 2x or 3x)
• Try to keep replicas in different racks
• Master node
• a.k.a. Name Node in Hadoop’s HDFS
• Stores metadata about where files are stored
• Might be replicated
• Client library for file access
• Talks to master to find chunk servers
• Connects directly to chunk servers to access data
23
Distributed File System
• Reliable distributed file system
• Data kept in “chunks” spread across machines
• Each chunk replicated on different machines
• Seamless recovery from disk or machine failure
C0 C1 D0 C1 C2 C5 C0 C5
C5 C2 C5 C3 D0 D1 … D0 C2
Chunk server 1 Chunk server 2 Chunk server 3 Chunk server N
Sample task:
• We have a huge text document
• Sample application:
• Analyze web server logs to find popular URLs
25
Task: Word Count
Case 1:
• File too large for memory, but all <word, count> pairs fit in memory
• Use a hashtable
Case 2:
• Even the <word, count> pairs do not fit in memory
• Count occurrences of words:
• words(doc.txt) | sort | uniq -c
• where words takes a file and outputs the words in it, one per a line
• Case 2 captures the essence of MapReduce
• Great thing is that it is naturally parallelizable
26
MapReduce: Overview
• Sequentially read a lot of data
• Map:
• Extract something you care about from each record (Key)
• Group by key: Sort and Shuffle
• Reduce:
• Aggregate, summarize, filter or transform
• Write the result
Outline stays the same, Map and Reduce
change to fit the problem
27
MapReduce: The Map Step
Input Intermediate
key-value pairs key-value pairs
k v
map
k v
k v
map
k v
k v
… …
k v k v
28
MapReduce: The Reduce Step
Output
Intermediate Key-value groups key-value pairs
key-value pairs
reduce
k v k v v v k v
reduce
Group
k v k v v k v
by key
k v
… … …
k v k v k v
29
Example: WordCount using MapReduce
File 1
Result
File 2 WordCount
File
File N
Step 0: File is stored in a DFS
Step 1: Map on each node
My apple is red and my rose is blue....
…
…
Map generates
My apple is red and my rose is blue.... key-value pairs
…
my, my à (my, 1), (my, 1)
apple à (apple, 1)
is, is à (is, 1), (is, 1)
red à (red, 1)
and à (and, 1)
rose à (rose, 1)
blue à (blue, 1)
Map generates
You are the apple of my eye.... key-value pairs
…
You à (You, 1)
are à (are, 1)
the à (the, 1)
apple à (apple, 1)
of à (of, 1)
my à (my, 1)
eye à (eye, 1)
Step 2: Sort and Shuffle
Pairs with same key
moved to same node
(You, 1) Step 2: Sort and Shuffle
(apple, 1) Pairs with same key
moved to same node
(apple, 1)
(is, 1)
(is, 1)
(rose, 1)
(red, 1)
Step 3: Reduce Add values for same keys
Step 3: Reduce Add values for same keys
(You, 1) (You, 1)
(apple, 1), (apple, 1) (apple, 2)
reduce(key, values):
// key: a word; value: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)
39
Map Shuffle Reduce
and Sort
More Specifically
• Input: a set of key-value pairs
• Programmer specifies two methods:
• Map(k, v) ® <k’, v’>*
• Takes a key-value pair and outputs a set of key-value pairs
• E.g., key is the filename, value is a single line in the file
• There is one Map call for every (k,v) pair
• Reduce(k’, <v’>*) ® <k’, v’’>*
• All values v’ with same key k’ are reduced together
and processed in v’ order
• There is one Reduce function call per unique key k’
41
Map-Reduce: A diagram
Big document
MAP:
Read input and
produces a set of
key-value pairs
Group by key:
Collect all pairs with
same key
(Hash merge, Shuffle,
Sort, Partition)
Reduce:
Collect all values
belonging to the key
and output
42
Map-Reduce: In Parallel
Map Node 1 Map Node 2 Map Node 3
All phases are distributed with many tasks doing the work 43
Map-Reduce Pattern
Map-Reduce: Environment
Input 0 Input 1 Input 2
45
Data Flow
• Input and final output are stored on a distributed file system (FS):
• Scheduler tries to schedule map tasks “close” to physical storage location of
input data
46
Coordination: Master
• Master node takes care of coordination:
• Task status: (idle, in-progress, completed)
• Idle tasks get scheduled as workers become available
• When a map task completes, it sends the master the location and sizes of its R
intermediate files, one for each reducer
• Master pushes this info to reducers
47
Dealing with Failures
• Map worker failure
• Map tasks completed or in-progress at worker are reset to idle since output of
the mapper is stored on the local FS
• Reduce workers are notified when task is rescheduled on another worker
• Reduce worker failure
• Only in-progress tasks are reset to idle since output of the reducer is stored
on the DFS
• Reduce task is restarted
• Master failure
• MapReduce task is aborted and client is notified
48
How many Map and Reduce jobs?
• M map tasks, R reduce tasks
• Rule of a thumb:
• Make M much larger than the number of nodes in the
cluster
• One DFS chunk per map is common
• Improves dynamic load balancing and speeds up
recovery from worker failures
• Usually R is smaller than M
• Because output is spread across R files. If there are too
many Reduce tasks the number of intermediate files
explodes.
49
Combiners
• Often a Map task will produce many pairs of the form (k,v1), (k,v2), …
for the same key k (e.g. the word “the”)
• E.g., popular words in the word count example
• Can save network time by
pre-aggregating values in
the mapper:
• combine(k, list(v1)) à v2
• Combiner is usually same
as the reduce function
50
Combiners
• Back to our word counting example:
• Combiner combines the values of all keys of a single
mapper (single machine):
51
Combiners – Caution!
• Works only if reduce function is commutative and associative
• Sum is commutative and associative
• a+b=b+a (a + b) + c = a + (b + c)
+ +
+
Combiners – Caution!
• Works only if reduce function is commutative and associative
• Average is NOT commutative and associative
avg1 avg2
(Avg1+avg2) /2
x Not the true average
Combiners – Caution!
• Works only if reduce function is commutative and associative
• Average is NOT commutative and associative, but can be computed if both
the sum and count are returned by the map function (Combiner Trick)
(sum1,count1) (sum2,count2)
(sum1+sum2) /(count1+count2)
✓
Combiners – Caution!
• Works only if reduce function is commutative and associative
• Median is NOT commutative and Associative
• There is not way to split the median computation
Partition Function
• Want to control how keys get partitioned
• Inputs to map tasks are created by contiguous splits of
input file
• Ensure that records with the same intermediate key end up
at the same worker
• System uses a default partition function:
• hash(key) mod R
56
Problems Suited for
Map-Reduce
Example: Host size
• Suppose we have a large web corpus
• Look at the metadata file
• Lines of the form: (URL, size, date, …)
• For each host, find the total number of bytes
• That is, the sum of the page sizes for all URLs from that particular host
58
Example: Language Model
• Statistical machine translation:
• Need to count number of times every 5-word sequence occurs in a large
corpus of documents
59
Example: Relational-Algebra Operations in database queries (SQL)
Background
• A relation is a table with column headers called attributes.
• Rows of the relation are called tuples.
• The set of attributes of a relation is called its schema. table1 Relation/Table
• R(A1, A2, . . . , An) -> The relation name is R and its attributes are A1, A2, . . . , An
a b c
Relational operations
• Selection: Apply a condition C to each tuple in the relation and produce as output only those tuples that a1 b1 c1
satisfy C. Denoted as 𝝈𝒄 𝑹 .
• Example SQL: select * from table1 where a = ‘a1’; a1 b2 c2
• Projection: For some subset S of the attributes of the relation, produce from each tuple only the
components for the attributes in S. Denoted 𝜋" (𝑅). a1 b3 c3
• Example SQL: select a,b from table1;
a4 b4 c4
• Union, Intersection, and Difference
• Example SQL: select a,b,c from table1 union select d,b,f from table2;
Attributes/column headers
• Natural Join: Given two relations, compare each pair of tuples, one from each relation. If the tuples agree
on all the attributes that are common to the two schemas, then produce a tuple that has components for
each of the attributes in either schema. The natural join of relations R and S is denoted 𝑅 ⋈ 𝑆
• Example SQL (inner join): d b f
• select * from table1 R inner join table2 S on R.b = S.b; table2
• Grouping and Aggregation: Given a relation R, partition its tuples according to their values in one set of
attributes G, called the grouping attributes. Then, for each group, aggregate the values in certain other d1 b1 f1
Row/tuples
attributes. The normally permitted aggregations are SUM, COUNT, AVG, MIN, and MAX.
• Example SQL: d2 b2 f2
• select a,count(*) from table1 group by a;
d3 b3 f3
d4 b4 f4
Example: Join By Map-Reduce
• Compute the natural join R(A,B) ⋈ S(B,C)
• R and S are each stored in files
• Tuples are pairs (a,b) or (b,c)
A B B C A C
a1 b1
⋈
b2 c1 a3 c1
a2
a3
b1
b2
b2 c2 = a3 c2
b3 c3 a4 c3
a4 b3
S
R
61
Example: Join By Map-Reduce
• Use a hash function h from B-values to 1...k
• A Map process turns:
• Each input tuple R(a,b) into key-value pair (b,(a,R))
• Each input tuple S(b,c) into (b,(c,S))
62
Example: Batch Gradient Descent
" (%)
Batch Gradient Descent: 𝜃! ≔ 𝜃! − 𝛼 #$$ ∑#$$
%&" ℎ' 𝑥 % −𝑦 % 𝑥!
Split training sets into different subsets
(4 subsets for 4 nodes [Mappers]) • Assume m = 400
• Normally m would be more like 400 000 000
! ! !"" !""
Node 1: Use (𝑥 ,𝑦 ), … , (𝑥 ,𝑦 ) • If m is large this is really expensive
!""
(!) ' ' (')
𝑡𝑒𝑚𝑝$ ≔ , ℎ) 𝑥 −𝑦 𝑥$
'(!
65
Problems not suited for MapReduce
• MapReduce is great for:
• Problems that require sequential data access
• Large batch jobs (not interactive, real-time)
• MapReduce is inefficient for problems where random (or irregular)
access to data required:
• Graphs
• Interdependent data
• Comparisons of many pairs of items
Cloud Computing
• Ability to rent computing by the hour
• Additional services e.g., persistent storage
67
Cost in MapReduce Jobs
• Computation Cost of mappers and the reducers
• System cost is principally sorting key-value pairs by key and merging them at
Reduce tasks
• Communication Cost of shipping key-value pairs from mappers to
reducers
• Map Tasks are executed where their input data resides so no communication
required.
Hands on practice environments
• Cloudera VM
• https://downloads.cloudera.com/demo_vm/virtualbox/cloudera-quickstart-
vm-5.4.2-0-virtualbox.zip
• HDFS, SQL, MongoDB, Spark
• Google CoLab/Jupyter Notebooks
• Spark, Python
Cloudera VM for hands on: Installation
instructions
Instructions
Please use the following instructions to download and install the Cloudera Quickstart VM with VirutalBox before proceeding
to the Getting Started with the Cloudera VM Environment video. The screenshots are from a Mac but the instructions should
be the same for Windows. Please see the discussion boards if you have any issues.
1. Install VirtualBox. Go to https://www.virtualbox.org/wiki/Downloads to download and install VirtualBox for your
computer.
2. Download the Cloudera VM. Download the Cloudera VM
from https://downloads.cloudera.com/demo_vm/virtualbox/cloudera-quickstart-vm-5.4.2-0-virtualbox.zip. The VM is over
4GB, so will take some time to download.
3. Unzip the Cloudera VM:
On Mac: Double click cloudera-quickstart-vm-5.4.2-0-virtualbox.zip
On Windows: Right-click cloudera-quickstart-vm-5.4.2-0-virtualbox.zip and select “Extract All…”
4. Start VirtualBox.
5. Begin importing. Import the VM by going to File -> Import Appliance
6. Click the Folder icon.
7. Select the cloudera-quickstart-vm-5.4.2-0-virtualbox.ovf from the Folder where you unzipped the VirtualBox
VM and click Open.
8. Click Continue to proceed.
9. Click Import.
10. The virtual machine image will
be imported. This can take several
minutes.
11. Launch Cloudera VM. When the importing is finished, the quickstart-vm-5.4.2-0 VM will appear on the left
in the VirtualBox window. Select it and click the Start button to launch the VM.
12. Cloudera VM booting. It will take several minutes for the Virtual Machine to start. The booting
process takes a long time since many Hadoop tools are started
13. The Cloudera VM desktop. Once the booting process is complete, the desktop will appear with a browser.
Apache Hadoop:
A Framework to manage and process Big Data
Apache Hadoop
• Hadoop is a framework that allows us to store and process large data
sets in parallel and distributed manner
• Created in 2005 by Yahoo
• Written in Java
• Implements the MapReduce Big Data Programming model
• More Big Data frameworks released
Now there’s over a 100!
Zookeeper
Hive
MapReduce
YARN
Pig
HDFS
Giraph
The Hadoop Ecosystem
Storm
Spark
Flink
HBase
Cassandra
MongoDB
The Hadoop Ecosystem
Higher levels:
Interactivity
Hive Pig
Giraph
Spark
Storm
Flink
MapReduce
HBase
Cassandra
MongoDB
Zookeeper
YARN
HDFS
Lower levels:
Storage and scheduling
The Hadoop Ecosystem
• Distributed file system as foundation
• Scalable storage
• Fault tolerance
Hive Pig
Giraph
Spark
Storm
Flink
MapReduce
HBase
Cassandra
MongoDB
Zookeeper
YARN
HDFS
The Hadoop Ecosystem
• YARN: Yet Another Resource Negotiator
• It is used for Flexible scheduling and resource management
• YARN schedules jobs on more than 40,000 servers at yahoo!
Hive Pig
Giraph
Spark
Storm
Flink
MapReduce
HBase
Cassandra
MongoDB
Zookeeper
YARN
HDFS
The Hadoop Ecosystem
Implementation of the MapReduce Programming Model
Hive Pig
Giraph
Spark
Storm
Flink
MapReduce
HBase
Cassandra
MongoDB
Zookeeper
YARN
HDFS
Hive Pig
Giraph
Spark
Storm
Flink
MapReduce
HBase
Cassandra
MongoDB
Zookeeper
YARN
HDFS
Hive Pig
Giraph
Spark
Storm
Flink
MapReduce
HBase
Cassandra
MongoDB
Zookeeper
YARN
HDFS
The Hadoop Ecosystem
• Real-time and in-memory processing
• 100x faster for some tasks
Hive Pig
Giraph
Spark
Storm
Flink
MapReduce
HBase
Cassandra
MongoDB
Zookeeper
YARN
HDFS
The Hadoop Ecosystem
• NoSQL databases
• Key-values
• Hbase used by Facebook’s messaging platform
Hive Pig
Giraph
Spark
Storm
Flink
MapReduce
HBase
Cassandra
MongoDB
Zookeeper
YARN
HDFS
The Hadoop Ecosystem
• Zookeeper for management
• Synchronization
• Configuration
• High-availability
Hive Pig
Giraph
Spark
Storm
Flink
MapReduce
HBase
Cassandra
MongoDB
Zookeeper
YARN
HDFS
• All these tools are open-source
• Large community for support
Hive Pig
Giraph
Spark
Storm
MapReduce
HBase
Cassandra
MongoDB
Zookeeper
YARN
HDFS
HDFS allows storing massively large data sets
• HDFS achieves scalability by partitioning or splitting large files across multiple computers
• The default chunk size, the size of each piece of a file is 128 MB. But we can configure this to any size.
What happens if node fails? Data May get lost
✔
✔
• Replication Factor: The number of times Hadoop framework replicates each and every Data Block
• Default Replication Factor of 3
Two key components of HDFS
• NameNode (also called Master node) for metadata
• It tracks files, manages the file system and has the metadata of all of the stored data within it
• Contains the details of the number of blocks, locations of the data node that the data is stored in,
where the replications are stored, and other details.
• The name node has direct contact with the client.
• Usually one per cluster
• Secondary Name Node: takes care of the checkpoints of the file system metadata which is in the
Name Node
Data locality is the process of moving the computation close to where the
actual data resides on the node, instead of moving large data to computation
YARN:
The Resource Manager for Hadoop
YARN: Yet Another Resource Negotiator
• YARN provides flexible resource management for Hadoop cluster. It interacts with applications and schedules
resources for their use.
• Allows batch processing , stream processing , graph processing, interactive processing.
• YARN enables running multiple applications over HDFS other than MapReduce. Extends Hadoop to enable
multiple frameworks such as Giraph, Spark and Flink
Hive Pig
Giraph
Spark
Storm
Flink
MapReduce
HBase
Cassandra
MongoDB
Zookeeper
YARN
HDFS
Hadoop evolved over time
Giraph
Spark
Storm
Flink
MapReduce
MapReduce
HBase
Cassandra
MongoDB
Zookeeper
YARN
HDFS HDFS
YARN lead to more Applications
and growing …
Benefits of YARN (Hadoop 2.0)
• It lets you run many distributed applications over the same Hadoop
cluster.
• YARN reduces the need to move data around and supports higher
resource utilization resulting in lower costs.
• It's a scalable platform that has enabled growth of several
applications over the HDFS, enriching the Hadoop ecosystem.
When to reconsider Hadoop?
When to use Hadoop?
• Future anticipated data growth
• Long term availability of data
• Many platforms over single data store
• High Volume
• High Variety
When Hadoop may not be the most suitable
option..
• Small Datasets
• Task Level Parallelism
• Advanced Algorithms
• Replacement to your infrastructure – Databases are still needed
• Random Data Access
Hands on Exercise
• Copy data into the Hadoop Distributed File System (HDFS)
• Run the WordCount program
Copy data into the Hadoop Distributed File System (HDFS)
Download the Shakespeare text. Enter the following link in the
browser: http://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespear
e.txt. Save the file as words.txt
• Open a terminal shell: Open a terminal shell by clicking on the square black box on
the top left of the screen.
• Run cd Downloads to change to the Downloads directory.
• Run ls to see that words.txt was saved.
• Copy file to HDFS: Run hadoop fs -copyFromLocal words.txt to copy
the text file to HDFS.
• Verify file was copied to HDFS: Run hadoop fs -ls to verify the file was
copied to HDFS.
• Copy a file within HDFS: Run hadoop fs -cp words.txt words2.txt to
make a copy of words.txt called words2.txt
• Copy a file from HDFS: Run hadoop fs -copyToLocal words2.txt . to
copy words2.txt to the local directory.
• Delete a file in HDFS: Run hadoop fs -rm words2.txt
Run the WordCount program
• See example MapReduce programs. Hadoop comes with several example MapReduce applications.
• List the programs by running hadoop jar /usr/jars/hadoop-examples.jar.
• We are interested in running wordcount.
Run the WordCount program
• Verify words.txt file exists. Run hadoop fs -ls
• See WordCount command line arguments. We can learn how to run WordCount by examining its command-line
arguments. Run hadoop jar /usr/jars/hadoop-examples.jar wordcount.
• Look inside output directory. The directory created by WordCount contains several files. Look inside the directory by
running hadoop –fs ls out
• Copy WordCount results to local file system. Copy part-r-00000 to the local file system by running hadoop fs –
copyToLocal out/part-r-00000 local.txt
• View the WordCount results. View the contents of the results: more local.txt
Further Reading
Reading
• Chapter 2: Mining of Massive Datasets, by Anand Rajaraman and Jeffrey David
Ullman. http://www.mmds.org/.
• Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung: The Google File
System
• http://labs.google.com/papers/gfs.html
113
Resources
• Hadoop Wiki
• Introduction
• http://wiki.apache.org/lucene-hadoop/
• Getting Started
• http://wiki.apache.org/lucene-hadoop/GettingStartedWithHadoop
• Map/Reduce Overview
• http://wiki.apache.org/lucene-hadoop/HadoopMapReduce
• http://wiki.apache.org/lucene-hadoop/HadoopMapRedClasses
• Eclipse Environment
• http://wiki.apache.org/lucene-hadoop/EclipseEnvironment
• Javadoc
• http://lucene.apache.org/hadoop/docs/api/
114
Resources
• Releases from Apache download mirrors
• http://www.apache.org/dyn/closer.cgi/lucene/hadoop/
• Nightly builds of source
• http://people.apache.org/dist/lucene/hadoop/nightly/
• Source code from subversion
• http://lucene.apache.org/hadoop/version_control.html
115