DA Lab EXERCISE
DA Lab EXERCISE
1. Install Java 8:
a. Download Java 8 a. Set environmental variables:
i. User variable:
• Variable: JAVA_HOME
• Value: C:\java
ii. System variable:
• Variable: PATH
• Value: C:\java\bin
b. Check on cmd, see below:
4. Download Hadoop-2.6.x:
a. Put extracted Hadoop-2.6.x files into D drive.
b. Download “hadoop-common-2.6.0-bin-master. Paste all these files into the
“bin” folder of Hadoop-2.6.x.
c. Create a “data” folder inside Hadoop-2.6.x, and also create two more folders
in the “data” folder as “data” and “name.”
d. Create a folder to store temporary data during execution of a project, such as
“D:\hadoop\temp.”
e. Create a log folder, such as “D:\hadoop\userlog”
f. Go to Hadoop-2.6.x etc Hadoop and edit four files:
i. core-site.xml
ii. hdfs-site.xml
iii. mapred.xml
iv. yarn.xml
core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file. -->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>D:\hadoop\temp</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:50071</value>
</property>
</configuration>
hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property><name>dfs.replication</name><value>1</value></property>
<property> <name>dfs.namenode.name.dir</name><value>/hadoop-
2.6.0/data/name</value><final>true</final></property>
<property><name>dfs.datanode.data.dir</name><value>/hadoop-
2.6.0/data/data</value><final>true</final> </property>
</configuration>
mapred.xml
<?xml version="1.0"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
<property>
<name>mapreduce.application.classpath</name>
<value>/hadoop-2.6.0/share/hadoop/mapreduce/*,
/hadoop-2.6.0/share/hadoop/mapreduce/lib/*,
/hadoop-2.6.0/share/hadoop/common/*,
/hadoop-2.6.0/share/hadoop/common/lib/*,
/hadoop-2.6.0/share/hadoop/yarn/*,
/hadoop-2.6.0/share/hadoop/yarn/lib/*,
/hadoop-2.6.0/share/hadoop/hdfs/*,
/hadoop-2.6.0/share/hadoop/hdfs/lib/*,
</value>
</property>
</configuration>
yarn-site.xml
<?xml version="1.0"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>D:\hadoop\userlog</value><final>true</final>
</property>
<property><name>yarn.nodemanager.local-
dirs</name><value>D:\hadoop\temp\nm-local-dir</value></property>
<property>
<name>yarn.nodemanager.delete.debug-delay-sec</name>
<value> 600</value>
</property>
<property><name>yarn.application.classpath</name>
<value>/hadoop-2.6.0/,/hadoop-2.6.0/share/hadoop/common/*,/hadoop-
2.6.0/share/hadoop/common/lib/*,/hadoop-2.6.0/share/hadoop/hdfs/*,/hadoop-
2.6.0/share/hadoop/hdfs/lib/*,/hadoop-2.6.0/share/hadoop/mapreduce/*,/hadoop-
2.6.0/share/hadoop/mapreduce/lib/*,/hadoop-2.6.0/share/hadoop/yarn/*,/hadoop-
2.6.0/share/hadoop/yarn/lib/*</value>
</property>
</configuration>
g. Go to the location: “Hadoop-2.6.0->etc->hadoop,” and edit “hadoop-env.cmd” by
writing
set JAVA_HOME=C:\java\jdk1.8.0_91
h. Set environmental variables: Do: My computer -> Properties -> Advance system
settings -> Advanced -> Environmental variables
i. User variables:
• Variable: HADOOP_HOME
• Value: D:\hadoop-2.6.0
ii. System variable
• Variable: Path
• Value: D:\hadoop-2.6.0\bin
D:\hadoop-2.6.0\sbin
D:\hadoop-2.6.0\share\hadoop\common\*
D:\hadoop-2.6.0\share\hadoop\hdfs
D:\hadoop-2.6.0\share\hadoop\hdfs\lib\*
D:\hadoop-2.6.0\share\hadoop\hdfs\*
D:\hadoop-2.6.0\share\hadoop\yarn\lib\*
D:\hadoop-2.6.0\share\hadoop\yarn\*
D:\hadoop-2.6.0\share\hadoop\mapreduce\lib\*
D:\hadoop-2.6.0\share\hadoop\mapreduce\*
D:\hadoop-2.6.0\share\hadoop\common\lib\*
i. Check on cmd;
j. Format name-node: On cmd go to the location “Hadoop-2.6.0 bin” by writing on
cmd “cd hadoop-2.6.0.\bin” and then “hdfs namenode –format”
k. Start Hadoop. Go to the location: “D:\hadoop-2.6.0\sbin.” Run the following files as
administrator “start-dfs.cmd” and “start-yarn.cmd”
Eclipse -> Window -> Perspective -> Open Perspective -> Other -> MapReduce ->
Click OK.
See a bar at the bottom. Click on Map/Reduce locations.
Right click on blank space, then click on “Edit setting,” and see the following screen.
Result:-
Aim:-
To implement word count program using MapReduce.
Program:-
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
Result:-
Thus, the word count program was executed using Hadoop environment.
EX.3 Implementation of MR program using Weather dataset
Aim :-
To write a code to find maximum temperature per year from sensor temperature
data sheet, using hadoop mapreduce framework.
Procedure:-
Implement Mapper and Reducer program for finding Maximum temperature in java
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
//Mapper class
class MaxTemperatureMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
//Reducer class
class MaxTemperatureReducer
extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {
//Driver Class
job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.submit();
}
}
Result:-
Aim:-
To install and configure spark in standalone machine.
Procedure:-
Step 1: Install Java 8
Apache Spark requires Java 8. You can check to see if Java is installed using the
command prompt.
Open the command line by clicking Start > type cmd > click Command Prompt.
Type the following command in the command prompt:
java -version
If Java is installed, it will respond with the following output:
Step 2: Install Python
1. To install the Python package manager, navigate to https://www.python.org/ in your
web browser.
2. Mouse over the Download menu option and click Python 3.8.3. 3.8.3 is the latest
version at the time of writing the article.
3. Once the download finishes, run the file.
4. Near the bottom of the first setup dialog box, check off Add Python 3.8 to PATH.
Leave the other box checked.
5. Next, click Customize installation.
6. You can leave all boxes checked at this step, or you can uncheck the options you
do not want.
7. Click Next.
8. Select the box Install for all users and leave other boxes as they are.
9. Under Customize install location, click Browse and navigate to the C drive. Add a
new folder and name it Python.
10. Select that folder and click OK.
11. Click Install, and let the installation complete.
12. When the installation completes, click the Disable path length limit option at the
bottom and then click Close.
13. If you have a command prompt open, restart it. Verify the installation by checking
the version of Python:
python --version
The output should print Python 3.8.3.
Step 3: Download Apache Spark
1. Open a browser and navigate to https://spark.apache.org/downloads.html.
2. Under the Download Apache Spark heading, there are two drop-down menus. Use
the current non-preview version.
In our case, in Choose a Spark release drop-down menu select 2.4.5
In the second drop-down Choose a package type, leave the selection Pre-built for
Apache Hadoop 2.7.
3. Click the spark-2.4.5-bin-hadoop2.7.tgz link.
4. A page with a list of mirrors loads where you can see different servers to download
from. Pick any from the list and save the file to your Downloads folder.
Step 4: Verify Spark Software File
1. Verify the integrity of your download by checking the checksum of the file. This
ensures you are working with unaltered, uncorrupted software.
2. Navigate back to the Spark Download page and open the Checksum link, preferably
in a new tab.
3. Next, open a command line and enter the following command:
certutil -hashfile c:\users\username\Downloads\spark-2.4.5-bin-hadoop2.7.tgz
SHA512
4. Change the username to your username. The system displays a long alphanumeric
code, along with the message Certutil: -hashfile completed successfully.
5. Compare the code to the one you opened in a new browser tab. If they match, your
download file is uncorrupted.
Step 5: Install Apache Spark
Installing Apache Spark involves extracting the downloaded file to the desired location.
1. Create a new folder named Spark in the root of your C: drive. From a command
line, enter the following:
cd \
mkdir Spark
2. In Explorer, locate the Spark file you downloaded.
3. Right-click the file and extract it to C:\Spark using the tool you have on your system
(e.g., 7-Zip).
4. Now, your C:\Spark folder has a new folder spark-2.4.5-bin-hadoop2.7 with the
necessary files inside.
Step 6: Add winutils.exe File
Download the winutils.exe file for the underlying Hadoop version for the Spark
installation you downloaded.
1. Navigate to this URL https://github.com/cdarlint/winutils and inside the bin
folder, locate winutils.exe, and click it.
2. Find the Download button on the right side to download the file.
3. Now, create new folders Hadoop and bin on C: using Windows Explorer or the
Command Prompt.
4. Copy the winutils.exe file from the Downloads folder to C:\hadoop\bin.
Step 7: Configure Environment Variables
Configuring environment variables in Windows adds the Spark and Hadoop locations
to your system PATH. It allows you to run the Spark shell directly from a command
prompt window.
1. Click Start and type environment.
2. Select the result labeled Edit the system environment variables.
3. A System Properties dialog box appears. In the lower-right corner, click
Environment Variables and then click New in the next window.
4. For Variable Name type SPARK_HOME.
5. For Variable Value type C:\Spark\spark-2.4.5-bin-hadoop2.7 and click OK. If you
changed the folder path, use that one instead.
6. In the top box, click the Path entry, then click Edit. Be careful with editing the system
path. Avoid deleting any entries already on the list.
7. You should see a box with entries on the left. On the right, click New.
8. The system highlights a new line. Enter the path to the Spark folder C:\Spark\spark-
2.4.5-bin-hadoop2.7\bin. We recommend using %SPARK_HOME%\bin to avoid
possible issues with the path.
9. Repeat this process for Hadoop and Java.
• For Hadoop, the variable name is HADOOP_HOME and for the value use the
path of the folder you created earlier: C:\hadoop. Add C:\hadoop\bin to the Path
variable field, but we recommend using %HADOOP_HOME%\bin.
• For Java, the variable name is JAVA_HOME and for the value use the path to
your Java JDK directory (in our case it’s C:\Program Files\Java\jdk1.8.0_251).
7. To exit Spark and close the Scala shell, press ctrl-d in the command-prompt window.
Result:-
Thus, the SPARK was installed and configured successfully.
EX.5 IMPLEMENT WORD COUNT / FREQUENCY PROGRAMS USING SPARK
Aim:-
To Implement word count / frequency programs using Spark
Program:-
package org.apache.spark.examples;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import scala.Tuple2;
import java.util.Arrays;
import java.util.List;
import java.util.regex.Pattern;