0% found this document useful (0 votes)

9 views

scala-1

MapReduce is a programming model for processing large datasets in parallel, consisting of Map and Reduce functions, and is foundational in big data applications. Apache Hadoop uses the MapReduce model for batch processing, while Apache Spark employs an in-memory processing model for better performance. The document also includes a step-by-step guide for setting up a Spark-Scala application and demonstrates a simple word count example using RDDs in Spark.

Uploaded by

ayeshagujrati00

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

scala-1

Uploaded by

ayeshagujrati00

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

MapReduce is a programming model designed for processing and generating large datasets in

parallel across a distributed cluster of computers. It simplifies the complexities of parallel

programming by dividing tasks into two main functions: Map and Reduce. This model is
foundational in big data processing and is utilized in various applications across different
industries

pache Hadoop and Apache Spark are both powerful open-source frameworks for processing
A
large datasets, but they differ in architecture, performance, and use cases.

1. Processing Models:

● Hadoop: Utilizes the MapReduce programming model, processing data in discrete

batches. Each MapReduce job reads data from the disk, processes it, and writes the
results back to the disk, which can introduce latency due to frequent disk I/O operations.

● Spark: Employs a Directed Acyclic Graph (DAG) execution engine that performs
computations in memory, significantly reducing disk I/O. This in-memory processing
enables Spark to handle iterative and interactive tasks more efficiently.

Step1

sudo apt update

sudo apt install openjdk-11-jdk
java -version

step 2
sudo apt install scala
scala -version

Scala code runner version 2.11.12 --

step 3
echo "deb
https://repo.scala-sbt.org/scalasbt/debian all
main" | sudo tee
/etc/apt/sources.list.d/sbt.list
curl -sL
"https://keyserver.ubuntu.com/pks/lookup?o
p=get&search=0x99e82a75642ac823" | sudo
apt-key add
sudo apt update

step 4
wget
https://dlcdn.apache.org/spark/spark-3.5.5/s
park-3.5.5-bin-hadoop3.tgz

step 5
tar -xvzf spark-3.5.5-bin-hadoop3.tgz
mv spark-3.5.5-bin-hadoop3 ~/spark

step 6
export SPARK_HOME=~/spark
export PATH=$SPARK_HOME/bin:$PATH

comment
step 7:
mkdir spark-scala-app
cd spark-scala-app

mkdir -p src/main/scala
touch build.sbt
touch src/main/scala/WordCount.scala

spark-shell

scala> val text = sc.parallelize(Seq("spark is fast",

"scala is powerful", "spark scala together"))
text is an RDD of strings
RDD stands for Resilient Distributed Dataset.
It is the core data structure in Apache Spark.

text: org.apache.spark.rdd.RDD[String] =
ParallelCollectionRDD[0] at parallelize at
<console>:23

scala>

scala> val words = text.flatMap(_.split("\\s+"))

words: org.apache.spark.rdd.RDD[String] =
MapPartitionsRDD[1] at flatMap at <console>:23

words is an RDD of individual words
The key part is the regular expression \\s+.

\s (single backslash) means: any whitespace character.

It can match:

● space " "

● tab \t

● newline \n

● carriage return \r, etc.

In Scala, we write \\s instead of \s because:

● In a string literal, \ is an escape character.

● So to get a literal \, you write it as \\.

"\\s" → means regex `\s` → matches any whitespace

---

### 🔹 `+` – What does it mean?

- `+` in regex means: **"one or more times"**

So:

```scala
"\\s+" → matches **one or more whitespace characters**

"hello world" → split into "hello" and "world" (because of 3 spaces)

👉
"one\ttwo\nthree" → split into "one", "two", "three"
splits each line into words, no matter how many spaces or tabs separate
them.

🔹 What is Whitespace?
Whitespace means any character that makes space on the screen but is not visible.
They are used to separate words or lines, but they don’t display actual symbols or letters.

🔸 Common Whitespace Characters:

Character Description Looks Like

" " (space) Regular space ␣ (invisible)

\t Tab → (moves cursor to next tab stop)

\n Newline (line break) ↓ (moves to next line)

\r Carriage return ⏎ (used in old systems)

\f Form feed — (rarely used now)

"hello world" → space

"hello\tworld" → tab
"hello\nworld" → newline

scala> val wordPairs = words.map(word => (word,

1))
wordPairs: org.apache.spark.rdd.RDD[(String, Int)] =
MapPartitionsRDD[2] at map at <console>:23

wordPairs is an RDD of (String, Int) tuples
scala> val wordCounts = wordPairs.reduceByKey(_
+ _)
wordCounts: org.apache.spark.rdd.RDD[(String, Int)]
= ShuffledRDD[3] at reduceByKey at <console>:23
This line is used to count how many times each word appears. Here's how it works step by step:

("spark", 1), ("is", 1), ("fast", 1), ("spark", 1), ("is", 1), ...

This means:

● Group all the key-value pairs by key (i.e., by word).

● For each group, add up the values (the 1s).

🔸 _ + _ Explanation:
This is a shorthand for:

scala

CopyEdit

(x, y) => x + y

("spark", 1), ("spark", 1) →

("spark", 2)

scala>

scala> wordCounts.collect().foreach(println)
(scala,2)
(together,1)
(powerful,1)
(is,2)
(fast,1)
(spark,2)

Nativity Play Script
100% (1)
Nativity Play Script
5 pages
Csec It Paper 2 May 2016 - Answer Sheet
No ratings yet
Csec It Paper 2 May 2016 - Answer Sheet
20 pages
Annie Reiner - Bion and Being-Passion and The Creative Mind (English)
100% (3)
Annie Reiner - Bion and Being-Passion and The Creative Mind (English)
188 pages
Sing Like A Catholic
100% (1)
Sing Like A Catholic
195 pages
Scala PDF
No ratings yet
Scala PDF
29 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
Parallel Programming With Spark: Matei Zaharia
No ratings yet
Parallel Programming With Spark: Matei Zaharia
40 pages
The Spark Programming Model
No ratings yet
The Spark Programming Model
7 pages
3- SPARK
No ratings yet
3- SPARK
51 pages
Lec 9
No ratings yet
Lec 9
33 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
spark
No ratings yet
spark
160 pages
Lec 9
No ratings yet
Lec 9
38 pages
Bda 05
No ratings yet
Bda 05
12 pages
Sumit Kothari Apache Spark and Scala Practical 17
No ratings yet
Sumit Kothari Apache Spark and Scala Practical 17
18 pages
Bda 05
No ratings yet
Bda 05
12 pages
Apache Spark: CS240A Winter 2016. T Yang
No ratings yet
Apache Spark: CS240A Winter 2016. T Yang
36 pages
5 - Programming With RDDs and Dataframes
No ratings yet
5 - Programming With RDDs and Dataframes
32 pages
Spark Notes
No ratings yet
Spark Notes
6 pages
Writing Spark Application
No ratings yet
Writing Spark Application
37 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
Spark and Scala Course
No ratings yet
Spark and Scala Course
5 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
Day1 Main
No ratings yet
Day1 Main
188 pages
Spark
No ratings yet
Spark
51 pages
SPARK
No ratings yet
SPARK
36 pages
Just Enough Scala For Spark
No ratings yet
Just Enough Scala For Spark
62 pages
L3
No ratings yet
L3
30 pages
Scala
No ratings yet
Scala
15 pages
BDALab Assn5
No ratings yet
BDALab Assn5
16 pages
4.1. Spark Basics
No ratings yet
4.1. Spark Basics
28 pages
Overview
No ratings yet
Overview
25 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
No ratings yet
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
11 pages
Intro To Apache Spark
No ratings yet
Intro To Apache Spark
66 pages
Note
No ratings yet
Note
14 pages
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
Slide 8 Spark Shell Tutorial
No ratings yet
Slide 8 Spark Shell Tutorial
61 pages
Module 3
No ratings yet
Module 3
51 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Spark: Fast, Interactive, Language-Integrated Cluster Computing
No ratings yet
Spark: Fast, Interactive, Language-Integrated Cluster Computing
25 pages
CS226 06 RDD
No ratings yet
CS226 06 RDD
29 pages
Devops Slides
No ratings yet
Devops Slides
223 pages
Mastering Apache Spark PDF
75% (4)
Mastering Apache Spark PDF
541 pages
BDA-Lec8
No ratings yet
BDA-Lec8
39 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
CC_ppt
No ratings yet
CC_ppt
12 pages
Pyspark_notes_new
No ratings yet
Pyspark_notes_new
18 pages
Apache Spark Tutorial
100% (4)
Apache Spark Tutorial
36 pages
Chapter 3 spark
No ratings yet
Chapter 3 spark
6 pages
Getting Started With Spark Redis PDF
0% (1)
Getting Started With Spark Redis PDF
9 pages
Module 5 Data Science
No ratings yet
Module 5 Data Science
8 pages
Module 9: Processing Distributed Data With Apache Spark: WWW - Edureka.co/big-Data-And-Hadoop
No ratings yet
Module 9: Processing Distributed Data With Apache Spark: WWW - Edureka.co/big-Data-And-Hadoop
45 pages
Spark 101
No ratings yet
Spark 101
25 pages
BDA-Unit-III
No ratings yet
BDA-Unit-III
19 pages
Spark
No ratings yet
Spark
96 pages
Module 4
No ratings yet
Module 4
29 pages
SPARK
No ratings yet
SPARK
125 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
From Everand
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
Joerg Christian Seubert
No ratings yet
Bash Command Line Pro Tips
From Everand
Bash Command Line Pro Tips
Jason Cannon
4.5/5 (8)
The Mac Terminal Reference and Scripting Primer
From Everand
The Mac Terminal Reference and Scripting Primer
Jay Docherty
4.5/5 (3)
Oral Rehabilitation A Case Based Approach 1st Edition by Iven Klineberg, Diana Kingston ISBN 1405197811 9781405197816instant download
100% (3)
Oral Rehabilitation A Case Based Approach 1st Edition by Iven Klineberg, Diana Kingston ISBN 1405197811 9781405197816instant download
32 pages
Networking Cheat Sheet - by Codelivly
No ratings yet
Networking Cheat Sheet - by Codelivly
5 pages
Columba de Iona
No ratings yet
Columba de Iona
4 pages
K.P.muthanna Resume
No ratings yet
K.P.muthanna Resume
1 page
API Security Fundamentals
No ratings yet
API Security Fundamentals
4 pages
Business Statistics Notes
No ratings yet
Business Statistics Notes
34 pages
How To Teach Listening Effectively PDF
No ratings yet
How To Teach Listening Effectively PDF
22 pages
Retailer Data of Mahalaxmi
No ratings yet
Retailer Data of Mahalaxmi
20 pages
Handwriting Needs Perceptual and Visual Motor Skills
100% (2)
Handwriting Needs Perceptual and Visual Motor Skills
2 pages
Why Do We Communicate
No ratings yet
Why Do We Communicate
1 page
Final Week 11 2019
No ratings yet
Final Week 11 2019
5 pages
A Survey of Reinforcement Learning from Human Feedback
No ratings yet
A Survey of Reinforcement Learning from Human Feedback
83 pages
Fluid Prop Syntax
No ratings yet
Fluid Prop Syntax
45 pages
Output Log
No ratings yet
Output Log
48 pages
Hayaat e Risalat Ma'Ab
No ratings yet
Hayaat e Risalat Ma'Ab
15 pages
2023 Webtech Bca
No ratings yet
2023 Webtech Bca
7 pages
Exam 1836156
No ratings yet
Exam 1836156
2 pages
Dual Access Control for Cloud Based Data Storage and Sharing 298tq7wg
No ratings yet
Dual Access Control for Cloud Based Data Storage and Sharing 298tq7wg
5 pages
Bam 1
No ratings yet
Bam 1
5 pages
The Story of Martha
No ratings yet
The Story of Martha
7 pages
Casela (Effectiveness and Recommenations)
No ratings yet
Casela (Effectiveness and Recommenations)
4 pages
COMP 226 INTE 222 BBIT 314 COMP 326 OOP WITH JAVA - kabarak university
No ratings yet
COMP 226 INTE 222 BBIT 314 COMP 326 OOP WITH JAVA - kabarak university
6 pages
Workshop Fase 2 Case Diagram
No ratings yet
Workshop Fase 2 Case Diagram
2 pages
Answer D: MCQ: Set of Programs With Full Set of Documentation Is Considered As
No ratings yet
Answer D: MCQ: Set of Programs With Full Set of Documentation Is Considered As
24 pages
SIEM Policy MGMT - WAF
No ratings yet
SIEM Policy MGMT - WAF
56 pages
Q3-WEEK2-DLL-ENGLISH-2
No ratings yet
Q3-WEEK2-DLL-ENGLISH-2
5 pages

scala-1

Uploaded by

scala-1

Uploaded by

MapReduce is a programming model designed for processing and generating large datasets in

parallel across a distributed cluster of computers. It simplifies the complexities of parallel

●​ Hadoop: Utilizes the MapReduce programming model, processing data in discrete

sudo apt update

Scala code runner version 2.11.12 --

scala> val text = sc.parallelize(Seq("spark is fast",

scala> val words = text.flatMap(_.split("\\s+"))

\s (single backslash) means: any whitespace character.

●​ space " "​

●​ carriage return \r, etc.

In Scala, we write \\s instead of \s because:

●​ In a string literal, \ is an escape character.​

●​ So to get a literal \, you write it as \\.

### 🔹 `+` – What does it mean?

"hello world" → split into "hello" and "world" (because of 3 spaces)​

🔸 Common Whitespace Characters:

" " (space) Regular space ␣ (invisible)

\t Tab → (moves cursor to next tab stop)

\n Newline (line break) ↓ (moves to next line)

\r Carriage return ⏎ (used in old systems)

\f Form feed — (rarely used now)

"hello world" → space

scala> val wordPairs = words.map(word => (word,

●​ Group all the key-value pairs by key (i.e., by word).​

●​ For each group, add up the values (the 1s).​

("spark", 1), ("spark", 1) →

You might also like

● Hadoop: Utilizes the MapReduce programming model, processing data in discrete

● space " "

● carriage return \r, etc.

● In a string literal, \ is an escape character.

● So to get a literal \, you write it as \\.

"hello world" → split into "hello" and "world" (because of 3 spaces)

● Group all the key-value pairs by key (i.e., by word).

● For each group, add up the values (the 1s).