scala-1
scala-1
pache Hadoop and Apache Spark are both powerful open-source frameworks for processing
A
large datasets, but they differ in architecture, performance, and use cases.
1. Processing Models:
● Spark: Employs a Directed Acyclic Graph (DAG) execution engine that performs
computations in memory, significantly reducing disk I/O. This in-memory processing
enables Spark to handle iterative and interactive tasks more efficiently.
Step1
step 2
sudo apt install scala
scala -version
step 3
echo "deb
https://repo.scala-sbt.org/scalasbt/debian all
main" | sudo tee
/etc/apt/sources.list.d/sbt.list
curl -sL
"https://keyserver.ubuntu.com/pks/lookup?o
p=get&search=0x99e82a75642ac823" | sudo
apt-key add
sudo apt update
step 4
wget
https://dlcdn.apache.org/spark/spark-3.5.5/s
park-3.5.5-bin-hadoop3.tgz
step 5
tar -xvzf spark-3.5.5-bin-hadoop3.tgz
mv spark-3.5.5-bin-hadoop3 ~/spark
step 6
export SPARK_HOME=~/spark
export PATH=$SPARK_HOME/bin:$PATH
comment
step 7:
mkdir spark-scala-app
cd spark-scala-app
mkdir -p src/main/scala
touch build.sbt
touch src/main/scala/WordCount.scala
spark-shell
text: org.apache.spark.rdd.RDD[String] =
ParallelCollectionRDD[0] at parallelize at
<console>:23
scala>
● tab \t
● newline \n
---
So:
```scala
"\\s+" → matches **one or more whitespace characters**
👉
"one\ttwo\nthree" → split into "one", "two", "three"
splits each line into words, no matter how many spaces or tabs separate
them.
🔹 What is Whitespace?
Whitespace means any character that makes space on the screen but is not visible.
They are used to separate words or lines, but they don’t display actual symbols or letters.
("spark", 1), ("is", 1), ("fast", 1), ("spark", 1), ("is", 1), ...
This means:
🔸 _ + _ Explanation:
This is a shorthand for:
scala
CopyEdit
(x, y) => x + y
scala>
scala> wordCounts.collect().foreach(println)
(scala,2)
(together,1)
(powerful,1)
(is,2)
(fast,1)
(spark,2)