0% found this document useful (0 votes)

2 views3 pages

TP Spark SQL Avec Scala - Fr.en

The document outlines a Spark SQL tutorial using Scala for analyzing employee data stored in a JSON file. It provides step-by-step instructions for setting up a Hadoop environment, loading data, querying it to find IT employees, calculating average salaries by department, and identifying the highest-paid employee. Finally, it explains how to save the results back to HDFS in JSON format.

Uploaded by

asma.asma230683

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views3 pages

TP Spark SQL Avec Scala - Fr.en

Uploaded by

asma.asma230683

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 3

Translated from French to English - www.onlinedoctranslator.

com

Spark SQL TP with Scala: Employee Analysis

Objective :

Learn how to use Spark SQL to load, query, and manipulate tabular data in Scala with Spark-shell.

Let the file “employees.json” contain the following data:

{"id": 1, "name": "Alice", "age": 30, "department": "HR", "salary": 3500}

{"id": 2, "name": "Bob", "age": 25, "department": "IT", "salary": 4000}
{"id": 3, "name": "Charlie", "age": 28, "department": "Finance", "salary": 3700}
{"id": 4, "name": "David", "age": 35, "department": "IT", "salary": 5000}
{"id": 5, "name": "Eva", "age": 45, "department": "HR", "salary": 4200}

 Create the file employees.json on Windows.

 Launch your machines using the following commands:

docker start hadoop-master hadoop-worker1 hadoop-worker2

 Copy the employees.json file into the hadoop-master container.

docker cp "C:\Users\LENOVO\Downloads\employees.json"
hadoop-master:/tmp/employees.json

 Then enter the master container:

docker exec -it hadoop-master bash
 Then start the yarn and hdfs daemons:
./start-hadoop.sh
 copy the employees.json file to hdfs.
hdfs dfs -put /tmp/employees.json
Scala steps and commands in Spark-shell:

1. Import the necessary libraries

import org.apache.spark.sql.SparkSession

2. Create a Spark session

val spark = SparkSession.builder.appName("Employee SQL").getOrCreate()
import spark.implicits._

3. Load JSON file from HDFS

val df = spark.read.json("/user/root/employees.json")

4. Display data
df.show()

5. Show the diagram

df.printSchema()

6. Create a temporary view

df.createOrReplaceTempView("employees")

7. List all IT employees

spark.sql("SELECT * FROM employees WHERE department = 'IT'").show()

8. Calculate the average salary by department

spark.sql("SELECT department, AVG(salary) as avg_salary FROM employees GROUP BY
department").show()

9. Finding the highest paid employee

val maxSalary = spark.sql("SELECT MAX(salary) as max_salary FROM
employees").first().getLong(0)
This query returns the max, but not necessarily the associated name.
To get the name + max salary:
spark.sql(s"SELECT name, salary FROM employees WHERE salary = $maxSalary").show()

10. Save results to HDFS as JSON

df.write.mode("overwrite").json("hdfs:///user/root/employees_json")

Apache Spark With Scala - Cheatsheet
No ratings yet
Apache Spark With Scala - Cheatsheet
7 pages
My Pyspark Practice Notes
100% (1)
My Pyspark Practice Notes
63 pages
DevOps. How To Build Pipelines With Bitbucket Pipelines + Docker Container + AWS ECS + JDK 11 + Maven 3?
From Everand
DevOps. How To Build Pipelines With Bitbucket Pipelines + Docker Container + AWS ECS + JDK 11 + Maven 3?
John Edward Cooper Berg
No ratings yet
Spark SQL-A Compiler From Queries To RDDs
No ratings yet
Spark SQL-A Compiler From Queries To RDDs
44 pages
DATAFRAME Vs DATASETS
No ratings yet
DATAFRAME Vs DATASETS
9 pages
Unit 4 Spark SQL
No ratings yet
Unit 4 Spark SQL
49 pages
Inspiring Powershell Articles
From Everand
Inspiring Powershell Articles
Murat Yildirimoglu
No ratings yet
How to a Developers Guide to 4k: Developer edition, #3
From Everand
How to a Developers Guide to 4k: Developer edition, #3
Xinc Cyberwizard
No ratings yet
NoSQL Injection for Elasticsearch
From Everand
NoSQL Injection for Elasticsearch
Gary Drocella
No ratings yet
Azure For Starters
From Everand
Azure For Starters
Chinmoy Mukherjee
No ratings yet
Sanya Sekhri Assignment
No ratings yet
Sanya Sekhri Assignment
2 pages
Notes
No ratings yet
Notes
26 pages
Spark SQL - Updated
No ratings yet
Spark SQL - Updated
19 pages
Docse
No ratings yet
Docse
3 pages
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
No ratings yet
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
11 pages
Spark SQL
No ratings yet
Spark SQL
41 pages
PoC Proposal Template
100% (1)
PoC Proposal Template
43 pages
Big Data With Spark and Hadoop
No ratings yet
Big Data With Spark and Hadoop
9 pages
Spark SQL
No ratings yet
Spark SQL
24 pages
Datasets and Dataframes: Org - Apache.Spark - Sql.Sparksession
No ratings yet
Datasets and Dataframes: Org - Apache.Spark - Sql.Sparksession
17 pages
10 Lessons in Front-end
From Everand
10 Lessons in Front-end
Krasimir Tsonev
2/5 (1)
4.3. Spark SQL
No ratings yet
4.3. Spark SQL
25 pages
Data and AI - Spark Python
No ratings yet
Data and AI - Spark Python
11 pages
Comparison of SQL
No ratings yet
Comparison of SQL
11 pages
Lab 4 - Apache Spark SQL
No ratings yet
Lab 4 - Apache Spark SQL
46 pages
Apache Spark - DataFrames and Spark SQL
100% (2)
Apache Spark - DataFrames and Spark SQL
146 pages
BDA - Week04 - 10
No ratings yet
BDA - Week04 - 10
41 pages
Study Guide Cisco 300-735 SAUTO Automating and Programming Cisco Security Solutions Exam
From Everand
Study Guide Cisco 300-735 SAUTO Automating and Programming Cisco Security Solutions Exam
Anand Vemula
No ratings yet
PySpark - FP - Course ID 58339 - Hands On 4
No ratings yet
PySpark - FP - Course ID 58339 - Hands On 4
2 pages
T09 Sparksql
No ratings yet
T09 Sparksql
30 pages
Json To Dataframe
No ratings yet
Json To Dataframe
13 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
BDA LabRecord Week04 07
No ratings yet
BDA LabRecord Week04 07
31 pages
Pyspark Distinct and Filter
No ratings yet
Pyspark Distinct and Filter
3 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
BDT MSE2Scheme 23-24
No ratings yet
BDT MSE2Scheme 23-24
4 pages
Mod5 Bda
No ratings yet
Mod5 Bda
9 pages
Spark SQL and DataFrames - Spark 2.2.0 Documentation
No ratings yet
Spark SQL and DataFrames - Spark 2.2.0 Documentation
35 pages
Spark SQL Meetup - 4-8-2012
No ratings yet
Spark SQL Meetup - 4-8-2012
27 pages
Pyspark Coding Interview Questions
No ratings yet
Pyspark Coding Interview Questions
19 pages
50 Recipes for Programming Node.js
From Everand
50 Recipes for Programming Node.js
Jamie Munro
3/5 (4)
Ajax in One Hour, For Beginners, Learn Coding Fast
From Everand
Ajax in One Hour, For Beginners, Learn Coding Fast
Ray Yao
No ratings yet
Midhun BIGDATA Curicullum
No ratings yet
Midhun BIGDATA Curicullum
17 pages
Firebase Storage for Angular: A reliable file upload solution for your applications
From Everand
Firebase Storage for Angular: A reliable file upload solution for your applications
Abdelfattah Ragab
No ratings yet
Parallel Programming With Spark: Matei Zaharia
No ratings yet
Parallel Programming With Spark: Matei Zaharia
40 pages
Core Java Programming Book
From Everand
Core Java Programming Book
Manish Soni
No ratings yet
Py Spark
No ratings yet
Py Spark
10 pages
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
From Everand
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
SA Lab Manual
No ratings yet
SA Lab Manual
7 pages
7 Apache Spark
No ratings yet
7 Apache Spark
48 pages
SparkSql AND DF
No ratings yet
SparkSql AND DF
89 pages
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
Big Data Spark Exp-8
No ratings yet
Big Data Spark Exp-8
3 pages
Apache Spark Tutorial, With Deep-Dives On SparkR and Data Sources API
No ratings yet
Apache Spark Tutorial, With Deep-Dives On SparkR and Data Sources API
39 pages
Chapter 3 Spark
No ratings yet
Chapter 3 Spark
6 pages
Myinterview Qs
No ratings yet
Myinterview Qs
9 pages
Mohamed Iqwan - Data Engineer - CV PDF
No ratings yet
Mohamed Iqwan - Data Engineer - CV PDF
1 page
ECS765P - W5 - Spark Programming
No ratings yet
ECS765P - W5 - Spark Programming
43 pages
Pyspark
No ratings yet
Pyspark
44 pages
Spark RDD Commands - Spark Core
No ratings yet
Spark RDD Commands - Spark Core
7 pages

TP Spark SQL Avec Scala - Fr.en

Uploaded by

TP Spark SQL Avec Scala - Fr.en

Uploaded by

Translated from French to English - www.onlinedoctranslator.

Spark SQL TP with Scala: Employee Analysis

Let the file “employees.json” contain the following data:

{"id": 1, "name": "Alice", "age": 30, "department": "HR", "salary": 3500}

 Create the file employees.json on Windows.

docker start hadoop-master hadoop-worker1 hadoop-worker2

 Copy the employees.json file into the hadoop-master container.

 Then enter the master container:

1. Import the necessary libraries

2. Create a Spark session

3. Load JSON file from HDFS

5. Show the diagram

6. Create a temporary view

7. List all IT employees

8. Calculate the average salary by department

9. Finding the highest paid employee

10. Save results to HDFS as JSON

You might also like