0% found this document useful (0 votes)
2 views3 pages

TP Spark SQL Avec Scala - Fr.en

The document outlines a Spark SQL tutorial using Scala for analyzing employee data stored in a JSON file. It provides step-by-step instructions for setting up a Hadoop environment, loading data, querying it to find IT employees, calculating average salaries by department, and identifying the highest-paid employee. Finally, it explains how to save the results back to HDFS in JSON format.

Uploaded by

asma.asma230683
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views3 pages

TP Spark SQL Avec Scala - Fr.en

The document outlines a Spark SQL tutorial using Scala for analyzing employee data stored in a JSON file. It provides step-by-step instructions for setting up a Hadoop environment, loading data, querying it to find IT employees, calculating average salaries by department, and identifying the highest-paid employee. Finally, it explains how to save the results back to HDFS in JSON format.

Uploaded by

asma.asma230683
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

Translated from French to English - www.onlinedoctranslator.

com

Spark SQL TP with Scala: Employee Analysis

Objective :

Learn how to use Spark SQL to load, query, and manipulate tabular data in Scala with Spark-shell.

Let the file “employees.json” contain the following data:

{"id": 1, "name": "Alice", "age": 30, "department": "HR", "salary": 3500}


{"id": 2, "name": "Bob", "age": 25, "department": "IT", "salary": 4000}
{"id": 3, "name": "Charlie", "age": 28, "department": "Finance", "salary": 3700}
{"id": 4, "name": "David", "age": 35, "department": "IT", "salary": 5000}
{"id": 5, "name": "Eva", "age": 45, "department": "HR", "salary": 4200}

 Create the file employees.json on Windows.


 Launch your machines using the following commands:

docker start hadoop-master hadoop-worker1 hadoop-worker2

 Copy the employees.json file into the hadoop-master container.


docker cp "C:\Users\LENOVO\Downloads\employees.json"
hadoop-master:/tmp/employees.json

 Then enter the master container:


docker exec -it hadoop-master bash
 Then start the yarn and hdfs daemons:
./start-hadoop.sh
 copy the employees.json file to hdfs.
hdfs dfs -put /tmp/employees.json
Scala steps and commands in Spark-shell:

1. Import the necessary libraries


import org.apache.spark.sql.SparkSession

2. Create a Spark session


val spark = SparkSession.builder.appName("Employee SQL").getOrCreate()
import spark.implicits._

3. Load JSON file from HDFS


val df = spark.read.json("/user/root/employees.json")

4. Display data
df.show()

5. Show the diagram


df.printSchema()

6. Create a temporary view


df.createOrReplaceTempView("employees")

7. List all IT employees


spark.sql("SELECT * FROM employees WHERE department = 'IT'").show()

8. Calculate the average salary by department


spark.sql("SELECT department, AVG(salary) as avg_salary FROM employees GROUP BY
department").show()

9. Finding the highest paid employee


val maxSalary = spark.sql("SELECT MAX(salary) as max_salary FROM
employees").first().getLong(0)
This query returns the max, but not necessarily the associated name.
To get the name + max salary:
spark.sql(s"SELECT name, salary FROM employees WHERE salary = $maxSalary").show()

10. Save results to HDFS as JSON


df.write.mode("overwrite").json("hdfs:///user/root/employees_json")

You might also like