The document outlines a Spark SQL tutorial using Scala for analyzing employee data stored in a JSON file. It provides step-by-step instructions for setting up a Hadoop environment, loading data, querying it to find IT employees, calculating average salaries by department, and identifying the highest-paid employee. Finally, it explains how to save the results back to HDFS in JSON format.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
2 views3 pages
TP Spark SQL Avec Scala - Fr.en
The document outlines a Spark SQL tutorial using Scala for analyzing employee data stored in a JSON file. It provides step-by-step instructions for setting up a Hadoop environment, loading data, querying it to find IT employees, calculating average salaries by department, and identifying the highest-paid employee. Finally, it explains how to save the results back to HDFS in JSON format.
docker exec -it hadoop-master bash Then start the yarn and hdfs daemons: ./start-hadoop.sh copy the employees.json file to hdfs. hdfs dfs -put /tmp/employees.json Scala steps and commands in Spark-shell:
1. Import the necessary libraries
import org.apache.spark.sql.SparkSession
2. Create a Spark session
val spark = SparkSession.builder.appName("Employee SQL").getOrCreate() import spark.implicits._
3. Load JSON file from HDFS
val df = spark.read.json("/user/root/employees.json")
4. Display data df.show()
5. Show the diagram
df.printSchema()
6. Create a temporary view
df.createOrReplaceTempView("employees")
7. List all IT employees
spark.sql("SELECT * FROM employees WHERE department = 'IT'").show()
8. Calculate the average salary by department
spark.sql("SELECT department, AVG(salary) as avg_salary FROM employees GROUP BY department").show()
9. Finding the highest paid employee
val maxSalary = spark.sql("SELECT MAX(salary) as max_salary FROM employees").first().getLong(0) This query returns the max, but not necessarily the associated name. To get the name + max salary: spark.sql(s"SELECT name, salary FROM employees WHERE salary = $maxSalary").show()