Week12 Assignment Solution
Week12 Assignment Solution
Assignment Solution
Week12: Apache Spark - Structured API
Part-2
1
Spark StructuredAPIs -Assignment Solutions
Assignment 1 :
Code:
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.log4j.Level
import org.apache.log4j.Logger
import org.apache.spark.sql.functions._
//Load the department data into a Dataframe using dataframe reader API
// deptDf.show()
// deptDf.printSchema()
2
//Load the employee data into a Dataframe using dataframe reader API
// employeeDf.show()
// employeeDf.printSchema()
//Joining of two dataframes using left outer join, with department dataframe on left
side
//Use first function so as to get other columns also along with aggregated columns
joinedDfNew.groupBy("deptid").agg(count("empname").as("empcount"),first("deptNam
e").as ("deptName")).dropDuplicates("deptName").show()
spark.stop()
}
Output:
Assignment 2
3
Find the top movies as shown in spark practical 18 using broadcast join. Use
Dataframes or Datasets to solve it this time.
Code:
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.log4j.Level
import org.apache.log4j.Logger
import org.apache.spark.sql.functions._
//Transform to a Dataframe:
import spark.implicits._
// ratingsDf.show()
// ratingsDf.printSchema()
val moviesRDD =
spark.sparkContext.textFile("C:/TrendyTech/SparkExamples/movies.dat")
val moviesNewDf =
moviestransformedRDD.toDF().select("movieid","moviename")
// moviesNewDf.show()
//moviesNewDf.printSchema()
//transformedmovieDf.show()
// popularMoviesDf.show()
5
//Now we want to associate the Movie names also, so we use a broadcast join
val finalPopularMoviesDf =
popularMoviesDf.join(broadcast(moviesNewDf),joinCondition,joinType).drop(popularM
oviesDf.col("movieid")).sort(desc("avgMovieRating")) //joining the 2 dataframes using
broadcast join where movies data is the smaller dataset
finalPopularMoviesDf.drop("movieViewCount","movieid","avgMovieRating").show(false
)
spark.stop()
Output:
Assignment 3
File A is a text file of size 1.2 GB in HDFS at location /loc/x. It contains match by match
statistics of runs scored by all the batsman in the history of cricket.
File B is a text file of size 1.2 MB present in local dir /loc/y. It contains list of batsman
playing in cricket world cup 2019.
6
File A:
1 Rohit_Sharma India 200 100.2
1 Virat_Kohli India 100 98.02
1 Steven_Smith Aus 77 79.23
35 Clive_Lloyd WI 29 37.00
243 Rohit_Sharma India 23 150.00
243 Faf_du_Plesis SA 17 35.06
File B:
Rohit_Sharma India
Steven_Smith Aus
Virat_Kohli India
Find the batsman participating in 2019 who has the best average of scoring runs in his
career. Solve this using Dataframes or Datasets.
Code:
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.log4j.Level
import org.apache.log4j.Logger
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.Row
val batsmenHistoryRDD =
spark.sparkContext.textFile("C:/TrendyTech/SparkExamples/FileA_BatsmenDetails_Histo
ry.txt")
// Dataframe creation
import spark.implicits._
//batsmenHistoryDf.show()
//batsmenHistoryDf.printSchema()
//Calculating Average runs scored by a batsman in history, with highest average at top
val batsmenBestRunsAvgHistoryDf =
batsmenHistoryDf.groupBy("Batsman").agg(avg("RunsScored").as("AverageRunsScored"))
.select("Batsman","AverageRunsScored")
//batsmenBestRunsAvgHistoryDf.sort(col("AverageRunsScored").desc).show()
//Alternative Approach instead of using case class ,though case class can also be used
instead-
8
//Programmatically create an explicit schema of the worldcup 2019 file:
batsmenWorldCupDf.show()
batsmenWorldCupDf.printSchema()
val finalBestBatsmenPlayingWorldCupDf =
batsmenBestRunsAvgHistoryDf.join(broadcast(batsmenWorldCupDf),joinCondition,joinT
ype).drop (batsmenBestRunsAvgHistoryDf.col("Batsman"))
finalBestBatsmenPlayingWorldCupDf.orderBy(desc("AverageRunsScored")).show()
spark.stop()
Output:
+-----------------+------------+
|AverageRunsScored| batsman|
9
+-----------------+------------+
| 111.5|Rohit_Sharma|
| 100.0| Virat_Kohli|
| 77.0|Steven_Smith|
+-----------------+------------+
**********************************************************************