0% found this document useful (0 votes)
118 views8 pages

Syed Zeppelin Assignment

The document outlines a Zeppelin notebook assignment that involves loading CSV data into a Spark dataframe, filtering and aggregating the data, creating views, and performing SQL queries and visualizations to analyze sales data by region, including filtering for units sold and costs over thresholds, counting regions, saving grouped data to HDFS, finding sums and averages of values grouped by region to display in graphs and charts.

Uploaded by

Syed Shouiab
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
118 views8 pages

Syed Zeppelin Assignment

The document outlines a Zeppelin notebook assignment that involves loading CSV data into a Spark dataframe, filtering and aggregating the data, creating views, and performing SQL queries and visualizations to analyze sales data by region, including filtering for units sold and costs over thresholds, counting regions, saving grouped data to HDFS, finding sums and averages of values grouped by region to display in graphs and charts.

Uploaded by

Syed Shouiab
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Zeppelin-Spark Assignment

The Client who has given you this data would like a Zeppelin notebook returned with the
following breakdown:

1. Load data into a Spark dataframe

Val empdata = (spark.read.option("header", "true") .option("inferSchema",


"true").csv("/tmp/worldsales.csv"))

2. Print the dataframe schema

worldsales.printSchema ()
3. Filter the dataframe to show units sold greater than 8000 and unit cost greater than
500 ("&&" operator can be used for multiple "AND" conditions)

val filtered = worldsales.select("Units_Sold", "Unit_Cost").filter($"Units_Sold" > 8000 &&


$"Unit_Cost" > 500) .show()

4. Aggregate the dataframe via group by “Region” and count

val grouping = worldsales.select ("Region") •groupBy("Region").agg(count("Region") .alias


("RegionCount")) .orderBy(desc ("RegionCount"))
5. Create a separate dataframe with the above group by results

grouping.coalesce(1).write.csv("/tmp/grouped")

6. Save this new subset dataframe as a csv file into HDFS – make sure it is saved as a
single file in HDFS

worldsales.createOrReplaceTempView("SalesnView")
7. Create two views using the “createOrReplaceTempView” command

grouping.createOrReplaceTempView("RegionView")

8. View on “Salesview” from the first dataframe


9. View on “Regionview” from the second dataframe

10. Using SQL select all from “Regionview” view and show in a line graph.

select * from RegionView


11. Using SQL, from the “Salesview” view, Select the region and sum of units sold, and
group by region

select Region, Sum(Units_Sold) from SalesnView group by Region

12. Using SQL select from the “Salesview” view – the region and sum of total_profit and
group by region and display in a Bar chart
select Region, Sum(Total_Profit) from SalesnView group by Region

13. Using SQL select from the “Salesview” view – show the total profit as profit, the total
revenue as revenue and the total cost as cost from “Salesview”, group by region

select Region, Sum(Total_Profit) as Profit, Sum(Total_Revenue) as Revenue, Sum(Total_Cost)


as Cost from SalesnView group by Region
14. The client is in the process of opening up a new store and they are looking at the
best location to do so - They need to see the avg profit in each region as a
percentage (pie chart) compared to other regions.

select Region,Avg (Total _Profit) from SalesnView group by Region

You might also like