The document provides instructions for analyzing stock data from Walmart using Spark. It includes tasks like loading a CSV file, examining the schema and columns, calculating descriptive statistics, and correlations between variables. It also asks to view the execution plan and shuffle sizes for specific tasks run on the Spark web UI.
The document provides instructions for analyzing stock data from Walmart using Spark. It includes tasks like loading a CSV file, examining the schema and columns, calculating descriptive statistics, and correlations between variables. It also asks to view the execution plan and shuffle sizes for specific tasks run on the Spark web UI.
Load the Walmart Stock CSV File, have Spark infer the data types.
What are the column names?
What does the Schema look like?
Print out the first 5 columns.
Use describe() to learn about the DataFrame.
Format the numbers to just show up to two decimal places.
Create a new dataframe with a column called HV Ratio that is the ratio of the High Price versus volume of stock traded for a day. What day had the Peak High in Price?
What is the mean of the Close column?
What is the max and min of the Volume column?
How many days was the Close lower than 60 dollars?
What percentage of the time was the High greater than 80 dollars ?
What is the Pearson correlation between High and Volume?
What is the max High per year?
What is the average Close for each Calendar Month?
Use spark web UI to view it execution plan of task no 15. Provide how much data get shuffle for this task.
Total bytes shuffled: 960 B
There were 5 jobs with total 8 stages as shown in the picture. 4 jobs had 960 B shuffle read divided between them and the first job had 960 B shuffled write. HDFS