0% found this document useful (0 votes)
46 views9 pages

Assignment I (Dataframe) : Analysis of Stocks Data

The document discusses analyzing stock data from a Walmart CSV file using Spark. It includes loading the file, viewing the schema and column names, printing the first 5 columns, describing statistics, and calculating various metrics like the high-low ratio, peak price day, mean close, correlation between high and volume, max high per year, and average close by month. It also provides the amount of shuffled data for task 15 in the Spark UI.

Uploaded by

HPot PotTech
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views9 pages

Assignment I (Dataframe) : Analysis of Stocks Data

The document discusses analyzing stock data from a Walmart CSV file using Spark. It includes loading the file, viewing the schema and column names, printing the first 5 columns, describing statistics, and calculating various metrics like the high-low ratio, peak price day, mean close, correlation between high and volume, max high per year, and average close by month. It also provides the amount of shuffled data for task 15 in the Spark UI.

Uploaded by

HPot PotTech
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Assignment I (DataFrame)

Analysis of Stocks Data


Load the Walmart Stock CSV File, have Spark infer the data types.

What are the column names?


What does the Schema look like?

For printing the schema the rdd was first converted to dataframe since rdds do not have any
schema.

Print out the first 5 columns.


Use describe() to learn about the DataFrame.

Since describe() does not exist for rdds, hence we manually find all the statistics. We also use
tabulate to print the table in a nice format. Note: tabulate needs to be installed using pip.
Format the numbers to just show up to two decimal places.

Create a new dataframe with a column called HV Ratio that is the ratio of
the High Price versus volume of stock traded for a day.
What day had the Peak High in Price?

What is the mean of the Close column?


What is the max and min of the Volume column?

How many days was the Close lower than 60 dollars?


What percentage of the time was the High greater than 80 dollars ?

What is the Pearson correlation between High and Volume?


What is the max High per year?
What is the average Close for each Calendar Month?

Use spark web UI to view it execution plan of task no 15. Provide how
much data get shuffle for this task.

Shuffled data = 371.0 B

You might also like