Spark Walmart Data Analysis Project
Spark Walmart Data Analysis Project
Let's get some quick practice with your new Spark DataFrame skills, you will be asked some basic
questions about some stock market data, in this case Walmart Stock from the years 2012-2017. This
exercise will just ask a bunch of questions, unlike the future machine learning exercises, which will be a little
looser and be in the form of "Consulting Projects", but more on that later!
For now, just answer the questions and complete the tasks below.
Use the walmart_stock.csv file to Answer and complete the tasks below!
In [2]:
1 import findspark
2 findspark.init('/home/jubinsoni/spark-2.1.0-bin-hadoop2.7')
3
4 from pyspark.sql import SparkSession
5
6 spark = SparkSession.builder.appName('walmart').getOrCreate()
Load the Walmart Stock CSV File, have Spark infer the data types.
In [1]:
In [2]:
1 df.columns
Out[2]:
1 df.printSchema()
root
|-- Date: timestamp (nullable = true)
|-- Open: double (nullable = true)
|-- High: double (nullable = true)
|-- Low: double (nullable = true)
|-- Close: double (nullable = true)
|-- Volume: integer (nullable = true)
|-- Adj Close: double (nullable = true)
In [4]:
1 df.describe().show()
+-------+------------------+-----------------+-----------------+----
-------------+-----------------+-----------------+
|summary| Open| High| Low|
Close| Volume| Adj Close|
+-------+------------------+-----------------+-----------------+----
-------------+-----------------+-----------------+
| count| 1258| 1258| 1258|
1258| 1258| 1258|
| mean| 72.35785375357709|72.83938807631165| 71.9186009594594|72.3
8844998012726|8222093.481717011|67.23883848728146|
| stddev| 6.76809024470826|6.768186808159218|6.744075756255496|6.75
6859163732991| 4519780.8431556|6.722609449996857|
| min|56.389998999999996| 57.060001| 56.299999|
56.419998| 2094900| 50.363689|
| max| 90.800003| 90.970001| 89.25|
90.470001| 80898100|84.91421600000001|
+-------+------------------+-----------------+-----------------+----
-------------+-----------------+-----------------+
Bonus Question!
There are too many decimal places for mean and stddev in the describe() dataframe. Format the
numbers to just show up to two decimal places. Pay careful attention to the datatypes that
.describe() returns, we didn't cover how to do this exact formatting, but we covered something very
similar. Check this link for a hint
(http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.cast)
If you get stuck on this, don't worry, just view the solutions.
In [25]:
1 '''
2 from pyspark.sql.types import (StructField, StringType,
3 IntegerType, StructType)
4
5 data_schema = [StructField('summary', StringType(), True),
6 StructField('Open', StringType(), True),
7 StructField('High', StringType(), True),
8 StructField('Low', StringType(), True),
9 StructField('Close', StringType(), True),
10 StructField('Volume', StringType(), True),
11 StructField('Adj Close', StringType(), True)
12 ]
13
14 final_struc = StructType(fields=data_schema)
15
16 '''
17 df = spark.read.csv('walmart_stock.csv', inferSchema=True, header=True)
18
19 df.printSchema()
20 #The schema given below is wrong, as it is mostly from an older version.
21 #Spark is able to predict the schema correctly now
root
|-- Date: timestamp (nullable = true)
|-- Open: double (nullable = true)
|-- High: double (nullable = true)
|-- Low: double (nullable = true)
|-- Close: double (nullable = true)
|-- Volume: integer (nullable = true)
|-- Adj Close: double (nullable = true)
In [38]:
+-------+--------+--------+--------+--------+----------+
|summary| Open| High| Low| Close| Volume|
+-------+--------+--------+--------+--------+----------+
| count|1,258.00|1,258.00|1,258.00|1,258.00| 1,258|
| mean| 72.36| 72.84| 71.92| 72.39| 8,222,093|
| stddev| 6.77| 6.77| 6.74| 6.76| 4,519,781|
| min| 56.39| 57.06| 56.30| 56.42| 2,094,900|
| max| 90.80| 90.97| 89.25| 90.47|80,898,100|
+-------+--------+--------+--------+--------+----------+
Create a new dataframe with a column called HV Ratio that is the ratio of the High Price versus
volume of stock traded for a day.
In [46]:
+--------------------+
| HV Ratio|
+--------------------+
|4.819714653321546E-6|
|6.290848613094555E-6|
|4.669412994783916E-6|
|7.367338463826307E-6|
|8.915604778943901E-6|
|8.644477436914568E-6|
|9.351828421515645E-6|
| 8.29141562102703E-6|
|7.712212102001476E-6|
|7.071764823529412E-6|
|1.015495466386981E-5|
|6.576354146362592...|
| 5.90145296180676E-6|
|8.547679455011844E-6|
|8.420709512685392E-6|
|1.041448341728929...|
|8.316075414862431E-6|
|9.721183814992126E-6|
|8.029436027707578E-6|
|6.307432259386365E-6|
+--------------------+
only showing top 20 rows
In [61]:
1 df.orderBy(df['High'].desc()).select(['Date']).head(1)[0]['Date']
Out[61]:
datetime.datetime(2015, 1, 13, 0, 0)
+-----------------+
| avg(Close)|
+-----------------+
|72.38844998012726|
+-----------------+
In [92]:
+-----------+-----------+
|max(Volume)|min(Volume)|
+-----------+-----------+
| 80898100| 2094900|
+-----------+-----------+
In [101]:
Out[101]:
81
What percentage of the time was the High greater than 80 dollars ?
In [105]:
Out[105]:
9.141494435612083
What is the Pearson correlation between High and Volume?
Hint
(http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameStatFunctions
In [113]:
1 df.corr('High', 'Volume')
Out[113]:
-0.3384326061737161
In [114]:
+-------------------+
| corr(High, Volume)|
+-------------------+
|-0.3384326061737161|
+-------------------+
In [133]:
+----+---------+
|Year|max(High)|
+----+---------+
|2015|90.970001|
|2013|81.370003|
|2014|88.089996|
|2012|77.599998|
|2016|75.190002|
+----+---------+
What is the average Close for each Calendar Month?
In other words, across all the years, what is the average Close price for Jan,Feb, Mar, etc... Your
result will have a value for each of these months.
In [139]:
+-----+-----------------+
|Month| avg(Close)|
+-----+-----------------+
| 1|71.44801958415842|
| 2| 71.306804443299|
| 3|71.77794377570092|
| 4|72.97361900952382|
| 5|72.30971688679247|
| 6| 72.4953774245283|
| 7|74.43971943925233|
| 8|73.02981855454546|
| 9|72.18411785294116|
| 10|71.57854545454543|
| 11| 72.1110893069307|
| 12|72.84792478301885|
+-----+-----------------+
Thank you!
For now, just answer the questions and complete the tasks below.
Use the walmart_stock.csv file to Answer and complete the tasks below!
In [2]:
Load the Walmart Stock CSV File, have Spark infer the data types.
In [1]:
In [2]:
Out[2]:
root
|-- Date: timestamp (nullable = true)
|-- Open: double (nullable = true)
|-- High: double (nullable = true)
|-- Low: double (nullable = true)
|-- Close: double (nullable = true)
|-- Volume: integer (nullable = true)
|-- Adj Close: double (nullable = true)
In [4]:
+-------+------------------+-----------------+-----------------+----
-------------+-----------------+-----------------+
|summary| Open| High| Low|
Close| Volume| Adj Close|
+-------+------------------+-----------------+-----------------+----
-------------+-----------------+-----------------+
| count| 1258| 1258| 1258|
1258| 1258| 1258|
| mean| 72.35785375357709|72.83938807631165| 71.9186009594594|72.3
8844998012726|8222093.481717011|67.23883848728146|
| stddev| 6.76809024470826|6.768186808159218|6.744075756255496|6.75
6859163732991| 4519780.8431556|6.722609449996857|
| min|56.389998999999996| 57.060001| 56.299999|
56.419998| 2094900| 50.363689|
| max| 90.800003| 90.970001| 89.25|
90.470001| 80898100|84.91421600000001|
+-------+------------------+-----------------+-----------------+----
-------------+-----------------+-----------------+
Bonus Question!
There are too many decimal places for mean and stddev in the describe() dataframe. Format the
numbers to just show up to two decimal places. Pay careful attention to the datatypes that
.describe() returns, we didn't cover how to do this exact formatting, but we covered something very
similar. Check this link for a hint
(http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.cast)
If you get stuck on this, don't worry, just view the solutions.
In [25]:
root
|-- Date: timestamp (nullable = true)
|-- Open: double (nullable = true)
|-- High: double (nullable = true)
|-- Low: double (nullable = true)
|-- Close: double (nullable = true)
|-- Volume: integer (nullable = true)
|-- Adj Close: double (nullable = true)
In [38]:
+-------+--------+--------+--------+--------+----------+
|summary| Open| High| Low| Close| Volume|
+-------+--------+--------+--------+--------+----------+
| count|1,258.00|1,258.00|1,258.00|1,258.00| 1,258|
| mean| 72.36| 72.84| 71.92| 72.39| 8,222,093|
| stddev| 6.77| 6.77| 6.74| 6.76| 4,519,781|
| min| 56.39| 57.06| 56.30| 56.42| 2,094,900|
| max| 90.80| 90.97| 89.25| 90.47|80,898,100|
+-------+--------+--------+--------+--------+----------+
Create a new dataframe with a column called HV Ratio that is the ratio of the High Price versus
volume of stock traded for a day.
In [46]:
+--------------------+
| HV Ratio|
+--------------------+
|4.819714653321546E-6|
|6.290848613094555E-6|
|4.669412994783916E-6|
|7.367338463826307E-6|
|8.915604778943901E-6|
|8.644477436914568E-6|
|9.351828421515645E-6|
| 8.29141562102703E-6|
|7.712212102001476E-6|
|7.071764823529412E-6|
|1.015495466386981E-5|
|6.576354146362592...|
| 5.90145296180676E-6|
|8.547679455011844E-6|
|8.420709512685392E-6|
|1.041448341728929...|
|8.316075414862431E-6|
Out[61]:
datetime.datetime(2015, 1, 13, 0, 0)
In [90]:
+-----------------+
| avg(Close)|
+-----------------+
|72.38844998012726|
+-----------------+
In [92]:
+-----------+-----------+
|max(Volume)|min(Volume)|
+-----------+-----------+
| 80898100| 2094900|
+-----------+-----------+
In [101]:
Out[101]:
81
What percentage of the time was the High greater than 80 dollars ?
In [105]:
Out[105]:
9.141494435612083
Hint
(http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameStatFunctions
In [113]:
Out[113]:
-0.3384326061737161
In [114]:
+-------------------+
| corr(High, Volume)|
+-------------------+
|-0.3384326061737161|
+-------------------+
+----+---------+
|Year|max(High)|
+----+---------+
|2015|90.970001|
|2013|81.370003|
|2014|88.089996|
|2012|77.599998|
|2016|75.190002|
+----+---------+
In other words, across all the years, what is the average Close price for Jan,Feb, Mar, etc... Your
result will have a value for each of these months.
In [139]:
+-----+-----------------+
|Month| avg(Close)|
+-----+-----------------+
| 1|71.44801958415842|
| 2| 71.306804443299|
| 3|71.77794377570092|
| 4|72.97361900952382|
| 5|72.30971688679247|
| 6| 72.4953774245283|
| 7|74.43971943925233|
| 8|73.02981855454546|
| 9|72.18411785294116|
| 10|71.57854545454543|
| 11| 72.1110893069307|
| 12|72.84792478301885|
+-----+-----------------+
Thank you!