This document introduces Spark DataFrames, emphasizing their familiarity for users of pandas, R, SQL, or Excel. It highlights the transition from the older RDD syntax to a more user-friendly DataFrame syntax in Spark 2.0 and above, allowing for easier data manipulation and transformation. The course will culminate in a project analyzing historical stock data, reinforcing the skills learned throughout the section.
This document introduces Spark DataFrames, emphasizing their familiarity for users of pandas, R, SQL, or Excel. It highlights the transition from the older RDD syntax to a more user-friendly DataFrame syntax in Spark 2.0 and above, allowing for easier data manipulation and transformation. The course will culminate in a project analyzing historical stock data, reinforcing the skills learned throughout the section.
working with Python and Spark is through the DataFrame Syntax. ● If you’ve worked with pandas in Python, R, SQL or even Excel, a DataFrame will feel very familiar! Python and Spark
● Spark DataFrames hold data in a column
and row format. ● Each column represents some feature or variable. ● Each row represents an individual data point. Python and Spark
● Spark began with something known as
the “RDD” syntax which was a little ugly and tricky to learn. ● Now Spark 2.0 and higher has shifted towards a DataFrame syntax which is much cleaner and easier to work with! Python and Spark
● Spark DataFrames are able to input and
output data from a wide variety of sources. ● We can then use these DataFrames to apply various transformations on the data. Python and Spark
● At the end of the transformation calls, we
can either show or collect the results to display or for some final processing. ● In this section we’ll cover all the main features of working with DataFrames that you need to know. Python and Spark
● Once we have a solid understanding of
Spark DataFrames, we can move on to utilizing the DataFrame MLlib API for Machine Learning. Python and Spark
● After this section you will have a section
for a “DataFrame Project”. ● This Project will be an analysis of some historical stock data information using all the Spark knowledge from this section of the course. Python and Spark
● It will serve as a quick exercise review to
test all the skills learned in this section. ● Let’s get started with learning the basics of Spark DataFrames! Spark DataFrames Project Exercise Solutions