Big Data Tools 2 - Apache Spark With PySpark
Big Data Tools 2 - Apache Spark With PySpark
Big Data Tools 2 - Apache Spark With PySpark
Fiqri Wicaksono
Updated
Q2.2020
● The main concern for hadoop is to maintain speed in processing large datasets
in terms of waiting time between queries and waiting time to run the program.
● Spark is not a modified version of Hadoop and is not, really, dependent on
Hadoop because it has its own cluster management. Hadoop is just one of the
ways to implement Spark.
● Spark uses Hadoop in two ways – one is storage and second is processing.
Since Spark has its own cluster management computation, it uses Hadoop for
storage purpose only.
● ApacheSparkCore
Spark Core is the underlying general execution engine for spark platform that
all other functionality is built upon. It provides In-Memory computing and
referencing datasets in external storage systems.
● Spark SQL
Spark SQL is a component on top of Spark Core that introduces a new data
abstraction called SchemaRDD, which provides support for structured and
semi-structured data.
● Spark Streaming
Spark Streaming leverages Spark Core's fast scheduling capability to perform
streaming analytics. It ingests data in mini-batches and performs RDD (Resilient
Distributed Datasets) transformations on those mini-batches of data.
● MLlib (Machine Learning Library)
MLlib is a distributed machine learning framework above Spark because of the
distributed memory-based Spark architecture. It is, according to benchmarks, done by
the MLlib developers against the Alternating Least Squares (ALS) implementations.
Spark MLlib is nine times as fast as the Hadoop disk-based version of Apache
Mahout (before Mahout gained a Spark interface).
● GraphX
GraphX is a distributed graph-processing framework on top of Spark. It provides an
API for expressing graph computation that can model the user-defined graphs by
using Pregel abstraction API. It also provides an optimized runtime for this
abstraction.
● To install Spark, make sure you have Java 8 or higher installed on your computer.
● To install PySpark, make sure you already have python installed on your computer.
● Then, you can install PySpark using pip.
● Open Jupyter Notebook and try if PySpark works. In a new notebook paste the
following code:
import pyspark
from pyspark import SparkContext
sc =SparkContext()
● If an error is shown, it is likely that Java is not installed on your machine. In mac,
open the terminal and write java -version, if there is a java version, make sure it is 1.8.
In Windows, go to Application and check if there is a Java folder. If there is a Java
folder, check that Java 1.8 is installed.
● SparkContext is the internal engine that allows the connections with the clusters. If
you want to run an operation, you need a SparkContext.
● Now that the SparkContext is ready, you can create a collection of data called RDD,
Resilient Distributed Dataset. Computation in an RDD is automatically parallelized
across the cluster.
● You can apply a transformation to the data with a lambda function. In the example
below, you return the square of nums. It is a map transformation.
● You can apply a transformation to the data with a lambda function. In the example
below, you return the square of nums. It is a map transformation.
squared = nums.map(lambda x:
x*x).collect()
for num in squared:
print('%i ' % (num))
● A more convenient way is to use the DataFrame. SparkContext is already set, you can
use it to create the dataFrame. You also need to declare the SQLContext.
● SQLContext allows connecting the engine with different data sources. It is used to
initiate the functionalities of Spark SQL.
sqlContext = SQLContext(sc)
list_p = [('John',19),('Smith',29),('Adam',35),('Henry',50)]
rdd = sc.parallelize(list_p)
ppl = rdd.map(lambda x: Row(name=x[0], age=int(x[1])))
DF_ppl = sqlContext.createDataFrame(ppl)
● If you want to access the type of each feature, you can use printSchema()
DF_ppl.printSchema()
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)
● Now, let’s get our hands dirty, open your notebook and run these following programs
to understand basic data operation with PySpark.
● First of all, you need to initialize the SQLContext is not already in initiated yet.
● Then, you can read the cvs file with sqlContext.read.csv. You use inferSchema set to
True to tell Spark to guess automatically the type of data. By default, it is turn to
False.
df = sqlContext.read.csv(SparkFiles.get("adult_data.csv"), header=True,
inferSchema= True)
df.printSchema()
● Select Columns
df.select('age','fnlwgt').show(5)
● Count by group
df.groupBy("education").count().sort("count",ascending=True).show()
df.describe().show()
● Crosstab Computation
df.crosstab('age', 'label').sort("age_label").show()
● Drop column
df.drop('education_num').columns
● Now you’re able to do some basic data manipulation with PySpark, however it all
doesn’t matter if you don’t practice it, now why don’t you do rebuild your machine
learning practice case using PySpark!
● Hint: You can look for MLib.