Introduction to PySpark | Distributed Computing with Apache Spark
Last Updated :
29 Apr, 2022
Datasets are becoming huge. Infact, data is growing faster than processing speeds. Therefore, algorithms involving large data and high amount of computation are often run on a distributed computing system. A distributed computing system involves nodes (networked computers) that run processes in parallel and communicate (if, necessary).
MapReduce - The programming model that is used for Distributed computing is known as MapReduce. The MapReduce model involves two stages, Map and Reduce.
- Map - The mapper processes each line of the input data (it is in the form of a file), and produces key - value pairs.
Input data → Mapper → list([key, value])
- Reduce - The reducer processes the list of key - value pairs (after the Mapper's function). It outputs a new set of key - value pairs.
list([key, value]) → Reducer → list([key, list(values)])
Spark - Spark (open source Big-Data processing engine by Apache) is a cluster computing system. It is faster as compared to other cluster computing systems (such as, Hadoop). It provides high level APIs in Python, Scala, and Java. Parallel jobs are easy to write in Spark. We will cover PySpark (Python + Apache Spark), because this will make the learning curve flatter. To install Spark on a linux system, follow
this. To run Spark in a multi - cluster system, follow
this. We will see how to create RDDs (fundamental data structure of Spark).
RDDs (Resilient Distributed Datasets) - RDDs are immutable collection of objects. Since we are using PySpark, these objects can be of multiple types. These will become more clear further.
SparkContext - For creating a standalone application in Spark, we first define a SparkContext -
Python
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("Test")
# setMaster(local) - we are doing tasks on a single machine
sc = SparkContext(conf = conf)
RDD transformations - Now, a SparkContext object is created. Now, we will create RDDs and see some transformations on them.
Python
# create an RDD called lines from ‘file_name.txt’
lines = sc.textFile("file_name.txt", 2)
# print lines.collect() prints the whole RDD
print lines.collect()
One major advantage of using Spark is that it does not load the dataset into memory, lines is a pointer to the
‘file_name.txt’ ?file.
A simple PySpark app to count the degree of each vertex for a given graph -
Python
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("Test")
# setMaster(local) - we are doing tasks on a single machine
sc = SparkContext(conf = conf)
def conv(line):
line = line.split()
return (int(line[0]), [int(line[1])])
def numNeighbours(x, y):
return len(x) + len(y)
lines = sc.textFile('graph.txt')
edges = lines.map(lambda line: conv(line))
Adj_list = edges.reduceByKey(lambda x, y: numNeighbours(x, y))
print Adj_list.collect()
Understanding the above code -
-
Our text file is in the following format - (each line represents an edge of a directed graph)
1 2
1 3
2 3
3 4
. .
. .
. .PySpark
-
Large Datasets may contain millions of nodes, and edges.
-
First few lines set up the SparkContext. We create an RDD lines from it.
-
Then, we transform the lines RDD to edges RDD.The function conv acts on each line and key value pairs of the form (1, 2), (1, 3), (2, 3), (3, 4), ... are stored in the edges RDD.
- After this the reduceByKey aggregates all the key - pairs corresponding to a particular key and numNeighbours function is used for generating each vertex's degree in a separate RDD Adj_list, which has the form (1, 2), (2, 1), (3, 1), ...
Running the code -
-
The above code can be run by the following commands -
$ cd /home/arik/Downloads/spark-1.6.0/
$ ./bin/spark-submit degree.py
- You can use your Spark installation path in the first line.
We will see more on, how to run MapReduce tasks in a cluster of machines using Spark, and also go through other MapReduce tasks.
References -
- http://lintool.github.io/SparkTutorial/
- https://spark.apache.org/
Similar Reads
How to introduce the schema in a Row in Spark? The type of data, field names, and field types in a table are defined by a schema, which is a structured definition of a dataset. In Spark, a row's structure in a data frame is defined by its schema. To carry out numerous tasks including data filtering, joining, and querying a schema is necessary.
3 min read
Expected Properties of a Big Data System Prerequisite - Introduction to Big Data, Benefits of Big DataThere are various properties that mostly rely on complexity as per their scalability in big data. As per these properties, Big data systems should perform well, efficiently, and reasonably well. Letâs explore these properties step by step.
6 min read
Spark vs Impala Spark and Impala are the two most common tools used for big data analytics. This article focuses on discussing the pros, cons, and differences between the two tools. What is Spark?Spark is a framework that is open source and is used for making queries interactive, for machine learning, and for real-
4 min read
Hadoop Tutorial Big Data is a collection of data that is growing exponentially, and it is huge in volume with a lot of complexity as it comes from various resources. This data may be structured data, unstructured or semi-structured. So to handle or manage it efficiently, Hadoop comes into the picture. Hadoop is a f
3 min read
Components of Apache Spark Spark is a cluster computing system. It is faster as compared to other cluster computing systems (such as Hadoop). It provides high-level APIs in Python, Scala, and Java. Parallel jobs are easy to write in Spark. In this article, we will discuss the different components of Apache Spark. Spark proces
5 min read
Apache Spark with Scala - Resilient Distributed Dataset In the modern world, we are dealing with huge datasets every day. Data is growing even faster than processing speeds. To perform computations on such large data is often achieved by using distributed systems. A distributed system consists of clusters (nodes/networked computers) that run processes in
3 min read
How to Install PySpark in Kaggle PySpark is the Python API for powerful distributed computing framework called Apache Spark. Among its many usage areas, I would say it majorly includes big data processing, machine learning, and real-time analytics. Running PySpark within the hosted environment of Kaggle would be super great if you
4 min read
Overview of Apache Spark In this article, we are going to discuss the introductory part of Apache Spark, and the history of spark, and why spark is important. Let's discuss one by one. According to Databrick's definition "Apache Spark is a lightning-fast unified analytics engine for big data and machine learning. It was ori
2 min read
Wide and Narrow Dependencies in Apache Spark Apache Spark, a powerful distributed computing framework, is designed to process large-scale datasets efficiently across a cluster of machines. However, Dependencies play a crucial role in Spark's performance, particularly concerning shuffling operations. Shuffling, which involves moving data across
6 min read
How to create Spark session in Scala? Scala stands for scalable language. It was developed in 2003 by Martin Odersky. It is an object-oriented language that provides support for functional programming approach as well. Everything in scala is an object e.g. - values like 1,2 can invoke functions like toString(). Scala is a statically typ
5 min read