
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Check for a Substring in a PySpark DataFrame
Pyspark is the library which provides the interface to Apache spark. The Apache spark is the distributed computing system to process the large datasets. We have the feasibility in pyspark to write the spark applications using python apart of the traditional languages Java and Scala.
Verifying for a substring in a PySpark
Pyspark provides the dataframe API which helps us in manipulating the structured data such as the SQL queries. It can read various formats of data like parquet, csv, JSON and much more. It provides the features to support the machine learning library to use classification, regression, clustering and etc.
We have different ways to check for a substring in a pyspark dataframe. In this article we will discuss the following
Using like() or rlike()
Using substr()
Installing PySpark
Before seeing about each approach in detail, we have to install the pyspark in our working environment by using the following code.
!pip install pyspark
On successful installation, we will get the similar output as shown below -
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/ Collecting pyspark Downloading pyspark-3.3.2.tar.gz (281.4 MB) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Installing collected packages: py4j, pyspark Attempting uninstall: py4j Found existing installation: py4j 0.10.9.7 Uninstalling py4j-0.10.9.7: Successfully uninstalled py4j-0.10.9.7 Successfully installed py4j-0.10.9.5 pyspark-3.3.2
Creating a dataframe in the pyspark
Now we have to create the Dataframe of the pyspark with the specified details of data. The following are the steps.
We have to import all the necessary PySpar modules and packages to create the Dataframe.
from pyspark.sql import SparkSession from pyspark.sql.types import *
Next we have to create the sparksession to work with the pyspark dataframe.
spark = SparkSession.builder \ .appName("substring check") \ .getOrCreate()
We have to define the schema for creating the dataframe using the pyspark
schema = StructType([ StructField("name", StringType(), True), StructField("age", IntegerType(), True), StructField("gender", StringType(), True), StructField("occupation", StringType(), True), StructField("salary", DoubleType(), True) ])
In this step, we will create the list rows, which forms then Dataframe
data = [("John", 25, "Male", "Developer", 5000.0), ("Jane", 30, "Female", "Manager", 7000.0), ("Bob", 35, "Male", "Director", 10000.0), ("Alice", 40, "Female", "CEO", 15000.0)]
We create the dataframe as per the schema and data which we specified in the above two steps and display the dataframe after creation.
df = spark.createDataFrame(data, schema) df.show()
Output
The following is the output of the dataframe created using the pyspark .
------+---+------+----------+-------+
| name|age|gender|occupation| salary|
+-----+---+------+----------+-------+
| John| 25| Male| Developer| 5000.0|
| Jane| 30|Female| Manager| 7000.0|
| Bob| 35| Male| Director|10000.0|
|Alice| 40|Female| CEO|15000.0|
+-----+---+------+----------+-------+
Now, let's see each approach of checking the substring in the dataframe of pyspark.
Using like() or rlike() functions
In pyspark, we have two functions like() and rlike() ; which are used to check the substring of the data frame. The like() function is used to check if any particular column contains specified pattern, whereas the rlike() function checks for the regular expression pattern in the column.
from pyspark.sql.functions import col df = spark.createDataFrame(data, schema) df.show() df.where(col("occupation").like("%Developer%")).show() df.where(col("occupation").rlike(".*Developer.*")).show()
Output
The output of the like and rlike functions are as follows.
+-----+---+------+----------+-------+ | name|age|gender|occupation| salary| +-----+---+------+----------+-------+ | John| 25| Male| Developer| 5000.0| | Jane| 30|Female| Manager| 7000.0| | Bob| 35| Male| Director|10000.0| |Alice| 40|Female| CEO|15000.0| +-----+---+------+----------+-------+ +----+---+------+----------+------+ |name|age|gender|occupation|salary| +----+---+------+----------+------+ |John| 25| Male| Developer|5000.0| +----+---+------+----------+------+ +----+---+------+----------+------+ |name|age|gender|occupation|salary| +----+---+------+----------+------+ |John| 25| Male| Developer|5000.0| +----+---+------+----------+------+
Using the substr() function
This substr function is used to extract the substring from the column and then checks if the substring matches the specified pattern.
from pyspark.sql.functions import col df = spark.createDataFrame(data, schema) df.show() df.where(col("occupation").substr(1, 5) == "Manag").show()
Output
The output is as follows for the substr function in pyspark.
+-----+---+------+----------+-------+ | name|age|gender|occupation| salary| +-----+---+------+----------+-------+ | John| 25| Male| Developer| 5000.0| | Jane| 30|Female| Manager| 7000.0| | Bob| 35| Male| Director|10000.0| |Alice| 40|Female| CEO|15000.0| +-----+---+------+----------+-------+ +----+---+------+----------+------+ |name|age|gender|occupation|salary| +----+---+------+----------+------+ |Jane| 30|Female| Manager|7000.0| +----+---+------+----------+------+