How to get keys and values from Map Type column in Spark SQL DataFrame
Last Updated :
24 Apr, 2025
In Python, the MapType function is preferably used to define an array of elements or a dictionary which is used to represent key-value pairs as a map function. The Maptype interface is just like HashMap in Java and the dictionary in Python. It takes a collection and a function as input and returns a new collection as a result.
The formation of a map column is possible by using the createMapType() function on the DataTypes class such as StringType, IntegerType, ArrayType, and many more. This formation mainly takes two arguments, one is keyType and another is valueType which should extend the DataTypes class. valueContainsNull is the third param which is an optional boolean type, used to signify the value of the second param which accepts Null/None values. To get the key-value pair map type function applies a given operation to each element of a collection such as either list or an array.
Features and functionalities of MapType function:
- We use maptype function for data transformation due to its flexibility.
- It applies various transformations on output such as addition, multiplication, string concatenation, or other, which is defined for the collection of data type.
- MapType functions are collimated, which signifies that they can be executed on multiple threads to enhance the performance of map functions to handle massive collections.
- Output is computed only when they are needed, which overall saves memory as well as run time.
Create MapType in Spark DataFrame
Let us first create PySpark MapType to create map objects using the MapType() function. Then create the schema using the StructType() and StructField() functions. After that create a DataFrame using the spark.createDataDrame() method, which takes the data as one of its parameters. In this example, we are taking a list of tuples as the dataset. the printSchema() and show() methods are used to display the schema and the dataframe as the output.
Python3
#import needed modules
from pyspark.sql.types import StringType, MapType
mapCol = MapType(StringType(),StringType(),False)
#Convert the map of StructType to an array of StructType for further implementation.
from pyspark.sql.types import StructField, StructType, StringType, MapType
schema = StructType([
StructField('identification', StringType(), True),
StructField('features', MapType(StringType(),StringType()),True)
])
# Dynamically create MapType on Spark DataFrame and assign to assign displayed attribute.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('GeeksforGeeks').getOrCreate()
dataDictionary = [
('Shivaz',{'pupil':'black','nails':'white'}),
('Shivi',{'pupil':'brown','nails':'yellow'}),
('Shiv',{'pupil':'green','nails':'white'}),
('Shaz',{'pupil':'grey','nails':'yellow'}),
('Shiva',{'pupil':'blue','nails':'white'})
]
#print the final created schema.
df = spark.createDataFrame(data=dataDictionary, schema = schema)
df.printSchema()
df.show(truncate=False)
Output:
Schema and DataFrame createdSteps to get Keys and Values from the Map Type column in SQL DataFrame
The described example is written in Python to get keys and values from the Map Type column in the SQL dataframe.
Example 1: Display the attributes and features of MapType
In this example, we will extract the keys and values of the features that are used in the DataFrame. Thr rdd.map() function returns the new DataFrame after applying the provided operation to each element of the input DataFrame.
Python3
# formatting the features and attributes
df3=df.rdd.map(lambda x: \
(x.identification,x.features["pupil"],x.features["nails"])) \
.toDF(["identification","pupil","nails"])
df3.printSchema()
# show the schema
df3.show()
Output:
Schema, Attributes and Features of the DataFrame
Example 2: Get the keys and values from MapType using .getItem()
Python3
#Apply .getItem function to
#assign features into desired column
df.withColumn("pupil",df.features.getItem("pupil")) \
.withColumn("nails",df.features.getItem("nails")) \
.drop("features") \
.show()
Output:
Keys and Values using getItem()
Example 3: Getting all the keys MapType using Explode function
Using the explode() function, we can get all the keys that are in MapType.
Python3
#Import the needed modules
from pyspark.sql.functions import explode,map_keys
# then use the explode to get
# keys and values of dataframe
keysDF = df.select(explode(map_keys(df.features))).distinct()
keysList = keysDF.rdd.map(lambda x:x[0]).collect()
print(keysList)
Output:
['pupil', 'nails']
Example 4: Getting the keys and values using explode function
Python3
# Using explode function to get keys and values
from pyspark.sql.functions import explode
df.select(df.identification,explode(df.features)).show()
Output:
Keys and Values using explode() function
Example 5: Getting keys and values using map_key function
Python3
# To extract keys we can use map_key()
#import module
from pyspark.sql.functions import map_keys
df.select(df.identification,map_keys(df.features)).show()
Output:
Keys and Values using map_key() function
Example 6: Getting keys and values using map_values.
Python3
# using map_values on the designed dataframe
# importing the desired module.
from pyspark.sql.functions import map_values
df.select(df.identification,map_values(df.features)).show()
Output:
Keys and Values using map_values() functionConclusion
The above implementation shows the multiple ways to fetch the key-value pair using maptype function. Hence, we can easily implement maptype function similar to the dictionary.
Similar Reads
How to Get substring from a column in PySpark Dataframe ?
In this article, we are going to see how to get the substring from the PySpark Dataframe column and how to create the new column and put the substring in that newly created column. We can get the substring of the column using substring() and substr() function. Syntax: substring(str,pos,len) df.col_n
3 min read
How to get a value from the Row object in PySpark Dataframe?
In this article, we are going to learn how to get a value from the Row object in PySpark DataFrame. Method 1 : Using __getitem()__ magic method We will create a Spark DataFrame with at least one row using createDataFrame(). We then get a Row object from a list of row objects returned by DataFrame.co
5 min read
How to Change Column Type in PySpark Dataframe ?
In this article, we are going to see how to change the column type of pyspark dataframe. Creating dataframe for demonstration: Python # Create a spark session from pyspark.sql import SparkSession spark = SparkSession.builder.appName('SparkExamples').getOrCreate() # Create a spark dataframe columns =
4 min read
How to drop all columns with null values in a PySpark DataFrame ?
The pyspark.sql.DataFrameNaFunctions class in PySpark has many methods to deal with NULL/None values, one of which is the drop() function, which is used to remove/delete rows containing NULL values in DataFrame columns. You can also use df.dropna(), as shown in this article. You may drop all rows in
3 min read
How to get column and row names in DataFrame?
While analyzing the real datasets which are often very huge in size, we might need to get the rows or index names and columns names in order to perform certain operations. Note: For downloading the nba dataset used in the below examples Click Here Getting row names in Pandas dataframe First, let's
3 min read
How to get name of dataframe column in PySpark ?
In this article, we will discuss how to get the name of the Dataframe column in PySpark. To get the name of the columns present in the Dataframe we are using the columns function through this function we will get the list of all the column names present in the Dataframe. Syntax: df.columns We can a
3 min read
How to verify Pyspark dataframe column type ?
While working with a big Dataframe, Dataframe consists of any number of columns that are having different datatypes. For pre-processing the data to apply operations on it, we have to know the dimensions of the Dataframe and datatypes of the columns which are present in the Dataframe. In this article
4 min read
Filtering rows based on column values in PySpark dataframe
In this article, we are going to filter the rows based on column values in PySpark dataframe. Creating Dataframe for demonstration:Python3 # importing module import spark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app n
2 min read
How to change dataframe column names in PySpark ?
In this article, we are going to see how to change the column names in the pyspark data frame. Let's create a Dataframe for demonstration: Python3 # Importing necessary libraries from pyspark.sql import SparkSession # Create a spark session spark = SparkSession.builder.appName('pyspark - example jo
3 min read
Get list of column headers from a Pandas DataFrame
In this article, we will see, how to get all the column headers of a Pandas DataFrame as a list in Python. The DataFrame.column.values attribute will return an array of column headers. pandas DataFrame column namesUsing list() Get Column Names as List in Pandas DataFrame In this method we are using
3 min read