How to Convert RDD to Dataframe in Spark Scala?
Last Updated :
04 Jun, 2024
This article focuses on discussing ways to convert rdd to dataframe in Spark Scala.
RDD and DataFrame in Spark
RDD and DataFrame are Spark's two primary methods for handling data.
- RDD is like the basic building block for processing data, while DataFrame is more like using SQL.
- Sometimes in projects, there is a need to switch between RDDs and DataFrames.
Below is the Scala program to setup a spark session and create a dataset:
Scala
import org.apache.spark.sql.{SparkSession, Row}
import org.apache.spark.sql.types._
val spark = SparkSession.builder().master("local").appName("RDDExample").getOrCreate()
val sc = spark.sparkContext
val rdd = sc.parallelize(Seq(
("Alice", "HR Manager", 40),
("Bob", "Software Developer", 35),
("Charlie", "Data Scientist", 28)
))
Output:

Convert Using createDataFrame Method
To make things simpler, you can use the createDataFrame method in Spark to turn your data into a DataFrame. You do not need to worry about specifying a schema (which describes the structure of your data) right away. Instead, you can just provide your existing data in the form of an RDD (Resilient Distributed Dataset), and Spark will figure out the structure for you.
This way, you can easily work with your data in a DataFrame format without much hassle.
Syntax:
Scala
import org.apache.spark.sql.{SparkSession, Row}
import org.apache.spark.sql.types._
object RDDToDataFrame {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("RDD to DataFrame")
.master("local[*]")
.getOrCreate()
val data = Seq(
("John", 30),
("Alice", 25),
("Bob", 35)
)
val rdd = spark.sparkContext.parallelize(data)
val schema = StructType(
Seq(
StructField("Name", StringType, nullable = true),
StructField("Age", IntegerType, nullable = true)
)
)
val df = spark.createDataFrame(rdd.map(row => Row.fromSeq(row)), schema)
df.show()
spark.stop()
}
}
Let's now examine the DataFrame is schema that we recently created:
Scala
dfWitDefaultSchema.printSchema()
- When we work with data in tables (like in a spreadsheet), the names of the columns usually follow a specific order based on a template. Sometimes, the computer tries to guess the type of data in each column, but it might not always get it right.
- To make sure our data is organized correctly and to have more control over its structure, it is better to set up a clear plan for how the table should look beforehand. In programming terms, we create something called a schema that defines the layout of our table.
- Now, when we are using a tool like Apache Spark, we need to follow certain rules. In the past, we could create a table directly from our data, but now we need to convert our data into a special format called Row before we can use it to create a table. This change helps ensure that our data is handled safely and accurately.
Below is the Code provided:
Scala
val rowRDD:RDD[Row] = rdd.map(t => Row(t._1, t._2, t._3))
Next, let's create the schema object that we need:
Below is the Code provided:
Scala
val schema = new StructType()
.add(StructField("EmployeeName", StringType, false))
.add(StructField("Department", StringType, true))
.add(StructField("Salary", DoubleType, true))
Output:

Let's invoke the method once more, this time passing in an extra schema parameter:
Scala
val dfWithSchema:DataFrame = spark.createDataFrame(rowRDD, schema)
We will print the schema information once again:
Output:

It is evident that the data types are defined correctly and that the columns have appropriate names.
Conversion Using toDF() Implicit Method
Another common way to turn RDDs into DataFrames is by using the .toDF() method. But before we start, we need to make sure we import the necessary things from the SparkSession.
This helps us work with DataFrames smoothly:
Below is the Code provided:
Scala
import org.apache.spark.sql.{SparkSession, Row}
import org.apache.spark.sql.types._
object RDDToDataFrame {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("RDD to DataFrame")
.master("local[*]")
.getOrCreate()
val data = Seq(
("John", 30),
("Alice", 25),
("Bob", 35)
)
import spark.implicits._
val df = data.toDF("Name", "Age")
df.show()
spark.stop()
}
}
With that in place, we are all set to convert our RDD. However, it is important to note that this method is designed to handle specific types of RDDs, including Int, Long, String, or any subclasses of scala.Product. Suppose we have an RDD constructed using a sequence of Tuples. In that case, we can utilize our imported implicit method as follows:
Below is the Code provided:
Scala
val dfUsingToDFMethod = rdd.toDF("EmployeeName", "Department", "Salary")
Now, let's take a peek at the schema of our freshly minted DataFrame:
Below is the Code provided:
Scala
dfUsingToDFMethod.printSchema()
Upon execution, this will display:
Output:

This showcases the schema structure of our DataFrame, including the names and data types of its columns.
Conclusion
In this guide, we have looked at different ways to turn an RDD into a DataFrame. We have gone through each method in detail, learning about what they need to work. If your RDD contains Rows, you can use the createDataFrame method. But if it is something else, the toDF() method can be really helpful.
Similar Reads
How to Convert Pandas to PySpark DataFrame ?
In this article, we will learn How to Convert Pandas to PySpark DataFrame. Sometimes we will get csv, xlsx, etc. format data, and we have to store it in PySpark DataFrame and that can be done by loading data in Pandas then converted PySpark DataFrame. For conversion, we pass the Pandas dataframe int
3 min read
How to Convert a List to a Dataframe in R
We have a list of values and if we want to Convert a List to a Dataframe within it, we can use a as.data.frame. it Convert a List to a Dataframe for each value. A DataFrame is a two-dimensional tabular data structure that can store different types of data. Various functions and packages, such as dat
4 min read
How to Check the Schema of DataFrame in Scala?
With DataFrames in Apache Spark using Scala, you could check the schema of a DataFrame and get to know its structure with column types. The schema contains data types and names of columns that are available in a DataFrame. Apache Spark is a powerful distributed computing framework used for processin
3 min read
Convert PySpark Row List to Pandas DataFrame
In this article, we will convert a PySpark Row List to Pandas Data Frame. A Row object is defined as a single Row in a PySpark DataFrame. Thus, a Data Frame can be easily represented as a Python List of Row objects. Method 1 : Use createDataFrame() method and use toPandas() method Here is the syntax
4 min read
How to check dataframe size in Scala?
In this article, we will learn how to check dataframe size in Scala. To check the size of a DataFrame in Scala, you can use the count() function, which returns the number of rows in the DataFrame. Here's how you can do it: Syntax: val size = dataframe.count() Example #1: [GFGTABS] Scala import org.a
2 min read
How to Convert a List to a DataFrame Row in Python?
In this article, we will discuss how to convert a list to a dataframe row in Python. Method 1: Using T function This is known as the Transpose function, this will convert the list into a row. Here each value is stored in one column. Syntax: pandas.DataFrame(list).T Example: C/C++ Code # import panda
3 min read
Convert PySpark DataFrame to Dictionary in Python
In this article, we are going to see how to convert the PySpark data frame to the dictionary, where keys are column names and values are column values. Before starting, we will create a sample Dataframe: C/C++ Code # Importing necessary libraries from pyspark.sql import SparkSession # Create a spark
3 min read
Convert Python Dictionary List to PySpark DataFrame
In this article, we will discuss how to convert Python Dictionary List to Pyspark DataFrame. It can be done in these ways: Using Infer schema.Using Explicit schemaUsing SQL Expression Method 1: Infer schema from the dictionary We will pass the dictionary directly to the createDataFrame() method. Syn
3 min read
How to Join Two DataFrame in Scala?
Scala stands for scalable language. It is a statically typed language although unlike other statically typed languages like C, C++, or Java, it doesn't require type information while writing the code. The type verification is done at the compile time. Static typing allows us to build safe systems by
5 min read
Convert DataFrame to vector in R
In this article, we will discuss how a dataframe can be converted to a vector in R. For the Conversion of dataframe into a vector, we can simply pass the dataframe column name as [[index]]. Approach: We are taking a column in the dataframe and passing it into another variable by the selection method
2 min read