How to create DataFrame from Scala's List of Iterables?
Last Updated :
16 Apr, 2024
In Scala, working with large datasets is made easier with Apache Spark, a powerful framework for distributed computing. One of the core components of Spark is DataFrames, which organizes data into tables for efficient processing. In this article, we'll explore how to create DataFrames from simple lists of data in Scala using Apache Spark's DataFrame API.
Understanding DataFrames and Lists
DataFrames are like tables in a database, allowing us to work with structured data easily. On the other hand, lists are collections of data in Scala.
- DataFrames: DataFrames are tabular data structures in Scala, representing distributed collections of data organized into named columns. They offer rich functionalities for data manipulation, including filtering, aggregation, and SQL queries, making them indispensable for data processing tasks.
- Lists of Iterables: In Scala, a List is an ordered collection of elements of the same type, while an Iterable represents a collection that can be iterated over. Lists of Iterables are often used to store structured data, where each Iterable represents a row or record.
Creating DataFrames from Lists
There are two simple ways to turn lists into DataFrames:
1. Using RDDs:
- RDDs are a flexible way to handle data transformations.
- Convert each row in the list into a format compatible with DataFrames.
- Create a DataFrame from the converted RDD.
2. Using the toDF method:
- This method is simpler and suitable for straightforward data structures.
- Directly convert the list into a DataFrame using the toDF method, specifying column names.
Let's see both methods in action.
Method 1: Using RDDs
Scala
// Import necessary libraries
import org.apache.spark.sql.{SparkSession, DataFrame, Row}
import org.apache.spark.sql.types._
// Create SparkSession
val spark = SparkSession.builder().appName("DataFrameFromList").getOrCreate()
// Define schema for DataFrame
val schema = StructType(Seq(
StructField("Name", StringType, nullable = false),
StructField("Age", IntegerType, nullable = false),
StructField("Location", StringType, nullable = false)
))
// Sample data: List of Lists
val data = List(
List("Alice", 30, "New York"),
List("Bob", 25, "Los Angeles"),
List("Charlie", 35, "Chicago")
)
// Convert List of Lists to RDD of Rows
val rowsRDD = spark.sparkContext.parallelize(data).map(Row.fromSeq)
// Create DataFrame
val df: DataFrame = spark.createDataFrame(rowsRDD, schema)
// Show DataFrame
df.show()
+-------+---+-----------+
| Name|Age| Location|
+-------+---+-----------+
| Alice| 30| New York|
| Bob| 25|Los Angeles|
|Charlie| 35| Chicago|
+-------+---+-----------+
Explanation:
- We import necessary libraries.
- A SparkSession is created.
- We define the DataFrame schema, specifying data types for each column.
- Sample data is represented as a List of Iterables, where each inner list represents a row.
- We use spark.sparkContext.parallelize to create an RDD from the list.
- The map function on the RDD converts each Iterable to a Row object using Row.fromSeq.
- Finally, spark.createDataFrame creates the DataFrame from the RDD and schema.
Method 2: Using toDF method
Scala
// Import necessary libraries
import org.apache.spark.sql.{SparkSession, DataFrame}
// Create SparkSession
val spark = SparkSession.builder().appName("DataFrameFromList").getOrCreate()
// Sample data: List of Lists
val data = List(
List("Alice", "30", "New York"),
List("Bob", "25", "Los Angeles"),
List("Charlie", "35", "Chicago")
)
// Create DataFrame using toDF method
val df = data.toDF("Name", "Age", "Location")
// Show DataFrame
df.show()
Explanation:
- We import necessary libraries.
- A SparkSession is created.
- Sample data is defined as a List of Lists, where each inner List represents a row with assumed column names and the same data type within each inner list.
- We directly create a DataFrame using the toDF method, specifying column names as arguments. This method offers a concise approach suitable for scenarios with a well-defined data structure.
- Finally, we display the DataFrame using the show method.
Conclusion:
Creating DataFrames from lists of data in Scala is straightforward with Apache Spark's DataFrame API. Whether you choose to use RDDs for flexibility or the toDF method for simplicity, you can quickly organize and process your data for analysis and insights.
Similar Reads
How to create a PySpark dataframe from multiple lists ? In this article, we will discuss how to create Pyspark dataframe from multiple lists. ApproachCreate data from multiple lists and give column names in another list. So, to do our task we will use the zip method. zip(list1,list2,., list n) Pass this zipped data to spark.createDataFrame() method data
2 min read
How to create an empty dataframe in Scala? In this article, we will learn how to create an empty dataframe in Scala. We can create an empty dataframe in Scala by using the createDataFrame method provided by the SparkSession object. Syntax to create an empty DataFrame: val df = spark.emptyDataFrame Example of How to create an empty dataframe
2 min read
Create PySpark DataFrame from list of tuples In this article, we are going to discuss the creation of a Pyspark dataframe from a list of tuples. To do this, we will use the createDataFrame() method from pyspark. This method creates a dataframe from RDD, list or Pandas Dataframe. Here data will be the list of tuples and columns will be a list
2 min read
PySpark - Create DataFrame from List In this article, we are going to discuss how to create a Pyspark dataframe from a list. To do this first create a list of data and a list of column names. Then pass this zipped data to spark.createDataFrame() method. This method is used to create DataFrame. The data attribute will be the list of da
2 min read
How to print dataframe in Scala? Scala stands for scalable language. It was developed in 2003 by Martin Odersky. It is an object-oriented language that provides support for functional programming approach as well. Everything in scala is an object e.g. - values like 1,2 can invoke functions like toString(). Scala is a statically typ
4 min read
How to convert list of dictionaries into Pyspark DataFrame ? In this article, we are going to discuss the creation of the Pyspark dataframe from the list of dictionaries. We are going to create a dataframe in PySpark using a list of dictionaries with the help createDataFrame() method. The data attribute takes the list of dictionaries and columns attribute tak
2 min read