Basic DataFrame Operation
Basic DataFrame Operation
Import notebook
Creating a SparkSession
The SparkSession is the entry point to programming with Spark SQL.
It allows you to create DataFrames, register DataFrames as tables, execute SQL over tables, cache tables, and read
parquet files.
SparkSession.builder: The builder attribute is a class attribute of SparkSession that provides a way to configure
and create a SparkSession instance.
appName("Example App"): The appName method sets the name of the Spark application. This name will appear
in the Spark web UI and can help you identify your application among others running on the same cluster.
config("spark.some.config.option", "some-value"): The config method allows you to set various configuration
options for the Spark session. In this example, " spark.some.config.option " is a placeholder for an actual
configuration key, and "some-value" is the value for that configuration. You can set multiple configuration options
by chaining multiple config calls.
getOrCreate(): The getOrCreate method either retrieves an existing SparkSession if one already exists or creates a
new one if it does not. This ensures that you do not accidentally create multiple SparkSession instances in your
application.
Note:In Databricks, you do not need to create or override the SparkSession as it is automatically created for each
notebook or job executed against the cluster. Databricks manages the SparkSession and SparkContext for you,
ensuring optimal configuration and resource usage.
Creating DataFrame
1.From Python a List of Tuples
%python
# List of tuples
data = [("John", 25), ("Doe", 30), ("Jane", 22)]
# Creating DataFrame
df_list = spark.createDataFrame(data, ["Name", "Age"])
%python
# List of dictionaries
data = [{"Name": "Alice", "Id": 1}, {"Name": "Bob", "Id": 2}, {"Name": "Cathy", "Id": 3}]
# Creating DataFrame
df_dict = spark.createDataFrame(data)
%python
from pyspark.sql import Row
# List of Rows
data = [ Row(Name="Cathy", Id=1),
Row(Name="David", Id=2),
Row(Name="Eva", Id=3),
Row(Name="Frank", Id=4)]
# Creating DataFrame
df_row = spark.createDataFrame(data)
+-----+---+
| Name| Id|
+-----+---+
|Cathy| 1|
|David| 2|
| Eva| 3|
|Frank| 4|
+-----+---+
%python
# Import necessary modules
from pyspark.sql import Row
# Create an RDD
rdd = spark.sparkContext.parallelize([
Row(Name="Alice", Age=25),
Row(Name="Bob", Age=30),
Row(Name="Cathy", Age=22),
Row(Name="David", Age=35),
Row(Name="Eva", Age=28),
Row(Name="Frank", Age=40)
])
+-----+---+
| Name|Age|
+-----+---+
|Alice| 25|
| Bob| 30|
|Cathy| 22|
|David| 35|
| Eva| 28|
|Frank| 40|
+-----+---+
.format("csv"): Specifies the format of the data source. In this case, it indicates that the data is in CSV (Comma-
Separated Values) format.
.option("header", "true"): This option tells Spark that the first row of the CSV file contains the column names. If this
option is set to false, Spark will treat the first row as data. "true" means that the CSV file has a header row.
.option("inferSchema", "true"): This option tells Spark to automatically infer the data types of each column in the
CSV file. If this option is set to false, all columns will be read as strings (default behavior). "true" means that Spark will
try to infer the schema (data types) of the columns based on the data.
.load("/FileStore/tables/retail_db/customers"):
This method specifies the path to the CSV file or directory containing CSV files that you want to read.
customer_df=spark.read.format("csv").option("header","true").option("inferSchema","true").load("dbfs:/FileStor
e/tables/customers_300mb.csv")
%python
#employee data and schemas
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, FloatType, DateType
from datetime import date
show(): This is a method available on Spark DataFrames that prints the first n rows to the console. It is useful for
quick inspection of data but does not provide rich formatting or interactivity. You can specify the number of rows to
display, and it defaults to 20 rows if not specified.
display(): This is a Databricks-specific function that provides a rich, interactive view of the DataFrame. It is more
suitable for use within notebooks as it allows for better visualization, including sorting, filtering, and graphical
representation of data.
customer_df.show(5)
+-----------+----------+------+-----------+-------+-----------------+---------+
|customer_id| name| city| state|country|registration_date|is_active|
+-----------+----------+------+-----------+-------+-----------------+---------+
| 0|Customer_0| Pune|Maharashtra| India| 2023-01-19| true|
| 1|Customer_1| Pune|West Bengal| India| 2023-08-10| true|
| 2|Customer_2| Delhi|Maharashtra| India| 2023-08-05| true|
| 3|Customer_3|Mumbai| Telangana| India| 2023-06-04| true|
| 4|Customer_4| Delhi| Karnataka| India| 2023-03-15| false|
+-----------+----------+------+-----------+-------+-----------------+---------+
only showing top 5 rows
customer_df.display()
#display(customer_df)
Table
columns: This attribute returns a list of the column names in the DataFrame.
printSchema(): This method prints the schema of the DataFrame, including column names and data types, in a
tree format.
customer_df.columns
['customer_id',
'name',
'city',
'state',
'country',
'registration_date',
'is_active']
customer_df.printSchema()
root
|-- customer_id: integer (nullable = true)
|-- name: string (nullable = true)
|-- city: string (nullable = true)
|-- state: string (nullable = true)
|-- country: string (nullable = true)
|-- registration_date: date (nullable = true)
|-- is_active: boolean (nullable = true)
customer_df.select("name","city").show()
| Customer_5| Kolkata|
| Customer_6| Kolkata|
| Customer_7| Mumbai|
| Customer_8| Pune|
| Customer_9| Delhi|
|Customer_10|Hyderabad|
|Customer_11| Delhi|
|Customer_12| Delhi|
|Customer_13| Pune|
|Customer_14| Chennai|
|Customer_15|Hyderabad|
|Customer_16| Chennai|
|Customer_17| Pune|
|Customer_18| Chennai|
|Customer_19| Chennai|
+-----------+---------+
only showing top 20 rows
4. Filter rows
customer_df.filter(customer_df.city=="Hyderabad").show()
customer_df.where(customer_df.city=="Hyderabad").show()
The withColumn method is used to create a new column or replace an existing column in a DataFrame.
df.withColumn("name","defination")
%python
from pyspark.sql.functions import col, concat, lit
withColumnRenamed
The withColumnRenamed method is used to rename a single column in a DataFrame.
%python
# Example: Renaming a column
df_renamed_column = df_with_new_column.withColumnRenamed("full name", "Full Name")
6. Dropping a Column
The drop method is used to remove one or more columns from a DataFrame.
%python
# Dropping multiple columns
df_dropped_columns = df_renamed_column.drop("name", "country")
df_dropped_columns: pyspark.sql.dataframe.DataFrame = [customer_id: integer, city: string ... 4 more fields]
| 2| Delhi|Maharashtra| 2023-08-05| true| Customer_2 Singh|
| 3| Mumbai| Telangana| 2023-06-04| true| Customer_3 Singh|
| 4| Delhi| Karnataka| 2023-03-15| false| Customer_4 Singh|
| 5| Kolkata|West Bengal| 2023-08-19| true| Customer_5 Singh|
| 6| Kolkata| Tamil Nadu| 2023-04-21| false| Customer_6 Singh|
| 7| Mumbai| Telangana| 2023-05-23| true| Customer_7 Singh|
| 8| Pune| Tamil Nadu| 2023-07-17| true| Customer_8 Singh|
| 9| Delhi| Karnataka| 2023-06-02| true| Customer_9 Singh|
| 10|Hyderabad| Delhi| 2023-02-23| true|Customer_10 Singh|
| 11| Delhi|West Bengal| 2023-11-08| true|Customer_11 Singh|
| 12| Delhi| Delhi| 2023-06-27| false|Customer_12 Singh|
| 13| Pune|Maharashtra| 2023-02-03| true|Customer_13 Singh|
| 14| Chennai| Karnataka| 2023-04-06| true|Customer_14 Singh|
| 15|Hyderabad|West Bengal| 2023-03-31| true|Customer_15 Singh|
| 16| Chennai|Maharashtra| 2023-04-26| true|Customer_16 Singh|
| 17| Pune| Delhi| 2023-04-14| false|Customer_17 Singh|
| 18| Chennai|Maharashtra| 2023-02-04| false|Customer_18 Singh|
| 19| Chennai| Karnataka| 2023-01-22| true|Customer_19 Singh|
+-----------+---------+-----------+-----------------+---------+-----------------+
only showing top 20 rows
%python
# Removing duplicate rows
df_distinct = df_renamed_column.distinct()
Aggregation
Will cover in detail tomorrow
+---------+------+
| city| count|
+---------+------+
|Bangalore|661013|
| Chennai|660249|
| Mumbai|661241|
|Ahmedabad|660218|
| Kolkata|660174|
| Pune|660737|
| Delhi|661025|
|Hyderabad|662281|
+---------+------+