0% found this document useful (0 votes)
39 views5 pages

Databricks Running Notes

The document provides an overview of narrow and wide transformations in Spark SQL. Narrow transformations like select, filter, withColumn operate on individual rows and don't cause shuffling while wide transformations like join, groupBy, distinct cause shuffling as they operate on groups of rows.

Uploaded by

abdulkalam sir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views5 pages

Databricks Running Notes

The document provides an overview of narrow and wide transformations in Spark SQL. Narrow transformations like select, filter, withColumn operate on individual rows and don't cause shuffling while wide transformations like join, groupBy, distinct cause shuffling as they operate on groups of rows.

Uploaded by

abdulkalam sir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

map:dictionary

{"mobile": "+1 234 567 8901", "home": "+1 234 567 8911"},

array:list
"phone_numbers": ["+1 234 567 8901", "+1 234 567 8911"],

import pyspark.sql import Row


struct
phone_numbers": Row(mobile="+1 234 567 8901",
home="+1 234 567 8911"),

---------------------------------------------------------------------------------
--------------------------------------

let us get an overview of Narrow and Wide Transformations.

Here are the functions related to narrow transformations.


Narrow transformations doesn't result in shuffling. These are
also known as row level transformations.
df.select
df.filter
df.withColumn
df.withColumnRenamed
df.drop
Here are the functions related to wide transformations.
df.distinct
df.union or any set operation
df.join or any join operation
df.groupBy
df.sort or df.orderBy
Any function that result in shuffling is wide transformation.
For all the wide transformations, we have to deal with group
of records based on a key.

---------------------------------------------------------------------------------
----------------------------------
for select we need to import col,concat,lit
from pyspark.sql.functions import concat, lit, col
users_df. \
select(
'id', 'first_name', 'last_name',
concat(col('first_name'), lit(', '),
col('last_name')).alias('full_name')
). \
show()
but for select exprerssion is used to select like sql syntax it
check here we dont use alias function instead we use as
keyword same as sql

users_df.selectExpr('id', 'first_name', 'last_name',


"concat(first_name, ', ', last_name) AS full_name").show()

for using sql query


users_df.createOrReplaceTempView('users')
is used in session

spark.sql("""
SELECT id, first_name, last_name,
concat(first_name, ', ', last_name) AS full_name
FROM users
"""). \
show()
---------------------------------------------------------------------------------
-----------------------------------
06 Referring Columns using Spark Data Frame Names

1)
users_df['id']
passing as list
this will return a column type object
Out[7]: Column<'id'>

2)
We can import col and we can also use col
From pyspark.sql.function import col
col('id')
Out[10]: Column<'id'>

3)
You can also check the type of object by specify type
type(users_df['id'])
pyspark.sql.column.Column

4)
from pyspark.sql.functions import col
users_df.select('id', col('first_name'), 'last_name').show()
you should specify col and single quotes

5)
You can also specify dataframe name in select for column
take it in list
users_df.select(users_df['id'], col('first_name'),
'last_name').show()

users_df.select(users_df['id','email'], col('first_name'),
'last_name').show()

You might also like