0% found this document useful (0 votes)
108 views24 pages

Spark Join3

This document summarizes the BroadcastHashJoin in Spark SQL. The BroadcastHashJoin performs an equi-join by broadcasting one of the relations (usually the smaller one) to all nodes. It executes by having one task build the broadcast relation, while the other tasks perform the join by looking up the broadcast relation using the join keys. This allows data from the smaller relation to be reused across tasks, reducing data shuffling compared to a regular hash join.

Uploaded by

Kanna Babu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
108 views24 pages

Spark Join3

This document summarizes the BroadcastHashJoin in Spark SQL. The BroadcastHashJoin performs an equi-join by broadcasting one of the relations (usually the smaller one) to all nodes. It executes by having one task build the broadcast relation, while the other tasks perform the join by looking up the broadcast relation using the join keys. This allows data from the smaller relation to be reused across tasks, reducing data shuffling compared to a regular hash join.

Uploaded by

Kanna Babu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 24

Spark “join”

deep dive
Part3
(DRAFT)
BroadcastHashJoin
Nikolay
Join us in telegram t.me/apache_spark
October 2020
Example

val df1 = spark.range(1,1000000)


val df2 = spark.range(0,Integer.MAX_VALUE)

df2.join( df1,"id").explain(true)
Spark Jobs 
DAG
Job 0
DAG
Job 1
== Parsed Logical Plan ==
'Join UsingJoin(Inner,List(id))
:- Range (0, 2147483647, step=1, splits=Some(8))
+- Range (1, 1000000, step=1, splits=Some(8))

== Analyzed Logical Plan ==


id: bigint
Project [id#2L]
+- Join Inner, (id#2L = id#0L)
:- Range (0, 2147483647, step=1, splits=Some(8))
+- Range (1, 1000000, step=1, splits=Some(8))

== Optimized Logical Plan ==


Project [id#2L]
+- Join Inner, (id#2L = id#0L)
:- Range (0, 2147483647, step=1, splits=Some(8))
+- Range (1, 1000000, step=1, splits=Some(8))

== Physical Plan ==
Project [id#2L]
+- BroadcastHashJoin [id#2L], [id#0L], Inner, BuildRight
:- Range (0, 2147483647, step=1, splits=8)
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
+- Range (1, 1000000, step=1, splits=8)
/**
* Inner equi-join with another `DataFrame` using the given column.
*
* Different from other join functions, the join column will only appear once in the output,
* i.e. similar to SQL's `JOIN USING` syntax.
*
* {{{
* // Joining df1 and df2 using the column "user_id"
* df1.join(df2, "user_id")
* }}}
*
* @param right Right side of the join operation.
* @param usingColumn Name of the column to join on. This column must exist on both sides.
*
* @note If you perform a self-join using this function without aliasing the input
* `DataFrame`s, you will NOT be able to reference any columns after the join, since
* there is no way to disambiguate which side of the join you would like to reference.
*
* @group untypedrel
* @since 2.0.0
*/
def join(right: Dataset[_], usingColumn: String): DataFrame = {
join(right, Seq(usingColumn))
}
case class BroadcastHashJoinExec(
case class Join( leftKeys: Seq[Expression],
left: LogicalPlan, rightKeys: Seq[Expression],
right: LogicalPlan, joinType: JoinType,
Spark Planner
joinType: JoinType, buildSide: BuildSide,
condition: Option[Expression], condition: Option[Expression],
hint: JoinHint) left: SparkPlan,
right: SparkPlan)
BroadcastHashJoinExec

protected override def doExecute(): RDD[InternalRow] = {


val numOutputRows = longMetric("numOutputRows")

val broadcastRelation = buildPlan.executeBroadcast[HashedRelation


streamedPlan.execute().mapPartitions { streamedIter =>
val hashed = broadcastRelation.value.asReadOnlyCopy()
TaskContext.get().taskMetrics().incPeakExecutionMemory(hashed.e
join(streamedIter, hashed, numOutputRows)
}
}
SparkPlan

Returns the result of this query as a broadcast variable by delegating to


`doExecuteBroadcast ` after preparations.

Concrete implementations of SparkPlan should override `doExecuteBroadcast`.

final def executeBroadcast[T](): broadcast.Broadcast[T] = executeQuery {


if (isCanonicalizedPlan) {
throw new IllegalStateException("A canonicalized plan is not supposed to
be executed.")
}
doExecuteBroadcast()
}
BroadcastExchangeExec

override protected[sql] def doExecuteBroadcast[T](): broadcast.Broadcast[T] = {


try {
relationFuture.get(timeout, TimeUnit.SECONDS).asInstanceOf[broadcast.Broadcast[T]]
} catch {
case ex: TimeoutException =>
logError(s"Could not execute broadcast in $timeout secs.", ex)
if (!relationFuture.isDone) {
sparkContext.cancelJobGroup(runId.toString)
relationFuture.cancel(true)
}
throw new SparkException(s"Could not execute broadcast in $timeout secs. " +
s"You can increase the timeout for broadcasts via ${SQLConf.BROADCAST_TIMEOUT.key
s"disable broadcast join by setting ${SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key} t
ex)
}
}
override lazy val relationFuture: Future[broadcast.Broadcast[Any]] = {
SQLExecution.withThreadLocalCaptured[broadcast.Broadcast[Any]](
sqlContext.sparkSession, BroadcastExchangeExec.executionContext) {
try {
// Setup a job group here so later it may get cancelled by groupId if necessary
sparkContext.setJobGroup(runId.toString, s"broadcast exchange (runId $runId)",
interruptOnCancel = true)
val beforeCollect = System.nanoTime()
// Use executeCollect/executeCollectIterator to avoid conversion to Scala types
val (numRows, input) = child.executeCollectIterator()
if (numRows >= MAX_BROADCAST_TABLE_ROWS) {
throw new SparkException(
s"Cannot broadcast the table over $MAX_BROADCAST_TABLE_ROWS rows: $numRows
rows")
}

val beforeBuild = System.nanoTime()


longMetric("collectTime") += NANOSECONDS.toMillis(beforeBuild - beforeCollect)
// Construct the relation.
val relation = mode.transform(input, Some(numRows))

val dataSize = relation match {


case map: HashedRelation =>
map.estimatedSize
case arr: Array[InternalRow] =>
arr.map(_.asInstanceOf[UnsafeRow].getSizeInBytes.toLong).sum
case _ =>
throw new SparkException("[BUG] BroadcastMode.transform returned unexpected " +
s"type: ${relation.getClass.getName}")
}

longMetric("dataSize") += dataSize
if (dataSize >= MAX_BROADCAST_TABLE_BYTES) {
throw new SparkException(
s"Cannot broadcast the table that is larger than 8GB: ${dataSize >> 30} GB")
}
val beforeBroadcast = System.nanoTime()
longMetric("buildTime") += NANOSECONDS.toMillis(beforeBroadcast - beforeBuild)

// Broadcast the relation


val broadcasted = sparkContext.broadcast(relation)
longMetric("broadcastTime") += NANOSECONDS.toMillis(
System.nanoTime() - beforeBroadcast)
val executionId = sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY)
SQLMetrics.postDriverMetricUpdates(sparkContext, executionId, metrics.values.toSeq)
promise.trySuccess(broadcasted)
broadcasted
The idea
JOB 1 JOB 0
1 1 2 3
1 1 3 5

Iterator1 Iterator2
collect
1 1 2 3

Array
1 1 2 3 Build broadcast relation
Relation

broadcast
BroadcastHashJoinExec

protected override def doExecute(): RDD[InternalRow] = {


val numOutputRows = longMetric("numOutputRows")

val broadcastRelation = buildPlan.executeBroadcast[HashedRelation


streamedPlan.execute().mapPartitions { streamedIter =>
val hashed = broadcastRelation.value.asReadOnlyCopy()
TaskContext.get().taskMetrics().incPeakExecutionMemory(hashed.e
join(streamedIter, hashed, numOutputRows)
}
}
protected def join(
streamedIter: Iterator[InternalRow],
hashed: HashedRelation,
numOutputRows: SQLMetric): Iterator[InternalRow] = {

val joinedIter = joinType match {


case _: InnerLike =>
innerJoin(streamedIter, hashed)
case LeftOuter | RightOuter =>
outerJoin(streamedIter, hashed)
case LeftSemi =>
semiJoin(streamedIter, hashed)
case LeftAnti =>
antiJoin(streamedIter, hashed)
case j: ExistenceJoin =>
existenceJoin(streamedIter, hashed)
case x =>
throw new IllegalArgumentException(
s"BroadcastHashJoin should not take $x as the JoinType")
}
private def innerJoin(
streamIter: Iterator[InternalRow],
hashedRelation: HashedRelation): Iterator[InternalRow] = {
val joinRow = new JoinedRow
val joinKeys = streamSideKeyGenerator()
streamIter.flatMap { srow =>
joinRow.withLeft(srow)
val matches = hashedRelation.get(joinKeys(srow))
if (matches != null) {
matches.map(joinRow.withRight(_)).filter(boundCondition)
} else {
Seq.empty
}
}
}
private def outerJoin( streamedIter: Iterator[InternalRow], hashedRelation: HashedRelation): Iterator[InternalRow] = {
val joinedRow = new JoinedRow();
val keyGenerator = streamSideKeyGenerator()
val nullRow = new GenericInternalRow(buildPlan.output.length)

streamedIter.flatMap { currentRow =>


val rowKey = keyGenerator(currentRow)
joinedRow.withLeft(currentRow)
val buildIter = hashedRelation.get(rowKey)
new RowIterator {
private var found = false
override def advanceNext(): Boolean = {
while (buildIter != null && buildIter.hasNext) {
val nextBuildRow = buildIter.next()
if (boundCondition(joinedRow.withRight(nextBuildRow))) {
found = true
return true
}
}
if (!found) {
joinedRow.withRight(nullRow)
found = true
return true
}
false
private def semiJoin(
streamIter: Iterator[InternalRow],
hashedRelation: HashedRelation): Iterator[InternalRow] = {
val joinKeys = streamSideKeyGenerator()
val joinedRow = new JoinedRow
streamIter.filter { current =>
val key = joinKeys(current)
lazy val buildIter = hashedRelation.get(key)
!key.anyNull && buildIter != null && (condition.isEmpty || buildIter.exists {
(row: InternalRow) => boundCondition(joinedRow(current, row))
})
}
}
A semijoin returns all rows from the left table where there are matching values in the right table, keeping just columns from the
left table.
private def existenceJoin(
streamIter: Iterator[InternalRow],
hashedRelation: HashedRelation): Iterator[InternalRow] = {
val joinKeys = streamSideKeyGenerator()
val result = new GenericInternalRow(Array[Any](null))
val joinedRow = new JoinedRow
streamIter.map { current =>
val key = joinKeys(current)
lazy val buildIter = hashedRelation.get(key)
val exists = !key.anyNull && buildIter != null && (condition.isEmpty || buildIter.exists {
(row: InternalRow) => boundCondition(joinedRow(current, row))
})
result.setBoolean(0, exists)
joinedRow(current, result)
}
}
private def antiJoin(
streamIter: Iterator[InternalRow],
hashedRelation: HashedRelation): Iterator[InternalRow] = {
val joinKeys = streamSideKeyGenerator()
val joinedRow = new JoinedRow
streamIter.filter { current =>
val key = joinKeys(current)
lazy val buildIter = hashedRelation.get(key)
key.anyNull || buildIter == null || (condition.isDefined && !buildIter.exists {
row => boundCondition(joinedRow(current, row))
})
}
}
An antijoin return all rows from the left table where there are not matching values in the right table, keeping just columns
from the left table
Summary
• Job0 and Job1
• BroadcastHashJoinExec
• innerJoin, outerJoin( flatMap)
• semiJoin, antiJoin( filter)
• existenceJoin( map)

You might also like