SCD in Databricks
SCD in Databricks
GritSetGrow - GSGLearn.com
DATA AND AI
SLOWLY CHANGING
DIMENSIONS
IN
DATABRICKS
www.gsglearn.com
Slowly Changing Dimensions (SCD) in
Databricks with PySpark and Delta Lake - By
Shwetank Singh (GSGLearn.com)
Introduction
In data warehousing, Slowly Changing Dimensions (SCD) are techniques to manage
dimensional data that changes slowly (e.g., customer addresses, product names) while
preserving historical accuracy
Instead of simply overwriting data, SCD methods allow us to decide how to handle
changes: keep old values, add new records, or retain limited history. Databricks, with
PySpark and Delta Lake, provides an efficient platform to implement SCDs. Delta Lake
brings ACID transactions and upsert capabilities (via MERGE), making it easier to
handle dimension changes in one unified pipeline
Type 0 -- Fixed Dimension: No changes are applied once recorded. The original
value remains as-is (retain initial data)
Type 1 -- Overwrite: Always update to the latest value. No history is kept; old
data is lost
Type 2 -- Add New Row: Add a new row/version for each change, preserving full
history. Typically uses effective dates or flags to identify current vs
historical records
Type 3 -- Add New Column: Keep a limited history by adding additional columns
to store previous values (e.g., "Previous Value") Only a few versions (usually
one prior) are retained.
For example, a Type 2 dimension that also stores some overwrite-able current
values or a "previous value" column (to enable both historical and current
analysis in one place). Often implemented by adding both a history-tracking
column and a current-value indicator to a Type 2 design.
Next, we'll set up a sample dimension dataset to illustrate how each SCD type is
implemented using PySpark and Delta Lake.
CustomerID -- the natural key for the customer (e.g., from source system).
Name -- customer name.
City -- customer city (this is an attribute that might change over time).
SignUpDate -- the date the customer joined (an attribute that does not change
after initial entry).
For example, our initial dimension table (current state) might look like this:
Now suppose we receive an update: Customer Alice (ID=1) has moved from New York to
Chicago as of 2021-05-10 . We will demonstrate how each SCD type handles this change.
In PySpark, we'll simulate the initial dimension as a Delta table and the incoming
change as a source DataFrame.
First, let's create the initial Delta table for our dimension and the new data for the
change:
This is used for attributes that should never change or where changes are ignored. For
example, SignUpDate might be treated as Type 0 because once set, it should not be
altered. If a change comes in for a Type 0 attribute, it is simply disregarded.
Implementation: In a Type 0 scenario, you do not perform any update on the attribute.
The Delta Lake table is only initially populated, and further changes to Type 0 fields
are ignored in the ETL logic. The code for Type 0 is trivial: you load the initial
data and then do nothing for that field on updates.
`# (Already loaded initial data above for Type 0 fields; no update code needed for
changes)
# If an update DataFrame has changes to a Type 0 field, we exclude that field from the
merge or update.
from delta.tables import DeltaTable
deltaTable = DeltaTable.forPath(spark, dim_path)
In the above merge, if a matching customer is found, we update the Name and City but
leave SignUpDate untouched (not listed in the set clause, so it remains as in
tgt ). Thus, any incoming change to SignUpDate would be ignored. In practice, you
might also simply not allow SignUpDate to appear in the source updates for existing
customers at all. After the merge, for Customer 1, SignUpDate stays "2020-01-15"
(original) even if the source had a different value.
Type 0 is the simplest SCD type -- no history is tracked because the attribute never
changes by design. This approach is suitable for immutable attributes like an original
registration date or an initial category that should never be modified.
. This method is used when you don't need history, only the current truth (for
example, correcting a spelling mistake in a name, or updating a phone number where
only the latest contact matters).
Behavior: Using our example, under Type 1, when Alice's city changes to Chicago, we
simply update her City from "New York" to "Chicago" in the customer dimension, and we
do not keep the old city anywhere in the dimension. After the update, the dimension
will show Alice as living in Chicago, and there's no trace in the dimension table that
she ever lived in New York.
Implementation in Delta Lake: We can use Delta Lake's MERGE operation to perform
this upsert (update insert) efficiently. The merge will find the matching dimension
row by CustomerID and update the City. If the incoming record was new (no match), it
would insert it as a new customer. Delta Lake handles this in one transaction
(atomic), which is simpler and safer than manual upsert logic
After this merge, the Delta table is updated in-place. In our scenario, Customer 1's
City would now be "Chicago". If we query the dimension table:
`spark.read.format("delta").load(dim_path).show()`
We would see Alice's city as Chicago, and Bob remains Los Angeles. The old city "New
York" is no longer in the table. Type 1 is straightforward and useful when historical
reporting is not required. Its advantages are simplicity and low storage usage (since
we don't add rows), but the disadvantage is obvious: loss of history -- we cannot see
past values after updates.
When a change occurs, a new row is added to the dimension table with the updated data,
and the old row is marked as historical (no longer current). This allows unlimited
history tracking -- every change results in a new version in the dimension. To manage
these versions, Type 2 dimensions use additional fields, typically:
A surrogate key (unique primary key for each dimension row version).
Valid-from and valid-to dates (or start and end timestamps) to indicate the
period during which each version was active.
A current flag (boolean or indicator) to easily filter the current record per
natural key.
With these, one can join fact data to the dimension version that was current at the
fact's date, enabling historically accurate analysis.
Behavior: In our example, under Type 2, when Alice moves to Chicago, we insert a new
record for Alice with the new city. The existing Alice record (with New York) is
marked as no longer current (for instance, set IsCurrent = false and EndDate = '2021-
05-10' which is the day before the change became effective, assuming the change
effective date is 2021-05-10). The new record for Alice in Chicago gets IsCurrent =
true and a StartDate = 2021-05-10 . We also assign a new surrogate key to this new
record (if using surrogate keys). All new facts after May 10, 2021 will now point to
the new surrogate key, thus linking to "Alice in Chicago". Facts before that date
remain linked to the old surrogate key ("Alice in New York")
. This way, queries can correctly aggregate data by the city Alice was in at the time
of each fact.
We will use the merge on condition CustomerID and IsCurrent=true (to target only the
current version of each customer). The logic will be:
When matched (i.e., the current record exists) and a change in City is
detected, update the existing record to set EndDate and IsCurrent=false .
When not matched (no current record for that CustomerID, or after updating the
old one), insert the new record with the updated City, StartDate as the change
date, EndDate = NULL (or a future end like 9999-12-31), and IsCurrent=true .
# Ensure the Delta table has SCD Type 2 columns (StartDate, EndDate, IsCurrent).
# For demonstration, alter the existing data:
deltaTable = DeltaTable.forPath(spark, dim_path)
# Let's update the existing records to fit Type 2 structure (initial load
adjustments).
deltaTable.update(
condition="true", # update all rows
set={
"StartDate": F.col("SignUpDate"), # use SignUpDate as the start
date for initial load
"EndDate": lit(None), # no end date yet (current record)
"IsCurrent": lit(True) # mark all initial records as
current
}
)`
(In a real ETL, you might use current_date() or a timestamp from the source to set
Start/End dates. Here we hardcoded "2021-05-10" for clarity.)
After this merge, our Delta dimension table would have two records for CustomerID=1:
We have effectively preserved history: if we query the table for Customer 1, we see
both the old and new city with their date ranges. Querying only current customers is
easy ( WHERE IsCurrent = true ). To find what city Alice lived in on e.g. 2020-12-01,
we can filter date >= StartDate and (date <= EndDate or EndDate is null) etc.
Delta Lake's ACID capabilities ensure this change (update + insert) happens
atomically. Also, Delta supports Time Travel, so even without SCD columns, one could
query older table versions. However, SCD Type 2 stores history as data, making it
straightforward to join with fact tables on surrogate keys and analyze historical data
at scale
Note: Each Type 2 change should generate a new surrogate key in a dimensional model.
In our code above, we used CustomerID as the key just for illustration. In practice,
you might have a surrogate key column (e.g., CustomerKey ) that is unique per row.
Delta Lake can auto-generate surrogate keys or you handle it in the ETL (e.g., by
keeping a sequence or using identity columns in Delta if available).
Instead of adding new rows on changes, we add (one-time) new columns such as
"PreviousCity". When a change occurs, we shift the current value into the "previous"
column and then update the current column with the new value. This way, we retain the
old value alongside the new one in the same row. However, this only keeps track of a
fixed number of changes (often just one prior value). If the attribute changes again,
you would lose the oldest value unless you had multiple columns (PreviousCity2, etc.),
which gets unwieldy. Type 3 is useful when you only care about comparing the current
value to the immediate previous value (for example, current department and previous
department of an employee).
Behavior: In our example, to implement Type 3 for the City attribute, we would alter
the Customer dimension to have a new column, say PreviousCity . Initially, this might
be null or same as current for the first load. When Alice's city changes from New York
to Chicago, we update Alice's single dimension row: set PreviousCity = "New York"
(the old city), and set City = "Chicago" (the new city). We do not create a new row.
The dimension now tells us Alice's current city and her previous city. If another
change comes (say Alice moves to Boston later), a pure Type 3 design would then lose
the fact she was in New York: we'd set PreviousCity to Chicago and City to Boston (New
York would be gone). Thus, only one historical step is kept.
Implementation: For Type 3, since we only update a single row (no insert of new rows),
the implementation is similar to Type 1 but with an additional column update. We need
to update two columns at once: the current value and the "previous" value. Delta Lake
supports this with UPDATE or MERGE operations. We'll assume our dimension now has a
PreviousCity column (added beforehand, default null or could be filled with initial
City for convenience).
`# Add PreviousCity column to the Delta table (for demo, do it via DataFrame
transformation)
df_dim = deltaTable.toDF()
# Initialize PreviousCity as null for all
df_dim = df_dim.withColumn("PreviousCity", lit(None))
# Overwrite the delta table with this new schema (in a real scenario, use Delta's
schema evolution or DDL)
df_dim.write.format("delta").mode("overwrite").save(dim_path)
deltaTable = DeltaTable.forPath(spark, dim_path)`
Now perform the Type 3 update for the incoming change (Alice's city):
After this merge, the dimension table still has one record per customer. For Customer
1 (Alice), the row now looks like:
Bob's record remains the same (PreviousCity is null or not applicable). We have
effectively kept Alice's old city in the PreviousCity column. If we needed to know
the last city she lived in before the current one, it's directly available.
Type 3 is beneficial for tracking one prior state without the complexity of multiple
records. It's easy to query (no need to join or filter for current vs historical---
both values are in one row)
. If requirements grow to track more changes, Type 3 falls short. In practice, Type 3
is used sparingly (for example, to track a previous customer segment or last job
title, while full history might not be needed).
Behavior: In our example, if City were highly volatile and we chose a Type 4 approach,
we would have:
DimCustomer_Current (the main dimension) containing the latest city for each
customer (and no other city history). E.g., after Alice moves, this table would
simply show Alice in Chicago.
DimCustomer_History (the historical table) containing records of changes,
possibly with columns CustomerID, OldCity, NewCity, ChangeDate (or a structure
similar to Type 2 with StartDate/EndDate for each city version). Each time a
customer's city changes, we would insert a record into the history table
capturing that change, and also update the main table. For instance, when Alice
moves to Chicago, we'd record a history entry: (CustomerID=1, City was "New
York", changed to "Chicago", ChangeDate=2021-05-10). The main table is then
updated to "Chicago". If another move happens, another history row is added
(previous "Chicago" to new city, etc.), and main table updated again.
1. Insert into History Table: Write a record of the old state (and new state, or
just old state with effective dates) into the historical Delta table.
2. Update Main Table: Overwrite the main dimension's current value with the new
value (Type 1 update on main).
We'll assume we have two Delta tables: dim_customer_current at dim_path (main table)
and dim_customer_history at history_path (history table). The history table could
have columns like CustomerID, City, StartDate, EndDate or a pair of City columns to
indicate change from -> to. Here, for simplicity, we log each change as a separate row
with the old city and the date it ended.
# Update main current table with new city (Type 1 update for current data)
for row in updates_df.collect():
currentTable.update(
condition=f"CustomerID = {row['CustomerID']}",
set={"City": F.lit(row["NewCity"])}
)`
Determined that Customer 1 has OldCity = "New York" and NewCity = "Chicago" .
Appended a row to the history table: (CustomerID=1, OldCity="New York",
EndDate="2021-05-10"). This records that New York was Alice's city up until
that date.
Then updated the main table so Alice's City becomes "Chicago".
Type 4 keeps the main dimension very compact (only current data). Historical analysis
is a bit more involved since you'd have to consult the history table (or combine it
with current for a full view). In some cases, Type 4 is implemented by designing a
mini-dimension for a frequently changing subset of attributes and linking that mini-
dimension to the fact table separately But the essence is separating current vs
history into two locations.
With Delta Lake, both tables can be maintained easily. The history table can grow
large, but since it's not joined to facts on every query (only when historical
reporting is needed), this can mitigate performance issues of a huge Type 2 dimension.
The goal of Type 6 is to enjoy the benefits of full history (Type 2) while also easily
accessing current values and perhaps a previous value in the same record (Type 1 and
Type 3). In practice, a Type 6 dimension is often a Type 2 dimension that includes
additional columns for current values or previous values that are updated along with
new row inserts.
For example, you might have a Type 2 dimension with StartDate , EndDate , IsCurrent ,
and also a column PreviousCity (Type 3 style) and possibly even a duplicate of the
current attribute that gets overwritten across all histories (Type 1 style for a
particular attribute). This sounds complex, but it provides flexibility: one can
analyze facts by the attribute value that was in effect at the time (using the Type 2
historical records), or by the current value of that attribute (using a Type 1 style
attribute that was overwritten on all records, or via a separate reference). Kimball
describes Type 6 as "Add Type 1 attributes to a Type 2 dimension" -- essentially
storing a continuously updated current value in each historical record for easy
comparison
Behavior: Suppose we want to design a Customer dimension that allows analysts to query
facts by both the city at the time of the fact and the customer's current city without
complex joins. We could implement Type 6 as follows:
Maintain Type 2 history of City (each change = new row, with Start/End dates,
etc.).
Add a CurrentCity column in the dimension. When a customer's city changes, we
create a new row (Type 2) and also update all rows of that customer to set
CurrentCity = new city (that's a Type 1 aspect applied to the CurrentCity
column across historical rows). Additionally, we might keep a PreviousCity
column in the new row as in Type 3 (the last city before the change).
After such an update: the new row will show the new City, the previous city in
PreviousCity , and its CurrentCity (which might be same as City for the new row). All
older rows for that customer would have their CurrentCity updated to the latest city
as well, even though their own City field (historical) remains unchanged. This way,
any row in the dimension can tell you "what is the current city of this customer"
(even for old records), enabling comparisons like "was the sales made when Alice lived
in New York higher or lower than her sales now that her current city is Chicago?"
without a self-join. Type 6 essentially embeds a point-in-time value and the current
value in each record
So, our dimension structure for Type 6 might be: CustomerID, Name, City, PreviousCity,
StartDate, EndDate, IsCurrent . The process for a change:
# 2. Insert new record with PreviousCity (Type 3 part) using the old value.
# Get the old city value from the previously current record (now expired) for use as
PreviousCity
old_city_df = spark.read.format("delta").load(dim_path).filter(
"CustomerID = 1 AND IsCurrent = false AND EndDate = '2021-05-09'"
).select(F.col("City").alias("OldCity")).limit(1)
.write.format("delta").mode("append").save(dim_path)`
In step 1, we used a merge to set the existing Alice record IsCurrent=false and
EndDate=2021-05-09 . In step 2, we built a new record for Alice in Chicago: we looked
up her OldCity (New York) from the now-expired record and set that as PreviousCity
for the new row, with StartDate=2021-05-10 . We then appended this new row. Now our
dimension has:
This is very similar to the pure Type 2 result, except we have a PreviousCity on the
new row explicitly storing "New York". If we had also maintained a CurrentCity in
every row and updated it across the board, that would complete a classic Type 6 (so
that even the first row might have CurrentCity=Chicago after the change, reflecting
the latest info). That could be done with an additional update if needed.
Type 6 (and other hybrid strategies) provide more analytic options at the cost of more
complexity
The Kimball Group has also defined specific hybrid types beyond Type 6: for example,
Type 5 (Type 4 + Type 1 outrigger) and Type 7 (dual Type 1 and Type 2 dimensions)
These are advanced designs to handle very specific reporting needs (like allowing easy
access to current values alongside historical ones through separate dimension links).
In practice, these are less common and can usually be achieved via simpler means or
additional columns.
In summary, hybrid SCD implementations give more flexibility for analysis -- you can
choose to preserve history for some attributes and not for others, or provide multiple
ways to view data (historical vs current) in your dimensional model. Delta Lake's
support for complex merge conditions and transactions makes implementing even complex
hybrids feasible. However, it's wise to balance complexity with actual business needs;
sometimes a straightforward Type 2 with some Type 1 columns is enough, rather than
implementing every hybrid twist.
Solution: They implement the product dimension as a Delta table with SCD Type 2
structure (surrogate key, effective dates, etc.) for category changes. When a product
changes category, a new dimension row is inserted (new surrogate key, new category,
current flag true) and the old row is expired -- preserving the old category history.
However, for price updates, instead of adding new rows, they simply overwrite the
Price in the current dimension record (Type 1 behavior), since historical prices
aren't required in the dimension. In other words, the ETL logic checks which attribute
changed:
They also maintain a column IsCurrent and dates in the dimension for Category. For
querying current data, they simply filter IsCurrent=true . For historical analysis,
they join facts on product surrogate keys with the appropriate effective date range
(or use the surrogate directly if the fact stores the surrogate foreign key at time of
transaction).
Delta Lake Implementation: Using PySpark, they leverage Delta Lake merges to handle
this logic. A simplified outline of their merge logic could be:
productTable.alias("tgt").merge(
changes.alias("src"),
"tgt.ProductID = src.ProductID AND tgt.IsCurrent = true"
).whenMatchedUpdate(condition="src.Category = tgt.Category", set={
# Category didn't change, so just update current fields (Type 1 for Price or
others)
"Price": F.col("src.Price"),
"Name": F.col("src.Name")
}).whenMatchedUpdate(condition="src.Category <> tgt.Category", set={
# Category changed -> expire old record (Type 2)
"EndDate": F.col("src.ChangeDate"),
"IsCurrent": lit(False)
}).whenNotMatchedInsert(values={
"ProductID": F.col("src.ProductID"),
"Name": F.col("src.Name"),
"Category": F.col("src.Category"),
"Price": F.col("src.Price"),
"StartDate": F.col("src.ChangeDate"),
"EndDate": lit(None),
"IsCurrent": lit(True)
}).execute()`
If categories are the same (only price or name might have changed), it does a
simple update in place (Type 1 update for those attributes).
If category changed, it updates the existing row to set it as not current (end
dating it), and the insert clause will add the new product row with the new
category and current price. (We used two whenMatchedUpdate with different
conditions --- Delta's extended merge syntax allows multiple whenMatched
clauses with conditions, evaluated in order).
Outcome: This hybrid SCD implementation allows the company to query the Product
dimension in two ways. If they join facts to the product dimension on the surrogate
key used at the time of the sale, they'll get the category as it was at that time
(because the fact would link to the correct historical row by surrogate key). If they
want the current category of products for all sales (for example, reclassify
historical sales under the new categories), they could join on ProductID and pick the
dimension row with IsCurrent=true (or maintain a separate current dimension table if
needed, akin to Type 4 for category). With Delta Lake handling the merges, the nightly
update process is robust --- it can insert and update in one job, and the ACID
guarantees mean analysts won't see partial updates. This approach has been running in
production, and thanks to Delta Lake's performance, even thousands of product changes
per day are processed efficiently (no full table scans or overwrites --- just merges
on the keys and relevant partitions).
Conclusion: In this document, I've explored various Slowly Changing Dimension types
and how to implement them using PySpark and Delta Lake on Databricks. Each SCD type
serves a purpose, from never changing (Type 0) to full audit history (Type 2) to
hybrid combinations for complex needs. Delta Lake's ability to perform upserts and
maintain data versioning makes it an ideal choice for handling these dimension changes
efficiently. By choosing the right SCD strategy (or combination) for your dimensions,
you can ensure your data warehouse preserves the necessary history and provides
correct data for all time periods with optimal performance.