Dataset materialization is a powerful feature that can have benefits in both performance and compute cost, but it can also be used to help track recent changes made to a table or dataset. In practice, your data team may implement Change Data Capture (CDC) which can help track when changes are made to a table by using columns like CREATED_AT and UPDATED_AT. If you are not tracking these real-time changes today and you are looking for a quick and easy way to check for recent changes made to a table or dataset, this blog post will go over how you can do this using dataset materialization.
Essentially, all we have to do is…
- Create a materialized dataset
- Create a live copy of that same dataset
- Perform a full outer join on the columns we want to check for changes between the two datasets. This will create an output with meaningful row duplication signifying deleted records, added records, and updated records that we can clean up to show an output similar to the below that clearly shows which records do not match the last refresh of the materialized dataset.
Note: To do the following steps, you will need to have a Sigma account type with permissions to materialize datasets.
Step 1: Create dataset and materialize it
First, we want to create a dataset and set up materialization to occur at whatever cadence we decide to track these changes to our dataset (e.g., do we want to check if a change occurred in the last hour, day, week, etc.). This materialized dataset will effectively act as a temporary snapshot in time. Note that materializations will overwrite each other, so this method will only show changes made to a table since the last materialization.
In this example, I am going to use a 10 row table with fake Salesforce opportunity data and will simulate changes by manually inserting/deleting/updating records from a table in Snowflake. The screenshot below shows the materialized dataset built off this table.
Step 2: Duplicate dataset
Now, we need to duplicate this dataset. Note that this new dataset does not have a materialization schedule defined, so this will serve as our live/current state of the Salesforce opportunity table.
Step 3: Join datasets
Now I am going to create a workbook and full outer join these two datasets together on the primary key and the columns that we want to check for changes. The screenshot below shows the join parameters where we want to check if an opportunity in Salesforce has a new stage or amount associated with it. In this example, [Id] is the primary key, while [Stage Name] and [Amount] are the columns we want to check for changes.
In my example, I haven’t made any changes to the Opportunity table in Snowflake yet, so the output still only has 10 rows of data. I will now make a manual change to the table in Snowflake to delete the 001 McDonald’s record and add a 011 Applebee’s record. I will also update 004 Domino’s to Stage 4 and 006 Chick-fil-A to have an amount of $10,000. Now when I refresh the data in Sigma, I am seeing duplicate records where there were changes made.
Step 4: Create primary key column and group by this column to maintain cardinality
Now that we have brought in all of the columns from both datasets, let’s rename the columns to make it more clear where each column is coming from. Add “OLD” before the column names coming from the materialized dataset and add “NEW” before the column names coming from the live dataset. Since any change made to the dataset join key columns would result in a duplicate row, we want to create a new calculated column to serve as our new primary key for our output.
[KEY] = Coalesce([NEW Id], [OLD Id])
We are using Coalesce() here because any deleted record will have a null value under [NEW Id] (Coalesce returns the first non-null value). The reason a null may be returned in this scenario is because a deleted record would exist in the materialized dataset, but not in the live dataset.
Group this table by [KEY].
Step 5: Determine changes in output
By doing a full outer join, we can get the full picture as to what changes have been made to the opportunity table since it was last materialized. For any [KEY] where we have multiple rows, we know there was a change made.
Where [OLD Id] is null, we know that record has been added since the last materialization since the record doesn’t exist in the materialized dataset, but it does in the live dataset. Where [NEW Id] is null, we know that record has been deleted since the last materialization.
The following calculation at the [KEY] grouping level will determine the status of the current state of a given record in the opportunity table.
[Record Status] = If(Count() > 1, “Updated record”, IsNull([OLD Id]), “New record”, IsNull([NEW Id]), “Deleted record”, “Existing record”)
In this example we actually have an additional calculation at this grouping level to check if there was a change to the [Amount] field, and if so, how much.
Step 6: Format output
Now we will use the Coalesce() function as we did before to display the current state of the opportunity table as well as the deleted records since the last materialization. To make it more clear to our end users what the changes were since the last materialization, we can add some conditional formatting rules.
And that’s it! Now you have an easy way to track changes made to a table/dataset for simple use cases.