Brent Ozar’s latest blog post contained a question that piqued my interest about a potential method to calculate the checksum for an entire row in SQL Server.
The original question:
Kansas4444 asks: Hi Brent, do you have any advice on calculating checksum for an entire row on SQL Server ? (Brent says: I’ve never had to do this, so no idea.)
https://www.brentozar.com/archive/2023/09/office-hours-questions-i-dont-have-answers-for/
Although I did leave the following comment, I still wanted to explore this more and make sure it’s actually valid and doesn’t rely on stuff like case-sensitive collations.
Use cases
One of the use cases I can think of would be a means of change control, making sure that the latest version of a specific record is a result of a valid change done via the application/website and not by someone who ran an accidental update via SSMS on that table by mixing up Prod with Dev/QA.
This implies that, whenever a “legitimate” insert/update is made, the application has a way to:
- compute the checksum of the newly added/changed record
- store that checksum for that record
Improving on the code suggested in my comment
Brief overview
Note: I’m running this on SQL Server 2019, latest CU, but the code is compatible with SQL Server 2022 as well as with previous versions.
I’m using the 180GB version of the StackOverflow database provided by Brent Ozar here and both the instance and the database are using the SQL_Latin1_General_CP1_CI_AS collation.
For reference, this is the structure of the Users table:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | CREATE TABLE [dbo].[Users]( [Id] [int] IDENTITY(1,1) NOT NULL, [AboutMe] [nvarchar](max) NULL, [Age] [int] NULL, [CreationDate] [datetime] NOT NULL, [DisplayName] [nvarchar](40) NOT NULL, [DownVotes] [int] NOT NULL, [EmailHash] [nvarchar](40) NULL, [LastAccessDate] [datetime] NOT NULL, [Location] [nvarchar](100) NULL, [Reputation] [int] NOT NULL, [UpVotes] [int] NOT NULL, [Views] [int] NOT NULL, [WebsiteUrl] [nvarchar](200) NULL, [AccountId] [int] NULL, CONSTRAINT [PK_Users_Id] PRIMARY KEY CLUSTERED ( [Id] ASC )); |
Moving on
After my initial comment I’ve realized that I have not taken into account a few situations:
- Without a separator, there is no difference between an empty string and two consecutive empty strings
- Without having a specific placeholder for a NULL and just using
ISNULL(Column, N'')
there’s no difference between an empty string added through an update and a NULL that’s been turned into an empty string throughISNULL()
, or between an Age set to 0 via the Website or from using 0 as a placeholder for NULL when computing the checksum.
Let’s take Brent’s user for example:
1 | SELECT * FROM Users WHERE id = 26837; |
If Brent would update the Age field to 0 via the StackOverflow Website, the checksum after that change would be the same as the one previously generated by the T-SQL in my comment.
First, I’ll need to add a column to store the checksum.
1 | ALTER TABLE Users ADD [CheckSum] VARBINARY(32); |
Since the SHA2_256 output is only 32 bytes, I can limit the width of the new column that byte size.
Next, I create a view that returns the Id and CheckSum for the Users table.
For my PoC, I’m opting for ;
as the separator and -NULL-
as the placeholder for NULLs.
I’m specifying WITH SCHEMABINDING when creating the view in order to avoid situations where someone changes the table structure without also updating the view.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | CREATE VIEW V_GetUsersRecordChecksum WITH SCHEMABINDING AS SELECT Id, HASHBYTES(N'SHA2_256',CAST(ID AS NVARCHAR(10)) + N';' + ISNULL(AboutMe, N'-NULL-') + N';' + ISNULL(CAST(Age AS NVARCHAR(10)), N'-NULL-') + N';' + CONVERT(NVARCHAR(23), CreationDate, 21) + N';' + DisplayName + N';' + CAST(DownVotes AS NVARCHAR(10)) + N';' + ISNULL(EmailHash, N'-NULL-') + N';' + CONVERT(NVARCHAR(23), LastAccessDate, 21) + N';' + ISNULL([Location], N'-NULL-') + N';' + CAST(Reputation AS NVARCHAR(10)) + N';' + CAST(UpVotes AS NVARCHAR(10)) + N';' + CAST([Views] AS NVARCHAR(10)) + N';' + ISNULL(WebsiteUrl,N'-NULL-') + N';' + ISNULL(CAST(AccountId AS NVARCHAR(10)), N'-NULL-')) AS [CheckSum] FROM dbo.Users; GO |
Note that I am using HASHBYTES, because BINARY_CHECKSUM sometimes produces the same checksum for different data.
Plus, I like playing with HASHBYTES, especially in combination with SQL logins’ password hashes.
I then test the view works as expected.
1 | SELECT * FROM V_GetUsersRecordChecksum WHERE Id = 26837; |
And proceed to update the CheckSum column for Brent’s record.
1 2 3 4 5 6 | UPDATE u SET u.[CheckSum] = c.[CheckSum] FROM Users AS U INNER JOIN V_GetUsersRecordChecksum AS c ON c.Id = u.Id WHERE u.Id = 26837; |
Note that if I want to update the new column for the whole table, I can just omit the WHERE clause.
Afterwards, I compare the values to make sure they are consistent between the view and the stored checskum.
1 2 | SELECT Id, [CheckSum] FROM Users WHERE Id = 26837; SELECT * FROM V_GetUsersRecordChecksum WHERE Id = 26837; |
Testing
Now to do some small changes to test the behavior.
First, I change the case of the letter B in DisplayName.
1 2 | UPDATE Users SET DisplayName = REPLACE(DisplayName, N'B', N'b') WHERE id = 26837; |
I then compare the previous checksum with the new one, then undo the change and verify again.
1 2 3 4 5 6 7 8 9 10 11 | /*Verify checksum change*/ SELECT Id, [CheckSum], DisplayName FROM Users WHERE Id = 26837; SELECT * FROM V_GetUsersRecordChecksum WHERE Id = 26837; /*Undo the change*/ UPDATE Users SET DisplayName = REPLACE(DisplayName, N'b', N'B') WHERE id = 26837; /*Check again*/ SELECT Id, [CheckSum], DisplayName FROM Users WHERE Id = 26837; SELECT * FROM V_GetUsersRecordChecksum WHERE Id = 26837; |
Note that the checksums no longer match in the first pair of the result sets since the value of the CheckSum column in the Users table is the old one.
This also confirms that it’s working as intended regardless if the instance is case-sensitive or not.
What does this mean for the application?
For an update to a record to have a consistent checksum, the application should make sure to keep the CheckSum column updated.
So an update for the LastAccessDate column would look like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | DECLARE @Id INT; SET @Id = 26837 /*Update record*/ UPDATE Users SET LastAccessDate = GETDATE() WHERE Id = @Id; /*Sync CheckSum*/ UPDATE u SET u.[CheckSum] = c.[CheckSum] FROM Users AS U INNER JOIN V_GetUsersRecordChecksum AS c ON c.Id = u.Id WHERE u.Id = @Id; |
If the application already handles inserts and updates via a stored procedure, then the Sync CheckSum part can be included in the stored procedure in order to avoid making code changes in the application itself.
Why not use a trigger or a computed column?
Because they defeat the purpose of trying to ensure that the changes to the data come from the application alone.
A trigger will be fired for every update, unless you start making flag tables in tempdb so that it only fires for sessions coming from the application (more on that in a future post maybe).
Of course, someone who knows the inner workings of the application can just update the CheckSum column themselves, but it’s still an extra step that someone needs to be aware of.
Conclusion
It is possible to get an accurate checksum for a row in SQL Server, it just requires a bit of planning and also adding another step to insert and update operations that touch the table you’re interested in
1 comment
Very interesting, thanks.
Also, sql server has rowVersion data type, that is incremented on insert , update (that column must exists) (like the depricated timestamp)
S