Home » Calculate checksum for entire rows in SQL Server

Calculate checksum for entire rows in SQL Server

by Vlad Drumea
1 comment

Brent Ozar’s latest blog post contained a question that piqued my interest about a potential method to calculate the checksum for an entire row in SQL Server.

The original question:

Kansas4444 asks: Hi Brent, do you have any advice on calculating checksum for an entire row on SQL Server ? (Brent says: I’ve never had to do this, so no idea.)

https://www.brentozar.com/archive/2023/09/office-hours-questions-i-dont-have-answers-for/

Although I did leave the following comment, I still wanted to explore this more and make sure it’s actually valid and doesn’t rely on stuff like case-sensitive collations.

 Vlad Drumea September 19, 2023 1:52 pm 1. The only method I can think of (but I’d like to dig more into this) is something like concatenating that record into a string variable and then applying HASHBYTES on that variable Example using the Users table in the StackOverflow database (yes, I didn’t concatenate all the columns, but it should be enough for a PoC: DECLARE @Record NVARCHAR(MAX); SELECT @Record = CAST(ID AS NVARCHAR(10)) + ISNULL(AboutMe, ”) + CAST(ISNULL(Age, 0) AS NVARCHAR(10)) + CONVERT(NVARCHAR(23), CreationDate, 21) + DisplayName + CAST(ISNULL(DownVotes, 0) AS NVARCHAR(10)) + ISNULL(EmailHash, ”) + CONVERT(NVARCHAR(23), LastAccessDate, 21) FROM Users WHERE id = 26837 SELECT HASHBYTES(‘SHA2_256’, @Record); SEO checksum for row in sql server

Use cases

One of the use cases I can think of would be a means of change control, making sure that the latest version of a specific record is a result of a valid change done via the application/website and not by someone who ran an accidental update via SSMS on that table by mixing up Prod with Dev/QA.

This implies that, whenever a “legitimate” insert/update is made, the application has a way to:

  1. compute the checksum of the newly added/changed record
  2. store that checksum for that record

Improving on the code suggested in my comment

Brief overview

Note: I’m running this on SQL Server 2019, latest CU, but the code is compatible with SQL Server 2022 as well as with previous versions.

I’m using the 180GB version of the StackOverflow database provided by Brent Ozar here and both the instance and the database are using the SQL_Latin1_General_CP1_CI_AS collation.

For reference, this is the structure of the Users table:

Moving on

After my initial comment I’ve realized that I have not taken into account a few situations:

  • Without a separator, there is no difference between an empty string and two consecutive empty strings
  • Without having a specific placeholder for a NULL and just using ISNULL(Column, N'') there’s no difference between an empty string added through an update and a NULL that’s been turned into an empty string through ISNULL(), or between an Age set to 0 via the Website or from using 0 as a placeholder for NULL when computing the checksum.

Let’s take Brent’s user for example:

If Brent would update the Age field to 0 via the StackOverflow Website, the checksum after that change would be the same as the one previously generated by the T-SQL in my comment.

First, I’ll need to add a column to store the checksum.

Since the SHA2_256 output is only 32 bytes, I can limit the width of the new column that byte size.

Next, I create a view that returns the Id and CheckSum for the Users table.

For my PoC, I’m opting for ; as the separator and -NULL- as the placeholder for NULLs.

I’m specifying WITH SCHEMABINDING when creating the view in order to avoid situations where someone changes the table structure without also updating the view.

Note that I am using HASHBYTES, because BINARY_CHECKSUM sometimes produces the same checksum for different data.

Plus, I like playing with HASHBYTES, especially in combination with SQL logins’ password hashes.

I then test the view works as expected.

 checksum for row in sql server

And proceed to update the CheckSum column for Brent’s record.

Note that if I want to update the new column for the whole table, I can just omit the WHERE clause.

Afterwards, I compare the values to make sure they are consistent between the view and the stored checskum.

 checksum for row in sql server

Testing

Now to do some small changes to test the behavior.

First, I change the case of the letter B in DisplayName.

I then compare the previous checksum with the new one, then undo the change and verify again.

 checksum for row in sql server

Note that the checksums no longer match in the first pair of the result sets since the value of the CheckSum column in the Users table is the old one.

This also confirms that it’s working as intended regardless if the instance is case-sensitive or not.

What does this mean for the application?

For an update to a record to have a consistent checksum, the application should make sure to keep the CheckSum column updated.

So an update for the LastAccessDate column would look like this:

If the application already handles inserts and updates via a stored procedure, then the Sync CheckSum part can be included in the stored procedure in order to avoid making code changes in the application itself.

Why not use a trigger or a computed column?

Because they defeat the purpose of trying to ensure that the changes to the data come from the application alone.

A trigger will be fired for every update, unless you start making flag tables in tempdb so that it only fires for sessions coming from the application (more on that in a future post maybe).

Of course, someone who knows the inner workings of the application can just update the CheckSum column themselves, but it’s still an extra step that someone needs to be aware of.

Conclusion

It is possible to get an accurate checksum for a row in SQL Server, it just requires a bit of planning and also adding another step to insert and update operations that touch the table you’re interested in

You may also like

1 comment

Sab September 20, 2023 - 09:48

Very interesting, thanks.
Also, sql server has rowVersion data type, that is incremented on insert , update (that column must exists) (like the depricated timestamp)
S

Reply

Leave a Comment

* By using this form you agree with the storage and handling of your data by this website.