In this post I’m going over potential reasons for why some special characters end up corrupted and displayed as question marks, squares or something completely different in SQL Server.
Note this post doesn’t really cover collations all that much. If you want to read a great post about SQL Server collations, check out Daniel Hutmacher’s post – A quick look at SQL Server UTF-8 collations.
This is one of those things I end up having to explain a couple of times per year, so I’ve figured it might as well be a blog post to spare me the typing/talking the next time I get this question.
Intro
This is how it usually happens:
Someone retrieving some data from SQL Server notices that some of the strings in the result set tend to look like this.
In this case, with some of the characters looking like question marks.
Followed by an allegation that SQL Server and/or SSMS is messing up the data in some way.
In the above image, the column Word should contain the word “duck” in Japanese, Greek and Romanian:
- アヒル
- πάπια
- rață
You can reproduce the same result by running this query in a database that uses SQL Server’s default collation – SQL_Latin1_General_CP1_CI_AS.
1 2 3 4 5 | SELECT 'rață' AS Word UNION SELECT 'πάπια' UNION SELECT 'アヒル'; |
If you don’t have a database with the SQL_Latin1_General_CP1_CI_AS collation, you can create one quickly on an existing instance.
1 2 | CREATE DATABASE [DefaultCollationDB] COLLATE SQL_Latin1_General_CP1_CI_AS; |
Or create a SQL Server Developer Edition container via sqlcmd or Podman with that collation.
Note that aside from some characters being replaced with ?, there are also other issues.
The Greek letter π is displayed as a p, and the Romanian letter ă is displayed as an a.
So, in short it looks like the values contain some corrupted characters.
Why do some characters end up displayed as question marks?
I’ll list the situations that cause this from most likely to least likely, at least based on my experience.
The inserted string is treated as non-Unicode
This is the case around 95% of the time.
The target column is NVARCHAR (Unicode/UTF-16), but the INSERT treats the string as VARCHAR (non-Unicode/ASCII).
Example using the same database with the SQL_Latin1_General_CP1_CI_AS collation:
1 2 3 4 5 6 7 8 9 10 11 12 | CREATE TABLE [Words] ( [Word] NVARCHAR(50) ); INSERT INTO [Words] ([Word]) VALUES ('rață'), ('πάπια'), ('アヒル'); SELECT Word FROM Words; |
The data stored in that table will have some corrupted characters and will look exactly like the one from the previous screenshot. With question marks and incorrect characters being retrieved.
And maybe corrupted characters might not be the most accurate term.
More like information was lost from said characters due to them being handled using a low fidelity data type.
What’s the difference between an NVARCHAR and a VARCHAR string?
One letter – N
Prepending N to the string delimiter will indicate that the string is NVARCHAR (UTF-16).
1 2 3 4 5 6 7 8 9 | IF OBJECT_ID(N'dbo.Words', N'U') IS NOT NULL DROP TABLE [Words]; INSERT INTO [Words] ([Word]) VALUES (N'rață'), (N'πάπια'), (N'アヒル'); SELECT Word FROM Words; |
As a result, all the characters will be stored and displayed correctly, regardless of the database’s collation.
Note that the above result set also includes the previously inserted values that were treated as VARCHAR in the INSERT statement.
And the only way to fix the previously inserted values that contain corrupted characters is to update them using the appropriate type of string.
The inserted string is treated as Unicode, but the target column is VARCHAR
This is usually due to a design mistake that ended up with the target column being defined as VARCHAR even though it was intended to store Unicode data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | IF OBJECT_ID(N'dbo.Words', N'U') IS NOT NULL DROP TABLE [Words]; CREATE TABLE [Words] ( [Word] VARCHAR(50) ); INSERT INTO [Words] ([Word]) VALUES (N'rață'), (N'πάπια'), (N'アヒル'); SELECT Word FROM Words; |
Notice that the INSERT statement does treat the string values as NVARCHAR by having the N prefix.
But, since the Word column is defined as VARCHAR, it won’t matter.
In this case the solution is to drop and re-create the table with the Word column being defined as NVARCHAR.
1 2 3 4 5 6 | IF OBJECT_ID(N'dbo.Words', N'U') IS NOT NULL DROP TABLE [Words]; CREATE TABLE [Words] ( [Word] NVARCHAR(50) ); |
Wrong database collation
This specific situation has only happened once in my career (so far), so I usually never consider this being the case.
Here, both the inserted string as well as the target column are VARCHAR by design, but the databases was created with the wrong collation.
In this case, a database with an appropriate collation would be able to properly store the 3 strings from my examples.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | USE [master] GO CREATE DATABASE [LatinUTF8] COLLATE Latin1_General_100_CI_AS_KS_SC_UTF8; GO USE [LatinUTF8] GO CREATE TABLE [Words] ( [Word] VARCHAR(50) ); GO INSERT INTO [Words] ([Word]) VALUES ('rață'), ('πάπια'), ('アヒル'); GO SELECT [Word] FROM [Words]; |
Just note that it can be a bad idea to have a database whose collation is different from the instance.
For more info check out Brent Ozar’s post on collation mismatching.
Conclusion
Usually it’s a fairly simple mistake in the INSERT/UPDATE statement that leads to specific characters being corrupted and displayed as question marks, squares or something else in SQL Server.
Which means that the fix is also simple, and only requires a minor adjustment to the INSERT/UPDATE statement.
Unfortunately, there’s nothing you can do about the previously inserted data, regardless of the situation, since character information was lost when the data was stored.
So, there’s no way to correct previously inserted string data without updating or recreating it.