Home » Characters displayed as question marks or squares in SQL Server

Characters displayed as question marks or squares in SQL Server

by Vlad Drumea
0 comments

In this post I’m going over potential reasons for why some special characters end up corrupted and displayed as question marks, squares or something completely different in SQL Server.

Note this post doesn’t really cover collations all that much. If you want to read a great post about SQL Server collations, check out Daniel Hutmacher’s post – A quick look at SQL Server UTF-8 collations.

This is one of those things I end up having to explain a couple of times per year, so I’ve figured it might as well be a blog post to spare me the typing/talking the next time I get this question.

Intro

This is how it usually happens:

Someone retrieving some data from SQL Server notices that some of the strings in the result set tend to look like this.

SSMS result set showing some characters returned as question marks from SQL Server

In this case, with some of the characters looking like question marks.

Followed by an allegation that SQL Server and/or SSMS is messing up the data in some way.

In the above image, the column Word should contain the word “duck” in Japanese, Greek and Romanian:

  • アヒル
  • πάπια
  • rață

You can reproduce the same result by running this query in a database that uses SQL Server’s default collation – SQL_Latin1_General_CP1_CI_AS.


If you don’t have a database with the SQL_Latin1_General_CP1_CI_AS collation, you can create one quickly on an existing instance.


Or create a SQL Server Developer Edition container via sqlcmd or Podman with that collation.

Note that aside from some characters being replaced with ?, there are also other issues.
The Greek letter π is displayed as a p, and the Romanian letter ă is displayed as an a.
So, in short it looks like the values contain some corrupted characters.

Why do some characters end up displayed as question marks?

I’ll list the situations that cause this from most likely to least likely, at least based on my experience.

The inserted string is treated as non-Unicode

This is the case around 95% of the time.
The target column is NVARCHAR (Unicode/UTF-16), but the INSERT treats the string as VARCHAR (non-Unicode/ASCII).

Example using the same database with the SQL_Latin1_General_CP1_CI_AS collation:


The data stored in that table will have some corrupted characters and will look exactly like the one from the previous screenshot. With question marks and incorrect characters being retrieved.

And maybe corrupted characters might not be the most accurate term.
More like information was lost from said characters due to them being handled using a low fidelity data type.

What’s the difference between an NVARCHAR and a VARCHAR string?

One letter – N

Prepending N to the string delimiter will indicate that the string is NVARCHAR (UTF-16).

As a result, all the characters will be stored and displayed correctly, regardless of the database’s collation.

SSMS result set showing the originally inserted words with some characters displayed as question marks as well as the words stored correctly in SQL Server

Note that the above result set also includes the previously inserted values that were treated as VARCHAR in the INSERT statement.

And the only way to fix the previously inserted values that contain corrupted characters is to update them using the appropriate type of string.

The inserted string is treated as Unicode, but the target column is VARCHAR

This is usually due to a design mistake that ended up with the target column being defined as VARCHAR even though it was intended to store Unicode data.


Notice that the INSERT statement does treat the string values as NVARCHAR by having the N prefix.
But, since the Word column is defined as VARCHAR, it won’t matter.

result set showing the words being stored with corrupted characters and displaying question marks

In this case the solution is to drop and re-create the table with the Word column being defined as NVARCHAR.


Wrong database collation

This specific situation has only happened once in my career (so far), so I usually never consider this being the case.

Here, both the inserted string as well as the target column are VARCHAR by design, but the databases was created with the wrong collation.

In this case, a database with an appropriate collation would be able to properly store the 3 strings from my examples.


result set showing all the words displayed correctly without any sign of corrupted characters, question marks or squares

Just note that it can be a bad idea to have a database whose collation is different from the instance.
For more info check out Brent Ozar’s post on collation mismatching.

Conclusion

Usually it’s a fairly simple mistake in the INSERT/UPDATE statement that leads to specific characters being corrupted and displayed as question marks, squares or something else in SQL Server.
Which means that the fix is also simple, and only requires a minor adjustment to the INSERT/UPDATE statement.

Unfortunately, there’s nothing you can do about the previously inserted data, regardless of the situation, since character information was lost when the data was stored.
So, there’s no way to correct previously inserted string data without updating or recreating it.

You may also like

Leave a Comment

* By using this form you agree with the storage and handling of your data by this website.

This site uses Akismet to reduce spam. Learn how your comment data is processed.