Friday, December 26, 2008

Why do some SQL strings have an 'N' prefix?

Why do some SQL strings have an 'N' prefix?

You may have seen Transact-SQL code that passes strings around using an N prefix. This denotes that the subsequent string is in Unicode (the N actually stands for National language character set). Which means that you are passing an NCHAR, NVARCHAR or NTEXT value, as opposed to CHAR, VARCHAR or TEXT. See Article #2354 for a comparison of these data types.

Unicode is typically used in database applications which are designed to facilitate code pages which extend beyond the English and Western Europe code pages (Erland Sommarskog, a native of Sweden, refers to this set as "Germanic and Romance languages"), for example Chinese. Unicode is designed so that extended character sets can still "fit" into database columns. What this means is that Unicode character data types are limited to half the space, because each byte actually takes two bytes (Unicode is sometimes referred to as "double-wide"). For more information on Unicode, see Unicode.org. Note that there are many encoding schemes in the Unicode standard, but SQL Server only supports one: UTF-16.

While using Unicode is a design choice you can make in building your own applications, some facilities in SQL server require it. One example is sp_executeSQL. If you try the following:

EXEC sp_ExecuteSQL 'SELECT 1'

You will get this error:

Server: Msg 214, Level 16, State 2, Procedure sp_executesql, Line 1
Procedure expects parameter '@statement' of type 'ntext/nchar/nvarchar'.

You can get around this in two ways:

-- (a) using the N prefix

EXEC sp_ExecuteSQL N'SELECT 1'

-- (b) using a variable

DECLARE @sql NVARCHAR(100)
SET @sql = N'SELECT 1'
EXEC sp_ExecuteSQL @sql

Note that implicit conversion makes the N prefix optional in case (b); however, for legibility and consistency, you should always use the prefix when defining Unicode strings. One reason is that leaving it off can actually change your data if it contains Unicode characters (losing the additional information), as in the following example:

DECLARE @n NVARCHAR(10)
SET @n = 'a'
PRINT @n
SET @n = N'a'
PRINT @n

The first assignment, which didn't use the N prefix, gets printed as a regular a. Only the second maintains the character that was actually supposed to be represented. As you can imagine, if you are intending to support data entry in foreign languages and code pages, you will likely need to test for Unicode support (making sure that such columns support Unicode, and that data won't be lost when passed into stored procedures, functions, etc.). Note that your application will need to handle Unicode as well; for example, when you try to print this character from ASP...

<% Response.Write("a") %>

...it actually prints out the string aa. (This result might depend on your codepage and regional settings.) So you might consider translating your data into its ASCII equivalent, e.g. a = ā.

Another reason you want to avoid implicit conversion is that there are some potentially serious performance issues. Consider the following quite simple repro:

USE tempdb
GO

CREATE TABLE a
(
b VARCHAR(3),
c NVARCHAR(3)
)
CREATE INDEX b ON a(b)
CREATE INDEX c ON a(c)
GO

SET NOCOUNT ON

INSERT a SELECT 'foo', N'foo'
INSERT a SELECT 'bar', N'bar'

DECLARE
@b VARCHAR(3),
@c NVARCHAR(3)

SELECT
@b = 'foo',
@c = N'foo'

SELECT * FROM a WHERE b = @b
SELECT * FROM a WHERE b = @c
SELECT * FROM a WHERE c = @b
SELECT * FROM a WHERE c = @c
SELECT * FROM a WHERE b LIKE @b
SELECT * FROM a WHERE b LIKE @c
SELECT * FROM a WHERE c LIKE @b
SELECT * FROM a WHERE c LIKE @c

DROP TABLE a

Paste the code into Query Analyzer, turn execution plan on, and let her rip. You'll observe the following breakdown of percentage of work (roughly, depending on your hardware):

VARCHAR = VARCHAR 4.48%
VARCHAR = NVARCHAR 13.31%
NVARCHAR = VARCHAR 4.48%
NVARCHAR = NVARCHAR 4.48%
VARCHAR LIKE VARCHAR 4.48%
VARCHAR LIKE NVARCHAR 13.31%
NVARCHAR LIKE VARCHAR 4.48%
NVARCHAR LIKE NVARCHAR 4.48%

Now, that's not the whole story; we all know that there are many other factors, such as I/O, that will impact the actual time each portion of the query takes. The key is that implicit conversion *can* cause a table scan instead of an index seek, and on larger tables this can really hurt. While it's important to understand why this happens and in which scenarios, my recommendation is to match your character-based datatypes as explicitly as possible.

One other thing to watch out for: your database may be using Unicode without your knowledge. If you upsize from Access to SQL Server, for example, character-based text columns might be translated to Unicode (I believe this is a catch-all technique; in case Access was storing Unicode strings, or if you might be storing Unicode strings later, you won't lose data or require changes). I think the Access upsizing tools should be updated to force a conscious choice, so that you aren't wasting space for nothing, and so that you know that you made a decision at all.

For a more thorough discussion of Unicode and the N prefix, please see KB #239530, this MSDN article, and this Google thread.

In other RDBMS platforms, or in the ANSI and/or ISO specifications, you might see prefixes other than N being used against values. (Current versions of SQL Server only support Unicode.) Here are the additional monikers I am aware of:

B This is used to denote a BINARY string expressed in bits (0 and 1 only)
X This is used to denote a BINARY string expressed in hexadecimal (0 -> F)

No comments: