T-SQL: How to Find Rows with Bad Characters

T-SQL: How to Find Rows with Bad Characters

One of the commonly asked questions in Transact SQL Forum on MSDN is how to filter rows containing bad characters. Also, often times these bad characters are not known, say, in one of the recent posts the question was to filter all the rows where characters were greater than ASCII 127.

The first step towards solution is to realize that in order to quickly filter out something we may want to know the list of allowed characters first. 

I will now show several samples of how important is to know the "good characters" in order to filter the "bad" ones.

Let's suppose we only want alpha-numeric characters to remain and everything else should be considered bad rows.

For all our examples let's create the following table variable:

DECLARE @TableWithBadRows TABLE (
    Id INT identity(1, 1) PRIMARY KEY
    ,description VARCHAR(max)
    );
  
INSERT INTO @TableWithBadRows (description)
  
VALUES ('test1'), ('I am OK'), ('Filter me, please.');


Our pattern then will be 

SELECT * FROM @TableWithBadRows WHERE description LIKE '%[^a-z0-9]%';


where a-z means a range of all letters from a to z, 0-9 means range of all numbers from 0 to 9 and ^ means everything which is not like the following characters.

The above code will return 2 last rows. The second row is returned because it contains a space character which was not included in the list of allowed characters.

Now, what should we do if want to keep all the "normal" characters and only disallow characters which are greater than ASCII 127? In this case, we may want to build the pattern in a loop.

Here is some code demonstrating this idea:

DECLARE @TableWithBadRows TABLE (
    Id INT identity(1, 1) PRIMARY KEY
    ,description VARCHAR(max)
    );
  
INSERT INTO @TableWithBadRows (description)
VALUES ('test1')
    ,('I am OK')
    ,('Filter me, please.')
    ,('Let them be & be happy')
    ,(CHAR(200))
    ,(CHAR(137))
    ,(CHAR(10) + CHAR(13) + 'Test more');
  
SELECT *
FROM @TableWithBadRows;
  
SELECT *
FROM @TableWithBadRows
WHERE description LIKE '%[^A-Z0-9%]%';
  
DECLARE @i INT = 32;
DECLARE @pattern VARCHAR(max) = '^a-Z0-9'
    ,@ch CHAR(1);
  
WHILE @i < 47
BEGIN
    SET @ch = CHAR(@i)
  
    IF @ch = '_'
        SET @pattern = @pattern + '[' + @ch + ']';
    ELSE
        IF @ch = '['
            SET @pattern = @pattern + @ch + @ch;
        ELSE
            SET @pattern = @pattern + @ch;
  
    SET @i = @i + 1;
END
  
SET @i = 58;
  
WHILE @i < 65
BEGIN
    SET @ch = CHAR(@i)
  
    IF @ch = '_'
        SET @pattern = @pattern + '[' + @ch + ']';
    ELSE
        IF @ch = '['
            SET @pattern = @pattern + @ch + @ch;
        ELSE
            SET @pattern = @pattern + @ch;
  
    SET @i = @i + 1;
END
  
SELECT @pattern
  
SELECT *
FROM @TableWithBadRows
WHERE description LIKE '%[' + @pattern +']%'


As you can see from the second select statement, the CHAR(200) (È) is not being filtered by the a-z filter as it is apparently considered a letter.

We may try adding binary collation to treat that letter as bad, e.g.


SELECT *
FROM @TableWithBadRows
WHERE description LIKE '%[^A-Za-z0-9% ]%'  COLLATE Latin1_General_BIN;


As you see, now this letter is considered bad row.

This thread "Getting records with special characters" shows how to create a pattern when the bad characters are in the special table and also which characters ([,^,-) we need to escape.

Conclusion

I have shown several examples of filtering bad rows using various patterns. For more comprehensive and different functions please see links in the See Also section.


See Also


This article participates in the TechNet Guru for August competition.


Leave a Comment
  • Please add 1 and 1 and type the answer here:
  • Post
Wiki - Revision Comment List(Revision Comment)
Sort by: Published Date | Most Recent | Most Useful
Comments
  • Naomi  N edited Revision 7. Comment: Formatting

  • Naomi  N edited Revision 2. Comment: 2 more links in See Also to my blogs in BeyondRelational

  • Naomi  N edited Original. Comment: Forgot to add my own tag!

Page 1 of 1 (3 items)
Wikis - Comment List
Sort by: Published Date | Most Recent | Most Useful
Posting comments is temporarily disabled until 10:00am PST on Saturday, December 14th. Thank you for your patience.
Comments
  • Yes, I found that we may need to use binary collation or any other particular collation if we want to properly filter non English characters. I may need to explore this problem a bit more.

  • Naomi  N edited Revision 7. Comment: Formatting

  • It's great solution!

    Its not bad to add that this solution just work for English characters and numbers. Non-English characters needs to customize the code.

  • Naomi  N edited Revision 2. Comment: 2 more links in See Also to my blogs in BeyondRelational

  • Naomi  N edited Original. Comment: Forgot to add my own tag!

Page 1 of 1 (5 items)