Some report generated by third party application export data which do have a consistent but complex format. Each line comes in different format. Those applications use an export's format that design on-the-fly, for each record and each value in the record separately. The format is set by the value itself. For example, if there is data in the field, then it will export the data inside a quotation mark, but if there is no data then the data will be blank and without a quotation mark. Moreover some application do not use 'data type' when generating a report. If there is data and the data fit as numeric then again some applications might not use any quotation mark and in the same field on different record a data that do not fit as numeric it will be exported in quotation marks.
We can think about a single CSV file which a specific column is exported in 6 different formats. In order to use bulk insert directly we have to make sure that all the data is consistent with one format.
We need to use bulk insert to import data from CSV file into the SQL server database using pure T-SQL. Bulk insert operation can use only one format file which our metadata must remain constant with. The first step using bulk insert is to find a set of "bulk insert" rules (like: End Of Line, End Of Column, Collation…) that fit the entire data. This is not the case sometimes as mention above.
If you got the answer in the forum that this can't be done using pure T-SQL then remember that I always say "never say never".
* In this article we are going to talk only about pure T-SQL solution, as there are several solutions (like using SSIS, CLR or third party app) that can do it in different ways sometimes in a better way.
Our application export the data as CSV Comma Delimited file without a consistent format, as some report generated by third party do. In this example we will deal with very common situation that fit those rules:
1. Our application use column type (just to make it easier for this article we will focus on a string column). So a numeric column will never use quotation mark and a string column will use a quotation mark on & off by those rules.
2. If there is data in the field then it will export the data inside a quotation mark (no matter if the data fit as numeric or not, as the column is string type)
3. If there is no data then the data will be blank and without a quotation mark.
The original sample data which we use looks like this:
5
According to the application export rules above, our data export CSV file will look like this:
1
,
9999999
"ronen"
"ariely"
2
8888888
"xxx1,xxx2"
,yyy
3
,,,
"yyy"
4
7777777
,,
2222222
,zzz,kkk
1111111
"5000.5"
"5"
* we can see in the last line that our application use column type so even when our value is fit with numeric it will come inside quotation marks. But we have to remember that there are some more complex situations like applications that do not use column type and then the last line will look like: [5,5000.5,5], and it can be more complex if our culture fit with number using comma like 5,000.5 then our CSV line might look like this [5,5,000.5,5]
* Remember that this is only workaround for our specific case, and for each data a slightly different solution might fit. The idea of how to get to the solution is what important here.
In this step we will run several test with different format files. Our aim is to identify any potential problem and to find the best format file which will fit as many columns as we can from the start.
First of all you have to find a records format that fit most of the data, as well as the columns that might have an in-consistent with this format. In order to do that we are going to run several tests and then we will implement the conclusion on the next step. We will start with a simple bulk insert and continue with some more complex formats. Using the ERROR messages and the results, we will identify the potential problems.
Let's try this in practice
Open Notepad and copy our CSV data into the file.
1,9999999,"ronen","ariely" 2,8888888,"xxx1,xxx2",yyy 2,8888888,"xxx1,xxx2", 3,,,"yyy" 4,7777777,, ,2222222,zzz,kkk 5,1111111,"5000.5","5"
Save the file as "C:\ArielyBulkInsertTesting\Test01.csv"
* make sure that you use ANSI format when you save the file (you can use different format like UNICODE but for this example we shell use ANSI).
Open Notepad and copy our XML format data into the file.
* Using a file format can help for more complex formats. I highly recommended always to use a file format.
<
xml
version
=
"1.0"
?>
BCPFORMAT
xmlns
"http://schemas.microsoft.com/sqlserver/2004/bulkload/format"
xmlns:xsi
"http://www.w3.org/2001/XMLSchema-instance"
>
RECORD
FIELD
ID
"1"
xsi:type
"CharTerm"
TERMINATOR
","
/>
"2"
MAX_LENGTH
"7"
"3"
COLLATION
"Hebrew_CI_AS"
"15"
"4"
"\r\n"
RECORD>
ROW
COLUMN
SOURCE
NAME
"ID"
"SQLINT"
"PhoneNumber"
"FirstName"
"SQLNVARCHAR"
"LastName"
ROW>
Save the file as "C:\ArielyBulkInsertTesting\Test01.xml"
Open SSMS and run this DDL to create our table:
CREATE
TABLE
#test (
int
, PhoneNumber
, FirstName
varchar
(15)
, LastName
)
GO
Try to use this simple bulk insert query to import our data:
BULK
INSERT
#test
FROM
'C:\ArielyBulkInsertTesting\Test01.csv'
WITH
(
FORMATFILE=
'C:\ArielyBulkInsertTesting\Test01.xml'
, MAXERRORS = 1
, KEEPNULLS
--, DATAFILETYPE = 'native'
--, CODEPAGE='RAW'
--, ROWTERMINATOR='\r\n'
--, FIRSTROW = 3
);
No error… have you got a good feeling? Let's check our data
select
*
from
Results:
Compare our results to the original data... Ops… that's bad…
In our case we can see that the first and second columns have no problem, but the problematic start on the third column and continue to fourth column. First of all we got some quotation mark in the results, moreover our third column was split in several records and part of the data moved into the fourth column. Actually as our format file say that the third column end on the comma then every time we got a comma as part of the string data then the data was split. That make sense.
When we have a string data we surround the content in quotes. If our data had "a consist format" then all the strings data content where surround with quotes, even the empty data.
Let's demonstrate a well format data CSV. Save this data as as "C:\ArielyBulkInsertTesting\Test02.csv"
""
"zzz"
"kkk"
In that case the solution was very simple. We could use this format file:
',\"'
'\",\"'
"\"
\r\n" />
Save the file as "C:\ArielyBulkInsertTesting\Test02.xml" Clean our table from the bad data: truncate table #test GO Now try to execute the bulk insert and the data should place in the table correctly. If our data was format in this way (with consistent format) then we did not need this article :-)
Save the file as "C:\ArielyBulkInsertTesting\Test02.xml"
Clean our table from the bad data:
truncate
table
Now try to execute the bulk insert and the data should place in the table correctly. If our data was format in this way (with consistent format) then we did not need this article :-)
'C:\ArielyBulkInsertTesting\Test02.csv'
'C:\ArielyBulkInsertTesting\Test02.xml'
--, KEEPNULLS
Let's continue to work on our "real" data! Clean the data
In some cases we might build a format file which bring us error maseges. We already know that the data will not fit all records. This test will give us more info using the error massage. Try to use this format file (C:\ArielyBulkInsertTesting\Test03.xml):
','
'\r\n'
Execute our bulk insert
'C:\ArielyBulkInsertTesting\Test03.xml'
, MAXERRORS = 10
We get an error message which can help us a lot in this case:
Msg 4864, Level 16, State 1, Line 1 Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 4, column 2 (PhoneNumber).
Moreover we can see that our data was insert (using SQL 2012) for the rest of the records. In some case using a data with small amount of in consist records this can be the best fast way, as most of the data inserted. Now we can just check what records do not exist in the table and fix it. The error massage include the number of the problematic first row.
In conclusion our best format file which we found, succeed to fit with the first and second columns without any problem. We recognized that the problematic start on the third column.
This is the main step as now we can use bulk insert in order to import the data to the SQL Server. Since we found that our data do not have a consistent format, than we are going to use a temporary table to import data into.
* We don’t have to use a temporary table, as we can just use OPENROWSET to get the data and do the parsing on-the-fly. I will show this in step 3.
The basic idea is to bring all data till the problematic point (in our case first and second columns) into separate columns as should be in the final table, and the rest of the data from the problematic point to the end of the problematic point (or the end of line if no other way to cut before) into one column. So third and fourth columns will be import as one column.
Let's do it. We will use this format file (save as C:\ArielyBulkInsertTesting\Test04.xml), which is similar to "Test01.xml" file, without the third column:
"30"
"FirstName_LastName"
And execute this queries (drop old table, create new table with 3 columns, bulk insert data, select and show the data):
DROP TABLE #test GO CREATE TABLE #test ( ID int , PhoneNumber int , FirstName_LastName varchar(30) ) GO BULK INSERT #test FROM 'C:\ArielyBulkInsertTesting\Test01.csv' WITH ( FORMATFILE='C:\ArielyBulkInsertTesting\Test04.xml' , MAXERRORS = 10 --, KEEPNULLS --, DATAFILETYPE = 'native' --, CODEPAGE='RAW' --, ROWTERMINATOR='\r\n' --, FIRSTROW = 3 ); GO select * from #test GO our results look like this:
DROP
, FirstName_LastName
(30)
'C:\ArielyBulkInsertTesting\Test04.xml'
our results look like this:
The goal of this article is to give an optimal use of pure T-SQL in order to import data, which is not well formatted, into the database, in appropriate and effective structure for parsing(step 3). I will not elaborate on step 3 of parsing the data. There can be hundreds of different ways to do this for each case.
I'll show some sample solutions in step 3. Those solutions are not necessarily optimal parsing solutions, but represent solutions which I use different string's functions for parsing our data. Usually parsing data in SQL Server best to do using CLR function.
Now that we import the data, all that we need to do is parsing the last column. Those queries can do the job in our case:
ID, PhoneNumber , FirstName_LastName
, FN =
case
when
CHARINDEX(
'",'
, FirstName_LastName, 1) > 0
then
LEFT
RIGHT
(FirstName_LastName, LEN(FirstName_LastName) - 1)
, CHARINDEX(
, FirstName_LastName, 1) - 2
else
''
END
, LN =
',"'
SUBSTRING
FirstName_LastName
, FirstName_LastName, 1) + 2
, LEN(FirstName_LastName) - CHARINDEX(
, FirstName_LastName, 1) - 2 )
go
-- i use @ char but you should use any combination of chars that cannot be in the data vlaue!
-- i can clean in one time all " char as i know it is not part of my data
Temp
, 0, charindex(
'@'
) ) FN
, charindex(
) + 1, LEN(
) - charindex(
)) LN
REPLACE
(FirstName_LastName,
'","'
),
'"'
) T
After we found the parsing way we can use simple SELECT INTO query to move the data from the temporary table to the final table.
Usually if this is not a onetime operation then I prefer to use one query do it all without declaring temporary table. I do need those steps to find my Bulk Insert query & format (step 1+2) and to find the parsing function (step 3). Next I convert my queries into OPENROWSET import query like this (in our case study)
--FINAL TABLE
#FINAL (
insert
#FINAL
ID, PhoneNumber
--, FirstName_LastName
OPENROWSET(
BULK N
, FORMATFILE =
) a
The basic idea is to bring all data from the problematic point to the end of the problematic point (or the end of line if no other way to cut before) into one column. We can use a temporary table to store the data. Then we can parse the temporary column using any way that fit us. We can use T-SQL functions or CLR functions like SPLIT, clean some char using replace or find a char place using CHARINDEX and so on. This is all depend on your specific data, and it has nothing to do with bulk insert anymore :-)
We must separate the operation into two parts:
1. Insert the data using bulk insert into the data base (temporary table or using OPENROWSET) in such way that we will be able to use it for step two
2. Parsing and splitting the text on the last column into the final columns
* This article elaborates step 1.
* A more complex case study which I used this logic, can be seen in the MSDN forum on this link: http://social.msdn.microsoft.com/Forums/en-US/5aab602e-1c6b-4316-9b7e-1b89d6c3aebf/bulk-insert-help-needed
* Usually it is much better to do the parsing using CLR function. If you are not convinced by my recommendation then you can check this link: http://www.sqlperformance.com/2012/07/t-sql-queries/split-strings
* If you can export the file in a consistent format fit with bulk insert than you should do it! This is only a workaround solution.
* If you can build a well formatted import file in advance, from the original import file, using small application which will format a new file then do it! This is much better solution as most developing language fit with parsing text better than using SQL Server (T-SQL).
* If you can manage the order of the columns during the exporting, then try to make sure that you move all the problematic columns to the end. This will help us to use the bulk insert in a more optimally way as we will need to parse less columns in step 3
* Why not to import all the data into one column in a temp table instead of STEP 1 & STEP 2?
This is always an option but probably not a good one. In our case study we uses a very simple table structure with 4 columns and only 7 records, but in real life we might get a table with 20 columns or more and several million records. If we have 2 columns with potential problem and we can order the columns that those columns will come last, than we can import the entire data almost as the final data structure. It is much better to separate the data into as many columns as we can fitting the final table and minimal the use of parsing. Parsing is a heavy oration, and parsing the data after importing it will probably take much longer and much more CPU.
In this case when you have to use a complex parsing it is much better to use CLR solution. As I mention in the start this is a pure T-SQL solution.
* This article is based on several forums question (more than 15 which I found using google and I check only first several pages) that remain unanswered for too long. i did not found any solution or answer except my own answers based on this logic. This is very easy solution but we have to think outside the box to get it :-)
There is no other references for this solution as I know and most forum's question that I found where close without an answer or by sending the questioner to different solution like using SSIS or third party application, or by saying that it cannot be done using bulk insert and pure t-SQL.
http://stackoverflow.com/questions/17552185/format-fields-during-bulk-insert-sql-2008
http://social.msdn.microsoft.com/Forums/en-US/5aab602e-1c6b-4316-9b7e-1b89d6c3aebf/bulk-insert-help-needed