This article provides information on how to cleanse complex source data using composite domains in DQS. How do you cleanse your data if you have multiple delimited fields in a single column or have data in delimited columns in a plain text file. For example, consider the following 2 sets of sample data that contain the list of US States and Capitals (purposely introduced errors in data for this tutorial):
The data is not correct in the above sample. For example Arizona, California, Indiana, and New Jersey are spelled incorrectly; Richmond and New York are not the capital cities for the Colorado and Washington states respectively. Since the source data is in a plain text file, we will use the DQS Cleansing transformation in SSIS to cleanse this data.
We will leverage the default database, DQS Data, in this example to cleanse the sample data.
You now have a State&Capital knowledge base that contains the StatesAndCapitals composite domain containing the following two domains: US - State and US - Places.
In this case, we will cleanse the comma delimited values as mentioned earlier in the sameple data under Scenario 1. When you have a single column with delimited values, you must map the source column directly to the composite domain.
When you have delimited data columns to be cleansed, you must use the delimter to extract the individual columns into SSIS, and then map the extracted input columns with the individual columns within the composite domain.
Kumar Vivek edited Revision 6. Comment: Minor edits
Richard Mueller edited Revision 11. Comment: Modified title casing, added tags