GETTING STARTED
Testing activities have sadly become more of verifying that every “t” is crossed and every “i" is dotted, while they should ideally be focusing on ascertaining the viability of the word thus formed. Within testing itself, data is an important spoke in the wheel – without good test data you are not really testing all the conditions for failure and it will take you off-guard at later point when it happens in production.
Quality of data is critical to Quality of Service (QoS) provided by applications. Having Bad or Incomplete data sometime can be more damaging than having no data at all. Poor data quality can seriously hinder or damage the efficiency and effectiveness of organizations and businesses. If not identified and corrected early on, defective data can contaminate all downstream systems and information assets which will have a strong cascading effect. The growing awareness of such repercussions has led to major public initiatives like the “Data Quality Act” in the USA and the “European 2003/98” directive of the European Parliament. Enterprises must present data that meet very stringent data quality levels, especially in the light of recent compliance regulations standards.
If we wait until wrong decisions are made by customers relying on the bad or inaccurate data - we are in trouble. Data quality is often overlooked and yet there is no well-defined testing process or that provides guidance around ways to improve, maintain and monitor the data quality in any data-centric project. However the bad times are now over and with our own SQL Server 2012 DQS we can turn the tables around. This artifact will show you how.
But before we go down any further, we want to take this opportunity to make it very clear that this work is not to talk/evangelize SQL Server 2012 DQS features but to exhibit how they can be leveraged by QA/Test team making use of exciting customer assisting knowledge discovery from data, domain values
In its latest avatar, Microsoft SQL Server (codenamed SQL Server 2012) takes on the onus of ensuring the following –
The main concept behind DQS is a rapid, easy-to-deploy, and easy-to-use data quality system that can be set up and used practically in minutes. DQS is applicable across different scenarios by providing customers with capabilities that help improve the quality of their data. Data is usually generated by multiple systems and parties across organizational and geographic boundaries and often contains inaccurate, incomplete or stale data elements
The following (Table-1) scenarios are the data quality problems addressed by DQS in SQL Server "SQL Server 2012" that are typical candidates for Test Scenarios for data centric project.
Data Quality Issue
Description
Completeness
Is all the required information available? Are data values missing, or in an unusable state? In some cases, missing data is irrelevant, but when the information that is missing is critical to a specific business process, completeness becomes an issue. Example: if you have an email field where only 50,000 values are present out of a total of 75,000 records, then the email field is 66.6% complete.
Conformity
Are there expectations that data values conform to specified formats? If so, do all the values conform to these formats? Maintaining conformance to specific formats is important in data representation, presentation, aggregate reporting, search, and establishing key relationships. Example: The Gender codes in two different systems are represented differently; in one system the codes are defined as ‘M’, ‘F’ and ‘U’ whereas in the second system they appear as 0, 1, and 2.
Consistency
Do values represent the same meaning? Example: Is revenue always presented in Dollars or also in Euro?
Accuracy
Do data objects accurately represent the “real-world” values they are expected to model? Incorrect spellings of product or person names, addresses, and even untimely or not current data can impact operational and analytical applications. Example: A customer’s address is a valid USPS address. However, the ZIP code is incorrect and the customer name contains a spelling mistake.
Validity
Do data values fall within acceptable ranges? Example: Salary values should be between 60,000 and 120,000 for position levels 51 and 52.
Duplication
Are there multiple, unnecessary representations of the same data objects within your data set? The inability to maintain a single representation for each entity across your systems poses numerous vulnerabilities and risks. Duplicates are measured as a percentage of the overall number of records. There can be duplicate individuals, companies, addresses, product lines, invoices and so on. The following example depicts duplicate records existing in a data set:
Name
Address
Postal Code
City
State
Mag. Smith
545 S Valley View D. # 136
34563
<Anytown>
New York
Margaret smith
545 Valley View ave unit 136
34563-2341
New-York
Maggie Smith
545 S Valley View Dr
NY.
*Sourced from MSDN FAQs on SQL Server 2012 DQS
The DQS knowledge base approach enables the organization, through its data experts; to efficiently capture and refine the data quality related knowledge in a Data Quality Knowledge Base (DQKB).
So that brings us to DQKB, what exactly is this thing? Well, DQS is a knowledge-driven solution, and in its heart resides the DQKB. A DQKB stores all the knowledge related to some specific type of data sources, and is maintained by the organization’s data expert (often referred to as a data steward). For example, one DQKB can handle information on an organization’s customer database, while another can handle employees’ database.
The DQKB contains data domains that relate to the data source (for example: name, city, state, zip code, ID). For each data domain, the DQKB stores all identified terms, spelling errors, validation and business rules, and reference data that can be used to perform data quality actions on the data source.
Test Teams & DQS: The ‘π’ Matrix
Indeed, with all above, we can actually help communities like the Test Teams do great stuff on data quality. Elaborated are the ways and means for accelerated adoption of SQL Server 2012 DQS in testing life cycle. We are particularly looking at adoption of DQS in a more pronounced way within the testing community, largely due to their active involvement in quality assurance tasks pertaining to data quality as well as the bouquet of features SQL Server 2012 brings to the table for the same.
The ‘π’ Matrix is nothing but a Phase V/s Issue Matrix (hence ‘PI’) indicating various states a project can be in depending on the amount of Data Quality issues encountered and the stage in the Test Execution Cycle.
· Since DQ issues are observed at early stages in test execution which result into bugs stacks, immediate cleansing of the data is advised before aggravating the damage – this is essentially the ‘damage control’ part.
· Client should be informed about the data quality (if data is provided by the client) and should also be informed about the risk of having faulty data in the system. This is the ‘alerting’ part of the steps.
· It should be kept in mind that if DQ issues are not fixed until the UAT phase, client satisfaction will take a beating and this will have a direct impact on the rapport of the company.
· Lastly, due to data issues, many bugs can remain uncovered. It is suggested that DQ implementation should be part of test planning once data issues are observed in PoC phase to avoid major pitfalls later.
· Once you are in Nightmare, please understand, it is almost impossible to move into Superstar quadrant. It is clearly because of lack of DQ implementation at early stages that the project has taken a beating.
· One should take a lesson from the failure and learn to religiously add DQ implementation right from the planning phase of the next release including DQ implementation as part of the scope, design and testing. Measuring & monitoring DQ on daily/weekly basis and taking actions against important DQ issues to fix them early has already been suggested.
· Teams assume that Data quality is only important for Business Intelligence and Data Warehousing projects and we are not applicable
· Lot of emphasis goes on testing UI and the code / logic / business rules but data is treated as a second class citizen
· DQ until recently didn’t get its due attention and hence was ignored and unfashionable
· Lack of Testing Tools to do it (Well, just wait, now there is one)
· Lack of skillset (our Testers are traditionally trained in testing APIs or functional aspects of applications more than the data flowing through it)
· Even if teams wants to do it, there is no good guidance/documentation around it
· The effort required to test data quality is often underestimated
1. SQL Server 2012 RC0 – DQSInstaller.exe should be available.
Please refer installation instructions from DQS Forum.
2. Sample Database (Refer next page for the sample DB used in this document.) By sample DB we mean here is the database required to be cleansed. It could be your client data or sample of client data.
1. As per business requirement, Gender should be either male or female. Any other value is not acceptable.
2. Age should be between 21 and 30.
3. Duplicate rows
1. As per business requirement, Email should be mentioned in the proper format.
2. Degree should be of 3 characters only.
3. Null values should be identified
1. Creating Knowledge base– not a core testing team’s activity but test team can do if project team is not adopting Denali DQS formally
2. Creating the data quality project
3. Executing the data quality project -