A Data Quality Firewall Architecture

Organisations know that Data Quality (DQ) is a key foundation for a successful business. Business Intelligence, Reporting, Analytics, Data Warehouses and Master Data Management are pretty much wasted effort if you cannot trust your data. To make matters worse, systems integration efforts could lead to ‘bad’ data spreading like a virus through your Enterprise Service Bus (ESB) across all systems, even into those which had fairly good data quality initially.

This article discusses the architectural concept of a Data Quality Firewall (DQF) which allows organisations to enforce their data quality standards to data in-flight and anywhere: Up-Stream, In-Stream and Down-Stream.

Data Quality Lifecycle

When data enters an organisation it is sometimes referred to as ‘at the beginning’ of the data quality lifecycle. Data can enter through all sorts of different means, e.g. emails, online portals, phone, paper forms or automated B2B processes.

Up-stream refers to new data entering, whereas in-stream means data being transferred between different systems (e.g. through your ESB or Hub). Down-Stream systems are usually data stores that contain already potentially unclean data like repositories, databases or existing CRM/ERP systems.

Some people regard a Data Warehouse to be at the end a data quality lifecycle, meaning there is no further data quality work necessary as all the logic and rules have been already applied up, down or in-stream.

However if you start with your DQ initiative you need to get a view of your data quality across all your systems, including your data warehouse. You achieve this through profiling. Some software vendors offer ‘free’ initial profiling as a conversation starter, maybe a worthwhile first step to get your DQ indicators.

Data Quality Rules and Logic

Profiling vital systems allows you to extract data quality rules which you can implement centrally so that you can re-use the same rules and standards enterprise (or domain) wide. The profiling equips you with data quality indicators, showing you how good your data really is, on a per system basis.
Understanding your business processes and looking at the data quality indicators, enables you to associate a $-value with your bad data. From this point onwards, it’s probably very easy to pick and choose which systems/repositories to focus on (unless your organisation is on a major strategic revamp, in which case you need to consider the target Enterprise Architecture as well).

Another question always was when and how to control the quality of the data. In the early days we began to implement data quality with input field verification, spell checks, confirmation responses and Message standards (e.g. EDIFACT). Organisations then found themselves duplicating the same rules in different places, like front-end, middleware and backend. Then field length changes come along (a la Year 2000 or entering global markets or through mergers and acquisitions) and you had to start all over again.

At the last APAC Gartner conference in Sydney I heard people suggesting that the data quality rules only need to be applied to the warehouse. I personally think this can be dangerous and needs to be evaluated carefully. If there are no other systems that store data apart from the warehouse this might make sense. In any other case it means that you cannot trust the data outside the warehouse.

Zooming In – The Data Quality Firewall

A DQ firewall is a centrally deployed system which offers domain or enterprise wide data quality functionality. The job of this system is to do cleansing, normalisation, standardisation (and possibly a match and merge if part of MDM).

In an Event Driven Architecture (EDA) all messages are simply routed through the data quality rules engine first. This is done by the DQ firewall being the (possibly only) subscriber of all originating messages from the core systems (#1 in the image). Subsequently, all the other interested systems subscribe to the messages emitted from the DQ firewall which means they are receiving messages with quality data (#2).

The diagram shows the Core Systems as both publishers and subscribers, emitting an event message (e.g. CustomerAddress) which is picked up by the Semantic Hub. The Semantic Hub transforms it into the appropriate message format and routes it to the Data Quality Firewall. The DQF then does it’s job and re-emits the CustomerAddress message with qualified data, which is then routed to the subscribing systems via the Semantic Hub.
Subscribing systems can be other systems, as well as the system that originally published the data.

In an SOA scenario the architecture is similar, using service calls to the appropriate service offered by the Data Quality engine. Care needs to be taken if the DQ service is required to partake in transactions (see time-outs, system availability, scalability, etc for more details).

Your New Data Quality Ready Target Architecture?

Benefits of a centrally deployed Data Quality engine are re-usability of rules, ease of maintenance, natural consolidation of rules, quick response to change, pervasiveness of change, assignment of ownership and responsibility to Data Stewards (who owns which rules and the associated data entities).

Things to regard are feedback mechanism (in case of bad data) to the originating system as it might affect current system design or the introduction of a Data Quality/Issue Tracker Portal which allows Data Stewards to intervene in cases were it cannot be done automatically.

The overhead which distributed approaches like a input field validation on multiple systems can have, makes a central Data Quality Firewall architecture by far more Enterprise ready, delivers more long-term benefits and is cheaper in terms of setup and maintenance and ROI.

The Small Bang Approach

The beauty of the EDA approach is that you can easily change the routing on a system-by-system or message-by-message basis through the Data Quality Firewall. You simply change the routing of the messages in the routing table of the Semantic Hub.

Example of message type ‘CustomerAddress’ emitted through SystemA below:

System B and C are subscribing to CustomerAddress messages emitted from SystemA.

Message Type	Content Based Routing	*Subscribers*
CustomerAddress	Xpath(/Originator)=’SystemA’	B, C
Account	Xpath(/Originator)=’SystemC’	A

To enable the Data Quality Firewall functionality we change the subscriber to DQF. From then on all CustomerAddress messages from SystemA are delivered to the DQF. After the data quality rules have been applied by the DQF SystemB and C are then getting clean and trustworthy data. The Account data remains unchanged in this example.

*Message Type*	*Content Based Routing*	*Subscribers*
CustomerAddress	Xpath(/Originator)=’SystemA’	DataQualityFirewall (DQF)
CustomerAddress	Xpath(/Originator)=’DQF’	B, C, A
Account	Xpath(/Originator)=’SystemC’	A

A possible next step could then be to quality control the Account message data. This approach allows you to consolidate your Data Quality step by step across your entire organisation.

Cheers,
Andy