Testing a Data Vault
Introduction
In general, you should always test your data, no matter what modeling approach you follow, no matter which tool you use. The combination of dbt and Data Vault 2.0 makes it very easy to test the data early on, in the Raw Data Vault.
Testing your data typically consists out of two parts, technical and business tests. Business tests are closely tied to the data concepts and contents of your organization and are not to be discussed here. This section focuses on technical tests of a Raw Data Vault, and how to implement them in dbt.
Hard Constraints vs. Soft Constraints
Hard Constraints
A Hard constraint is a constraint enforced on one or multiple columns, directly on the database. When inserting into a column with a defined constraint, the database will check, whether the inserted data violates the constraint or not. If it would violate it, the data would not be inserted. A couple of examples for database constraints:
- Primary Key constraint on Hub or Link Hashkeys, within Hubs or Links
- Primary Key constraint on Hashkey + load_date within a Satellite
- Foreign Key constraint between Hub Hashkey in Link, and Hub Hashkey in Hub
- Foreign Key constraint between Hashkey in Satellite, and Hashkey in Parent entity
- Not Null constraint on specific columns in RDV or BDV
Soft Constraints
Compared to Hard Constraints, Soft Constraints do not stop data, that violates a constraint, to be loaded into the target entity. Instead, a warning is shown. This ensures, that no data is lost.
Soft Constraints can be implemented as tests, that are applied on your data after loading a table.
Constraints in Data Vault 2.0
In Data Vault 2.0, hard-constraints on the database are typically seen as a bad practice, since they would lead to the possibility, that raw data is not captured, if it would violate a constraint. You always want to catch the good, bad, and ugly data, no matter if it violates some assumptions. If for some reason, you really need to use hard database constraints, you need to be clear about the risk of loosing data, and it is recommended to only use hard constraints, if Error Marts are implemented.
As a general recommendation, Soft Constraints should be the first choice, when using Data Vault 2.0
Soft Constraints in dbt
As said before, a soft constraint can be implemented as a test. That makes soft constraints a perfect use case for dbt tests. Every constraint can be defined as a test on a column, or on multiple columns, inside .yml files.
Testing the Raw Data Vault in dbt
To start basic testing of a Raw Data Vault, lets turn all Primary Key and Foreign Key assumptions of Data Vault 2.0 into dbt tests. This creates a list of tests per entity type:
- Hubs
- Hashkey
- not_null
- unique
- Hashkey
- Links
- Link-Hashkey
- not_null
- unique
- Foreign Hashkeys
- relationship to all connected Hubs
- Link-Hashkey
- Satellites (v0)
- General
- unique_combination_of_columns
- Hashkey
- LoadDate
- unique_combination_of_columns
- Hashkey
- relationship to parent Hub/Link
- General
- Non-Historized Links
- Link-Hashkey
- not_null
- unique
- Foreign Hashkeys
- relationship to all connected Hubs
- Link-Hashkey
- Non-Historized Satellites (v0)
- Hashkey
- not_null
- unique
- relationship to parent NH-Link
- Hashkey
- Multi Active Satellites (v0)
- General
- unique_combination_of_columns
- Hashkey
- LoadDate
- Multi Active Key(s)
- unique_combination_of_columns
- Hashkey
- relationship to parent Hub/Link
- General
- Reference Hubs
- Reference Key(s)
- unique (optionally unique_combination_of_columns, if multiple Reference Keys)
- not_null
- Reference Key(s)
- Reference Satellites (v0)
- General
- unique_combination_of_columns
- Reference Key(s)
- LoadDate
- unique_combination_of_columns
- Reference Key(s)
- relationship to parent Hub/Link
- General
- Record Tracking Satellites
- General
- unique_combination_of_columns
- Hashkey
- LoadDate
- unique_combination_of_columns
- Hashkey
- relationship to parent Hub/Link
- General