2015-08-17

7 tips to improve data quality using database constraints

When it comes to validating application's data, most Java developers think about frontend/backend validation, JSR 303 etc. Sadly, many ignore the database, even though it allows some easy to use, declarative validations.

NOT NULL constraint (for the sake of completeness only)

CREATE TABLE USERS
(
  USER_ID NUMERIC PRIMARY KEY,
  USER_LOGIN VARCHAR(255) NOT NULL,
  ...
);
If an attribute must have a value, we should specify it at DB level. Plain and simple.

1. Foreign key

CREATE TABLE USERS_GROUPS
(
  USGR_ID NUMERIC PRIMARY KEY,
  USER_ID NUMERIC NOT NULL CONSTRAINT USGR_USER_FK REFERENCES USERS(USER_ID),
  GROUP_ID NUMERIC NOT NULL CONSTRAINT USGR_GROUP_FK REFERENCES GROUPS(GROUP_ID),
  ...
);
If an attribute is a PK from another table, it should be marked as such. Foreign keys enforce referential integrity. Combined with NOT NULL they allow to use JOINs instead of LEFT JOINs.

2. Unique index


Very useful in case of many-to-many relationship. It ensures uniqueness of the relationship.
CREATE TABLE USERS_GROUPS
(
  USGR_ID NUMERIC PRIMARY KEY,
  USER_ID NUMERIC NOT NULL CONSTRAINT USGR_USER_FK REFERENCES USERS(USER_ID),
  GROUP_ID NUMERIC NOT NULL CONSTRAINT USGR_GROUP_FK REFERENCES GROUPS(GROUP_ID),
  ...
);

CREATE UNIQUE INDEX USERS_GROUPS_DATA_UQ ON USERS_GROUPS (USER_ID, GROUP_ID);
This index guarantees that a user belongs to a group no more than once.

3. Partial index


Suppose that when a user leaves a group, we don't delete row from USER_GROUPS, but only mark the association as invalid (using USGR_VALID_TO column).
CREATE TABLE USERS_GROUPS
(
  USGR_ID NUMERIC PRIMARY KEY,
  USER_ID NUMERIC NOT NULL CONSTRAINT USGR_USER_FK REFERENCES USERS(USER_ID),
  GROUP_ID NUMERIC NOT NULL CONSTRAINT USGR_GROUP_FK REFERENCES GROUPS(GROUP_ID),
  USGR_VALID_FROM DATE NOT NULL,
  USGR_VALID_TO DATE,
  ...
);
In this case we cannot use the previous index - we need a different one. A user may be assigned to a group multiple times, but at most one of those associations may have date USGR_VALID_TO = NULL:
CREATE UNIQUE INDEX USERS_GROUPS_DATA_UQ ON USERS_GROUPS (USER_ID, GROUP_ID)
WHERE USGR_VALID_TO IS NULL;
Note that Oracle's syntax for partial index is rather strange:
CREATE UNIQUE INDEX USERS_GROUPS_DATA_UQ ON USERS_GROUPS (
  CASE WHEN USGR_VALID_TO IS NULL THEN USER_ID ELSE NULL END, 
  CASE WHEN USGR_VALID_TO IS NULL THEN GROUP_ID ELSE NULL END
);

4. Check constraint - enums


Suppose we have the following table:
CREATE TABLE ORDERS
(
  ORDER_ID NUMERIC PRIMARY KEY,
  ORDER_STATUS VARCHAR(20) NOT NULL,
  ...
);
If column ORDER_STATUS is mapped to an enum in Java, we can (and should) limit its allowed values:
ALTER TABLE ORDERS ADD CONSTRAINT ORDER_STATUS_CHK 
CHECK (ORDER_STATUS IN ('NEW', 'CONFIRMED', 'SENT', 'DELIVERED', 'CANCELLED'));
Just think how many times your ORM mappings exploded because someone put invalid value in such column.

5. Check constraint - implication

CREATE TABLE ORDERS
(
  ORDER_ID NUMERIC PRIMARY KEY,
  ORDER_STATUS VARCHAR(20) NOT NULL,
  ORDER_DELIVERY_ADDR_ID NUMERIC,
  ...
);
Suppose we have a requirement, that if ORDER_STATUS is CONFIRMED, SENT or DELIVERED, then we must have a delivery address (we may have it in any other status though).
ALTER TABLE ORDERS ADD CONSTRAINT ORDER_DELIVERY_ADDR_CHK CHECK (
ORDER_STATUS NOT IN ('CONFIRMED', 'SENT', 'DELIVERED') 
OR ORDER_DELIVERY_ADDR_ID IS NOT NULL
);
We use the fact that p => q <=> ~p v q.

6. Check constraint - if and only if

CREATE TABLE ORDERS
(
  ORDER_ID NUMERIC PRIMARY KEY,
  ORDER_STATUS VARCHAR(20) NOT NULL,
  ORDER_CANCEL_REASON VARCHAR(255),
  ...
);
Suppose we have a requirement, that if (and only if) ORDER_STATUS is CANCELLED, then ORDER_CANCEL_REASON must be given:
ALTER TABLE ORDERS ADD CONSTRAINT ORDER_CANCEL_REASON_CHK CHECK (
(ORDER_STATUS = 'CANCELLED' AND ORDER_CANCEL_REASON IS NOT NULL)
OR (ORDER_STATUS != 'CANCELLED' AND ORDER_CANCEL_REASON IS NULL)
);

7. Check constraint - any invariant


The last examples can actually be generalized to protecting any invariant within a single table. Anything that evaluates to a boolean can be validated using check constraint, for example:

Date "to" must be >= date "from":
ALTER TABLE USERS_GROUPS ADD CONSTRAINT USGR_DATES_CHK CHECK (
USGR_VALID_TO IS NULL OR USGR_VALID_TO >= USGR_VALID_FROM
);
An order can be placed by either user or company:
ALTER TABLE ORDERS ADD CONSTRAINT ORDER_USER_OR_COMP_CHK CHECK (
(USER_ID IS NOT NULL AND COMP_ID IS NULL) OR (USER_ID IS NULL AND COMP_ID IS NOT NULL)
);
The list may go on and on - check constraints are probably the most underestimated ones.

Summary


In today hipster times of NoSQL databases writing about constraints in relational databases might seem old fashioned. However, I believe it's good to be paranoid about your app's data validation and quality.

Once invalid data gets into your DB it is much harder to fix it. You may be forced to write complex migration scripts for it. Or maybe you will have to include code in your application's logic, that can handle faulty data. Either way, prevention is better than cure.

Obviosuly DB should not be the only place where you validate data. You should rather treat it as the last line of defense. Keep in mind that sometimes it can be your only line of defense:
  • frontend validation can be easily bypassed and backend validation may contain errors
  • an administrator may modify data directly in DB
  • some sort of migration script executed at new version deployment may modify data directly in DB
  • your DB may be populated with data coming from other systems