Improve Entity-Attribute-Value models with proper data types

Entity-Attribute-Value (EAV) models are a contentious topic in database design.  Flexibility is important but some things should not be sacrificed.

EAVs should use a table like this:

	NAME          VARCHAR2(4000)
	NUMBER_VALUE  NUMBER
	STRING_VALUE  VARCHAR2(4000)
	DATE_VALUE    DATE

EAVs should avoid a simpler table structure like this:

	NAME   VARCHAR2(4000)
	VALUE  VARCHAR2(4000)

Those extra columns don’t take away any of the benefits and alleviate many of the costs.

Pros and Cons

Here are some common arguments in EAV debates:

Pros

  1. Flexibility to add anything without schema changes.
  2. Looks nicer on the schema diagrams.

Cons

  1. Data type errors
  2. Bad performance.
  3. Lack of constraints.

Architects and application developers prefer EAV and DBAs tend to hate it.

The best design is a compromise.  Be honest about the importance of each pro and con.  While it is important to compromise on high level schema designs we should never compromise on low level type safety.

Flexible

Applications can still store anything.  Those 3 columns cover 99% of all data.  It’s easy to add more columns if necessary.

Some people might complain that those extra columns create more work because you need to identify the type.  But in practice, if you have a single value you will almost always need to know the type anyway.  Even if the value will only be displayed on a screen the type is useful for justifying the text.

If you truly have unstructured data it should still have a data type, such as JSON or XMLType.

Looks Nice

This reason usually isn’t that important. Developers and DBAs typically use an IDE as the primary interface for the database, not the PDFs generated by the data modeling tool.

But with the compromise solution this isn’t an issue, it only requires a few extra columns one time.

Data Type Errors

This is a subtly dangerous issue that you may not discover until it’s too late.

It is difficult to control the order a SQL query is executed.  Without that guaranteed order most developers will write unsafe queries like this:

	select *
	from eav
	where name = 'Date of Birth'
	and value = date '2000-01-01';

That query will usually work.  But it is not guaranteed to work because Oracle may check the value predicate before the name predicate, and not all values can convert to dates.

There is only one safe way to write that query:

	select *
	from
	(
		select *
		from eav
		where name = 'Date of Birth'
		and rownum >= 1
	)
	where value = date '2000-01-01';

That is the only safe way because ROWNUM is a special function that Oracle assumes means “these rows must be displayed in order”.  Oracle will never push a predicate or view merge into an inline view with a ROWNUM.

No other fix will work reliably.  A simple inline view can be trivially rewritten by the optimizer.  The physical order of predicates is ignored.  Hints are difficult to get right.  Even a CASE statement does not always guarantee short circuit evaluation in SQL (despite what the manual says).

Use the extra type-safe columns unless you want to write the ROWNUM trick a hundred times and are willing to thoroughly warn all SQL developers about the dangers in your data model.

Bad Performance

The extra type columns can significantly boost performance.

The exact type will be smaller and slightly faster to compare than a converted string.

The optimizer will have a little better information with the right types.

Most importantly, the right types let you avoid the horrible ROWNUM trick.  That gives the optimizer more flexibility to transform the query and choose a better plan.

Lack of Constraints

Using the right types doesn’t allow for smart constraints but at least it supplies some basic type checking.

Validating dates and numbers is much harder than most developers realize.

For dates, very few applications properly deal with internationalization.  For example, the month abbreviations are not the same in every language.

For numbers, very few applications can handle all number formats. Number validation sounds simple until you look at the Oracle manual’s complex syntax diagrams.  (Tip - those diagrams are helpful but contain a mistake.  If you can’t find it, don’t try to write a regular expression to handle number validation).  The Method5 PL/SQL lexer uses this regular expression. But don’t use it, this is only an example of how horrible the regular expression will be. It also doesn’t work with “+” or “-“, because the lexer considers those operators and not part of the number.

	([0-9]+\.[0-9]+|\.[0-9]+|[0-9]+)((e|E)(\+|-)?[0-9]+)?(f|F|d|D)?

Always Use the Right Type

Using the default Oracle data types costs almost nothing, avoids type errors, improves performance, and simplifies validation. Adding a few extra columns now can prevent a lot of problems later.

Written on November 14, 2017