RDBMS Fundamentals: Third Normal Form
Requirements for 3NF
A table satisfies third normal form if:
- the table satisfies the requirements for second normal form, and
- it contains only columns that are non-transitively dependent on the primary key.
What does this mean? Well, it means we are starting to get deep into the semantic meaning of your data.
Let’s start with an address table. Let’s create entries for the Marion County Clerk’s Office and a local yoga studio.
|1234||200 E WASHINTON ST||STE W122||INDIANAPOLIS||IN||46204|
|5678||1032 N MAIN ST||null||SPEEDWAY||IN||46224|
If we have spent any time around data, we have probably seen this data structure way too many times. However, did you know that this does not satisfy 3NF? There are tons of interdependent data values in this one line.
- Indianapolis is a city in Indiana. There is no Seattle, Indiana, and there is no Indianapolis, Washington.
- There are only so many valid ZIP codes in Indianapolis. The ZIP code and city are dependent on each other. In the grand scheme of addresses, there are relatively few valid city + state + ZIP code combinations.
- Not all 5 digit numbers are valid ZIP codes. Ideally, we could create a table of all valid ZIP codes.
- Not all building have multiple units (suites, apartment numbers, floors, etc.).
Let’s start breaking down this table so it satisfies 3NF.
Revision 1: City and State are related
First, we need a table for states.
Our states table should have unique indexes on both the
Second, we need a table for cities. Cities must belong to a single state.
Here’s our updated address table. The combination of
state_key should be unique.
|1234||200 E WASHINTON ST||STE W122||1||46204|
|5678||1032 N MAIN ST||null||2||46224|
Revision 2: ZIP Code and City are related
There are only so many valid combinations of city and ZIP code. A ZIP code can represent multiple cities, and a city can have multiple ZIP codes. This means that we require a many-to-many relationship to represent this model.
We already have a city table. This is our ZIP code table. (There are 63 ZIP codes in Indianapolis, IN. This table has been greatly reduced in size.)
zip_code value should be unique.
This is our many-to-many table for cities and ZIP codes. A ZIP code can have multiple cities, and a city can have multiple ZIP codes. The ZIP code for 46204 is only Indianapolis. The ZIP code for 46224 has both Indianapolis and Speedway addresses.
Every combination of
zip_code_key should be unique.
Our address table now looks like the following.
|1234||200 E WASHINGTON ST||STE W122||1|
|5678||1032 N MAIN ST||null||3|
We’re getting closer!
Revision 3: Unit number and street address are related
We’re almost there. We still have one dependency: the unit number and the street address are dependent. If this builder were to be demolished, all of the units would be destroyed with it. If a new building were to built in its place, the availability and numbers of units would also change.
Let’s call our this new thing the street address, since that’s what it really represents.
|1||200 E WASHINGTON ST||1|
|2||1032 N MAIN ST||3|
As expected, every combination of
city2zip_code_key should be unique.
And here’s our completed address table. We will make
line_2 non-nullable. Typically, a null value means an unknown value. In this case, it is a known empty value. Let’s make our data reflect that.
Finally, every combination of
line_2 should be unique.
Wait! This is a pain in the butt!
Ugh! I really just want to see line 1, line 2, city, state, ZIP code. Guess what! That’s why views exist!
CREATE VIEW [dbo].[vw_address] AS SELECT [dbo].[address].[address_key], [dbo].[street_address].[line_1], [dbo].[address].[line_2], [dbo].[city].[city_name], [dbo].[state].[state_name], [dbo].[state].[state_abbreviation], [dbo].[zip_code].[zip_code], [dbo].[city].[city_name] + ', ' + [dbo].[state].[state_abbreviation] + ' ' + [dbo].[zip_code].[zip_code] AS line_3 FROM [dbo].[address] INNER JOIN [dbo].[street_address] ON [dbo].[address].[street_address_key] = [dbo].[street_address].[street_address_key] INNER JOIN [dbo].[city2zip_code] ON [dbo].[street_address].[city2zip_code_key] = [dbo].[city2zip_code].[city2zip_code_key] INNER JOIN [dbo].[city] ON [dbo].[city2zip_code].[city_key] = [dbo].[city].[city_key] INNER JOIN [dbo].[state] ON [dbo].[city].[state_key] = [dbo].[state].[state_key] INNER JOIN [dbo].[zip_code] ON [dbo].[city2zip_code].[zip_code_key] = [dbo].[zip_code].[zip_code_key]
Remember, tools exist for a reason. The code-level manifestation of an address (line 1, line 2, city, state, ZIP) is not the same as an address in your RDBMS. As you can see, in this type of design, there are no null values in any tables and there is no duplication of values.
Note 1: If we were being really picky, we could make the street it’s own table. Streets do - from time to time - get renamed. Ideally, if a street were to get renamed, we would want everything on that street to get updated in one operation.
Note 2: I have never seen an address broken up like this. This is the way a data engineer thinks about data; this is not the typical way a software developer thinks about data.
Note 3: Nine-digit ZIP codes (ZIP+4) are beyond the scope of this post. It should be clear that the same rules apply. ZIP+4 have rules that make them dependent on address line 1.