Show me your flowchart and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowchart; it’ll be obvious.
We’ve chosen to use relational databases, specifically PostgreSQL, for storing some of our data. We like ACID, we like the ease of ad-hoc query-ability and we like the fact that databases add an additional layer of security and data quality control. To make the most of this we should adopt some conventions, so that when we are accessing PG from an ORM, we don’t bring too many ORM-isms into our data model. This ensures that other staff, who might be using other ORMs, can still work with our data and it also prevents us from relying too much on the ORM or application layer for doing work that RDBMs can already do.
These are my preferred conventions, we will evolve them over time. I’ll try to provide a justification for each one, but this is a discussion.
- All names (table, column, sequence, index, constraint, role, etc.) should be lowercase with underscores. Postgres does support AnYSortOF casing that you’d like, but it makes manual querying painful.
- Table names should be a singular noun that describes one row. “account”, not “accounts”. Some people prefer plural, we just need a standard, my vote is for singular as it makes SQL a little more natural to read
e.g., “SELECT * FROM account WHERE account.balance > 5000;”.
- We’re using a relational database. Have relations. Very few tables should be islands.
- Foreign keys should be named “<table>_id”, e.g., if the “account” table links to the “person” table, there should be a column in “account” called “person_id”. In the case where there are multiple foreign keys to the same table, prefix the ids, e.g. “from_person_id” and “to_person_id”
- Foreign keys must have foreign key constraints. It makes the schema more readable, both by humans and introspection tools. It also prevents mistakes at the application layer.
- Serial columns should have the sequence as the default value for that column. E.g., if the “account” table has a primary key of “id”, it should be defined (in SQL) as “id SERIAL PRIMARY KEY”, which is a shortcut for “INTEGER NOT NULL DEFAULT nextval(‘account_id_seq’)”.
- Never expose serial columns outside of the model layer. If any table is going to be exposed in any way via an API, it should have a UUID column that will be exposed instead of using the “id”.
- Index, constraint and sequence names should take the form of table_column_[idx | uidx | seq | ck] for indexes, unique indexes, sequences and constraints.
- Unique indexes should encompass all the rules for uniqueness. If the “user” table can only have one copy of each user, consider a unique constraint on first_name, last_name, address and zip, also on SSN, or whatever. There is nothing wrong with having the front-end, back end and database all check this.
- Constraints should reflect business rules. Just because your application does sanity checking, it doesn’t mean that some bozo at the terminal will do it. FYI Josh has access to most of our machines and is a bozo.
- Postgres has a rich selection of native types (IP Addresses, UUIDs, Time intervals, Polygons). Use them where appropriate. If your data is an IP address, stick it in an INET. If it is a UUID, there’s a type for that.
- Postgres also supports enumerated types. If we have a relatively immutable small list of possible values for a column, use an enum.
- If the SQL type is not descriptive enough of the type of data that is stored in a column, use the units of measurement in the column. E.g., “height_meters” if we are storing a height, in meters. God knows why we’d do that, but you get the idea.
- Don’t be afraid of TEXT. If you want to store free form text, VARCHAR (2048) isn’t what you want. Postgres is smart enough to move large chunks of text outside of the table and into a blob, so there VARCHARs end up taking more space than TEXT. If there are strict length constraints, don’t use TEXT.
- Don’t be afraid of NUMERIC. We are dealing with money. Bigint’s are fine, but we need to then rely on the application layer to do the right thing. Each application needs to know what 12345 means in dollars. When we start having interest bearing accounts, 4 decimal places may not be enough. Postgres supports arbitrary precision numbers. We should standardize on NUMERIC (18,6) for money. And please be sure that your application doesn’t silently translate arbitrary precision numbers into IEEE-754 floats or similar. We all saw Superman III.
- Set reasonable DEFAULTs. If you have a column called “created” which records when a row was created, a reasonable default would be now(). I saw this as a default on one of our tables “not null default ”::character varying” Not reasonable. If its not supposed to be null, setting the default as ‘’ is silly. At the very least, decide whether each column should be NULL.
- Don’t be afraid of schemas. Postgres supports multiple object namespaces within the same database . If you’re unaware of schemas, you are probably creating objects in the “public” schema. If we ever get to a point where any database has dozens of tables, schemas are a good way to clarify the roles of each table. Look into it.
- By default, don’t denormalize. At our scale, its bad form to have the same column in two tables that are joined by a 1:1 relationship. Doing this means less logic at the application layer to enforce consistency.
- Many to many tables should be named with the name of the two tables they join.
- Log modifications. If tables have mutable columns, provide a _history table that keeps track of changes. If you’re so inclined, you can do this with triggers.