Introduction

If you ever wondered when it makes sense to have a separate table versus fields on a table, this is the article for you.

We're gonna start and gradually move to a more complex evolving schema. It's an ecommerce website with products.

I did this for myself to understand trade offs and when things make sense.

1. Simple products table

-- Initial: "Just store products!"
products
  id
  name
  description
  price

First, we just store products.

Ignore the concept of stocks for now.

We wanna add a color to each product, let's do it!

2. Adding colors to products

-- "Let's add colors!"
products
  id
  name
  description
  price
  color  -- Simply added a column

We realize now that a product can't just have one color, it can have many.

And then we have sizes plus quantity for each color... oops

To model multiple colors on the current schema would be a pain and super messy.

3. Introducing a new table for variants

Taking size, colors and quantity into account, let's create a new table for variants.

-- "Oh wait, same product, different colors/sizes..."
products
  id
  name
  description
  base_price

product_variants
  id
  product_id
  color
  size
  price
  stock_quantity

We have the base for a product, including it's base price.

But if you've ever shopped, you've probably noticed that some products in bigger sizes are more expensive. This is also something we need to take into account.

We can look at each product_id and that way get the variants for each product.

A single variant contains:

Color
Size
Price
Stock quantity

For it to be effective, we'd have an index on product_id for the product_variants table.

CREATE INDEX ON product_variants (product_id);

SELECT * FROM product_variants WHERE product_id = 1; -- Fast

-- If you want to JOIN

SELECT * FROM products JOIN product_variants ON products.id = product_variants.product_id; -- Fast

The JOIN result looks something like this with for each row in the result:

id | name | description | base_price | id | product_id | color | size | price | stock_quantity

We have a new problem though. We got a call. Every time description changes, we need to keep track of the old descriptions due to legal reasons.

To keep track of all the descriptions, we need to add a new table for product descriptions.

4. Introducing a new table for descriptions

-- "Legal needs version history of descriptions..."
products
  id
  name
  current_version_id  -- Key change!
  base_price

product_variants
  id
  product_id
  color
  size
  price
  stock_quantity

product_versions
  id
  product_id
  description
  version_number -- monotonically increasing number - 1,2,3,4,5
  edited_at
  edited_by

current_version_id is a foreign key to the product_versions table. So whenever we show the user the product description, we immediately know which version to show.

And yes, we're gonna need a new index on current_version_id in the products table. And an index on product_id in the product_versions table.

CREATE INDEX ON products (current_version_id);
CREATE INDEX ON product_versions (product_id);

The first index is for the products table to quickly find the current version of a product.

The second index is for the product_versions table to quickly find all versions of a product. To be fair, this may be unnecessary. Assuming it'll be needed for an internal admin tool to see the history of a product. Latency should be fine.

version_number is a monotonically increasing number - 1,2,3,4,5. Monotonic means that it either only goes up or only goes down. In our case, it only goes up. These type of numbers are also called logical timestamps. Their job is to keep track of the order of events.

Let's go over indexes

I've noticed that I haven't explicitly explained indexes yet. More or less glossed over them. The problem without indexes is that we'd have to scan the entire table to find the data we want. Think of before hand knowing the index of the item you're looking for in an array. It's a quick lookup.

Indexes are a data structure that allows us to quickly look up data in a database. They are optimized for specific queries.

Let's go over some relevant ones for us:

CREATE INDEX idx_products_current ON products(current_version_id);
-- Problem it solves: When loading product pages, we need to quickly get current description
-- Without it: Every product lookup would require scanning product_versions table (VERY slow)

CREATE INDEX idx_product_versions ON product_versions(product_id, version_number); -- This is a composite index!
-- Problems it solves:
-- 1. "Show version history for product X" -> Relevant for our admin tool
-- 2. "Get specific version of product X" -> Relevant for our admin tool
-- We can use this for both because of composite index rule: left-most column rule!

A composite index is an index that has multiple columns. In our case, it's (product_id, version_number).

The left-most column rule means that we can use the index for queries that only use the first column. That's why "Show version history for product X" works, even though we only use product_id.

Now, for a composite index, you can't start with the second column. You always need to start with the first column.

Our composite index would NOT optimize this:

SELECT * FROM product_versions WHERE version_number = 1; -- Not optimized

Another thing I wanna cover about composite indexes is picking the more unique columns first.

For example, if we have color and id, color can be of 10 different colors let's say:

CREATE INDEX idx_product_variants_color ON product_variants(color, product_id);

This is a composite index, but it's not optimized.

Because color is not very unique, we literally have to scan all 10 colors, for ALL product_variants. Then we can find the ones we want based on product_id.

In JS with some pseudo code it might look like this:

const colorsSet = new Set();
const productVariants = [];

// 1. We literally start by scanning all product variants to get all 10 colors
const productVariantsWithRightColors = productVariants.filter(
  (productVariant) => colorsSet.has(productVariant.color)
);

// 2. Now we filter based on product_id
const productVariantsWithRightProductId = productVariantsWithRightColors.filter(
  (productVariant) => productVariant.product_id === productId
);

Now, .filter does go through the entire array, so it's not the best example, but my point here is that it's more efficient to first filter based on the most unique column. And with an index, this would be fast.