Linking Data in PostgreSQL Without Explicit Keys: A Practical Guide

Databases often evolve in unpredictable ways. Tables that were initially designed to operate independently might later need to be connected. This situation can arise in various scenarios, such as when data is generated by separate processes, systems, or applications that lack coordination.

In this article, we’ll explore how to link two tables in PostgreSQL based on their creation timestamps when no explicit keys exist. Specifically, we’ll demonstrate how to establish a relationship between records where the difference in creation times is no more than 2 seconds.

The Problem #

Let’s consider a realistic example: an e-commerce platform where orders and payments were initially tracked independently.

  • No explicit linking keys were added between these tables because they were designed for separate workflows.
  • Payments and orders are created by different processes, introducing potential time delays.
  • The only feasible way to connect them is by using their created_at timestamps, assuming payments happen within a few seconds of the corresponding order being placed.

The Data #

Here’s how the tables are structured:

Orders Table #

ColumnTypeDescription
idSERIALPrimary key
order_numberVARCHARUnique identifier for the order
created_atTIMESTAMPTimestamp of order creation

Payments Table #

ColumnTypeDescription
idSERIALPrimary key
transaction_idVARCHARUnique identifier for the payment
created_atTIMESTAMPTimestamp of payment creation

Sample data:

INSERT INTO orders (order_number, created_at)
VALUES
    ('ORD001', '2024-12-01 10:00:00'),
    ('ORD002', '2024-12-01 10:00:05');

INSERT INTO payments (transaction_id, created_at)
VALUES
    ('TXN12345', '2024-12-01 10:00:01'),
    ('TXN67890', '2024-12-01 10:00:07');

The Solution #

Therefore, we need to create that relation among these tables where the created time difference between the created_at values in both tables is not longer than 2 seconds apart. This can be achieved with a query calculated on time difference via EXTRACT(EPOCH) filtering out the rows that fulfill that condition.

Creating a View #

CREATE VIEW order_payment_links AS
SELECT
    o.id AS order_id,
    o.order_number,
    p.id AS payment_id,
    p.transaction_id,
    o.created_at AS order_created_at,
    p.created_at AS payment_created_at
FROM
    orders o
JOIN
    payments p
ON
    ABS(EXTRACT(EPOCH FROM (o.created_at - p.created_at))) <= 2;

This view dynamically generates the linked data without modifying the original tables.

Querying the View You can retrieve the linked data with a simple query:

SELECT * FROM order_payment_links;

Result:

order_idorder_numberpayment_idtransaction_idorder_created_atpayment_created_at
1ORD0011TXN123452024-12-01 10:00:002024-12-01 10:00:01

Here, the first order (ORD001) is linked to the first payment (TXN12345) because the timestamps differ by only 1 second. The second order and payment are not linked due to the greater time difference.

Why This Approach Works #

This solution is effective in scenarios where explicit foreign keys are missing, and retrofitting them isn’t feasible. By leveraging timestamp-based linking:

  • No schema changes are required, making it safe for existing systems.
  • Flexible linking is possible, based on business-specific rules.

However, this method has limitations:

  • Timestamp precision is critical; inaccurate timestamps could lead to incorrect links.
  • Performance may degrade as the dataset grows because the query compares every possible pair.

Conclusion #

When working with systems where tables were not originally designed to be related, linking records by timestamps can provide a practical workaround. PostgreSQL’s robust query capabilities make it easy to establish relationships dynamically.

This approach isn’t perfect for every use case, but it’s a valuable tool when retrofitting connections into an existing database without disrupting the current setup.

Comments