Using AI to automate foreign key discovery

Check out the code on Github: https://github.com/321k/FKScout

When setting up a data warehouse, understanding the relationships between various objects—both within a system and across systems—is essential. Foreign keys are the connective tissue, linking tables and datasets. When dealing with undocumented legacy systems, discovering these relationships manually can be tedious. Automated foreign key discovery, aided by AI, helps you get up to speed in a new data environment quickly.

When there is no documentation or expert to refer to, there are two main approaches to identifying foreign keys:

Symbolic analysis
Statistical analysis

Symbolic analysis

Symbolic analysis means you inspect the names of columns in the tables to identify patterns. A table called customer might have a primary key called customer_id or id. A table called purchase might have a column called purchase.customer_id pointing to the customer table.

Prefix and suffix conventions can be another clue. A column might be named customer_fk, indicating that it’s pointing to the customer object.

If the system uses strong naming conventions (e.g. Ruby on Rails) you can review the documentation of the framework to understand how foreign keys should be named.

Systems with database constraints or indices, foreign keys often have explicit or implicit constraints like foreign_key or index. Scanning database metadata for these keywords can help uncover relationships.

If you have access to query logs or usage data, you can analyse join patterns. Frequently joined columns across tables are often strong candidates for foreign key relationships.

AI excels at symbolic analysis and will be able to find these patterns in column names or logs easily.

Statistical analysis

Symbolic analysis helps narrow down foreign key candidates, but statistical analysis is how you validate those assumptions

This method requires you to inspect the values in the columns. For example, if a column identified through symbolic analysis appears to contain UUIDs, you can leverage the uniqueness property of UUIDs to detect relationships. The unique property of these allow you to accurately detect relationships.

Checking that all foreign keys have a match in the table it’s referencing is another validation point.

You can also use the timestamp when a record was created to identify records created at the same time to infer relationships, and count the number of IDs present in the foreign key candidate from the primary key table.

Doing symbolic analysis with AI

Thanks to tools like ChatGPT and Gemini, it’s easy to get the computer to perform the symbolic analysis. With powerful compute platforms such as BigQuery, the statistical analysis to evaluate foreign key candidates can also be applied at scale.

I have prepared a sample data set to show how it’s done.

First, I export the tables and columns from my data set.

SELECT 
  table_name,
  column_name
FROM 
  `your_project_id.your_dataset.INFORMATION_SCHEMA.COLUMNS`
ORDER BY 
  table_name, column_name;

I ask ChatGPT to use symbolic analysis to identify primary keys and foreign keys.

Using symbolic analysis, please find foreign key and primary key candidates from this export of a database:

table_name,column_name
Country,alpha2isocode
Country,country
Country,countryid
account,accountid
account,accountnumber
...

Then, I ask ChatGPT to create a diagram of the relationships using Mermaid markdown. I get this comprehensive illustration that I can now start interrogating using statistical analysis.

Using AI to support statistical analysis

While AI isn’t great at statistical analysis, it’s great at explaining how to carry out the tests. Some methods that can be applied with the support of ChatGPT:

Validate uniqueness of primary keys
Foreign key integrity checks
Nullability analysis
Cardinality analysis
Distribution analysis
Anomaly detection

o1 versus 4o

I ran the same prompts with the latest OpenAI model o1. o1 identified 52 relationships where 4o identified nine relationships, generating a much more comprehensive relationship map.

About Erik Edin