When setting up a data warehouse, understanding the relationships between various objects—both within a system and across systems—is essential. Foreign keys are the connective tissue, linking tables and datasets. When dealing with undocumented legacy systems, discovering these relationships manually can be tedious and time-consuming. Automated foreign key discovery, aided by AI, helps you get up to speed in a new data environment quickly.
When there is no documentation or expert to refer to, there are two main approaches to identifying foreign keys:
- Symbolic analysis
- Statistical analysis
Symbolic analysis
Symbolic analysis means you inspect the names of columns in the tables to identify patterns. A table called customer
might have a primary key called customer_id
or id
. A table called purchase
might have a column called purchase.customer_id
pointing to the customer table.
Prefix and suffix conventions can be another clue. A column might be named customer_fk
, indicating that it’s pointing to the customer object.
If the system uses strong naming conventions (e.g. Ruby on Rails) you can review the documentation of the framework to understand how foreign keys should be named.
Systems with database constraints or indices, foreign keys often have explicit or implicit constraints like foreign_key
or index
. Scanning database metadata for these keywords can help uncover relationships.
If you have access to query logs or usage data, you can analyse join patterns. Frequently joined columns across tables are often strong candidates for foreign key relationships.
AI excels at symbolic analysis and will be able to find these patterns in column names or logs easily.
Statistical analysis
Symbolic analysis helps narrow down foreign key candidates, but statistical analysis is how you validate those assumptions
This method requires you to inspect the values in the columns. For example, if a column identified through symbolic analysis appears to contain UUIDs, you can leverage the uniqueness property of UUIDs to detect relationships. The unique property of these allow you to accurately detect relationships.
You can also use the timestamp when a record was created to identify records created at the same time to infer relationships, and count the number of IDs present in the foreign key candidate from the primary key table.
Doing symbolic analysis with AI
Thanks to tools like ChatGPT and Gemini, it’s easy to get the computer to perform the symbolic analysis. With powerful compute platforms such as BigQuery, the statistical analysis can to evaluate foreign key candidates can also be applied at scale.
I have prepared a sample data set to show how it’s done.
First, I export the tables and columns from my data set.
SELECT
table_name,
column_name
FROM
`your_project_id.your_dataset.INFORMATION_SCHEMA.COLUMNS`
ORDER BY
table_name, column_name;
I ask ChatGPT to use symbolic analysis to identify primary keys and foreign keys.
Using symbolic analysis, please find foreign key and primary key candidates from this export of a database:
table_name,column_name
Country,alpha2isocode
Country,country
Country,countryid
account,accountid
account,accountnumber
...
Then, I ask ChatGPT to create a diagram of the relationships using Mermaid markdown. I get this comprehensive illustration that I can now start interrogating using statistical analysis.
Using AI to support statistical analysis
While AI isn’t great at statistical analysis, it’s great at explaining how to carry out the tests. Some methods that can be applied with the support of ChatGPT:
- Validate uniqueness of primary keys
- Foreign key integrity checks
- Nullability analysis
- Cardinality analysis
- Distribution analysis
- Anomaly detection
o1 versus 4o
I ran the same prompts with the latest OpenAI model o1. o1 identified 52 relationships where 4o identified nine relationships, generating a much more comprehensive relationship map.
Leave a Reply