LLMs like ChatGPT can be a great assistant in automating foreign key discovery. Here, I walk through a Python implementation with GPT-4 that uses public datasets on BigQuery to demonstrate how it works in practice.
Check out the code on Github: https://github.com/321k/FKScout.
We’ll use the Overture Maps dataset, a provider of free and open map data. The end result will be this Mermaid domain model diagram:
Exploring a new data set can take some time, and being able to automatically generate a domain model diagram is both useful and fun.
1. Database schema
The first step is to download the database schema.
query = f"""
SELECT
table_name,
column_name
FROM
`{project_id}.{dataset_id}.INFORMATION_SCHEMA.COLUMNS`
ORDER BY
table_name, ordinal_position
"""
# execute the query
query_job = client.query(query)
2. Primary keys
We then get GPT-4 to identify the primary keys in our data set. Here is the message I use:
messages = [
{
"role": "system",
"content": "You are an AI assistant specializing in database schema analysis."
},
{
"role": "user",
"content": (
"Identify all potential primary keys in the following database schema. "
"Provide the results in JSON format.\n\n"
f"{schema_string}\n\n"
),
},
]
Function calling with GPT-4 is a critical enabler here as it will enable us to get structured data back. This is the function I use:
functions = [
{
"name": "primary_keys",
"description": "Find primary keys in the database schema.",
"parameters": {
"type": "object",
"properties": {
"keys": {
"type": "array",
"items": {
"type": "object",
"properties": {
"table_name": {"type": "string", "description": "Name of the table."},
"column_name": {"type": "string", "description": "Name of the primary key."},
},
"required": ["table_name", "column_name"],
},
},
},
"required": ["keys"], # Ensure the "keys" array is always provided
},
}
]
The result is a table with primary keys that GPT-4 has identified. GPT-4 does a good job at identifying primary keys that are not simply “id” too, for example address_id would also be caught.
table_name | column_name |
address | id |
administrative_boundary | id |
boundary | id |
building | id |
building_part | id |
connector | id |
division | id |
division_area | id |
division_boundary | id |
infrastructure | id |
land | id |
land_cover | id |
land_use | id |
locality | id |
locality_area | id |
place | id |
segment | id |
water | id |
3. Foreign keys
The third step in the symbolic analysis is to identify the foreign keys. In this step, I make one API call to OpenAI per table with the following message:
messages = [
{
"role": "system",
"content": "You are an AI assistant specializing in database schema analysis."
},
{
"role": "user",
"content": (
"Identify all potential foreign keys in the following database table:"
f"{table_data_csv}\n\n"
"Please use the list of primary keys provided below as a reference:"
f"{primary_keys_csv}\n\n"
"Provide the results in JSON format.\n\n"
),
},
]
We now have a list of database relationships ready to test with statistical analysis.
table_name | column_name | key_type | referenced_table | referenced_column |
building_part | building_id | foreign | building | id |
division_boundary | parent_division_id | foreign | division | id |
division_boundary | division_id | foreign | division | id |
division_boundary | division_ids | foreign | division | id |
division_boundary | capital_division_ids | foreign | division | id |
division | parent_division_id | foreign | division | id |
division | division_id | foreign | division | id |
division | capital_division_ids | foreign | division | id |
boundary | parent_division_id | foreign | division | id |
boundary | division_id | foreign | division | id |
boundary | capital_division_ids | foreign | division | id |
locality_area | context_id | foreign | locality | id |
locality_area | locality_id | foreign | locality | id |
segment | connectors | foreign | connector | id |
division_area | division_id | foreign | division | id |
division_area | parent_division_id | foreign | division | id |
division_area | capital_division_ids | foreign | division | id |
place | addresses | foreign | address | id |
administrative_boundary | context_id | foreign | administrative_boundary | id |
administrative_boundary | locality_id | foreign | locality | id |
4. Statistical analysis
As LLMs don’t always get it right, it’s essential to validate the results. I find that two checks usually are enough:
- The primary key is unqiue
- The foreign key exist
- The relationship is valid
This is what the primary key uniqueness check looks like:
SELECT
COUNT(*) AS total_rows,
COUNT(DISTINCT {primary_key}) AS unique_rows
FROM `{project_id}.{dataset}.{table}`;
How much better is O1 versus GPT-4?
Once we have established the relationships, we now have a really interesting benchmark to use for LLMs. How many valid relationships do they identify? Using these steps, I ask the latest O1 model to do the same analysis.
An initial review suggests O1 is much better, but once we run the statistical validations, it turns out that GPT-4 had captured the relationships correctly. A more advanced database schema might yield more interesting results.
Leave a Reply