Creating a domain model diagram with AI

LLMs like ChatGPT can be a great assistant in automating foreign key discovery. Here, I walk through a Python implementation with GPT-4 that uses public datasets on BigQuery to demonstrate how it works in practice.

Check out the code on Github: https://github.com/321k/FKScout.

We’ll use the Overture Maps dataset, a provider of free and open map data. The end result will be this Mermaid domain model diagram:

erDiagram building_part }o–||building : “” division }o–||division : “” locality_area }o–||locality : “” division_area }o–||division : “”

Exploring a new data set can take some time, and being able to automatically generate a domain model diagram is both useful and fun.

1. Database schema

The first step is to download the database schema.

query = f"""
SELECT
    table_name, 
    column_name
FROM 
    `{project_id}.{dataset_id}.INFORMATION_SCHEMA.COLUMNS`
ORDER BY 
    table_name, ordinal_position
"""

# execute the query
query_job = client.query(query)

2. Primary keys

We then get GPT-4 to identify the primary keys in our data set. Here is the message I use:

messages = [
    {
        "role": "system",
        "content": "You are an AI assistant specializing in database schema analysis."
    },
    {
        "role": "user",
        "content": (
            "Identify all potential primary keys in the following database schema. "
            "Provide the results in JSON format.\n\n"
            f"{schema_string}\n\n"
        ),
    },
]

Function calling with GPT-4 is a critical enabler here as it will enable us to get structured data back. This is the function I use:

functions = [
    {
        "name": "primary_keys",
        "description": "Find primary keys in the database schema.",
        "parameters": {
            "type": "object",
            "properties": {
                "keys": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "table_name": {"type": "string", "description": "Name of the table."},
                            "column_name": {"type": "string", "description": "Name of the primary key."},
                        },
                        "required": ["table_name", "column_name"],
                    },
                },
            },
            "required": ["keys"],  # Ensure the "keys" array is always provided
        },
    }
]

The result is a table with primary keys that GPT-4 has identified. GPT-4 does a good job at identifying primary keys that are not simply “id” too, for example address_id would also be caught.

table_name	column_name
address	id
administrative_boundary	id
boundary	id
building	id
building_part	id
connector	id
division	id
division_area	id
division_boundary	id
infrastructure	id
land	id
land_cover	id
land_use	id
locality	id
locality_area	id
place	id
segment	id
water	id

3. Foreign keys

The third step in the symbolic analysis is to identify the foreign keys. In this step, I make one API call to OpenAI per table with the following message:

messages = [
    {
        "role": "system",
        "content": "You are an AI assistant specializing in database schema analysis."
    },
    {
        "role": "user",
        "content": (
            "Identify all potential foreign keys in the following database table:"
            f"{table_data_csv}\n\n"
            "Please use the list of primary keys provided below as a reference:"
            f"{primary_keys_csv}\n\n"
            "Provide the results in JSON format.\n\n"
        ),
    },
]

We now have a list of database relationships ready to test with statistical analysis.

table_name	column_name	key_type	referenced_table	referenced_column
building_part	building_id	foreign	building	id
division_boundary	parent_division_id	foreign	division	id
division_boundary	division_id	foreign	division	id
division_boundary	division_ids	foreign	division	id
division_boundary	capital_division_ids	foreign	division	id
division	parent_division_id	foreign	division	id
division	division_id	foreign	division	id
division	capital_division_ids	foreign	division	id
boundary	parent_division_id	foreign	division	id
boundary	division_id	foreign	division	id
boundary	capital_division_ids	foreign	division	id
locality_area	context_id	foreign	locality	id
locality_area	locality_id	foreign	locality	id
segment	connectors	foreign	connector	id
division_area	division_id	foreign	division	id
division_area	parent_division_id	foreign	division	id
division_area	capital_division_ids	foreign	division	id
place	addresses	foreign	address	id
administrative_boundary	context_id	foreign	administrative_boundary	id
administrative_boundary	locality_id	foreign	locality	id

4. Statistical analysis

As LLMs don’t always get it right, it’s essential to validate the results. I find that two checks usually are enough:

The primary key is unqiue
The foreign key exist
The relationship is valid

This is what the primary key uniqueness check looks like:

SELECT
    COUNT(*) AS total_rows,
    COUNT(DISTINCT {primary_key}) AS unique_rows
FROM `{project_id}.{dataset}.{table}`;

How much better is O1 versus GPT-4?

Once we have established the relationships, we now have a really interesting benchmark to use for LLMs. How many valid relationships do they identify? Using these steps, I ask the latest O1 model to do the same analysis.

erDiagram %% Building + Building Part building { int id PK } building_part { int id PK int building_id FK } building_part }|..|| building : “building_id → building.id” %% Division + Boundary division { int id PK int parent_division_id FK int division_id FK } boundary { int id PK int division_id FK int parent_division_id FK } boundary }|..|| division : “division_id → division.id” boundary }|..|| division : “parent_division_id → division.id” division }|..|| division : “parent_division_id → division.id” division }|..|| division : “division_id → division.id” %% Division Area division_area { int id PK int division_id FK int parent_division_id FK } division_area }|..|| division : “division_id → division.id” division_area }|..|| division : “parent_division_id → division.id” %% Division Boundary division_boundary { int id PK int division_id FK int parent_division_id FK } division_boundary }|..|| division : “division_id → division.id” division_boundary }|..|| division : “parent_division_id → division.id” %% Locality + Locality Area + Administrative Boundary locality { int id PK } locality_area { int id PK int locality_id FK } locality_area }|..|| locality : “locality_id → locality.id” administrative_boundary { int id PK int locality_id FK } administrative_boundary }|..|| locality : “locality_id → locality.id”

An initial review suggests O1 is much better, but once we run the statistical validations, it turns out that GPT-4 had captured the relationships correctly. A more advanced database schema might yield more interesting results.

About Erik Edin