DDL

Kubling uses SQL-like DDL files to define TABLEs, which are organized within SCHEMAs. A SCHEMA serves as a container for metadata describing various objects, such as TABLEs, VIEWs, or PROCEDUREs.

Each SCHEMA is part of a Virtual Database (VDB), defined within the primary Descriptor bundle.

The real advantage of Kubling’s database virtualization engine lies in its flexibility: you can redefine the structure of TABLEs regardless of the underlying system’s actual model. This allows you to tailor data representations to your needs while still reflecting the source system's objects.

This flexibility is crucial for creating viewpoints tailored to specific data interactions, enabling you to adapt the data model to your needs rather than conforming to the system’s structure.

Additionally, Kubling allows the definition of TABLEs based on APIs, even those returning complex JSON documents. This capability simplifies working with deeply nested data structures, making it easier to transform them into a tabular format.

💡

DDL files are parsed as templates, meaning that you can use context properties.
See more about template parser.

Define a `TABLE`

CREATE TABLE {table-name} (
    <table-element> (,<table-element>)*
    (,<constraint>)*
) [OPTIONS (<options-clause>)]
 
 
<table-element> ::=
    {column-name} <data-type> <element-attr> <options-clause>
 
<data-type> ::=
    varchar | boolean | integer | double | date | timestamp .. (see Data Types)
 
<element-attr> ::=
    [AUTO_INCREMENT] [NOT NULL] [PRIMARY KEY] [UNIQUE] [INDEX] [DEFAULT {expr}]
 
<constraint> ::=
    CONSTRAINT {constraint-name} (
        PRIMARY KEY <columns> |
        FOREIGN KEY (<columns>) REFERENCES tbl (<columns>)
        UNIQUE <columns> |
        INDEX <columns>
 
<columns> ::=
    ( {column-name} [,{column-name}]* )
 
<options-clause> ::=
    <key> <value>[,<key>, <value>]*

Sample

CREATE FOREIGN TABLE NAMESPACE
   (clusterName string OPTIONS(val_variable 'cluster_name'),
    clusterUrl string OPTIONS(val_constant '{{ schema.properties.kubernetes_api_url }}'),
    schema string OPTIONS(val_constant '{{ schema.name }}'),
    metadata__name string,
    metadata__labels json OPTIONS(parser_format 'asJsonPretty'),
    status__phase string,
    PRIMARY KEY(metadata__name),
    UNIQUE(clusterUrl, metadata__name))
OPTIONS(updatable true,
        supports_idempotency false,
        tags 'kubernetes;{{ schema.properties.cluster_name }};namespace');

Double underscore in field name has a special meaning in the context of a Script Document Data Source. See here for more information.

Options

⚠️

Following options work ony in document data sources. However, we are implementing them also in non-document Adapters, like POSTGRESQL.
You can follow the progress in roadmap section.

Options is the mechanism used by the Engine to pass meta-information from the TABLE definition to the query processor (QP).
Internally, each listed OPTION but tags is referred as DIRECTIVE since, as mentioned before, it can change the behavior of the QP

Common `TABLE` Directives

Directive	Type	Options	Description
`ANNOTATION`	String	Any	A short natural language description that provides semantic meaning for the object. This is used to build the Semantic Schema and helps LLMs understand the role and purpose of the object.
`updatable`	Boolean	True/False	Indicates whether the QP allows `INSERT`, `UPDATE` and `DELETE` operations on this `TABLE`
`tags`	String	Any	Defines tags for the `TABLE`, which can be used to get `TABLE`s based on tags criteria.
`cache`	String	cache 'option' Options: `enabled`, `disabled`, `schemaDefault`	Defines the cache behavior for this `TABLE` when caching is enabled at the data source level.
`relationships`	String	Comma-separated list of table names	General associations between this object and others. Use when the direction or type of relationship is not essential.
`relationship_affects`	String	Comma-separated list of table names	Indicates which objects this one can influence or change.
`relationship_affected_by`	String	Comma-separated list of table names	Indicates which objects can influence or change this one.
`relationship_generates`	String	Comma-separated list of table names	Denotes outputs or artifacts generated by this object.
`relationship_caused_by`	String	Comma-separated list of table names	Used to express root-cause-like relationships, especially in diagnostic or event-driven systems.

Common Field Directives

Directive	Type	Options	Description
`ANNOTATION`	String	Any	A short natural language description that provides semantic meaning for the object. This is used to build the Semantic Schema and helps LLMs understand the role and purpose of the object.
`val_constant`	String	Any	Defines a constant field value.
`val_variable`	String	Any	The value of the field is get from the properties. More information.
`val_pk`	String	[field+field+...]	Creates a calculated `PK` based on a field set. Since the value is not stored anywhere, when filtering using this field, the QP replaces the `WHERE` section by the original list of field's values.
`call_func`	String	call_func 'fn_name<field, field,...>'	Calls a function (built-in or user-defined) to get the field value. Function parameters can be other's fields values.

Specific directives are listed in each data source documentation.

Schema post-precessing in Document data sources

Suppose our data source interacts with the GitHub API, and we are retrieving a specific repository. GitHub’s response would be something like the following

{
    "id": 1296269,
    "node_id": "MDEwOlJlcG9zaXRvcnkxMjk2MjY5",
    "name": "app-backend",
    "full_name": "demo/app-backend",
    "owner": {
      "login": "demo",
      "id": 1,
      "node_id": "MDQ6VXNlcjE=",
      "avatar_url": "https://github.com/images/error/demo_happy.gif",
      "gravatar_id": "",
      "url": "https://api.github.com/users/demo",
      "html_url": "https://github.com/demo",
      ...
    },
    "private": false,
    "html_url": "https://github.com/demo/app-backend",
    "description": "This your first repo!",
    "fork": false,
    "url": "https://api.github.com/repos/demo/app-backend",
    "archive_url": "https://api.github.com/repos/demo/app-backend/{archive_format}{/ref}",
    "assignees_url": "https://api.github.com/repos/demo/app-backend/assignees{/user}",
    "blobs_url": "https://api.github.com/repos/demo/app-backend/git/blobs{/sha}",
    "branches_url": "https://api.github.com/repos/demo/app-backend/branches{/branch}",
    ...
    "topics": [
      "demo",
      "atom",
      "electron",
      "api"
    ],
    "has_issues": true,
    "has_projects": true,
    "has_wiki": true,
    "has_pages": false,
    "has_downloads": true,
    "has_discussions": false,
    "archived": false,
    "disabled": false,
    "visibility": "public",
    "pushed_at": "2011-01-26T19:06:43Z",
    "created_at": "2011-01-26T19:01:12Z",
    "updated_at": "2011-01-26T19:14:43Z",
    "permissions": {
      "admin": false,
      "push": false,
      "pull": true
    },
    "security_and_analysis": {
      "advanced_security": {
        "status": "enabled"
      },
      "secret_scanning": {
        "status": "enabled"
      },
      "secret_scanning_push_protection": {
        "status": "disabled"
      }
    }
  }

The challenge we face when translating a document schema into Kubling's Schema is handling nested fields.

Before diving into how to manage nested fields, it's important to note that while you can create a Kubling Schema that mirrors the entire document schema, it's generally not recommended. Most applications use only a subset of the data provided by upstream systems via their APIs. Therefore, the design phase remains critical when implementing Kubling.

When dealing with nested fields, there are three options:

Store the entire nested document in a json field.
Use a path-like access naming convention.
Ungroup fields into synthetic tables.

`json` field

A json field is treated by the engine as a nested document but behaves like any other regular field.

Path-like access

Suppose that you only need a couple of fields of owner rather than the entire nested document. When defining the schema, you can simply do the following:

CREATE FOREIGN TABLE CODE_REPO
(
    id long,
    name string,
    ...
    owner__login string,
    owner__html_url string,
    ...
) OPTIONS(...);

When a document-based adapter retrieves the information, the special double underscore character (__) is interpreted as a JSON path, equivalent to .[field]. Essentially, it replaces the dot (.) typically used in path notation.

`synthetic` tables

synthetic are one of the most powerful features in Kubling because they allow the creation of tables that do not have a direct physical entity in the remote data source.

Continuing with this example, we can define a synthetic table called CODE_REPO_OWNER to represent the nested document, as shown below:

CREATE FOREIGN TABLE CODE_REPO_OWNER
(
    repoId long NOT NULL OPTIONS(synthetic_type 'parent', synthetic_parent_field 'id'),
    repoName long NOT NULL OPTIONS(synthetic_type 'parent', synthetic_parent_field 'name'),
    id long,
    node_id string,
    "login" string,
    html_url string,
    ...
    identifier string OPTIONS(val_pk 'repoId+id' ),
    PRIMARY KEY(identifier),
    UNIQUE(repoId, id)
) OPTIONS(
    synthetic_parent 'CODE_REPO',
    synthetic_path 'owner');

CODE_REPO_OWNER is a table derived from the synthetic_path field of synthetic_parent. It includes two fields inherited from its parent, marked as synthetic_type 'parent', and a generated identifier that also serves as the primary key.

Additionally, we mark repoId and id as UNIQUE, ensuring that no other tuple will have the exact same combination of values.

The UNIQUE constraint plays an important role: when a Create, Update, or Delete (CUD) operation is performed on a synthetic table, the engine needs to trace the exact document lineage. It does this by relying on the combination of fields in the UNIQUE constraint.

Document arrays

When a field is an array of documents, the main difference is that the synthetic table will retrieve multiple rows for each parent document. However, you must ensure that the path built using the UNIQUE constraint leads to the exact element, meaning both the parent document(s) and the precise index or indices of the array.

`INSERT` on a `synthetic` table

Performing an INSERT on a synthetic table means that a new element must be added to the parent document's synthetic_path, whether it's a field or an array. The engine will create this new element based on the field values that are not associated with any synthetic_type. These values will then be used to trace the document lineage and identify the correct node where the new element should be inserted.

`UPDATE` on a `synthetic` table

Similar to INSERT, but instead of adding a new element (node), the values specified in the SET clause will replace an existing node.

`DELETE` on a `synthetic` table

Similar to INSERT, but after identifying the node, the engine removes it from the document.

Data Types Queries

DDL

Define a TABLE

Sample

Options

Common TABLE Directives

Common Field Directives

Schema post-precessing in Document data sources

json field

Path-like access

synthetic tables

Document arrays

INSERT on a synthetic table

UPDATE on a synthetic table

DELETE on a synthetic table