Engine
DDL

DDL

Kubling uses SQL-like DDL files to define TABLEs, which are organized within SCHEMAs. A SCHEMA serves as a container for metadata describing various objects, such as TABLEs, VIEWs, or PROCEDUREs.

Each SCHEMA is part of a Virtual Database (VDB), defined within the primary Descriptor bundle.

The real advantage of Kubling’s database virtualization engine lies in its flexibility: you can redefine the structure of TABLEs regardless of the underlying system’s actual model. This allows you to tailor data representations to your needs while still reflecting the source system's objects.

This flexibility is crucial for creating viewpoints tailored to specific data interactions, enabling you to adapt the data model to your needs rather than conforming to the system’s structure.

Additionally, Kubling allows the definition of TABLEs based on APIs, even those returning complex JSON documents. This capability simplifies working with deeply nested data structures, making it easier to transform them into a tabular format.

💡

DDL files are parsed as templates, meaning that you can use context properties.
See more about template parser.

Define a TABLE

CREATE TABLE {table-name} (
    <table-element> (,<table-element>)*
    (,<constraint>)*
) [OPTIONS (<options-clause>)]
 
 
<table-element> ::=
    {column-name} <data-type> <element-attr> <options-clause>
 
<data-type> ::=
    varchar | boolean | integer | double | date | timestamp .. (see Data Types)
 
<element-attr> ::=
    [AUTO_INCREMENT] [NOT NULL] [PRIMARY KEY] [UNIQUE] [INDEX] [DEFAULT {expr}]
 
<constraint> ::=
    CONSTRAINT {constraint-name} (
        PRIMARY KEY <columns> |
        FOREIGN KEY (<columns>) REFERENCES tbl (<columns>)
        UNIQUE <columns> |
        INDEX <columns>
 
<columns> ::=
    ( {column-name} [,{column-name}]* )
 
<options-clause> ::=
    <key> <value>[,<key>, <value>]*

Sample

CREATE FOREIGN TABLE NAMESPACE
   (clusterName string OPTIONS(val_variable 'cluster_name'),
    clusterUrl string OPTIONS(val_constant '{{ schema.properties.kubernetes_api_url }}'),
    schema string OPTIONS(val_constant '{{ schema.name }}'),
    metadata__name string,
    metadata__labels json OPTIONS(parser_format 'asJsonPretty'),
    status__phase string,
    PRIMARY KEY(metadata__name),
    UNIQUE(clusterUrl, metadata__name))
OPTIONS(updatable true,
        supports_idempotency false,
        tags 'kubernetes;{{ schema.properties.cluster_name }};namespace');

Double underscore in field name has a special meaning in the context of a Script Document Data Source. See here for more information.

Options

⚠️

Following options work ony in document data sources. However, we are implementing them also in non-document Adapters, like POSTGRESQL.
You can follow the progress in roadmap section.

Options is the mechanism used by the Engine to pass meta-information from the TABLE definition to the query processor (QP).
Internally, each listed OPTION but tags is referred as DIRECTIVE since, as mentioned before, it can change the behavior of the QP

Common TABLE Directives

DirectiveTypeOptionsDescription
updatableBooleanTrue/FalseIndicates whether the QP allows INSERT, UPDATE and DELETE operations on this TABLE
tagsStringAnyDefines tags for the TABLE, which can be used to get TABLEs based on tags criteria.
cacheStringcache 'option'
Options:
enabled, disabled, schemaDefault
Defines the cache behavior for this TABLE when caching is enabled at the data source level.

Common Field Directives

DirectiveTypeOptionsDescription
val_constantStringAnyDefines a constant field value.
val_variableStringAnyThe value of the field is get from the properties. More information.
val_pkString[field+field+...]Creates a calculated PK based on a field set.
Since the value is not stored anywhere, when filtering using this field, the QP replaces the WHERE section by the original list of field's values.
call_funcStringcall_func 'fn_name<field, field,...>'Calls a function (built-in or user-defined) to get the field value.
Function parameters can be other's fields values.

Specific directives are listed in each data source documentation.

Schema post-precessing in Document data sources

Suppose our data source interacts with the GitHub API, and we are retrieving a specific repository. GitHub’s response would be something like the following

{
    "id": 1296269,
    "node_id": "MDEwOlJlcG9zaXRvcnkxMjk2MjY5",
    "name": "app-backend",
    "full_name": "demo/app-backend",
    "owner": {
      "login": "demo",
      "id": 1,
      "node_id": "MDQ6VXNlcjE=",
      "avatar_url": "https://github.com/images/error/demo_happy.gif",
      "gravatar_id": "",
      "url": "https://api.github.com/users/demo",
      "html_url": "https://github.com/demo",
      ...
    },
    "private": false,
    "html_url": "https://github.com/demo/app-backend",
    "description": "This your first repo!",
    "fork": false,
    "url": "https://api.github.com/repos/demo/app-backend",
    "archive_url": "https://api.github.com/repos/demo/app-backend/{archive_format}{/ref}",
    "assignees_url": "https://api.github.com/repos/demo/app-backend/assignees{/user}",
    "blobs_url": "https://api.github.com/repos/demo/app-backend/git/blobs{/sha}",
    "branches_url": "https://api.github.com/repos/demo/app-backend/branches{/branch}",
    ...
    "topics": [
      "demo",
      "atom",
      "electron",
      "api"
    ],
    "has_issues": true,
    "has_projects": true,
    "has_wiki": true,
    "has_pages": false,
    "has_downloads": true,
    "has_discussions": false,
    "archived": false,
    "disabled": false,
    "visibility": "public",
    "pushed_at": "2011-01-26T19:06:43Z",
    "created_at": "2011-01-26T19:01:12Z",
    "updated_at": "2011-01-26T19:14:43Z",
    "permissions": {
      "admin": false,
      "push": false,
      "pull": true
    },
    "security_and_analysis": {
      "advanced_security": {
        "status": "enabled"
      },
      "secret_scanning": {
        "status": "enabled"
      },
      "secret_scanning_push_protection": {
        "status": "disabled"
      }
    }
  }

The challenge we face when translating a document schema into Kubling's Schema is handling nested fields.

Before diving into how to manage nested fields, it's important to note that while you can create a Kubling Schema that mirrors the entire document schema, it's generally not recommended. Most applications use only a subset of the data provided by upstream systems via their APIs. Therefore, the design phase remains critical when implementing Kubling.

When dealing with nested fields, there are three options:

  • Store the entire nested document in a json field.
  • Use a path-like access naming convention.
  • Ungroup fields into synthetic tables.

json field

A json field is treated by the engine as a nested document but behaves like any other regular field.

Path-like access

Suppose that you only need a couple of fields of owner rather than the entire nested document. When defining the schema, you can simply do the following:

CREATE FOREIGN TABLE CODE_REPO
(
    id long,
    name string,
    ...
    owner__login string,
    owner__html_url string,
    ...
) OPTIONS(...);

When a document-based adapter retrieves the information, the special double underscore character (__) is interpreted as a JSON path, equivalent to .[field]. Essentially, it replaces the dot (.) typically used in path notation.

synthetic tables

synthetic are one of the most powerful features in Kubling because they allow the creation of tables that do not have a direct physical entity in the remote data source.

Continuing with this example, we can define a synthetic table called CODE_REPO_OWNER to represent the nested document, as shown below:

CREATE FOREIGN TABLE CODE_REPO_OWNER
(
    repoId long NOT NULL OPTIONS(synthetic_type 'parent', synthetic_parent_field 'id'),
    repoName long NOT NULL OPTIONS(synthetic_type 'parent', synthetic_parent_field 'name'),
    id long,
    node_id string,
    "login" string,
    html_url string,
    ...
    identifier string OPTIONS(val_pk 'repoId+id' ),
    PRIMARY KEY(identifier),
    UNIQUE(repoId, id)
) OPTIONS(
    synthetic_parent 'CODE_REPO',
    synthetic_path 'owner');

CODE_REPO_OWNER is a table derived from the synthetic_path field of synthetic_parent. It includes two fields inherited from its parent, marked as synthetic_type 'parent', and a generated identifier that also serves as the primary key.

Additionally, we mark repoId and id as UNIQUE, ensuring that no other tuple will have the exact same combination of values.

The UNIQUE constraint plays an important role: when a Create, Update, or Delete (CUD) operation is performed on a synthetic table, the engine needs to trace the exact document lineage. It does this by relying on the combination of fields in the UNIQUE constraint.

Document arrays

When a field is an array of documents, the main difference is that the synthetic table will retrieve multiple rows for each parent document. However, you must ensure that the path built using the UNIQUE constraint leads to the exact element, meaning both the parent document(s) and the precise index or indices of the array.

INSERT on a synthetic table

Performing an INSERT on a synthetic table means that a new element must be added to the parent document's synthetic_path, whether it's a field or an array. The engine will create this new element based on the field values that are not associated with any synthetic_type. These values will then be used to trace the document lineage and identify the correct node where the new element should be inserted.

UPDATE on a synthetic table

Similar to INSERT, but instead of adding a new element (node), the values specified in the SET clause will replace an existing node.

DELETE on a synthetic table

Similar to INSERT, but after identifying the node, the engine removes it from the document.