Introduction to Database Data Sources

When a new Database is added to the list of data sources, and no DDL is defined, the Kubling engine treats that data source as passthrough (also called direct), which has several implications on how the engine parses and resolves queries.

Passtrough

Objects loaded in this mode come directly from the metadata read from the remote source during engine initialization. SCHEMAs and TABLEs are still created in Kubling but serve only as a 1:1 mirror of the original TABLEs identified in the source.

This mode is useful when:

  • A connection is needed to join existing information with TABLEs defined in Kubling.
  • The source does not support complex custom functions, like Kubling’s, and some transformations need to be applied during query via functions.
  • Reducing load on the source, such as networking and cache storage, is beneficial.

Queries

For query resolution in passthrough mode, the engine pushes down as much as possible to the source, including functions and subqueries, relying on the source’s implementation.

As a result, the query the source receives is almost identical to the one Kubling received.

In regular mode, however, the DQP resolves the AST differently. Subqueries may be resolved as independent units, resulting in multiple queries being sent to the data source.


Additional Columns v25.1+

In passthrough mode, it is possible to add specific fields without defining the entire schema.
This is particularly useful for making small amendments to tables, such as adding constant values using the val_constant directive.

Example:

dataSources:
  - name: "psql-1"
    dataSourceType: "POSTGRESQL"
    configObject:
      dataSourceName: "psql-1"
      databaseName: "psql-1"
      ...
    schema:
      ...
      addColumns:
        - tableNamePattern: ".+"
          name: "data_center"
          type: "string"
          directives:
            val_constant: "DC1"

tableNamePattern allows selecting which tables will be impacted by the column addition using a Java regular expression.
directives is equivalent to adding OPTIONS next to the field via DDL. However, as in DDL, it is optional.

Operations in Regular Mode (non passthrough)

In this mode, all objects (particularly SCHEMAs and TABLEs) are defined explicitly in a DDL file, as specified in the Virtual Database information file under the schema.ddl or schema.ddlFilePaths property.

As a result, Kubling will not auto-generate the schema based on the remote database’s metadata.

Schema Definition

The NAMEINSOURCE (or name_in_source) directive is required at both the table and field levels. This directive specifies the corresponding object in the remote data source.

When a TABLE is defined in the DDL to extend the original schema, CUD (Create, Update, Delete) operations are translated into a set of commands that the remote source understands, even if the fields involved are generated and do not physically exist in the source.

For example, given a PostgreSQL table:

CREATE TABLE itops.app_component (
	identifier varchar(36) NOT NULL,
	application_identifier varchar(255) NOT NULL,
	"name" varchar(255) NOT NULL,
	CONSTRAINT app_component_pkey PRIMARY KEY (identifier)
);
CREATE INDEX idx_app_component_search ON itops.app_component USING btree (name, application_identifier);

You can import it into Kubling using:

CREATE FOREIGN TABLE app_component
(
  -- matches with source 
  identifier string OPTIONS(NAMEINSOURCE '"identifier"'),
  application_identifier string OPTIONS(NAMEINSOURCE '"application_identifier"'),
  name string OPTIONS(NAMEINSOURCE '"name"'),
 
  -- extended fields
  ext_rnd string OPTIONS(call_func 'get_random<>'),
  ext_name string OPTIONS(call_func 'just_print_field<name>'),
  ext_combined string OPTIONS(call_func 'combine<identifier, name>'),
  n_identifier string NOT NULL OPTIONS(val_pk 'identifier+name' ),
  PRIMARY KEY(n_identifier),
  UNIQUE(identifier)        
) 
OPTIONS(updatable true, name_in_source 'itops.app_component');

Fields like ext_rnd or n_identifier are extended fields, they exist only within Kubling and are not pushed to the remote database. However, they can still be used in WHERE clauses.

To determine how to apply these operations, the engine uses one of the following strategies:

  • primary_key: Used when the table defines a primary key that does not include any generated fields.
  • unique_key: Used when the table defines a UNIQUE constraint, again without involving generated fields.
  • brute_force: Fallback strategy used when neither of the above applies. This requires one of the following:
    • allowBruteForceOperations enabled at the data source level, or
    • allow_brute_force enabled at the table level.

Defining a TABLE with the name_in_source directive is functionally equivalent to referencing the full source table name. However, in the latter case, when querying, the name must be quoted, e.g., SELECT * FROM pg."itops.app_component".

💡

Defining complex schemas can be challenging and error-prone, even when relying on LLMs due to the amount of context required.

To simplify this, we recommend:

  1. First connect a Kubling instance to the remote data source in passthrough mode.
  2. Use the Kubling API to fetch the auto-generated underlying schema:
http[s]://[kubling_host:kubling_port]/api/v1/admin/ddl/[vdb_name]/[schema_name]

Once the schema is generated, you can safely customize it: add fields, update types, annotate columns, and more.

Performance Implications

Using regular mode introduces performance considerations, primarily due to limitations in what the DQP (Data Query Processor) can push down to the underlying data source.

In this mode, most functions, including aggregation functions, are not pushed down. This is because of extended fields: columns defined in Kubling’s schema that don’t exist in the remote data source.

When a query includes such fields, the query translator strips them out before forwarding the request. As a result, the data returned by the remote source may not fully satisfy all conditions from the original WHERE clause. The DQP must then re-evaluate those conditions locally after retrieving the data.

This works well for basic filtering but breaks down for function-based logic, especially in two key cases:

  • Function Decomposition Complexity
    Kubling allows functions to be applied to non-pushable fields, even within filtering conditions. This makes it difficult to decompose a function and push it down only when its arguments are pushable.

  • Aggregation Semantics
    Aggregations like AVG, SUM, or COUNT rely on access to the full set of matching records. For example, computing AVG(AVG(...)), where the inner AVG is executed remotely and the outer one in Kubling, is not equivalent to a single AVG(...) over the full population.
    If the remote source returns pre-aggregated (partial) values, Kubling cannot guarantee correctness. To ensure accuracy, aggregations must be performed entirely within Kubling.

In short, regular mode prioritizes correctness over performance, and while it maintains flexibility in how queries are written, it often requires post-processing in Kubling, which may lead to increased execution time for complex queries.

How to Avoid Performance Issues

When working in regular mode, there’s a simple rule of thumb to help avoid performance bottlenecks:
Always write queries that filter using primary or unique keys defined in the remote data source.

This practice minimizes the size of result sets fetched over the network, leading to:

  • Lower network I/O
  • Reduced buffer usage in Kubling
  • Faster query evaluation when non-pushable fields are involved, since fewer rows need processing

In the example above, prioritize filters on columns like identifier, application_identifier, or name, as they are indexed in the source system.
Combine them with additional conditions that reduce result set cardinality (i.e., remove irrelevant rows), which helps minimize data transfer and processing overhead.

Determining what should be pushed to the remote source versus what should be processed in Kubling involves deeper analysis and may have a huge positive impact on your current setup.
It depends on aspects that can vay from case to case but there is one important thing to consider: Kubling is cheap and runs on stock hardaware, making it an ideal candidate to unload some burden to your databases when Kubling runs close to them, meaning fast and inexpensive networking.
We plan to release a dedicated guide covering optimization strategies for mid- to large-scale deployments.


Specific TABLE directives

Following directives work only in Database Data Sources

DirectiveTypeOptionsDescription
name_in_sourceStringAnyIndicates how the TABLE is named in the source, if it differs from Kubling’s TABLE.
allow_brute_forceBooleanTrue/FalseOverrides the data source’s decision to use brute force when the engine cannot find a way to identify affected rows in an UPDATE or DELETE.

Available Data Sources

PostgreSQL Data Source (POSTGRESQL)

A Database Data Source that translates and adapts queriers to POSTGRESQL.

Configuration

Uses the Standard DB Data Source Configuration.

Sample configuration in VDB file:

- name: "app_db"
  dataSourceType: "POSTGRESQL"
  config: "bundle:datasource/psql-datasource.yaml"
  translatorConfig: "bundle:translator/base-translator-config.yaml"
  schema:
    type: "PHYSICAL"
    properties:
      'importer.useFullSchemaName': 'false'
      'importer.useCatalogName': 'false'
      'importer.schemaName': 'public'

When no DDL is defined and there are importer properties, the engine automatically creates all the FOREIGN tables that match the import strategy.
In this specific sample all the TABLES of the schema PUBLIC will have a local Kubling TABLE.

MySQL Data Source (MYSQL)

Configuration is exactly the same as POSTGRESQL (above). Only difference is that dataSourceType must have the value MYSQL.