Engine
Aggregators

Aggregators

The aggregator is a special schema type that enables automatic view creation based on other data sources. This feature simplifies querying across multiple sources by aggregating data into views, which are easier to manage.

Consider a scenario where we have several Kubernetes clusters, each defined as a data source within our VDB (Virtual Database). If we want to know, for example, which cluster a specific deployment is running on, one approach is to write a query that joins tables from all relevant schemas. However, this process becomes increasingly complex when managing dozens of clusters.

Relational databases address such scenarios with VIEWs, and Kubling follows the same pattern. A VIEW is a virtual table constructed from the result of a query, which can be defined in DDL. But in our Kubernetes example, where each cluster has its own schema, it’s unclear where to define a VIEW that spans all clusters.

This is where aggregators come in. They automatically generate views from tables across multiple schemas, streamlining multi-cluster data queries.

Example:

aggregatorSchemaModels:
  - name: "k8s"
    schemas:
      - "k8s_1"
      - "k8s_2"
    tables:
      - name: "DEPLOYMENT"
        options:
          updatable: false
      - name: "NAMESPACE"
        options:
          updatable: false
      - name: "DEPLOYMENT_CONDITIONS"
        options:
          updatable: false
      - name: "DEPLOYMENT_CONTAINER"
        options:
          updatable: false
      - name: "DEPLOYMENT_CONTAINER_VOLS"
        options:
          updatable: false

This configuration creates a new SCHEMA called k8s in the VDB. The tables in this schema, such as DEPLOYMENT and NAMESPACE, are actually VIEWs that aggregate data from the k8s_1 and k8s_2 schemas using a UNION ALL.

Primary Keys

When using aggregators, be cautious when selecting the primary key for your SCHEMA definitions, as key duplication may occur.

If a data source does not provide a globally unique key, you should use the Kubling val_pk directive while considering the aggregation implications.

For instance, in Kubernetes, if you have only one cluster, it may not be necessary to create primary keys that concatenate the cluster name. However, if you plan to add more clusters in the future, you should construct the primary key to include a cluster identifier, as shown below:

...
identifier string OPTIONS(val_pk 'clusterName+metadata__namespace+metadata__name+name'),
PRIMARY KEY(identifier),
...

Errors or empty results

Since VIEWS in the aggregator context rely on UNION ALL (which includes duplicates) to merge results, the query planner (DQP) evaluates costs differently compared to standard queries. In the case of Kubernetes data sources, this behavior might be influenced by the blankNamespaceStrategy configuration.

For more information on blankNamespaceStrategy, check out the detailed explanation here.