Derived Columns

A stage model is capable of defining Derived Columns. As the name indicates, this allows users to apply Hard Rules for transformation.

The metadata structure within a stage model looks like this:

derived_columns: 
    <col_alias_1>:
        value: <expression_1>
        datatype: <datatype_1>
        src_cols_required: <src_col_1>
    <col_alias_2>:
        value: <expression_2>
        datatype: <datatype_2>
        src_cols_required:
            - <src_col_2>
            - <src_col_3>

Depending on how ‘col_alias’ and ‘src_col’ are called, two different behaviours can be achieved:

Overwriting existing columns

When ‘col_alias’ equals ‘src_col’, the original input column will be overwritten with the transformation configured in ‘expression’. Use with caution!

Adding new columns

When ‘col_alias’ deviates from the’src_col’, the transformation will be added as a new column.

Expressions

An expression is defined under the key ‘value’ and can basically do three different things:

  • Inserting a static string that will be the same across all rows. Needs to begin with ‘!’ followed by the string.
  • Renaming a column. Expression would just be the name of another column.
  • Applying SQL. The Expression needs to hold valid SQL, typically based on one or multiple columns. Note: All used columns should be listed under ‘src_cols_required’.

Datatypes

Defining a datatype of the expressions is mandatory to properly generate Ghost Records. Please note, that setting a datatype does not automatically cast the expression to this datatype. You manually have to ensure that your expression matches the datatype defined.

It must only be set for SQL expressions. For static strings, it will be set to STRING (and the database correspondents) automatically. For column renaming, datatype will be set to the datatype of the input column.

Required Source Columns

The parameter ‘src_cols_required’ is only required when the Stage model is configured to not include the source columns by setting the parameter ‘include_source_columns’ to false.

If this is the case, you have to list all columns used within the SQL expressions under the parameter ‘src_cols_required’. This information is required to properly generate the model SQL.