Transformations

Transformations #

The Transformations components are used to modify the extracted data before loading it to the Target

Each Transformation requires data tables to be provided as input.
All Transformations will generate a data table as an output.

Every data table contains data and metadata (column names and their data types). For optimalization, every Transformation can return a data table with the same columns names and data types - it does not matter whether the Transformation received a data table with data or not.

Join #

This Transformation generates a combination of records based on a related column in two tables. The Join Transformation requires exactly two input tables. The data is joined based on the same values in the columns from both tables. You have to choose the columns used in the join and the type of join to make:

  • Left - only join data from the first table and matching data from the second table; not matching data will be replaced by None,

  • Right - only join data from the second table and matching data from the first table; not matching data will be replaced by None,

  • Outer - join all data from both tables; not matching data will be replaced by None,

  • Inner - only join matching data from both tables; not matching data will be omitted,

Each column in the resulted table will have a prefix that tells from which data table it came from (left_ for the first table and right_ for the second table).

Union #

This Transformation generates a table that has data from at least 2 data tables (maximum 5) which have the same columns (the data does not need to be the same).

The generated table will have data from all input tables in order from the first to last.

The generated table will have an additional column added as the first column called DATA_SOURCE which will indicate the table from which the data is.

Data tables used as input must follow the following requirements:

  1. They must have the same number of columns,
  2. They must have the same column names,
  3. The corresponding columns in all data tables must have the same data type,
  4. They cannot have a column named DATA_SOURCE.

Either load data tables that complies with the requirements or use the Convert transformation to generate correct data tables.

By default, the values in the DATA_SOURCE column are generated from names given to the components in the Modeler. Additionally, you can provide a list of Aliases to use instead of the default values.

Filter #

This Transformation only takes one table as input and generates a table with filtered data. You must provide how to chain all provided filters:

  • And to filter data that matches all filters,
  • Or to filter data that matches any of the filters,

Then, add the filters by selecting the column, condition and value for each filter. The available conditions are:

  • == (equals) - data matches the provided value,
  • != (not equals) - data does not match the provided value,
  • < (less than) - data is less than the provided value,
  • <= (less than or equal) - data is less than or matches the provided value,
  • > (greater than) - data is greater than the provided value,
  • >= (greater than or equal) - data is greater than or matches the provided value,
  • isnull (is null) - data does not have any value,
  • notnull (not null) - data has any value and is not None,

Script #

This Transformation is for advanced users. It allows them to use a Python script to create custom modifications to the data tables, not available by other Transformations. The Script Transformation can use more than one data table as inputs (available in the INPUT_DATA list variable) and must return a Python Pandas DataFrame.

This Transformation must work if an empty Python Pandas DataFrame is used as an input! Remember to prepare this Transformation correctly, otherwise the modifier might not work properly!

Working with the Script Transformation #

Because every Transformation can return a data table with the same columns names and data types whether with data or not, the Script Transformation must also handle such a case.

Analyze this simple example Modifier. Before we convert the columns into correct data types for the target database, we add some columns using the Script Transformation. When we edit the Convert Transformation, an empty data table is passed through the Script Transformation and input for the Convert Transformation to work. This is done because it is faster to pass just the column names and data types when you just want to work on them and not see actual data.

Refer to these two other examples:

Script modifying only structure #

We want to gather data from a remote Excel file and prepare it to load to a database table.

The database table has a column showing who added the data, so we add this column using the Script Transformation.

def script (INPUT_DATA, PARAMETERS):
    import pandas as pd


    df = INPUT_DATA[0]

    # add column
    df['new_column'] = "ETL data_snake"

    return df

As we do not operate on data, this code will work both on an empty data table and one with data. When we use the Convert Transformation to remove and rename some columns and change the rest to correct data types (data from files sets all column data types to string (normal text)) we can edit the following columns in the Convert Transformation:

Script modifying data and structure #

Now we take the data that we pushed to the database…

Date cases deaths country continent added_by
2020-12-14 27 5 Angola Africa ETL data_snake
2020-12-14 586 1 Ecuador America ETL data_snake
2020-12-14 718 7 South_Korea Asia ETL data_snake
2020-12-14 8976 188 Poland Europe ETL data_snake

… and transform it to only show all cases per continent per day:

Date Africa America Asia Europe Oceania Other
2019-12-31 0 0 27 0 0 0
2020-01-01 0 0 0 0 0 0
2020-01-02 0 0 0 0 0 0
2020-01-03 0 0 17 0 0 0
2020-01-04 0 0 0 0 0 0

As we will work on data and not just the data structure, writing a regular script will not work:

def script (INPUT_DATA, PARAMETERS):
    import pandas as pd
    df = INPUT_DATA[0]

    df = df.pivot_table(
        index='Date',
        columns='continent',
        values='cases',
        aggfunc='sum',
        fill_value=0
    )

    return df

That is why we need to take into account that a data table with no data can be input in the Script Transformation and should return a dataframe with the same structure as the one with data. That is why we need to use an if-else statement:

def script (INPUT_DATA, PARAMETERS):
    import pandas as pd
    dataframe = INPUT_DATA[0]

    if dataframe.empty:
        return pd.DataFrame(
            index=['2019-12-31'],
            columns=['Africa', 'America', 'Asia', 'Europe', 'Oceania', 'Other'],
            dtype='Int64'
        )
    else:
        return dataframe.pivot_table(
            index='Date',
            columns='continent',
            values='cases',
            aggfunc='sum',
            fill_value=0
        ).astype('Int64')

The above script makes sure that when we open the Convert Transformation we will see all the columns that are generated by the Script Transformation.

Available Packages #

Here is a list of available packages to use in your Script:

  • all packages in the Python Standard Library,
  • the pandas manipulation tool,
  • the dateutil module for date and time manipulations,
  • the NumPy fundamental package for scientific computing with Python,
  • the OpenPyXL library to read/write Excel 2010 xlsx/xlsm/xltx/xltm files,
  • the xlrd library to read old xls Excel files,
  • the odfpy library to read and write OpenDocument files (i.e. ods),
  • the pandaSDMX library for SDMX-ML data manipulation,
  • the SQLAlchemy Python SQL toolkit and Object Relational Mapper,
  • the Requests library for sending HTTP requests,
  • the Beautiful Soup library for pulling data out of HTML and XML files,
  • the pyodbc module for accessing ODBC databases,

Convert #

This Transformation takes a single data table as input and modifies its column structure. Each column from the table is presented to you and you can choose the following options for each column:

  • Alias - change the column name,
  • Casting type - change the column data type to a different type; avaliable data types: Integer ('Int64'), Float ('float64'), String (regular text - string), Boolean (having two possible values called true and false - boolean), and Datetime (object representing time - datetime64[ns]).
  • Active - whether to keep or delete the column from the table,

Add column #

This Transformation allows you to add columns to the data structure with a specific value. You choose the name, type and the place where new column will be positioned in the data structure.

Aggregation #

This Transformation allows you to split data into groups, apply an aggregation function to each grouped column and combine the results into a new data structure. This transformation allows to compute a summary statistics for each group like sums, counts or means. The resulting data structure will have only the selected columns with grouped values (dropped duplicates) and new columns representing each aggregation applied to all the grouped columns.

If you select only grouped columns, then you will receive a new data structure with only the grouped columns. If you select only aggregations, then you will receive a data structure with added columns with the same values as the columns that were to be aggregated. Remember to always provide a list of grouped columns and at least one aggregation!

Aggregation methods #

All those methods work on columns of Integer and Float type and only some of them can be used for columns with other types (e.g. String, Boolean, DateTime).

  • MIN - Minimum of group values (e.g. MIN of (5, 10, 15) will be 5),
  • MAX - Maximum of group values (e.g. MAX of (5, 10, 15) will be 15),
  • COUNT - Count of group values (values that evaluates to True), (e.g.COUNT of (0, 10, 5, 0, 12, 0, -4) will be 4),
  • SUM - Sum of group values (e.g. SUM of (5, 10, 15) will be 30),
  • AVERAGE - Average of group values (e.g. AVERAGE of (5, 10, 15) will be 10),
  • SIZE - Count of group values (True and False) (e.g. SIZE of (0, 10, 5, 0, 12, 0, -4) will be 7),
  • STANDARD DEVIATION - Compute standard deviation of groups (e.g. STANDARD VARIATION of (2, 1, 3, 1.75) will be 0.826),
  • VAR - Compute variance of groups (e.g. VARIANCE of (2, 1, 3, 1.75) will be 0.6823),
  • STANDARD ERROR - Compute standard error of the mean of groups (e.g. STANDARD ERROR of (2, 1, 3, 1.75) will be 0.413)
  • FIRST - Get first element of group values (e.g. FIRST of (0, 10, 5, 0, 12, 0, -4) will be 0),
  • LAST - Get last element of group values (e.g. LAST of (0, 10, 5, 0, 12, 0, -4) will be -4),
  • MEDIAN - Get median of group values (e.g. MEDIAN of (2, 1, 3, 1.75, 0, 2.6) will be 1.875)

This transformation ignores None values. See example below:

product price $ country
apple 3 POL
pear 11 POL
banana None POL
apple 5 FRA
pear 7 FRA
banana None FRA
apple 10 IDN
pear None IDN
banana 9 IDN

Grouped by product:

product SUM price AVARAGE price SIZE price
apple 18 6 3
pear 18 9 2
banana 9 9 1

While avarage of:

  • PEARS is (11 + 7 + 0) / 3 = 6
  • BANANAS is (0 + 0 + 9) / 3 = 3

Sort #

This Transformation allows you to sort the whole dataset by specified columns in the chosen order. The sorting can be ASCENDING or DESCENDING.

EXAMPLE

product price $ country
apple 3 POL
pear 11 POL
banana None POL
apple 5 FRA
pear 7 FRA
banana None FRA
apple 10 IDN
pear None IDN
banana 9 IDN

Sorted by country product DESCENDING:

product price $ country
pear 11 POL
banana None POL
apple 3 POL
pear None IDN
banana 9 IDN
apple 10 IDN
pear 7 FRA
banana None FRA
apple 5 FRA

Sorted by price ASCENDING:

product price $ country
apple 3 POL
apple 5 FRA
pear 7 FRA
banana 9 IDN
apple 10 IDN
pear 11 POL
banana None POL
banana None FRA
pear None IDN

As you can see, None values are ignored and left at the end in the order they appeared.