Group By

This step groups rows from a source, based on a specified field or collection of fields. A new row is generated for each group. The step can also generate one or more aggregate values for the groups. Common uses include calculating the average sales per product and counting the number of items you have in stock.

The Group By step is designed for sorted inputs. If your input is not sorted, only consecutive rows are grouped correctly.

You can use the Memory Group By step to handle unsorted input.

Options

  • Step name: Specify the unique name of the Group By step on the canvas. You can customize the name or leave it as the default.

  • Include all rows?: Select to include all rows in the output. Clear to output only aggregate rows.

    If you select this option, the following options become available:

    • Temporary files directory

    • TMP-file prefix

    • Add line number, restart in each group

    • Line number field name

  • Temporary files directory: Directory where temporary files are stored. Default: the standard temporary directory for your system.

    You must specify a directory when Include all rows? is selected and the number of grouped rows exceeds 5000.

  • TMP-file prefix: File prefix for naming temporary files.

  • Add line number, restart in each group: Adds a line number that restarts at 1 in each group. When both Include all rows? and this option are selected, all rows are included in the output with a line number for each row.

  • Line number field name: Name of the field to add line numbers for each new group.

  • Always give back a result row: Return a result row even when there is no input row. When there are no input rows, this option returns a count of 0. Clear to output a result row only when there is an input row.

Group fields

Use The fields that make up the group table to specify the fields to group.

  • Select Get Fields to add all fields from the input stream to the table.

  • Right-click a row in the table to edit that row or all rows.

circle-info

You can calculate aggregate functions over the entire dataset by leaving this table blank.

Aggregates

Use the Aggregates table to specify the field to aggregate, the aggregation method, and the name of the resulting output field.

Columns:

  • Name: Name of the aggregate output field.

  • Subject: Field to use as input to the aggregation method.

  • Type: Aggregation method.

  • Value: Aggregate value.

Aggregation methods:

  • Sum

  • Average (mean)

  • Median

  • Percentile (linear interpolation)

  • Minimum

  • Maximum

  • Number of values (N)

  • Concatenate strings separated by comma (,)

  • First non-null value

  • Last non-null value

  • First value (including null)

  • Last value (including null)

  • Cumulative sum (include all rows option only)

  • Cumulative average (include all rows option only)

  • Standard deviation (population)

  • Concatenate strings separated by the character specified in Value

  • Number of distinct values

  • Number of rows (without field argument)

  • Standard deviation (sample)

  • Percentile (nearest-rank method)

Examples

Sample transformations included in design-tools/data-integration/samples/transformations:

  • Calculate median and percentiles using the group by steps.ktr

  • General - Repeat fields - Group by - Denormalize.ktr

  • Group By - Calculate standard deviation.ktr

  • Group By - include all rows without a grouping.ktr

  • Group by - include all rows and calculations .ktr.

Metadata injection support

All fields of this step support metadata injection. You can use this step with ETL metadata injection to pass metadata to your transformation at runtime.

The metadata injection values for the aggregation type are:

  • SUM

  • AVERAGE

  • MEDIAN

  • PERCENTILE

  • MIN

  • MAX

  • COUNT_ALL

  • CONCAT_COMMA

  • FIRST

  • LAST

  • FIRST_INCL_NULL

  • LAST_INCL_NULL

  • CUM_SUM

  • CUM_AVG

  • STD_DEV

  • CONCAT_STRING

  • COUNT_DISTINCT

  • COUNT_ANY

Last updated

Was this helpful?