Understand repartitioning logic
Data distribution in the steps is shown in the following table.

As you can see, the CSV file input step divides the work between two step copies and each copy reads 50 rows of data. However, these 2 step copies also need to make sure that the rows end up on the correct count by state
step copy where they arrive in a 43/57 split. Because of that, it is a general rule that the step performing the repartitioning (row redistribution) of the data (a non-partitioned step before a partitioned one) has internal buffers from every source step copy to every target step copy, as shown below.

This is where partitioning data becomes a useful concept, as it applies specific rule-based direction for aggregation, directing rows from the same state to the same step copy, so that the rows are not split arbitrarily. In the example below, a partition schema called State
was applied to the count by state
step and the Remainder of division partitioning rule was applied to the State
field. Now, the count by state
aggregation step produces consistent correct results because the rows were split up according to the partition schema and rule, as shown in the preview data.

Note: To view this transformation in the PDI client, open the Pentaho/…/design-tools/data-integration/samples/transformations/General - parallel reading and aggregation.ktr
sample file.
Last updated
Was this helpful?