Understand repartitioning logic

Data distribution in the steps is shown in the following table.

As you can see, the CSV file input step divides the work between two step copies and each copy reads 50 rows of data. However, these 2 step copies also need to make sure that the rows end up on the correct count by statestep copy where they arrive in a 43/57 split. Because of that, it is a general rule that the step performing the repartitioning (row redistribution) of the data (a non-partitioned step before a partitioned one) has internal buffers from every source step copy to every target step copy, as shown below.

This is where partitioning data becomes a useful concept, as it applies specific rule-based direction for aggregation, directing rows from the same state to the same step copy, so that the rows are not split arbitrarily. In the example below, a partition schema called State was applied to the count by state step and the Remainder of division partitioning rule was applied to the State field. Now, the count by state aggregation step produces consistent correct results because the rows were split up according to the partition schema and rule, as shown in the preview data.

Note: To view this transformation in the PDI client, open the Pentaho/…/design-tools/data-integration/samples/transformations/General - parallel reading and aggregation.ktr sample file.

PreviousPartitioning during data processing NextPartitioning data over tables

Last updated 23 days ago

Was this helpful?