JDBC tuning options

Spark is a massive parallel computation system that can run on many nodes, processing hundreds of partitions at a time, but when working with SQL databases, you may want to customize processing to reduce the risk of failure. You can specify the number of concurrent JDBC connections, numeric column names, minimal value to read, and maximum value to read. Spark then reads data from the JDBC partitioned by a specific column and partitions the data by the specified numeric column, producing parallel queries when applied correctly. If you have a cluster installed with Hive, the JDBC tuning options can improve transformation performance.

The read-jdbc parameter constructs a DataFrame representing the database table accessible via a JDBC URL named table. Partitions of the table are retrieved in parallel based on the parameters passed to this function. See the Spark API documentation for more information.

Option

Description

Value type

Example value

read.jdbc.columnName

The name of a column of integral type that will be used for partitioning.

String

column1

read.jdbc.lowerBound

The minimum value of columnName used to decide partition stride. This option works with read.jdbc.columnName.

Any value

read.jdbc.upperBound

The maximum value of columnName used to decide partition stride. This option works with read.jdbc.columnName.

Any value

read.jdbc.numPartitions

The number of partitions. This, along with lowerBound (inclusive), upperBound (exclusive), form partition strides for generated WHERE clause expressions used to split the column columnName evenly. When the input is less than 1, the number is set to 1.

Integer

PreviousSteps using DataframeWriter tuning options NextSteps using JDBC tuning options

Last updated 2 months ago

Was this helpful?