JDBC tuning options
Spark is a massive parallel computation system that can run on many nodes, processing hundreds of partitions at a time, but when working with SQL databases, you may want to customize processing to reduce the risk of failure. You can specify the number of concurrent JDBC connections, numeric column names, minimal value to read, and maximum value to read. Spark then reads data from the JDBC partitioned by a specific column and partitions the data by the specified numeric column, producing parallel queries when applied correctly. If you have a cluster installed with Hive, the JDBC tuning options can improve transformation performance.
The read-jdbc parameter constructs a DataFrame
representing the database table accessible via a JDBC URL named table. Partitions of the table are retrieved in parallel based on the parameters passed to this function. See the Spark API documentation for more information.
read.jdbc.columnName
The name of a column of integral type that will be used for partitioning.
String
column1
read.jdbc.lowerBound
The minimum value of columnName
used to decide partition stride. This option works with read.jdbc.columnName.
Any value
read.jdbc.upperBound
The maximum value of columnName
used to decide partition stride. This option works with read.jdbc.columnName.
Any value
read.jdbc.numPartitions
The number of partitions. This, along with lowerBound
(inclusive), upperBound
(exclusive), form partition strides for generated WHERE clause expressions used to split the column columnName
evenly. When the input is less than 1, the number is set to 1
.
Integer
5
Last updated
Was this helpful?