Cartesian products are very slow. More importantly, it could consume a lot of memory and trigger OOM. If the join type is not Inner, Spark SQL could use Broadcast Nested Loop Join even if both sides of tables are not small enough. Thus, it also could cause lots of unwanted network traffic. For avoiding users to accidentally use cartesian products, Spark 2.0 introduced a flag spark.sql.crossJoin.enabled
. When a query contains a cartesian product, users need to turn on this flag whose default is false
; otherwise, users can hit an error message.
To avoid turning on the session-scoped configuration flag, Spark 2.1 introduces a new syntax for cartesian product. Users can explicitly use the new crossJoin
syntax. Below are the examples
1 2 3 4 5 6 7 8 9 |
// CROSS JOIN using SQL spark.sql("SELECT * FROM table1 CROSS JOIN table2")) // CROSS JOIN using Dataset API df1.crossJoin(df2) // Triggers cartesian products using the regular inner join syntax after turning on the flag spark.conf.set("spark.sql.crossJoin.enabled", "true") df1.Join(df2) spark.conf.set("spark.sql.crossJoin.enabled", "false") |
Unfortunately, Spark 2.1.0 introduced a regression. Not all the cartesian products can be correctly detected. There are two typical cases:
Case 1) having non-equal predicates in join conditiions of an inner join.
Case 2) equi-join’s key columns are not sortable and both sides are not small enough for broadcasting.
Will fix it in the following fixpack and Spark 2.2.