Spark 2.1: New Cross Join Syntax

Cartesian products are very slow. More importantly, it could consume a lot of memory and trigger OOM. If the join type is not Inner, Spark SQL could use Broadcast Nested Loop Join even if both sides of tables are not small enough. Thus, it also could cause lots of unwanted network traffic. For avoiding users to accidentally use cartesian products, Spark 2.0 introduced a flag spark.sql.crossJoin.enabled. When a query contains a cartesian product, users need to turn on this flag whose default is false; otherwise, users can hit an error message.

To avoid turning on the session-scoped configuration flag, Spark 2.1 introduces a new syntax for cartesian product. Users can explicitly use the new crossJoin syntax. Below are the examples

// CROSS JOIN using SQL
spark.sql("SELECT * FROM table1 CROSS JOIN table2"))
// CROSS JOIN using Dataset API
df1.crossJoin(df2)

// Triggers cartesian products using the regular inner join syntax after turning on the flag
spark.conf.set("spark.sql.crossJoin.enabled", "true")
df1.Join(df2)
spark.conf.set("spark.sql.crossJoin.enabled", "false")

// CROSS JOIN using SQL

spark.sql("SELECT * FROM table1 CROSS JOIN table2"))

// CROSS JOIN using Dataset API

df1.crossJoin(df2)

// Triggers cartesian products using the regular inner join syntax after turning on the flag

spark.conf.set("spark.sql.crossJoin.enabled", "true")

df1.Join(df2)

spark.conf.set("spark.sql.crossJoin.enabled", "false")

Unfortunately, Spark 2.1.0 introduced a regression. Not all the cartesian products can be correctly detected. There are two typical cases:
Case 1) having non-equal predicates in join conditiions of an inner join.
Case 2) equi-join’s key columns are not sortable and both sides are not small enough for broadcasting.

Will fix it in the following fixpack and Spark 2.2.

Developer Diaries of gatorsmile | Apache Spark

By a smiling gator – Xiao Li

Spark 2.1: New Cross Join Syntax

Leave a Reply Cancel reply