Spark 2.1: New Cross Join Syntax

Cartesian products are very slow. More importantly, it could consume a lot of memory and trigger OOM. If the join type is not Inner, Spark SQL could use Broadcast Nested Loop Join even if both sides of tables are not small enough. Thus, it also could cause lots of unwanted network traffic. For avoiding users to accidentally use cartesian products, Spark 2.0 introduced a flag spark.sql.crossJoin.enabled. When a query contains a cartesian product, users need to turn on this flag whose default is false; otherwise, users can hit an error message.

To avoid turning on the session-scoped configuration flag, Spark 2.1 introduces a new syntax for cartesian product. Users can explicitly use the new crossJoin syntax. Below are the examples

Unfortunately, Spark 2.1.0 introduced a regression. Not all the cartesian products can be correctly detected. There are two typical cases:
Case 1) having non-equal predicates in join conditiions of an inner join.
Case 2) equi-join’s key columns are not sortable and both sides are not small enough for broadcasting.

Will fix it in the following fixpack and Spark 2.2.

Leave a Reply

Your email address will not be published. Required fields are marked *