What is the order of execution of join and where clauses in Hive?

2 years ago

Benjamin Taylor

2 minutes

In Hive, the order of execution for joins and where clauses is determined by the query optimizer and is typically not fixed. The query optimizer considers various factors such as table size, data skewness, and index information to choose the optimal execution order.

In general, the query optimizer of Hive will aim to push the filtering conditions in the WHERE clause to before the join operation, in order to reduce the amount of data involved in the join. This way, the dataset can be narrowed down to the smallest possible size before the join operation, improving query efficiency.

More specifically, the query optimizer may perform the following steps:

Execute the filter conditions in the WHERE clause to narrow down the dataset.
Based on the statistical information and indexing situation of the tables, select an appropriate table as the driving table and load its data into memory.
Perform a join operation in memory for each record in the driving table and output the matching records.

It is important to note that the decisions made by the query optimizer are based on the statistics of the table and the index situation, so when using Hive, query performance can be optimized by collecting table statistics and creating indexes.