Skip to main content

Three Joins in Postgres

·906 words·5 mins

When you execute a query what exactly happens? You don’t really think about it, you just go on your merry way and expect a result to get back. But lets say that your application is now at a scale where you need to think about scale, suddenly you are wondering why your query is taking a long time to return results. It use to not be that slow, so

So you have a postgres database that has been serving your application’s backend. And you have found great success in your app and has grown in popularity. Your application is grinding, some requests are getting dropped. You do some debugging. You now have slow queries but you don’t know why. Here you embark on a journey to discover what is really going on behind the scenes and start to have a deeper understanding of the abstraction that is everyone’s favorite datastore, postgres.

First of i recommend that you read the postgres internals doc, it is more precise and accurate in how it describes what is going on behind the scenes. What I offer here are my own thoughts on how I’ve come to appreciate it.

What happens when anyone executes a query?
#

When you execute a query, the string representation of the query is parsed into a query tree . This is like and abstract syntax tree that interprets code and converts into a data structure that postgres can work with. This is done by scan.l (the lexer) and gram.y (the parser), after the parser query gets checked against the rule system defined in the system, then it goes to the planner.

How does postgres work with queries?
#

The planner evaluate the query and calculates the most efficient way to retrieve the data using what it knows about each table via it’s statistics. For an RDBMS system where you have multiple tables, the planner also determines the best kind of join to use: it has 3 choices: nested loops, merge and hash join.

Nested Loop Joins
#

Nested Loop Joins are the most basic and easy to explain type of join. Say you have a sequence and youd like to join it with another sequence. In a nested loop join the algorithm for joining takes an item in one relation, and loops over every item in that other relation to find matches to join on. When resources describe this they talk about the inner and outer tables. Don’t get hung up on that, just know that there are two entities, one of them gets scanned fully while the other is scanned for each row in the other for matches. Indexes on the table matter alot here. If you are dealing with Nested Loop joins its a good idea that scans use an index in one or both tables.

for (i = 0; i < data.length; i++) {
	for (j = 0; i < data.length; j++) {
		if (i.value == j.value) {
			return match
		}
	}
}

If you are wondering why it was made like this, It is likely because there really isnt a way for the planner to know that one you have a match that its the only one so it has to make sure it continues on and on until it is sure. Nested Loop Joins are generally prefered when both sides of the joins are small since the time complexity of it is O(MN) where M and N are the size of the tables joined. You’ll see Nested Loop joins where there are joins that do not use an equality operator.

Merge Join
#

In merge joins the two tables are sorted first by their joining keys. Then both are scanned in parallel for matches. So the time here takes in sorting the two tables, and then doing a scan. In the worst case the planner scans both tables to the end. The algorithm is similar to a two pointer algorithm that moves the pointer when matches one side are found or not. Ideally, the planner has available the index in order to execute the sort for the scan. Therefore the planner can use one of the join keys to sort the values on each side and then do a join.

This has the time complexity of O(M + N) where both M and N are the size of the tables to be joined.

Hash Join
#

In hash joins a hash table is created using the joining keys on the query. Then the other table is scanned, using the joining keys, a hash is created and used to retrieve the appropriate values from the hash table. This requires space in your database for the hash map. The planner will likely use this when it is dealing with tables that are too large for a nested loop, based on the query statistics.

Keep in mind the following properties: work_mem and hash_mem_multiplier, when you see hash joins in your explain. Hash joins are generally prefered when there are equality operators and both sides of the join are large for nested loop joins, and sorting would be a nightmare.

Now what?
#

When you have many tables and joins you’ll see the planner using these types of joins to get your final resultset. Now that you know a bit more of what it is doing behind the scenes you can have a clearer direction on where you can apply your optimizations. Happy querying!