2014-05-20

How to draw a tree using window functions and recursive query in PostgreSQL

In this post I would like to show you how to draw a nice looking tree (similar to output of unix tree command) using select statement in PostgreSQL.

Suppose we have a view v_deps, which contains information about dependencies between views, materialized views and tables (the definition of this view is at the end of the post). Every row in this view means that dependent object dep_obj depends on referenced object ref_obj:
select * from v_deps;
 dep_obj | ref_obj
---------+---------
 mv_abcd | v_abcd
 v_abcd  | v_abc
 v_abcd  | v_d
 v_abc   | v_ab
 v_abc   | v_c
 v_ab    | t_a
 v_ab    | t_b
 v_c     | t_c
 v_d     | t_d
We are going to write a query against this view, which will return all dependencies of mv_abcd formatted as a tree:
 mv_abcd
 `-- v_abcd
     |-- v_abc
     |   |-- v_ab
     |   |   |-- t_a
     |   |   `-- t_b
     |   `-- v_c
     |       `-- t_c
     `-- v_d
         `-- t_d
Let's start with something simple. To find dependencies of materialized view mv_abcd we can run the following query:
select dep_obj, ref_obj from v_deps where dep_obj = 'mv_abcd';
 dep_obj | ref_obj
---------+---------
 mv_abcd | v_abcd
This returns only direct dependencies of mv_abcd. In order to see transitive dependencies we need to use a recursive query:
with recursive deps(dep_obj, ref_obj) as
(
  select '', 'mv_abcd' 
  union
  select d2.dep_obj, d2.ref_obj 
  from deps d1, v_deps d2
  where d1.ref_obj = d2.dep_obj
)
select ref_obj
from deps;
 ref_obj
---------
 mv_abcd
 v_abcd
 v_d
 v_abc
 t_d
 v_c
 v_ab
 t_c
 t_b
 t_a
The results seem to be in random order and are not very readable, so let's add sorting and indentation to form a simple tree-like structure:
with recursive deps(dep_obj, ref_obj, ind, ord) as
(
  select '', 'mv_abcd', '', array[row_number() over ()] 
  union
  select d2.dep_obj, d2.ref_obj, 
    d1.ind || '    ', d1.ord || row_number() over (order by d2.ref_obj)
  from deps d1, v_deps d2
  where d1.ref_obj = d2.dep_obj
)
select ind || ref_obj, ord
from deps
order by ord;
      ?column?       |     ord
---------------------+-------------
 mv_abcd             | {1}
     v_abcd          | {1,1}
         v_abc       | {1,1,1}
             v_ab    | {1,1,1,2}
                 t_a | {1,1,1,2,1}
                 t_b | {1,1,1,2,2}
             v_c     | {1,1,1,3}
                 t_c | {1,1,1,3,3}
         v_d         | {1,1,2}
             t_d     | {1,1,2,1}
It seems to be almost done. The only part that is left is to transform this tree formatted with spaces into a tree formatted with appropriate characters:
 mv_abcd                                  mv_abcd
     v_abcd                               `-- v_abcd
         v_abc                                |-- v_abc
             v_ab                             |   |-- v_ab
                 t_a         ====>            |   |   |-- t_a
                 t_b                          |   |   `-- t_b
             v_c                              |   `-- v_c
                 t_c                          |       `-- t_c
         v_d                                  `-- v_d
             t_d                                  `-- t_d
How to do it? For each result row of the previous query we need to analyze its ord array and instead of indenting always with four spaces, we should indent with one of these strings:
"|-- ", "`-- ", "|   ", "    "
. The algorithm is best illustrated with pictures:



This means we need to find a way to do two things in select statement:
  • for each result row of the previous query iterate over its ord array - this can be done with recursive query
  • for each iteration analyze the relation of current row to other rows - this can be done using window functions
Below is a step by step implementation of this algorithm. Every step adds a little modification to the query from the previous step and gets closer to the desired result. The description is quite long - if you just want to see the final result, feel free to skip to the last step.

Step 1 - iterate over an array using recursive query

First let's simplify the previous query so it doesn't return indented data:
with recursive deps(dep_obj, ref_obj, ord) as
(
  select '', 'mv_abcd', array[row_number() over ()] 
  union
  select d2.dep_obj, d2.ref_obj, 
    d1.ord || row_number() over (order by d2.ref_obj)
  from deps d1, v_deps d2
  where d1.ref_obj = d2.dep_obj
)
select ref_obj, ord
from deps
order by ord;
 ref_obj |     ord
---------+-------------
 mv_abcd | {1}
...
Now we can write another query which will use the results of the previous one as starting data and iterate over ord array of each result row.
with recursive tree_data(name, idx, ord) as
(
  select t.ref_obj, 1, t.ord
  from
  (
    with recursive deps(dep_obj, ref_obj, ord) as
    (
      select '', 'mv_abcd', array[row_number() over ()] 
      union
      select d2.dep_obj, d2.ref_obj, 
        d1.ord || row_number() over (order by d2.ref_obj)
      from deps d1, v_deps d2
      where d1.ref_obj = d2.dep_obj
    )
    select ref_obj, ord
    from deps
    order by ord
  ) t
  union all
  select name, idx + 1, ord
  from tree_data
  where ord[idx + 1] is not null
)
select name, idx, ord
from tree_data;
and it returns:
  name   | idx |     ord
---------+-----+-------------
 mv_abcd |   1 | {1}
...
The iteration works as expected. Every input row apperas in the query result exactly array_length(ord) times, with idx values in range [1, ..., array_length(ord)].

Step 2 - analyze relation of current row to other rows

Now we can implement the rules presented in the slides. In each iteration we are going to check some conditions and choose an appropriate indentation string (one of
"|-- ", "`-- ", "|   ", "    "
). This is achieved with the following case statement:
case
  when ord[idx + 1] != lag(ord[idx + 1], 1, -1::bigint) over w
       and ord[idx + 1] != last_value(ord[idx + 1]) over w then '|-- '
  when ord[idx + 1] != lag(ord[idx + 1], 1, -1::bigint) over w 
       and ord[idx + 1] = last_value(ord[idx + 1]) over w then '`-- '
  when ord[idx + 1] != last_value(ord[idx + 1]) over w then '|   '
  else '    '
end
We are using window functions, which are evaluated over window w defined as follows:
window w as (partition by ord[idx] 
             order by ord rows between current row and unbounded following)
This means that input rows are split into groups based on parent of current row (partition by ord[idx]). Then within a partition the rows are sorted by their order in the tree (order by ord - similar to what we've seen in previous queries) and we define current frame of rows as [current row, end of partition] in order to use last_value() window function.

This is the full query:
with recursive tree_data(name, idx, ord, ind) as
(
  select t.ref_obj, 1, t.ord, ''
  from
  (
    with recursive deps(dep_obj, ref_obj, ord) as
    (
      select '', 'mv_abcd', array[row_number() over ()] 
      union
      select d2.dep_obj, d2.ref_obj, 
        d1.ord || row_number() over (order by d2.ref_obj)
      from deps d1, v_deps d2
      where d1.ref_obj = d2.dep_obj
    )
    select ref_obj, ord
    from deps
    order by ord
  ) t
  union all
  select name, idx + 1, ord, 
  case
    when ord[idx + 1] != lag(ord[idx + 1], 1, -1::bigint) over w
         and ord[idx + 1] != last_value(ord[idx + 1]) over w then '|-- '
    when ord[idx + 1] != lag(ord[idx + 1], 1, -1::bigint) over w 
         and ord[idx + 1] = last_value(ord[idx + 1]) over w then '`-- '
    when ord[idx + 1] != last_value(ord[idx + 1]) over w then '|   '
    else '    '
  end
  from tree_data
  where ord[idx + 1] is not null
  window w as (partition by ord[idx] 
               order by ord rows between current row and unbounded following)
)
select name, ind
from tree_data;
  name   | ind
---------+------
 mv_abcd |
...


Step 3 - concatenate indentation strings in correct order

The previous query returned rows with tree node name, iteration index and indentation string. In this step we're going to concatenate these indentation strings in the correct order (iteration order).
with recursive tree_data(name, idx, ord, ind) as
(
  select t.ref_obj, 1, t.ord, ''
  from
  (
    with recursive deps(dep_obj, ref_obj, ord) as
    (
      select '', 'mv_abcd', array[row_number() over ()] 
      union
      select d2.dep_obj, d2.ref_obj, 
        d1.ord || row_number() over (order by d2.ref_obj)
      from deps d1, v_deps d2
      where d1.ref_obj = d2.dep_obj
    )
    select ref_obj, ord
    from deps
    order by ord
  ) t
  union all
  select name, idx + 1, ord, 
  case
    when ord[idx + 1] != lag(ord[idx + 1], 1, -1::bigint) over w
         and ord[idx + 1] != last_value(ord[idx + 1]) over w then '|-- '
    when ord[idx + 1] != lag(ord[idx + 1], 1, -1::bigint) over w 
         and ord[idx + 1] = last_value(ord[idx + 1]) over w then '`-- '
    when ord[idx + 1] != last_value(ord[idx + 1]) over w then '|   '
    else '    '
  end
  from tree_data
  where ord[idx + 1] is not null
  window w as (partition by ord[idx] 
               order by ord rows between current row and unbounded following)
)
select name, 
string_agg(ind, '') over (partition by name 
    order by idx rows between unbounded preceding and unbounded following) ind
from tree_data;
  name   |       ind
---------+------------------
 mv_abcd |
...


Step 4 - use group by to eliminate duplicates; concatenate indentation string and node name

The last step is to remove duplicates from result of previous query and concatenate indentation string and node name to generate final result.
select ind || name 
from
(
  with recursive tree_data(name, idx, ord, ind) as
  (
    select t.ref_obj, 1, t.ord, ''
    from
    (
      with recursive deps(dep_obj, ref_obj, ord) as
      (
        select '', 'mv_abcd', array[row_number() over ()] 
        union
        select d2.dep_obj, d2.ref_obj, 
          d1.ord || row_number() over (order by d2.ref_obj)
        from deps d1, v_deps d2
        where d1.ref_obj = d2.dep_obj
      )
      select ref_obj, ord
      from deps
      order by ord
    ) t
    union all
    select name, idx + 1, ord, 
    case
      when ord[idx + 1] != lag(ord[idx + 1], 1, -1::bigint) over w
           and ord[idx + 1] != last_value(ord[idx + 1]) over w then '|-- '
      when ord[idx + 1] != lag(ord[idx + 1], 1, -1::bigint) over w 
           and ord[idx + 1] = last_value(ord[idx + 1]) over w then '`-- '
      when ord[idx + 1] != last_value(ord[idx + 1]) over w then '|   '
      else '    '
    end
    from tree_data
    where ord[idx + 1] is not null
    window w as (partition by ord[idx] 
                 order by ord rows between current row and unbounded following)
  )
  select name, 
  string_agg(ind, '') over (partition by name 
      order by idx rows between unbounded preceding and unbounded following) ind,
  ord
  from tree_data
) t
group by ind, name, ord
order by ord;
      ?column?
---------------------
 mv_abcd
 `-- v_abcd
     |-- v_abc
     |   |-- v_ab
     |   |   |-- t_a
     |   |   `-- t_b
     |   `-- v_c
     |       `-- t_c
     `-- v_d
         `-- t_d

Summary

In the beginning of the post I promised to show definition of v_deps view. This is DDL for the whole schema, which I used in this post:
create table t_a (a integer primary key);
...
Note two things:
  • v_deps contains qualified names of objects (with schema name) - for example public.mv_abcd instead of mv_abcd. For simplicity I removed "public." from examples in this post (it introduced a lot of noise). If you create schema from DDL above and try to run queries from this post, remember to substitute 'mv_abcd' with 'public.mv_abcd'.
  • the solution presented in this post is quite general - it is ready for use with any hierarchical data for which you can create a view similar to v_deps


TL;DR

This post describes how to transform any parent-child hierarchy data into a tree using recursive select and window functions.
 parent  | child
---------+--------                                                mv_abcd
 mv_abcd | v_abcd                                                 `-- v_abcd
 v_abcd  | v_abc                                                      |-- v_abc
 v_abcd  | v_d                    RECURSIVE SELECT                    |   |-- v_ab
 v_abc   | v_ab        ===>               +              ===>         |   |   |-- t_a
 v_abc   | v_c                    WINDOW FUNCTIONS                    |   |   `-- t_b
 v_ab    | t_a                                                        |   `-- v_c
 v_ab    | t_b                                                        |       `-- t_c
 v_c     | t_c                                                        `-- v_d
 v_d     | t_d                                                            `-- t_d