Explaining View Embedding in PostgREST #2: pg_node_tree to JSON

The background to this is explained in https://gist.github.com/wolfgangwalther/5425d64e7b0d20aad71f6f68474d9f19 This explains the algorithm to transform a pg_node_tree to a JSON format, that can be used by PostgREST to obtain the :targetList of the view.

Author: Wolfgang Walther https://github.com/wolfgangwalther

License: MIT

We're starting with the same simple example that Steve used in part #1:

({QUERY 
  :commandType 1 :querySource 0 :canSetTag true :utilityStmt <> :resultRelation 0 :hasAggs false :hasWindowFuncs false :hasSubLinks false 
  :hasDistinctOn false :hasRecursive false :hasModifyingCTE false :hasForUpdate false :hasRowSecurity false :cteList <> 
  :rtable (
    {RTE 
      :alias {ALIAS :aliasname old :colnames <>} :eref {ALIAS :aliasname old :colnames ("id" "name" "client_id")} :rtekind 0 
      :relid 564854 :relkind v :tablesample <> :lateral false :inh false :inFromCl false :requiredPerms 0 :checkAsUser 0 :selectedCols (b) 
      :insertedCols (b) :updatedCols (b) :securityQuals <>} 
    {RTE 
      :alias {ALIAS :aliasname new :colnames <>} :eref {ALIAS :aliasname new :colnames ("id" "name" "client_id")} :rtekind 0 
      :relid 564854 :relkind v :tablesample <> :lateral false :inh false :inFromCl false :requiredPerms 0 :checkAsUser 0 :selectedCols (b) 
      :insertedCols (b) :updatedCols (b) :securityQuals <>} 
    {RTE 
      :alias <> :eref {ALIAS :aliasname projects :colnames ("id" "name" "client_id")} :rtekind 0 :relid 564848 :relkind r 
      :tablesample <> :lateral false :inh true :inFromCl true :requiredPerms 2 :checkAsUser 0 :selectedCols (b 9 10 11) 
      :insertedCols (b) :updatedCols (b) :securityQuals <>}) 
  :jointree {FROMEXPR :fromlist ({RANGETBLREF :rtindex 3}) :quals <>} 
  :targetList (
    {TARGETENTRY 
      :expr {VAR :varno 3 :varattno 1 :vartype 23 :vartypmod -1 :varcollid 0 :varlevelsup 0 :varnoold 3 :varoattno 1 :location 37} 
      :resno 1 :resname id :ressortgroupref 0 :resorigtbl 564848 :resorigcol 1 :resjunk false} 
    {TARGETENTRY 
      :expr {VAR :varno 3 :varattno 2 :vartype 25 :vartypmod -1 :varcollid 100 :varlevelsup 0 :varnoold 3 :varoattno 2 :location 54} 
      :resno 2 :resname name :ressortgroupref 0 :resorigtbl 564848 :resorigcol 2 :resjunk false} 
    {TARGETENTRY 
      :expr {VAR :varno 3 :varattno 3 :vartype 23 :vartypmod -1 :varcollid 0 :varlevelsup 0 :varnoold 3 :varoattno 3 :location 73} 
      :resno 3 :resname client_id :ressortgroupref 0 :resorigtbl 564848 :resorigcol 3 :resjunk false}) 
  :onConflict <> :returningList <> :groupClause <> :groupingSets <> :havingQual <> :windowClause <> :distinctClause <> :sortClause <> :limitOffset <> 
  :limitCount <> :rowMarks <> :setOperations <> :constraintDeps <>})

Let's first note a couple of similarities between the pg_node_tree format and JSON:

nodes: Each node has a type in capital letters (QUERY, RTE, TARGETENTRY, ...) and several key value pairs (:commandType 1, :resorigtbl 564848, :resorigcol 1, ...) as parameters. Each node is enclosed by {...}. Nodes are very similar to JSON objects.
lists: Lists are enclosed by (...) with a space as separator between items. Some examples: :rtable ({RTE ...} {RTE ...} {RTE ...}) or :targetList ({TARGETENTRY ...} {TARGETENTRY ...} {TARGETENTRY ...}). Lists are very similar to JSON arrays.
emptiness: Missing values are represented with <>. Some examples are :utilityStmt <> (would be a node if not empty) or :cteList <> (would be a list). Could be represented as either JSON null or empty objects/arrays.

There are a couple of pitfalls as well, though:

strings #1: Some strings are not quoted, e.g. :aliasname projects. If we wanted to transform those, we would need to quote them for proper JSON.
strings #2: When strings are quoted, they can contain a lot of characters that could interfer with our replace regex, e.g. :colnames ("id" "name" "client_id"). Column names can contain any character.
lists: There are some other types of lists that carry a type indicator, e.g. :selectedCols (b 9 10 11). We need to either quote or remove it.
constants: This is not part of the example above, but a constant expression (e.g. text) in the query will be represented like this: :constvalue 12 [ 48 0 0 0 99 111 110 115 116 97 110 116 ].

The full pg_node_tree format can be reverse engineered quite easily from PG source code: https://github.com/postgres/postgres/blob/master/src/backend/nodes/outfuncs.c#L43-L288

The approach we're going to take to transform this into JSON will be as follows:

We care about performance at lot, so we will throw out everything we don't need to make the JSON small and fast to parse:
- Node types. We don't need them, because the structure of pg_node_tree only allows one type of node to appear in each place. We know what to look at.
- All keys and values, that we don't strictly need to access the info we're looking for. We need to access the :targetList and in there we need :resorigtbland :resorigcol - and then one of either :resno or :resname. The latter two describe the column name and position in the view. We choose :resno here, because it's a lot easier to parse an integer then to parse arbitrary strings in :resname. We later get the column name from the system catalogs. All other keys can be thrown out.
The whole point of using the JSON format to parse this, is to easily represent the nested structures in pg_node_tree. Resolving those is not possible with regex alone. That means, that we can't delete any {} or () - as those provide the structure of the document. We will keep all those empty objects and give them an empty key, where needed. So for example, we don't need the RTE nodes at all. So we transform this:
```
{RTE 
  :alias {ALIAS :aliasname old :colnames <>} :eref {ALIAS :aliasname old :colnames ("id" "name" "client_id")} :rtekind 0 
  :relid 564854 :relkind v :tablesample <> :lateral false :inh false :inFromCl false :requiredPerms 0 :checkAsUser 0 :selectedCols (b) 
  :insertedCols (b) :updatedCols (b) :securityQuals <>}
```
to this
```
{"":{},"":{}}
```
Both those empty objects are the remainings of the two ALIAS nodes. We can repeat the same empty key, this will be ignored by the JSON parser. We don't care to access those anyway.
To achieve the best performance, we try to limit the number of regexp_replace calls and use some additional replace calls instead. Those are a lot faster.

Now, we're going through the replacement steps to turn the above example into a JSON parseable format. The core part is a regexp_replace that removes all of the key-value pairs, that we don't need. This saves us a lot of trouble with quoting values properly and so on.

Before the regex, we need to do some preparatory replacements. This is about the column names that can contain some characters that could throw off our main regex. The regex uses {, , and }. , are not part of the pg_node_tree format, so we can just throw out all of them. { and } are of course a major part, but inside column names those appear only in escaped form.
```
REPLACE(..., ',' , '')
REPLACE(..., '\{', '')
REPLACE(..., '\}', '')
```
Because the regex will match :key value pairs, we manually replace the keys we want to keep with proper JSON format.
```
REPLACE(..., ' :targetList ', ',"targetList":')
REPLACE(..., ' :resno '     , ',"resno":')
REPLACE(..., ' :resorigtbl ', ',"resorigtbl":')
REPLACE(..., ' :resorigcol ', ',"resorigcol":')
```
Because we replaced all commas in step 1, the commas we now introduced with this are the only commas around and can serve as the "stop" signal for our regex, to not to touch what follows.
To get rid of the node types, e.g. {QUERY, we use a little trick: We add a space and a colon, so that it looks like this: { :QUERY. This looks just like the start of a regular key-value pair, which the regex will remove.
```
REPLACE(..., '{' , '{ :')
```
There are three types of lists, that can show up in key-value pairs. Lists of nodes, lists of lists of nodes and lists of other values (strings, integers, ...). We need to keep the first two, to keep the nested structure intact, but want to get rid of the last one, to avoid the need to quote strings properly. We can achieve this by "protecting" all lists that start with either ({ or (( with an extra { temporarily. This is cheaper than adding branching to the regex.
```
REPLACE(..., '((' , '{((')
REPLACE(..., '({' , '{({')
```

With all those replacements done so far, just before the regex is applied, the intermediate result will look like this:

{({ :QUERY
  :commandType 1 :querySource 0 :canSetTag true :utilityStmt <> :resultRelation 0 :hasAggs false :hasWindowFuncs false :hasSubLinks false
  :hasDistinctOn false :hasRecursive false :hasModifyingCTE false :hasForUpdate false :hasRowSecurity false :cteList <>
  :rtable {(
    { :RTE
      :alias { :ALIAS :aliasname old :colnames <>} :eref { :ALIAS :aliasname old :colnames ("id" "name" "client_id")} :rtekind 0
      :relid 564854 :relkind v :tablesample <> :lateral false :inh false :inFromCl false :requiredPerms 0 :checkAsUser 0 :selectedCols (b)
      :insertedCols (b) :updatedCols (b) :securityQuals <>}
    { :RTE
      :alias { :ALIAS :aliasname new :colnames <>} :eref { :ALIAS :aliasname new :colnames ("id" "name" "client_id")} :rtekind 0
      :relid 564854 :relkind v :tablesample <> :lateral false :inh false :inFromCl false :requiredPerms 0 :checkAsUser 0 :selectedCols (b)
      :insertedCols (b) :updatedCols (b) :securityQuals <>}
    { :RTE
      :alias <> :eref { :ALIAS :aliasname projects :colnames ("id" "name" "client_id")} :rtekind 0
      :relid 564848 :relkind r :tablesample <> :lateral false :inh true :inFromCl true :requiredPerms 2 :checkAsUser 0 :selectedCols (b 9 10 11)
      :insertedCols (b) :updatedCols (b) :securityQuals <>})
  :jointree { :FROMEXPR :fromlist {({ :RANGETBLREF :rtindex 3}) :quals <>}
  ,"targetList":{(
    { :TARGETENTRY
      :expr { :VAR :varno 3 :varattno 1 :vartype 23 :vartypmod -1 :varcollid 0 :varlevelsup 0 :varnoold 3 :varoattno 1 :location 37}
      ,"resno":1 :resname id :ressortgroupref 0,"resorigtbl":564848,"resorigcol":1 :resjunk false}
    { :TARGETENTRY
      :expr { :VAR :varno 3 :varattno 2 :vartype 25 :vartypmod -1 :varcollid 100 :varlevelsup 0 :varnoold 3 :varoattno 2 :location 54}
      ,"resno":2 :resname name :ressortgroupref 0,"resorigtbl":564848,"resorigcol":2 :resjunk false}
    { :TARGETENTRY
      :expr { :VAR :varno 3 :varattno 3 :vartype 23 :vartypmod -1 :varcollid 0 :varlevelsup 0 :varnoold 3 :varoattno 3 :location 73}
      ,"resno":3 :resname client_id :ressortgroupref 0,"resorigtbl":564848,"resorigcol":3 :resjunk false})
  :onConflict <> :returningList <> :groupClause <> :groupingSets <> :havingQual <> :windowClause <> :distinctClause <> :sortClause <> :limitOffset <>
  :limitCount <> :rowMarks <> :setOperations <> :constraintDeps <>})

Now we apply the following regex replace:
```
REGEXP_REPLACE(..., ' :[^{,}]+', ',"":', 'g')
```
This removes all the unused keys and values. Keys start with :. The selection stops at , for the fields we want to keep and at { or }, because those change the level of nesting, which we need to keep. The replacement is the empty string as a key for the nested objects, as mentioned above.

The result after the regex looks like this:

{({
  ,"":{(
    {,"":{,"":},"":{,"":},"":}
    {,"":{,"":},"":{,"":},"":}
    {,"":{,"":},"":})
  ,"":{,"":{({,"":}),"":},
  "targetList":{(
    {,"":{,"":},"resno":1,"":,"resorigtbl":564848,"resorigcol":1,"":}
    {,"":{,"":},"resno":2,"":,"resorigtbl":564848,"resorigcol":2,"":}
    {,"":{,"":},"resno":3,"":,"resorigtbl":564848,"resorigcol":3,"":})
  ,"":})

It would be possible to recursively delete all the empty keys and objects and receive a "nice" JSON output from this - but this is much more expensive compared to just parsing it as-is. To be able to do that, we need to apply a couple of "clean-up" replacements.

Because of how the regex worked, some of the empty keys don't have a value associated: ,"":} and ,"":,. Those are removed first, because they are not valid JSON.
```
REPLACE(..., ',"":}', '}')
REPLACE(..., ',"":,', ',')
```
Next we reverse step 4, where we protected some of the lists. We now have a couple of { too much in place, currently those are not balanced.
```
REPLACE(..., '{(', '(')
```
Because we added every key (both empty and kept) with a , before, we now have some cases where an object starts with {,. This is invalid JSON, so we need to remove the comma here.
```
REPLACE(..., '{,', '{')
```
Now we just need to replace a few characters with their JSON counterpart. () need to become [] and spaces between list items need to be replaced with commas.
```
REPLACE(..., '(', '[')
REPLACE(..., ')', ']')
REPLACE(..., ' ', ',')
```
Now, there is only one case remaining: In some cases there are empty targetLists - not in the example above. The empty value for pg_node_tree is <>, so this would currently still be there, but is invalid JSON as well. We replace it with an empty list, so that our main query can just do json_array_elements(...) on it in the same way as with every other targetList.

REPLACE(..., '<>', '[]')

The result now looks like this:

[{
  "":[
    {"":{},"":{}},
    {"":{},"":{}},
    {"":{}}
  ],
  "":{"":[{}]},
  "targetList":[
    {"":{},"resno":1,"resorigtbl":564848,"resorigcol":1},
    {"":{},"resno":2,"resorigtbl":564848,"resorigcol":2},
    {"":{},"resno":3,"resorigtbl":564848,"resorigcol":3}
  ]
}]

This is valid JSON. The targetList can easily be extracted from it.

wolfgangwalther/node_tree_json.md