Modified Preorder Tree Traversal in Django
Introduction
This article is an extension to a previous article titled, Recursive Model Relationships in Django, which demonstrated a way to utilize the bare-bones Django capabilities to define database-backed Classes that model a common use-case for a recursive relationship. The use case I intend to satisfy is the common relationship between employees and managers of employees, which are also employees themselves.
Evaluating the Prior Implementation
The prior article defined an Employee
class that translates into a database table of the structure “employee(id, first_name, last_name, role, manager_id)” where manager_id is a foreign key that references the employee ID representing the manager of the current employee. This type of implementation of storing recursive data in a database is known as the adjacent list method.
To make this more clear, the result set below lists out employees of a fictitious company, which is listed in hierarchical order from the president at the top, then two managers and the employees they manage below them.
SELECT id, first_name, last_name, role, manager_id FROM employee ORDER BY id;
Employee Table
id | first_name | last_name | role | manager_id |
---|---|---|---|---|
1 | Jane | Doe | PRES | |
2 | John | Doe | MGR | 1 |
3 | Joe | Schmo | STD | 2 |
4 | John | Brown | STD | 2 |
5 | Adam | Smith | MGR | 1 |
6 | Milt | Friedman | STD | 5 |
Looking at the employee table listed above you can identify the hierarchical nature of the data. For example, you can tell that Jane Doe is the president (the top of the hierarchy) because her manager_id entry is empty and you can also tell that two employees report to her, John Doe and Adam Smith, because their manager_id entries equal to Jane’s employee ID of 1.
Below I demonstrate using an instance of the Employee
class from the prior article, representing Jane Doe, to retrieve the employees that report directly to her.
(venv) $ python manage.py shell
Python 3.6.2 (default, Jul 17 2017, 16:44:45)
>>> from hrmgmt.models import Employee
>>> jane_doe = Employee.objects.get(pk=1)
>>> managers = jane_doe.employee.all()
>>> for m in managers:
... print(m.first_name, m.last_name, m.role, m.manager_id, m.manager_id)
...
John Doe MGR 1
Adam Smith MGR 1
>>>
Under the hood the Django ORM issues a query similar to the following to get the employees directly under Jane Doe when the employee
property is called on an instance of the Employee
class.
SELECT * FROM htmgmt_employee WHERE manager_id = 1
id | first_name | last_name | role | manager_id |
---|---|---|---|---|
1 | John | Doe | MGR | 1 |
5 | Adam | Smith | MGR | 1 |
Similarly, to get the employees that report to John Doe you would call the employee
relationship field on an Employee
class instance representing John Doe, and under the hood the ORM would issue a query similar to this:
SELECT * FROM hrmgmt_employee WHERE manager_id = 2
id | first_name | last_name | role | manager_id |
---|---|---|---|---|
3 | Joe | Schmo | STD | 2 |
4 | John | Brown | STD | 2 |
In this way we can identify the hierarchy of the company starting with the top (Jane Doe) and working our way down the reporting chain. However, for each new manager you identify you will again have to call the employee
relationship property and the Django ORM will issue yet another query to retrieve the new set of employees reporting to the prior manager.
While this approach will certainly work – yielding the information we desire when we want to walk our way down the company listing – there is a performance concern. Each new level of management we encounter requires a yet another trip to the database, and these queries accumulate, consuming more and more resources leading to longer wait times for the client calling the program. Users will quickly become aggravated while staring at the spinning wheel of patience in the browser tab.
The same problem occurs when we try to walk up the employee listing from a regular employee up the levels of management and ending with the president. For example, consider when you want to determine the ascending line of management starting from John Brown.
You would identify the manager ID for John Brown, which is 2, then make a call to the database to determine the manager of the employee with an ID of 2.
/* Get John Brown and determine his associated manager_id */
SELECT * FROM htmgmt_employee WHERE first_name = 'John' AND last_name = 'Brown';
id | first_name | last_name | role | manager_id |
---|---|---|---|---|
4 | John | Brown | STD | 2 |
/* Get the employee with id of 2 */
SELECT * FROM htmgmt_employee WHERE id = 2;
id | first_name | last_name | role | manager_id |
---|---|---|---|---|
2 | John | Doe | MGR | 1 |
This returns John Doe, the manager of John Brown, and we see that his manager_id is 1 indicating that there is at least one more level of management above him. Once again we issue another query to determine if the employee with ID of 1 yields the top of the management hierarchy, or if there is yet another level of management.
/* Get the employee with id of 1 */
SELECT * FROM htmgmt_employee WHERE id = 1;
id | first_name | last_name | role | manager_id |
---|---|---|---|---|
1 | Jane | Doe | PRES | NULL |
Only now, after making multiple trips to the database, you can determine the hierarchy of management. In a much larger company this method will clearly have some scaling issues.
Modified Preorder Tree Traversal
Fortunately there exists another method of storing and retrieving hierarchical data in a database known, as the Modified Preorder Tree Traversal (MPTT). This second way utilizes a tree like data structure to model the data, along with some intuitive labeling of the associated nodes of the tree, enabling traversing based off of the labels.
Below is a tree representation of the data in the previous employee listing table.
The labeling scheme begins by placing a 1 to the left of the root node, president Jane Doe in this example, then you go down one node to the left of the root. At this node immediately below and to the left increment the count and label this new node with a 2. This process continues all the way down to the lowest child (leaf) node, Joe Schmo in this example. You then label the right side of the child node with the next increment and move laterally through the siblings to the right labeling left and right sides, incrementing as you go.
Once you reach the edge of the subtree, John Brown, you traverse up the tree until reaching a level that has siblings, then you again move laterally and back up the tree, similar to the prior subtree encountered until reaching the root again.
The next thing to do is to translate this nested tree into a flat table structure. This is accomplished by defining two additional columns of “left” and “right” values. However, since left and right are reserved keywords in the SQL language the actual implementations use abbreviations, such as “lft” and “rgt”.
Below is an example table of a minimal implementation of an MPTT structured table for the employee listing.
employee_mptt
id | first_name | last_name | role | manager_id | lft | rgt |
---|---|---|---|---|---|---|
1 | Jane | Doe | PRES | 1 | 12 | |
2 | John | Doe | MGR | 1 | 2 | 7 |
3 | Joe | Schmo | STD | 2 | 3 | 4 |
4 | John | Brown | STD | 2 | 5 | 6 |
5 | Adam | Smith | MGR | 1 | 8 | 11 |
6 | Milt | Friedman | STD | 5 | 9 | 10 |
Now that the data is organized and annotated with the values in the lft and rgt columns we have gained more flexibility, control, and efficiency in how we retrieve data.
Using the MPTT-structured table above you can list the employees that report to manager John Doe using the following SQL query.
SELECT * FROM employee_mptt WHERE lft > 2 and rgt < 7 ORDER BY lft;
However, to demonstrate the efficiency of the MPTT structure I'll again trace the accession of management starting from John Brown. I can accomplish this by including a few predicates in the WHERE section of the query, specifying that lft be less than 6 and rgt be greater than 6 and then ORDER
-ing by rgt will list the management hierarchy in ascending order, all in one trip to the database.
SELECT * FROM employee_mptt WHERE lft < 5 AND rgt > 6 ORDER BY rgt;
id | first_name | last_name | role | manager_id | lft | rgt |
---|---|---|---|---|---|---|
2 | John | Doe | MGR | 1 | 2 | 7 |
1 | Jane | Doe | PRES | 1 | 12 |
Annotating the employee records with the lft and rgt columns according to the MPTT structure provides us an enhanced way to traverse the data and glean useful information with more efficient and fewer interactions with the database. For example, if we wanted to know how many employees are under John Doe in the structure, assuming we already have the information for John, we can apply this simple formula:
abs((rgt - lft - 1)) / 2 = # of managed employees
Plugging in John's rgt and lft values, we get:
abs((2 - 7 - 1)) / 2 = 2
This provides us with the answer and didn't require any additional interactions with the database at all.
Django-mptt
The awesome community using and developing the Django web framework has produced the Django-MPTT project which extends Django's base functionalities and implements MPTT. The Django-MPTT project offers a number of conveniences that makes interacting with hierarchical data in the MPTT structure very convenient while achieving the efficiencies associated with MPTT data retrieval.
Implementing our employee listing of hierarchical data using Django-MPTT is quite simple. To demonstrate this I will use the existing code from the prior article's discussion of using Django to model recursive employee relationships.
If you'd like to follow along you can download the code from my GitHub account here starting at the tag for the beginning of this tutorial called "mptt-start".
Open up your command terminal, create a new virtual environment, and install the following requirements:
(venv) $ pip install django django-mptt
After running the initial migrations as described in the prior article, load the project in your favorite integrated development environment or text editor and open the model's Python script in the "hrmgmt" directory and add the following code.
# hrmgmt/models.py
from django.db import models
from mptt.models import MPTTModel, TreeForeignKey
class EmployeeMptt(MPTTModel):
STANDARD = 'STD'
MANAGER = 'MGR'
SR_MANAGER = 'SRMGR'
PRESIDENT = 'PRES'
EMPLOYEE_TYPES = (
(STANDARD, 'base employee'),
(MANAGER, 'manager'),
(SR_MANAGER, 'senior manager'),
(PRESIDENT, 'president'))
role = models.CharField(max_length=25, choices=EMPLOYEE_TYPES)
first_name = models.CharField(max_length=100)
last_name = models.CharField(max_length=100)
parent = TreeForeignKey('self', null=True, related_name='employee')
def __str__(self):
return "".format(self.first_name, self.last_name)
def __repr__(self):
return self.__str__()
The first new statement adds imports for the MPTTModel
and TreeForeignKey
classes from the django-mptt library. Then the EmployeeMptt
class is defined.
The EmployeeMptt
class inherits from the MPTTModel
which adds the class fields lft
, rght
, level
, and tree_id
to the subclass (EmployeeMptt
). The fields work as follows:
lft
: an integer field as described in the previous sectionrght
: an integer field as described in the previous sectionlevel
: an integer field what indicates the level of hierarchy for each instancetree_id
: an integer field similar to the prior article'sEmployee
class field manager_id
However, a more useful feature resulting from inheriting from MPTTModel
are the methods that come along with it, which abstracts the implementation of the aforementioned fields and provide the preferred functionalities for working with the tree structure.
- get_ancestors(ascending=False, include_self=False)
- get_children()
- get_descendants(include_self=False)
- get_descendant_count()
- get_family()
- get_next_sibling()
- get_previous_sibling()
- get_root()
- get_siblings(include_self=False)
- insert_at(target, position='first-child', save=False)
- is_child_node()
- is_leaf_node()
- is_root_node()
- move_to(target, position='first-child')
The TreeForeignKey
field behaves essentially the same as the regular django.db.models.ForeignKey
but it also displays the options of a tree's hierarchy with nesting in Django forms.
Now that we've written the code to define the EmployeeMptt
, let us translate the model code into database tables according to the MPTT structure. In your terminal make and run a migration for the EmployeeMptt
class:
(venv) $ python manage.py makemigrations
Migrations for 'hrmgmt':
hrmgmt/migrations/0002_employeemptt.py
- Create model EmployeeMptt
Inspect the DDL SQL that will be issued:
(venv) $ python manage.py sqlmigrate hrmgmt 0002
BEGIN;
--
-- Create model EmployeeMptt
--
CREATE TABLE "hrmgmt_employeemptt" ("id" integer NOT NULL PRIMARY KEY AUTOINCREMENT, "role" varchar(25) NOT NULL, "first_name" varchar(100) NOT NULL, "last_name" varchar(100) NOT NULL, "lft" integer unsigned NOT NULL, "rght" integer unsigned NOT NULL, "tree_id" integer unsigned NOT NULL, "level" integer unsigned NOT NULL, "parent_id" integer NULL REFERENCES "hrmgmt_employeemptt" ("id"));
CREATE INDEX "hrmgmt_employeemptt_lft_c82902c3" ON "hrmgmt_employeemptt" ("lft");
CREATE INDEX "hrmgmt_employeemptt_rght_c6110254" ON "hrmgmt_employeemptt" ("rght");
CREATE INDEX "hrmgmt_employeemptt_tree_id_7abd1eb2" ON "hrmgmt_employeemptt" ("tree_id");
CREATE INDEX "hrmgmt_employeemptt_level_687f7b49" ON "hrmgmt_employeemptt" ("level");
CREATE INDEX "hrmgmt_employeemptt_parent_id_88909826" ON "hrmgmt_employeemptt" ("parent_id");
COMMIT;
Run the migration:
(venv) $ python manage.py migrate
Operations to perform:
Apply all migrations: admin, auth, contenttypes, hrmgmt, sessions
Running migrations:
Applying hrmgmt.0002_employeemptt... OK
Now utilize the Django shell to populate the new "hrmgmt_employeemptt" table while simultaneously getting familiar with the Django-MPTT API:
(venv) $ python manage.py shell
Python 3.6.2 (default, Jul 17 2017, 16:44:45)
(InteractiveConsole)
>>> from hrmgmt.models import EmployeeMptt
>>> jane_doe = EmployeeMptt.objects.create(first_name='Jane', last_name='Doe', role=EmployeeMptt.PRESIDENT)
>>> john_doe = EmployeeMptt.objects.create(first_name='John', last_name='Doe', role=EmployeeMptt.MANAGER, parent=jane_doe)
>>> joe_schmo = EmployeeMptt.objects.create(first_name='Joe', last_name='Schmo', role=EmployeeMptt.STANDARD, parent=john_doe)
>>> john_brown = EmployeeMptt.objects.create(first_name='John', last_name='Brown', role=EmployeeMptt.STANDARD, parent=john_doe)
>>> adam_smith = EmployeeMptt.objects.create(first_name='Adam', last_name='Smith', role=EmployeeMptt.MANAGER, parent=jane_doe)
>>> milt_friedman = EmployeeMptt.objects.create(first_name='Milt', last_name='Friedman', role=EmployeeMptt.STANDARD, parent=adam_smith)
Not too complicated, right? So far the only thing that is relevant to the Django-MPTT API is the use of the parent
field. This is necessary for the Django-MPTT library to annotate the records with the appropriate lft, rght, tree_id, and level fields which leads to a table named "hrmgmt_employeemptt", populated as follows.
htmgmt_employeemptt
id | first_name | last_name | role | lft | rght | tree_id | level | parent_id |
---|---|---|---|---|---|---|---|---|
1 | Jane | Doe | PRES | 1 | 12 | 1 | 0 | NULL |
2 | John | Doe | MGR | 2 | 7 | 1 | 1 | 1 |
3 | Joe | Schmo | STD | 3 | 4 | 1 | 2 | 2 |
4 | John | Brown | STD | 5 | 6 | 1 | 2 | 2 |
5 | Adam | Smith | MGR | 8 | 11 | 1 | 1 | 1 |
6 | Milt | Friedman | STD | 9 | 10 | 1 | 2 | 5 |
Now let us gain some appreciation for this fine library by playing with the great utility methods that Django-MPTT has to offer.
Say we want to get a listing of the employees that directly report to President Jane Doe (ie, John Doe and Adam Smith), the root node of the MPTT tree.
>>> jane_doe.get_children()
, ]>
Ok, so far not too special, right? This basically got us the same result as our earlier jane_doe.employee.all()
and we already established that this has basically the same performance as the adjacent list implementation. However, say I wanted to get all employees lower in the company, relative to Jane Doe:
>>> jane_doe.get_descendants()
, , , , ]>
Well that was pretty slick, being that we got all that in one trip to the database.
Something else that might be interesting would be to see all employees on the same level as another, say John Brown:
>>> john_brown.get_siblings()
]>
Now we will take a look at something a little more interesting. Let us see if we can list the employees that are above John Brown, so we are basically walking up the management hierarchy, which I already described before as something that is both expensive (in terms of trips to the database) but also would inevitably require some sort of looping construct.
>>> john_brown.get_ancestors()
, ]>
Pretty simple, right? And again, only one trip to the database.
The other utility methods provided by Django-MPTT are pretty straightforward with intuitive names. I invite you to further investigate the other utility methods in the official documentation.
Tradeoffs Between Adjacent List and MPTT
As is the case with many tasks software developers face, we often need to make important decisions with regards to implementation strategy. In the first article on recursive relationships with Django I demonstrated a method of implementation known as the "adjacent list". While in this follow-up article I presented another implementation method, known as "Modified Preorder Tree Traversal (MPTT)". Both satisfy the basic requirements for our use case. So, when you are faced with a programming task that is inherently recursive, as in the use case being demonstrated here, which should you choose?
The adjacent list method is relatively straightforward to reason about and interact with from a coding-with-Django perspective, as well as using raw SQL and procedural programming. However, looking critically at the level of the database (regular SELECT
queries) this tends to be a bit repetitive and expensive with many trips to the database.
On the other hand, MPTT is a bit more of an elaborate implementation in its theoretical perspective, but thanks to Django-MPTT we have a nice layer of abstraction to free us from the need to think in terms of tree data structures. We clearly have seen that retrieving data from a database table implementing the MPTT structure is significantly more performant than the adjacent list method.
However, there is one major gotcha to be aware of and consider before you go on implementing MPTT in all your Django apps:
MPTT is best suited for use cases where you have relatively static hierarchical data that is frequently accessed via SELECT
statements.
Updating the entries in a MPTT structured table is expensive because you have to change the left and right values of nearly every entry, but it is also a rather complex process. Luckily Django-MPTT comes with some nice methods that take care of the complexity, but this does not alleviate the issue of having to update nearly every entry's left, right, and level values.
To summarize, I suggest implementing the adjacent list in cases where you expect the data to be updated semi-frequently or more and pulling out Django-MPTT when data is expected to stay fairly static so you can enjoy the great retrieval performance boosts.
I hope you enjoyed the article and as always, please feel free to comment or critique where necessary.