How to Store Directory Structures in a Database?
Introduction
If we want to store directory structures in a database, which type of database should be used? — Either NoSQL or RDBMS. If we choose NoSQL, we should select one that supports tree structures; if we choose RDBMS, the focus is on table design and query operations.
The key question is: What operations do we need to perform on the directory structure stored in the database? Obviously, these operations essentially mimic file operations in a file system.
- Operation 1: CRUD operations on single files/directories, such as copying, deleting, or moving a file/directory;
- Operation 2: Recursively traversing a directory;
Considering these two points is crucial for designing an efficient solution. This article presents solutions for both NoSQL and RDBMS, with a focus on the latter. Additionally, it explores using JSON as a solution.
NoSQL
Options include:
- Using a file system like HDFS as the database;
- Using graph databases like Neo4j;
There’s no need to introduce new components just to store the directory structure.
RDBMS
The following Item
tables represent directory items, i.e., nodes in the directory tree.
Tree Representation
- Storing the parent node pointer:
1 | CREATE TABLE Item ( |
- Storing a list of child node pointers (handling the
children
field is also tricky):
1 | CREATE TABLE Item ( |
Both methods essentially represent adjacency lists of a directed graph. For both operations 1 and 2, multiple table scans are required, making it highly inefficient.
Using File Paths as Indexes
Since a file path can uniquely locate a file, it can also serve as a unique index.
This method is called Materialized Paths.
1 | CREATE TABLE Item ( |
In practice, there’s no need to make path
a unique index because: (1) Unique indexes are inefficient; (2) path
is too long to be a primary key. Therefore, path
can be a regular index.
For operation 1:
- Adding, deleting, or querying a file triggers the index, with a time complexity of $O(\log n)$:
1 | SELECT * FROM Item WHERE path = 'path'; |
- Adding or querying a directory has the same time complexity as file operations, $O(\log n)$.
- Renaming a directory has a time complexity of $O(n)$ because it requires updating the paths of all subdirectory items.
- Deleting a directory also has a time complexity of $O(n)$ because all subdirectory items must be deleted.
For operation 2: Only one index table traversal is needed, and due to the leftmost prefix principle, a full table scan is not required, making it very efficient.
In summary, using file paths as indexes is highly efficient.
Nested Sets
This data structure is suitable for static data — designed for efficient querying, not for modification.
1 | CREATE TABLE Item ( |
Nested Sets have the following characteristics:
-
Each node has
lft
andrgt
attributes, recording the left and right endpoints of the interval. For example, the interval for the node with id 1 is(1, 22)
. -
The intervals of all child nodes of a node are sub-intervals of its interval. Thus, an interval corresponds to a subtree. For example, the intervals of all child nodes of the node with id 1 must be sub-intervals of
(1, 22)
. -
The intervals of nodes at the same level do not overlap. For example, the intervals at the second level are
(2, 9)
,(10, 15)
,(16, 21)
.
The advantage of this design is that operation 2 is efficient. Consider recursively traversing the subtree of the node with id 1:
1 | SELECT * FROM Item WHERE lft > 1 AND rgt < 22; |
For static structures, the lft
and rgt
of all nodes are pre-calculated and directly inserted into the database in batches. However, performing operation 1, i.e., dynamically modifying the lft
and rgt
of related nodes, is very costly.
JSON
A directory is a structured object and is recursively defined. JSON is similar in this regard. Therefore, a directory can be represented in JSON.
JSON is essentially a string, so it can be stored in any database or simply in a file.
In this case, the operations on the directory structure are consistent with those of file system. So, the time complexity of the operations is low. Since operations are performed in memory, performance is good.
The drawbacks of this approach are that when the directory tree is large, (1) the conversion time between strings and objects increases; (2) memory usage increases; (3) network overhead between the client and server, and between the server and database, becomes excessive.
The third issue is the main drawback of the JSON solution in a C/S architecture, making it inferior to the RDBMS solution:
- If the directory tree is small, both RDBMS and JSON can be used.
- If the directory tree is large, with most selects and few updates, RDBMS is better.
- If the directory tree is large and updates are frequent, although JSON is faster for in-memory operations, frequent network communication and large network packets lead to high bandwidth usage, potentially making the overall performance worse than RDBMS.
However, for a local database of a program, JSON is undoubtedly the best choice.
Conclusion
The issue of how to store tree-structured data (such as a file directory) in an RDBMS was actually discussed and concluded thoroughly between 2005 and 2010. From today’s perspective, the rise of NoSQL seems to obscure the essence of this problem, leading me to detour in exploring this issue.
At the same time, a database is also a component. Just as we wouldn’t introduce a new NoSQL database solely to store a directory structure, if we only need to store a directory structure, we don’t even need to introduce a database and can directly use a JSON file.
References
- How to store directory / hierarchy / tree structure in the database?
- Hierarchical data in RDBMSs
- Trees in SQL: Nested Sets and Materialized Path
- Trees in SQL (Internet Archive)
- Tree structures in ASP.NET and SQL Server - Storing Trees in SQL Server (Internet Archive)
- What are the options for storing hierarchical data in a relational database?
- One more Nested Intervals vs. Adjacency List comparison
- MongoDB Manual: Model Tree Structures
- 多级目录树(森林)的三种数据库存储结构介绍
- Using Nested Set Model to Build Hierarchical Data