How to Store Directory Structures in a Database?

Posted on 2024-07-19 In tech , DBMS Word count in article: 1.2k Reading time ≈ 4 mins.

Introduction

If we want to store directory structures in a database, which type of database should be used? — Either NoSQL or RDBMS. If we choose NoSQL, we should select one that supports tree structures; if we choose RDBMS, the focus is on table design and query operations.

The key question is: What operations do we need to perform on the directory structure stored in the database? Obviously, these operations essentially mimic file operations in a file system.

Operation 1: CRUD operations on single files/directories, such as copying, deleting, or moving a file/directory;
Operation 2: Recursively traversing a directory;

Considering these two points is crucial for designing an efficient solution. This article presents solutions for both NoSQL and RDBMS, with a focus on the latter. Additionally, it explores using JSON as a solution.

NoSQL

Options include:

Using a file system like HDFS as the database;
Using graph databases like Neo4j;

There’s no need to introduce new components just to store the directory structure.

RDBMS

The following Item tables represent directory items, i.e., nodes in the directory tree.

Tree Representation

Storing the parent node pointer:

CREATE TABLE Item (
  id INT AUTO_INCREMENT PRIMARY KEY, 
  name VARCHAR(255) NOT NULL COMMENT 'File name', 
  type TINYINT NOT NULL COMMENT '0 for file, 1 for directory', 
  parent_id INT NOT NULL DEFAULT -1 COMMENT 'Parent node id'
);

Storing a list of child node pointers (handling the children field is also tricky):

CREATE TABLE Item (
  id INT AUTO_INCREMENT PRIMARY KEY, 
  name VARCHAR(255) NOT NULL COMMENT 'File name',
  type TINYINT NOT NULL COMMENT '0 for file, 1 for directory', 
  children VARCHAR(1023) NOT NULL DEFAULT "" COMMENT 'Comma-separated list of child node ids'
);

Both methods essentially represent adjacency lists of a directed graph. For both operations 1 and 2, multiple table scans are required, making it highly inefficient.

Using File Paths as Indexes

Since a file path can uniquely locate a file, it can also serve as a unique index.

This method is called Materialized Paths.

CREATE TABLE Item (
  id INT AUTO_INCREMENT PRIMARY KEY, 
  name VARCHAR(255) NOT NULL COMMENT 'File name', 
  type TINYINT NOT NULL COMMENT '0 for file, 1 for directory', 
  path VARCHAR(1023) NOT NULL COMMENT 'File path'
);
CREATE INDEX path_idx ON Item(path);

In practice, there’s no need to make path a unique index because: (1) Unique indexes are inefficient; (2) path is too long to be a primary key. Therefore, path can be a regular index.

For operation 1:

Adding, deleting, or querying a file triggers the index, with a time complexity of $O(\log n)$:

1	SELECT * FROM Item WHERE path = 'path';

Adding or querying a directory has the same time complexity as file operations, $O(\log n)$.
Renaming a directory has a time complexity of $O(n)$ because it requires updating the paths of all subdirectory items.
Deleting a directory also has a time complexity of $O(n)$ because all subdirectory items must be deleted.

For operation 2: Only one index table traversal is needed, and due to the leftmost prefix principle, a full table scan is not required, making it very efficient.

In summary, using file paths as indexes is highly efficient.

Nested Sets

This data structure is suitable for static data — designed for efficient querying, not for modification.

CREATE TABLE Item (
  id INT AUTO_INCREMENT PRIMARY KEY, 
  name VARCHAR(255) NOT NULL COMMENT 'File name', 
  type TINYINT NOT NULL COMMENT '0 for file, 1 for directory',
  lft INT NOT NULL, 
  rgt INT NOT NULL
);

Nested Sets have the following characteristics:

Each node has lft and rgt attributes, recording the left and right endpoints of the interval. For example, the interval for the node with id 1 is (1, 22).
The intervals of all child nodes of a node are sub-intervals of its interval. Thus, an interval corresponds to a subtree. For example, the intervals of all child nodes of the node with id 1 must be sub-intervals of (1, 22).
The intervals of nodes at the same level do not overlap. For example, the intervals at the second level are (2, 9), (10, 15), (16, 21).

The advantage of this design is that operation 2 is efficient. Consider recursively traversing the subtree of the node with id 1:

1	SELECT * FROM Item WHERE lft > 1 AND rgt < 22;

For static structures, the lft and rgt of all nodes are pre-calculated and directly inserted into the database in batches. However, performing operation 1, i.e., dynamically modifying the lft and rgt of related nodes, is very costly.

JSON

A directory is a structured object and is recursively defined. JSON is similar in this regard. Therefore, a directory can be represented in JSON.

JSON is essentially a string, so it can be stored in any database or simply in a file.

In this case, the operations on the directory structure are consistent with those of file system. So, the time complexity of the operations is low. Since operations are performed in memory, performance is good.

The drawbacks of this approach are that when the directory tree is large, (1) the conversion time between strings and objects increases; (2) memory usage increases; (3) network overhead between the client and server, and between the server and database, becomes excessive.

The third issue is the main drawback of the JSON solution in a C/S architecture, making it inferior to the RDBMS solution:

If the directory tree is small, both RDBMS and JSON can be used.
If the directory tree is large, with most selects and few updates, RDBMS is better.
If the directory tree is large and updates are frequent, although JSON is faster for in-memory operations, frequent network communication and large network packets lead to high bandwidth usage, potentially making the overall performance worse than RDBMS.

However, for a local database of a program, JSON is undoubtedly the best choice.

Conclusion

The issue of how to store tree-structured data (such as a file directory) in an RDBMS was actually discussed and concluded thoroughly between 2005 and 2010. From today’s perspective, the rise of NoSQL seems to obscure the essence of this problem, leading me to detour in exploring this issue.

At the same time, a database is also a component. Just as we wouldn’t introduce a new NoSQL database solely to store a directory structure, if we only need to store a directory structure, we don’t even need to introduce a database and can directly use a JSON file.