File System

The term for a low-level programming model for persistent data storage.

In traditional operating systems, persistence of data has to be explicitly managed. Persistent storage is made of physical partitions, each of which is organized as a filesystem, out of which programs can allocate files, that are raw sequences of bytes that can be read of modified.

Typically, files are named by small character strings with many limitations.

In a hierarchical filesystem, as most every filesystem is, files are grouped into directories, that can be named like files, and recursively contain other subdirectories and subfiles. These directories are naturally structured as a finite pointed tree, that can be isomorphically thought as a finite partial order whose bounded intervals are all total orders.

In such filesystems, files and directories are named by the "path" of successive names of directories to go through to reach the said file or directory, from the root of the tree (or any other "current" directory). Those paths are themselves encoded as character strings, where elementary names are concatenated, with as a separator a special character that is otherwise forbidden in file names (typically "/" on Unix systems, "\" on bad subclones, and ":" or "." in a few other systems).

Because a raw sequence of byte is so poor an abstraction, and because even the simplest paranoid protection system in a multiprogramming environment requires some way of tracking "ownership" and "rights" to access files and directories, filesystems generally associate various "attributes" to files. While most (primitive) file systems have a finite space of possible attributes, as described by a few fields (numeric owner identifier, bitfields for access rights, etc), some allow for longer attributes (access control lists or whatever).

Now, because filesystems have proven so difficult to use, many tricks were invented to make things better:

To overcome the limitations of paths, and as a way to both allow quicker access, and to provide fixes to file and directory motion, good filesystems allow for special "alias" or "symbolic links" to exist, that actually point to a path in another directory. "symbolic links" (or "symlinks") are associated to a path, so that the system tries to follow the path when the file/directory is accessed. Symlinks allow for a standardized directory hierarchy, instead of having to adapt programs to the physical organization of disks, that changes so much from site to site, and so fast on a same site.
Some filesystems also allow for "hard links", so that many paths on the same filesystem are associated to the same file contents; unlike symlinks, hardlinks may only point to paths on same filesystems; however, hardlinks do not have to be cared for, whereas symlinks may be dangling if the target path becomes invalidated by a file move or removal.
Some "advanced" filesystems even allow for arbitrary "resources" to be associated with files, to describe their contents.
A few filesystems, but none available on any common operating system, have a builtin versioning system to track divergent or erroneous modifications.
A few are resilient to power shutdowns, and ensure the filesystem on disk will always be in a consistent state; but most are subject to crashes and severe data loss in the case the computer happens not to have been shutdown "properly". This is all an implementation problem rather than general problem of the filesystem design. But the same reasons of compatibility with low-level standards that caused the promotion of the filesystem design means that once a filesystem is first implemented, further implementations have to be compatible at a low level; and since hard resilience to ungraceful power failures is not generally considered essential at first, the filesystem layout on disk is not designed with it in mind, and it becomes too difficult to ensure resilience afterwards, since it might completely slow-down most operations on files.

So after all, here is what we can reproach to filesystem design:

Its basic abstraction of a persistent object, a file, is a very low-level abstraction, that isn't fit to describe even the simplest data types, not to talk about recursive types and higher-order types. Every program that uses files (and that means every program, in traditional OSes that only allow for filesystem interfaces to persistent data) must then explicitly encode whatever high-level objects it wants to represent into raw sequences of bytes.
To every file must hence be associated a particular encoding, but there is no infallible way to determine which it is precisely. Resources sometimes associated with files are meant to help determine the encoding, But then appears the problem of giving meaning to the resources themselves; most such systems allow for new resource handlers to be installed, but they cannot guarantee existence, availability, and compatibility of a similar resource handler on other hosts of same system (not to talk about different systems) when exchanging data. Hence, any meaning associated by such resources is purely advisory, as is any user-maintained equivalent thereof (a suffix is often added at the end of filenames to informally designate the encoding of its contents). The fact that resources can be modified or damaged as easily as files themselves only makes the problem worse.
All in all, the users of filesystems are facing an essential problem: that they must manually manage conventions of encoding of high-level objects into low-level byte sequences. There is no possible system-enforced consistency of file encoding.
Because even a few encodings is too much for users to manage, who are completely uninterested in such details after all, the encodings must be as few as possible, and cannot be easily changed for betterment, since any incompatibility would bring havoc or require careful manual update of all programs from the user. Hence, the filesystem design is also a great practical brake to good and efficient data encoding. The fact that most filesystem implementations can't be trusted, the fact that encoders, decoders and specifications must be mostly hand-written, encourage the use of "simple" (hence inefficient) encodings.
Also, the lack of any system-wide encoding standard lack of structure editor => human readability => inefficiency commercial systems choose inexpressivity instead.
Files and directories are a coarse-grained abstraction; because to every file is associated lots of "metadata", one can't afford to put every small individual object in its own file. This encourages grouping data in common files, makes data sharing difficult (since accessing one object requires being aware of all other objects in same file). This induces lots of similar artificial barriers in the programming model, so that programs in general have lots of trouble exchanging even the simplest data.

The low-level programming model of filesystems is deeply linked to the low-level programming model of enclosing Operating Systems in general. Again, there are lots of historical reasons why filesystems exist; but no good technical reason. Operating systems with much better paradigms have existed and run successfully for just decades. See orthogonal persistence for a better paradigm.

This page is linked from: Distributed Microkernel Debate Persistence Plan9