Tuesday, August 31, 2004

The File System as a Database and the Last Post of Summer

software :: python


Heh, that title reads like the title of a Victorian-era
pseduo-science journal article. However, I really do feel that August
31st is the last day of Summer... that come September first, Everything
Changes. But not in a bad way -- Fall is amazing. It's exhilarating in
a funny, erie kind of way.



So, this is the last blog entry of the Summer, 'nough said.



As for the rest of the title, I've been discussing this topic with
friends of late. Basically, how can we access the file system in
python, treat it like a database, search on it like a database, and
write code for it in such a way that when we are ready, we can
migratate to a database with no code changes (only config or module
import changes)?



Well, we've debated the issue(s) back and forth. I even asked
Phillip Eby about it on the PEAK mail list (we will be using PEAK for
this project... once we learn it!). But I think we've all been trying
to make the problem and solution too general. A very simple and rigid
API would work for us right now. It means less time spent on R&D,
and since this is for a paying customer, that means more money for us
in the long run.... as long as what we implement leaves enough
flexibility for future change.



So, without further ado, an adaptation from a post to the mail list today:



If we have a file on the filesystem, then the full path + the file
name uniquely identifies that file. In my limited knowledge of OODBs,
this is pretty standard (path-to-object = UID). Then there's the file
itself, which contains some data. Additionally, however, is the path:
it contains data that is just as important as the data inside the file.



UID: full path + filename

Data: stored in file at /fullpath/filename

Data: stored in path and filename



How do we think about this problem? If this were a table, we might be looking at a schema like this:



Table

-----

id: full_path + filename

blob: rrd file/text file/ini file/xml file with DTD/whatever

additional field 1:

additional field 2:

additional field 3:

...

additional field n:



I'm not proposing an OR mapping here: that's complicated shit. Way
beyond me. Some of the biggest brains in the software development world
are working on ways to do that which make sense and work right.



I'm just talking about doing something simple and straight-forward.
Something that's easy to configure and easy to migrate from a
filesystem to an RDBMS.



This shouldn't be as hard as I was thinking originally. The only
issue is that for every implementation, there will need to be a
configuration. This is because, by their nature, database tables and
fields defined therein are fixed; directory structures aren't/don't
have to be. A configuration would "lock" a directory structure... you'd
have to have an API (or something) that defined what each level of the
directory structure indicated, as these would have to be mappable to
fields defined in a table (for migration to a SQL framework).



Additionally, if you wanted to move the data stored in you blob
field out of its own little format into SQL, you'd have to define an
additional config/API for mapping its data to more fields in the
table...



So what you'd really have here is a directory structure schema and
then a file storage schema. Using the two together, you'd get what I
originally asked Phillip Eby about...



PEAK already has 'peak.storage.files' which lets you interact with
text files transactionally. We could do something like this for other
types of data-containing files at the end of whatever directory tree.
The combination of this with an implementation of a queryable filsystem
data interface should leave us with a fairly powerful tool for many of
our projects.



Questions to ask:

* What constitutes a database? (root dir and below?)

* What constitutes a table? (all directories at the first level, inside the root dir?)

* What constitutes a row? (every branch from root? this means all paths
from the root dir have to have the same number of dirs, subdirs, etc.)

* Can there be no file at the end of a path?

* Can there be empty files at the end of a path?

* Can there be multiple files at the end of a path?

* Can there be files in intermediate directories? (dirs that aren't the end of a path; good place for metadata?)



No comments: