2005-06-01 – A new file system idea for modern life

I’ve been thinking that the major fault with a lot of the systems around at the moment is they were designed for “toy” scale things. Let’s face it, we all have more electronic storage now than the entire world had 30 years ago.

DOS had 8.3 names to save space, and years were 2 digit. Folders made sense when there were only 20 on a hard disk. Now, I download more than a floppy full of stuff in 5 minutes, I have hundreds of CDs, DVDs and (still) floppies. I can see at least 8 different storage devices from here, all not even used. Who the hell wants a 200Mb HDD, or even a .5Gb one?

I have terabytes of data stored about the place. Hence things like PGP to protect it from prying eyes, and things like Google Desktop to help find the needle in the haystack. This is because the file I’m saving might go in “Work” or “Pleasure” or under the client’s name, or under the file type or the language it is written in. And, of course, although you might have your file system worked out, things still get lost, or they wind up in a new folder called \”Bob/work/cobol\” instead of \”work/cobol/Bob\”, etc.

Tools like Picassa are letting us tag images with things, and store them in a database, such that we can find them by date or by keyword, and this is a big step up from the basic file/folder/volume system. I propose we take it a step further.

File systems like Ext 3 and NTFS are already based on a database table that tells the computer where the actual data is stored. EFS uses encryption on the fly to keep things fairly safe. However, they don’t really help with finding things.

I propose a new database-like file system. Each blob of data will be held on the disk at some location. The database table is then written to, stating what part of the disk holds what, and that all of the blobs 1 to 132 are parts of the file “What I did on my holidays”, whereas blobs 133 to 145 are a plain text shopping list, and so on. Each blob also holds meta-data, and, in the case of text, it is indexed. There is also a time stamp for last altered, and created.

Searches on this database system would take but moments, as the master index would be consulted, and it would always be correct. Text searches would be instant, and seaches for meta-data, such as author, creation time, keywords tagged to the file, even the “path”, etc. would also work quite rapidly.

Imagine being able to follow the “path” year/2005/05/31/11PM and see the files that were created or accessed at that time, or following work/ideas/computers/filesystems and see those, instantly. You could group things in more than one way, too, via the metadata, so a picture of your dog with your friend could be grouped under both those, and be there when you looked under either file path! And, of course, this allows queries! All files created since June that are spreadsheets containing the words \”home accounts\” would return as fast as any other database query, and far faster than any non-indexed search.

A small prompt would be used at the time of creation or save, asking for a few keywords, passwords, etc. in much the same way as OOo does, to build meta-data for things like photos, images and music. Much of this could be done via drop-downs and context. One tag could be a auto-magic entry that stores the URL you are saving the picture from, another could be the link that sent you to that picture, another could be the actual file name. You get the idea. Then you could search for “slashdot pictures” and see images that you saved from websites linked to from Slashdot, with the usual criteria being allowed to be played with to change the search rankings.

Now a real advantage of this system is, that there is no reason at all to cluster the blobs together in a logical order! By accepting a slight reduction in access and write times, we can ensure that contiguous blobs are not located contiguously on the physical device! Hell, they don’t even need to be on the same disk. Combined with the encryption key(s) that allowed access to the index and the reading of the data, even a low-level scan of the surface couldn’t tell the size of a file! A neat trick would be to put sections of a file on a removable disk, and tell the system to fail if all parts of the file could not be found. This would allow a hardware key to be made for certain files, which would be useless if either part was stolen.

Because the size of each blob would be quite small, the encryption could be a rather smaller and faster than normal key, as it would, in effect, be a one-time pad, and hence impossible to brute-force. Your index would, of course, require far higher encryption key length standards! Again, however, part of the index could be stored physically apart from the rest – 1Mb of index \”missing\” would effectively prevent any brute force attack, as it would be distributed carefully through the entire index, and not just one block.

Yes, this index would become a huge thing, but this matters not, because it is still only a list of the numbers stored in a database table. Without the physical disk, the index alone would only yield garbage, and, even if unlocked, it would then only reveal the keys for the disc, and the file names. Even the file names could, of course, be hidden further, though that might be going too far. The final, perhaps crowning, glory of this system is that because it is encrypted, it can be given several access passwords. One password allows all files to be seen, but another \”duress\” password shows only a sub-set of \”safe\” files. Nothing more can ever be found without brute-forcing the (possibly non-existant) key, and so laws that force you to hand over your key become worthless. 🙂

Leave a Comment