Tuesday, July 14, 2009

So maybe databases aren't half bad afterall

So I've been working on a budgeting program for the last few years (none of the available ones had the features that I needed/wanted) and when I first started I put all of the data in XML. It was easy/quick to get it up and running and handled the hierarchical nature of the data very well, but as is the case with XML is was very space heavy and took a while to save, load, and query the information (I later changed the program to just load the XML to mirrored data structures in memory to speed things up, but querying transactions by date or bill assignment still meant that I had to touch every single transaction in memory).

So I eventually made the switch to SQLite and since the hierarchy was fixed and I didn't have any trees in my data structures, the weaknesses of hierarchical data in databases weren't a concern. Currently there's only 2 levels in my bills structure (categories and then bills), but a few weeks ago I decided that I wanted to get rid of categories and change the way I did bills to allow infinite depth/sub-bills to my bill organization.

This then had me wondering if I should go back to a more hierarchical friendly data format. After the compactness and speed of SQLite, I didn't want to go back to XML and a little searching turned up Protocol Buffers. They are actually really cool and I was amazed by them that I was about to redo my whole program to make use of them (I'm a sucker for strict typing), but then I started thinking about the whole querying thing again. Right now, I can query in so many ways in a very easy and efficient manner that it's almost impossible to match with my own data structures. I thought about reorganizing the data so that querying was more efficient, but then I was just recreating a database by hand in my code and I just couldn't justify all of the programming time that that would require just to get some potential speed improvements (and then all of the data would have to be resident in memory all the time (or I'd have to write my own read/write/cache system) and the cons definitely started outweighing the potential pros).

So I was then left with getting the hierarchical data working in SQLite. The standard parent-child tree structure was the simple approach that popped into my head first and apparently some new SQL to handle such a thing in some DBs (but not SQLite). I figured that someone smarter than myself must have already solved this problem, and some digging revealed that I was right about that. I stumbled on the nested set model, and it's one of those beautifully elegant solutions that I love so much about programming (but that's another post). It's pretty cool, because you actually encode all of the links in a separate table (or you can do the same table if you like) and you do it by doing a pre-order tree traversal and numbering as you go. So insert, moving, and deleting nodes is just a matter of updating the numbering in linkage. The one disadvantage is that when you add/delete/change a node in the tree, you are going to have to touch half the entries in the table on average, but in my program adding/moving bills is infrequent and there are FAR fewer bills than transactions, so paying that small hit in performance on the bills is more than worth it for the big gains with transactions.

No comments: