tech-nickel: IT

The needs of many

These days companies have a lot of records floating around. At the last eCommerce company I worked for there where many millions of records floating around our Oracle database. I feel as you get into that sort of situation two things happen: your have very small lines of inter connectivity between the records, as they are spread across all those tables, and secondly you find yourself yielding more and more to the whims of the database system. I've started toying with CouchDB (which is itself only a toy at this point).

I've thrown in over 12k documents to CouchDB so far. I've taken my companies whole product catalog and serialized it from simple XML to even simpler JSON. Each document was about roughly 300kb. Each document contains things like product data, many records which make up the production history of the product, and just about anything related to a product. Performance is good, with 12k docs in the system I notice no difference in request times from 10 documents. It appears that CouchDB doesn't care about how many documents you are storing, it simply is relying on the underlying system to provide the space on disk.

It Handles lots of documents

Thats great and all but how big can these documents get? I'm looking to move from relational to document based which means for a customer document you have order history, purchased items and many other things. Rather than have the summation of a customer, the most important business entity, spread out all over the DB system why not put their 'story' into one document? Yes some of this data could end up being redundant, but that's the point. Don't worry too much about the small stuff, worry about the stuff that really hurts later down the road like being able to provide a correct SQL statement to put together a Customer model! At my big eCommerce company that was damn near impossible, with many people having separate interpretations! Ouch! A document can be self describing, a big plus when the data model gets complex.

To keep perspective going forward I will use the metric of WAP (or "War and Peace"). This is an epic document written by the Russian author Leo Tolstoy. Here it is over at Wikipedia, yes it is huge. I feel this makes a good metric because it is so big. YOU will never write your own "War and Peace". In the real world this book is a freak of nature, say the Andre the Giant of books. I downloaded a free version off the net, and the size was 3.2 MB, keep this number in mind.

So we need to determine if documents could get to these really large sizes without worries. I mean we can think up a reasonable size to which now document will ever grow. I asked myself "Could my companies product records increase in size by 100x and still CouchDB serves em up?". At this size would the client even be able to parse and hold the object in memory? Does that matter? Can I use views (in the CouchDB sense) to pare down a document to just the parts the client needs?

Well I started by simply increasing the size of a larger product document by 3600x. Yeah I'm shooting for the moon here. The documents size was ~46Mb, or ~15 WAP. Well on POST of the document to CouchDB things looked good but after a minute CouchDB crashed:

eheap_alloc: Cannot allocate 729810240 bytes of memory (of type "heap").
Aborted

I'm sorry CouchDB... not! I am running the db on a virtual slice that doesn't have too much memory available. Still it doesn't have to deal with multiple clients making requests. So maybe there needs to be a feature where documents of certain size are handled differently due to their size. I'm guessing that currently the whole document is parsed into memory, probably not a great idea for large documents, but good for performance overall.

Next I tried halving the document's size, from 3600x to 1800x, with a relative size of 7 WAP. Again this blew up with problems allocating memory. Ok next up 900x... hey it worked! Using my unoptimized vanilla install on a Gentoo slice I can upload a document that is 3.5-4 WAP in size. That is a lot of reading :) I now ask myself "so at that size what can you expect from clients?".

Clients the other white meat

Right off, no current browser can handle this 900x document. Just too freaking big. Requesting such a doc wouldn't make much sense in most cases. Really, what kind of web pages need all that data at once? Take into consideration that the browser will have to pass it onto a Javascript engine and allow it to parse the content. Parsing always takes time. Depending on the Javascript engine being used this time could range from slow to really slow.

If we could get just the pieces that are needed from the document a client would be much better off. Think of lazy fetching in SQLAlchemy or Hibernate, where the ORM layer doesn't load everything at once, but can get things when asked for. Well I think that the CouchDB view feature will allow you to cut down the size of a document, maybe for a customer document you get just the order history, or contact info. Then you update that doc and merge it back into the document.

I put a 200x document into the DB. Using the metric this is about 0.7 WAP, still very large. Fire Fox 3 loaded the structure in 32-33 seconds in Futon. At this size the request took 1-2 seconds when using cUrl via the command line to request the object. Using this CouchDB Python library it took 4-5 seconds to get the 200x doc from CouchDb. Not bad if you really needed to parse that much data. To again put things in perspective this is 0.7 War and Peace books. Thats pretty big.

So, are there even applications for document that get so big? I can think of a certain historical report that really could use all this info to generate a complete history of who changed what and when, how things changed (more in a minute), give the all the data surrounding a product, who has what access and more.

On keeping track

Now I haven't spoke of it yet but there is also a feature built into CouchDB where a document's changes can be tracked. Basically you can review the changes over time. I'm not going to do any testing of performance in history, I'll save that for later. I just wanted to point out that this feature is very cool and useful.

What I need to do

I need to actually build and maintain something using CouchDB. I've looked and listened for stories on using CouchDB and have found little. I can relate though to using Mark Logic, a XML database which is therefore document based, and having good results. The site SafariU (maybe gone by now) used mostly XQuery to manage documents that contained things such as user info like subscriptions. It worked pretty good. You did end up replacing SQL for XQuery, but XQuery was a much better tool I feel than SQL was. Anyway, I will be toying around and will post more on CouchDB if and when I find something more out.

I've been fishing around with Perl over the last month. Got a big legacy PIM application to deal with. Oh my! This app is running on Apache and mod_perl. These are some old school technologies. I'm sure there are lots of burnt out mod_perl programmers out there who are much more qualified to work on this shit. I bet they don't bother to put this stuff on their resume, not as long as they have money in the bank, and/or a roof over their head. Scratch that maybe even without those... having gotten acquainted to this stuff homelessness may be preferable. Sad sad. I don't dislike Perl. It has it's charms and especially evokes finer feelings of classic-ness. It's the language that helped build the net into what it is... a giant commercial cesspool slash platform for extremists of the opinionated sort and trans-generational masturbatory aid. Wow what a legacy.

So there it is. Why should I respect something merely because it's author has a rad mustache and sweet Hawaiian shirt collection? Perl in my very humble opinion is too prickly for the newer generation (which I kinda am in). It has nice features: hashes real nice.. lists real nice... CPAN amazing... come on though, for each thing it does right it does numerous things wrong or in poor taste. Arguments to functions blah, references why are you so lousy looking and overly complicated? And strict, dumbest vestigial lousy feature ever. Classes are disgustingly complicated to pick up... bless?!? Lots of sharp edges. Dispite all this Perl has lit the way for many. I've met people that are now programmers because they found Perl. Lets face it though, languages have severely changed since Perl's founding. Most people who program Perl have changed too.

I recently read Perl is not dead, sure I agree, it is just legacy. Cobol and Fortran are not REALLY dead either. But you couldn't lure me into doing either, not with a six pack of Micky's and a Snickers bar. There are few things golden left for those of great Perl skills to be looked forward to. All the cool kids have moved onto Python and Ruby, even Java I may argue. These people of skill would have little difficulty picking up another language. The true hacker/scholar/intellectual will thrive on a new language, enjoying the diversity of human creation. They also should have little holding them back from trying something new. I wager those who make the effort will quickly make the transition to the new language and never look back. This may include a change of job.

To wrap it up Perl is like the obnoxious slob at a party to which you want to avoid. Difficult to deal, disheveled and unkempt, but also sometimes fun. If you are learning to program, or have done a bit of programming learn Python (or maybe Ruby). Stay away from Perl. Also don't go near PHP, please!

tech-nickel

Friday, August 1, 2008

Hurting the CouchDB

Friday, May 2, 2008

Perl how you offend me (sometimes)

Contributors

Blog Archive