Friday, August 1, 2008

Hurting the CouchDB

The needs of many

These days companies have a lot of records floating around. At the last eCommerce company I worked for there where many millions of records floating around our Oracle database. I feel as you get into that sort of situation two things happen: your have very small lines of inter connectivity between the records, as they are spread across all those tables, and secondly you find yourself yielding more and more to the whims of the database system. I've started toying with CouchDB (which is itself only a toy at this point).

I've thrown in over 12k documents to CouchDB so far. I've taken my companies whole product catalog and serialized it from simple XML to even simpler JSON. Each document was about roughly 300kb. Each document contains things like product data, many records which make up the production history of the product, and just about anything related to a product. Performance is good, with 12k docs in the system I notice no difference in request times from 10 documents. It appears that CouchDB doesn't care about how many documents you are storing, it simply is relying on the underlying system to provide the space on disk.

It Handles lots of documents

Thats great and all but how big can these documents get? I'm looking to move from relational to document based which means for a customer document you have order history, purchased items and many other things. Rather than have the summation of a customer, the most important business entity, spread out all over the DB system why not put their 'story' into one document? Yes some of this data could end up being redundant, but that's the point. Don't worry too much about the small stuff, worry about the stuff that really hurts later down the road like being able to provide a correct SQL statement to put together a Customer model! At my big eCommerce company that was damn near impossible, with many people having separate interpretations! Ouch! A document can be self describing, a big plus when the data model gets complex.

To keep perspective going forward I will use the metric of WAP (or "War and Peace"). This is an epic document written by the Russian author Leo Tolstoy. Here it is over at Wikipedia, yes it is huge. I feel this makes a good metric because it is so big. YOU will never write your own "War and Peace". In the real world this book is a freak of nature, say the Andre the Giant of books. I downloaded a free version off the net, and the size was 3.2 MB, keep this number in mind.

So we need to determine if documents could get to these really large sizes without worries. I mean we can think up a reasonable size to which now document will ever grow. I asked myself "Could my companies product records increase in size by 100x and still CouchDB serves em up?". At this size would the client even be able to parse and hold the object in memory? Does that matter? Can I use views (in the CouchDB sense) to pare down a document to just the parts the client needs?

Well I started by simply increasing the size of a larger product document by 3600x. Yeah I'm shooting for the moon here. The documents size was ~46Mb, or ~15 WAP. Well on POST of the document to CouchDB things looked good but after a minute CouchDB crashed:

eheap_alloc: Cannot allocate 729810240 bytes of memory (of type "heap").
Aborted

I'm sorry CouchDB... not! I am running the db on a virtual slice that doesn't have too much memory available. Still it doesn't have to deal with multiple clients making requests. So maybe there needs to be a feature where documents of certain size are handled differently due to their size. I'm guessing that currently the whole document is parsed into memory, probably not a great idea for large documents, but good for performance overall.

Next I tried halving the document's size, from 3600x to 1800x, with a relative size of 7 WAP. Again this blew up with problems allocating memory. Ok next up 900x... hey it worked! Using my unoptimized vanilla install on a Gentoo slice I can upload a document that is 3.5-4 WAP in size. That is a lot of reading :) I now ask myself "so at that size what can you expect from clients?".

Clients the other white meat

Right off, no current browser can handle this 900x document. Just too freaking big. Requesting such a doc wouldn't make much sense in most cases. Really, what kind of web pages need all that data at once? Take into consideration that the browser will have to pass it onto a Javascript engine and allow it to parse the content. Parsing always takes time. Depending on the Javascript engine being used this time could range from slow to really slow.

If we could get just the pieces that are needed from the document a client would be much better off. Think of lazy fetching in SQLAlchemy or Hibernate, where the ORM layer doesn't load everything at once, but can get things when asked for. Well I think that the CouchDB view feature will allow you to cut down the size of a document, maybe for a customer document you get just the order history, or contact info. Then you update that doc and merge it back into the document.

I put a 200x document into the DB. Using the metric this is about 0.7 WAP, still very large. Fire Fox 3 loaded the structure in 32-33 seconds in Futon. At this size the request took 1-2 seconds when using cUrl via the command line to request the object. Using this CouchDB Python library it took 4-5 seconds to get the 200x doc from CouchDb. Not bad if you really needed to parse that much data. To again put things in perspective this is 0.7 War and Peace books. Thats pretty big.

So, are there even applications for document that get so big? I can think of a certain historical report that really could use all this info to generate a complete history of who changed what and when, how things changed (more in a minute), give the all the data surrounding a product, who has what access and more.

On keeping track

Now I haven't spoke of it yet but there is also a feature built into CouchDB where a document's changes can be tracked. Basically you can review the changes over time. I'm not going to do any testing of performance in history, I'll save that for later. I just wanted to point out that this feature is very cool and useful.

What I need to do

I need to actually build and maintain something using CouchDB. I've looked and listened for stories on using CouchDB and have found little. I can relate though to using Mark Logic, a XML database which is therefore document based, and having good results. The site SafariU (maybe gone by now) used mostly XQuery to manage documents that contained things such as user info like subscriptions. It worked pretty good. You did end up replacing SQL for XQuery, but XQuery was a much better tool I feel than SQL was. Anyway, I will be toying around and will post more on CouchDB if and when I find something more out.

2 comments:

JanL said...

Just a quick note: CouchDB is in alpha state and lacks optimizations. The problems you are seeing will be gone in a stable release. Large documents won't be that big a problem.

Also, sending mails with problem descriptions to the CouchDB mailing lists (http://incubator.apache.org/couchdb/community/lists.html) might have helped :)

Cheers
Jan
--

robottaway said...

Ah I hope my review didn't come off negative! In fact I feel quite positive aboud Couch. Even Superman is susceptible to kryptonite; I wasn't expecting performance at the levels I was testing at. I just wanted to see how far I could push the envelope. CouchDB has a lot of potential, and even right now I feel it could be used in some smaller applications around my work. You can count me as someone who thinks you all are onto something big. Thank you for the work you've put into CDB.