Sunday, August 24, 2008

Hacking Eclipse plugin configurations


Reading Neal Ford's generally excellent book The Productive Programmer, I experimented with his multiple plugin configuration hack. In short, creating multiple plugin configurations in a single eclipse install allows for a team to keep their plugin configurations in version control, so everyone has exactly the same configurations as anyone else on the team. No more "works on my machine but not on yours" weirdness. You can even manage them on a project-by-project basis, which is good.

But there are two aspects of working with multiple plugin configurations that are strange: specifically, creating and deleting them. In order to create an additional plugin configuration, you have to:
  1. Create a folder to hold the configuration. It must be named "eclipse" and it must not be in Eclipse's directory structure.
  2. Within your "eclipse" folder, you have to make an empty file called .eclipseextension, and two empty folders, features and plugins.
As Ford points out, Eclipse (inexplicably) won't do this for you. You have to do this by hand. It's not hard, just strange. At least this way you have more control over where the folders and files are located. (As I have multiple Eclipse installs, I used nesting to keep track of everything. I created a top-level folder called "eclipse-configurations"... under that, I made another folder for each of my named installs, and under each of those, I placed the "eclipse" folder as mentioned above. So, the versioned configuration for my "xquery" install of Eclipse is at /Users/haren/eclipse-configurations/xquery/eclipse.)

From there, you can go to Help -> Software Updates -> Manage Configuration and add your configuration location(s). Then it's a simple matter of installing your plugins to the config locations desired. You can then enable and disable multiple plugins as a group, switch between versions, etc. It's very handy.

But I'd mentioned that there were two strange things about the process. Creating additional configuration locations was one, deleting them was the second. Just as Eclipse gives you no love in creating them, it makes it even harder to get rid of them.

Let's say, for example, that you've added your new location as an extension not to the top-level list, but as an extension to an extension. (Yes, you can do this.) But let's also say that's not what you wanted. Well, you can disable your extension-within-an-extension, but you can't get Eclipse to ignore its existence entirely. If you then try to add it to the top level, Eclipse won't let you, complaining that you've already added it elsewhere. Arg.

Well, there's a way around that, too (but Ford doesn't mention it). Under ${ECLIPSE_HOME}/configuration/org.eclipse.update there's a file called platform.xml. Up at the top there are "site" nodes, and one of those will be your offender. Delete the bad guy and restart Eclipse. Now you can place your configuration elsewhere. (Or, you can just change the path in the node).

Anyway, as noted, there's a lot to gain by using multiple plugin configuarations, once you get around Eclipse's strange reluctance to make it intuitive. Happy hacking!

Tuesday, August 19, 2008

Windoze tools - Infra Recorder

If you somehow find yourself on a Windows machine and need to burn an image (iso) to a cd or dvd then I recommend Infra Recorder. I've used it a year or two ago, but found it again today and have to say it is looking and working good.

Mojave Experiment - an open insult to Windows users

If you haven't heard of Microsoft's newest advertising venture called Mojave Experiment let me fill you in. It is an 'experiment' (their words not mine) in which Windows XP users are asked to test out a cutting edge new M$ OS. Well it turns out this is just Vista. The punchline is users loveVista, just give it a chance all you mean naysayers. Well I think users don't like Vista and for good reasons. That is why it has not caught fire like M$ thinks it should have. Well I feel safe saying the execs up in Redwood must have lost their shit, having drunk their own Kool-Aid far too long. For a thorough examination check out this article. Peace!

Monday, August 4, 2008

Installing CouchDB on Gentoo

So I recently installed CouchDB on Gentoo at work and I figured for others sake I would post clear concise directions.

Portage setup

You are going to want a Portage overlay in which to put your own ebuild scripts. Having such a place will keep separation between the core Gentoo ebuilds and stuff you dabble with. Lets start with adding your Portage overlay: 'sudo mkdir -p /usr/local/portage'. This is where we are going to put your custom ebuilds. If you are interested, look in '/usr/portage'. Here you will see a lot of ebuilds that come via Gentoo's network.

To notify Gentoo (more specifically Portage) of this new overlay you will want to add the following line into the '/etc/make.conf' file:
PORTDIR_OVERLAY="/usr/local/portage"
Next up we need to have a category. Categories separate ebuilds by function and or purpose. For instance the web Server Apache is found in the 'www-servers' category. You can maybe find it @ '/usr/portage/www-servers' on your machine. You can pick any category name you like for this exercise. I'm going to be using 'ottaway'. For the category create a folder in '/usr/local/portage', in my case I do 'sudo mkdir /usr/local/portage/ottaway'. Substitute 'ottaway' for the name of your category.

To make Portage aware of this new category I add the line 'ottaway' to the '/etc/portage/categories' file.

Getting CouchDB

You will need to get the ebuild script for CouchDB. It is found as an attachment on this page. I used the following to download the script:
curl https://bugs.gentoo.org/attachment.cgi?id=159315 > couchdb-0.8.0.ebuild
You could pretty easily use wget also. I put this in my Portage overlay in my custom category @ '/usr/local/portage/ottaway', you must do the same for your category.

Next up you are going to have to tell Gentoo that you are ok with certain development ebuilds being installed. I did this by adding the following lines to the '/usr/portage/package.keywords' file:
# couchdb stuff
dev-lang/erlang
dev-util/svn2cl
dev-lang/spidermonkey
ottaway/couchdb ~x86
You can put those lines anywhere in the file. If you used a category name other than 'ottaway' change the value in the last line of the example above.

Next up I was ready to install the whole thing. You can do so using:
sudo emerge =<yourcategory>/couchdb-0.8.0
Where <yourcategory> is the name you gave your category you created earlier. Once this starts moving you can sit back and relax. When it finishes you can use "sudo -u couchdb couchdb" to get things started. When you see the "time to relax" pop onto the screen go ahead and hit your instance @ http://<yourdomain>:5984/_utils/index.html, where <yourdomain> is the network name of the machine CouchDB is running on.

Friday, August 1, 2008

Hurting the CouchDB

The needs of many

These days companies have a lot of records floating around. At the last eCommerce company I worked for there where many millions of records floating around our Oracle database. I feel as you get into that sort of situation two things happen: your have very small lines of inter connectivity between the records, as they are spread across all those tables, and secondly you find yourself yielding more and more to the whims of the database system. I've started toying with CouchDB (which is itself only a toy at this point).

I've thrown in over 12k documents to CouchDB so far. I've taken my companies whole product catalog and serialized it from simple XML to even simpler JSON. Each document was about roughly 300kb. Each document contains things like product data, many records which make up the production history of the product, and just about anything related to a product. Performance is good, with 12k docs in the system I notice no difference in request times from 10 documents. It appears that CouchDB doesn't care about how many documents you are storing, it simply is relying on the underlying system to provide the space on disk.

It Handles lots of documents

Thats great and all but how big can these documents get? I'm looking to move from relational to document based which means for a customer document you have order history, purchased items and many other things. Rather than have the summation of a customer, the most important business entity, spread out all over the DB system why not put their 'story' into one document? Yes some of this data could end up being redundant, but that's the point. Don't worry too much about the small stuff, worry about the stuff that really hurts later down the road like being able to provide a correct SQL statement to put together a Customer model! At my big eCommerce company that was damn near impossible, with many people having separate interpretations! Ouch! A document can be self describing, a big plus when the data model gets complex.

To keep perspective going forward I will use the metric of WAP (or "War and Peace"). This is an epic document written by the Russian author Leo Tolstoy. Here it is over at Wikipedia, yes it is huge. I feel this makes a good metric because it is so big. YOU will never write your own "War and Peace". In the real world this book is a freak of nature, say the Andre the Giant of books. I downloaded a free version off the net, and the size was 3.2 MB, keep this number in mind.

So we need to determine if documents could get to these really large sizes without worries. I mean we can think up a reasonable size to which now document will ever grow. I asked myself "Could my companies product records increase in size by 100x and still CouchDB serves em up?". At this size would the client even be able to parse and hold the object in memory? Does that matter? Can I use views (in the CouchDB sense) to pare down a document to just the parts the client needs?

Well I started by simply increasing the size of a larger product document by 3600x. Yeah I'm shooting for the moon here. The documents size was ~46Mb, or ~15 WAP. Well on POST of the document to CouchDB things looked good but after a minute CouchDB crashed:

eheap_alloc: Cannot allocate 729810240 bytes of memory (of type "heap").
Aborted

I'm sorry CouchDB... not! I am running the db on a virtual slice that doesn't have too much memory available. Still it doesn't have to deal with multiple clients making requests. So maybe there needs to be a feature where documents of certain size are handled differently due to their size. I'm guessing that currently the whole document is parsed into memory, probably not a great idea for large documents, but good for performance overall.

Next I tried halving the document's size, from 3600x to 1800x, with a relative size of 7 WAP. Again this blew up with problems allocating memory. Ok next up 900x... hey it worked! Using my unoptimized vanilla install on a Gentoo slice I can upload a document that is 3.5-4 WAP in size. That is a lot of reading :) I now ask myself "so at that size what can you expect from clients?".

Clients the other white meat

Right off, no current browser can handle this 900x document. Just too freaking big. Requesting such a doc wouldn't make much sense in most cases. Really, what kind of web pages need all that data at once? Take into consideration that the browser will have to pass it onto a Javascript engine and allow it to parse the content. Parsing always takes time. Depending on the Javascript engine being used this time could range from slow to really slow.

If we could get just the pieces that are needed from the document a client would be much better off. Think of lazy fetching in SQLAlchemy or Hibernate, where the ORM layer doesn't load everything at once, but can get things when asked for. Well I think that the CouchDB view feature will allow you to cut down the size of a document, maybe for a customer document you get just the order history, or contact info. Then you update that doc and merge it back into the document.

I put a 200x document into the DB. Using the metric this is about 0.7 WAP, still very large. Fire Fox 3 loaded the structure in 32-33 seconds in Futon. At this size the request took 1-2 seconds when using cUrl via the command line to request the object. Using this CouchDB Python library it took 4-5 seconds to get the 200x doc from CouchDb. Not bad if you really needed to parse that much data. To again put things in perspective this is 0.7 War and Peace books. Thats pretty big.

So, are there even applications for document that get so big? I can think of a certain historical report that really could use all this info to generate a complete history of who changed what and when, how things changed (more in a minute), give the all the data surrounding a product, who has what access and more.

On keeping track

Now I haven't spoke of it yet but there is also a feature built into CouchDB where a document's changes can be tracked. Basically you can review the changes over time. I'm not going to do any testing of performance in history, I'll save that for later. I just wanted to point out that this feature is very cool and useful.

What I need to do

I need to actually build and maintain something using CouchDB. I've looked and listened for stories on using CouchDB and have found little. I can relate though to using Mark Logic, a XML database which is therefore document based, and having good results. The site SafariU (maybe gone by now) used mostly XQuery to manage documents that contained things such as user info like subscriptions. It worked pretty good. You did end up replacing SQL for XQuery, but XQuery was a much better tool I feel than SQL was. Anyway, I will be toying around and will post more on CouchDB if and when I find something more out.