Sunday, March 21, 2010

Best Practices with Maven: OSS forks

Recently I came across a company that is forking several open source Java projects. I saw they were making a mistake that I also made a few years ago and have since learned from.

In Maven's distributed repository architecture project artifacts, like JAR files, are uniquely identified by a coordinate system composed of a group identifier, an artifact identifier, a version number, optionally a classifier, and a packaging type. For instance, the most recent version of the Apache Commons Lang project has a Maven coordinate (i.e.groupId:artifactId:version:classifier:type) of commons-lang:commons-lang:2.5::jar.

A few years ago, if I wanted to make custom changes to this project I would get the source, make my changes and then deploy the result to our private Nexus repository under a new groupId such as com.jaxzin.oss:commons-lang:2.5::jar. That might seem reasonable. Then a year later or so I tried something different and changed the artifactId like this commons-lang:commons-lang-jaxzin:2.5::jar.

Unfortunately there is a serious problem with both of these approaches. Maven supports transitive dependencies which means, if you include a dependency you get its dependencies 'for free'. But what happens when you depend on com.jaxzin.oss:commons-lang and indirectly include commons-lang:commons-lang? With either approach, Maven has lost all knowledge that these two artifacts are actually related. And when I say 'related' I mean they include different versions of the same classes. When Maven loses this relationship, it can't perform version conflict resolution and will include both versions in the output. It will compile against both in the classpath. If you are building a WAR file, it will include both in the WEB-INF/lib directory. If you are assembling or shading an "uber"jar, it will include the classes from both in your giant jar with all its dependencies. And unfortunately, the one that 'wins' is nearly indeterministic.

So what's the solution? How do you properly fork an open-source project privately?

The trick is to change the version, and leave the groupId and artifactId alone. That way, Maven still can detect the relationship and can perform version conflict resolution. So to complete the example I would fork Commons Lang 2.5 to a new coordinate commons-lang:commons-lang:2.5-jaxzin-1::jar.

Now I do have one further suggestion, but it's of questionable practice and I'm not sure how well it works. You might consider forking version 2.5 to version 2.6-jaxzin. This way, if Maven attempt to resolve version conflicts, it will know that your fork is 'newer/better' than 2.5. Maven sees version with qualifiers as being older than the unqualified version. I think the assumption is that if you are qualifying a version its a pre-release version like 1.0-alpha-1, 1.0-beta-1, or 1.0-rc-1. You can read more about how Maven version conflict resolution works and I know they have a major overhaul of this logic available in Maven 3.0 with the Mercury project.

But, in practice, when I've run into version conflicts like this I will add an exclusion clause where I depend on an artifact that is including the conflict transitively.

Thursday, March 18, 2010

Not for Adoption

Last night was my first session as a volunteer at the Danbury Animal Welfare Society (DAWS). I had attended an orientation a few weeks back and that's when I saw the facility for the first time, learned about the standard operating procedures and policies, and got to meet some of the cats I'll be working with. Now I'm not a person that enjoys change or meeting new people, but other than my immediate family I don't think many people are aware because I try very hard to hide my discomfort. Who knows, I could be wrong so feel free to call me out in the comments!

So far, this entire experience is quite out of my normal comfort zone, but I'm forcing myself to do this for several reasons. I learned about DAWS after attending their Puppy Love Ball, a fund-raiser they held in February, in support of a friend of mine who was honored as their Person of the Year. They premiered this mission video at the event and it had me hooked. To learn that DAWS is a shelter that doesn't euthanize animals for non-medical reasons was a real inspiration to me. I didn't even know this idea of a 'no-kill' shelter even existed. Another reason, I have wanted to volunteer somewhere simply to give back to a community. I'm not sure where my misunderstanding came from, but I've always imagined volunteering to be an unpleasant activity, like cleaning or manual labor. So when I found out that I could volunteer to be a 'cat socializer', it sounded like a perfect fit. Since my wife is very allergic to animal hair and dander, this would also be my chance to interact with cats on a regular basis without owning one.

This next statement may come off as brutally honest, another reason I was outside my comfort zone is the facility. Inside, its a clean, spacious and wonderful facility for the animals, but when I first arrived its exterior isn't exactly what I had expected especially since my introduction to DAWS was the grandeur of the Ball held at the Ethan Allen. The building looks like a large residential home and could easily be overlooked as a place of business. After visiting the DAWS location in Bethel I understood the importance of the sign and landscaping that my friend Paul created. Hearing about their goal to fundraise for a brand-new state-of-the-art facility by 2019, it made me even more impressed and connected with DAWS. I'd love to be a small part of them reaching that goal.

I arrived last night a few minutes early, uncertain what to do. Seeing others waiting outside for the door to open, I introduced myself and found out they were waiting to become adopters. I was starting to get a little nervous since none were other volunteers. Was I late? Was there a different entrance? Did I not read an email fully? After a few minutes I discovered I hadn't made a mistake when other volunteers arrived.

After a few minutes we were let in and I introduced myself and was introduce to the other cat program volunteers that were there. Unfortunately I'm joining the volunteer program at a time when everyone needs to be more hands-off with the cats because of a skin infection that is making the rounds in the shelter. They have done an amazing job handling the situation. I hope in a few weeks the issue will behind DAWS.

From what I gathered from the my first session, a two-hour evening consists of feeding the cats, playing with them, cleaning litter boxes and then finishing the night by getting everyone back in their cages. This is the kind of volunteering I can handle!

The thing I really took away from last night though was my experience with a cat named Daphne. She's a beautiful black & white cat and very reminiscent of the family cat I grew up with for 14 years named Purrina. She has some trouble with her hind legs, and she seems a bit of a special-needs case because of it. On her cage is a sign that reads "Not for Adoption" and those three words seemed to say so much about DAWS. Here is a cat that in many cases would be euthanized because no one goes to a shelter looking for a special-needs pet. But instead DAWS, with their no-kill policy, chose to give this cat a chance to be adopted and she's so close to going home.

Daphne cemented why I want to give two hours of my week to DAWS, in a time when I feel like I don't have 5 extra minutes to spare. If you are located in the greater Bethel/Danbury area, I encourage you to donate your time or donate your money.



Friday, March 12, 2010

First Impressions from NoSQL Live

Today I drove up to Boston for the day to attend NoSQL Live. My experience so far within the NoSQL community has been limited to what we've built in-house at Disney and ESPN over the past decade to solve our scaling issues, more recently has been ESPN's use of Websphere eXtreme Scale, and the very latest has been my own experimentation with HBase which hasn't gotten much further than setting up a four node cluster. I've read a little about Cassandra, memcached, Tokyo Cabinet and that's about it. So before the sandman wipes away most of my first impressions of the technologies discussed today, I wanted to record my thoughts for posterity or, at the very least, tomorrow.


Cassandra
Cassandra seems to be the hottest NoSQL solution this month with press about both Twitter and Digg running implementations. My impression, I'm wary of "eventual consistency". I don't feel I understand the risk and ramifications well enough to design a system properly. When Jonathan Ellis of Rackspace Cloud mentioned that Digg needed to implement Zookeeper-based locking on top of Cassandra so that diggs get recorded correctly, I realized how poorly I understand eventual consistency and how risky it could be. But my impression of Cassandra isn't all negative, it definitely seems to have less baggage than HBase by not being built on top of HDFS. I'll get into what that means a little later.

Memcached
Unfortunately the speaker that 'represented' memcached gave off a vibe that really turned me off to the product. I know that's incredibly shallow, but this is first impressions after all and not perfectly-evaluated impressions. Mark Atwood sat on the first panel of the day "Scaling with NoSQL" and his whole attitude seemed to say "memcached is all you'll ever need and these guys next to me are just overdesigning hacks". His answers were short and his tone was quite condescending even when addressing audience questions. Not a very good first impression of him. But luckily today wasn't my first impression of memcached as I was pointed in its direction just last week by a Disney colleague. My research before today has me intrigued about using it as a replacement for ehcache as a second-level cache provider to Hibernate which we use as an ORM in one system at ESPN right now.

Document Oriented Databases (Riak, CouchDB, MongoDB)
Wow, this is a subgroup of NoSQL technology that I had heard of in passing but was really unaware of what problem they were trying to solve. Riak had the best answers for scaling and operational-ease. With homogeneous nodes and consistent hashing, Riak promises that adding and removing nodes are seemless. CouchDB and MongoDB sounded like a 'me too' answer so I'm interested to find out what that really means for each, or better yet what it doesn't mean. But the concepts of document-oriented databases really meshes well with ESPN's current fantasy user database. Our fantasy user profiles are stored in a traditional RDBMS as serialized maps of maps, one row for each user. Since its serialized to a BLOB column its completely opaque to reporting and analytics. To keep that model but have vendor support for divining information and having transparency into it sounds exciting. I really need to look into these. Riak definitely won this round of first impressions.

Tokyo Cabinet
This was a technology I was referred to by a colleague and read through their site last week. I was far from impressed then since it seems much too low-level for my taste, similar to my impressions of Carbonado which we use at ESPN. The lightning talk by Flinn Mueller got me a little more interested. He seems to be doing interesting things but from an analytics and reporting perspective. He was vague on how loads the data from his primary store and what the scale of the data is, so my first impression: its a toy. I'm sure that's an unfair characterization but I'm not trying to be fair tonight. But honestly, Tokyo Cabinet makes no bones that it punts on horizontal scaling which is the deal breaker for me.

Hypertable
I looked at Hypertable (as in read their website) about 18 months ago on the suggestion of a colleague when discussing HBase around the same time. This conference didn't change my opinion, which is "It's HBase but written in C". It doesn't seem to bring anything else to the table which to me is a blocker. JVM implementations are available for all the operating systems I use and so I don't like the idea of needing to find the right binary to download for a given box. When it comes to Java vs. C, I choose Java but I'm also extremely biased as I've been a Java developer nearly my entire career.

Full-stack JavaScript
This was my favorite of the lightning talks, and possibly my favorite of the conference. It felt a little tangential to the NoSQL topic, mainly because Jim Wilson covered more than just data storage. The idea, what if you could use JavaScript on your server, in your client and use JSON for talking between the layers and as the storage format? Crazy, right? I say brilliant. His few slides were mildly embarrassing that dissected each of the popular stacks of today by how many languages you need to learn (Java, XML, etc) as well and the various impedance mismatches between layers (ORM, Object marshaling to JSON or XML, etc). "ORM is an antipattern" was an enlightening take on something I've accepted as necessary. Full-stack JavaScript is something I'll be lusting after for a long time, especially since he made it sounds so attainable with node.js, rhino and MongoDB. As soon as his slides are online I'll be linking to them as well as passing them around the office.

HBase
Well I saved HBase for last. It's the one I've had the most experience with, though that experience can still be measured in hours. As I hinted at earlier, this conference gave me the first impression that HDFS is a weight around the neck of HBase. I was surprised to get that feeling from the room, since my impression has been purely positive so far. It is also getting a lot of flack from the 'single point of failure' problem associated with the current HDFS architecture's Name Node. Apparently performance is a dog since it was "only" designed to be highly distributed with no promise of when you'll get your data. This burden seems to carry over to HBase. But after talking to Ryan Rawson one-on-one at the end of the HBase lab, it's clear he is of the strong opinion that its getting a bad wrap. He also makes very convincing arguments about the scale of what HBase is currently doing in real production environments vs. competitors like Cassandra. It's very pursuasive and you can read more of the details in a very active thread I kicked off on the HBase user group earlier this week.

Conclusion
HBase is still the front-runner of my personal candidates for a NoSQL option for ESPN as it has been for a long time. Cassandra's design choice of eventual consistency is a little scary to me because I don't know yet how to design for it, not because it is inherently bad choice. Documented-oriented databases just made a big blip on my radar. Memcached is interesting if I want to stick with a traditional ORM-based architecture. Tokyo Cabinet and Hypertable are all but off my radar. And the lusty vixen of them all is a full-stack JavaScript architecture.

Disclaimer: Though I mention my employer ESPN in this post, these are my own personal opinions and don't represent the opinions of the company. The final decision on this stuff is "above my pay-grade" as they say.