FOSS Release Management Examined – pick a release interval and the rest will follow
This presentation reflects the ideas presented in this PhD thesis.
This presentation reflects the ideas presented in this PhD thesis.
A great presentation by Kirk Wylie. He didn’t explicitly make the point of the presentation about how RESTful architecture solves client configuration headaches, but he definitely alluded to it.
Consider what browser agents do today? They will get their html and then query the document to see what “mode” they need to process. This is typically based on the doctype in html; see https://developer.mozilla.org/en/Mozilla’s_DOCTYPE_sniffing for an example of one browser.
HTML is built on the principle of, if I don’t understand it, I’ll ignore it. Browser also try their best to render the intent. Anyone involved in those projects understand how much effort is required to overcome junk input. Making REST services more accepting requires a change in development style. Application/xml is descriptive enough for opaque xml blobs, but I would expect HTTP 415 more often than not. For this, creating media-types makes sense (eschewing version) and I would create a media-type, such as application/vnd.example.profile or application/profile+xml, and rely on the content to indicate the version. A DOCTYPE or a namespace could serve the role for detecting a version, requiring an inspection of the document before processing it.
What does this look like in practice? Lets start by examining the typical approach using XSD. Assume that you have an xml document; for example, a list of customers. In XML-Schema, you would likely have a ns0:CustomersType that contains 0..many ns0:CustomerType elements. Traditionally, this is mapped to a collection of type customer (List for those familiar with Java) and marshaled up to a handler. Abstraction is good(tm) the developer says, I won’t ever be bothered by invalid input…sweet! But not sweet once the app is deployed. Someone revs the schema, perhaps simply changing the namespace, so ns1:CustomerType and ns0:CustomerType are no longer equivalent even though there shape is exactly the same! This means an entire rev of the application is required to accommodate something as trivial as a change in namespace.
Lets say the developer threw out the marshaling framework and worked directly at the request/response level where they are able to inspect the byte stream. Now they could take the request and stuff it into an XSD validating parser but this won’t buy them anything beyond the recently discarded framework. Instead, using xpath, a developer finds all the CustomerTypes (e.g. //Customer) and processes each element, regardless of the location in the document. The code is not as short when the documents are marshaled into an object, but it’s much more accommodating. It also avoids the marshaling overhead when only a handful of values are needed.
But wait, doesn’t this boil into a big ball of mud? Perhaps, but only if you let it. It could lead to a huge if/then/else mess, but there are plenty of ways to avoid such branching. Better yet, the service can send a redirect (3xx?) to another service who can handle an unknown or older type.
In short, I would recommend that the handler make its best effort to accommodate the input, trying to not make any assumptions about the structure of the input. This means ditching marshaling stacks and handle the bytes directly, querying the document to find matches based on intent, dump namespaces, ignore case, plan for parent/child relationships being more then one generation apart, and when in doubt, find a meaningful response code in HTTP to signal the user what went wrong.
My loose analysis of the issue is that most enterprise developers are dealing with new types that flow directly from their code. Tools bang out a schema that looks like structured data. The class/record fidelity is very close. Transforming from angle brackets to code, using a framework, is comfortable for a rank and file enterprise developer. What happens after this is where trouble begins. A small change to the class results in a mismatch and the need to rev the version of the XML-Schema to accommodate the change, hence the question on this mailing list.
Someone made the assumption (correctly, IMO) that media types will not rev all that often in the wild because it limits adoption. image/jpeg, text/html, application/atompub+svc, are all nice, but life is different inside the enterprise. Often there are efforts to canonicalise the “core” business objects and this helps, but as it was pointed out earlier, these rev as business definitions change…and they’re typically built on XML-Schema so they’re just bigger types. The vicious reality of versioning is typically delayed, but not for long enough.
Extensions in XML-Schema provide some relief, but these black holes in a schema definition force the developer into an unnatural position of having to query a document for a value instead of using offsets (either numeric or via “getters”). If you throw XML-Schema out it forces engineers to query the document with the assumption that you’re only interested in what you understand, which I believe is the beginning steps to HATEOAS, The catch is that by not using XML-Schema you’ll get laughed out of a design meeting in any enterprise.
So where is the middle ground here?
One idea I’ve been kicking around is to look at XHTML and how the XML-Schema works for creating a forward extensible media-format. I can’t point to successful adoption in the enterprise from my experience, but it works (granted not perfectly) for the web. It passes the sniff test from an enterprise point of view because the safety of XML-Schema is there, but the structure of the resulting document is very loose forcing the developer to query the document. Querying can be done via XQuery or a Dom navigation, because it’s still XML, developers will not find this approach completely foreign.
In thinking about this, application/xhtml+xml becomes the media-type, but it begs how you find the “entity” you’re searching for in this mis-mash of angle brackets. I see the question of types being pushed to the rel tags. Does <link rel=”mytype-v2″/> still suffer the issues surrounding media-types? HTML5 has an open discussion for adding proposed rel types. This part of the specification seems pretty wide open, so enterprises can go nuts and add their own (versioned) type. I don’t know if this is a better place to allow type proliferation, but it keeps the discussion away from the content-type http header.
Enough dragging of feet…it is time to learn Ruby and RoR so I can find out what all the excitement is about. Out of the gate, I tried to use the stock Rails provided with OS X distro. That was a quick lesson in package rot. Doing a gem update rails left it in a broken state where rake would work.
I found this nifty guide, and built rails old school. Everything works like a champ now.
It definitely reminded me of my days build PHP 3/4 on Apache. Good old ./configure && make.
I came across it while reading Sam Ruby’s thoughts on HTML5 and RDFa. I like RDFa because it allows for using one presentation format while hiding another within it. I’m sure those that like steganography also like this approach.
Having used semantic technologies in anger (i.e. I’m comfortable using them to solve actual problems, not just hypothetical), I find this view on semantic reasoning quite refreshing, albeit hard hitting and pretty dim for those who promote the virtues of the semantic web and what it will deliver.
I haven’t read the rebuttals so any links you throw my way I’ll gladly read.
Atwood’s Law: any application that can be written in JavaScript, will eventually be written in JavaScript.
Jeff Atwood has a great blog called coding horror and is the found of serverfault.com and stackoverflow.com two very cool Q/A sites.
I find this principle of Least Power to be completely true and a major motivator in my design decisions. RESTful architecture fits into this camp, however, it doesn’t mean you can be sloppy as a developer and implementing REST faithfully is quite a bit of energy, but not quite as much as WS-*, plus it has benefits from the “web multiplier” that loosely coupled web applications enjoy.
PubSub is nothing new, but getting it work over http has always been an interesting endeavor. Publication via syndication has been achieved on a massive scale with RSS and Atom relying on a clients ability to poll the server for new content. Service like FeedBurner (owned by Google) provide mass syndication of a particular feed. They provide the benefit of offloading the bandwidth for your feed by peeking, well…a thorough inspection really, of those who are interested in your content.
Those that work with me know that I’m a proponent of polling until its absolutely necessary to use pub/sub, and even then I’d argue that you can still do it with polling. The reason why I keep this stance is that I’ve waded through many JMS implementations that tend to abstract away the pub/sub architectural style into a request/response RPC style, effectively ignoring the benefits. I also observed the primary reason for this inefficiency was due to lack of understanding by the developer. This isn’t meant as a slam because new architectural styles take time to understand and until they’ve been used in anger, they haven’t been stressed. I remember when I first tried to apply REST to a project and how much it looked like WS-*/RPC.
What pubsubhubbub provides is a way to continue the RPC style polling but allows consolidation of the multiple requests into one. It also provides a call back mechanism. Call backs on private networks from third parties have always been problematic because of firewall restrictions, but pubsubhubbub provides a nice model to control the callbacks to a single server (or at least a controlled number of managed servers) and the rest of the clients can continue to bang away on the internal server, or more appropriately the internal server can bang away on all those that have subscribed to the topic. The BIG difference is that the client now must be a server to take unknown requests. Since most developers understand request/response from the server’s perspective, I think this easier for developers to pick up.
http://pubsubhubbub.googlecode.com/svn/trunk/pubsubhubbub-core-0.1.html
While digging around pubsubhubbub, I came across webhooks.org. There is nothing terribly exciting about the technique because it simply following HTTP, but the difference is that the callback is potentially hitting the “client”, something that isn’t normally done. Again, the pesky firewall comes in to play when call backs need to move from less secure to more secure environment where trust hasn’t been established. However, combining pubsubhubub with the pattern of webhooks could allow engineers to take advantage of this style of notification.
As I commented previously, it is time to brush up on my algorithms. The place to start is the trusted, but not too savvy, insertion sort. The runtime for this sorting algorithm is O(n2).
public static <T> void sort(T[] array) {
for(int j = 1; j < array.length; j++) {
T key = array[j];
int i = j - 1;
while(i >= 0 && array[i].compareTo(key) >= 1) {
array[i + 1] = array[i];
i--;
}
array[i + 1] = key;
}
}
I chose to make this generic using java generics. This requires that the class implement the Comparable interface so I can invoke compareTo. Luckily the semantics of the compareTo method work for insertion sort. Since insertion sort is a stable sorting algorithm, I needn’t worry about duplicates.
For nearly sorted arrays, this algorithm is extremely fast since the inner while loop is only executed for a few clicks of the cpu to correct the sort order before moving on. Definitely a good property to remember when dealing with nearly sorted data.
I’ve been very interested in the Non-SQL based architecture ever since I started programming Java in 2001. Thinking in objects for business processing was a nice change from thinking only in sql and stored procedures. It started with TopLink, one of the first commercial ORM tools I used on a COBOL/Java conversion project. Generally, I trust the access patterns encoded by the ORM are more efficient then my own hand crafted version.
In the last year and half, I’ve been reaching for Python as a quick prototype tool to get my projects up and running. Again, I kept wanting to reach for an ORM tool in Python, but they’re all relatively light. It is often easier to just jam out the SQL and move on.
So why bag on ORMs and SQL? The reason is that they have generated architectures that scale up to the limits of the database. These are typically shallow compared to the needs of today’s large scale internet sites. A favorite blog, High Scalability, chronicles the toils of many large scale internet sites and what they had to do to scale their infrastructure. The darlings of these success stories are typically *not* the database, but tools like memcached, reverse proxies, and/or sharding techniques. The database is not forgotten, it’s just sharing the spotlight. The rise of the alternative data storage engines are starting to emerge. Tokyo Cabinet, Voldemort, and Cassandra are an examples of such “new” databases.
Using such technologies in lieu of SQL is putting a heavier emphasis on the programmers ability to substitute what has been traditionally done in the database. However, the specific requirements for their environment and business problem. Problems like sorting must be solved again, perhaps modified slightly to exploit certain properties that are true in their environment but not in the general case.
Understanding your algorithms is more important now than it was when you could simply point to the database. It’s easy to talk about big O notations for qicksort (O(n lg n), but did you know worst case is O(n2)), but what make it perform better than mergesort which is O(n lg n) too? Why did Sun decide to use a modified mergesort for Arrays.sort()? It’s been some time (15 years) for me between the last time I evaluated various sorting algorithms. I’m long overdue for a refresher, how about you?