It was bound to happen sooner or later — a new blog post! (only took me 1,207 days.)
Anyways, I'd like to talk a bit about a project I've been working on to extract dates from their natural language representation. The project is called natty, and Ive decided to open source it under an MIT license.
What natty does is attempt to define a formal grammar describing all the crazy ways people refer to dates that are relative (or not) to the current date. Some typical examples are: monday of next week, next tuesday, or the 28th of february at 6pm. The casual observer may suggest a series of regular expressions to do the job, but will soon find that the grammar describing such things is far from regular, and tends to be quite ambiguous.
ANTLR fits the bill nicely as we can use syntactic predicates to deal with ambiguities (by enabling infinite lookahead,) and semantic predicates to deal with the parts of our grammar that inch towards context sensitivity.
Defining the grammar is one thing, but actually extracting meaningful information is another. To solve this, I'm using ANTLR's abstract syntax tree (AST) rewrite rules to build an intermediate representation that defines the date as a generic, tree-like structure. Once built, this structure can be walked, invoking common date manipulation methods along the way to arrive at the correct date.
This approach has several advantages over a more naive, non-grammar based approach. The most significant of which is the ability to offload the generation of messy parsing code to a code generator instead of a human. We still have to deal with a complex grammar specification, but the grammar is infinitely less complex than the generated code, and will prove to be far less brittle when making future changes.
Another significant advantage is the theoretical ease at which the parser can be ported to a new language — choose a new ANTLR target, implement a few generic date manipulation methods, and call it a day.