Python Rocks! and other rants
Weblog of Kent S Johnson

2004-10-28 09:15:44

Jython + dom4j = High octane development

I have been working on a project called Meccano2 since October. Meccano2 is written primarily in Jython, the Java implementation of Python. The Meccano2 data model is implemented using dom4j, a Java framework for working with XML data.

I have used both Jython and dom4j before and found them to be very useful tools. Now, in Meccano, I am using them both together. The combination is extraordinarily powerful and flexible. It is so good, I just have to share.

Introduction to Meccano2

Meccano is a specialized editor for internal use at my work. It uses XML for its native data format. (Meccano1 used a domain-specific data model.) Meccano has a custom tree display and domain-specific editors for the data. Both the tree display and the editors have been extensively customized to meet user requirements.

Meccano1 was written entirely in Java. The existing Meccano1 codebase is incorporated into Meccano2 as a library. The current Java code base has 154 modules (including modules added for Meccano2).

The Meccano2 application is written in primarily in Jython with a Swing GUI. It makes extensive use of Java libraries. Currently Meccano2 has about 6500 lines of Jython code in 62 modules.

Jython

Jython is an implementation of the Python language in Java. There are many reasons why I think Python is a great language. For the purposes of this paper I will focus on just a few.

Python has strong support for data structures

Lists and dictionaries are built in to the language. It is very easy to create and use ad-hoc, heterogeneous data structures. Iteration and lookup for both lists and dictionaries are supported by the syntax of the language. This benefits all kinds of programs because lists, dictionaries and compound data structures are useful everywhere.

This strong support for structured data really shines in data-driven programs, because it is easy to specify and use the actual data. Complex data structures can be created statically, in declarations, or dynamically, in code. Unlike XML or other text-based data formats, no user-written parser is needed - the Python runtime parses the data for you and yields a native data structure. Using the data is supported by the core language.

Python has first-class functions

In Python, you can have standalone functions (not part of a class). Functions can be put in a list and called later, so a data structure can include callback functions. And a function can return another function as a value, so you can create a function that binds some variables while leaving others unassigned.

For example suppose you have a function adder that takes two arguments and returns their sum:

def adder(x, y):
    return x + y

Now you can make a function factory - given a value n, it returns a function of one argument that adds n to its argument:

def makeAddN(n):
    return lambda x: adder(x, n)

So a call makeAddN(5) returns a function that adds 5 to its argument.

This is a little hard to explain but it is very powerful. Do you ever miss function pointers from C or C++? Python can do everything you can do with a function pointer and more. I will show you a real-world example of this below.

In addition to Python's benefits, Jython provides easy access to Java libraries, both Java standard APIs (Swing), third party libraries (dom4j, Velocity, Jetty, etc) and our own Java code (Meccano1 codebase with additions for Meccano2)

dom4j

Meccano uses dom4j to create its runtime data model. dom4j is "an Open Source XML framework for Java. dom4j allows you to read, write, navigate, create and modify XML documents. dom4j is seamlessly integrated with full XPath support." dom4j is similar to W3C DOM but it is designed specifically for Java. As a result it has a simpler interface than DOM. dom4j also has an event-driven parser similar to SAX but at a higher level of abstraction.

dom4j's integrated XPath engine is its killer feature. It allows many types of queries on the data to be done with one or two lines of code. For example, you can easily find, count or check for the existence of nodes matching an XPath expression. XPath expressions can look at the name, value, attributes and children of a node. For example to find all LearningPoints whose type attribute has the value "software_simulation" and which contain a child slide whose type is "exercise", you could use the expression

node.selectNodes('LearningPoint[@type="software_simulation" and Slide/@type="exercise"]')

Using a generic data model means the model is homogeneous in the sense that every element supports the same operations. The homogeneous data model allows tools to be applied consistently to the whole model and facilitates creating data-driven engines that operate on the data. Meccano has several examples of such engines including a module that builds tree model of the data for visual display and a word count engine.

Synergy

The combination of Jython's expressiveness and flexible data structures with integrated queries against the data model using XPath allows some types of problems to be solved with remarkable ease.

Example: Meccano audit engine

The Meccano audit engine takes advantage of Jython's flexible data and first-class functions, and dom4j's integrated XPath engine, to make a very flexible, table-driven course audit tool.

The core of the engine is a recursive data structure consisting of XPath expressions and audit rules. The structure can be defined by these production rules:

A RuleItem is either

  • A callable (a function or function-like object) which is called with the current dom4j node as its only argument
  • A pair (tuple) containing an xpath expression and a RuleItem to apply to all the nodes that match that expression, or
  • A list of zero or more RuleItems

The function which applies a set of rules is short and sweet - just 12 lines of code. It is passed a RuleItem structure and a dom4j node to apply the rules to. At each step it looks to see if it has a callable, a pair or a list, and takes the appropriate action:

def walkByRule(ruleItem, data):
    if callable(ruleItem):
        # This is an actual rule, just call it on the current data
        ruleItem(data)

    elif type(ruleItem) == type( () ):
        # ruleItem is a tuple ( xpathexpression, childRuleItem )
        # childRuleItem is applied to each node selected by the xpath expression
        childExpr, childRuleItem = ruleItem
        for child in data.selectNodes(childExpr):
            walkByRule(childRuleItem, child)

    elif type(ruleItem) == type( [] ):
        # A list of rules, evaluate it recursively
        for rule in ruleItem:
            walkByRule(rule, data)

    else:
        print 'Unknown rule type:', ruleItem

Here is a short snippet of the actual audit rules table. At this point the current node is a Learning Point. The first part of the snippet is an xpath expression that selects only Learning Points whose type is "summary". This is followed by a list containing two rules. The first rule evaluates an XPath expression againt the current node. The particular expression checks to see if the summary point is the last point. The second rule checks for the existence of an exercise point in the same topic, which is not allowed.

('.[@type="summary"]',  [
    xpathRule('count(following-sibling::LearningPoint) = 0',
       'Summary Learning Point must be the last Learning Point in a Topic'),

    limit('count(../LearningPoint[@type="software_simulation" and Slide/@type="exercise"]')', 0,
       'Topic with Exercise Point must not contain a Summary Learning Point'),
] ),

Note that the rules are actual function calls. How is this possible? The rule engine requires that the rules be instances of functions with one argument.

Well, xpathRule is a function that returns a function of one argument. To see how this works, first look at the function that implements the actual rule. It is a pretty straightforward function that evaluates an XPath expression against a data node and logs an error if the result is not 'true':

def _xpathRuleImpl(data, xpathExpr, desc):
    ''' data.valueOf(xpathExpr) must be 'true' '''
    result = data.valueOf(xpathExpr)
    if not result == 'true':
        _logError(desc, data)

To turn this function of three arguments into a function of one argument, we create a new function that binds the xpathExpr and desc args and just needs the data argument. That is exactly what xpathRule does:

def xpathRule(xpathExpr, desc):
    return lambda data: _xpathRuleImpl(data, xpathExpr, desc)

So the call to xpathRule in the audit data actually creates the function that will be called by the rule engine. The call to limit() does much the same thing.

The complete table of audit rules contains over 100 rules. They are all created using the same simple building blocks. Many of the rules reuse xpathRule, limit and other generic rules. In a few cases, special-purpose functions check a single specific rule.

In Meccano, the same core rule engine is used to create word counts. In this case, the rules at the leaves of the rule data are instructions for how to count the text in the corresponding node. Similar techniques are used in the part of Meccano that builds the visual representation of the dom tree for display in the GUI.

I hope these examples have given you some idea of the power of Jython and dom4j when used together for data-driven programming.

Detailed example

Here is a complete DomUtil module containing walkByRule and an extensive example:

''' Module DomUtil '''
import org.dom4j as dom


def walkByRule(ruleItem, data):
  '''
  walkByRule does a specialized tree walk of a dom tree.
  The walk is specified by xpath expressions in a Python structure.
  The actual work is done by callback functions embedded in the structure.

  The arguments to walkByRule are the rule structure and the top-level
  dom node to which it should be applied.

  ruleItem contains the actual rules. This is a recursive data structure.
  Here is the production rule for the structure.
  Note that the parentheses and brackets are part of the Python syntax,
  not part of the rule syntax, and the distinction
  between lists and tuples is significant.

  RuleItem := [ RuleItem* ] | ( XPathChildExpression, RuleItem ) | callable

  In English:
  A RuleItem is either
  - A list of zero or more RuleItems
  - A 2-tuple containing an xpath expression and a RuleItem to apply to all the nodes that
    match that expression, or
  - A callable which is called with the current node as its only argument

  data is the top-level dom node. So the walk starts by applying the given ruleItem
  to the given data.

  Here are some examples. First some data to play with:

  >>> data = """
  ... <root>
  ...     <branch id="1">
  ...         <leaf type="text">Leaf text</leaf>
  ...         <leaf type="number">123</leaf>
  ...     </branch>
  ...     <branch id="2">
  ...         <leaf type="text">Leaf text 2</leaf>
  ...     </branch>
  ... </root>
  ... """
  ...

  >>> doc = dom.DocumentHelper.parseText(data)
  >>> root = doc.getRootElement()


  The simplest use of walkByRule is to pass a callable as the rule.
      walkByRule(fn, node)

  is the same as

      fn(node)

  >>> def printName(node):
  ...     print node.getName()
  ...

  >>> walkByRule(printName, root)
  root

  This is not very interesting by itself, but callables will be the leaf nodes of
  a more complex rule structure.


  If a rule is a tuple, the first element must be an XPath expression.
  It is used to select nodes from the current target. The full dom4j XPath
  syntax is supported.

  The second element is a rule to evaluate against the selected nodes.

      walkByRule( (xpath, rule), node)

  is the same as

      for child in node.selectNodes(xpath):
          walkByRule(rule, child)

  For example,

  >>> rule = ('branch', printName)
  >>> walkByRule(rule, root)
  branch
  branch


  If the rule is a list, each element of the list must be a rule.
  The rules are passed to walkByRule using the current target node.

      walkByRule( [ rule1, rule2 ], node)

  is the same as

      walkByRule(rule1, node); walkByRule(rule2, node)

  The component rules can themselves be callables, tuples or lists.
  We can apply both rules from the preceding two examples by putting them in a list:

  >>> rule = [ printName, ('branch', printName) ]
  >>> walkByRule(rule, root)
  root
  branch
  branch


  The combination of list rules and tuple rules is very powerful.
  Tuple rules allow you to dig down into the dom tree to get to a node of interest.
  List rules allow you to apply many rules to a node.

  Here is a more complex example:

  >>> def printInt(node):
  ...     print 'Integer value:', int(node.valueOf('text()'))
  ...

  >>> def printText(node):
  ...     print 'Text value:', str(node.valueOf('text()'))
  ...

  >>> rule = [
  ...     printName,
  ...     ('branch', [
  ...         printName,
  ...         ('leaf[@type="text"]', printText),
  ...         ('leaf[@type="number"]', printInt),
  ...     ] ),
  ... ]
  ...

  >>> walkByRule(rule, root)
  root
  branch
  Text value: Leaf text
  Integer value: 123
  branch
  Text value: Leaf text 2


  Notice that in all cases the value in the rules for a callable rule is
  a reference to the actual function - printName - not a call of the function - printName().
  A callable rule must be a reference to a function of a single argument.

  You may find that you have several similar rules and you would like to pass more than
  one argument to the rule. You can do this by defining a function that is a closure -
  it binds the extra arguments together with the actual implementation function
  to make a function of one argument.

  For example, here is a test function that compares the node text to an argument:

  >>> def containsTextImpl(node, text):
  ...     if node.getText().find(text) >= 0:
  ...         print 'Found', text
  ...     else:
  ...         print 'No', text
  ...

  This function binds a text value and containsTextImpl into a function of one argument:

  >>> def containsText(text):
  ...     return lambda node, text=text: containsTextImpl(node, text)
  ...

  To use this, put a _call_ to containsText in the rule:

  >>> rule = ('//leaf', [ containsText('Leaf'), containsText('123') ] )
  >>> walkByRule(rule, root)
  Found Leaf
  No 123
  No Leaf
  Found 123
  Found Leaf
  No 123


  '''

  if callable(ruleItem):
      # This is an actual rule, just call it
      ruleItem(data)

  elif type(ruleItem) == type( () ):
      # ruleItem is a tuple ( xpathexpression, childRuleItem )
      # childRuleItem is applied to each node selected by the xpath expression
      childExpr, childRuleItem = ruleItem
      for child in data.selectNodes(childExpr):
          walkByRule(childRuleItem, child)

  elif type(ruleItem) == type( [] ):
      # A list of rules, evaluate it recursively
      for rule in ruleItem:
          walkByRule(rule, data)

  else:
      print 'Unknown rule type:', ruleItem



if __name__ == "__main__":
  import doctest
  import DomUtil
  doctest.testmod(DomUtil)
 
© Kent S Johnson Creative Commons License

Comments about life, the universe and Python, from the imagination of Kent S Johnson.

kentsjohnson.com

Weblog home

All By Date

All By Category

Essays

XML-Image

BlogRoll