Objective-C I: a Postscript interpreter with high performance

A crucial part of the metaobject product suite is EGOS, which heavily relies on a custom interpreter compatible with Postscript Level 3. This project was initiated for several reasons: as a safeguard against the possibility of Apple discontinuing DisplayPostscript, because our existing Postscript virtualization technique was reaching its limits, and to simplify the extraction of Objective-C objects from the interpreter.

At its heart, Postscript is a stack-oriented, dynamically typed, and highly polymorphic interpreted programming language. This makes implementing Postscript with Objective-C objects not just convenient for accessing Objective-C objects, but also a natural fit for the language’s semantics.

So everything is perfect, right? Not quite. We also need to ensure competitive performance; otherwise, the effort is pointless. How do we determine if the performance meets our standards? Luckily, we have a readily available benchmark: Adobe’s interpreter, used in NeXT’s DisplayPostscript and available as the PS Normalizer on Mac OS X. Let’s evaluate performance using this simple Postscript program:

1
2
3
4
5
6
7
  %!
  usertime
  0 1 1000000 { 4 mul pop } bind for
  usertime exch sub dup ==
  20 20 moveto /Times-Roman 24 selectfont
  100 string cvs show ( ms) show
  showpage

This program times a loop that performs multiplication one million times. It utilizes a significant portion of the fundamental execution mechanisms within the Postscript language: stack manipulation, procedure invocation, array access (a procedure is essentially an array flagged as executable), looping, and arithmetic. The loop’s duration is measured using the usertime command, which returns the CPU time consumed in milliseconds.

This test completes in 513 ms (513 ns per iteration) in Preview, which is quite respectable.

1. The Challenge

To test the concept, let’s create an Objective-C equivalent of what the Postscript interpreter handles in this loop. This should provide a reasonable lower limit for the execution time. It’ll be a lower bound because there will be extra interpretation overhead, and Postscript’s semantics are a bit more complex. We require a stack, numerical objects, and some basic arithmetic. Straightforward:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
 id startcounter=\[NSNumber numberWithInt:0\];
 id endcounter=\[NSNumber numberWithInt:1000000\];
 id counter=startcounter;
 id four=\[NSNumber numberWithInt:4\];
 while ( \[counter intValue\] < \[endcounter intValue\] ) {
  int intResult;
  id result;
  \[stack addObject:counter\];
  \[stack addObject:four\];
  intResult = \[\[stack lastObject\] intValue\] \* \[\[stack objectAtIndex:\[stack count\]-2\] intValue\];
  result=\[NSNumber numberWithInt:intResult\];
  \[stack removeLastObject\];
  \[stack removeLastObject\];
  \[stack addObject:result\];
  \[stack removeLastObject\];
  counter=\[NSNumber numberWithInt:\[counter intValue\]+1\];
 }
 

Unfortunately, this takes 4.8 µs per iteration. This means our “lower” bound is nearly 10 times slower than our objective, even before considering interpretation overhead. Clearly, this isn’t sufficient. What if we eliminate all that unnecessary stack manipulation code and use a simple C loop?

1
2
3
4
5
  id b=\[NSNumber numberWithInt:4\];
  for (i=0;i < 10000000;i++) {
  id a=\[NSNumber numberWithInt:i\];
  id c=\[NSNumber numberWithInt:\[a intValue\] \* \[b intValue\]\];
 }

2. Mutable State

Objective-C is an imperative object-oriented language, meaning objects can undergo state changes. However, we’ve been treating numbers as immutable value objects, necessitating their recreation from scratch. Allocating objects is roughly 25 times more expensive than an Objective-C message send. So, what happens if we stop allocating new integer objects and instead reuse an existing one, simply modifying its value? It turns out we can’t use NSNumber for this because it doesn’t allow its value to be modified. Therefore, we need a (simple) wrapper class for a single integer.

1
2
3
4
5
6
7
8
 
   id b=\[MPWInteger numberWithInt:4\];
   id a=\[MPWInteger numberWithInt:0\];
   id c=\[MPWInteger numberWithInt:0\];
   for (i=0;i <10000000;i++) {
  \[a setIntValue:i\];
  \[c setIntValue:\[a intValue\] \* \[b intValue\]\];
  }

That’s more like it: 50ns per iteration is 100 times faster than our initial attempt and 10 times faster than our target. So, leveraging mutable state makes our basic approach viable, at least conceptually. Of course, we now need to reintroduce the stack and incorporate interpretation.

3. Conserving Resources

Unfortunately, it turns out that the interpreter does require fresh instances. While it discards them quickly in most scenarios, it occasionally stores them, meaning we can’t statically reuse objects as we did before.

Instead, we need a way to recycle temporary objects, enabling reuse without significant overhead. The standard approach is maintaining an object pool to fulfill requests for new MPWInteger instances. However, the unpredictable nature of interpreted code prevents us from using the explicit check-in/check-out policy typically required for such pools.

Instead, we make the pool a circular buffer and use the retain count to determine if an object is reusable. When encountering an object in the pool, it can be reused if its retain count is one, indicating the pool holds the only valid reference. If the retain count exceeds one, another entity besides the pool references the object, making it currently unusable. In this case, we need a new instance.

This logic is encapsulated in the MPWObjectCache class, used similarly to a class (factory object) for creating instances.

1
2
3
4
5
6
7
8
 MPWObjectCache\* intCache=\[\[MPWObjectCache alloc\] initWithCapacity:20 
        class:\[MPWInteger class\]\];
 id b=\[MPWInteger integer:5\];
 for (i=0;i < 1000000;i++) {
  id a=GETOBJECT(intCache);
  id d=GETOBJECT(intCache);
  \[a setIntValue:i\];
  \[d setIntValue:\[a intValue\] \* \[b intValue\]\];

This code executes in 100ns per iteration, providing a solution that offers new or safely recycled objects quickly enough for building upon, with confidence that the final result will perform adequately.

4. Results

Running the initial Postscript test program in PostView results in 260ns per iteration. This means our Objective-C Postscript interpreter is almost twice as fast as Adobe’s, at least for this specific workload. While I wouldn’t extrapolate this isolated result to claim EGOS is universally faster, it clearly demonstrates its competitiveness, which was the objective.

The fact that it only required a mere 20 KLOC highlights the leverage provided by Objective-C. In comparison, Ghostscript weighs in at over 250+ KLOC (excluding drivers).

Conclusion

One aspect I’ve always appreciated about Objective-C is its ability to provide both expressiveness and performance. You can develop elegant solutions effectively while retaining the ability to optimize for performance without sacrificing the original solution’s structure.

Object allocation tends to be the most critical factor influencing performance. In this case, managing it with a transparent object-cache yielded an overall performance improvement of around 10-20x for our Postscript interpreter, elevating its performance from unacceptably slow to exceeding the industry standard.

Naturally, this isn’t the only factor, and Postscript interpretation isn’t the only application. Stay tuned for more!