Jawk - Compiler Module

Jawk supports compilation of AWK scripts to JVM bytecode. This is achieved by the release of an implementation of the AwkCompiler interface (please see the Javadoc for more details on this interface). Download the latest Jawk release if your version does not support compilation.

Jawk accepts the -z argument to compile the script and the -Z to compile and execute the script. You may provide the -o argument to specify the output class name. By default, the name used is "AwkScript". Also by default, Jawk assumes no package name and writes the class file into the present working directory. To override this behavior, use the -d argument. This results in the assignment of a package name to the script, along with creation of the script class in the package-named directory.

For the compiler to work, you must have the Apache Byte Code Engineering Library (BCEL) JAR file in your classpath. Click here and follow the links to download the JAR file. Remember that the JAR file is imbedded within the ZIP file you download from the site. Then, augment your classpath to contain the JAR file. Without the BCEL JAR file in your path, you will get a NoClassDefFoundError because the Awk Compiler implementation is looking for the BCEL support classes and services, which it cannot find.

The script class relies on the Jawk runtime environment for execution. As a result, you must include jrt.jar in the classpath when executing the script class. (Note: jawk.jar includes classes contained in jrt.jar. Therefore, either jar file will do. However, jrt.jar is smaller, and thus, a more favorable choice.)

Compiler Usage

To use, type:

For Mac/Unix:
java -cp "jawk.jar:bcel.jar:$CLASSPATH" org.jawk.Awk -Z -f script.awk

For Windows:
java -cp "jawk.jar;bcel.jar;%classpath%" org.jawk.Awk -Z -f script.awk

This results in the compilation of script.awk into AwkScript.class, and if there were no errors, the subsequent execution of AwkScript.class. The side-effect of the -Z argument is the resultant AwkScript.class in the present working directory. Thus, to execute the script without reparsing and compiling, type:

For Mac/Unix:
java -cp "jrt.jar:$CLASSPATH" AwkScript

For Windows:
java -cp "jrt.jar;%classpath%" AwkScript

Extensions, such as _sleep, _dump, _INTEGER, etc are supported. Keep in mind, however, that arguments which enable most extensions must be provided at compile time. For example, to compile scripts which contain the _dump and _sleep keywords, use:


java -cp "jawk.jar;bcel.jar;%classpath%" org.jawk.Awk -z -x script.awk

Executing

java -cp "jrt.jar;%classpath%" AwkScript -h

outputs (within a usage statement) the extensions which are enabled.

Note: the -t extension is the only runtime, not compile-time, extension. Therefore, to use sorted key maps for associative arrays, the -t argument is needed along with the -Z argument, or when executing AwkScript itself.

Command Line Arguments

Compiled scripts take in command-line parameters in a similar, but not exact, manner to its interpreted counterpart. For instance, it doesn't make sense for the compiled script to take in the -f argument. Likewise, there is no decompiler such that an AST or the intermediate code can be generated. This eliminates the -S and -s arguments. Most extensions are enabled upon compilation rather than at run-time. This eliminates the -x, -y and -r arguments. And, finally, compiled scripts cannot compile scripts themselves. Thus, the -Z, -z and -d arguments are missing. With all these eliminations, the -o argument is no longer useful. This leaves the following: Note: Key sorting for associative arrays (-t) could have been a compiled feature rather than a run-time feature. It was decided that the number of times associative arrays are created should be minimal, and not require high throughput. Therefore, -t has been kept as a runtime switch for maximum flexibility.

Architecture

It is important to understand how the compiled class file is constructed, especially if you plan on executing the compiled script in a Java application where a direct call is made to the compiled result. Below is a near-identical Java representation of the compiled result:

import org.jawk.jrt.*;
import org.jawk.util.AwkParameters;

import java.util.*;
import java.util.regex.*;
import java.io.*;

public class AwkScript implements VariableManager {

	// use this field as the third argument to the AwkParameters
	// constructor when executing ScriptMain directly

	public static final String EXTENSION_DESCRIPTION = extension-description-string;

	private static final Integer ZERO = new Integer(0);
	private static final Integer ONE = new Integer(1);
	private static final Integer MINUS_ONE = new Integer(-1);

	public static void main(String args[]) {
		AwkScript as = new AwkScript();
		// this is why org.jawk.util.AwkParameters is in jrt.jar ...
		AwkParameters ap = new AwkParameters(AwkScript.class, args, EXTENSION_DESCRIPTION);
		// to send the error code back to the calling process
		System.exit(as.ScriptMain(ap));
	}

	// to satisfy the VariableManager interface

	// Note: field names here correspond to a global_N field
	// which are assigned upon compilation of the script.

	public final Object getARGC() { if (argc_field == null) return ""; else return argc_field; }
	public final Object getCONVFMT() { if (convfmt_field == null) return ""; else return convfmt_field; }
	public final Object getFS() { if (fs_field == null) return ""; else return fs_field; }
	public final Object getARGV() { return argv_field; }
	public final Object getOFS() { if (ofs_field == null) return ""; else return ofs_field; }
	public final Object getRS() { if (rs_field == null) return ""; else return rs_field; }
	public final void setFILENAME(String arg) { filename_field = arg; }
	public final void setNF(String arg) { nf_field = arg; }

	private final Object getNR() { if (nr_field == null) return ""; else return nr_field; }
	private final Object getFNR() { if (fnr_field == null) return ""; else return fnr_field; }

	public final void incNR() { nr_field = (int) JRT.toDouble(JRT.inc(getNR())); }
	public final void incFNR() { fnr_field = (int) JRT.toDouble(JRT.inc(getFNR())); }
	public final void resetFNR() { fnr_field = ZERO; }

	public final void assignField(String name, Object value) {
		if (name.equals("scalar1")) scalar1_field = value;
		else if (name.equals("scalar2")) scalar2_field = value;
		else if (name.equals("scalar3")) scalar3_field = value;
		...
		else if (name.equals("scalarN")) scalarN_field = value;
		else if (name.equals("funcName1")) throw an exception;
		else if (name.equals("funcName2")) throw an exception;
		...
		else if (name.equals("funcNameX")) throw an exception;
		else if (name.equals("assocArrayName1")) throw an exception;
		else if (name.equals("assocArrayName2")) throw an exception;
		...
		else if (name.equals("assocArrayNameM")) throw an exception;
	}

	private JRT input_runtime;
	private HashMap regexps;
	private HashMap pattern_pairs;
	private int exit_code;

	private int oldseed;

	private Random random_number_generator;

	// global_N refers to all the global variables,
	// those which are defined by default
	// (i.e., ARGC, ARGV, ENVIRON, NF, etc.)
	// and vars declared by the script.
	// The _SET_NUM_GLOBALS_ opcode allocates
	// these fields.

	private Object global_0;
	private Object global_1;
	private Object global_2;
	...
	private Object global_N;

	// Call this method to invoke the Jawk script.
	// Refer to the static main method implementation
	// and Javadocs on how to build the AwkParameters.
	// Use the public static String EXTENSION_DESCRIPTION
	// field (within AwkScript) as the third parameter
	// to the AwkParameters constructor to ensure proper
	// extension description in the usage statement.

	public final int ScriptMain(AwkParameters awk_parameters) {

		// local variables

		double dregister;
		StringBuffer sb = new StringBuffer();

		// Field Allocation
		// ----------------
		// Could have be done in the class constructor,
		// but placed here to ensure proper repeat initialization
		// if repeat execution is required
		// within the same JVM instance.  Because
		// if these were within the class constructor, each of these
		// data structures / int values would have to be
		// reinitialized in some way anyway.

		input_runtime = JRT(this);	// this = VariableManager
		regexps = new HashMap();
		pattern_pairs = new HashMap();
		oldseed = 0;
		random_number_generator = new Random(null);
		exit_code = 0;

		// script execution

		try {
			///
			/// Compiled BEGIN and input rule blocks code here.
			/// (EndException is thrown when exit() is encountered.)
			///
		} catch (EndException ee) {
			// do nothing
		}

		try {
			runEndBlocks();
		} catch (EndException ee) {
			// do nothing
		}

		return exit_code;
	}

	public void runEndBlocks() {
		double dregister;
		StringBuffer sb = new StringBuffer();

		///
		/// Compiled END blocks code here.
		/// (EndException is thrown when exit() is encountered.)
		///
	}

	// One of these exists for every function definition.
	// Arguments are reversed from its Jawk source.
	public Object FUNC_function_name(Object oN, Object oN-1, ... Object o2, Object o1) {
		Object _return_value_ = null;
		StringBuffer sb = new StringBuffer();
		double dregister = 0.0;

		///
		/// Compiled function function_name code here.
		/// (A return() sets the _return_value_ and falls out of this
		///  function code block.)
		/// (EndException is thrown when exit() is encountered.)
		///

		return _return_value_;
	}

	// The following is created for every optarg version of the function
	// call that exists within the script for this function_name.
	// X > 0 && N > X
	public final Object FUNC_function_name(Object oX, Object oX-1, ... Object o2, Object o1) {
		return FUNC_function_name(null, null, ..., null, oX, oX-1, ..., o2, o1);
	}
}
All italicized variable/function names are placeholders. The actual variable/function names are described in comments which are generally above the placeholders. Please refer to these comments to understand exactly what Jawk uses as the variable/function names.

Here are several points that should be considered:

Benchmarks

Note: The following benchmark analysis is very informal. All timings are visually observed and estimated to the nearest second.

For the following AWK script:

BEGIN { print fib(30) }
function fib(i) {
	if (i<2) return i
	return fib(i-1) + fib(i-2)
}
it takes 30 seconds to interpret the script while it takes a little over a second to execute the compiled script. This dramatic result is possible because this AWK script compiles a collection of repeated (recursive) function calls and simple arithmetic operations, actions the JVM performs very quickly. However, if we run the grep.awk script (supplied in the examples) to count the number of "public" keywords in the Jawk source, the interpreted version takes roughly 17 seconds while the compiled version takes 14 seconds. The reason why we received only a slight efficiency increase is because grep.awk spends a bulk of its time in IO operations, which is almost identical to what happens in the interpreted verison.

The bottom line is that, on average, you'll see a noticeable increase in speed in the compiled version. If the script spends most of its time doing computations and calling AWK functions, then you'll see large gains in execution efficiency. If the script spends most of its time doing IO, then you wont see much gain.

Compilation vs. Interpretation

Even if you don't see much improvement in execution efficiency, it is still worthwhile compiling scripts to JVM. By doing so:

Changes to Jawk

Various changes have been made to Jawk , mainly to accommodate compilation to JVM.

Package Reorganization

Many classes used by both the interpreter and the compiler moved from org.jawk.backend to org.jawk.jrt. The exception to this is AwkParameters. It is the only class in jrt.jar that is not part of org.jawk.jrt (it is part of the org.jawk.util package).

Intermediate Code Changes

The new AwkTuple opcode _THIS_ is added to support function calling from within a the compiled script. It is necessary because the JVM operand stack cannot be randomly accessed. And, a local FUNC_ call requires the "this" pointer on the operand stack prior to its arguments.

PositionForCompilation

A new Position subinterface is introduced to Jawk called PositionForCompilation such that it does not provide random access into the intermediate code list (since compilation requires sequential access into the intermediate instruction list). Random access is required only when interpreting branch statements, such as _GOTO_ and _IFFALSE_.

Better Runtime Error Reporting

Script line numbers have always been captured within the abstract syntax tree. To use these in better error reporting, AwkTuples now contain line numbers, which are subsequently used by both the Jawk interpreter and compiler modules. Now, the script (if the -f argument is used) and line numbers are reported on stack traces.

Interpreter Stack Leaks

By working on the compiler, several bugs with the interpreter have been discovered. One major issue was stack leaks, where certain opcodes, such as _REGEXP_PAIR_, _SUB_FOR_DOLLAR_REFERENCE_, and _SUBSTR_ did not pop all opcodes off the operand stack. To guard against these issues, assertions in the AVM check upon termination if the stack is empty. If not, the contents of the stack is dumped to stdout and an AssertionError is thrown.

Other Bug Fixes

Several other bugs have been fixed as a result of compiler development: Also, many thanks to users who reported the following bugs. They have been repaird in this release:


http://jawk.sourceforge.net/