Jawk
- AWK for Java
Introduction
Jawk
is the implementation of AWK in Java.
Jawk
parses, analyzes, and interprets and/or compiles
AWK scripts. Compilation is targetted for the JVM.
Jawk
runs on any platform which supports, at minimum,
J2SE 5.
Usage
To use, simply download the application, copy the release jar to the
jawk.jar file and execute the following command:
java -jar jawk.jar {command-line-arguments}
If executing from an environment which does not support the -jar argument,
then you may use the following command instead of the one above:
For Mac/Unix:
java -cp "$CLASSPATH:jawk.jar" org.jawk.Awk {command-line-arguments}
for Windows:
java -cp "%classpath%;jawk.jar" org.jawk.Awk {command-line-arguments}
for Windows with Awk script compilation:
java -cp "%classpath%;jawk.jar;bcel.jar" org.jawk.Awk {command-line-arguments}
or
java org.jawk.Awk {command-line-arguments}
if you already have jawk.jar in your classpath.
For brevity, the document will continue to use the -jar argument version.
To view the command line argument usage summary, execute
java -jar jawk.jar -h
The output of this command is shown below:
java ... org.jawk.Awk [-F fs_val] [-f script-filename] [-o output-filename] [-c] [-z] [-Z] [-d dest-directory] [-S] [-s] [-x] [-y] [-r] [-ext] [-ni] [-t] [-v name=val]... [script] [name=val | input_filename]...
-F fs_val = Use fs_val for FS.
-f filename = Use contents of filename for script.
-v name=val = Initial awk variable assignments.
-t = (extension) Maintain array keys in sorted order.
-c = (extension) Compile to intermediate file. (default: a.ai)
-o = (extension) Specify output file.
-z = (extension) | Compile for JVM. (default: AwkScript.class)
-Z = (extension) | Compile for JVM and execute it. (default: AwkScript.class)
-d = (extension) | Compile to destination directory. (default: pwd)
-S = (extension) Write the syntax tree to file. (default: syntax_tree.lst)
-s = (extension) Write the intermediate code to file. (default: avm.lst)
-x = (extension) Enable _sleep, _dump as keywords, and exec as a builtin func.
(Note: exec enabled only in interpreted mode.)
-y = (extension) Enable _INTEGER, _DOUBLE, and _STRING casting keywords.
-r = (extension) Do NOT hide IllegalFormatExceptions for [s]printf.
-ext= (extension) Enable user-defined extensions. (default: not enabled)
-ni = (extension) Do NOT process stdin or ARGC/V through input rules.
(Useful for blocking extensions.)
(Note: -ext & -ni available only in interpreted mode.)
-h or -? = (extension) This help screen.
Jawk
supports all of the standard AWK command line parameters:
- -v name=value - global variable assignments prior to the execution of the script.
- -F fs - input field separator assignment. This is equivalent to
FS="fs"
prior to its use (by getline or by input rules).
- -f filename - The script filename. If used, a script argument is not expected.
To enhance development and script execution over traditional AWK,
Jawk
also supports the following command-line parameter extensions:
- -t - Maintain all associated arrays in key-sorted order. This is implemented by using a TreeMap instead of a HashMap as the backing store for the associated array.
- -c - writes the tuples (generated by the Intermediate Subsystem) to a file, and then halts. If the -o parameter is provided, use its optarg as the filename. Otherwise, write to
"a.ai"
. This file can be used as an argument to -f to avoid the front end and intermediate steps for a particular script. It also provides a measure of script obfuscation.
- -o filename - Override the default output filename for extended parameters -c, -S, -s, -z, and -Z.
- -z - Compile the script to JVM bytecode instead of interpreting it upon Jawk invocation. The compiled result is not executed. Note: compilation requires the Byte Code Engineering Library from Apache (BCEL). The bcel.jar file must be included in your classpath, or the compilation will fail.
- -Z - Compile the script to JVM bytecode instead of interpreting it upon Jawk invocation. The compiled result is subsequently executed. Note: compilation requires the Byte Code Engineering Library from Apache (BCEL). The bcel.jar file must be included in your classpath, or the compilation will fail.
- -d - Target the compilation to a particular package name, resulting in the placement of the resultant compiled classfile into the package directory.
- -S - Dump the abstract syntax tree (constructed by the front end) to a text readable file. If the -o argument is not provided, the contents will be dumped into the
"syntax_tree.lst"
file.
- -s - Dump the intermediate code (tuples) to a text readable file. If the -o argument is not provided, the contents will be dumped into the
"avm.lst"
file.
- -x - (UPDATED) Enable
_sleep
, _dump
and exec
keywords. _sleep
causes the execution thread to sleep for a specified number of seconds (or one second if no argument is provided), and _dump
dumps the global variables (names and values) to stdout. If associative array arguments are provided, _dump
dumps the contents of each associative array to stdout. And exec
dynamically parses and executes complete AWK scripts, however in a separate vairable environment.
Note: exec
is only available in interpreted mode.
- -y - Enable
_INTEGER
, _DOUBLE
and _STRING
typecast keywords. These are particularly useful in [s]printf functions/statements to force parameters to convert to particular types.
- -r - Allow IllegalFormatExceptions to be thrown when using the java.util.Formatter class for printf/sprintf. If the argument is not provided, the interpreter/compiled result catches IllegalFormatExceptions and silently returns a blank string in its place. If the argument is provided, the interpreter/compiled result will halt by throwing this runtime exception.
- -ext - (NEW) Enables the parser/AVM to recognize extensions within scripts. Extensions allow for arbitrary Java code to be called as registered AWK functions. Please refer to the Jawk Extension Facility Description page for more information.
Note: -ext is only available in interpreted mode.
- -ni - (NEW) Conventionally used in conjunction with extensions, -ni prohibits Jawk from processing stdin or files in ARGC/V through action rules. This enables the extension facility to apply action rules to blockable events. Please refer to the Jawk Extension Facility Description page for more information.
Note: -ni is only available in interpreted mode.
- -h/-? - Displays a usage screen. The screen contains a list of command-line arguments and what each does.
If -f is not provided, a script argument is expected here.
Finally, one or more of the following parameters are consumed by Jawk
and provided to the script via the ARGV/ARGC variables. The script can add/remove
to this array to modify the behavior of the interpreter/compiled result.
- filename - Uses this file as input to the script. If the filename is invalid, an error is produced on stderr, but Jawk has no direct way of notifying the script.
- name=value - Performs this assignment as a global variable assignment prior to the consumption of the next input file.
If the parameter contains an =, Jawk
treats it like a variable
assignment. Otherwise, it’s a filename.
Note: Parameters passed into the command-line which result
in non-execution of the script (i.e., -S, -s, -h, -? and -z) cause
Jawk
to ignore filename and name=value parameters.
Jawk
employs the org.jawk.util.AwkParameters for command-line
parameter management. Please refer to the Javadocs for more details.
If an invalid command-line parameter is provided, Jawk
will throw an IllegalArgumentException and terminate execution.
Java Scripting API (JSR 223)
Jawk
can be invoked via the JSR 223 scripting API (J2SE 6).
The script API access mechanism was provided by Sun for previous versions
of Jawk
(0.14).
To continue this support, Jawk
implements a constructor
similar to that used by previous versions.
Compilation to JVM Byte Code
Jawk
provides compilation of AWK scripts to Java bytecode.
In short, you'll need to download the Byte Code Engineering Library
from Apache and add the bcel.jar file to your classpath.
Also, to run the compiled result (by default, the class is
named "AwkScript" and it is located in the AwkScript.class file),
you'll need to add the jrt.jar
file to the classpath.
For example:
For Mac/Unix:
java -cp "jrt.jar
:$CLASSPATH" AwkScript
for Windows:
java -cp "jrt.jar
;%CLASSPATH%" AwkScript
Note that you do not need the BCEL to execute the compiled
result. The BCEL is necessary only to compile the script.
Please refer to
Jawk Compiler Module
for more detailed information on the compiler implementation.
You may download the BCEL from
http://jakarta.apache.org/bcel/.
Features
As stated earlier, Jawk
interprets AWK scripts in Java.
This is a full implementation of AWK, which includes:
- An intuitive text processing paradigm, tightly integrated with regular
expressions.
- Functions with local, static scoping.
- Scalar and associative array (map) variables.
- Weakly typed variables for greatest flexibility with automatic string/number
conversion.
- Powerful IPC constructs similar to those used by most UNIX shells (pipes and
IO redirect).
- Highly intuitive error diagnostics.
Jawk
also offers the following features which the original AWK
does not provide:
- Output to a post-compiled, pre-interpreted format for both elimination
of the compilation step and obfuscation of Jawk
scripts.
- Text dumps of abstract syntax tree and intermediate code representation
(tuples).
- Maintenance of associative arrays in key-sorted order.
- Error detection for printf/sprintf format parameters (via the -r argument).
- Compilation of scripts to bytecode executable on any modern JVM.
- An opt-in, flexible extension facility with event blocking capabilities
(in interpreted mode, only).
Because we’re using Java, the following differences exist in order to blend
easily within the J2SE environment:
- Jawk
regular expressions are implemented with Java regular expressions.
Therefore, they differ from AWK’s regular expression semantics (mostly by
adding functionality over AWK’s regular expressions).
- printf/sprintf formatting is done by java.util.Formatter. This is markedly
different from C’s, and thus AWK’s printf(). Java's Formatter class
does not attempt to implicitly convert its argument datatypes.
If differing datatypes are present than what is expected,
an IllegalFormatException will occur.
Therefore, the script developer must either keep track of implicit type
conversions by Jawk, or use _DOUBLE, _INTEGER, and _STRING
keywords to forcibly convert parameters to these types (to do so,
the -y parameter must be provided).
An example is provided below:
BEGIN {
a=3
# java.util.Formatter accepts %n as \n
printf("Integer of a = %02d%n", _INTEGER a)
printf("Double of a = %.2f%n", _DOUBLE a)
printf("String of a = %s%n", _STRING a)
}
Without the -y argument, only "String of a = 3" is displayed
because _STRING
is evaluated to a blank string,
and the concatenation of two expressions results in a string.
With the -y argument, all three lines are displayed.
Code Quality Assurance
Jawk
employs various methods to ensure software quality,
several of which are listed below:
- OOD - to enforce code encapsulation / implementation
hiding.
- Packaging - the front end, intermediate step, and
back ends are contained within their own Java packages. Also, their
public interfaces have been kept very small. This ensures loose
coupling between these subsystems.
- Assertions - primarily to ensure operand stack
integrity by matching resultant stack push counts with what is expected
(during intermediate code generation).
- Automated Regression Testing - described below.
Regression Testing
All builds are executed against a suite of regression tests
developed by the author. The original goal for developing these scripts
was to cover as many of the intermediate opcodes as possible.
However, the following opcodes are not covered
for reasons which are described below:
_SLEEP_
_PRINT_TO_PIPE_
_EXEC_
(new)
_SYSTEM_
_USE_AS_COMMAND_INPUT_
_CHECK_CLASS_
_PRINTF_TO_PIPE_
_EXTENSION_
(new)
_DUMP_
_SLEEP_
and _DUMP_
are extensions which cannot
be tested via the regression test script mechanism that is utilized,
_CHECK_CLASS_
exists only when assertions are turned on
(to verify that a KeyList exists on the operand stack
during a for(x in y)
statement),
and the rest involve executing commands on the host operating system.
Again, this cannot be tested via the existing regression test script
environment.
As for _EXEC_
, we have not decided if the exec() extension
is in its final form, or if we'll change it to, perhaps, include the
current script context (variable space and runtime stack). Until major
design decisions are made and implemented, it is premature to implement
a test case for this opcode.
In the near future, we plan to construct a Jawk Extension Facility
regression test suite, to avoid coupling the existing regression test
framework with extension semantics. Until then, this opcode will
remain out of the existing regression test framework.
As of this writing, there are 127 opcodes. Therefore, even with 7 opcodes
not covered by the test suite, the regression process still covers
94% of the opcodes used by Jawk
.
A future goal of the regression test suite is to exercise all of the
abstract syntax tree classes. Currently, this is not considered
in the regression test suite.
Semantic Analysis
Other versions of AWK will run through a script and issue
a "runtime error" if a user-defined function is not found.
Jawk
does not. It attempts to resolve all function
calls to defined functions at compile-time
(after parsing the script and
prior to assembling the intermediate code from the
abstract syntax tree). This is necessary in order
to produce intermediate code with branch statements fully resolved.
Other versions of AWK provide command-line parameters to choose
compile-time or run-time checks for function name
resolution. Jawk
does not, mainly to ensure semantic
analysis is done for the reasons
stated above. Also, to undo these semantic checks will result
in unresolved references, most likely resulting in
NullPointerExceptions.
Other semantic checks include formal/actual parameter
analysis and array/scalar operation verification.
Again, these are necessary to produce coherent
intermediate code.
http://jawk.sourceforge.net/