Parsing binary files is no different than parsing character-based files except that the "characters" are actually bytes not 16-bit unsigned short unicode characters. From a lexer/parser point of view, there is no difference except that the characters are likely not printable. If you want to match a special 2-byte marker 0xCA then 0xFE, the following rule is sufficient.
```
MARKER : '\u00CA' '\u00FE' ;
```
The parser of course would refer to that token like any other token.
Here is a sample grammar for use with the code snippets below.
Notice that `BYTE` is using a range operator to match anything between 0 and 255. We can't use character classes like `[a-z]` naturally because we are not parsing character codes. All character specifiers must have `00` as their upper byte. E.g., `\uCAFE` is not a valid character because that 16-bit value will never be created from the input stream (bytes only remember).
If there are actual characters like `$` or `!` encoded as bytes in the binary file, you can refer to them via literals like `'$'` as you normally would. See `'.'` in the grammar.
## Binary streams
There are many targets now so I'm not sure exactly how they process text files but most targets will pull in text per the machine's locale. Much of the time this will mean UTF-8 encoding of text converted to 16-bit Unicode. ANTLR's lexers operate on `int` so we can handle any kind of character you want to send in that fits in `int`.
Once the lexer gets an input stream, it doesn't care whether the characters come from / represent bytes or actual Unicode characters.
The `ISO-8859-1` encoding is just the 8-bit char encoding for LATIN-1, which effectively tells the stream to treat each byte as a character. That's what we want. Then we have the usual test rig:
We can't just print out the text because we are not reading in text. We need to emit each byte as a decimal value. The output should be the following when you run the test code:
If you want to play around with the stream, you can. Here's an example that alters how "text" is computed from the byte stream (which changes how tokens print out their text as well):
class MyIPListenerCustomStream extends IPBaseListener {
@Override
public void exitIp(IPParser.IpContext ctx) {
List<TerminalNode> octets = ctx.BYTE();
System.out.println(octets);
}
}
```
You should get this enhanced output:
```
[172(0xAC), 0(0x0), 0(0x0), 1(0x1)]
[10(0xA), 10(0xA), 10(0xA), 1(0x1)]
[10(0xA), 10(0xA), 10(0xA), 99(0x63)]
```
## Error handling in binary files
Error handling proceeds exactly like any other parser. For example, let's alter the binary file so that it is missing one of the 0's in the first IP address: