Calculator REPL Part 2: Tokenizing the Input

This is the second post in a series about building a calculator REPL in Rust. You may want to start with the first post. Today I’ll talk about how the tokenizer is built.

The tokenizer, basically, is a function that looks at a string and recognizes chunks that are meaningful to the program. In the calculator REPL I wrote, these are represented as an enumerated type, because they can be a delimiter, an operator, or a value.

In principle, we could write the tokenizer by looping over the string and checking a bunch of if statements to see which token we’re working on, but this would be very prone to bugs and hard to read. Instead, I used a crate called nom.

nom lets me write this instead:

named!(left_paren<&[u8], Token>,
do_parse!(tag!("(") >> (Token::LeftParen))
);
named!(right_paren<&[u8], Token>,
do_parse!(tag!(")") >> (Token::RightParen))
);
named!(addition_sign<&[u8], Token>,
do_parse!(tag!("+") >> (Token::Operator(Opcode::Add)))
);
named!(subtraction_sign<&[u8], Token>,
do_parse!(tag!("-") >> (Token::Operator(Opcode::Subtract)))
);
// and several more

view raw
nom_tokenizer.rs
hosted with ❤ by GitHub

This is a basic part of the parser. It does things like map a substring containing just the character ( to the token that represents a left paren. So far It doesn’t feel very magical. The real magic comes from nom’s ability to nest parsers:

named!(single_token<&[u8], Token>,
alt!(
left_paren |
right_paren |
addition_sign |
subtraction_sign |
multiplication_sign |
division_sign |
operand
)
);
named!(tokens<&[u8], Vec<Token>>,
ws!(many0!(single_token))
);

view raw
nom_combinators.rs
hosted with ❤ by GitHub

So here we go. A single token is either a left paren, or a right paren, or an operator, or an operand. Nice declarative syntax. I love how the single_token function is basically just a list of legal tokens, and it works! Then, if we want more than one token, we can just tell nom, “hey, this string is a whitespace delimited list of substrings that should match single_token, and we get our parser for free. Very fun.

This lets me have my overall parsing function be super clean:

pub fn parse(bytes: &[u8]) -> Result<Vec<Token>, TokenizationError> {
let parse_result = tokens(bytes).to_result();
match parse_result {
Ok(token_vec) => Ok(token_vec),
Err(_) => Err(TokenizationError {})
}
}

view raw
simple_parser.rs
hosted with ❤ by GitHub

And hurray! We call parse, pass in our input, and it either makes a vector of tokens or, if the byte array can’t be parsed as a valid set of tokens, we return an error.

One feature that I’d like to add, which would be a bit more work I believe, is trying to tell the user what specifically went wrong with the tokens that they tried to parse. Having at least “unexpected token: foo” would be a nice future improvement.

And now we have a file that will turn user input strings into lists of tokens that mean something to the program. You can see the whole tokenizer here. Next time, we’ll learn to evaluate a list of tokens to get the result of the calculation.

Till then, happy learning!

-Will

2 thoughts on “Calculator REPL Part 2: Tokenizing the Input”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s