This is the second post in a series about building a calculator REPL in Rust. You may want to start with the first post. Today I’ll talk about how the tokenizer is built.
The tokenizer, basically, is a function that looks at a string and recognizes chunks that are meaningful to the program. In the calculator REPL I wrote, these are represented as an enumerated type, because they can be a delimiter, an operator, or a value.
In principle, we could write the tokenizer by looping over the string and checking a bunch of if statements to see which token we’re working on, but this would be very prone to bugs and hard to read. Instead, I used a crate called nom
.
nom lets me write this instead:
named!(left_paren<&[u8], Token>, | |
do_parse!(tag!("(") >> (Token::LeftParen)) | |
); | |
named!(right_paren<&[u8], Token>, | |
do_parse!(tag!(")") >> (Token::RightParen)) | |
); | |
named!(addition_sign<&[u8], Token>, | |
do_parse!(tag!("+") >> (Token::Operator(Opcode::Add))) | |
); | |
named!(subtraction_sign<&[u8], Token>, | |
do_parse!(tag!("-") >> (Token::Operator(Opcode::Subtract))) | |
); | |
// and several more |
This is a basic part of the parser. It does things like map a substring containing just the character (
to the token that represents a left paren. So far It doesn’t feel very magical. The real magic comes from nom’s ability to nest parsers:
named!(single_token<&[u8], Token>, | |
alt!( | |
left_paren | | |
right_paren | | |
addition_sign | | |
subtraction_sign | | |
multiplication_sign | | |
division_sign | | |
operand | |
) | |
); | |
named!(tokens<&[u8], Vec<Token>>, | |
ws!(many0!(single_token)) | |
); |
So here we go. A single token is either a left paren, or a right paren, or an operator, or an operand. Nice declarative syntax. I love how the single_token
function is basically just a list of legal tokens, and it works! Then, if we want more than one token, we can just tell nom, “hey, this string is a whitespace delimited list of substrings that should match single_token
, and we get our parser for free. Very fun.
This lets me have my overall parsing function be super clean:
pub fn parse(bytes: &[u8]) -> Result<Vec<Token>, TokenizationError> { | |
let parse_result = tokens(bytes).to_result(); | |
match parse_result { | |
Ok(token_vec) => Ok(token_vec), | |
Err(_) => Err(TokenizationError {}) | |
} | |
} |
And hurray! We call parse, pass in our input, and it either makes a vector of tokens or, if the byte array can’t be parsed as a valid set of tokens, we return an error.
One feature that I’d like to add, which would be a bit more work I believe, is trying to tell the user what specifically went wrong with the tokens that they tried to parse. Having at least “unexpected token: foo” would be a nice future improvement.
And now we have a file that will turn user input strings into lists of tokens that mean something to the program. You can see the whole tokenizer here. Next time, we’ll learn to evaluate a list of tokens to get the result of the calculation.
Till then, happy learning!
-Will
[…] part of a series on building a simple calculator REPL in Rust. (You may with to read part one and part two […]
LikeLike
[…] in my discussion of building a calculator REPL in Rust. (For convenience, here are part 1 part 2 and part 3 .) I’ve decided that I like this project a lot, and that I’m going to […]
LikeLike