This is the second post in a series about building a calculator REPL in Rust. You may want to start with the first post. Today I’ll talk about how the tokenizer is built.
The tokenizer, basically, is a function that looks at a string and recognizes chunks that are meaningful to the program. In the calculator REPL I wrote, these are represented as an enumerated type, because they can be a delimiter, an operator, or a value.
In principle, we could write the tokenizer by looping over the string and checking a bunch of if statements to see which token we’re working on, but this would be very prone to bugs and hard to read. Instead, I used a crate called nom
.
nom lets me write this instead:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
named!(left_paren<&[u8], Token>, | |
do_parse!(tag!("(") >> (Token::LeftParen)) | |
); | |
named!(right_paren<&[u8], Token>, | |
do_parse!(tag!(")") >> (Token::RightParen)) | |
); | |
named!(addition_sign<&[u8], Token>, | |
do_parse!(tag!("+") >> (Token::Operator(Opcode::Add))) | |
); | |
named!(subtraction_sign<&[u8], Token>, | |
do_parse!(tag!("-") >> (Token::Operator(Opcode::Subtract))) | |
); | |
// and several more |
This is a basic part of the parser. It does things like map a substring containing just the character (
to the token that represents a left paren. So far It doesn’t feel very magical. The real magic comes from nom’s ability to nest parsers:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
named!(single_token<&[u8], Token>, | |
alt!( | |
left_paren | | |
right_paren | | |
addition_sign | | |
subtraction_sign | | |
multiplication_sign | | |
division_sign | | |
operand | |
) | |
); | |
named!(tokens<&[u8], Vec<Token>>, | |
ws!(many0!(single_token)) | |
); |
So here we go. A single token is either a left paren, or a right paren, or an operator, or an operand. Nice declarative syntax. I love how the single_token
function is basically just a list of legal tokens, and it works! Then, if we want more than one token, we can just tell nom, “hey, this string is a whitespace delimited list of substrings that should match single_token
, and we get our parser for free. Very fun.
This lets me have my overall parsing function be super clean:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
pub fn parse(bytes: &[u8]) -> Result<Vec<Token>, TokenizationError> { | |
let parse_result = tokens(bytes).to_result(); | |
match parse_result { | |
Ok(token_vec) => Ok(token_vec), | |
Err(_) => Err(TokenizationError {}) | |
} | |
} |
And hurray! We call parse, pass in our input, and it either makes a vector of tokens or, if the byte array can’t be parsed as a valid set of tokens, we return an error.
One feature that I’d like to add, which would be a bit more work I believe, is trying to tell the user what specifically went wrong with the tokens that they tried to parse. Having at least “unexpected token: foo” would be a nice future improvement.
And now we have a file that will turn user input strings into lists of tokens that mean something to the program. You can see the whole tokenizer here. Next time, we’ll learn to evaluate a list of tokens to get the result of the calculation.
Till then, happy learning!
-Will
[…] part of a series on building a simple calculator REPL in Rust. (You may with to read part one and part two […]
LikeLike
[…] in my discussion of building a calculator REPL in Rust. (For convenience, here are part 1 part 2 and part 3 .) I’ve decided that I like this project a lot, and that I’m going to […]
LikeLike