Newlines and whitespace characters require special handling in the lexer. When a newline character is encountered, the line counter should be incremented by one. Whitespace characters (spaces, tabs) are generally ignored by the compiler as they do not affect program execution. The lexer should simply advance past these characters without creating tokens for them. This ensures that the token stream contains only meaningful elements from the source code.
Approfondir
Prérequis
- Pas de données disponibles.
Prochaines étapes
- Pas de données disponibles.
Approfondir
Lexer | Writing a C Compiler in C++ | risc-v toolchain | Day 2Indexé :
Project: https://github.com/pranavmag/riscv-emulator
All right. Uh, getting it set up real quick. One more second. Yo, what's up, EXO? Right now, I was trying to watch the Truman Show and I was at the ending, bro. Come on. Hey, I won't be I won't be starting for a couple minutes, so you have some time.
Don't worry.
I need to get stuff set up. So, yeah.
All right.
Yeah, just a couple more things. Uh, just pull up like a notepad because I know we're going to need it today to map things out properly.
Got my headphones.
Okay.
Looks to be good. Uh, yeah, let's just recap what we did last time. Let's just recap what we did last time real quick.
So, uh, just making sure the audio is fine. Yeah. All right. Everything's fine. Let's just recap real quick.
All right. Yeah. So, last time we started by setting up the enum for the token type and then actually made the strruct for the token, the actual like components of the token. We started writing the scanner class. We need to implement this more.
We implemented a couple of the helper functions for this for the scanner class last time. We didn't finish though. And we still need to implement all the functions within uh the CPP file. And also last time we did all like the main stuff for the main file which is error handling, file handling, ripple handling. Yeah, we did that. So today we're probably just going to be writing more of the class the class and its components.
So we need to kind of like figure out that.
I do have some more notes to look through. So we can use this as we go and check that.
Um, all right. So, what exactly we have? So, we got the is at end function what we wrote last time. So, basically checking if we're at the end of the file or not.
Advance, which basically advances our token or the current value by one to check the next character. Uh we have the peak function which basically checks the next character to check if we need to consume it or not.
And then peak next is for a special case. This peaks two characters ahead.
And then match just basically checks if the next character is what you expect it to be.
Welcome back EXO. Welcome back.
So we can kind of map it out here. So basically, yeah, let's let's kind of map it out.
Let's just pull up like a probably best to pull up like a paint or something and map this out correctly so we can get the good idea here just so I can kind of like understand exactly what we need to do here. Uh so essentially we need we need a function that will scan all the tokens in the file. So like let's take like this program for example right we have every single thing in this program right here that we need to go through that the compiler needs to go through and parse everything and make sure that everything is properly being set.
It needs to like understand a distinction between keywords, punctuation, anything like that. So we need to actually have a function for that. So we can call that like scan actually for the text I'm probably just going to type it out. So we have something like scan probably like a scan tokens function, right? And that's going to be called let's kind of just map it out here. So we have like a scan tokens function.
Uh let's just draw like a box around it or something. So let's just draw like a box here.
Uh let's zoom out a bit.
All right. So scan tokens is the first one to be called.
And we're essentially scanning through every single every single keyword here.
Right. So every single keyword, every single string literal word, anything in this file. And we need to make sure that we scan every single one. And when we get to the end, we return like a end of file token. So that's going to be so once we call this, it's going to check to see if we're at the end or not.
So let me just write that. So we have our is at end function already, right?
We already have that function. So we can essentially just say that this is being called.
So it's going to be checking this. It's going to be checking this if it's if it is at the end. So let's say let's just draw two paths here. So yes and a no path. So if it is if it is at the end we need to have a end of file token. So we already have that enumeration in here. We have um end of file token there. So that's what we'll be returning if that's the case. So um yeah, so we can say emit a file and return. So this will this is essentially what will happen if it's the end of the file. So we know that we can end it right there.
And the other case uh okay so for this case right so we have our different variables like start and current right so start will basically tell us like exactly where we are at to start the actual token. So um we would need to set start to start as our current value that we that we are at. So um let me just say like start is equal to current right here. I believe that's what my notes also says.
Yeah. Um, if it's not at end of all tokens, then it sets the start as a current and scans that individual token using a different function. Yeah, that's basically how we do it. From my research, that's how we should have that.
Um, so yeah, it's going to be setting the start to the current. So, let me just give an example. So, we have this is what I'm kind of getting of it.
I'm going to kind of I literally just wrote these notes, not going to lie. I wrote these notes like a little while ago today, earlier today. So, I'm just trying to see if I can actually remember what is actually going on here before we implement it into the code.
Uh, yeah. So, let me make sure the audio is fine.
Hopefully audio is fine. I don't want to mishap like that one time.
Seem seems to be fine. I just checked.
So yeah. Okay. So like what I'm kind of getting at is like if we have this token right here, right? So our start would be here and our current would keep moving on to the end of that token. Once we reach the the next is the starting token, we need to actually set the start back to current. So we can start from here and move on and so on for each for each token in the vector that we're essentially going through. So the final phase of this part would just be to this is where we check individual characters, right? So we're setting okay so we're checking the end first and start the current. Now we need to check for the characters, right?
the each individual character not like the token itself but each individual character like st this colon colon string we need to check every single every single character within this code right here. So uh yeah so we can essentially call that like a separate scan token function. So we're checking each character within the code.
Yo, I appreciate it. Jokers, welcome to the stream, man. Yeah, so this is essentially how our loop is going to go kind of. So, this is basically what we need to actually implement into code to actually make this work.
So, the ve this is like the overall vector of all the tokens. We're checking if we're at the end of the file. If not, we keep going setting the start as the current token value. And then we check each individual character and keep going. So that's essentially what we're doing. And then just loop back to here.
Loop back to here if we're as we're going.
Yo, what's up KY men? Yeah, I'm making a compiler.
The compiler is going to feed like raw C code. It's going to take a subset of C and feed that into my emulator, my CPU emulator.
Okay, so this is a pretty good map that we have here. Let's try and start implementing. So, uh let's kind of like think of what we exactly need, right?
So, what are like different things that we can actually have in our file? So, we can have um so we can have like strings, right? we can have separate strings that we need to parse through.
Uh so like strings would basically just be like anything after a double quote. So any any amount of text we need to make sure that this is seen as one whole string.
So we need to actually like make sure that we can get everything in between the double quotes. And if we hit a double quote then we will essentially call that the end of the string.
Uh we can also have like integer literal, right?
Uh we just need to check if these are like digits.
So we just need to check if it's a number or not essentially.
Uh so like we need to just make sure that any amount like any value that we get we can check etc. Um let's go let me go back to my enum my enumerations. Yeah so we have strings numbers identifiers stuff like that we have our operations and punctuation.
This is what we need to check mainly.
This is where things will get a bit different. So essentially any punctuation, any like punctuation, let's just call them like single characters or double characters, anything like that.
And the flow for those, right? So let's say we have like exclamation mark or an equal to or like a less than sign semicolon.
We have to check for each of these characters using one of our helper functions. We can use like a peak or a match to see if that character is actually there. Well, for match specifically, we need to match specifically would work for these.
So, like if we have a exclamation mark, we need to check the next character to see if it's like if it's an equal to or something like that. So like if this next character after this exclamation mark isn't equal to the token will become not equal to. If it's if it's not an equal to then it'll become just not by itself. So that's one other edge case that we have to get. Uh what else is there?
There have to be some other edge cases.
Let's think.
You have to make sure also that character is not into the quotations cuz that should be read like string or command, right? Yeah, it should be read like a string. So essentially what I'm probably going to do is just once we parse through the entire string and make sure that we get through it. We have to use a substring and just grab this not the quotations. We just need to grab that that raw string and call that a string. So the compiler knows that's probably how we're going to do that. Uh I guess uh new line characters as well.
If we have like a new line, we just need to call that uh what should we do? What should we do in the case of a new line?
I'll get to the coding in a second, guys. I'm just trying to map this out.
Okay. So, uh the case of new line, I guess we just we take our line here and just increment it by one. It's not really a big deal.
Increment line by one. That's fine.
Uh, white space should just be white space should generally be ignored to be honest.
Um, oh yeah, decimals. Decimals are a bit interesting. Like we're doing floats for ours. So floating point. So any floatingoint character, right? We have to make sure that if we have like 12.3, we need to check the first number here.
Make sure that's a digit. And then check after this dot. We need to check if this number is a number because we can also have something like 1 2 3 stir or something like that like make it a string or something like that. any like helper functions called after dot. We need to make sure that it doesn't mix up these two special characters in string and call a function for these. I mean I guess like any character in a string I just advance forward. So like it doesn't really matter if like you have this anything like it can be in the string and I won't it won't be like executed as a command but I have to write it in a way so it doesn't do that. So I mean I guess we can start we can start writing.
Yeah, I guess we can start writing. I guess we kind of covered I'll probably think of some edge cases aside from that. But yeah.
Oh, you mean like Oh, new lines and tabs. Okay. Like that. Uh h that's interesting. Uh I mean I guess if it hits that then we just increment line by one.
That's really all like the new line does anyways.
In the case of tabs I don't know if tabs actually do anything for the compiler itself. I think that's just like a human thing just for human reading. I don't know if that's anything with the compiler but I have to check that probably.
But yeah, we can probably start. Uh, so we implemented our helper functions last time. So we should probably declare like forward declare or just declare that in here any other type of functions.
Uh so like I need a separate function for int strings something like that probably uh let me see. So yeah we would need a way to add the token as well. So we need to add the token to the thing.
So we say void add token and the token in general we can just say like for now we can just say like token type is a type and then it's literal value. Uh, anything else I want to implement?
That's fine for now. Just the literal value.
Just call it literal or did I call it literal value? No, I called it just literal.
Uh it's sleep now. You will be back. All right, man. I appreciate you coming to the stream, hanging out for a bit.
Uh okay, so token type literal literal.
Uh, why is that not I guess it's I'll take a look at that in a second.
Um, but yeah, we need, as we said, uh, I'm going to compile for now. I'll take a look at that in a second, but okay. So we need as we said void scan token like the single individual token that we're scanning that's not really going to pass anything probably a way to handle string.
So with string with number it's just to handle any numbers. I'm just going to combine I'm going to combine numbers and floats together and just in one into one function because we can just handle that there itself. I'm pretty sure um and I guess keywords identifiers stuff like that.
So that should be kind of like the path that we should take.
I guess I guess it needs to call that like that.
That's probably what was going on earlier.
Um, is there any way to like That's going to be kind of hassle to write out every time. Is there a way to like maybe using like a name space or something? H.
So like before the strct I guess we can say like using uh literal and then define that define that as um define this literal value as that.
So then in here we can just call it'll be better to it'll be better to have that name space but we still need it. So I'm just going to call it little here. Yeah. So that should that should be better. I don't want to write out this whole thing every time.
So now in here calling literal like that should work.
Yeah, that works.
And that should be for declaration of that. So that's just fine.
Okay, that works.
So we have these now. So now we need to actually go into CPP and actually implement that.
So let's just say include extra.h include some basic stuff we need # include vector as well and I think that's fine. I think everything is included. Oh yeah, let's include baron too.
Yeah. So now we have all our includes.
So now we need to actually go through and implement implement what we talked about. So we had scan tokens.
So scan tokens is just going to be a vector. It's just going to be a vector of all the whole file basically. Basically your program file. That's what scan tokens is going to be. It's going to just be the whole file of all your code that we're taking every single token as a vector of that.
So yeah, I'm probably going to do it like that. So call it like the vector of token and uh let's call it yeah scan tokens.
Let's get tokens.
And yo, what's up, Protham? How you doing, man? First time joining my live stream and seeing me. Yo, I appreciate it.
We're just fighting a compiler here, trying to slowly go through this, figure out everything.
So, yeah, skin tokens. All right. So, we're taking the vector token.
No, I'm not I'm not going to use namespace std. That can collide with a lot of the things since I have functions like string and number. It's going to just make it harder for my compiler and the compiler in general to actually see what name space I'm working in.
It's best practice to use stood in everything that you do.
You can use using namespace stood but like it's just not it's not a good practice.
So I'm just going to just write stood every time. It's not really a big of a deal to write stood every single time.
So it's fine. I'm used to writing stood every time. You're from India. Nice. I'm in the States right now. United States.
Um, I guess we won't really take anything.
Uh, oh yeah, we do need to do this though. So, we can take it from the class. Well, we don't have a function in there yet for that. So, let me add that void. He'll be uh yeah, let's just call it skin token.
Yeah. So now that should work. That should work.
You say good practice of not using namespace to how tell me in lead code is fine but when you're making a project you generally don't want that cuz let's say like you have maybe it won't be a best explanation. You should probably look for someone else's explanation. But let's say you have like a function like from for example like I have a function like string or like something like see out or something. If I have that same function within my code somewhere, it won't know what namespace it's coming from. So it will look through every single it'll look through every name space that has that value that function and it the compiler may or may not know which one actually take it from. So you generally don't want to use that.
That's probably not the best explanation. you should probably look for a YouTube video that tells you why not to use using namespace to but yeah let's just go through okay so all right so we have our scan tokens function right so essentially what we're doing is okay so scan tokens will be called and once that function is called we need to check if we're at the end of the file so let's take this for example like our whole the whole file here we need to scan and each token individually.
As long as we're not at the end of the file, we can scan it. So, we can say while um while we're not at the end.
Okay, as long as we're not at the end.
And I guess that's the only condition we need to really check here.
So as long as we're not at the end, we need to set the start as the current value.
So taking our start variable set that as current.
So essentially what that's doing is let's say we have let's say we have this right. So we set our start as our current value. Our current will parse through every single character until we get to the next part and then we set our start back to this line back to our cursor. So then it will go to the next token and so on. So so that that sets it like that and um what else current and then yeah we need to scan token as well.
we can just call get token function. So that's basically what it's doing in a loop over and over again.
Once we reach the end of that uh once we reach the end of that we just want to put that end of file token in. So we already do have um end of file token enumeration. So, I guess we can just say like uh let's just say like um I guess we just take our our vector our vector for that token. So that's going to be tokens that we had.
So just tokens and then we can just add a value there to the end and we want to add that enumeration. So end of file token so it's from the enumeration token type. So token type and and a file token. Yeah.
Um and that. Okay. So, what else?
I mean, I guess just that, right? I guess that's just what we need.
Um, okay.
Well, wait. Our vector does take in tokens. So we actually need to implement everything properly there.
H um yeah we need to implement the rest of it too. So we can actually have it properly because each value in our each value in our vector will have I think that's how it would work right it would have the every single part of the token itself.
So type the lex the literal and the line. So we need to actually implement uh I mean I guess literal is just an empty literal uh or the lexium I mean not the not the literal the literal will be just the null value so it's just going to be state so I guess just like that and the line number is just going to be the line value that we have set for that um I think that should be good and then we just return tokens right so now we just return that whole vector itself so that should have recreated the file properly uh what other edge cases I think that's the I think that's that should encapsulate this function perfectly Okay. As long as it's not at the end, we set start to current and we keep going through each token.
And once we get to the end of the file, we emit the end of file token. So, I think that's good. Um, okay. Let's see what other functions we need to implement.
Uh, add token. Let's do add token.
So that's basically just setting the token that we need. So let's do that.
All right. So, okay. Um, when we're adding the token, right, we're just going to be we're just going to be adding that to the vector itself, I believe. So, Um so we're taking the type or taking the literal value of it.
I mean I guess just tokens out and place back.
And we're essentially just going to take a type as our token type. Uh for the Lexim, we're just going to take Okay, wait. How do we How do we get the Lexon though?
How do we get the like some value in here?
Because it's going to it's going to be like uh let me just comment it out. Let me just comment it out so I can write it out probably. It's going to be let's say like some text or something.
some text or if we're copying it from if we're just moving it from somewhere, we're going to say to move on the on the lexim on the lexim value, right? Something like that. And then we need the literal that's fine. Line is just simple. Line is just a line number we take.
But how are we going to take our lexon?
I need to see how to do that. Uh so the lexim would just be the lexim would just be the value itself right so I mean okay so let's say we have like I don't know like a float just the keyboard itself that's the type of it but the actual the name of it would be float in the text so how would we grab it I guess since we're Since we're grabbing it from the file, right, we need to need to find a way to grab it from the file itself.
Uh, so I guess I'm going to try and map it out.
Uh, maybe I'll draw it on here or something.
Yeah. So, let me just draw it right right here. So, let's say I'm trying to think of what example I could use. I guess maybe strings would be the best way to do it. I guess I'm not sure.
I'm not exactly sure. Uh let's just say so we have float float f equals flo whatever something like that right we need to grab just this word and parse it as a keyword right and add that token in it can be the same for anything so we can it can be the same for this this like every single thing but right now let's just focus on float so how do we grab that from the text itself because probably before this we'll have some other stuff like int text x= 3 something like that right we just need to grab this but I guess since we're uh since we're using like the entire file itself it's just going to the I guess a substring right because let's think about it right we're taking it from the source itself the source should be a string so I guess we can use a substring the whole file itself will be a string itself um so I guess we can't use substring right on it So let's say um it's a string.
Let's call it Lexim.
Or should I call Lexim? I guess. Yeah, we'll call it Lexim for now. If it confuses me on the type later, we can we can change it later. So we're taking Lexim, right? We need to uh take this value. We need to take this value. not the whole thing but take this float value. We need to grab it from the file.
So let's say just before that we parsed this, we parsed this, we parse or not parse but we went through each of these tokens separately. Now we're on this token. So our start value would be here at the F. So uh uh okay. So we can take our source string that we had right and call substring substring on it.
Uh source does source doesn't have that on it. So okay we need to include main.
Wait, that's probably not a good way.
People include main.
Uh, our source is in here, right? So, how are we going to grab that? Is our source value in here at all?
Oh, yeah, it is. Okay. Wait. So, why wasn't that working?
It's under phone.
Um.
Oh, yeah. It's cuz it's not recognizing as part of the class. All right. There we go. Okay.
Okay. So, source substring. Okay. So, what our value starts here. So, basically our start value would be there. So, uh what were the parameters of substring again?
Let me see. C++ uh substring.
Okay. So, it's going to take the position and also the the length of the substring itself.
Index of the first character is zero.
No. Yeah, that's fine. Okay. So we need to take so our start variable will already be here but we need to take in the remaining position. Okay. Uh so I guess our start value right let's assume that we parsed through let's assume that we went through the whole part this is getting kind of weird. Okay, so let's assume that we went through the whole flow to actually identify that this is a key word here, right? So that would mean our current our current variable would be at the end here, right?
That's what I'm going to assume.
So I guess we could say current minus start I guess and then that would work for us.
So I guess that would probably work, right? Uh let me just think through it again.
Right. So we reach we reach this keyword. We reach this keyword, right?
So we go character by character. Our start is our start variable uh from our class. Our start and our current variable both our start and our current variable are here. So S and C are over here. As we move through each character, our C gets moved one by one until we reach the end here. So now our current is at the end. So it should in theory work if we move it like that.
in theory to brick. We'll see though.
We'll see.
Yeah. So, that's how we add a token. So, I guess that that works.
I mean, yeah, that that should work now.
Uh what else? Okay, so we implemented add token.
That took longer than I expected, but it's fine.
uh scan token. Okay.
Void. Okay. Make sure we have the scanner name uh scope identifier first before that the name space and the scope identifier and we're doing for the scan token function.
Okay. So let's try and implement scan token now. It's not taking any parameters right. No it's not. So okay.
Oops.
Uh, what exactly do we need?
Scan token. Let me look back at my notes for a second. I I believe scan token is just scanning each individual um character.
So uh let's take let's take the word let's just stick with the same thing float the keyword right so we're scanning it letter by letter to see if it actually comes up to make that identifier right I believe uh let me just read let me read my notes real token. Yeah, individual scan token would be to check for single character literal values or more like that.
Yeah. So, it's going to be checking every single type. So, um I'm just going to set up the functions here.
void string need to implement that.
So that will be to identify strings in our for our compiler number right I keep forgetting scanner scanner board number and uh identifier. So that's the three types that we need to do.
Okay. I don't know why I keep forgetting this.
That's fine though. All right. Yeah. So, identifier.
All right. So, we can implement those for these edge cases right here depending on what I wanted to do. I wanted to cover some edge cases. So, numbers, we have that specific function.
Uh strings for that specific function.
Um, I guess we can for the scan token function. I'm going to do punctuation and like any single character value or double character value and probably like new line and probably white space as well. And then this is just going to be for identifiers.
So, okay, let's think of how we need to do this.
So we're essentially checking for single literal values, right? So I mean I guess we just make a switch statement and check check for each value. I'd assume uh Okay, let's let's just make like a char variable, right? So, let's say char, right? And um we basically want that token value, right? So, we need to check We need to check this uh we need to use this function here cuz this grabs the value that's up ahead that we need that's in the source. So we can call it like advance and then we can do a switch and okay so uh let's see basically I want to map it out how we kind of do this so let's see so this will be for any literal characters. So, uh it should cover it should cover anything in this enum here, comma, semicolon, left parenthesis, plus minus all these. But for the case of these ones that have two characters, we need to do a further check and make sure we cover those. So for the case of those ones, so let's say we have like an exclamation mark. We need to check if that next character is an equal to because if it is then we can set it as not equal to. If if it's not an equal sign, let's say it's like I don't know like closing bracket or something or some random value, then it will be just not.
So that's how we need to approach this.
That's a terrible one. But yeah, you you understand. I kind of understand it. Uh so that's kind of how we need to approach this. So So checking that next character, right? So let's do the simple cases first. Let's do the first let's do the one character case first.
So um comma is going to be I guess. Yeah.
You know what? I'm going to change this.
I'm going to change this. I'm going to call this single character.
I'm going to call these uh double double character. I'm just going to move this for my ease so I can not miss anything.
So, all these are single. Let's move some of these other ones though. Like, huh. I mean plus does have a plus or equal to. So that's not that that can be double character.
Right. Right. So uh okay. I'm just going to move all these I'm thinking if I should cover sequel because that's a different type a paradigm because plus equals just plus like you're just appending that value to the current thing. So I okay let's just do it. Let's just let's just go for it. We'll think of edge cases as they arrive. Let's start with comma. We know comma doesn't have anything. So for the case of comma, we can essentially use our add token function and add that in. So add token uh we need our token type. So that's going to be comma. So token type uh comma and the literal value. So, well, that's just going to be combo. So, I assume, right? Something like that.
something like that, I guess.
Um, and then we just need to break out.
And we do the same for every other case.
So, I'm just going to reference this uh semicolon.
Yeah, this should be pretty fast now since I kind of understand Let's see. Colon.
Uh shoot.
Yeah. Okay. Yeah. case.
Uh, left parenthesis right at least.
Wait, what?
Oh, did I Okay, I'm I made a mistake.
Uh, they should be the same thing. So, so I made a mistake there. Left parenthesis.
Wait, wait, wait. No, no, I didn't make a mistake.
Brace should be I did not make a mistake.
Uh, I guess I meant to make that bracket I guess, right? So, yeah, left underscore parenthesis. Okay.
Uh, after this function, I'll probably check with Claude to make sure I'm doing it right.
Uh same thing with these. So case of parenthesis Yeah. Uh I guess for these ones I meant like race as in like these things.
So we can do for the case of this. So we can token type left brace which will be this value and break.
H dude, I can't type right now. Okay.
Right brace.
Okay. Uh right brace. Yeah.
Break plus minus slash. Slash is a special case because uh slash you can actually have slash being the comment. So that we need to actually cover that properly when we get there. Oh yeah, for now we can just add the rest case.
Let's call it case plus add token token type plus and then value and then break uh you know what I'm just going to copy for case of minus this is going to minus and minus uh yeah minus we need star the star case as well so star star break yeah okay now I'll just do percentage first slash is a special case uh slash you can like slash slash or comment. So we need to actually make sure that it can identify that's not a comment. So let's do percent the next uh percent first percentage the modulus operator and yeah. Okay. Now here's the thing.
It's a little different for slash because we need to actually check that it's not that's not a slash for comments. So we need to actually check.
So if uh so if our next character So how would we uh exactly get that? So I guess um if we're looking for the next character, we use the match function. I believe the match checks the next character the match. So if match if match is uh so if our next character match we need a char value to be in there. So if our match is this so our next character is also slash we need to check and make sure okay so let me map it out. So, let's get rid of all this. All right.
So, let's say we have like this code right here. Let me just get rid of this.
So, let's say we have this code right here, right? Then we have a SL slash whatever comment we have here. I guess any amount of comment here. And then we continue off.
Let's see out X or X something something something along those lines something like this.
So we have these values then we have slash slash to indicate a comment. Our lecture shouldn't be looking at these as actual values. Anything past a double slash our lexer should not be looking at that. So we need to we need to do that. So there's I guess there's an edge case, right? So if we the comment gets broken if we achieve a new line. So at the end of this comment here if we reach a new line here it's broken we can then use our lexer to look for the next stuff or if we're at the end of our file. So if is at end function if we're at the end of our file then we're also done there. So that's two edge cases. So we also need to check that too. So if uh if we take a peek right if you take a peek at the next character if that character is a slash new line.
So if we're at a new line uh peak function oh yeah we don't okay we're not uh inputting that in here. We need to check if peak is equal equal to uh new line um or or if at the end if we're at the end in either of these two cases we need to uh I guess just return or wait hold up. Okay.
Um I mean I guess we just have the token, right?
Okay. Wait, it's not if. It's going to be uh it's going to be a while. So, they're doing every single one. I I messed that up because it can be any amount of characters. It has to be while peak is not equal to. Sorry, let's do not equal to. That's better to do. So if we're not at if the next character is not a new line character if it's not or if it's not at the end every character we advance.
So we want to advance.
Okay.
So, so this is uh this is our comment. Each character in our comment will keep advancing until we reach a new line or end of file.
Yeah, that's probably a better way. So then we advance. We keep advancing. So the lexer doesn't know that we're in a comment right now. It doesn't care that we're in a comment right now. We just keep going until the comment is done.
And the only way for a comment to be done is new ladder and so yeah. So as long as we're not at that we advance.
Uh yeah I guess that's fine.
And then the other case. So else if our slash character our slash token is not followed by another slash we're not in a comment so it doesn't matter that we don't need this to go. So uh need to yeah add token need to add token and essentially just like this slash and it's going to be slash. Yeah.
So that's essentially how we cover the edge case for slash.
Um, okay, that's good.
Well, I'm not I'm not exactly I'm not exactly using C++ for my language that I'm compiling. That'd be a bit too complicated. I'm taking a subset of C, which will be just like a couple features. So, I'm probably thinking of implementing like uh let me just take a look at my enum. So I'm probably thinking of implementing just like the the data types, if statements, while while loops, for loops, stuff like that, functions. So it's not going to be like the entire C++ or the entire C. It's just going to be a very small subset. And I can probably build off of that once we keep going.
That's that's my plan.
But yeah, you can write compilers of your same language. I mean, I think many people have done that before.
Okay. Uh I lost my train of thought for a second. So, okay. So, we have Yeah.
Yeah. We're done with this case. Uh we need to go on to the next case now. So, next case. Okay. So now we kind of get to the the double character. So like not equal to equal to equal to greater than or equal to like the equality operators.
So h we need to handle those uh in a special way. So we start off the same. So we're going to check.
So, this is probably my clan.
We're probably going to have like maybe a turnary statement or something like that. So, if our next token is a exclamation point, exclamation mark, whatever. If it is and the next character, the next token is an equal to token, then the overall token will become one token. If not, let's say if the next character is like a keyword or whatever something something else, it will end up being not equal to. So that's kind of or not not equal to just not it will just be not.
So that's kind of how we want to approach it with every single equality thing. So I mean I guess it's similar to this, right? So if it's not, we need to check if match is equal to the equal to sign.
if that's the case.
I guess I probably should uh update my title to actually say writing a C compiler in C. That's just what I thought my in my back of my head. But yeah, let's continue with this. So if wait so if the match is equal to if the match is equal to uh it's going to be adding the token of token type uh it's going to be what did I call it?
uh exclamation equal. I'm going to make it easier to read that. Equal equal, greater than or equal to less than equal to.
Dang, it's kind of hot in this room, bro. I even got my fan and everything.
It's still really hot.
Okay, that's whatever.
All right, I'm just like sweating right now. That's why. But yeah. Okay. So, it's going to be exclamation equal.
Uh, not bad.
Make the value.
Drink some water real quick.
All right. Yeah. Not equal, too. Next.
Why did I Why did I write it like that?
I didn't mean I didn't mean to do that.
It's going to be this break.
else.
If the match is not if the next character after the knot is not an equal sign then we'll just be at token just singular exclamation.
Um, wait, hold up. Maybe I've been writing this wrong cuz Okay, wait. I've been writing this wrong.
Does it Does that work? Because they're chars, right? They're char values, but in our thing, it's uh it's the variant for string only, not a char.
I might have messed up.
I'll come back to it.
But yeah, this has to be not equal to.
I feel like Claude's going to once I check this all with Claude, Claude's going to say this can't be single.
Oh, we'll see. We'll see. We'll see.
Actually, why not create it now? Let's let's just create it now. So, if my C++ parameter is still string and I pass in char, we're going to be fine if it is a single character.
Uh, well, okay. It didn't it didn't give me the AI summary here. I'm just going to It didn't give me the AI summary here.
So, uh, I mean, I guess it's fine, right? Like, my parameter is just string and I pass in char.
I I think it's fine.
If it's really a problem, I'll fix it later.
But yeah, so that should be that should be the right token to add.
And it's the same it's going to be the same case with the other ones, too.
Um, let's say break.
Okay.
Case uh for the next case which will be Uh yeah, let's do equal sign. Now the next case of equal sign we can essentially just copy this.
So if match if our next token is equal equal then we make this value equal equal.
If not it will just be equal.
And yeah should be the same thing for everything else.
Case greater than or not greater than less than.
So equal to will be less than or equal to or else it would just be and I forgot to change it up here.
Equal equal equal.
We just got a couple more cases of this to do then we can move on to the next edge case. next like category of edge cases. Uh yeah, I just need to make sure everything's right. The problem with copy and pasting from like previous code, you might miss some stuff and not correctly correct it properly. Not equal to not equal to equal equal equal less or equal to less than. Okay.
Uh greater than should be the same case.
So greater than or equal to greater equal or greater.
Okay, just cracking my finger. Sorry. Sorry if that bugs anyone.
Okay. Uh yeah, I think that's all the edge cases.
Exclamation equal equal equal equal.
I'm going to move these over to the single character equal greater.
So now this should accurately do it.
Yes, I think we covered everything. All these cases for the single character and then double character. Okay. Um there are some other edge cases we need to handle.
Uh that edge case would probably be Yeah.
new line, new line and also whites space. Whites space should generally be ignored and then that should increment the line by one. So yeah, let's do that. So we have a case of Oh my god. Okay. So, the case of this just an empty string with a white space, I guess. And then uh it should just be I mean, how do we just ignore that? We just advance forward, right? I guess. Uh I guess we just advance forward. Um, let's do the same thing for I need to actually look this up. I need to look this up. Um, is slasht in ignored by compilers because I'm pretty sure that's just like a user thing.
Uh, interpret as a white space. Okay, that's what I meant. I didn't mean ignore, but I meant like they treat it as white space. So, uh, it's let me look at I think there's one more slash r is slash r treated as white space.
I guess so.
Yeah. Okay. So, we can treat those also as white space. So, I guess we can just do like a fall through just fall through these.
So if it's slash T or slash R, we can fall through this and say break.
And then the last and final case which is our beloved new line character.
So let's see how that's just going to increment line by one. That's pretty simple. Um we have our our value for line over here and whenever a new line hits we just increment that by one anyway. So that's fine. So, it's just going to be um plus+ line I assume. Yeah, just plus+ line. We can keep it on the same line.
Plus plus 9.
Oh, yeah.
There.
Okay.
any other things. Uh we can have a default case.
Uh we can just say uh error. We can call the error function.
Uh so error handling error.
Oh, wait.
We We haven't uh included that. So, we can't do that.
Wait. So, Uh yeah, let's just make that same that same that same thing in here.
I mean, I guess we can put in the class itself, right? So, let's just Okay. But no, we can't because we haven't Do people do people include that in?
I'm not actually sure. Uh, okay. For now, I'm just going to keep it simple.
I'm just going to say let's just call it S.
Why C++ instead of C? What's up, Jennifer? How you doing? Uh, I'm making a C compiler, like a subset of C, but I'm writing it in C++. I need to change the title. It's kind of uh maybe might be misleading. I'll change the title right now.
Writing a C compiler.
Okay, I changed the title.
So, it should be more accurate now because a couple people have asked me that already. So, yeah.
Uh, we'll just call it unin operation. I need to figure out what's going on with that other error thing.
Uh, why is it not? What's going on?
Oh, I totally forgot. Oh my god, that's such a silly thing to mess up.
Okay. Yeah. All right, there we go. It's Sarah on an operation and then break out of that. So, that's going to be our default case. Uh, we handled all our edge cases. Now, so for any uh single token like a comma like semicolon, it should handle fine.
But anything like a exclamation mark, we check if it's the next character is equal to make and if it is, we make that not equal to. If it's not, then we just make it not not. And we did that for every single one here. And also we handle the edge case for slash because slash can be a comment. So we need to make sure that we make it a comment or division. So we covered that properly.
Um, okay. What else? White space, we covered. Any new line, we covered. I think that's it. I think that's it for this function.
I'm not going to lie, this function took longer than I thought it would. But yeah. Okay. How long have you been streaming?
for quite a bit. I think I can do a bit more.
I think I can go a bit longer. Uh yeah, everything looks good here.
I'm going to see what's going on.
What scan token? Wait.
Well, we just finished that thing. Okay, I see what's going on. I accidentally missed um that.
There we go. Okay, I totally forgot about that. Yeah, now scan token should be implemented. Yeah, no more squiggly line and our string is fine. Okay, now here's the hard part.
It was all fun and games until strings.
Actually, it shouldn't be too bad, but let's let's see. So a string is kind of different, right? So a string takes in it starts with our double quote and everything after the double quote until it reaches the next double quote is considered part of the string. So no matter what no matter what is in the string, it shouldn't complain.
So even if you have like something like int x= 3, even if it's in the string, it shouldn't complain that something's going wrong. So string should basically just cover everything inside these little quotations. So what I'm probably going to do is probably Okay, so let me kind of map it out. So, we should probably check check that our character is check if our character is a double quote. So, we need to check if our character is double quote first.
If that's the case, we need to loop until we reach the end of it. Loop until end of string.
And the way we can tell that it's the end of string is by reaching the next double quote. So our loop like our conditions essentially have to be like um as long as they're not at the end of the function.
So something like not is at end and or not end or as long as we're not at the end of our file or the character is not our char is not equal to uh double quote.
So that's kind of how we need to approach this.
Um I guess yeah that's how we kind of do that I assume.
And at the end of that we need to we need to add the token.
We need to add the token for the string itself. So the string we probably need to do some sort of like substring operation I believe.
uh like kind of like how we did here. We need to add the token but it needs to be it needs to be that string value itself.
It needs to be that string literal itself, right? So how would we exactly do that?
I mean kind of like how we did earlier, right? Just take a substring.
Take a substring. We encountered this earlier actually for actually grabbing the lexm value that's kind of similar to how we would probably do that here taking a substring value right so I assume we'd probably take in like a substring of our uh current character that we're on minus the starting character whatever that is we'll figure that out when we get there substring of puppy source and you also need to append the token at the end as well.
We need to append the token for Oh, wait. We didn't actually Wait, we didn't actually cover that. We didn't cover that actually.
I I totally missed that. Uh let's let's do that actually.
That should consider that too. We should consider that too. So for the case of our double quote, that's also going to be a token itself.
Uh, so we can call token type and I'm going to make a numeration in here.
How long have I been doing C++? Uh, just for a couple months. I think I started around March.
Yeah, March. But it wasn't my first programming language. I had experience with Python beforehand for like a yearish in university.
I'm in my second year of university.
Well, I'm about to be in my third year in a couple months, but yeah, just for a couple months. I'm just trying to I still have a lot to learn.
I'm just kind of going with it, to be honest.
The best way to learn is by building.
So, I appreciate it. I appreciate it.
I mean, all you got to do is really just start. You'll get there to a certain point. I'm still really far away from where I want to be. So, uh, yeah, let's make the enumeration.
Uh, it's a single character, so let's just call it like double quote, I guess. Yeah, double quote. So, now we can actually implement that in here.
double float.
Okay. Yeah. So, that should cover that.
And yeah, I mean to be honest should we don't need the ad token for that in here anymore then because um it will cover the ad token for that double quote in our in this function here in the scan token.
Okay, let's start implementing our string function. Okay. Uh let's just let's just try and go with it. It's going to be difficult but let's see. All right. Okay. Um, first we need to check if it's if it's um a double quote. So if if our um let's see I guess we can call uh we can set a string just like how we did there or a char not a string or should we let's see let uh should we do that I mean we don't really need Yeah, we don't need it. We don't really need it. It's kind of similar to how we did the peak, right? So, we kind of just want to peek to the next valley, I guess, right? We just want to peek to the valley that we're on. Okay. So while we call peak uh as long as it's not equal to our double quote yeah our double quote value right and not and sorry or or if it's not um at the end. Yeah. So if it's not at end.
So that's going to be the starting of the loop. So now we actually need to check.
Um h everything in the string is not going to be really taken to be honest.
So I guess we just advance we keep advancing until we reach the end of the string. So we'll start with S J and go all the way until the end until 9. And then we need to check once it hits that while while loop statement that it's a a double quote then it will exit out of the loop and then we can take it from there. Basically the lexer needs to not know what's going on with the string. The lexer needs to we need to know that the string is constructed afterwards but we don't need to actually like use a lexer to lex through each character of the string cuz everything should be ignored.
Uh okay. Okay. So, let's see what's next. Let's think of what's next.
Strings are kind of weird. Uh, yeah.
Let's think of some edge cases here for strings. What What edge cases would be there for strings?
Um, I know for sure we would need an advance here because we need to because this only covers up till the end the nine. We haven't covered this thing. So, we need to advance one more time after.
Uh, I'm going to think of some edge cases, but let me just write the substring.
I'm not going to lie, edge cases are kind of hard to think of off the top of my head. I might need to look it up or probably ask some AI for edge cases because AIS are really good at that. I can't think of the edge cases off the top of my head for strings.
But yeah, let's see. Okay. So, um this is very similar to what was what was there earlier when we needed to add the token and take the lexim. We needed to take the source the substring of that source.
Uh let me basically let me remap it up. So, uh let's say we had uh I'm just going to get rid of all this.
Let's say we're grabbing a string here. Let's call it I'm just going to write a string.
Uh let's say yes like this or something something like that. This will be our string value, right?
Oh my god. All right. Uh yeah. Okay. So this is a string.
What sources did you learn from? Uh I learned from a book. I'm still reading it, but yeah, this is a it basically covers everything that you need to know.
This is written in Java, so you can't really take the code unless you want to like take the code, but yeah, this basically like teaches you literally everything that you need to know about the lexer.
This is for interpreters, but um the front end of compilers and interpreters are the same. So you don't need to worry about that. Once you get to the back end of a compiler, it's different from an interpreter, but the front end does the same stuff. So I'll probably use a different book after I do the front end for the compiler.
Oh, you mean C++ in general? Oh, yeah.
That's uh learn CPP.
I can't type right now. Okay. Yeah, learn CPP. This has like everything and also YouTube videos as well because to be honest, learn CPP is kind of slow.
So if you're not really like into reading huge amounts, you're not going to learn too fast. So I recommend using learn TPP to learn like the nuances of stuff but then using YouTube videos or any other book to understand like other concepts like templates OOP data structures other stuff like that. That's what I would recommend because that's how I've been doing it.
But if you do learn CPP by itself you're going to take a while like control flow.
You only get to control flow at chapter 8. So you have to read like seven chapters before you even get to like if statements and loops and like classes and stuff is like chapter 14. So it's going to take a while if you just go learn CPP. So kind of mix mix and match. Learn CPP is good for the nuances like very the nuances of the language to understand every little edge case stuff like that. So yeah let's continue.
Yeah, I was explaining this right. So, okay. So, this earlier we we had to take a substring of the source to actually add the token. It's very very similar.
So, currently at the beginning, our start position will be at this at this um double quote, right?
We're going to be needing to return a substring of that plus one.
And as we go through the string, our current, okay, I'm getting a bit confused here. So our start value and our current value, our variables, our variables we have in here to actually go through the string. Our start and our current value are set to the start of the string. Our start value stays there, but our current value increments by one for each character that we go through. So by the end of the string our current our current variable will be here or would it be here? Yeah, it would be here because we did that. Yeah, we advanced twice. So it would be there.
Our current value would be here. But we need to grab everything in between here and append that. We don't want to append the quotation. So, we need our start + one and we need current uh minus start uh wait.
So, okay, let me just write it out. Let me write it out real quick. So, it's going to be the string.
Let's say string literal It's going to be the source itself like how we did earlier. Substring and substring takes the current index the current position and the length. Right? So our current position needs to be start + one because our start is still stuck at the quotations. So we need to start plus one to grab a first and we need to grab the rest of it too. So our current value is here right? Um, so I guess current minus um okay so if you do if we do minus one then it will cover that. So because the index starts from zero so we need to do the current minus2 I believe.
It's kind of hard to think about it.
Let me use a smaller string. Let me use a smaller string to actually map it out.
Probably let's say like John or something.
It'll be easier for me to do it. So our current position we moved it to be here. That's our starting position.
And we want to get the length of the string. So 1 2 3 4 right. Okay. So we want to take four as our substring. Let's say we have that I believe. So how do we achieve that? So this is going to be one, right?
Cuz this would be 0 1 2 3 4 5. So this would be five right here, right?
Uh so I guess minus one, I guess. Current minus one, I believe.
H.
Yeah, I I guess current minus one if I'm thinking about it right.
Uh let me read the advanced part again.
Let's see if I have the right thing for substring.
Let me just read the docs again real quick.
Okay, right here. Let's just use this example. So they have geeks.
012 three. So yeah, they're taking the third position right there.
Uh one, two. So they're only taking K and S.
Okay, I think that's right then.
I think that's right.
I hope because I I literally mapped it out and everything. So, if it's not right, then that that'll be kind of problematic.
It should be right.
Going to be right back for a bit. I'll leave your stream open. I appreciate it.
I appreciate it.
I'll probably only be streaming for a little bit, a little while longer anyways. So, I think that's right. I really hope I'm right there. Current minus one.
Okay. Uh yeah, now we need to add the token cuz now we need to add the token. So it's going to be uh adding the token takes in parameters of token type and also it's going to be token type.
I believe it was string. Yeah. String string and our string literal value.
So that should be good.
Oh my god. Okay. Yeah, my brain is starting to hurt a little bit, not going to lie. Uh yeah, I think we're I think we're doing fine, though.
Yeah. So now we uh appended the token of the string.
Uh what other Okay, let's think. Right.
I'm going to actually type uh what are the some edge cases for strings.
I probably won't get a good response here. They probably don't understand my thing here.
Let's see. Oh, actually they do give a good amount. Okay. Um single character. I guess I'm just going to have single character behave normally. Uh, empty string. That should be fine, too.
I mean, we kind of did handle that already in a way, I guess. Uh, everything inside the stream should be fine. Special characters.
Oh, wait. Uh, there is that one edge case, right?
Yeah.
Um yeah for new line no input.
Oh input. Okay. Uh yeah.
Oh wait, wait.
I believe I believe they are. Let me check real quick. Let me check for a second. Um all the prices are C style.
I believe they do have a null terminator and a sea style thing literal. Yeah, null terminator.
They can contain escape sequences. So, we need to cover those edge cases as well. They can support any character. We already did that. So, yeah. Okay.
Yeah, let's cover that. So um so essentially we so we have if we have a string like this I believe it's like I believe it's within the string itself it's going to be like one second guys I'm kind of just mapping it out real quick it's going to be like null terminated like this I believe right but it's not it's It's not included within the string. It's just like implicitly there, I think.
So, I I don't think we need to cover that, but we need to make sure that this is a we need to make sure this next character is um is a is a quotation mark. So, so after we come out of here, we need to actually validate that because if we hit the end after we come out of the string, so like after we come out of this string here and the if it's the end, that means we didn't properly close it with the proper um brackets.
So if it's at the end, I guess we can just call sincere.
It's a unerminated stream.
Obviously the code myself here.
So that should cover that and we also don't want the function to keep going the same that should cover that edge case. Now we need to cover a new line.
I mean I guess we have to cover that in here itself though inside the stream.
So if our next character, if we peek ahead, if that's a new line character, then we need to advance the line by one.
It's a plus plus nine. So that should cover that.
I think that's it.
I think that should cover it.
How long have we been streaming?
Around two hours.
Yeah, it's getting pretty late though.
Getting pretty late here. I'd say we made really good progress to be honest today. I think we I don't know if we 100% did it correctly, but maybe around like 95%. There in terms of accuracy, we'll we'll continue this off tomorrow. So, today we basically finished setting up the class. We got the the vector for the actual scanning all the tokens in the file adding the token to the vector whenever we scan that actual XM and then we handled some edge cases for different types. So like the single character and double characters we did the whole function for that and also the string. I believe string function is done. So I believe we covered that.
Tomorrow we can do number and also identifiers and any other things I need to add. So yeah, I guess I'm going to end the end the stream off there. I appreciate you all for joining and I'll see you guys tomorrow.
Vidéos Similaires
Ubuntu Touch Q&A 190
UBports
241 views•2026-05-17
Learning k8s ep. 3 - The end of the VM
devcentral
102 views•2026-05-15
Iterators and Generators: Real Use Cases
jsmentor-uk
188 views•2026-05-17
TCS NQT Coding Questions Solution (One Shot) | TCS NQT Preparation 2027 | TCS Actual PYQ 2026
knacademy20
2K views•2026-05-17
The 4 Bit AI Training Trick
explaquiz
414 views•2026-05-19
Image to 3D World Workflow 👀
badxstudio
843 views•2026-05-16
Why Learn Algorithms in the AI Era
bitsandproofs
245 views•2026-05-17
NFA - Transition Diagram and Transition Table
nesoacademy
198 views•2026-05-19











