Implement collecting errors while tokenizing #2911

Veetaha · 2020-01-26T19:06:42Z

Now we are collecting errors from rustc_lexer and returning them in ParsedToken { token, error } and ParsedTokens { tokens, errors } structures ([UPD]: this is now simplified, see updates bellow).

The main changes are introduced in ra_syntax/parsing/lexer.rs. It now exposes the following functions and types:

pub fn tokenize(text: &str) -> ParsedTokens;
pub fn tokenize_append(text: &str, parsed_tokens_to_append_to: &mut ParsedTokens);
pub fn first_token(text: &str) -> Option<ParsedToken>; // allows any number of tokens in text
pub fn single_token(text: &str) -> Option<ParsedToken>; // allows only a single token in text

pub struct ParsedToken  { pub token: Token,       pub error: Option<SyntaxError> }
pub struct ParsedTokens { pub tokens: Vec<Token>, pub errors: Vec<SyntaxError>   }

pub enum TokenizeError { /* Simple enum which reflects rustc_lexer tokenization errors */ }

In the first commit I implemented it with iterators, but then decided that since this crate is ad hoc for rust-analyzer and we clearly see the places of its usage it would be better to simplify it to vectors.

This is currently WIP, because I want to add tests for error messages generated by the lexer.
I'd like to listen to you thoughts how to define these tests in ra_syntax/test-data dir.

Related issues: #223

[UPD]

After the PR review the API was simplified:

pub fn tokenize(text: &str) -> (Vec<Token>, Vec<SyntaxError>);
// Both lex functions do not check for unescape errors
pub fn lex_single_syntax_kind(text: &str) -> Option<(SyntaxKind, Option<SyntaxError>)>;
pub fn lex_single_valid_syntax_kind(text: &str) -> Option<SyntaxKind>;

// This will be removed in the next PR in favour of simlifying `SyntaxError` to `(String, TextRange)`
pub enum TokenizeError { /* Simple enum which reflects rustc_lexer tokenization errors */ }

// this is private, but may be made public if such demand would exist in future (least privilege principle)
fn lex_first_token(text: &str) -> Option<(Token, Option<SyntaxError>)>;

Veetaha · 2020-01-26T20:08:15Z

crates/ra_syntax/src/parsing/reparsing.rs

@@ -41,37 +41,42 @@ fn reparse_token<'node>(
    root: &'node SyntaxNode,
    edit: &AtomTextEdit,
 ) -> Option<(GreenNode, TextRange)> {
-    let token = algo::find_covering_element(root, edit.delete).as_token()?.clone();
-    match token.kind() {
+    let prev_token = algo::find_covering_element(root, edit.delete).as_token()?.clone();


Spent some time figuring out what is going on in this method to properly integrate changes from lexer. The core implementation here didn't change, just renamed some variables for clarity...

Good names do a lot for code readability 👍

Veetaha · 2020-01-26T20:10:53Z

crates/ra_syntax/src/parsing/reparsing.rs

@@ -97,6 +102,9 @@ fn reparse_block<'node>(
 fn get_text_after_edit(element: SyntaxElement, edit: &AtomTextEdit) -> String {
    let edit =
        AtomTextEdit::replace(edit.delete - element.text_range().start(), edit.insert.clone());
+
+    // Note: we could move this match to a method or even further: use enum_dispatch crate
+    // https://crates.io/crates/enum_dispatch


This is my proposal for the future refactoring, want to hear your opinion on it.

Honestly, I just liked the idea of static enum dispatch that relives from making object-safe traits and replaces dynamic dispatch.
It even reduces the boilerplate of such matches.

Though you may consider this to be unnecessary overhead and I won't argue too much)

I think there's some consensus that we already have a bit too many dependencies, so adding another one, especially a procedural macro, is probably not a good thing if we want to keep build times down.

I actually feel pretty strongly about not using "helper" creates, that try to bolt on idioms onto a language. I feel that the accidental complexity introduced by them is almost always larger than any boilerplate savings. The notable exception is the impl_froms macro which we use throughout the hir, and which I feel is justified, in that it's a very simple idea, and the amount of boilerplate saved is significant.

The reason why we need an explicit match here is that the .text() are two very different things in this case: the first one is a &str, and the second-one is a tree structure -- a view into the node. If we had a generic trait Text, then making an type NodeOrTokenText<'a> = NodeOrToken<SyntaxText, &str>; impl Text for NodeOrTokenText<'a>would be reasonable, but we don't have such a trait yet, and designing it right is really hard.

Okay, guys, thank you for the opinion)

Veetaha · 2020-01-26T20:14:44Z

crates/ra_syntax/src/syntax_error.rs

+// FIXME: Location should be just `Location(TextRange)`
+// TextUnit enum member just unnecessarily compicates things,
+// we should'n treat it specially, it just as a `TextRange { start: x, end: x + 1 }`
+// see `location_to_range()` in ra_ide/src/diagnostics


I think this should be the target for the future refactoring. I am not entirely sure why we don't use just TextRange as a location of SyntaxError instead. Or at least make location a newtype of TextRange? What do you think?

Seems reasonable. In the end, when used to emit a diagnostic, it gets turned into a TextRange anyways: https://github.com/rust-analyzer/rust-analyzer/blob/d1330a4a65f0113c687716a5a679239af4df9c11/crates/ra_ide/src/diagnostics.rs#L118-L123

Yeah, I think this should just be a TextRange.

Yeah, I just referred to that snippet in the comment)

That you did 😆

Veetaha · 2020-01-26T20:16:46Z

crates/ra_syntax/src/syntax_error.rs

+    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
+        let msg = match self {
+            TokenizeError::EmptyInt => "Missing digits after integer base prefix",
+            TokenizeError::EmptyExponent => "Missing digits after the exponent symbol",


I wish rustfmt was not so clippy, initialy I wrote this match with each arm homogeneously and nicely wrapped into curly braces, but the formatting tool demands the shorthand syntax here ...

You could always #[rustfmt::skip] this one function. We do the same for a few match statements elsewhere.

I am ok with sticking #[rustfmt::skip] on top in similar cases.

Veetaha · 2020-01-26T20:18:43Z

crates/ra_syntax/src/syntax_node.rs

 pub struct SyntaxTreeBuilder {
    errors: Vec<SyntaxError>,
    inner: GreenNodeBuilder<'static>,
 }

-impl Default for SyntaxTreeBuilder {


I am not sure why Default trait was implemented here by hand, could you, please, elabortate @matklad?

At the time this was written, I forgot to add a Default to GreenBuilder, and was to lazy to publish a new version of rowan just for that: rust-analyzer/rowan@a3692a9#diff-1a2b23ceb6e534bd16f63d2e55e537aaR154.

In general, if something looks like it doesn't make sense, it probably indeed doesn't make sense :)

I've stumbled with enough amount of things that didn't make sense to me but did actually exist for a reason, so I'd rather ask the author first before amending them. Anyway, thank you for the clarification!

Veetaha · 2020-01-26T20:41:45Z

I know that ra_syntax::SourceFile::parse() returns Parse<SourceFile> type that has the same semmantics as ParsedToken[s], we could reuse it (with some changes to it) and return Parse<Token> instead. But this would be the task for refactoring and I am not sure whether it is worth to generalize it this way... What are your thoughts?

Veetaha · 2020-01-26T20:51:06Z

crates/ra_syntax/src/tests.rs

@@ -10,7 +10,8 @@ use crate::{fuzz, SourceFile};
 #[test]
 fn lexer_tests() {
    dir_tests(&test_data_dir(), &["lexer"], |text, _| {
-        let tokens = crate::tokenize(text);
+        // FIXME: add tests for errors (their format is up to discussion)
+        let tokens = crate::tokenize(text).tokens;


I'd like to make these tests data-driven, but I am not sure which format would be the most suitable...

Your input is very important here, guys!

You could probably just update dump_tokens to take ParsedTokens and then also output the errors like we do the tokens. If you don't print anything when there's no errors, you shouldn't even need to touch the existing tests.

Just dumping errors in dump_tokens after the tokens themselves seems good to me.

Unrelated, but perhaps a slightly more information rich way to write this would be:

let ParsedToken { tokens, errors: _ } = crate::tokenize(text)

Ie, explicitely naming the thing you are ignoring (which usually is important for erroes).

Okay guys, will do!
Also, @matklad, good point on explicit errors dropping. Do you think we should use this pattern in all places tokenize*()/single_token()/first_token() are called?

I think so! In general, I just always try to destruct tuples that way:

13:49:36|~/projects/rust-analyzer|HEAD✓ λ rg 'let .*, _\w+\)' xtask/src/codegen/gen_syntax.rs 167: let punctuation_values = grammar.punct.iter().map(|(token, _name)| { crates/ra_hir_def/src/generics.rs 59: let (params, _source_map) = GenericParams::new(db, def.into()); crates/ra_hir/src/code_model.rs 1072: let (adt, _subst) = self.ty.value.as_adt()?; crates/ra_batch/src/lib.rs 53: let (crate_graph, _crate_names) = 147: let (host, _roots) = load_cargo(path).unwrap(); crates/ra_hir_ty/src/utils.rs 141: let (_total, parent_len, _child) = self.len_split(); crates/ra_parser/src/grammar/expressions.rs 27: let (cm, _block_like) = expr(p); crates/ra_hir_def/src/nameres/collector.rs 947: let (db, _file_id) = TestDB::with_single_file(&code); crates/ra_hir_ty/src/infer/coerce.rs 253: let (last_field_id, _data) = fields.next_back()?; crates/ra_parser/src/grammar/expressions/atom.rs 560: let (completed, _is_block) = ``

kiljacken

Overall this looks good. A few small nits to the actual code, but nothing major.

kiljacken · 2020-01-27T10:02:54Z

crates/ra_ide/src/references/rename.rs

+    match single_token(new_name)?.token.kind {
+        SyntaxKind::IDENT | SyntaxKind::UNDERSCORE => (),
+        _ => return None,


I love how this turned out 👍

kiljacken · 2020-01-27T10:05:27Z

crates/ra_syntax/src/parsing/lexer.rs

+    /// In general `self.errors.len() <= self.tokens.len()`
+    pub errors: Vec<SyntaxError>,
+}
+impl ParsedTokens {


Nit: Add a blank line above this.

kiljacken · 2020-01-27T10:12:15Z

crates/ra_syntax/src/parsing/lexer.rs

+            .map(|error| SyntaxError::new(SyntaxErrorKind::TokenizeError(error), token_range)),
+    };
+
+    type ParsedSyntaxKind = (SyntaxKind, Option<TokenizeError>);


Defining the alias here probably takes up more space that just writing it out four times. Don't really care too much either way.

Okay, I'll try to refactor it according to this comment

kiljacken · 2020-01-27T10:15:02Z

crates/ra_syntax/src/parsing/reparsing.rs

@@ -41,37 +41,42 @@ fn reparse_token<'node>(
    root: &'node SyntaxNode,
    edit: &AtomTextEdit,
 ) -> Option<(GreenNode, TextRange)> {
-    let token = algo::find_covering_element(root, edit.delete).as_token()?.clone();
-    match token.kind() {
+    let prev_token = algo::find_covering_element(root, edit.delete).as_token()?.clone();


Good names do a lot for code readability 👍

kiljacken · 2020-01-27T10:24:04Z

crates/ra_syntax/src/parsing/reparsing.rs

@@ -97,6 +102,9 @@ fn reparse_block<'node>(
 fn get_text_after_edit(element: SyntaxElement, edit: &AtomTextEdit) -> String {
    let edit =
        AtomTextEdit::replace(edit.delete - element.text_range().start(), edit.insert.clone());
+
+    // Note: we could move this match to a method or even further: use enum_dispatch crate
+    // https://crates.io/crates/enum_dispatch


I think there's some consensus that we already have a bit too many dependencies, so adding another one, especially a procedural macro, is probably not a good thing if we want to keep build times down.

kiljacken · 2020-01-27T10:27:13Z

crates/ra_syntax/src/parsing/reparsing.rs

+            // FIXME: it seems this initialization statement is unnecessary (see edit in outer scope)
+            // Investigate whether it should really be removed.
            let edit = AtomTextEdit { delete: range, insert: replace_with.to_string() };


Neither range nor replace_with change, and edit is always passed as reference, so I'm inclined to agree that this line could probably be removed.

kiljacken · 2020-01-27T10:31:31Z

crates/ra_syntax/src/syntax_error.rs

+// FIXME: Location should be just `Location(TextRange)`
+// TextUnit enum member just unnecessarily compicates things,
+// we should'n treat it specially, it just as a `TextRange { start: x, end: x + 1 }`
+// see `location_to_range()` in ra_ide/src/diagnostics


Seems reasonable. In the end, when used to emit a diagnostic, it gets turned into a TextRange anyways: https://github.com/rust-analyzer/rust-analyzer/blob/d1330a4a65f0113c687716a5a679239af4df9c11/crates/ra_ide/src/diagnostics.rs#L118-L123

kiljacken · 2020-01-27T10:32:41Z

crates/ra_syntax/src/syntax_error.rs

+    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
+        let msg = match self {
+            TokenizeError::EmptyInt => "Missing digits after integer base prefix",
+            TokenizeError::EmptyExponent => "Missing digits after the exponent symbol",


You could always #[rustfmt::skip] this one function. We do the same for a few match statements elsewhere.

kiljacken · 2020-01-27T10:36:46Z

crates/ra_syntax/src/syntax_error.rs

+    // FIXME: the obvious pattern of this enum dictates that the following enum variants
+    // should be wrapped into something like `SemmanticError(SemmanticError)`
+    // or `ValidateError(ValidateError)` or `SemmanticValidateError(...)`


Lets leave it at the tokenizer changes, at least for this PR. Then you can do this as a separate PR if it seems useful.

Yes, I described the summary and future todos in the comment under the original issue. I'll see what should be done in next PR.

kiljacken · 2020-01-27T10:43:36Z

crates/ra_syntax/src/tests.rs

@@ -10,7 +10,8 @@ use crate::{fuzz, SourceFile};
 #[test]
 fn lexer_tests() {
    dir_tests(&test_data_dir(), &["lexer"], |text, _| {
-        let tokens = crate::tokenize(text);
+        // FIXME: add tests for errors (their format is up to discussion)
+        let tokens = crate::tokenize(text).tokens;


You could probably just update dump_tokens to take ParsedTokens and then also output the errors like we do the tokens. If you don't print anything when there's no errors, you shouldn't even need to touch the existing tests.

kiljacken · 2020-01-27T10:46:12Z

I know that ra_syntax::SourceFile::parse() returns Parse<SourceFile> type that has the same semmantics as ParsedToken[s], we could reuse it (witch some changes to it) and return Parse<Token> instead. But this would be the task for refactoring and I am not sure whether it is worth to generalize it this way... What are your thoughts?

Parse<T> is currently rather specialized for things that actually contain a syntax tree from rowan, so this should probably just remain separate.

matklad

Left a bunch of nitpicky comments!

I don't really have a strong opinions on them, but, as this is partially a refactoring work, I'd like to pay more than usual attention to detail, to make sure we have a somewhat shared understanding what's the "rust-analyzer" style is

Overall, LGTM!

matklad · 2020-01-27T10:19:18Z

crates/ra_syntax/src/parsing/lexer.rs

+    pub error: Option<TokenizeError>,
+}
+impl ParsedToken {
+    pub const fn new(token: Token, error: Option<TokenizeError>) -> Self {


Do we actually use this function? In general, I try to avoid adding trivial news for structs where all fields are public, as a struct literal is more readable at the call-site (as fields have names). The exception is if the struct is created in a lot of different places, where the method call might be more concise.

Similarly for the Token above

Also, in the case of ParsedToken, I would probably use type ParsedToekn = (Token, Option<TokenizeError>), because semantically it is just a pair of a token and error.

Didn't expect you would dig into commits, I did rethink this and removed that functions in later commits.

matklad · 2020-01-27T10:23:11Z

crates/ra_syntax/src/parsing/lexer.rs

+    // We drop some useful infromation here (see patterns with double dots `..`)
+    // Storing that info in `SyntaxKind` is not possible due to its layout requirements of
+    // being `u16` that come from `rowan::SyntaxKind` type and changes to `rowan::SyntaxKind`
+    // would mean hell of a rewrite.


We don't do this not because we are lazy to rewrite (I think the syntax three were rewriten something like four times?), but because dumb semantic-less nodes are an explicit design decision :)

Sorry, I think I did make this comment a bit misleading, I just wanted to explain why we do drop some info here, not proposing to change the design!

matklad · 2020-01-27T10:36:20Z

crates/ra_syntax/src/parsing/lexer.rs

+        TK::Caret => ok(CARET),
+        TK::Percent => ok(PERCENT),
+        TK::Unknown => ok(ERROR),
+    };


I would probably avoid ok, ok_if functions and do something like this

let kind = match { TK::LineComment => COMMENT, TK::BlockComment { terminated: false } => return (COMMENT, Some(TE::UnterminatedBlockComment)), TK::BlockComment { terminated: true } => COMMENT, }; (kind, None)

That is,

move ok to a happy path, and use return for errors

pull TextUnit::from_usize(token_text.len()) bit out of the function, as it doesn't depend on on the specific kind at all.

Hmm, nice notice, I'll apply this

matklad · 2020-01-27T10:38:30Z

crates/ra_syntax/src/parsing/lexer.rs

+pub fn first_token(text: &str) -> Option<ParsedToken> {
+    // Checking for emptyness because of `rustc_lexer::first_token()` invariant (see its body)
+    if text.is_empty() {
+        None


Suggested change

None

return None;

and remove else. We use early returns by default.

matklad · 2020-01-27T10:39:02Z

crates/ra_syntax/src/parsing/reparsing.rs

@@ -46,8 +46,7 @@ fn reparse_token<'node>(
        WHITESPACE | COMMENT | IDENT | STRING | RAW_STRING => {
            if token.kind() == WHITESPACE || token.kind() == COMMENT {
                // removing a new line may extends previous token
-                if token.text().to_string()[edit.delete - token.text_range().start()].contains('\n')
-                {
+                if token.text()[edit.delete - token.text_range().start()].contains('\n') {


matklad · 2020-01-27T10:57:46Z

crates/ra_syntax/src/syntax_error.rs

@@ -84,6 +84,9 @@ pub enum SyntaxErrorKind {
    ParseError(ParseError),
    EscapeError(EscapeError),
    TokenizeError(TokenizeError),
+    // FIXME: the obvious pattern of this enum dictates that the following enum variants
+    // should be wrapped into something like `SemmanticError(SemmanticError)`
+    // or `ValidateError(ValidateError)` or `SemmanticValidateError(...)`


Yeah, in general, I am not sure that using an enum for SyntaxErrors is a good idea. We don't really match on them, so perhaps just an OpaqueError(String, TextRange) would be a better representation.

Okay, let's leave it for the refactoring to be done soon. Since this crate is called ra_syntax let's leave SyntaxError name for that and not some other OpaqueError.
If this crate is used only by rust-analyzer exclusively I am definitely for generic (String, TextRange) error shape.

matklad · 2020-01-27T10:58:22Z

crates/ra_syntax/src/syntax_error.rs

+    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
+        let msg = match self {
+            TokenizeError::EmptyInt => "Missing digits after integer base prefix",
+            TokenizeError::EmptyExponent => "Missing digits after the exponent symbol",


I am ok with sticking #[rustfmt::skip] on top in similar cases.

matklad · 2020-01-27T11:01:17Z

crates/ra_syntax/src/syntax_node.rs

 pub struct SyntaxTreeBuilder {
    errors: Vec<SyntaxError>,
    inner: GreenNodeBuilder<'static>,
 }

-impl Default for SyntaxTreeBuilder {


At the time this was written, I forgot to add a Default to GreenBuilder, and was to lazy to publish a new version of rowan just for that: rust-analyzer/rowan@a3692a9#diff-1a2b23ceb6e534bd16f63d2e55e537aaR154.

In general, if something looks like it doesn't make sense, it probably indeed doesn't make sense :)

matklad · 2020-01-27T11:03:14Z

crates/ra_syntax/src/tests.rs

@@ -10,7 +10,8 @@ use crate::{fuzz, SourceFile};
 #[test]
 fn lexer_tests() {
    dir_tests(&test_data_dir(), &["lexer"], |text, _| {
-        let tokens = crate::tokenize(text);
+        // FIXME: add tests for errors (their format is up to discussion)
+        let tokens = crate::tokenize(text).tokens;


Just dumping errors in dump_tokens after the tokens themselves seems good to me.

Unrelated, but perhaps a slightly more information rich way to write this would be:

let ParsedToken { tokens, errors: _ } = crate::tokenize(text)

Ie, explicitely naming the thing you are ignoring (which usually is important for erroes).

matklad · 2020-01-27T11:09:21Z

crates/ra_syntax/src/syntax_error.rs

            }
            TokenizeError::UnstartedRawString => {
-                "Missing `\"` symbol after `#` symbols to begin the raw string literal"
+                "Missing \" symbol after # symbols to begin the raw string literal"


I usually like to place some kind of quotes around snippets because that's helpful when a snippet contains trailing/leading whitespace or other invisible characters. But we should just stick to what rustc is doing.

Which... I am not sure what is exactly! For this code

λ cat main.rs fn main() { let s = r###; }

I get this error

Note how # is uselessly in quotes, but the empty string between : and ; isn't.

@estebank what are the current guidelines around quotes?

Heh, I initially wrapped the symbols into backticks (hoping they would be rendered as Markdown), but once I started VSCode and checked out that the messages were displayed as a raw text I removed them 34d72aa

We try to consistently surround code and tokens (including identifiers) in backticks. No tool is actually parsing them, but having them bare makes it harder to read (consider
found , but expected .
vs
found `,` but expected `.`
)

Veetaha · 2020-01-27T14:11:54Z

Thank you for the review and elaborating on the "rust-analyzer" style, will take all that into account!

matklad · 2020-01-28T14:20:50Z

is this still a wip?

Veetaha · 2020-01-28T17:07:59Z

@matklad please don't merge yet, I'll add tests today. Then I'll remove WIP label. Or do you have another understanding of WIP label?

matklad · 2020-01-28T17:09:01Z

Sometimes wip lables are left there accidentally, just checking that that's not the case :)

…

On Tue, 28 Jan 2020 at 18:08, Veetaha ***@***.***> wrote: @matklad <https://github.com/matklad> please don't merge yet, I'll add tests today. Then I'll remove WIP label. Or do you have another understanding of WIP label? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2911>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AANB3MY7EHPH6GF7GPEEX23RABQ7BANCNFSM4KLYT5MQ> .

Veetaha · 2020-02-01T20:31:29Z

@kiljacken, @matklad
Please see changes of the latest commit: 38f1abf
And please don't be terrified with the number of changed files, these are mostly adding new and moving old test_data files.
Brief changelog:

Extracted ok and err directories in test_data/lexer similarly to how it is done in test_data/parser.
Moved a bunch of existing test for error scenarios to err directory, renamed comments.rs test file to single_line_comments.rs
Added a whole bunch of tests for tokenization errors introduced in this PR
Added FIXMEs (which are in fact todos) to state the plan of future improvements.

Veetaha · 2020-02-01T20:39:01Z

I also need to rebase this branch, I'll do it before the merge, just let me know when it is time.

matklad · 2020-02-03T11:14:28Z

crates/ra_syntax/src/tests.rs

+        let (tokens, errors) = tokenize(text);
+        assert_errors_are_present(&errors, path);
+        dump_tokens_and_errors(&tokens, &errors, text)
+    });


matklad · 2020-02-03T11:15:18Z

bors r+

kiljacken · 2020-02-03T12:27:48Z

Bors is stuck, we probably need that rebase.

@Veetaha Would you mind rebasing? :)

Veetaha · 2020-02-03T12:32:45Z

Sure, I'll do that later today!

…ze() (implemented with iterators)

…not Markdown ;(

…PR review

Veetaha · 2020-02-03T22:16:02Z

@matklad bors r+ ?

matklad · 2020-02-03T22:50:58Z

bors r+

2911: Implement collecting errors while tokenizing r=matklad a=Veetaha Now we are collecting errors from `rustc_lexer` and returning them in `ParsedToken { token, error }` and `ParsedTokens { tokens, errors }` structures **([UPD]: this is now simplified, see updates bellow)**. The main changes are introduced in `ra_syntax/parsing/lexer.rs`. It now exposes the following functions and types: ```rust pub fn tokenize(text: &str) -> ParsedTokens; pub fn tokenize_append(text: &str, parsed_tokens_to_append_to: &mut ParsedTokens); pub fn first_token(text: &str) -> Option<ParsedToken>; // allows any number of tokens in text pub fn single_token(text: &str) -> Option<ParsedToken>; // allows only a single token in text pub struct ParsedToken { pub token: Token, pub error: Option<SyntaxError> } pub struct ParsedTokens { pub tokens: Vec<Token>, pub errors: Vec<SyntaxError> } pub enum TokenizeError { /* Simple enum which reflects rustc_lexer tokenization errors */ } ``` In the first commit I implemented it with iterators, but then decided that since this crate is ad hoc for `rust-analyzer` and we clearly see the places of its usage it would be better to simplify it to vectors. This is currently WIP, because I want to add tests for error messages generated by the lexer. I'd like to listen to you thoughts how to define these tests in `ra_syntax/test-data` dir. Related issues: #223 **[UPD]** After the PR review the API was simplified: ```rust pub fn tokenize(text: &str) -> (Vec<Token>, Vec<SyntaxError>); // Both lex functions do not check for unescape errors pub fn lex_single_syntax_kind(text: &str) -> Option<(SyntaxKind, Option<SyntaxError>)>; pub fn lex_single_valid_syntax_kind(text: &str) -> Option<SyntaxKind>; // This will be removed in the next PR in favour of simlifying `SyntaxError` to `(String, TextRange)` pub enum TokenizeError { /* Simple enum which reflects rustc_lexer tokenization errors */ } // this is private, but may be made public if such demand would exist in future (least privilege principle) fn lex_first_token(text: &str) -> Option<(Token, Option<SyntaxError>)>; ``` Co-authored-by: Veetaha <[email protected]>

bors · 2020-02-03T22:59:55Z

Build succeeded

Rust (macos-latest)
Rust (ubuntu-latest)
Rust (windows-latest)
TypeScript

3026: ra_syntax: reshape SyntaxError for the sake of removing redundancy r=matklad a=Veetaha Followup of #2911, also puts some crosses to the todo list of #223. **AHTUNG!** A big part of the diff of this PR are test data files changes. Simplified `SyntaxError` that was `SyntaxError { kind: { /* big enum */ }, location: Location }` to `SyntaxError(String, TextRange)`. I am not sure whether the tuple struct here is best fit, I am inclined to add names to the fields, because I already provide getters `SyntaxError::message()`, `SyntaxError::range()`. I also removed `Location` altogether ... This is currently WIP, because the following is not done: - [ ] ~~Add tests to `test_data` dir for unescape errors *// I don't know where to put these errors in particular, because they are out of the scope of the lexer and parser. However, I have an idea in mind that we move all validators we have right now to parsing stage, but this is up to discussion...*~~ **[UPD]** I came to a conclusion that tree validation logic, which unescape errors are a part of, should be rethought of, we currently have no tests and no place to put tests for tree validations. So I'd like to extract potential redesign (maybe move of tree validation to ra_parser) and adding tests for this into a separate task. Co-authored-by: Veetaha <[email protected]> Co-authored-by: Veetaha <[email protected]>

Rustup

Veetaha requested review from matklad and kiljacken January 26, 2020 19:06

Veetaha commented Jan 26, 2020

View reviewed changes

Veetaha mentioned this pull request Jan 27, 2020

Missing validators #223

Closed

10 tasks

kiljacken suggested changes Jan 27, 2020

View reviewed changes

matklad reviewed Jan 27, 2020

View reviewed changes

Veetaha force-pushed the master branch from d206c16 to a884797 Compare January 28, 2020 05:15

Veetaha changed the title ~~[WIP]: Implement collecting errors while tokenizing~~ Implement collecting errors while tokenizing Feb 1, 2020

Veetaha requested review from kiljacken and matklad February 1, 2020 20:25

matklad reviewed Feb 3, 2020

View reviewed changes

Veetaha added 7 commits February 4, 2020 00:00

ra_syntax: changed added diagnostics information returned from tokeni…

ad24976

…ze() (implemented with iterators)

Reimplemented lexer with vectors instead of iterators

ac37a11

ra_syntax: fixed doc comment

a2bc4c2

ra_syntax: moved ParsedToken derive attribute under the doc comment

ffe0063

add better docs for tokenize errors

c6d0881

ra_syntax: remove backticks from TokenizeError message since that is …

bf60661

…not Markdown ;(

ra_syntax: refactored the lexer design as per @matklad and @kiljacken …

9e7eaa9

…PR review

Veetaha added 5 commits February 4, 2020 00:00

ra_syntax: fixed a typo in doc comment

b1764d8

ra_syntax: rename first_token() -> lex_first_token()

58e01d8

ra_syntax: removed unnecessary init statement from reparsing tests

c3117ee

ra_syntax: add backticks around tokens specimen

9367b9a

ra_syntax: added tests for tokenization errors

a3e5663

Veetaha force-pushed the master branch from 38f1abf to a3e5663 Compare February 3, 2020 22:01

bors bot merged commit a3e5663 into rust-lang:master Feb 3, 2020

Veetaha mentioned this pull request Feb 6, 2020

ra_syntax: reshape SyntaxError for the sake of removing redundancy #3026

Merged

1 task

Veetaha mentioned this pull request Feb 16, 2020

vscode: fix the default value for withSysroot #3163

Merged

Veetaha mentioned this pull request Feb 19, 2020

Formatting unterminated double quote string leads to panic and weird backtrace. #3226

Closed

RalfJung pushed a commit to RalfJung/rust-analyzer that referenced this pull request Apr 20, 2024

Auto merge of rust-lang#2911 - RalfJung:rustup, r=RalfJung

1677149

Rustup

RalfJung pushed a commit to RalfJung/rust-analyzer that referenced this pull request Apr 27, 2024

Auto merge of rust-lang#2911 - RalfJung:rustup, r=RalfJung

204460f

Rustup

Implement collecting errors while tokenizing #2911

Implement collecting errors while tokenizing #2911

Uh oh!

Conversation

Veetaha commented Jan 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Veetaha Jan 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Veetaha Jan 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Veetaha Jan 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Veetaha commented Jan 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Veetaha Jan 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kiljacken left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Veetaha commented Jan 26, 2020 •

edited

Loading

Veetaha Jan 26, 2020 •

edited

Loading

Veetaha Jan 27, 2020 •

edited

Loading

Veetaha Jan 26, 2020 •

edited

Loading

Veetaha commented Jan 26, 2020 •

edited

Loading

Veetaha Jan 26, 2020 •

edited

Loading