Skip to content

Implement collecting errors while tokenizing #2911

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Feb 3, 2020
Merged

Conversation

Veetaha
Copy link
Contributor

@Veetaha Veetaha commented Jan 26, 2020

Now we are collecting errors from rustc_lexer and returning them in ParsedToken { token, error } and ParsedTokens { tokens, errors } structures ([UPD]: this is now simplified, see updates bellow).

The main changes are introduced in ra_syntax/parsing/lexer.rs. It now exposes the following functions and types:

pub fn tokenize(text: &str) -> ParsedTokens;
pub fn tokenize_append(text: &str, parsed_tokens_to_append_to: &mut ParsedTokens);
pub fn first_token(text: &str) -> Option<ParsedToken>; // allows any number of tokens in text
pub fn single_token(text: &str) -> Option<ParsedToken>; // allows only a single token in text

pub struct ParsedToken  { pub token: Token,       pub error: Option<SyntaxError> }
pub struct ParsedTokens { pub tokens: Vec<Token>, pub errors: Vec<SyntaxError>   }

pub enum TokenizeError { /* Simple enum which reflects rustc_lexer tokenization errors */ }

In the first commit I implemented it with iterators, but then decided that since this crate is ad hoc for rust-analyzer and we clearly see the places of its usage it would be better to simplify it to vectors.

This is currently WIP, because I want to add tests for error messages generated by the lexer.
I'd like to listen to you thoughts how to define these tests in ra_syntax/test-data dir.

Related issues: #223

[UPD]

After the PR review the API was simplified:

pub fn tokenize(text: &str) -> (Vec<Token>, Vec<SyntaxError>);
// Both lex functions do not check for unescape errors
pub fn lex_single_syntax_kind(text: &str) -> Option<(SyntaxKind, Option<SyntaxError>)>;
pub fn lex_single_valid_syntax_kind(text: &str) -> Option<SyntaxKind>;

// This will be removed in the next PR in favour of simlifying `SyntaxError` to `(String, TextRange)`
pub enum TokenizeError { /* Simple enum which reflects rustc_lexer tokenization errors */ }

// this is private, but may be made public if such demand would exist in future (least privilege principle)
fn lex_first_token(text: &str) -> Option<(Token, Option<SyntaxError>)>;

@Veetaha Veetaha requested review from matklad and kiljacken January 26, 2020 19:06
@@ -41,37 +41,42 @@ fn reparse_token<'node>(
root: &'node SyntaxNode,
edit: &AtomTextEdit,
) -> Option<(GreenNode, TextRange)> {
let token = algo::find_covering_element(root, edit.delete).as_token()?.clone();
match token.kind() {
let prev_token = algo::find_covering_element(root, edit.delete).as_token()?.clone();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spent some time figuring out what is going on in this method to properly integrate changes from lexer. The core implementation here didn't change, just renamed some variables for clarity...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good names do a lot for code readability 👍

@@ -97,6 +102,9 @@ fn reparse_block<'node>(
fn get_text_after_edit(element: SyntaxElement, edit: &AtomTextEdit) -> String {
let edit =
AtomTextEdit::replace(edit.delete - element.text_range().start(), edit.insert.clone());

// Note: we could move this match to a method or even further: use enum_dispatch crate
// https://crates.io/crates/enum_dispatch
Copy link
Contributor Author

@Veetaha Veetaha Jan 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my proposal for the future refactoring, want to hear your opinion on it.

Honestly, I just liked the idea of static enum dispatch that relives from making object-safe traits and replaces dynamic dispatch.
It even reduces the boilerplate of such matches.

Though you may consider this to be unnecessary overhead and I won't argue too much)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's some consensus that we already have a bit too many dependencies, so adding another one, especially a procedural macro, is probably not a good thing if we want to keep build times down.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually feel pretty strongly about not using "helper" creates, that try to bolt on idioms onto a language. I feel that the accidental complexity introduced by them is almost always larger than any boilerplate savings. The notable exception is the impl_froms macro which we use throughout the hir, and which I feel is justified, in that it's a very simple idea, and the amount of boilerplate saved is significant.

The reason why we need an explicit match here is that the .text() are two very different things in this case: the first one is a &str, and the second-one is a tree structure -- a view into the node. If we had a generic trait Text, then making an type NodeOrTokenText<'a> = NodeOrToken<SyntaxText, &str>; impl Text for NodeOrTokenText<'a>would be reasonable, but we don't have such a trait yet, and designing it right is really hard.

Copy link
Contributor Author

@Veetaha Veetaha Jan 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, guys, thank you for the opinion)

// FIXME: Location should be just `Location(TextRange)`
// TextUnit enum member just unnecessarily compicates things,
// we should'n treat it specially, it just as a `TextRange { start: x, end: x + 1 }`
// see `location_to_range()` in ra_ide/src/diagnostics
Copy link
Contributor Author

@Veetaha Veetaha Jan 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be the target for the future refactoring. I am not entirely sure why we don't use just TextRange as a location of SyntaxError instead. Or at least make location a newtype of TextRange? What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable. In the end, when used to emit a diagnostic, it gets turned into a TextRange anyways: https://github.com/rust-analyzer/rust-analyzer/blob/d1330a4a65f0113c687716a5a679239af4df9c11/crates/ra_ide/src/diagnostics.rs#L118-L123

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think this should just be a TextRange.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I just referred to that snippet in the comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That you did 😆

fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
let msg = match self {
TokenizeError::EmptyInt => "Missing digits after integer base prefix",
TokenizeError::EmptyExponent => "Missing digits after the exponent symbol",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish rustfmt was not so clippy, initialy I wrote this match with each arm homogeneously and nicely wrapped into curly braces, but the formatting tool demands the shorthand syntax here ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could always #[rustfmt::skip] this one function. We do the same for a few match statements elsewhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with sticking #[rustfmt::skip] on top in similar cases.

pub struct SyntaxTreeBuilder {
errors: Vec<SyntaxError>,
inner: GreenNodeBuilder<'static>,
}

impl Default for SyntaxTreeBuilder {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why Default trait was implemented here by hand, could you, please, elabortate @matklad?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the time this was written, I forgot to add a Default to GreenBuilder, and was to lazy to publish a new version of rowan just for that: rust-analyzer/rowan@a3692a9#diff-1a2b23ceb6e534bd16f63d2e55e537aaR154.

In general, if something looks like it doesn't make sense, it probably indeed doesn't make sense :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've stumbled with enough amount of things that didn't make sense to me but did actually exist for a reason, so I'd rather ask the author first before amending them. Anyway, thank you for the clarification!

@Veetaha
Copy link
Contributor Author

Veetaha commented Jan 26, 2020

I know that ra_syntax::SourceFile::parse() returns Parse<SourceFile> type that has the same semmantics as ParsedToken[s], we could reuse it (with some changes to it) and return Parse<Token> instead. But this would be the task for refactoring and I am not sure whether it is worth to generalize it this way... What are your thoughts?

@@ -10,7 +10,8 @@ use crate::{fuzz, SourceFile};
#[test]
fn lexer_tests() {
dir_tests(&test_data_dir(), &["lexer"], |text, _| {
let tokens = crate::tokenize(text);
// FIXME: add tests for errors (their format is up to discussion)
let tokens = crate::tokenize(text).tokens;
Copy link
Contributor Author

@Veetaha Veetaha Jan 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to make these tests data-driven, but I am not sure which format would be the most suitable...

Your input is very important here, guys!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could probably just update dump_tokens to take ParsedTokens and then also output the errors like we do the tokens. If you don't print anything when there's no errors, you shouldn't even need to touch the existing tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just dumping errors in dump_tokens after the tokens themselves seems good to me.

Unrelated, but perhaps a slightly more information rich way to write this would be:

let ParsedToken { tokens, errors: _ } = crate::tokenize(text)

Ie, explicitely naming the thing you are ignoring (which usually is important for erroes).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay guys, will do!
Also, @matklad, good point on explicit errors dropping. Do you think we should use this pattern in all places tokenize*()/single_token()/first_token() are called?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so! In general, I just always try to destruct tuples that way:

13:49:36|~/projects/rust-analyzer|HEAD✓
λ rg  'let .*, _\w+\)'
xtask/src/codegen/gen_syntax.rs
167:    let punctuation_values = grammar.punct.iter().map(|(token, _name)| {

crates/ra_hir_def/src/generics.rs
59:        let (params, _source_map) = GenericParams::new(db, def.into());

crates/ra_hir/src/code_model.rs
1072:        let (adt, _subst) = self.ty.value.as_adt()?;

crates/ra_batch/src/lib.rs
53:    let (crate_graph, _crate_names) =
147:        let (host, _roots) = load_cargo(path).unwrap();

crates/ra_hir_ty/src/utils.rs
141:            let (_total, parent_len, _child) = self.len_split();

crates/ra_parser/src/grammar/expressions.rs
27:    let (cm, _block_like) = expr(p);

crates/ra_hir_def/src/nameres/collector.rs
947:        let (db, _file_id) = TestDB::with_single_file(&code);

crates/ra_hir_ty/src/infer/coerce.rs
253:                let (last_field_id, _data) = fields.next_back()?;

crates/ra_parser/src/grammar/expressions/atom.rs
560:    let (completed, _is_block) =
``

@Veetaha Veetaha mentioned this pull request Jan 27, 2020
10 tasks
Copy link
Contributor

@kiljacken kiljacken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good. A few small nits to the actual code, but nothing major.

Comment on lines 20 to 24
match single_token(new_name)?.token.kind {
SyntaxKind::IDENT | SyntaxKind::UNDERSCORE => (),
_ => return None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love how this turned out 👍

/// In general `self.errors.len() <= self.tokens.len()`
pub errors: Vec<SyntaxError>,
}
impl ParsedTokens {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Add a blank line above this.

.map(|error| SyntaxError::new(SyntaxErrorKind::TokenizeError(error), token_range)),
};

type ParsedSyntaxKind = (SyntaxKind, Option<TokenizeError>);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defining the alias here probably takes up more space that just writing it out four times. Don't really care too much either way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll try to refactor it according to this comment

@@ -41,37 +41,42 @@ fn reparse_token<'node>(
root: &'node SyntaxNode,
edit: &AtomTextEdit,
) -> Option<(GreenNode, TextRange)> {
let token = algo::find_covering_element(root, edit.delete).as_token()?.clone();
match token.kind() {
let prev_token = algo::find_covering_element(root, edit.delete).as_token()?.clone();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good names do a lot for code readability 👍

@@ -97,6 +102,9 @@ fn reparse_block<'node>(
fn get_text_after_edit(element: SyntaxElement, edit: &AtomTextEdit) -> String {
let edit =
AtomTextEdit::replace(edit.delete - element.text_range().start(), edit.insert.clone());

// Note: we could move this match to a method or even further: use enum_dispatch crate
// https://crates.io/crates/enum_dispatch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's some consensus that we already have a bit too many dependencies, so adding another one, especially a procedural macro, is probably not a good thing if we want to keep build times down.

Comment on lines 196 to 198
// FIXME: it seems this initialization statement is unnecessary (see edit in outer scope)
// Investigate whether it should really be removed.
let edit = AtomTextEdit { delete: range, insert: replace_with.to_string() };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither range nor replace_with change, and edit is always passed as reference, so I'm inclined to agree that this line could probably be removed.

// FIXME: Location should be just `Location(TextRange)`
// TextUnit enum member just unnecessarily compicates things,
// we should'n treat it specially, it just as a `TextRange { start: x, end: x + 1 }`
// see `location_to_range()` in ra_ide/src/diagnostics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable. In the end, when used to emit a diagnostic, it gets turned into a TextRange anyways: https://github.com/rust-analyzer/rust-analyzer/blob/d1330a4a65f0113c687716a5a679239af4df9c11/crates/ra_ide/src/diagnostics.rs#L118-L123

fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
let msg = match self {
TokenizeError::EmptyInt => "Missing digits after integer base prefix",
TokenizeError::EmptyExponent => "Missing digits after the exponent symbol",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could always #[rustfmt::skip] this one function. We do the same for a few match statements elsewhere.

Comment on lines +87 to +93
// FIXME: the obvious pattern of this enum dictates that the following enum variants
// should be wrapped into something like `SemmanticError(SemmanticError)`
// or `ValidateError(ValidateError)` or `SemmanticValidateError(...)`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets leave it at the tokenizer changes, at least for this PR. Then you can do this as a separate PR if it seems useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I described the summary and future todos in the comment under the original issue. I'll see what should be done in next PR.

@@ -10,7 +10,8 @@ use crate::{fuzz, SourceFile};
#[test]
fn lexer_tests() {
dir_tests(&test_data_dir(), &["lexer"], |text, _| {
let tokens = crate::tokenize(text);
// FIXME: add tests for errors (their format is up to discussion)
let tokens = crate::tokenize(text).tokens;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could probably just update dump_tokens to take ParsedTokens and then also output the errors like we do the tokens. If you don't print anything when there's no errors, you shouldn't even need to touch the existing tests.

@kiljacken
Copy link
Contributor

I know that ra_syntax::SourceFile::parse() returns Parse<SourceFile> type that has the same semmantics as ParsedToken[s], we could reuse it (witch some changes to it) and return Parse<Token> instead. But this would be the task for refactoring and I am not sure whether it is worth to generalize it this way... What are your thoughts?

Parse<T> is currently rather specialized for things that actually contain a syntax tree from rowan, so this should probably just remain separate.

Copy link
Member

@matklad matklad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a bunch of nitpicky comments!

I don't really have a strong opinions on them, but, as this is partially a refactoring work, I'd like to pay more than usual attention to detail, to make sure we have a somewhat shared understanding what's the "rust-analyzer" style is

Overall, LGTM!

pub error: Option<TokenizeError>,
}
impl ParsedToken {
pub const fn new(token: Token, error: Option<TokenizeError>) -> Self {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually use this function? In general, I try to avoid adding trivial news for structs where all fields are public, as a struct literal is more readable at the call-site (as fields have names). The exception is if the struct is created in a lot of different places, where the method call might be more concise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly for the Token above

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, in the case of ParsedToken, I would probably use type ParsedToekn = (Token, Option<TokenizeError>), because semantically it is just a pair of a token and error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't expect you would dig into commits, I did rethink this and removed that functions in later commits.

// We drop some useful infromation here (see patterns with double dots `..`)
// Storing that info in `SyntaxKind` is not possible due to its layout requirements of
// being `u16` that come from `rowan::SyntaxKind` type and changes to `rowan::SyntaxKind`
// would mean hell of a rewrite.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't do this not because we are lazy to rewrite (I think the syntax three were rewriten something like four times?), but because dumb semantic-less nodes are an explicit design decision :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I think I did make this comment a bit misleading, I just wanted to explain why we do drop some info here, not proposing to change the design!

TK::Caret => ok(CARET),
TK::Percent => ok(PERCENT),
TK::Unknown => ok(ERROR),
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably avoid ok, ok_if functions and do something like this

let kind = match {
    TK::LineComment => COMMENT,
    TK::BlockComment { terminated: false } => return (COMMENT, Some(TE::UnterminatedBlockComment)),
    TK::BlockComment { terminated: true } => COMMENT,
};

(kind, None)

That is,

  • move ok to a happy path, and use return for errors
  • pull TextUnit::from_usize(token_text.len()) bit out of the function, as it doesn't depend on on the specific kind at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, nice notice, I'll apply this

pub fn first_token(text: &str) -> Option<ParsedToken> {
// Checking for emptyness because of `rustc_lexer::first_token()` invariant (see its body)
if text.is_empty() {
None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
None
return None;

and remove else. We use early returns by default.

@@ -46,8 +46,7 @@ fn reparse_token<'node>(
WHITESPACE | COMMENT | IDENT | STRING | RAW_STRING => {
if token.kind() == WHITESPACE || token.kind() == COMMENT {
// removing a new line may extends previous token
if token.text().to_string()[edit.delete - token.text_range().start()].contains('\n')
{
if token.text()[edit.delete - token.text_range().start()].contains('\n') {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -84,6 +84,9 @@ pub enum SyntaxErrorKind {
ParseError(ParseError),
EscapeError(EscapeError),
TokenizeError(TokenizeError),
// FIXME: the obvious pattern of this enum dictates that the following enum variants
// should be wrapped into something like `SemmanticError(SemmanticError)`
// or `ValidateError(ValidateError)` or `SemmanticValidateError(...)`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, in general, I am not sure that using an enum for SyntaxErrors is a good idea. We don't really match on them, so perhaps just an OpaqueError(String, TextRange) would be a better representation.

Copy link
Contributor Author

@Veetaha Veetaha Jan 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, let's leave it for the refactoring to be done soon. Since this crate is called ra_syntax let's leave SyntaxError name for that and not some other OpaqueError.
If this crate is used only by rust-analyzer exclusively I am definitely for generic (String, TextRange) error shape.

fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
let msg = match self {
TokenizeError::EmptyInt => "Missing digits after integer base prefix",
TokenizeError::EmptyExponent => "Missing digits after the exponent symbol",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with sticking #[rustfmt::skip] on top in similar cases.

pub struct SyntaxTreeBuilder {
errors: Vec<SyntaxError>,
inner: GreenNodeBuilder<'static>,
}

impl Default for SyntaxTreeBuilder {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the time this was written, I forgot to add a Default to GreenBuilder, and was to lazy to publish a new version of rowan just for that: rust-analyzer/rowan@a3692a9#diff-1a2b23ceb6e534bd16f63d2e55e537aaR154.

In general, if something looks like it doesn't make sense, it probably indeed doesn't make sense :)

@@ -10,7 +10,8 @@ use crate::{fuzz, SourceFile};
#[test]
fn lexer_tests() {
dir_tests(&test_data_dir(), &["lexer"], |text, _| {
let tokens = crate::tokenize(text);
// FIXME: add tests for errors (their format is up to discussion)
let tokens = crate::tokenize(text).tokens;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just dumping errors in dump_tokens after the tokens themselves seems good to me.

Unrelated, but perhaps a slightly more information rich way to write this would be:

let ParsedToken { tokens, errors: _ } = crate::tokenize(text)

Ie, explicitely naming the thing you are ignoring (which usually is important for erroes).

}
TokenizeError::UnstartedRawString => {
"Missing `\"` symbol after `#` symbols to begin the raw string literal"
"Missing \" symbol after # symbols to begin the raw string literal"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually like to place some kind of quotes around snippets because that's helpful when a snippet contains trailing/leading whitespace or other invisible characters. But we should just stick to what rustc is doing.

Which... I am not sure what is exactly! For this code

λ cat main.rs 
fn main() {
    let s = r###;
}

I get this error

image

Note how # is uselessly in quotes, but the empty string between : and ; isn't.

@estebank what are the current guidelines around quotes?

Copy link
Contributor Author

@Veetaha Veetaha Jan 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, I initially wrapped the symbols into backticks (hoping they would be rendered as Markdown), but once I started VSCode and checked out that the messages were displayed as a raw text I removed them 34d72aa

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We try to consistently surround code and tokens (including identifiers) in backticks. No tool is actually parsing them, but having them bare makes it harder to read (consider

found , but expected .
vs
found `,` but expected `.`
)

@Veetaha
Copy link
Contributor Author

Veetaha commented Jan 27, 2020

Thank you for the review and elaborating on the "rust-analyzer" style, will take all that into account!

@matklad
Copy link
Member

matklad commented Jan 28, 2020

is this still a wip?

@Veetaha
Copy link
Contributor Author

Veetaha commented Jan 28, 2020

@matklad please don't merge yet, I'll add tests today. Then I'll remove WIP label. Or do you have another understanding of WIP label?

@matklad
Copy link
Member

matklad commented Jan 28, 2020 via email

@Veetaha Veetaha changed the title [WIP]: Implement collecting errors while tokenizing Implement collecting errors while tokenizing Feb 1, 2020
@Veetaha Veetaha requested review from kiljacken and matklad February 1, 2020 20:25
@Veetaha
Copy link
Contributor Author

Veetaha commented Feb 1, 2020

@kiljacken, @matklad
Please see changes of the latest commit: 38f1abf
And please don't be terrified with the number of changed files, these are mostly adding new and moving old test_data files.
Brief changelog:

  • Extracted ok and err directories in test_data/lexer similarly to how it is done in test_data/parser.
  • Moved a bunch of existing test for error scenarios to err directory, renamed comments.rs test file to single_line_comments.rs
  • Added a whole bunch of tests for tokenization errors introduced in this PR
  • Added FIXMEs (which are in fact todos) to state the plan of future improvements.

@Veetaha
Copy link
Contributor Author

Veetaha commented Feb 1, 2020

I also need to rebase this branch, I'll do it before the merge, just let me know when it is time.

let (tokens, errors) = tokenize(text);
assert_errors_are_present(&errors, path);
dump_tokens_and_errors(&tokens, &errors, text)
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@matklad
Copy link
Member

matklad commented Feb 3, 2020

bors r+

@kiljacken
Copy link
Contributor

Bors is stuck, we probably need that rebase.

@Veetaha Would you mind rebasing? :)

@Veetaha
Copy link
Contributor Author

Veetaha commented Feb 3, 2020

Sure, I'll do that later today!

@Veetaha
Copy link
Contributor Author

Veetaha commented Feb 3, 2020

@matklad bors r+ ?

@matklad
Copy link
Member

matklad commented Feb 3, 2020

bors r+

bors bot added a commit that referenced this pull request Feb 3, 2020
2911: Implement collecting errors while tokenizing r=matklad a=Veetaha

Now we are collecting errors from `rustc_lexer` and returning them in `ParsedToken { token, error }` and `ParsedTokens { tokens, errors }` structures **([UPD]: this is now simplified, see updates bellow)**.

The main changes are introduced in `ra_syntax/parsing/lexer.rs`. It now exposes the following functions and types:

```rust
pub fn tokenize(text: &str) -> ParsedTokens;
pub fn tokenize_append(text: &str, parsed_tokens_to_append_to: &mut ParsedTokens);
pub fn first_token(text: &str) -> Option<ParsedToken>; // allows any number of tokens in text
pub fn single_token(text: &str) -> Option<ParsedToken>; // allows only a single token in text

pub struct ParsedToken  { pub token: Token,       pub error: Option<SyntaxError> }
pub struct ParsedTokens { pub tokens: Vec<Token>, pub errors: Vec<SyntaxError>   }

pub enum TokenizeError { /* Simple enum which reflects rustc_lexer tokenization errors */ }
```
In the first commit I implemented it with iterators, but then decided that since this crate is ad hoc for `rust-analyzer` and we clearly see the places of its usage it would be better to simplify it to vectors.

This is currently WIP, because I want to add tests for error messages generated by the lexer.
I'd like to listen to you thoughts how to define these tests in `ra_syntax/test-data` dir.

Related issues: #223 

**[UPD]**

After the PR review the API was simplified:
```rust
pub fn tokenize(text: &str) -> (Vec<Token>, Vec<SyntaxError>);
// Both lex functions do not check for unescape errors
pub fn lex_single_syntax_kind(text: &str) -> Option<(SyntaxKind, Option<SyntaxError>)>;
pub fn lex_single_valid_syntax_kind(text: &str) -> Option<SyntaxKind>;

// This will be removed in the next PR in favour of simlifying `SyntaxError` to `(String, TextRange)`
pub enum TokenizeError { /* Simple enum which reflects rustc_lexer tokenization errors */ }

// this is private, but may be made public if such demand would exist in future (least privilege principle)
fn lex_first_token(text: &str) -> Option<(Token, Option<SyntaxError>)>;
```

Co-authored-by: Veetaha <[email protected]>
@bors
Copy link
Contributor

bors bot commented Feb 3, 2020

Build succeeded

  • Rust (macos-latest)
  • Rust (ubuntu-latest)
  • Rust (windows-latest)
  • TypeScript

@bors bors bot merged commit a3e5663 into rust-lang:master Feb 3, 2020
bors bot added a commit that referenced this pull request Feb 18, 2020
3026: ra_syntax: reshape SyntaxError for the sake of removing redundancy r=matklad a=Veetaha

Followup of #2911, also puts some crosses to the todo list of #223.

**AHTUNG!** A big part of the diff of this PR are test data files changes.

Simplified `SyntaxError` that was `SyntaxError { kind: { /* big enum */  }, location: Location }` to `SyntaxError(String, TextRange)`. I am not sure whether the tuple struct here is best fit, I am inclined to add names to the fields, because I already provide getters `SyntaxError::message()`, `SyntaxError::range()`.
I also removed `Location` altogether ...

This is currently WIP, because the following is not done:
- [ ] ~~Add tests to `test_data` dir for unescape errors *// I don't know where to put these errors in particular, because they are out of the scope of the lexer and parser. However, I have an idea in mind that we move all validators we have right now to parsing stage, but this is up to discussion...*~~ **[UPD]** I came to a conclusion that tree validation logic, which unescape errors are a part of, should be rethought of, we currently have no tests and no place to put tests for tree validations. So I'd like to extract potential redesign (maybe move of tree validation to ra_parser) and adding tests for this into a separate task.

Co-authored-by: Veetaha <[email protected]>
Co-authored-by: Veetaha <[email protected]>
RalfJung pushed a commit to RalfJung/rust-analyzer that referenced this pull request Apr 20, 2024
RalfJung pushed a commit to RalfJung/rust-analyzer that referenced this pull request Apr 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants