Skip to content

Handling serialisation overflow #6

@Aerijo

Description

@Aerijo

I've been looking through external scanners, and was trying to work out how the serialisation worked in this package. Originally, I thought it would try it's best and discard deeper scopes than it could hold, and that seems to be the case.

However, it does not store whether it discarded tags or not. This leads to closing tags being marked as erroneous, even when they match. E.g., the following has a middle tag that's too long to be stored properly. The current behaviour is to mark it's closing tag and </foo> as invalid.

<html>
<bar>
  <abcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghij>
    <foo>

    </foo>
  </abcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghij>
</bar>
</html>

This seems worse than marking invalid endings as valid. Could it use a bit to indicate if all the tags were saved, or if some were discarded? Could this be used to improve the behaviour?

It gets worse when same tags are nested:

<html>
<bar>
  <abcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghij>
    <foo>
      <bar>

      </bar>
    </foo>
  </abcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghij>
</bar>
</html>

Because bar was the only name that made it onto the buffer, it sees the inner closing bar and thinks that closes everything, so marks the rest as invalid. I don't even know if it's possible to account for this one, considering some tags can be self closing (so we can't just count tag ends for the number we discarded).

What about marking the "serialisation" as invalid or something, and requesting a full reparse from some point? (I was going to reference the TextMate parser here, but then saw it doesn't even try to match anything. So no more suggestions).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions