-
-
Notifications
You must be signed in to change notification settings - Fork 91
Description
I've been looking through external scanners, and was trying to work out how the serialisation worked in this package. Originally, I thought it would try it's best and discard deeper scopes than it could hold, and that seems to be the case.
However, it does not store whether it discarded tags or not. This leads to closing tags being marked as erroneous, even when they match. E.g., the following has a middle tag that's too long to be stored properly. The current behaviour is to mark it's closing tag and </foo>
as invalid.
<html>
<bar>
<abcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghij>
<foo>
</foo>
</abcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghij>
</bar>
</html>
This seems worse than marking invalid endings as valid. Could it use a bit to indicate if all the tags were saved, or if some were discarded? Could this be used to improve the behaviour?
It gets worse when same tags are nested:
<html>
<bar>
<abcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghij>
<foo>
<bar>
</bar>
</foo>
</abcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghij>
</bar>
</html>
Because bar
was the only name that made it onto the buffer, it sees the inner closing bar
and thinks that closes everything, so marks the rest as invalid. I don't even know if it's possible to account for this one, considering some tags can be self closing (so we can't just count tag ends for the number we discarded).
What about marking the "serialisation" as invalid or something, and requesting a full reparse from some point? (I was going to reference the TextMate parser here, but then saw it doesn't even try to match anything. So no more suggestions).