-
Notifications
You must be signed in to change notification settings - Fork 871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize raw HTML post-processor #1510
base: master
Are you sure you want to change the base?
Optimize raw HTML post-processor #1510
Conversation
6113aad
to
fc9acc0
Compare
Hmm, the list->set change could be seen as breaking. We can instead create a new |
This comment was marked as resolved.
This comment was marked as resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good generally. However, I haven't tested it or done any performance comparisons of my own. From a quick reading of the code only the few things you pointed out stand out to me. I'll need to take a closer look when I have more time.
I'll likely try this branch on some projects to inspect for regressions, just to be sure |
This is a valid concern. I'm not sure what to think about it. |
Yes, I saw the change of blocks using a list to set and meant to comment on it. I'm pretty sure it'll break some of my plugins, so keeping the original but deprecating it in favor of a new structure is probably the way to go. |
ce408be
to
0841396
Compare
Thanks for confirming @facelessuser! I'll change it. |
0841396
to
e05cce7
Compare
I just added a new private |
Using a set allows for better performances when checking for membership of a tag within block level elements. Issue-1507: Python-Markdown#1507
Previously, the raw HTML post-processor would precompute all possible replacements for placeholders in a string, based on the HTML stash. It would then apply a regular expression substitution using these replacements. Finally, if the text changed, it would recurse, and do all that again. This was inefficient because placeholders were re-computed each time it recursed, and because only a few replacements would be used anyway. This change moves the recursion into the regular expression substitution, so that: 1. the regular expression does minimal work on the text (contrary to re-scanning text already scanned in previous frames); 2. but more importantly, replacements aren't computed ahead of time anymore (and even less *several times*), and only fetched from the HTML stash as placeholders are found in the text. The substitution function relies on the regular expression groups ordering: we make sure to match `<p>PLACEHOLDER</p>` first, before `PLACEHOLDER`. The presence of a wrapping `p` tag indicates whether to wrap again the substitution result, or not (also depending on whether the substituted HTML is a block-level tag). Issue-1507: Python-Markdown#1507
e05cce7
to
1bc8c54
Compare
markdown/util.py
Outdated
BLOCK_LEVEL_ELEMENTS: list[str] = [ | ||
# TODO: Raise errors from list methods in the future. | ||
# Later, remove this class entirely and use a regular set. | ||
class _BlockLevelElements: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Elsewhere the type is specified as set
. However, as this is not a subclass of set, that would be incorrect. Also, later when we remove this, then we will have a set
. Perhaps this class should properly reflect that now. Unfortunately, there are no specialized container datatypes for sets in collections. And sets are not a Sequence Type either. Not really sure how best to address this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My feeling was that it was OK to lie about the type, so that users actually get type warnings when they use it as a list instead of a set. That's one more (very effective) way of communicating the change. And since we implement all set functionality, the lie is not too big 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do that, but some users run type validation on their code and this could result in bug reports. Either we need to fully document our reasoning for the inconsistency in the code comments or we need to make a change to the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users running type checkers on their code is exactly what I expect, yes 😄 So, my opinion is, yes, lets keep lying about the type, and document that prominently (both with code comments and release/changelog notes).
Type correctness would otherwise mean that:
- we type the variables as
_BlockLevelElements
- which in turn means we'd have to document this class properly
- which in turn (probably) means we'd have to make it public
(also we'd likely want to use Self
from typing_extensions
as return annotation of methods retuning self
, which means an additional dependency depending on the Python version)
I know it's far from being exhaustive but a search on GitHub returns only one result for use of md.block_level_elements
(apart from @facelessuser's pymdownx which creates a new set
from it, and discarding forks and venvs), and it's using remove
, which is both a list and set method. That's IMO a strong indicator that md.block_level_elements
has very few uses in the wild, and so immediately advertising its type as set
should have a very small impact. I could be wrong though 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not just type validators that I'm thinking about here. What about someone who is using isinstance
(or similar) on the object in their code? They could reasonably expect isinstance(set)
to return True, which is does not. For that matter existing code could reasonable expect isisntance(list)
to return True. My objection is that neither return True. However, if _BlockLevelElements
was a subclass of something, then it could return True for one of either set
or list
. That's why I pointed out above that set is not a Sequence Type. I was hoping it was and then _BlockLevelElements
could just subclass the relevant baseclass and return True for both. I think we need to return True for at least one. The question is: which one? Using set
is a breaking change but what it will end up being when the transition is complete. Using list
avoids the breaking change, but it not forward looking. And I don't think there is any way to raise a deprecation warning if a user calls isinstance
on the objection.
My point is that I think _BlockLevelElements
needs to be either a set or a list at a minimum. Ideally it would be both, but that is probably not sensible. Any solution is going to be a compromise. I'm saying that having _BlockLevelElements
not be either a set or a list is not the compromise I want to make.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I think we could inherit from both list
and set
since we implement all their methods anyway? This way isinstance
would return True
for both. Also, we can customize the behavior of isinstance
thanks to __instancecheck__
to emit a deprecation warning when checking against list
🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK __instancecheck__
works the other way unfortunately, it's not meant for this use-case. Inheriting from set
breaks a lot of tests (121 of them), not sure exactly why. Inheriting from list
works, but as you pointed, isinstance(ref, set)
will return False
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First batch of tests added. It's pretty basic. A second batch should be added to test lists with duplicate elements within them. I updated the code to try and reduce the possibility of duplicates but it's not thorough.
About duplicates: I don't prevent their insertion, otherwise we lose backwards compatibility again. We'll just have to add test cases with duplicates 🙂 A few other remarks/suggestions:
|
I added tests with duplicates. |
markdown/util.py
Outdated
# We raise `ValueError` for backwards compatibility. | ||
try: | ||
self._set.remove(element) | ||
except KeyError: | ||
raise ValueError(f"{element!r} not in list") from None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about this. It's difficult to deprecate a raised exception in favor of another one. Maybe we should immediately start raising KeyError
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About deprecating raised exceptions: we could create new exception that inherits from both ValueError
and KeyError
, so that both can be caught in user code. I was hoping we could hook into the except
mechanism to only emit a warning when ValueError
or KeyError
is caught, but it's not possible unfortunately, see https://discuss.python.org/t/instance-subclass-check-for-except-clauses/71583. So we're stuck with either:
- not emitting any warning
- always emitting a warning, that users will have to ignore even if they catch the right exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm inclined to suggest that we not bother ourselves with the errors here. We can just call _list.remove
and _set.remove
and allow any errors to bubble up to the calling code. Is there any reason why we shouldn't do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree on keeping things simple. I'll try a different approach (we still want to remove all matching items from the list, not just one) 🤔
Sorry, but I was out for a couple of weeks. Back now and trying to catch up.
Yes, let's do that. I would have mentioned this earlier but missed that we weren't using
We have not traditionally done this in the past. I have always taken the position that we are not responsible for Python's default behavior. As Python by default hides deprecation warnings, then users should expect that and act accordingly. But I'm not opposed to this either.
It is easy enough to search for #TODO. Adding dependencies just creates more complexity. |
Absolutely no worries, and no need to apologize, thank you very much for your answers 🙂 I'll address the various points you've answered. |
Oh, I noticed again that you have your own |
def pop(self, index: int | None = None, /) -> str: | ||
# In-place pop should update both list and set. | ||
if index is None: | ||
index = -1 | ||
else: | ||
warnings.warn( | ||
"Using block level elements as a list is deprecated, use it as a set instead.", | ||
DeprecationWarning, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still use warnings.warn
here because we don't want to deprecate the whole method, only the index
parameter.
Closes #1507
TODO
maybe return new instances of_BlockLevelElements
instead oflist
/set
in relevant methodsmaybe usestacklevel=2
maybe useuse existingwarnings.deprecated
(andtyping_extensions.deprecated
on Python 3.12-)deprecated
function instead ofwarnings.warn