-
-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Add Attribute Support for Paragraphs: Enabling Para Elements to Carry HTML Classes, IDs, and Custom Attributes #10768
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This would be a very large change, affecting not just I don't think it would really count as a step towards supporting attributes on all elements, because the best way to do that would not be to add a separate attributes field to each constructor, but to do something like what I do in djot-hs. |
Thank you for your feedback regarding the paragraph attributes feature and for pointing me to the Djot implementation. I completely understand your concerns about the potential disruption such a change would cause to the ecosystem. After reviewing the Djot.AST code and your comments, I see the elegance of the node-based approach. I've also explored existing discussions on similar topics, including [pandoc vs djot document model / footnotes (Discussion #9526)](#9526) and [From HTML to Markdown (Discussion #9324)](#9324), which deal with related architectural considerations. I've started a discussion here: #10771 to gather input from the community about different possible approaches:
This way, we can explore whether there's a path forward that provides the needed functionality without causing excessive disruption to dependent packages and users. Thanks again for your guidance - it's been helpful in understanding the architectural considerations involved. |
@Valgard could you provide some concrete examples where this capability would be useful? I understand that in HTML paragraphs can have attrs (as can any other element), but pandoc is a document conversion program after all, so the format you are converting to must also support paragraph attrs for this to be useful at all. If you want to apply attrs to block-level content, the |
Thank you for asking about concrete examples. I can share a specific case I'm currently working on that demonstrates the need for paragraph attributes: Real-world example: Converting structured EPUB/HTML indexes to other formatsI'm currently working with book indexes from EPUB/HTML files that use paragraph attributes (CSS classes) to indicate the hierarchy and structure of the index: <p class="indexmain1"><span><strong>A</strong></span></p>
<p class="indexmain"><span>Abfragen</span></p>
<p class="indexsub"><span>föderierte<span class="spacei"></span><a href="...">122</a></span></p>
<p class="indexsub"><span>Push-down-<span class="spacei"></span><a href="...">122</a></span></p> These CSS classes ( My current workaround: After struggling with various approaches, I had to resort to a fragmented workflow:
This workflow is inefficient and error-prone, but necessary because Pandoc cannot preserve the paragraph-level semantic information during conversion. With paragraph attribute support, I could have solved this with a simple Pandoc filter that recognizes the classes and applies appropriate transformations. Currently, I'm forced to choose between complex Lua filters to identify and transform these paragraphs (without having the attribute information) or wrap everything in Divs, both with significant drawbacks:
If paragraphs could carry their attributes in the AST, filters could simply check for specific classes and transform them accordingly, leading to much cleaner and more robust conversion. Additional use cases where paragraph attributes are valuable:
When working with technical documentation or academic content, paragraphs often carry semantic meaning: <p class="note">This is an important note for the reader.</p>
<p class="warning">Warning: This approach has limitations.</p>
<p class="example">Consider the following example: ...</p> These could be transformed appropriately into target formats (admonition blocks in LaTeX, specifically styled paragraphs in DOCX, etc.).
<p lang="en">English text</p>
<p lang="de">Deutscher Text</p> Preserving language attributes ensures proper hyphenation, spell-checking, and accessibility in output formats.
<p class="first-paragraph">The opening paragraph with special styling...</p>
<p class="summary">Summary of key points...</p>
<p class="quote-attribution">— Author Name</p> Why Divs Aren't an Ideal SolutionWhile Divs can technically carry attributes, they create several problems:
I understand that architectural changes require careful consideration, but having spent significant time implementing workarounds for this limitation, I believe paragraph attributes would be a valuable addition that aligns with Pandoc's goal of faithful document conversion. Does this help clarify the practical need for this feature? |
What if pandoc's HTML reader dealt with |
@Valgard thanks for providing some use cases. Some quick thoughts for each of them below.
|
Note also this precedent. Djot allows attributes to be placed on paragraphs (as does commonmark with the
The |
Thank you for your thoughts on the use cases and for sharing the wrapper approach used with Djot. Let me first clarify some points about my situation: On the HTML/EPUB Index ExampleThe HTML is not authored by me - it comes from published EPUBs that I need to process. This is a common real-world scenario: working with content we don't control but need to convert faithfully. Many commercial publishers use CSS classes for semantic markup in EPUB files rather than nested structures (whether ideal or not). When you say:
I'm not asking Pandoc to fix anything - I'm asking it to preserve the semantic information that exists in the source document during conversion. This is precisely what Pandoc excels at for other elements (like headings and code blocks with attributes). The issue isn't about "fixing bad HTML" but about faithfully converting between formats while preserving semantic information, which is Pandoc's core purpose. On the Wrapper Attribute ApproachThe approach you shared using a Div with a special This solution could work well for my use case and others:
This approach has several advantages:
Move ForwardI'd be happy to help implement this wrapper-based approach for the HTML reader/writer if that would be welcome. It seems like a pragmatic solution that provides the functionality needed while respecting the current architectural constraints and following the precedent already established in the Djot reader/writer. Would a pull request that implements this wrapper-based approach for HTML paragraphs be something you'd consider? |
Yes, I think that's a good way to go. |
This change implements support for preserving HTML paragraph attributes: 1. Readers/HTML.hs: Modified pPara to detect paragraphs with attributes and wrap them in a special Div with wrapper="1" attribute to preserve the original paragraph attributes. 2. Writers/HTML.hs: Updated blockToHtmlInner to detect wrapper Divs and apply their attributes directly to the contained paragraph. 3. HTML/Parsing.hs: Improved attribute handling for data-* attributes. 4. Added tests to verify the correct handling of paragraph attributes in both HTML to native and HTML to HTML conversions. This maintains clean roundtripping of paragraph attributes while keeping the AST structure consistent with Pandoc's design.
- HTML reader wraps attributed `p` tags in `Div` with `wrapper="1"`. - HTML writer unwraps `Div` with `wrapper="1"` back to attributed `p` tag. - Add tests for HTML paragraph attribute roundtrip. - Update EPUB golden files to reflect new AST for attributed paragraphs.
- HTML reader wraps attributed `p` tags in `Div` with `wrapper="1"`. - HTML writer unwraps `Div` with `wrapper="1"` back to attributed `p` tag. - Add tests for HTML paragraph attribute roundtrip. - Update EPUB golden files to reflect new AST for attributed paragraphs.
- HTML reader wraps attributed `p` tags in `Div` with `wrapper="1"`. - HTML writer unwraps `Div` with `wrapper="1"` back to attributed `p` tag. - Add tests for HTML paragraph attribute roundtrip. - Update EPUB golden files to reflect new AST for attributed paragraphs.
- HTML reader wraps attributed `p` tags in `Div` with `wrapper="1"`. - HTML writer unwraps `Div` with `wrapper="1"` back to attributed `p` tag. - Add tests for HTML paragraph attribute roundtrip. - Update EPUB golden files to reflect new AST for attributed paragraphs.
This change implements support for preserving HTML paragraph attributes: 1. Readers/HTML.hs: Modified pPara to detect paragraphs with attributes and wrap them in a special Div with wrapper="1" attribute to preserve the original paragraph attributes. 2. Writers/HTML.hs: Updated blockToHtmlInner to detect wrapper Divs and apply their attributes directly to the contained paragraph. 3. HTML/Parsing.hs: Improved attribute handling for data-* attributes. 4. Added tests to verify the correct handling of paragraph attributes in both HTML to native and HTML to HTML conversions. This maintains clean roundtripping of paragraph attributes while keeping the AST structure consistent with Pandoc's design.
Describe your proposed improvement and the problem it solves.
I propose to extend the
Para
constructor in Pandoc's AST to include attributes (ID, classes, key-value pairs), similar to howHeader
,CodeBlock
, andDiv
elements already support attributes.Current definition:
Proposed definition:
This would solve the problem of preserving paragraph-level attributes (like IDs, classes, and other HTML attributes) during document conversion. Currently, when converting HTML with paragraphs that have CSS classes or IDs, this information is lost unless the paragraph is wrapped in a Div. This feature would make attribute handling more consistent across block elements.
This proposal addresses part of the broader goal described in issue #684 ("Permit adding attributes to all Markdown elements"), focusing specifically on paragraph elements as a first step. Paragraphs are one of the most common elements in documents, making this a high-value improvement that affects many conversion scenarios.
Implementation would include:
paraWith
function and modifying the existingpara
functionThis would be consistent with the existing pattern used for other block elements that support attributes.
Describe alternatives you've considered.
Wrapping paragraphs with attributes in Divs: This is the current workaround, but it creates an additional nesting level that complicates the document structure and isn't semantically accurate.
Using a
Maybe Attr
field: I considered usingMaybe Attr
to maintain backward compatibility, but this would be inconsistent with other block types likeHeader
andCodeBlock
which use a directAttr
field withnullAttr
for no attributes.Creating a new
ParaWith
constructor: This would preserve the existingPara
constructor unchanged, but would introduce redundancy and complicate pattern matching throughout the codebase.Custom filter for post-processing: A filter could be used to convert specially-marked Divs to paragraphs with attributes, but this would be a workaround rather than a proper solution.
The direct addition of an
Attr
field to thePara
constructor seems most consistent with the existing design of the Pandoc AST and would provide the cleanest interface for users of the library.Implementation Plan
Here's a step-by-step approach for implementing this feature:
Modify
Text.Pandoc.Definition
:Update the
Para
constructor to includeAttr
parameter.Update
Text.Pandoc.Builder
:paraWith :: Attr -> Inlines -> Blocks
functionpara
function to usenullAttr
Update HTML Reader (
Text.Pandoc.Readers.HTML
):Modify
pPara
to capture attributes fromp
tags and use them in the createdPara
blocks.Update all Writers:
p
tagsUpdate Pattern Matching:
Find and update all places in the codebase that pattern match on
Para
elements.Update Tests:
Update Documentation:
Backward Compatibility
This change would require updating pattern matching throughout the codebase, but the semantic meaning of the
Para
constructor would not change. The use ofnullAttr
would maintain behavior equivalence for paragraphs without attributes.Benefits
I'm willing to implement this feature if the approach is acceptable to the maintainers.
The text was updated successfully, but these errors were encountered: