Skip to content

Add Attribute Support for Paragraphs: Enabling Para Elements to Carry HTML Classes, IDs, and Custom Attributes #10768

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Valgard opened this issue Apr 8, 2025 · 9 comments

Comments

@Valgard
Copy link

Valgard commented Apr 8, 2025

Describe your proposed improvement and the problem it solves.

I propose to extend the Para constructor in Pandoc's AST to include attributes (ID, classes, key-value pairs), similar to how Header, CodeBlock, and Div elements already support attributes.

Current definition:

data Block
  = ...
  | Para [Inline]
  | ...

Proposed definition:

data Block
  = ...
  | Para Attr [Inline]
  | ...

This would solve the problem of preserving paragraph-level attributes (like IDs, classes, and other HTML attributes) during document conversion. Currently, when converting HTML with paragraphs that have CSS classes or IDs, this information is lost unless the paragraph is wrapped in a Div. This feature would make attribute handling more consistent across block elements.

This proposal addresses part of the broader goal described in issue #684 ("Permit adding attributes to all Markdown elements"), focusing specifically on paragraph elements as a first step. Paragraphs are one of the most common elements in documents, making this a high-value improvement that affects many conversion scenarios.

Implementation would include:

  1. Updating the Block data type in Text.Pandoc.Definition
  2. Extending the builder in Text.Pandoc.Builder by adding a paraWith function and modifying the existing para function
  3. Modifying the HTML reader to capture paragraph attributes
  4. Updating all writers to respect paragraph attributes
  5. Adding appropriate tests

This would be consistent with the existing pattern used for other block elements that support attributes.

Describe alternatives you've considered.

  1. Wrapping paragraphs with attributes in Divs: This is the current workaround, but it creates an additional nesting level that complicates the document structure and isn't semantically accurate.

  2. Using a Maybe Attr field: I considered using Maybe Attr to maintain backward compatibility, but this would be inconsistent with other block types like Header and CodeBlock which use a direct Attr field with nullAttr for no attributes.

  3. Creating a new ParaWith constructor: This would preserve the existing Para constructor unchanged, but would introduce redundancy and complicate pattern matching throughout the codebase.

  4. Custom filter for post-processing: A filter could be used to convert specially-marked Divs to paragraphs with attributes, but this would be a workaround rather than a proper solution.

The direct addition of an Attr field to the Para constructor seems most consistent with the existing design of the Pandoc AST and would provide the cleanest interface for users of the library.

Implementation Plan

Here's a step-by-step approach for implementing this feature:

  1. Modify Text.Pandoc.Definition:
    Update the Para constructor to include Attr parameter.

  2. Update Text.Pandoc.Builder:

    • Add a new paraWith :: Attr -> Inlines -> Blocks function
    • Modify existing para function to use nullAttr
  3. Update HTML Reader (Text.Pandoc.Readers.HTML):
    Modify pPara to capture attributes from p tags and use them in the created Para blocks.

  4. Update all Writers:

    • HTML Writer: Add attributes to p tags
    • LaTeX Writer: Consider using custom commands for paragraphs with attributes
    • Markdown Writer: Perhaps use HTML comments or attribute syntax to preserve attributes
    • Other Writers: Add appropriate handling
  5. Update Pattern Matching:
    Find and update all places in the codebase that pattern match on Para elements.

  6. Update Tests:

    • Add reader tests to verify attributes are properly captured
    • Add writer tests to verify attributes are properly rendered
    • Add round-trip tests to verify attributes are preserved
  7. Update Documentation:

    • Update relevant sections in the manual
    • Add examples of how paragraph attributes can be used

Backward Compatibility

This change would require updating pattern matching throughout the codebase, but the semantic meaning of the Para constructor would not change. The use of nullAttr would maintain behavior equivalence for paragraphs without attributes.

Benefits

  1. More consistent AST design across block elements
  2. Better preservation of document semantics during format conversion
  3. Enhanced styling capabilities for paragraph elements
  4. Reduced need for wrapper divs simply to carry paragraph attributes
  5. First step toward the broader goal of supporting attributes on all elements (issue Permit adding attributes to all Markdown elements #684)

I'm willing to implement this feature if the approach is acceptable to the maintainers.

@jgm
Copy link
Owner

jgm commented Apr 8, 2025

This would be a very large change, affecting not just pandoc-types and pandoc but packages that depend on them. Filters would also have to be changed. I'm not sure the disruption is worth the benefit.

I don't think it would really count as a step towards supporting attributes on all elements, because the best way to do that would not be to add a separate attributes field to each constructor, but to do something like what I do in djot-hs.
https://github.com/jgm/djoths/blob/main/src/Djot/AST.hs#L95-L293

@Valgard
Copy link
Author

Valgard commented Apr 9, 2025

Thank you for your feedback regarding the paragraph attributes feature and for pointing me to the Djot implementation. I completely understand your concerns about the potential disruption such a change would cause to the ecosystem.

After reviewing the Djot.AST code and your comments, I see the elegance of the node-based approach. I've also explored existing discussions on similar topics, including [pandoc vs djot document model / footnotes (Discussion #9526)](#9526) and [From HTML to Markdown (Discussion #9324)](#9324), which deal with related architectural considerations.

I've started a discussion here: #10771 to gather input from the community about different possible approaches:

  1. The direct approach (adding Attr to Para)
  2. A node-based architecture similar to Djot
  3. Potential intermediate solutions that might balance functionality with migration costs

This way, we can explore whether there's a path forward that provides the needed functionality without causing excessive disruption to dependent packages and users.

Thanks again for your guidance - it's been helpful in understanding the architectural considerations involved.

@rnwst
Copy link
Contributor

rnwst commented Apr 9, 2025

@Valgard could you provide some concrete examples where this capability would be useful? I understand that in HTML paragraphs can have attrs (as can any other element), but pandoc is a document conversion program after all, so the format you are converting to must also support paragraph attrs for this to be useful at all. If you want to apply attrs to block-level content, the Div element is already available to you. Note that the comments in #684 are largely outdated as a lot of the elements mentioned there already include attrs (such as headers or tables). Personally, I have never felt the need to add attrs to a paragraph, but perhaps you have a compelling use case in mind!

@Valgard
Copy link
Author

Valgard commented Apr 9, 2025

Thank you for asking about concrete examples. I can share a specific case I'm currently working on that demonstrates the need for paragraph attributes:

Real-world example: Converting structured EPUB/HTML indexes to other formats

I'm currently working with book indexes from EPUB/HTML files that use paragraph attributes (CSS classes) to indicate the hierarchy and structure of the index:

<p class="indexmain1"><span><strong>A</strong></span></p>
<p class="indexmain"><span>Abfragen</span></p>
<p class="indexsub"><span>föderierte<span class="spacei"></span><a href="...">122</a></span></p>
<p class="indexsub"><span>Push-down-<span class="spacei"></span><a href="...">122</a></span></p>

These CSS classes (indexmain1, indexmain, indexsub) carry crucial semantic information about the index hierarchy. When converting to LaTeX or other formats, this structure needs to be preserved to properly render the index with appropriate indentation, formatting, and relationships.

My current workaround: After struggling with various approaches, I had to resort to a fragmented workflow:

  1. Write a custom Python script specifically to handle the index portions
  2. Convert the rest of the content using Pandoc
  3. Manually merge the results

This workflow is inefficient and error-prone, but necessary because Pandoc cannot preserve the paragraph-level semantic information during conversion. With paragraph attribute support, I could have solved this with a simple Pandoc filter that recognizes the classes and applies appropriate transformations.

Currently, I'm forced to choose between complex Lua filters to identify and transform these paragraphs (without having the attribute information) or wrap everything in Divs, both with significant drawbacks:

  1. Content pattern matching: Fragile and breaks with content changes
  2. Div wrapping: Creates unnecessary nesting and complicates the document structure

If paragraphs could carry their attributes in the AST, filters could simply check for specific classes and transform them accordingly, leading to much cleaner and more robust conversion.

Additional use cases where paragraph attributes are valuable:

  1. Semantic Markup and Accessibility

When working with technical documentation or academic content, paragraphs often carry semantic meaning:

<p class="note">This is an important note for the reader.</p>
<p class="warning">Warning: This approach has limitations.</p>
<p class="example">Consider the following example: ...</p>

These could be transformed appropriately into target formats (admonition blocks in LaTeX, specifically styled paragraphs in DOCX, etc.).

  1. Multilingual Document Processing
<p lang="en">English text</p>
<p lang="de">Deutscher Text</p>

Preserving language attributes ensures proper hyphenation, spell-checking, and accessibility in output formats.

  1. Publishing Workflows with CSS-based Styling
<p class="first-paragraph">The opening paragraph with special styling...</p>
<p class="summary">Summary of key points...</p>
<p class="quote-attribution">— Author Name</p>

Why Divs Aren't an Ideal Solution

While Divs can technically carry attributes, they create several problems:

  1. Semantic Accuracy: A paragraph with attributes is semantically different from a Div containing a paragraph
  2. Processing Complexity: Requires extra steps to extract and process nested elements
  3. Format Compatibility: Many formats (LaTeX, DOCX, EPUB) have direct support for paragraph-level styling
  4. Round-trip Conversion: Converting HTML → Markdown → HTML loses paragraph attributes but preserves Div attributes

I understand that architectural changes require careful consideration, but having spent significant time implementing workarounds for this limitation, I believe paragraph attributes would be a valuable addition that aligns with Pandoc's goal of faithful document conversion.

Does this help clarify the practical need for this feature?

@jgm
Copy link
Owner

jgm commented Apr 9, 2025

What if pandoc's HTML reader dealt with p elements with attributes by wrapping them in a Div? I think this would give you most of what you need at minimal cost. We could put a special class wrapper on the Div so that the HTML writer could unwrap it and give you a round-trip.

@rnwst
Copy link
Contributor

rnwst commented Apr 9, 2025

@Valgard thanks for providing some use cases. Some quick thoughts for each of them below.

  1. Converting structured EPUB/HTML indexes to other formats
    I'm not sure I understand this example. When you talk about hierarchy and indentation, this suggests to me that perhaps these elements should have been nested to represent their hierarchical relationships semantically in HTML, but instead HTML classes were used to convey these relationships. If this is the case then this is simply badly written HTML, and people with screen readers will struggle to make sense of this. In my view, pandoc cannot be expected to be the tool of choice to fix bad HTML. In this case I would suggest perhaps a nodejs script using jsdom or a similar library to transform the HTML to something that makes sense semantically before passing it to pandoc (it seems you have done something similar with a Python script already).
  2. Semantic Markup and Accessibility
    (Sidenote: For accessibility, the first example here should be using role="note", not class="note"). If you are converting to HTML you could choose Divs and Spans to create an equivalent structure. If, however, you are reading from HTML then jgm's suggestion of altering the HTML reader would be a good option I think.
  3. Multilingual Document Processing
    Again, I think jgm's suggestion would provide a remedy here. (Btw, there is an example in the manual for switching languages with Divs and Spans.)
  4. Publishing Workflows with CSS-based Styling
    The headline here makes it sound like you are publishing to HTML, in which case you could utilise a filter (or custom writer) to convert Divs and Spans with attrs enclosing paragraphs to the appropriate HTML.

@jgm
Copy link
Owner

jgm commented Apr 10, 2025

Note also this precedent. Djot allows attributes to be placed on paragraphs (as does commonmark with the attributes extension). Here is how we handle it:

% pandoc -f djot -t native
{.foo #bar}
This is a paragraph.
^D
[ Div
    ( "bar" , [ "foo" ] , [ ( "wrapper" , "1" ) ] )
    [ Para [ Str "This is a paragraph." ] ]
]
 % pandoc -f native -t djot
[ Div
    ( "bar" , [ "foo" ] , [ ( "wrapper" , "1" ) ] )
    [ Para [ Str "This is a paragraph." ] ]
]
^D
{#bar .foo}
This is a paragraph.

The wrapper attribute is used to tell pandoc internally that this container has just been added as an attribute-containing wrapper, and the djot writer unwraps it automatically. The HTML reader and writer could be modified in a similar way.

@Valgard
Copy link
Author

Valgard commented Apr 12, 2025

Thank you for your thoughts on the use cases and for sharing the wrapper approach used with Djot. Let me first clarify some points about my situation:

On the HTML/EPUB Index Example

The HTML is not authored by me - it comes from published EPUBs that I need to process. This is a common real-world scenario: working with content we don't control but need to convert faithfully. Many commercial publishers use CSS classes for semantic markup in EPUB files rather than nested structures (whether ideal or not).

When you say:

pandoc cannot be expected to be the tool of choice to fix bad HTML

I'm not asking Pandoc to fix anything - I'm asking it to preserve the semantic information that exists in the source document during conversion. This is precisely what Pandoc excels at for other elements (like headings and code blocks with attributes).

The issue isn't about "fixing bad HTML" but about faithfully converting between formats while preserving semantic information, which is Pandoc's core purpose.

On the Wrapper Attribute Approach

The approach you shared using a Div with a special wrapper="1" attribute is actually very elegant! This effectively provides a way to represent paragraph attributes without changing the core AST structure.

This solution could work well for my use case and others:

  1. The HTML reader could be modified to detect paragraph attributes and wrap them in Divs with the wrapper="1" attribute
  2. Writers that support paragraph attributes (like Djot, CommonMark with attributes extension, HTML) could check for this wrapper attribute and unwrap it, applying the attributes directly to the paragraph
  3. Writers that don't support paragraph attributes would simply render the wrapper Div

This approach has several advantages:

  • No major AST changes required
  • Backward compatibility maintained
  • Clear semantics about the purpose of the wrapper
  • Elegant handling in formats that support paragraph attributes

Move Forward

I'd be happy to help implement this wrapper-based approach for the HTML reader/writer if that would be welcome. It seems like a pragmatic solution that provides the functionality needed while respecting the current architectural constraints and following the precedent already established in the Djot reader/writer.

Would a pull request that implements this wrapper-based approach for HTML paragraphs be something you'd consider?

@jgm
Copy link
Owner

jgm commented Apr 12, 2025

Yes, I think that's a good way to go.

johannhartmann added a commit to Valgard/pandoc that referenced this issue May 17, 2025
This change implements support for preserving HTML paragraph attributes:

1. Readers/HTML.hs: Modified pPara to detect paragraphs with attributes
   and wrap them in a special Div with wrapper="1" attribute to preserve
   the original paragraph attributes.

2. Writers/HTML.hs: Updated blockToHtmlInner to detect wrapper Divs
   and apply their attributes directly to the contained paragraph.

3. HTML/Parsing.hs: Improved attribute handling for data-* attributes.

4. Added tests to verify the correct handling of paragraph attributes
   in both HTML to native and HTML to HTML conversions.

This maintains clean roundtripping of paragraph attributes while
keeping the AST structure consistent with Pandoc's design.
Valgard added a commit to Valgard/pandoc that referenced this issue May 17, 2025
- HTML reader wraps attributed `p` tags in `Div` with `wrapper="1"`.
- HTML writer unwraps `Div` with `wrapper="1"` back to attributed `p` tag.
- Add tests for HTML paragraph attribute roundtrip.
- Update EPUB golden files to reflect new AST for attributed paragraphs.
Valgard added a commit to Valgard/pandoc that referenced this issue May 17, 2025
- HTML reader wraps attributed `p` tags in `Div` with `wrapper="1"`.
- HTML writer unwraps `Div` with `wrapper="1"` back to attributed `p` tag.
- Add tests for HTML paragraph attribute roundtrip.
- Update EPUB golden files to reflect new AST for attributed paragraphs.
Valgard added a commit to Valgard/pandoc that referenced this issue May 17, 2025
- HTML reader wraps attributed `p` tags in `Div` with `wrapper="1"`.
- HTML writer unwraps `Div` with `wrapper="1"` back to attributed `p` tag.
- Add tests for HTML paragraph attribute roundtrip.
- Update EPUB golden files to reflect new AST for attributed paragraphs.
Valgard added a commit to Valgard/pandoc that referenced this issue May 17, 2025
- HTML reader wraps attributed `p` tags in `Div` with `wrapper="1"`.
- HTML writer unwraps `Div` with `wrapper="1"` back to attributed `p` tag.
- Add tests for HTML paragraph attribute roundtrip.
- Update EPUB golden files to reflect new AST for attributed paragraphs.
johannhartmann added a commit to Valgard/pandoc that referenced this issue May 18, 2025
This change implements support for preserving HTML paragraph attributes:

1. Readers/HTML.hs: Modified pPara to detect paragraphs with attributes
   and wrap them in a special Div with wrapper="1" attribute to preserve
   the original paragraph attributes.

2. Writers/HTML.hs: Updated blockToHtmlInner to detect wrapper Divs
   and apply their attributes directly to the contained paragraph.

3. HTML/Parsing.hs: Improved attribute handling for data-* attributes.

4. Added tests to verify the correct handling of paragraph attributes
   in both HTML to native and HTML to HTML conversions.

This maintains clean roundtripping of paragraph attributes while
keeping the AST structure consistent with Pandoc's design.
johannhartmann added a commit to Valgard/pandoc that referenced this issue May 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants