Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requirements for cross-reference streams in original PDFs (no incremental updates) is unclear #517

Open
petervwyatt opened this issue Jan 21, 2025 · 8 comments
Labels
documentation Improvements or additions to documentation question Further information is requested

Comments

@petervwyatt
Copy link
Member

Subclause 7.5.4 Cross-reference table very clearly identifies that an original PDF (i.e. without any incremental updates) must always explicitly define the start of the free object list when using conventional cross-reference sections (i.e. with the xref keyword:

Initially, the entire table consists of a single section ...

For a PDF file that has never been incrementally updated, the cross-reference section shall contain only one subsection, whose object numbering begins at 0.

The first entry in the table (object number 0) shall always be free and shall have a generation number of 65,535; this entry shall be the head of the linked list of free objects

and hence why we always see:

...
xref
0 ???
0000000000 65535 f
...

What is unclear is if subclause 7.5.8 Cross-reference stream has similar requirements - must an original PDF file that only uses cross-reference streams do the equivalent? i.e. can the Index entry for an original PDF start from 1 (and thus object 0 as the head of the free list is never explicitly defined in the cross-reference stream)? Or, more generally, can the Index entry for an original PDF start from N with the free object list assumed?

This is also a roundabout way of asking if an original PDF that only uses cross-reference streams can also have a value of 0 for the first element in the W array and thus rely on the default value of Type 1 (in-use, uncompressed) entries as per Table 17 ("If the first element is zero, the type field shall not be present, and shall default to Type 1.").

PS. I am looking for what the spec should formally state - NOT what implementations may or may not support - although that may be part of the answer.

@petervwyatt petervwyatt added documentation Improvements or additions to documentation question Further information is requested labels Jan 21, 2025
@mkl-public
Copy link

As I see it, 7.5.4 contains requirements both for PDF cross reference information in general and for storing them in a cross reference table in particular, and unfortunately these requirements are not clearly separated but actually very intermingled. 7.5.8 contains requirements for alternatively storing them in a cross reference stream.

In my opinion this implies that the requirements from 7.5.4 for PDF cross reference information in general also need to be satisfied when using cross reference streams if possible. The actual challenge is to determine which requirements are general and which are cross reference table specific.

Thus, yes, this means that in the cross reference stream of the initial revision of a PDF the single range of entries must be for the object numbers from 0 to Size - 1, and entry 0 must contain the anchor of the free object list.

In particular a value of 0 for the first element in the W array is not allowed for the cross reference stream in the initial revision of a PDF. Also a value smaller than 2 in the last entry in that array is not allowed, the required generation number 65535 needs those two bytes.


I think it is not relevant here what implementations currently support. As far as I know most implementations also accept PDFs with cross reference tables in the initial revision that don't have the 0 entry, or that are arbitrarily segmented with gaps or even overlaps. Nonetheless, such practices are formally forbidden.

Furthermore, I don't necessarily think those requirements actually are necessary or good for PDF. This merely is how I think the current spec should be interpreted. I'd be completely fine if PDF-2.1 lifted those requirements both for cross reference tables and streams.

@petervwyatt
Copy link
Member Author

I was intentionally not mentioning the 3rd W entry needing to be 2 for original files to try and avoid biasing replies. So much for that cunning strategy 😀 ...

This is only needed for the 1st entry (object 0, start-of-free-list, where it needs to be FFFFh) as every other object in an original PDF has generation 0 and thus could have otherwise relied on the default and thus saved 2 bytes per object in the (uncompressed) data stream. This is very significant although reduced via compression. Original PDFs also don't have free objects unless a PDF Writer is inefficient so it never needs to be anything except 2 for original PDFs from well-behaved writers.

Such obvious wastefulness totally goes against the original design goal from Adobe and is thus highly surprising to me. I would have thought they would have specified things differently...

@mkl-public
Copy link

mkl-public commented Jan 21, 2025

By the way, there is something strange: On one hand there is the "If the first element is zero, the type field shall not be present, and shall default to Type 1." you already quoted. On the other hand Table 18 (Entries in a cross-reference stream) in the entry for the first element in Type 0 says "Default value: 0." which would amount to Type 0 being the default. That latter default value is simply wrong, isn't it?

With that contradiction (and the other one in the same table identified in another issue recently) it's no wonder people shy away from relying on defaults here...

@petervwyatt
Copy link
Member Author

Just linking past issues: https://pdf-issues.pdfa.org/32000-2-2020/clause07.html#Table18 from #500

@mkl-public
Copy link

By the way, there is another information missing in the description of the cross reference stream data, at least I cannot find it: It doesn't say in which format exactly the numbers are to be stored in the stream. Unsigned or signed? In the latter case 1s complement or 2s complement? Or something more exotic, e.g. (unpacked or packed) BCDs?

The minimum 2 bytes for the 65535 generation number is true for the regular unsigned format. Signed already would require 3 bytes: 0x00ffff, so would packed BCD: 0x065535, and unpacked BCD would require 5 bytes: 0x0605050305.

Admittedly, as there is no need for negative numbers and the intent is a compact representation, the regular unsigned format is a quite natural choice but being natural is not normative...

@gettalong
Copy link

I think it is not relevant here what implementations currently support. As far as I know most implementations also accept PDFs with cross reference tables in the initial revision that don't have the 0 entry, or that are arbitrarily segmented with gaps or even overlaps. Nonetheless, such practices are formally forbidden.

As an example, it seems that the PDF viewer of the Chromium based browsers works fine with the initial cross reference section containing multiple subsections with gaps where the free objects would be but has problems with a single subsection that explicitly lists those free entries.

@mkl-public
Copy link

As an example, it seems that the PDF viewer of the Chromium based browsers works fine with the initial cross reference section containing multiple subsections with gaps where the free objects would be but has problems with a single subsection that explicitly lists those free entries.

Hhmmm, that behavior is something one should ask them to fix. That structure after all is the only one the spec allows.

@petervwyatt
Copy link
Member Author

Implementations can always go beyond the spec for broken PDFs - but at that point, what they recover vs what other implementations recover may well differ since it is outside the spec. Hopefully, they also support perfectly valid PDFs and do NOT ignore valid cross-reference info!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants