-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Harvesting not working #561
Comments
As of today, 2024-03-11, none of these packages was harvested. I have requested the harvesting again programmatically on 2024-03-06 and received 201 HTTP responses. Given that it is 12 days since the original harvesting requests, I am changing the issue title from "Harvesting not working or taking very long" to "Harvesting not working". |
@glogowski-wojciech-MSFT Thanks for reporting the issue! In ClearlyDefined, we typically download source distributions ( The absence of source distributions may be the reason why the harvesting process failed for the listed packages. |
During the harvesting process, we download a source distribution from PyPI to perform further analysis, such as running the licensee, reuse, and ScanCode tools. If a source package is not available, the package is currently marked as missing. This behavior was introduced in this PR to address this issue. When a package is marked as missing during the harvest, there is no information stored regarding the downloaded registry information for that PyPI package. In addition, curation can only be created through a pull request against https://github.com/clearlydefined/curated-data rather than through the user interface. Due to recent questions about harvesting PyPI packages without source distributions, it may be worthwhile to discuss the matter further on the original issue. Should we allow the harvest to succeed even if the source PyPI package cannot be downloaded? Could it be considered the intended behavior for those PyPI packages where no files are displayed on the components details page due to the unavailability of the source package? @capfei @bduranc @jeffwilcox @elrayle Any thoughts? |
@Jeffrey-Luszcz ☝ See comment responding to issue raised in the community meeting today. |
Can I assume in this context, that the "normal" package files (i.e. binary/deployable code) are still being retrieved and scanned? In either case, I think what's important is we have some clear way of notifying end-users the reason why they can't see the files. And if it's due to a tool error (as was discussed in clearlydefined/website#964), then I consider this as different than the files just "not being available". We of course cannot consider it "succeeded" in such cases. |
I've just faced this issue when running ClearlyDefined integration tests. The package in question is However I wonder if we could use wheel distributions for scanning. Currently we only consider crawler/providers/fetch/pypiFetch.js Line 104 in a1d12ac
At least this particular What do you think @elrayle @qtomlinson? |
Historically, only source distributions of pypi packages are harvested. This is probably because source distribution contains more meta data (see https://packaging.python.org/en/latest/discussions/package-formats/). .zip was considered acceptable for source distribution. See doc. That is why only .zip and .tar.gz files are used for harvest pypi packages. |
I submitted harvesting requests for the following Python packages using the clearlydefined website, first on 2024-02-28 in a single query, and then on 2024-02-29, each one in a separate query, in a separate browser tab. As of today, 2024-03-04, none of these packages were harvested:
The harvesting either does not reliably work or takes a very long time (5 days and counting). Either way I believe this requires a fix or at least extra documentation. I will also appreciate help with harvesting these specific packages.
The text was updated successfully, but these errors were encountered: