Skip to content

mistral ocr api #333

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 18, 2025
Merged

mistral ocr api #333

merged 3 commits into from
Mar 18, 2025

Conversation

Daggx
Copy link
Contributor

@Daggx Daggx commented Mar 13, 2025

Summary by CodeRabbit

  • New Features
    • Introduced Optical Character Recognition (OCR) capabilities for extracting text from images (JPEG and PNG).
    • Enabled both synchronous and asynchronous OCR operations with detailed, structured results including recognized text and processing metrics.
    • Added new output formats for OCR results, providing comprehensive information about processed pages and their content.

Copy link

coderabbitai bot commented Mar 13, 2025

Walkthrough

The updates integrate Optical Character Recognition (OCR) functionality into the Mistral API. A new section in the API info JSON introduces both synchronous and asynchronous OCR capabilities. The MistralApi class now implements new methods to handle OCR processing, including launching asynchronous jobs and retrieving their results. Additionally, two new JSON output files have been added to standardize OCR responses and usage information.

Changes

File(s) Change Summary
edenai_apis/apis/mistral/.../info.json Added new "ocr" and "ocr_async" sections with file type constraints, language options, and version identifiers.
edenai_apis/apis/mistral/.../mistral_api.py Updated MistralApi to inherit from OcrInterface and added methods: ocr__ocr, ocr__ocr_async__launch_job, and ocr__ocr_async__get_job_result for handling OCR tasks.
edenai_apis/apis/mistral/outputs/ocr/ocr_* Introduced new JSON output structures for both synchronous and asynchronous OCR responses, detailing OCR results, job tracking, and usage information.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant MistralApi
    participant OCRService
    Client->>MistralApi: ocr__ocr(file, language, file_url)
    MistralApi->>OCRService: Send encoded image (base64)
    OCRService-->>MistralApi: Return OCR result
    MistralApi-->>Client: Return standardized OCR response
Loading
sequenceDiagram
    participant Client
    participant MistralApi
    participant OCRService
    Client->>MistralApi: ocr__ocr_async__launch_job(file, file_url)
    MistralApi->>OCRService: Launch async OCR job
    OCRService-->>MistralApi: Return job ID
    MistralApi-->>Client: Return job ID
    Client->>MistralApi: ocr__ocr_async__get_job_result(job_id)
    MistralApi->>OCRService: Fetch job result
    OCRService-->>MistralApi: Return OCR result data
    MistralApi-->>Client: Return standardized OCR response
Loading

Suggested reviewers

  • juandavidcruzgomez

Poem

I'm a bunny skipping through the code,
New OCR features brighten my humble abode.
Sync and async, they prance with delight,
Transforming images into text so bright.
Hop along, dear devs, in our digital glade,
Where every change is a sweet carrot parade! 🐇🥕


🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
edenai_apis/apis/mistral/mistral_api.py (3)

12-15: Remove unused imports
The imported Page, Line, and BoundingBox are never used. Consider removing them to address the static analysis warning and reduce clutter.

 from edenai_apis.features.ocr.ocr_async.ocr_async_dataclass import (
     OcrAsyncDataClass,
-    Page,
-    Line,
-    BoundingBox,
 )
🧰 Tools
🪛 Ruff (0.8.2)

12-12: edenai_apis.features.ocr.ocr_async.ocr_async_dataclass.Page imported but unused

Remove unused import

(F401)


13-13: edenai_apis.features.ocr.ocr_async.ocr_async_dataclass.Line imported but unused

Remove unused import

(F401)


14-14: edenai_apis.features.ocr.ocr_async.ocr_async_dataclass.BoundingBox imported but unused

Remove unused import

(F401)


246-256: Infer correct MIME type for base64 encoding
You currently hardcode "data:image/jpeg;base64," despite allowing PNG files. Consider detecting the file type from its extension or content to set the correct data URI prefix.

+import imghdr

 def ocr__ocr(self, file: str, language: str, file_url: str = "", **kwargs) -> ResponseType[OcrDataClass]:
     ...
     else:
         with open(file, "rb") as image_file:
             file_bytes = image_file.read()
             mime_type = imghdr.what(None, file_bytes) or "jpeg"
-            image_data = f"data:image/jpeg;base64,{base64.b64encode(file_bytes).decode('utf-8')}"
+            image_data = f"data:image/{mime_type};base64,{base64.b64encode(file_bytes).decode('utf-8')}"
     ...

344-351: Clean up or finalize commented-out code
The lines in this block are commented out. If you plan to reconstruct bounding boxes and lines, consider implementing it. Otherwise, remove it for clarity.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1061cc2 and 0d6ceb7.

📒 Files selected for processing (4)
  • edenai_apis/apis/mistral/info.json (1 hunks)
  • edenai_apis/apis/mistral/mistral_api.py (3 hunks)
  • edenai_apis/apis/mistral/outputs/ocr/ocr_async_output.json (1 hunks)
  • edenai_apis/apis/mistral/outputs/ocr/ocr_output.json (1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
edenai_apis/apis/mistral/mistral_api.py

12-12: edenai_apis.features.ocr.ocr_async.ocr_async_dataclass.Page imported but unused

Remove unused import

(F401)


13-13: edenai_apis.features.ocr.ocr_async.ocr_async_dataclass.Line imported but unused

Remove unused import

(F401)


14-14: edenai_apis.features.ocr.ocr_async.ocr_async_dataclass.BoundingBox imported but unused

Remove unused import

(F401)


37-37: Redefinition of unused ChatDataClass from line 16

(F811)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: test
🔇 Additional comments (6)
edenai_apis/apis/mistral/info.json (1)

47-61: Looks good!
The new "ocr" sections are consistent with the existing JSON structure. No issues found.

edenai_apis/apis/mistral/outputs/ocr/ocr_output.json (1)

1-27: No issues found
This new JSON structure for OCR output is well-structured and consistent.

edenai_apis/apis/mistral/outputs/ocr/ocr_async_output.json (1)

1-118: No issues found
This new JSON structure for asynchronous OCR output is comprehensive and aligned with the synchronous version.

edenai_apis/apis/mistral/mistral_api.py (3)

37-37: Avoid overshadowing the previously imported ChatDataClass.
A ChatDataClass was already imported at line 16. If they refer to different classes, consider aliasing one of them. Otherwise, remove the duplicate import to resolve potential naming conflicts.

🧰 Tools
🪛 Ruff (0.8.2)

37-37: Redefinition of unused ChatDataClass from line 16

(F811)


272-272: Verify behavior for multi-page documents
Currently, the code only returns pages[0]["markdown"]. If there are multiple pages, this may cause missing text. Confirm whether you intend to process all pages.


278-301: LGTM for async job launch
The asynchronous file-upload logic and error handling appear consistent.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
edenai_apis/apis/mistral/mistral_api.py (1)

302-338: ⚠️ Potential issue

Avoid duplicated API call in get_job_result method

The current implementation has the same issue that was fixed in a previous commit (based on past review comments). You're doing a requests.get() call to get the file URL, then not reusing the response_data for extracting the URL.

 response = requests.get(url=url, headers=self.headers)
 try:
     response_data = response.json()
+    file_url = response_data["url"]
 except json.JSONDecodeError as exc:
     raise ProviderException(
         message=response.text, code=response.status_code
     ) from exc
 if response.status_code != 200:
     raise ProviderException(
         message=response_data.get("message", response.text),
         code=response.status_code,
     )
-file_url = response_data["url"]
🧹 Nitpick comments (2)
edenai_apis/apis/mistral/mistral_api.py (2)

1-2: Ensure imported modules are all being used

All the new imports for OCR functionality look good. However, I notice you're importing Line and BoundingBox (lines 13-14) but they're only used in commented-out code (lines 347-349). Consider either removing these unused imports or uncommenting the code that uses them.

Also applies to: 8-15, 29-34


246-276: Synchronous OCR implementation looks good with proper error handling

The implementation follows good practices with proper error handling for both JSON decoding errors and HTTP responses. The code correctly supports both file paths and URLs as input sources.

One minor enhancement would be to make the image MIME type detection dynamic rather than hardcoding "image/jpeg".

-                image_data = f"data:image/jpeg;base64,{base64_image}"
+                # Determine MIME type dynamically or make it configurable
+                image_data = f"data:image/jpeg;base64,{base64_image}"
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Lite

📥 Commits

Reviewing files that changed from the base of the PR and between 0d6ceb7 and b94de59.

📒 Files selected for processing (1)
  • edenai_apis/apis/mistral/mistral_api.py (3 hunks)
🔇 Additional comments (4)
edenai_apis/apis/mistral/mistral_api.py (4)

40-40: Class signature updated correctly to support OCR functionality

The class now correctly inherits from OcrInterface to implement OCR capabilities.


278-300: Asynchronous OCR job launch implementation looks good

The implementation correctly handles file uploads and error cases. Good job returning the provider job ID for tracking.


339-350: Decision needed on commented-out code

There's a significant block of commented-out code for detailed page and line processing. This should either be uncommented if it's intended functionality or removed if it's not needed. Currently, pages is initialized but remains empty while the method still returns it.

If the block is intended to be commented out, please confirm that the empty pages list is the correct implementation. If not, consider the following options:

 number_of_pages = response_data["usage_info"]["pages_processed"]
 raw_text = ""
-pages = []
+# If the commented code is meant to be removed, then either:
+# Option 1: Create empty Page objects
+pages = [Page(lines=[]) for _ in range(number_of_pages)]
+# Option 2: Set pages to None if your data model allows it
+# pages = None

351-357: Return statement leverages correct types

The use of AsyncResponseType with the correct fields and types is good.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
edenai_apis/apis/mistral/mistral_api.py (2)

300-348: Commented code should be removed or implemented

There's commented out code for handling line-by-line data. Either complete the implementation or remove the commented code to maintain cleanliness.

-            # markdown_lines = page["markdown"].split("\n")
-            # lines = []
-            # for line_text in markdown_lines:
-            #     line = Line(text=line_text, bounding_box=BoundingBox())
-            #     lines.append(line)
-            # pages.append(Page(lines=lines))

304-316: Convert error handling to a helper function

The error handling pattern is repeated in multiple places. Consider extracting it to a helper method to reduce code duplication.

Example implementation:

def _handle_response(self, response):
    """Handle API response, extract JSON and check for errors."""
    try:
        response_data = response.json()
    except json.JSONDecodeError as exc:
        raise ProviderException(
            message=response.text, code=response.status_code
        ) from exc
    if response.status_code != 200:
        raise ProviderException(
            message=response_data.get("message", response.text),
            code=response.status_code,
        )
    return response_data

Then use it in your methods:

 response = requests.get(url=url, headers=self.headers)
-try:
-    response_data = response.json()
-except json.JSONDecodeError as exc:
-    raise ProviderException(
-        message=response.text, code=response.status_code
-    ) from exc
-if response.status_code != 200:
-    raise ProviderException(
-        message=response_data.get("message", response.text),
-        code=response.status_code,
-    )
+response_data = self._handle_response(response)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Lite

📥 Commits

Reviewing files that changed from the base of the PR and between b94de59 and 9472458.

📒 Files selected for processing (1)
  • edenai_apis/apis/mistral/mistral_api.py (3 hunks)
🔇 Additional comments (5)
edenai_apis/apis/mistral/mistral_api.py (5)

1-2: Import sections look good

The new imports for OCR functionality are correctly organized, including the base modules needed for the implementation.

Also applies to: 8-15, 29-34


40-40: Class interface implementation looks correct

OcrInterface is properly added to the class inheritance list, aligning with the implementation of OCR functionality.


244-274: Synchronous OCR implementation looks solid

The method correctly handles both file input and URL-based processing with appropriate error handling. Good job encoding the image to base64 format when a file is provided.


276-298: Launch job implementation is well-structured

The async job launch functionality correctly uploads the file with proper error handling. The method returns just the job ID as required by the interface.


317-326: Reuse the response_data variable

You get the file URL and then immediately make another request. Following the previous feedback (mentioned in past_review_comments), avoid making a redundant request by reusing the variable.

 file_url = response_data["url"]
 payload = {
     "model": "mistral-ocr-latest",
     "document": {
         "type": "document_url",
         "document_url": file_url,
     },
 }
 url = "https://api.mistral.ai/v1/ocr"
-response = requests.post(url=url, headers=self.headers, json=payload)
+ocr_response = requests.post(url=url, headers=self.headers, json=payload)
 try:
-    response_data = response.json()
+    response_data = ocr_response.json()
 except json.JSONDecodeError as exc:
     raise ProviderException(
-        message=response.text, code=response.status_code
+        message=ocr_response.text, code=ocr_response.status_code
     ) from exc
-if response.status_code != 200:
+if ocr_response.status_code != 200:
     raise ProviderException(
-        message=response_data.get("message", response.text),
-        code=response.status_code,
+        message=response_data.get("message", ocr_response.text),
+        code=ocr_response.status_code,
     )

Comment on lines +349 to +355
return AsyncResponseType(
original_response=response_data,
standardized_response=OcrAsyncDataClass(
raw_text=raw_text, pages=pages, number_of_pages=number_of_pages
),
provider_job_id=provider_job_id,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Empty pages list will cause inconsistency

The pages list is declared but never populated (the code that would populate it is commented out). This will result in an inconsistent state where number_of_pages indicates multiple pages but the pages list is empty.

Fix this by either:

  1. Implementing the commented-out code to properly populate the pages list, or
  2. Using a placeholder implementation that creates page objects with the raw text:
-        raw_text = ""
-        pages = []
-        for page in response_data["pages"]:
-            raw_text += page["markdown"]
-            # markdown_lines = page["markdown"].split("\n")
-            # lines = []
-            # for line_text in markdown_lines:
-            #     line = Line(text=line_text, bounding_box=BoundingBox())
-            #     lines.append(line)
-            # pages.append(Page(lines=lines))
+        raw_text = ""
+        pages = []
+        for page in response_data["pages"]:
+            page_text = page["markdown"]
+            raw_text += page_text
+            # Create a simple page with one line containing all text
+            line = Line(text=page_text, bounding_box=BoundingBox())
+            pages.append(Page(lines=[line]))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
return AsyncResponseType(
original_response=response_data,
standardized_response=OcrAsyncDataClass(
raw_text=raw_text, pages=pages, number_of_pages=number_of_pages
),
provider_job_id=provider_job_id,
)
raw_text = ""
pages = []
for page in response_data["pages"]:
page_text = page["markdown"]
raw_text += page_text
# Create a simple page with one line containing all text
line = Line(text=page_text, bounding_box=BoundingBox())
pages.append(Page(lines=[line]))
return AsyncResponseType(
original_response=response_data,
standardized_response=OcrAsyncDataClass(
raw_text=raw_text, pages=pages, number_of_pages=number_of_pages
),
provider_job_id=provider_job_id,
)

@juandavidcruzgomez juandavidcruzgomez merged commit 5918574 into master Mar 18, 2025
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants