mistral ocr api #333

Daggx · 2025-03-13T09:15:29Z

Summary by CodeRabbit

New Features
- Introduced Optical Character Recognition (OCR) capabilities for extracting text from images (JPEG and PNG).
- Enabled both synchronous and asynchronous OCR operations with detailed, structured results including recognized text and processing metrics.
- Added new output formats for OCR results, providing comprehensive information about processed pages and their content.

coderabbitai · 2025-03-13T09:15:39Z

Walkthrough

The updates integrate Optical Character Recognition (OCR) functionality into the Mistral API. A new section in the API info JSON introduces both synchronous and asynchronous OCR capabilities. The MistralApi class now implements new methods to handle OCR processing, including launching asynchronous jobs and retrieving their results. Additionally, two new JSON output files have been added to standardize OCR responses and usage information.

Changes

File(s)	Change Summary
`edenai_apis/apis/mistral/.../info.json`	Added new `"ocr"` and `"ocr_async"` sections with file type constraints, language options, and version identifiers.
`edenai_apis/apis/mistral/.../mistral_api.py`	Updated `MistralApi` to inherit from `OcrInterface` and added methods: `ocr__ocr`, `ocr__ocr_async__launch_job`, and `ocr__ocr_async__get_job_result` for handling OCR tasks.
`edenai_apis/apis/mistral/outputs/ocr/ocr_*`	Introduced new JSON output structures for both synchronous and asynchronous OCR responses, detailing OCR results, job tracking, and usage information.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant MistralApi
    participant OCRService
    Client->>MistralApi: ocr__ocr(file, language, file_url)
    MistralApi->>OCRService: Send encoded image (base64)
    OCRService-->>MistralApi: Return OCR result
    MistralApi-->>Client: Return standardized OCR response

sequenceDiagram
    participant Client
    participant MistralApi
    participant OCRService
    Client->>MistralApi: ocr__ocr_async__launch_job(file, file_url)
    MistralApi->>OCRService: Launch async OCR job
    OCRService-->>MistralApi: Return job ID
    MistralApi-->>Client: Return job ID
    Client->>MistralApi: ocr__ocr_async__get_job_result(job_id)
    MistralApi->>OCRService: Fetch job result
    OCRService-->>MistralApi: Return OCR result data
    MistralApi-->>Client: Return standardized OCR response

Suggested reviewers

juandavidcruzgomez

Poem

I'm a bunny skipping through the code,
New OCR features brighten my humble abode.
Sync and async, they prance with delight,
Transforming images into text so bright.
Hop along, dear devs, in our digital glade,
Where every change is a sweet carrot parade! 🐇🥕

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

edenai_apis/apis/mistral/mistral_api.py (3)
12-15: Remove unused imports
The imported Page, Line, and BoundingBox are never used. Consider removing them to address the static analysis warning and reduce clutter.
 from edenai_apis.features.ocr.ocr_async.ocr_async_dataclass import (
     OcrAsyncDataClass,
-    Page,
-    Line,
-    BoundingBox,
 )
🧰 Tools

🪛 Ruff (0.8.2)

12-12: edenai_apis.features.ocr.ocr_async.ocr_async_dataclass.Page imported but unused

Remove unused import

(F401)

13-13: edenai_apis.features.ocr.ocr_async.ocr_async_dataclass.Line imported but unused

Remove unused import

(F401)

14-14: edenai_apis.features.ocr.ocr_async.ocr_async_dataclass.BoundingBox imported but unused

Remove unused import

(F401)

246-256: Infer correct MIME type for base64 encoding
You currently hardcode "data:image/jpeg;base64," despite allowing PNG files. Consider detecting the file type from its extension or content to set the correct data URI prefix.
+import imghdr

 def ocr__ocr(self, file: str, language: str, file_url: str = "", **kwargs) -> ResponseType[OcrDataClass]:
     ...
     else:
         with open(file, "rb") as image_file:
             file_bytes = image_file.read()
             mime_type = imghdr.what(None, file_bytes) or "jpeg"
-            image_data = f"data:image/jpeg;base64,{base64.b64encode(file_bytes).decode('utf-8')}"
+            image_data = f"data:image/{mime_type};base64,{base64.b64encode(file_bytes).decode('utf-8')}"
     ...
344-351: Clean up or finalize commented-out code
The lines in this block are commented out. If you plan to reconstruct bounding boxes and lines, consider implementing it. Otherwise, remove it for clarity.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1061cc2 and 0d6ceb7.

📒 Files selected for processing (4)

edenai_apis/apis/mistral/info.json (1 hunks)
edenai_apis/apis/mistral/mistral_api.py (3 hunks)
edenai_apis/apis/mistral/outputs/ocr/ocr_async_output.json (1 hunks)
edenai_apis/apis/mistral/outputs/ocr/ocr_output.json (1 hunks)

🧰 Additional context used

🪛 Ruff (0.8.2)

edenai_apis/apis/mistral/mistral_api.py

12-12: edenai_apis.features.ocr.ocr_async.ocr_async_dataclass.Page imported but unused

Remove unused import

(F401)

13-13: edenai_apis.features.ocr.ocr_async.ocr_async_dataclass.Line imported but unused

Remove unused import

(F401)

14-14: edenai_apis.features.ocr.ocr_async.ocr_async_dataclass.BoundingBox imported but unused

Remove unused import

(F401)

37-37: Redefinition of unused ChatDataClass from line 16

(F811)

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: test

🔇 Additional comments (6)

edenai_apis/apis/mistral/info.json (1)

47-61: Looks good!
The new "ocr" sections are consistent with the existing JSON structure. No issues found.

edenai_apis/apis/mistral/outputs/ocr/ocr_output.json (1)

1-27: No issues found
This new JSON structure for OCR output is well-structured and consistent.

edenai_apis/apis/mistral/outputs/ocr/ocr_async_output.json (1)

1-118: No issues found
This new JSON structure for asynchronous OCR output is comprehensive and aligned with the synchronous version.

edenai_apis/apis/mistral/mistral_api.py (3)

37-37: Avoid overshadowing the previously imported ChatDataClass.
A ChatDataClass was already imported at line 16. If they refer to different classes, consider aliasing one of them. Otherwise, remove the duplicate import to resolve potential naming conflicts.

🧰 Tools

🪛 Ruff (0.8.2)

37-37: Redefinition of unused ChatDataClass from line 16

(F811)

272-272: Verify behavior for multi-page documents
Currently, the code only returns pages[0]["markdown"]. If there are multiple pages, this may cause missing text. Confirm whether you intend to process all pages.

278-301: LGTM for async job launch
The asynchronous file-upload logic and error handling appear consistent.

edenai_apis/apis/mistral/mistral_api.py

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

edenai_apis/apis/mistral/mistral_api.py (1)
302-338: ⚠️ Potential issue

Avoid duplicated API call in get_job_result method

The current implementation has the same issue that was fixed in a previous commit (based on past review comments). You're doing a requests.get() call to get the file URL, then not reusing the response_data for extracting the URL.
 response = requests.get(url=url, headers=self.headers)
 try:
     response_data = response.json()
+    file_url = response_data["url"]
 except json.JSONDecodeError as exc:
     raise ProviderException(
         message=response.text, code=response.status_code
     ) from exc
 if response.status_code != 200:
     raise ProviderException(
         message=response_data.get("message", response.text),
         code=response.status_code,
     )
-file_url = response_data["url"]

🧹 Nitpick comments (2)

edenai_apis/apis/mistral/mistral_api.py (2)
1-2: Ensure imported modules are all being used

All the new imports for OCR functionality look good. However, I notice you're importing Line and BoundingBox (lines 13-14) but they're only used in commented-out code (lines 347-349). Consider either removing these unused imports or uncommenting the code that uses them.

Also applies to: 8-15, 29-34

246-276: Synchronous OCR implementation looks good with proper error handling

The implementation follows good practices with proper error handling for both JSON decoding errors and HTTP responses. The code correctly supports both file paths and URLs as input sources.

One minor enhancement would be to make the image MIME type detection dynamic rather than hardcoding "image/jpeg".
-                image_data = f"data:image/jpeg;base64,{base64_image}"
+                # Determine MIME type dynamically or make it configurable
+                image_data = f"data:image/jpeg;base64,{base64_image}"

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Lite

📥 Commits

Reviewing files that changed from the base of the PR and between 0d6ceb7 and b94de59.

📒 Files selected for processing (1)

edenai_apis/apis/mistral/mistral_api.py (3 hunks)

🔇 Additional comments (4)

edenai_apis/apis/mistral/mistral_api.py (4)

40-40: Class signature updated correctly to support OCR functionality

The class now correctly inherits from OcrInterface to implement OCR capabilities.

278-300: Asynchronous OCR job launch implementation looks good

The implementation correctly handles file uploads and error cases. Good job returning the provider job ID for tracking.

339-350: Decision needed on commented-out code

There's a significant block of commented-out code for detailed page and line processing. This should either be uncommented if it's intended functionality or removed if it's not needed. Currently, pages is initialized but remains empty while the method still returns it.

If the block is intended to be commented out, please confirm that the empty pages list is the correct implementation. If not, consider the following options:
 number_of_pages = response_data["usage_info"]["pages_processed"]
 raw_text = ""
-pages = []
+# If the commented code is meant to be removed, then either:
+# Option 1: Create empty Page objects
+pages = [Page(lines=[]) for _ in range(number_of_pages)]
+# Option 2: Set pages to None if your data model allows it
+# pages = None
351-357: Return statement leverages correct types

The use of AsyncResponseType with the correct fields and types is good.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

edenai_apis/apis/mistral/mistral_api.py (2)

300-348: Commented code should be removed or implemented

There's commented out code for handling line-by-line data. Either complete the implementation or remove the commented code to maintain cleanliness.

-            # markdown_lines = page["markdown"].split("\n")
-            # lines = []
-            # for line_text in markdown_lines:
-            #     line = Line(text=line_text, bounding_box=BoundingBox())
-            #     lines.append(line)
-            # pages.append(Page(lines=lines))

304-316: Convert error handling to a helper function

The error handling pattern is repeated in multiple places. Consider extracting it to a helper method to reduce code duplication.

Example implementation:

def _handle_response(self, response):
    """Handle API response, extract JSON and check for errors."""
    try:
        response_data = response.json()
    except json.JSONDecodeError as exc:
        raise ProviderException(
            message=response.text, code=response.status_code
        ) from exc
    if response.status_code != 200:
        raise ProviderException(
            message=response_data.get("message", response.text),
            code=response.status_code,
        )
    return response_data

Then use it in your methods:

 response = requests.get(url=url, headers=self.headers)
-try:
-    response_data = response.json()
-except json.JSONDecodeError as exc:
-    raise ProviderException(
-        message=response.text, code=response.status_code
-    ) from exc
-if response.status_code != 200:
-    raise ProviderException(
-        message=response_data.get("message", response.text),
-        code=response.status_code,
-    )
+response_data = self._handle_response(response)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Lite

📥 Commits

Reviewing files that changed from the base of the PR and between b94de59 and 9472458.

📒 Files selected for processing (1)

edenai_apis/apis/mistral/mistral_api.py (3 hunks)

🔇 Additional comments (5)

edenai_apis/apis/mistral/mistral_api.py (5)

1-2: Import sections look good

The new imports for OCR functionality are correctly organized, including the base modules needed for the implementation.

Also applies to: 8-15, 29-34

40-40: Class interface implementation looks correct

OcrInterface is properly added to the class inheritance list, aligning with the implementation of OCR functionality.

244-274: Synchronous OCR implementation looks solid

The method correctly handles both file input and URL-based processing with appropriate error handling. Good job encoding the image to base64 format when a file is provided.

276-298: Launch job implementation is well-structured

The async job launch functionality correctly uploads the file with proper error handling. The method returns just the job ID as required by the interface.

317-326: Reuse the response_data variable

You get the file URL and then immediately make another request. Following the previous feedback (mentioned in past_review_comments), avoid making a redundant request by reusing the variable.
 file_url = response_data["url"]
 payload = {
     "model": "mistral-ocr-latest",
     "document": {
         "type": "document_url",
         "document_url": file_url,
     },
 }
 url = "https://api.mistral.ai/v1/ocr"
-response = requests.post(url=url, headers=self.headers, json=payload)
+ocr_response = requests.post(url=url, headers=self.headers, json=payload)
 try:
-    response_data = response.json()
+    response_data = ocr_response.json()
 except json.JSONDecodeError as exc:
     raise ProviderException(
-        message=response.text, code=response.status_code
+        message=ocr_response.text, code=ocr_response.status_code
     ) from exc
-if response.status_code != 200:
+if ocr_response.status_code != 200:
     raise ProviderException(
-        message=response_data.get("message", response.text),
-        code=response.status_code,
+        message=response_data.get("message", ocr_response.text),
+        code=ocr_response.status_code,
     )

coderabbitai · 2025-03-18T14:20:01Z

edenai_apis/apis/mistral/mistral_api.py

+        return AsyncResponseType(
+            original_response=response_data,
+            standardized_response=OcrAsyncDataClass(
+                raw_text=raw_text, pages=pages, number_of_pages=number_of_pages
+            ),
+            provider_job_id=provider_job_id,
+        )


⚠️ Potential issue

Empty pages list will cause inconsistency

The pages list is declared but never populated (the code that would populate it is commented out). This will result in an inconsistent state where number_of_pages indicates multiple pages but the pages list is empty.

Fix this by either:

Implementing the commented-out code to properly populate the pages list, or

Using a placeholder implementation that creates page objects with the raw text:

- raw_text = "" - pages = [] - for page in response_data["pages"]: - raw_text += page["markdown"] - # markdown_lines = page["markdown"].split("\n") - # lines = [] - # for line_text in markdown_lines: - # line = Line(text=line_text, bounding_box=BoundingBox()) - # lines.append(line) - # pages.append(Page(lines=lines)) + raw_text = "" + pages = [] + for page in response_data["pages"]: + page_text = page["markdown"] + raw_text += page_text + # Create a simple page with one line containing all text + line = Line(text=page_text, bounding_box=BoundingBox()) + pages.append(Page(lines=[line]))

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

return AsyncResponseType(

original_response=response_data,

standardized_response=OcrAsyncDataClass(

raw_text=raw_text, pages=pages, number_of_pages=number_of_pages

),

provider_job_id=provider_job_id,

)

raw_text = ""

pages = []

for page in response_data["pages"]:

page_text = page["markdown"]

raw_text += page_text

# Create a simple page with one line containing all text

line = Line(text=page_text, bounding_box=BoundingBox())

pages.append(Page(lines=[line]))

return AsyncResponseType(

original_response=response_data,

standardized_response=OcrAsyncDataClass(

raw_text=raw_text, pages=pages, number_of_pages=number_of_pages

),

provider_job_id=provider_job_id,

)

mistral ocr api

0d6ceb7

Daggx requested a review from juandavidcruzgomez March 13, 2025 09:15

coderabbitai bot reviewed Mar 13, 2025

View reviewed changes

edenai_apis/apis/mistral/mistral_api.py Outdated Show resolved Hide resolved

juandavidcruzgomez requested changes Mar 13, 2025

View reviewed changes

edenai_apis/apis/mistral/mistral_api.py Outdated Show resolved Hide resolved

[Fix] get response only 1 time

b94de59

coderabbitai bot reviewed Mar 13, 2025

View reviewed changes

Daggx requested a review from juandavidcruzgomez March 13, 2025 09:58

juandavidcruzgomez approved these changes Mar 18, 2025

View reviewed changes

Merge branch 'master' into SD2-1251-add-mistral-ocr-api

9472458

coderabbitai bot reviewed Mar 18, 2025

View reviewed changes

juandavidcruzgomez merged commit 5918574 into master Mar 18, 2025
3 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

mistral ocr api #333

mistral ocr api #333

Uh oh!

Daggx commented Mar 13, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 13, 2025 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 18, 2025

Uh oh!

Uh oh!

Uh oh!

mistral ocr api #333

mistral ocr api #333

Uh oh!

Conversation

Daggx commented Mar 13, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Suggested reviewers

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Daggx commented Mar 13, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 13, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)