-
Notifications
You must be signed in to change notification settings - Fork 68
mistral ocr api #333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mistral ocr api #333
Conversation
WalkthroughThe updates integrate Optical Character Recognition (OCR) functionality into the Mistral API. A new section in the API info JSON introduces both synchronous and asynchronous OCR capabilities. The MistralApi class now implements new methods to handle OCR processing, including launching asynchronous jobs and retrieving their results. Additionally, two new JSON output files have been added to standardize OCR responses and usage information. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant MistralApi
participant OCRService
Client->>MistralApi: ocr__ocr(file, language, file_url)
MistralApi->>OCRService: Send encoded image (base64)
OCRService-->>MistralApi: Return OCR result
MistralApi-->>Client: Return standardized OCR response
sequenceDiagram
participant Client
participant MistralApi
participant OCRService
Client->>MistralApi: ocr__ocr_async__launch_job(file, file_url)
MistralApi->>OCRService: Launch async OCR job
OCRService-->>MistralApi: Return job ID
MistralApi-->>Client: Return job ID
Client->>MistralApi: ocr__ocr_async__get_job_result(job_id)
MistralApi->>OCRService: Fetch job result
OCRService-->>MistralApi: Return OCR result data
MistralApi-->>Client: Return standardized OCR response
Suggested reviewers
Poem
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
edenai_apis/apis/mistral/mistral_api.py (3)
12-15
: Remove unused imports
The importedPage
,Line
, andBoundingBox
are never used. Consider removing them to address the static analysis warning and reduce clutter.from edenai_apis.features.ocr.ocr_async.ocr_async_dataclass import ( OcrAsyncDataClass, - Page, - Line, - BoundingBox, )🧰 Tools
🪛 Ruff (0.8.2)
12-12:
edenai_apis.features.ocr.ocr_async.ocr_async_dataclass.Page
imported but unusedRemove unused import
(F401)
13-13:
edenai_apis.features.ocr.ocr_async.ocr_async_dataclass.Line
imported but unusedRemove unused import
(F401)
14-14:
edenai_apis.features.ocr.ocr_async.ocr_async_dataclass.BoundingBox
imported but unusedRemove unused import
(F401)
246-256
: Infer correct MIME type for base64 encoding
You currently hardcode"data:image/jpeg;base64,"
despite allowing PNG files. Consider detecting the file type from its extension or content to set the correct data URI prefix.+import imghdr def ocr__ocr(self, file: str, language: str, file_url: str = "", **kwargs) -> ResponseType[OcrDataClass]: ... else: with open(file, "rb") as image_file: file_bytes = image_file.read() mime_type = imghdr.what(None, file_bytes) or "jpeg" - image_data = f"data:image/jpeg;base64,{base64.b64encode(file_bytes).decode('utf-8')}" + image_data = f"data:image/{mime_type};base64,{base64.b64encode(file_bytes).decode('utf-8')}" ...
344-351
: Clean up or finalize commented-out code
The lines in this block are commented out. If you plan to reconstruct bounding boxes and lines, consider implementing it. Otherwise, remove it for clarity.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
edenai_apis/apis/mistral/info.json
(1 hunks)edenai_apis/apis/mistral/mistral_api.py
(3 hunks)edenai_apis/apis/mistral/outputs/ocr/ocr_async_output.json
(1 hunks)edenai_apis/apis/mistral/outputs/ocr/ocr_output.json
(1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
edenai_apis/apis/mistral/mistral_api.py
12-12: edenai_apis.features.ocr.ocr_async.ocr_async_dataclass.Page
imported but unused
Remove unused import
(F401)
13-13: edenai_apis.features.ocr.ocr_async.ocr_async_dataclass.Line
imported but unused
Remove unused import
(F401)
14-14: edenai_apis.features.ocr.ocr_async.ocr_async_dataclass.BoundingBox
imported but unused
Remove unused import
(F401)
37-37: Redefinition of unused ChatDataClass
from line 16
(F811)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: test
🔇 Additional comments (6)
edenai_apis/apis/mistral/info.json (1)
47-61
: Looks good!
The new "ocr" sections are consistent with the existing JSON structure. No issues found.edenai_apis/apis/mistral/outputs/ocr/ocr_output.json (1)
1-27
: No issues found
This new JSON structure for OCR output is well-structured and consistent.edenai_apis/apis/mistral/outputs/ocr/ocr_async_output.json (1)
1-118
: No issues found
This new JSON structure for asynchronous OCR output is comprehensive and aligned with the synchronous version.edenai_apis/apis/mistral/mistral_api.py (3)
37-37
: Avoid overshadowing the previously importedChatDataClass
.
AChatDataClass
was already imported at line 16. If they refer to different classes, consider aliasing one of them. Otherwise, remove the duplicate import to resolve potential naming conflicts.🧰 Tools
🪛 Ruff (0.8.2)
37-37: Redefinition of unused
ChatDataClass
from line 16(F811)
272-272
: Verify behavior for multi-page documents
Currently, the code only returnspages[0]["markdown"]
. If there are multiple pages, this may cause missing text. Confirm whether you intend to process all pages.
278-301
: LGTM for async job launch
The asynchronous file-upload logic and error handling appear consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
edenai_apis/apis/mistral/mistral_api.py (1)
302-338
:⚠️ Potential issueAvoid duplicated API call in get_job_result method
The current implementation has the same issue that was fixed in a previous commit (based on past review comments). You're doing a
requests.get()
call to get the file URL, then not reusing theresponse_data
for extracting the URL.response = requests.get(url=url, headers=self.headers) try: response_data = response.json() + file_url = response_data["url"] except json.JSONDecodeError as exc: raise ProviderException( message=response.text, code=response.status_code ) from exc if response.status_code != 200: raise ProviderException( message=response_data.get("message", response.text), code=response.status_code, ) -file_url = response_data["url"]
🧹 Nitpick comments (2)
edenai_apis/apis/mistral/mistral_api.py (2)
1-2
: Ensure imported modules are all being usedAll the new imports for OCR functionality look good. However, I notice you're importing
Line
andBoundingBox
(lines 13-14) but they're only used in commented-out code (lines 347-349). Consider either removing these unused imports or uncommenting the code that uses them.Also applies to: 8-15, 29-34
246-276
: Synchronous OCR implementation looks good with proper error handlingThe implementation follows good practices with proper error handling for both JSON decoding errors and HTTP responses. The code correctly supports both file paths and URLs as input sources.
One minor enhancement would be to make the image MIME type detection dynamic rather than hardcoding "image/jpeg".
- image_data = f"data:image/jpeg;base64,{base64_image}" + # Determine MIME type dynamically or make it configurable + image_data = f"data:image/jpeg;base64,{base64_image}"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Lite
📒 Files selected for processing (1)
edenai_apis/apis/mistral/mistral_api.py
(3 hunks)
🔇 Additional comments (4)
edenai_apis/apis/mistral/mistral_api.py (4)
40-40
: Class signature updated correctly to support OCR functionalityThe class now correctly inherits from
OcrInterface
to implement OCR capabilities.
278-300
: Asynchronous OCR job launch implementation looks goodThe implementation correctly handles file uploads and error cases. Good job returning the provider job ID for tracking.
339-350
: Decision needed on commented-out codeThere's a significant block of commented-out code for detailed page and line processing. This should either be uncommented if it's intended functionality or removed if it's not needed. Currently,
pages
is initialized but remains empty while the method still returns it.If the block is intended to be commented out, please confirm that the empty
pages
list is the correct implementation. If not, consider the following options:number_of_pages = response_data["usage_info"]["pages_processed"] raw_text = "" -pages = [] +# If the commented code is meant to be removed, then either: +# Option 1: Create empty Page objects +pages = [Page(lines=[]) for _ in range(number_of_pages)] +# Option 2: Set pages to None if your data model allows it +# pages = None
351-357
: Return statement leverages correct typesThe use of
AsyncResponseType
with the correct fields and types is good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
edenai_apis/apis/mistral/mistral_api.py (2)
300-348
: Commented code should be removed or implementedThere's commented out code for handling line-by-line data. Either complete the implementation or remove the commented code to maintain cleanliness.
- # markdown_lines = page["markdown"].split("\n") - # lines = [] - # for line_text in markdown_lines: - # line = Line(text=line_text, bounding_box=BoundingBox()) - # lines.append(line) - # pages.append(Page(lines=lines))
304-316
: Convert error handling to a helper functionThe error handling pattern is repeated in multiple places. Consider extracting it to a helper method to reduce code duplication.
Example implementation:
def _handle_response(self, response): """Handle API response, extract JSON and check for errors.""" try: response_data = response.json() except json.JSONDecodeError as exc: raise ProviderException( message=response.text, code=response.status_code ) from exc if response.status_code != 200: raise ProviderException( message=response_data.get("message", response.text), code=response.status_code, ) return response_dataThen use it in your methods:
response = requests.get(url=url, headers=self.headers) -try: - response_data = response.json() -except json.JSONDecodeError as exc: - raise ProviderException( - message=response.text, code=response.status_code - ) from exc -if response.status_code != 200: - raise ProviderException( - message=response_data.get("message", response.text), - code=response.status_code, - ) +response_data = self._handle_response(response)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Lite
📒 Files selected for processing (1)
edenai_apis/apis/mistral/mistral_api.py
(3 hunks)
🔇 Additional comments (5)
edenai_apis/apis/mistral/mistral_api.py (5)
1-2
: Import sections look goodThe new imports for OCR functionality are correctly organized, including the base modules needed for the implementation.
Also applies to: 8-15, 29-34
40-40
: Class interface implementation looks correct
OcrInterface
is properly added to the class inheritance list, aligning with the implementation of OCR functionality.
244-274
: Synchronous OCR implementation looks solidThe method correctly handles both file input and URL-based processing with appropriate error handling. Good job encoding the image to base64 format when a file is provided.
276-298
: Launch job implementation is well-structuredThe async job launch functionality correctly uploads the file with proper error handling. The method returns just the job ID as required by the interface.
317-326
: Reuse the response_data variableYou get the file URL and then immediately make another request. Following the previous feedback (mentioned in past_review_comments), avoid making a redundant request by reusing the variable.
file_url = response_data["url"] payload = { "model": "mistral-ocr-latest", "document": { "type": "document_url", "document_url": file_url, }, } url = "https://api.mistral.ai/v1/ocr" -response = requests.post(url=url, headers=self.headers, json=payload) +ocr_response = requests.post(url=url, headers=self.headers, json=payload) try: - response_data = response.json() + response_data = ocr_response.json() except json.JSONDecodeError as exc: raise ProviderException( - message=response.text, code=response.status_code + message=ocr_response.text, code=ocr_response.status_code ) from exc -if response.status_code != 200: +if ocr_response.status_code != 200: raise ProviderException( - message=response_data.get("message", response.text), - code=response.status_code, + message=response_data.get("message", ocr_response.text), + code=ocr_response.status_code, )
return AsyncResponseType( | ||
original_response=response_data, | ||
standardized_response=OcrAsyncDataClass( | ||
raw_text=raw_text, pages=pages, number_of_pages=number_of_pages | ||
), | ||
provider_job_id=provider_job_id, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Empty pages list will cause inconsistency
The pages
list is declared but never populated (the code that would populate it is commented out). This will result in an inconsistent state where number_of_pages
indicates multiple pages but the pages
list is empty.
Fix this by either:
- Implementing the commented-out code to properly populate the pages list, or
- Using a placeholder implementation that creates page objects with the raw text:
- raw_text = ""
- pages = []
- for page in response_data["pages"]:
- raw_text += page["markdown"]
- # markdown_lines = page["markdown"].split("\n")
- # lines = []
- # for line_text in markdown_lines:
- # line = Line(text=line_text, bounding_box=BoundingBox())
- # lines.append(line)
- # pages.append(Page(lines=lines))
+ raw_text = ""
+ pages = []
+ for page in response_data["pages"]:
+ page_text = page["markdown"]
+ raw_text += page_text
+ # Create a simple page with one line containing all text
+ line = Line(text=page_text, bounding_box=BoundingBox())
+ pages.append(Page(lines=[line]))
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
return AsyncResponseType( | |
original_response=response_data, | |
standardized_response=OcrAsyncDataClass( | |
raw_text=raw_text, pages=pages, number_of_pages=number_of_pages | |
), | |
provider_job_id=provider_job_id, | |
) | |
raw_text = "" | |
pages = [] | |
for page in response_data["pages"]: | |
page_text = page["markdown"] | |
raw_text += page_text | |
# Create a simple page with one line containing all text | |
line = Line(text=page_text, bounding_box=BoundingBox()) | |
pages.append(Page(lines=[line])) | |
return AsyncResponseType( | |
original_response=response_data, | |
standardized_response=OcrAsyncDataClass( | |
raw_text=raw_text, pages=pages, number_of_pages=number_of_pages | |
), | |
provider_job_id=provider_job_id, | |
) |
Summary by CodeRabbit