Skip to content

[BUG] Hybrid Experiment Creation fails on large enough Query Sets #158

Open
@alexeyrodriguez

Description

@alexeyrodriguez

What is the bug?

Running a Hybrid Experiment fails when running on a large enough Query Set and Judgments. The error message stored with the experiment is not informative.

The error seems to be triggered due to the parallel processing done at the ExperimentVariant level, searches are performed in parallel and the population of results is also performed in parallel. If the cluster cannot keep up with the requests, new requests will start being rejected.

How can one reproduce the bug?

On a local cluster first run the script demo_hybrid_optimizer.sh. It will create the query sets and judgments that we need to trigger this bug. Then, using the UI, create a Hybrid Experiment that uses the ESCI Query Set (150 queries) and ESCI judgments.

When completed, the experiment will display status ERROR. The error message in the experiment document is empty.

In the backend logs one can find:

opensearch_search_relevance  | [2025-06-30T09:29:15,686][ERROR][o.o.s.t.e.PutExperimentTransportAction] [opensearch] Failed to process metrics for experiment: 5921f9bc-6ad7-4af6-a0a4-d50e5c46e297
opensearch_search_relevance  | org.opensearch.action.search.SearchPhaseExecutionException: all shards failed
opensearch_search_relevance  |  at org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:775) ~[opensearch-3.1.0.jar:3.1.0]
opensearch_search_relevance  |  at org.opensearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:395) ~[opensearch-3.1.0.jar:3.1.0]
opensearch_search_relevance  |  at org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:815) ~[opensearch-3.1.0.jar:3.1.0]

The reason for this failure seems to be:

opensearch_search_relevance  | Caused by: org.opensearch.OpenSearchException$3: rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@cd7dc6d on QueueResizableOpenSearchThreadPoolExecutor[name = opensearch/search, queue capacity = 1000, org.opensearch.common.util.concurrent.QueueResizableOpenSearchThreadPoolExecutor@14b4f462[Running, pool size = 13, active threads = 13, queued tasks = 1000, completed tasks = 296]]
opensearch_search_relevance  |  at org.opensearch.OpenSearchException.guessRootCauses(OpenSearchException.java:716) ~[opensearch-core-3.1.0.jar:3.1.0]
opensearch_search_relevance  |  at org.opensearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:393) ~[opensearch-3.1.0.jar:3.1.0]
opensearch_search_relevance  |  ... 80 more
opensearch_search_relevance  | Caused by: org.opensearch.core.concurrency.OpenSearchRejectedExecutionException: rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@cd7dc6d on QueueResizableOpenSearchThreadPoolExecutor[name = opensearch/search, queue capacity = 1000, org.opensearch.common.util.concurrent.QueueResizableOpenSearchThreadPoolExecutor@14b4f462[Running, pool size = 13, active threads = 13, queued tasks = 1000, completed tasks = 296]]
opensearch_search_relevance  |  at org.opensearch.common.util.concurrent.OpenSearchAbortPolicy.rejectedExecution(OpenSearchAbortPolicy.java:67) ~[opensearch-3.1.0.jar:3.1.0]
opensearch_search_relevance  |  at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:841) ~[?:?]
opensearch_search_relevance  |  at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1376) ~[?:?]
opensearch_search_relevance  |  at org.opensearch.common.util.concurrent.OpenSearchThreadPoolExecutor.execute(OpenSearchThreadPoolExecutor.java:131) ~[opensearch-3.1.0.jar:3.1.0]
opensearch_search_relevance  |  ... 61 more
opensearch_search_relevance  | [2025-06-30T09:29:15,782][INFO ][o.o.s.t.e.PutExperimentTransportAction] [opensearch] Updated experiment 5921f9bc-6ad7-4af6-a0a4-d50e5c46e297 status to ERROR

What is the expected behavior?

The experiment creation should work or it should decline to create the experiment if there are too many queries (150 queries is not much though).

What is your host/environment?

n/a

Do you have any screenshots?

n/a

Do you have any additional context?

n/a

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

Status

Hot

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions