Skip to content

Commit 517dc37

Browse files
committed
feat: add bedrock translation example + conditional strategy in reducer
1 parent 6973355 commit 517dc37

File tree

93 files changed

+2467
-632
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

93 files changed

+2467
-632
lines changed

.github/workflows/run-build-test.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ jobs:
1212
runs-on: ubuntu-latest
1313
strategy:
1414
matrix:
15-
version: [20, 21]
15+
version: [20, 22]
1616
steps:
1717
- name: Checkout repository
1818
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11

docs/package-lock.json

+275-315
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

docs/package.json

+2-2
Original file line numberDiff line numberDiff line change
@@ -23,8 +23,8 @@
2323
},
2424
"license": "Apache-2.0",
2525
"dependencies": {
26-
"@astrojs/starlight": "^0.25.0",
27-
"astro": "^4.11.5",
26+
"@astrojs/starlight": "^0.25.2",
27+
"astro": "^4.12.2",
2828
"sharp": "^0.33.4"
2929
},
3030
"devDependencies": {
Loading
Loading

docs/src/content/docs/flow-control/reducer.md

+89-11
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ The `Reducer` middleware is an essential flow control mechanism of Project Lakec
2424

2525
At its core, this middleware allows pipeline builders to group relevant documents into a single semantical envelope, and perform combined operations on them. In combination with other middlewares this unlocks many use-cases, for example, aggregating multiple audio files together to concatenate them, zip a collection of documents on-the-fly, or insert a set of subtitles into a video.
2626

27-
The `Reducer` middleware can aggregate multiple documents based on specific strategies which we will document below.
27+
> The `Reducer` middleware can aggregate multiple documents based on specific strategies which we document below.
2828
2929
<br />
3030

@@ -50,7 +50,7 @@ We say that a new pipeline execution is triggered when a new video is uploaded t
5050

5151
#### Time Windows
5252

53-
The time window strategy makes it possible to reduce events belonging to the same `chainId` within a specific time window. It defines a static time window, comprised between 1 second and 48 hours, in which all events belonging to the same `chainId` are aggregated together. When the time window reaches its end, the aggregated events are reduced into a single composite event, and forwarded to the next middlewares in the pipeline.
53+
The time window strategy reduces events belonging to the same `chainId` within a user defined time window. The time window can be comprised between 1 second and 48 hours. When the time window reaches its end, the aggregated events are reduced into a single composite event, and forwarded to the next middlewares in the pipeline.
5454

5555
This strategy is a good fit for scenarios where you don't necessarily know how many documents will be produced by previous middlewares preceding the `Reducer` step.
5656

@@ -60,7 +60,7 @@ It starts aggregating documents belonging to the same `chainId` when the first d
6060

6161
##### Jitter
6262

63-
The `TimeWindowStrategy` also allows you to optionally specify a specific jitter which consists of a random number between zero and the specified jitter value. Using a jitter can be very useful to smoothen the aggregation process across multiple `chainId`.
63+
The `TimeWindowStrategy` allows you to optionally specify a jitter which consists of a random number between zero and an arbitrary value. Using a jitter can be useful to smoothen the aggregation process across multiple `chainId`.
6464

6565
For example, if your time window is 10 minutes, and you add a jitter of 30 seconds, each reduce operation will occur after 10 minutes + a random value comprised between zero and 30 seconds.
6666

@@ -101,15 +101,17 @@ class Stack extends cdk.Stack {
101101

102102
#### Static Counter
103103

104-
The static counter strategy allows you to reduce all events belonging to the same `chainId`, based on a static counter. It allows you to specify the number of documents to aggregate together before reducing them into a single event.
104+
The static counter strategy reduces all events belonging to the same `chainId`, based on a static counter. It allows you to specify the number of documents to aggregate together before reducing them into a single event.
105105

106106
This strategy is a good fit when you know the exact number of documents that you expect to be reduced.
107107

108-
For example, let's say that you want to translate a document in french, english, and spanish using the [Translate Text Processor](/project-lakechain/text-processing/translate-text-processor), and reduce the translated documents back together to zip them. In this case, you know that you will be expecting 3 documents associated with the 3 translated languages.
108+
For example, let's say that you want to translate a document in french, english, and spanish using the [Translate Text Processor](/project-lakechain/text-processing/translate-text-processor), and reduce the translated documents back together to zip them. In this case, you know that you will be expecting exaxtly 3 documents associated with the translated languages.
109109

110110
##### Unmatched Events
111111

112-
As the reducer awaits for the static count condition to be met, it will aggregate documents for a period of 48 hours. If the condition is unmet after this period, the aggregated documents will be dismissed, and no event will be created.
112+
As the reducer awaits for the static count condition to be met, it will aggregate documents for a period of up to 48 hours. If the counter is not reached after this period, the aggregated documents will be dismissed.
113+
114+
Similarly, if a reduce operation already occurred for a given chain identifier, any subsequent document that may arrive after the count condition has been met will be dismissed.
113115

114116
##### Usage
115117

@@ -137,7 +139,76 @@ class Stack extends cdk.Stack {
137139
}
138140
```
139141

140-
<br>
142+
<br />
143+
144+
---
145+
146+
#### Conditional Strategy
147+
148+
The conditional strategy reduces events based on a custom user-provided condition. It allows you to define a [funclet](/project-lakechain/guides/funclets) or a lambda function that gets called back when a new document belonging to a given `chainId` is being aggregated. This conditional expression defines when the aggregated events should be reduced.
149+
150+
This strategy is a good fit when you want to control the reduce process based on a specific condition. For example, let's say that you want to reduce a collection of events based on the metadata of the documents, or even based on a third-party API, you can use the conditional strategy to do that.
151+
152+
##### Unmatched Events
153+
154+
This strategy allows you to evaluate each aggregated document for a duration of up to 48 hours. If the condition is unmet after this period, the aggregated documents will be dismissed.
155+
156+
If a reduce operation already occurred for a given chain identifier, any subsequent document that may arrive after the condition has been met, and having the same chain identifier, will be dismissed.
157+
158+
##### Usage
159+
160+
To reduce events using the `ConditionalStrategy`, you must import and instantiate the `Reducer` middleware as part of your pipeline.
161+
162+
> 💁 Below is an example showcasing how to instantiate the reducer using the `ConditionalStrategy` with a custom condition.
163+
164+
```typescript
165+
import * as cdk from 'aws-cdk-lib';
166+
import { CloudEvent, TextMetadata } from '@project-lakechain/sdk';
167+
import { Reducer, ConditionalStrategy } from '@project-lakechain/reducer';
168+
169+
/**
170+
* This conditional expression is called by the reducer middleware
171+
* for every new received event. In this example, we want to reduce
172+
* the events based on the total number of chunks produced by the
173+
* previous middlewares.
174+
* @param events the event to process.
175+
* @param storedEvents the list of events stored in the table.
176+
* @returns a promise resolving to a boolean value.
177+
*/
178+
export const conditional = async (event: CloudEvent, storedEvents: CloudEvent[]) => {
179+
const metadata = event.data().metadata().properties?.attrs as TextMetadata;
180+
181+
// Return a boolean value.
182+
return (storedEvents.length === metadata.chunk?.total);
183+
};
184+
185+
class Stack extends cdk.Stack {
186+
constructor(scope: cdk.Construct, id: string) {
187+
const reducer = new Reducer.Builder()
188+
.withScope(this)
189+
.withIdentifier('Reducer')
190+
.withCacheStorage(cache)
191+
.withSources([M1, M2, M3]) // 👈 Specifies the sources.
192+
.withReducerStrategy(new ConditionalStrategy.Builder()
193+
.withConditional(conditional)
194+
.build()
195+
)
196+
.build();
197+
}
198+
}
199+
```
200+
201+
##### Funclet Signature
202+
203+
Funclet expressions use the power of a full programming language to express complex reduce conditional expressions. They are asynchronous and can be defined as TypeScript named functions, anonymous functions, or arrow functions.
204+
205+
A reduce conditional funclet takes 2 arguments. A CloudEvent describing the document that is being handled by the reducer, and a collection of the stored events up until now — excluding the received event. It must return a promise to a boolean value representing the result of the evaluation, true if the reduce operation should occur, false otherwise.
206+
207+
```typescript
208+
type ConditionalExpression = (event: CloudEvent, storedEvents: CloudEvent[]) => Promise<boolean>;
209+
```
210+
211+
<br />
141212

142213
---
143214

@@ -147,7 +218,7 @@ The architecture implemented by this middleware depends on the selected strategy
147218

148219
#### `TimeWindowStrategy`
149220

150-
This strategy implements a serverless aggregation architecture based on DynamoDB for document event aggregation, and the [EventBridge Scheduler](https://docs.aws.amazon.com/scheduler/latest/UserGuide/what-is-scheduler.html) service for scheduling the execution of the reducer for each `chainId` group of events.
221+
This strategy implements a serverless aggregation architecture based on DynamoDB for document event aggregation, and the [EventBridge Scheduler](https://docs.aws.amazon.com/scheduler/latest/UserGuide/what-is-scheduler.html) service for scheduling the execution of the reducer for each `chainId` group of events when the time window is reached.
151222

152223
![Time Window Architecture](../../../assets/reduce-time-window-architecture.png)
153224

@@ -157,13 +228,19 @@ This strategy also implements a serverless aggregation architecture based on Dyn
157228

158229
![Static Counter Architecture](../../../assets/reduce-static-counter-architecture.png)
159230

160-
<br>
231+
#### `ConditionalStrategy`
232+
233+
The conditional strategy implements a serverless aggregation architecture based on DynamoDB as the document aggregator, and leverages an event-driven approach to evaluate a conditional expression for each received document belonging to the same `chainId`.
234+
235+
![Conditional Strategy Architecture](../../../assets/reduce-conditional-strategy-architecture.png)
236+
237+
<br />
161238

162239
---
163240

164241
### 🏷️ Properties
165242

166-
<br>
243+
<br />
167244

168245
##### Supported Inputs
169246

@@ -183,11 +260,12 @@ This strategy also implements a serverless aggregation architecture based on Dyn
183260
| ----- | ----------- |
184261
| `CPU` | This middleware only supports CPU compute. |
185262

186-
<br>
263+
<br />
187264

188265
---
189266

190267
### 📖 Examples
191268

192269
- [Building a Generative Podcast](https://github.com/awslabs/project-lakechain/tree/main/examples/end-to-end-use-cases/building-a-podcast-generator) - Builds a pipeline for creating a generative weekly AWS news podcast.
193270
- [Building a Video Chaptering Service](https://github.com/awslabs/project-lakechain/tree/main/examples/end-to-end-use-cases/building-a-video-chaptering-service) - Builds a pipeline for automatic video chaptering generation.
271+
- [Bedrock Translation Pipeline](https://github.com/awslabs/project-lakechain/tree/main/examples/simple-pipelines/text-translation-pipelines/bedrock-translation-pipeline) - Translates documents using a large-language model hosted on Amazon Bedrock.

docs/src/content/docs/generative-ai/llama-text-processor.mdx

+3
Original file line numberDiff line numberDiff line change
@@ -106,6 +106,9 @@ LLAMA2_13B_CHAT_V1 | `meta.llama2-13b-chat-v1`
106106
LLAMA2_70B_CHAT_V1 | `meta.llama2-70b-chat-v1`
107107
LLAMA3_8B_INSTRUCT_V1 | `meta.llama3-8b-instruct-v1:0`
108108
LLAMA3_70B_INSTRUCT_V1 | `meta.llama3-70b-instruct-v1:0`
109+
LLAMA3_1_8B_INSTRUCT_V1 | `meta.llama3-1-8b-instruct-v1:0`
110+
LLAMA3_1_70B_INSTRUCT_V1 | `meta.llama3-1-70b-instruct-v1:0`
111+
LLAMA3_1_405B_INSTRUCT_V1 | `meta.llama3-1-405b-instruct-v1:0`
109112

110113
<br />
111114

docs/src/content/docs/text-processing/translate-text-processor.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -171,4 +171,4 @@ When using asynchronous translations, This middleware uses an event-driven archi
171171

172172
### 📖 Examples
173173

174-
- [Text Translation Pipeline](https://github.com/awslabs/project-lakechain/tree/main/examples/simple-pipelines/text-translation-pipeline/) - An example showcasing how to translate documents using Amazon Translate.
174+
- [Text Translation Pipeline](https://github.com/awslabs/project-lakechain/tree/main/examples/simple-pipelines/text-translation-pipelines/translate-pipeline/) - An example showcasing how to translate documents using Amazon Translate.

examples/end-to-end-use-cases/building-a-document-index/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,7 @@ The following requirements are needed to deploy the infrastructure required to r
7676
- You need access to a development AWS account.
7777
- [AWS CDK](https://docs.aws.amazon.com/cdk/latest/guide/getting_started.html#getting_started_install) is required to deploy the infrastructure.
7878
- [Docker](https://docs.docker.com/get-docker/) is required to be running to build middlewares.
79-
- [Node.js](https://nodejs.org/en/download/) v18+ and NPM.
79+
- [Node.js](https://nodejs.org/en/download/) v20+ and NPM.
8080
- [Python](https://www.python.org/downloads/) v3.8+ and [Pip](https://pip.pypa.io/en/stable/installation/).
8181

8282
## 🚀 Deploy

examples/end-to-end-use-cases/building-a-podcast-generator/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ The following requirements are needed to deploy the infrastructure required to r
6060
- You need access to a development AWS account.
6161
- [AWS CDK](https://docs.aws.amazon.com/cdk/latest/guide/getting_started.html#getting_started_install) is required to deploy the infrastructure.
6262
- [Docker](https://docs.docker.com/get-docker/) is required to be running to build middlewares.
63-
- [Node.js](https://nodejs.org/en/download/) v18+ and NPM.
63+
- [Node.js](https://nodejs.org/en/download/) v20+ and NPM.
6464
- [Python](https://www.python.org/downloads/) v3.8+ and [Pip](https://pip.pypa.io/en/stable/installation/).
6565

6666
## 🚀 Deploy

examples/end-to-end-use-cases/building-a-rag-pipeline/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@ The following requirements are needed to deploy the infrastructure required to r
6161
- You need access to a development AWS account.
6262
- [AWS CDK](https://docs.aws.amazon.com/cdk/latest/guide/getting_started.html#getting_started_install) is required to deploy the infrastructure.
6363
- [Docker](https://docs.docker.com/get-docker/) is required to be running to build middlewares.
64-
- [Node.js](https://nodejs.org/en/download/) v18+ and NPM.
64+
- [Node.js](https://nodejs.org/en/download/) v20+ and NPM.
6565
- [Python](https://www.python.org/downloads/) v3.8+ and [Pip](https://pip.pypa.io/en/stable/installation/).
6666

6767
<br />

examples/end-to-end-use-cases/building-a-search-engine/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ The following requirements are needed to deploy the infrastructure required to r
5050
- You need access to a development AWS account.
5151
- [AWS CDK](https://docs.aws.amazon.com/cdk/latest/guide/getting_started.html#getting_started_install) is required to deploy the infrastructure.
5252
- [Docker](https://docs.docker.com/get-docker/) is required to be running to build middlewares.
53-
- [Node.js](https://nodejs.org/en/download/) v18+ and NPM.
53+
- [Node.js](https://nodejs.org/en/download/) v20+ and NPM.
5454
- [Python](https://www.python.org/downloads/) v3.8+ and [Pip](https://pip.pypa.io/en/stable/installation/).
5555

5656
<br />

examples/end-to-end-use-cases/building-a-video-chaptering-service/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ The following requirements are needed to deploy the infrastructure required to r
5555
- You need access to a development AWS account.
5656
- [AWS CDK](https://docs.aws.amazon.com/cdk/latest/guide/getting_started.html#getting_started_install) is required to deploy the infrastructure.
5757
- [Docker](https://docs.docker.com/get-docker/) is required to be running to build middlewares.
58-
- [Node.js](https://nodejs.org/en/download/) v18+ and NPM.
58+
- [Node.js](https://nodejs.org/en/download/) v20+ and NPM.
5959
- [Python](https://www.python.org/downloads/) v3.8+ and [Pip](https://pip.pypa.io/en/stable/installation/).
6060

6161
## 🚀 Deploy

examples/end-to-end-use-cases/building-a-video-subtitle-service/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@ The following requirements are needed to deploy the infrastructure required to r
6767
- You need access to a development AWS account.
6868
- [AWS CDK](https://docs.aws.amazon.com/cdk/latest/guide/getting_started.html#getting_started_install) is required to deploy the infrastructure.
6969
- [Docker](https://docs.docker.com/get-docker/) is required to be running to build middlewares.
70-
- [Node.js](https://nodejs.org/en/download/) v18+ and NPM.
70+
- [Node.js](https://nodejs.org/en/download/) v20+ and NPM.
7171
- [Python](https://www.python.org/downloads/) v3.8+ and [Pip](https://pip.pypa.io/en/stable/installation/).
7272

7373
## 🚀 Deploy

examples/simple-pipelines/archive-processing-pipelines/deflate-pipeline/README.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -31,12 +31,12 @@ The following requirements are needed to deploy the infrastructure associated wi
3131
- You need access to a development AWS account.
3232
- [AWS CDK](https://docs.aws.amazon.com/cdk/latest/guide/getting_started.html#getting_started_install) is required to deploy the infrastructure.
3333
- [Docker](https://docs.docker.com/get-docker/) is required to be running to build middlewares.
34-
- [Node.js](https://nodejs.org/en/download/) v18+ and NPM.
34+
- [Node.js](https://nodejs.org/en/download/) v20+ and NPM.
3535
- [Python](https://www.python.org/downloads/) v3.8+ and [Pip](https://pip.pypa.io/en/stable/installation/).
3636

3737
## 🚀 Deploy
3838

39-
Head to the directory [`examples/simple-pipelines/archive-processing/deflate-pipeline`](/examples/simple-pipelines/archive-processing/deflate-pipeline) in the repository and run the following commands to build the example:
39+
Head to the directory [`examples/simple-pipelines/archive-processing/deflate-pipeline`](/examples/simple-pipelines/archive-processing-pipelines/deflate-pipeline) in the repository and run the following commands to build the example:
4040

4141
```bash
4242
npm install

examples/simple-pipelines/archive-processing-pipelines/inflate-pipeline/README.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -23,12 +23,12 @@ The following requirements are needed to deploy the infrastructure associated wi
2323
- You need access to a development AWS account.
2424
- [AWS CDK](https://docs.aws.amazon.com/cdk/latest/guide/getting_started.html#getting_started_install) is required to deploy the infrastructure.
2525
- [Docker](https://docs.docker.com/get-docker/) is required to be running to build middlewares.
26-
- [Node.js](https://nodejs.org/en/download/) v18+ and NPM.
26+
- [Node.js](https://nodejs.org/en/download/) v20+ and NPM.
2727
- [Python](https://www.python.org/downloads/) v3.8+ and [Pip](https://pip.pypa.io/en/stable/installation/).
2828

2929
## 🚀 Deploy
3030

31-
Head to the directory [`examples/simple-pipelines/inflate-pipeline`](/examples/simple-pipelines/inflate-pipeline) in the repository and run the following commands to build the example:
31+
Head to the directory [`examples/simple-pipelines/archive-processing-pipelines/inflate-pipeline`](/examples/simple-pipelines/archive-processing-pipelines/inflate-pipeline) in the repository and run the following commands to build the example:
3232

3333
```bash
3434
npm install

examples/simple-pipelines/data-extraction-pipelines/metadata-extraction-pipeline/README.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -23,12 +23,12 @@ The following requirements are needed to deploy the infrastructure associated wi
2323
- You need access to a development AWS account.
2424
- [AWS CDK](https://docs.aws.amazon.com/cdk/latest/guide/getting_started.html#getting_started_install) is required to deploy the infrastructure.
2525
- [Docker](https://docs.docker.com/get-docker/) is required to be running to build middlewares.
26-
- [Node.js](https://nodejs.org/en/download/) v18+ and NPM.
26+
- [Node.js](https://nodejs.org/en/download/) v20+ and NPM.
2727
- [Python](https://www.python.org/downloads/) v3.8+ and [Pip](https://pip.pypa.io/en/stable/installation/).
2828

2929
## 🚀 Deploy
3030

31-
Head to the directory [`examples/simple-pipelines/metadata-extraction-pipeline`](/examples/simple-pipelines/metadata-extraction-pipeline) in the repository and run the following commands to build the example:
31+
Head to the directory [`examples/simple-pipelines/data-extraction-pipelines/metadata-extraction-pipeline`](/examples/simple-pipelines/data-extraction-pipelines/metadata-extraction-pipeline) in the repository and run the following commands to build the example:
3232

3333
```bash
3434
npm install

examples/simple-pipelines/email-nlp-pipeline/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ The following requirements are needed to deploy the infrastructure associated wi
3535
- You need access to a development AWS account.
3636
- [AWS CDK](https://docs.aws.amazon.com/cdk/latest/guide/getting_started.html#getting_started_install) is required to deploy the infrastructure.
3737
- [Docker](https://docs.docker.com/get-docker/) is required to be running to build middlewares.
38-
- [Node.js](https://nodejs.org/en/download/) v18+ and NPM.
38+
- [Node.js](https://nodejs.org/en/download/) v20+ and NPM.
3939
- [Python](https://www.python.org/downloads/) v3.8+ and [Pip](https://pip.pypa.io/en/stable/installation/).
4040

4141
## 🚀 Deploy

0 commit comments

Comments
 (0)