Skip to content

[Doc](partial update) use a separate config to control the behavior of newly inserted rows in partial update #2472

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 1 addition & 70 deletions docs/data-operate/import/handling-messy-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,9 +36,7 @@ This makes it easier to handle data loading problems and keeps data management s

## Strict Mode

Strict mode serves two primary purposes:
1. Filtering out data rows where column type conversion fails during load
2. Restricting updates to existing columns only in partial column update scenarios
The main function of strict mode is to filter out data rows where column type conversion fails during load.

### Filtering Strategy for Column Type Conversion Failures

Expand Down Expand Up @@ -82,73 +80,6 @@ The system employs different strategies based on the strict mode setting:
3. Although `10` exceeds the range, since its type meets decimal requirements, strict mode does not affect it.
:::

### Restricting Partial Column Updates to Existing Columns Only

In strict mode, each row in a partial column update must have its Key already exist in the table. In non-strict mode, partial column updates can both update existing rows (where Key exists) and insert new rows (where Key doesn't exist).

For example, given a table structure as follows:
```sql
CREATE TABLE user_profile
(
id INT,
name VARCHAR(10),
age INT,
city VARCHAR(10),
balance DECIMAL(9, 0),
last_access_time DATETIME
) ENGINE=OLAP
UNIQUE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 1
PROPERTIES (
"enable_unique_key_merge_on_write" = "true"
);
```

The table contains one record as follows:
```sql
mysql> select * from user_profile;
+------+-------+------+----------+---------+---------------------+
| id | name | age | city | balance | last_access_time |
+------+-------+------+----------+---------+---------------------+
| 1 | kevin | 18 | shenzhen | 400 | 2023-07-01 12:00:00|
+------+-------+------+----------+---------+---------------------+
```

When using Insert Into with strict mode to perform partial column updates, the insertion will fail because the second and third rows with keys `(3)` and `(18)` do not exist in the original table:
```sql
SET enable_unique_key_partial_update=true;
SET enable_insert_strict = true;
INSERT INTO user_profile (id, balance, last_access_time) VALUES
(1, 500, '2023-07-03 12:00:01'),
(3, 23, '2023-07-03 12:00:02'),
(18, 9999999, '2023-07-03 12:00:03');
ERROR 1105 (HY000): errCode = 2, detailMessage = Insert has filtered data in strict mode
```

When using Insert Into with non-strict mode to perform partial column updates:
```sql
SET enable_unique_key_partial_update=true;
SET enable_insert_strict = false;
INSERT INTO user_profile (id, balance, last_access_time) VALUES
(1, 500, '2023-07-03 12:00:01'),
(3, 23, '2023-07-03 12:00:02'),
(18, 9999999, '2023-07-03 12:00:03');
```

The existing record will be updated, and two new records will be inserted. For columns not specified in the insert statement, if a default value is defined, it will be used; if the column allows NULL values, NULL will be used; otherwise, the insertion will fail.

The query result is as follows:
```sql
mysql> select * from user_profile;
+------+-------+------+----------+---------+---------------------+
| id | name | age | city | balance | last_access_time |
+------+-------+------+----------+---------+---------------------+
| 1 | kevin | 18 | shenzhen | 500 | 2023-07-03 12:00:01 |
| 3 | NULL | NULL | NULL | 23 | 2023-07-03 12:00:02 |
| 18 | NULL | NULL | NULL | 9999999 | 2023-07-03 12:00:03 |
+------+-------+------+----------+---------+---------------------+
```

### Enable Strict Mode

Strict mode (strict_mode) defaults to False. Here's how to set it for different load methods:
Expand Down
3 changes: 2 additions & 1 deletion docs/data-operate/import/import-way/stream-load-manual.md
Original file line number Diff line number Diff line change
Expand Up @@ -352,6 +352,7 @@ Parameter Description: The default timeout for Stream Load. The load job will be
| escape | Specify the escape character. It is used to escape characters that are the same as the enclosure character within a field. For example, if the data is "a,'b,'c'", and the enclosure is "'", and you want "b,'c" to be parsed as a single field, you need to specify a single-byte escape character, such as "", and modify the data to "a,'b','c'". |
| memtable_on_sink_node | Whether to enable MemTable on DataSink node when loading data, default is false. |
|unique_key_update_mode | The update modes on Unique tables, currently are only effective for Merge-On-Write Unique tables. Supporting three types: `UPSERT`, `UPDATE_FIXED_COLUMNS`, and `UPDATE_FLEXIBLE_COLUMNS`. `UPSERT`: Indicates that data is loaded with upsert semantics; `UPDATE_FIXED_COLUMNS`: Indicates that data is loaded through partial updates; `UPDATE_FLEXIBLE_COLUMNS`: Indicates that data is loaded through flexible partial updates.|
| partial_update_new_key_behavior | When performing partial column updates or flexible column updates on Unique tables, this parameter controls how new rows are handled. There are two types: `APPEND` and `ERROR`.<br/>- `APPEND`: Allows inserting new row data<br/>- `ERROR`: Fails and reports an error when inserting new rows |

### Load return value

Expand Down Expand Up @@ -1032,7 +1033,7 @@ Doris supports a very rich set of column transformations and filtering operation

### Enable strict mode import

The strict_mode attribute is used to set whether the import task runs in strict mode. This attribute affects the results of column mapping, transformation, and filtering, and it also controls the behavior of partial column updates. For specific instructions on strict mode, please refer to the [Handling Messy Data](../../../data-operate/import/handling-messy-data) documentation.
The strict_mode attribute is used to set whether the import task runs in strict mode. This attribute affects the results of column mapping, transformation, and filtering. For specific instructions on strict mode, please refer to the [Handling Messy Data](../../../data-operate/import/handling-messy-data) documentation.

### Perform partial column updates/flexible partial update during import

Expand Down
70 changes: 70 additions & 0 deletions docs/data-operate/update/update-of-unique-model.md
Original file line number Diff line number Diff line change
Expand Up @@ -415,3 +415,73 @@ The final data in the table is as follows:
6. Flexible partial updates are not supported on tables with Variant columns.

7. Flexible partial updates are not supported on tables with synchronous materialized views.


## Handling New Rows in Partial Column Updates

The session variable or import property `partial_update_new_key_behavior` controls the behavior when inserting new rows during partial column updates.

When `partial_update_new_key_behavior=ERROR`, each inserted row must have a key that already exists in the table. When `partial_update_new_key_behavior=APPEND`, partial column updates can update existing rows with matching keys or insert new rows with keys that do not exist in the table.

For example, consider the following table structure:
```sql
CREATE TABLE user_profile
(
id INT,
name VARCHAR(10),
age INT,
city VARCHAR(10),
balance DECIMAL(9, 0),
last_access_time DATETIME
) ENGINE=OLAP
UNIQUE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 1
PROPERTIES (
"enable_unique_key_merge_on_write" = "true"
);
```

Suppose the table contains the following data:
```sql
mysql> select * from user_profile;
+------+-------+------+----------+---------+---------------------+
| id | name | age | city | balance | last_access_time |
+------+-------+------+----------+---------+---------------------+
| 1 | kevin | 18 | shenzhen | 400 | 2023-07-01 12:00:00|
+------+-------+------+----------+---------+---------------------+
```

If you use `Insert Into` for partial column updates with `partial_update_new_key_behavior=ERROR`, and try to insert the following data, the operation will fail because the keys `(3)` and `(18)` do not exist in the original table:
```sql
SET enable_unique_key_partial_update=true;
SET partial_update_new_key_behavior=ERROR;
INSERT INTO user_profile (id, balance, last_access_time) VALUES
(1, 500, '2023-07-03 12:00:01'),
(3, 23, '2023-07-03 12:00:02'),
(18, 9999999, '2023-07-03 12:00:03');
(1105, "errCode = 2, detailMessage = (127.0.0.1)[INTERNAL_ERROR]tablet error: [E-7003]Can't append new rows in partial update when partial_update_new_key_behavior is ERROR. Row with key=[3] is not in table., host: 127.0.0.1")
```

If you use `partial_update_new_key_behavior=APPEND` and perform the same partial column update:
```sql
SET enable_unique_key_partial_update=true;
SET partial_update_new_key_behavior=APPEND;
INSERT INTO user_profile (id, balance, last_access_time) VALUES
(1, 500, '2023-07-03 12:00:01'),
(3, 23, '2023-07-03 12:00:02'),
(18, 9999999, '2023-07-03 12:00:03');
```

The existing row will be updated, and two new rows will be inserted. For columns not specified in the inserted data, if a default value is defined, the default will be used; if the column is nullable, NULL will be used; otherwise, the insert will fail.

The query result will be:
```sql
mysql> select * from user_profile;
+------+-------+------+----------+---------+---------------------+
| id | name | age | city | balance | last_access_time |
+------+-------+------+----------+---------+---------------------+
| 1 | kevin | 18 | shenzhen | 500 | 2023-07-03 12:00:01 |
| 3 | NULL | NULL | NULL | 23 | 2023-07-03 12:00:02 |
| 18 | NULL | NULL | NULL | 9999999 | 2023-07-03 12:00:03 |
+------+-------+------+----------+---------+---------------------+
```
Original file line number Diff line number Diff line change
Expand Up @@ -33,9 +33,7 @@ under the License.

## 严格模式

严格模式具有两个主要功能:
1. 对导入过程中发生列类型转换失败的数据行进行过滤。
2. 在部分列更新场景中,限制只能更新已存在的列。
严格模式的主要功能是对导入过程中发生列类型转换失败的数据行进行过滤。

### 列类型转换失败的过滤策略

Expand Down Expand Up @@ -79,74 +77,6 @@ under the License.
3. `10` 虽然是一个超过范围的值,但是因为其类型符合 decimal 的要求,所以严格模式对其不产生影响。
:::


### 限定部分列更新只能更新已有的列

在严格模式下,部分列更新插入的每一行数据必须满足该行数据的 Key 在表中已经存在。而在非严格模式下,进行部分列更新时可以更新 Key 已经存在的行,也可以插入 Key 不存在的新行。

例如有表结构如下:
```sql
CREATE TABLE user_profile
(
id INT,
name VARCHAR(10),
age INT,
city VARCHAR(10),
balance DECIMAL(9, 0),
last_access_time DATETIME
) ENGINE=OLAP
UNIQUE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 1
PROPERTIES (
"enable_unique_key_merge_on_write" = "true"
);
```

表中有一条数据如下:
```sql
mysql> select * from user_profile;
+------+-------+------+----------+---------+---------------------+
| id | name | age | city | balance | last_access_time |
+------+-------+------+----------+---------+---------------------+
| 1 | kevin | 18 | shenzhen | 400 | 2023-07-01 12:00:00|
+------+-------+------+----------+---------+---------------------+
```

当用户使用严格模式的 Insert Into 部分列更新向表中插入上述数据时,由于开启了严格模式且第二、三行的数据的 key(`(3)`, `(18)`) 不在原表中,所以本次插入会失败:
```sql
SET enable_unique_key_partial_update=true;
SET enable_insert_strict = true;
INSERT INTO user_profile (id, balance, last_access_time) VALUES
(1, 500, '2023-07-03 12:00:01'),
(3, 23, '2023-07-03 12:00:02'),
(18, 9999999, '2023-07-03 12:00:03');
ERROR 1105 (HY000): errCode = 2, detailMessage = Insert has filtered data in strict mode
```

当用户使用非严格模式的 Insert Into 部分列更新向表中插入如下数据时:
```sql
SET enable_unique_key_partial_update=true;
SET enable_insert_strict = false;
INSERT INTO user_profile (id, balance, last_access_time) VALUES
(1, 500, '2023-07-03 12:00:01'),
(3, 23, '2023-07-03 12:00:02'),
(18, 9999999, '2023-07-03 12:00:03');
```

表中原有的一条数据将会被更新,此外还向表中插入了两条新数据。对于插入的数据中用户没有指定的列,如果该列有默认值,则会以默认值填充;否则,如果该列可以为 NULL,则将以 NULL 值填充;否则本次插入不成功。

查询结果如下:
```sql
mysql> select * from user_profile;
+------+-------+------+----------+---------+---------------------+
| id | name | age | city | balance | last_access_time |
+------+-------+------+----------+---------+---------------------+
| 1 | kevin | 18 | shenzhen | 500 | 2023-07-03 12:00:01 |
| 3 | NULL | NULL | NULL | 23 | 2023-07-03 12:00:02 |
| 18 | NULL | NULL | NULL | 9999999 | 2023-07-03 12:00:03 |
+------+-------+------+----------+---------+---------------------+
```

### 开启严格模式
严格模式(strict_mode)默认为 False,以下是各种导入方式的设置方法:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -348,6 +348,7 @@ Stream Load 操作支持 HTTP 分块导入(HTTP chunked)与 HTTP 非分块
| escape | 指定转义符。用于转义在字段中出现的与包围符相同的字符。例如数据为 "a,'b,'c'",包围符为 "'",希望 "b,'c 被作为一个字段解析,则需要指定单字节转义符,例如"\",将数据修改为 "a,'b,\'c'"。 |
| memtable_on_sink_node | 导入数据的时候是否开启 MemTable 前移,默认为 false。 |
| unique_key_update_mode | Unique 表上的更新模式,目前仅对 Merge-On-Write Unique 表有效,一共支持三种类型 `UPSERT`, `UPDATE_FIXED_COLUMNS`, `UPDATE_FLEXIBLE_COLUMNS`。 `UPSERT`: 表示以 upsert 语义导入数据; `UPDATE_FIXED_COLUMNS`: 表示以[部分列更新](../../../data-operate/update/update-of-unique-model)的方式导入数据; `UPDATE_FLEXIBLE_COLUMNS`: 表示以[灵活部分列更新](../../../data-operate/update/update-of-unique-model)的方式导入数据|
| partial_update_new_key_behavior | Unique 表上进行部分列更新或灵活列更新时,对新插入行的处理方式。有两种类型 `APPEND`, `ERROR`。<br/>-`APPEND`: 允许插入新行数据<br/>-`ERROR`: 插入新行时倒入失败并报错 |

### 导入返回值

Expand Down Expand Up @@ -1036,7 +1037,7 @@ Doris 可以在导入语句中支持非常丰富的列转换和过滤操作。

### 启用严格模式导入

`strict_mode` 属性用于设置导入任务是否运行在严格模式下。该属性会对列映射、转换和过滤的结果产生影响,它同时也将控制部分列更新的行为。关于严格模式的具体说明,可参阅 [严格模式](../handling-messy-data#严格模式) 文档。
`strict_mode` 属性用于设置导入任务是否运行在严格模式下。该属性会对列映射、转换和过滤的结果产生影响。关于严格模式的具体说明,可参阅 [严格模式](../handling-messy-data#严格模式) 文档。

### 导入时进行部分列更新/灵活部分列更新

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -407,3 +407,74 @@ PROPERTIES(
6. 不支持在有 Variant 列的表上进行灵活列更新。

7. 不支持在有同步物化视图的表上进行灵活列更新。


## 部分列更新/灵活列更新中对新插入的行的处理

session variable或导入属性`partial_update_new_key_behavior`用于控制部分列更新和灵活列更新中插入的新行的行为。

当`partial_update_new_key_behavior=ERROR`时,插入的每一行数据必须满足该行数据的 Key 在表中已经存在。而当`partial_update_new_key_behavior=APPEND`时,进行部分列更新或灵活列更新时可以更新 Key 已经存在的行,也可以插入 Key 不存在的新行。

例如有表结构如下:
```sql
CREATE TABLE user_profile
(
id INT,
name VARCHAR(10),
age INT,
city VARCHAR(10),
balance DECIMAL(9, 0),
last_access_time DATETIME
) ENGINE=OLAP
UNIQUE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 1
PROPERTIES (
"enable_unique_key_merge_on_write" = "true"
);
```

表中有一条数据如下:
```sql
mysql> select * from user_profile;
+------+-------+------+----------+---------+---------------------+
| id | name | age | city | balance | last_access_time |
+------+-------+------+----------+---------+---------------------+
| 1 | kevin | 18 | shenzhen | 400 | 2023-07-01 12:00:00|
+------+-------+------+----------+---------+---------------------+
```

当用户在`partial_update_new_key_behavior=ERROR`的情况下使用 Insert Into 部分列更新向表中插入上述数据时,由于第二、三行的数据的 key(`(3)`, `(18)`) 不在原表中,所以本次插入会失败:
```sql
SET enable_unique_key_partial_update=true;
SET partial_update_new_key_behavior=ERROR;
INSERT INTO user_profile (id, balance, last_access_time) VALUES
(1, 500, '2023-07-03 12:00:01'),
(3, 23, '2023-07-03 12:00:02'),
(18, 9999999, '2023-07-03 12:00:03');
(1105, "errCode = 2, detailMessage = (127.0.0.1)[INTERNAL_ERROR]tablet error: [E-7003]Can't append new rows in partial update when partial_update_new_key_behavior is ERROR. Row with key=[3] is not in table., host: 127.0.0.1")
```

当用在`partial_update_new_key_behavior=APPEND`的情况下使用 Insert Into 部分列更新向表中插入如下数据时:
```sql
SET enable_unique_key_partial_update=true;
SET partial_update_new_key_behavior=APPEND;
INSERT INTO user_profile (id, balance, last_access_time) VALUES
(1, 500, '2023-07-03 12:00:01'),
(3, 23, '2023-07-03 12:00:02'),
(18, 9999999, '2023-07-03 12:00:03');
```

表中原有的一条数据将会被更新,此外还向表中插入了两条新数据。对于插入的数据中用户没有指定的列,如果该列有默认值,则会以默认值填充;否则,如果该列可以为 NULL,则将以 NULL 值填充;否则本次插入不成功。

查询结果如下:
```sql
mysql> select * from user_profile;
+------+-------+------+----------+---------+---------------------+
| id | name | age | city | balance | last_access_time |
+------+-------+------+----------+---------+---------------------+
| 1 | kevin | 18 | shenzhen | 500 | 2023-07-03 12:00:01 |
| 3 | NULL | NULL | NULL | 23 | 2023-07-03 12:00:02 |
| 18 | NULL | NULL | NULL | 9999999 | 2023-07-03 12:00:03 |
+------+-------+------+----------+---------+---------------------+
```

Loading