Closure and closure context field loads are not folded in AOT #60068

osa1 · 2025-02-06T13:17:13Z

In https://github.com/osa1/protobuf.dart/tree/inlining_issue I add a bunch of inline pragmas to two higher-order functions and a few small functions.

Inlining these higher-order functions should eliminate indirect calls when calling the function arguments as in each call site the closure being passed is a function expression.

However currently AOT does not seem to fold field loads from closures and contexts.

If I manually inline _readPacked (by copy-pasting it into each of the call sites), that improves runtime of from_binary benchmark 7%. The diff for this change is at the end.

To repro:

Clone the repo and switch to the inlining_issue branch.
In benchmarks directory:
- ./tool/compile_protos.sh (requires protoc with Dart plugin)
- dart pub get
- dart compile exe bin/from_binary.dart

I'm not fluent in VM's flow graph IR, but with --extra-gen-snapshot-options=--print-flow-graph-filter=_mergeFromCodedBufferReader,--print-flow-graph-optimized,--compiler-passes=\*Inlining and inline pragmas, I see code like

v525 <- Constant(#Function '<anonymous closure>': static.) T{Function}
...
v519 <- AllocateContext:10(num_variables=2) T{Context}
StoreField(v519 . input = v84 T{CodedBufferReader})
StoreField(v519 . readFunc = v82 T{_Closure})
...
v524 <- AllocateClosure:14(v525 T{Function}, v519) T{_Closure}
...
v4632 <- LoadField(v524 T{_Closure} . Closure.context {final}) T{!null}
...
v4636 <- LoadField(v4632 . readFunc {final}) T{(dynamic) => void}
v4637 <- ClosureCall:18( closure=v4636<0>, v4636)

(this is in an "after inlining" block)

Here LoadField(v524, ...) could be folded, which would then allow folding LoadField(v4632, ...), which I suspect in the end would give us the optimizations.

Inlining _readPacked manually is not ideal as that would potentially increase binary sizes on the browsers without improving runtime perf. Ideally I want to do this with backend-specific pragmas.

Diff for manually inlining `_readPacked`

diff --git a/protobuf/lib/src/protobuf/coded_buffer.dart b/protobuf/lib/src/protobuf/coded_buffer.dart
index a102cbb..ca5d26d 100644
--- a/protobuf/lib/src/protobuf/coded_buffer.dart
+++ b/protobuf/lib/src/protobuf/coded_buffer.dart
@@ -130,7 +130,11 @@ void _mergeFromCodedBufferReader(BuilderInfo meta, _FieldSet fs,
       case PbFieldType._REPEATED_BOOL:
         final list = fs._ensureRepeatedField(meta, fi);
         if (wireType == WIRETYPE_LENGTH_DELIMITED) {
-          _readPacked(input, () => list.add(input.readBool()));
+          input._withLimit(input.readInt32(), () {
+            while (!input.isAtEnd()) {
+              list.add(input.readBool());
+            }
+          });
         } else {
           list.add(input.readBool());
         }
@@ -146,7 +150,11 @@ void _mergeFromCodedBufferReader(BuilderInfo meta, _FieldSet fs,
       case PbFieldType._REPEATED_FLOAT:
         final list = fs._ensureRepeatedField(meta, fi);
         if (wireType == WIRETYPE_LENGTH_DELIMITED) {
-          _readPacked(input, () => list.add(input.readFloat()));
+          input._withLimit(input.readInt32(), () {
+            while (!input.isAtEnd()) {
+              list.add(input.readFloat());
+            }
+          });
         } else {
           list.add(input.readFloat());
         }
@@ -154,7 +162,11 @@ void _mergeFromCodedBufferReader(BuilderInfo meta, _FieldSet fs,
       case PbFieldType._REPEATED_DOUBLE:
         final list = fs._ensureRepeatedField(meta, fi);
         if (wireType == WIRETYPE_LENGTH_DELIMITED) {
-          _readPacked(input, () => list.add(input.readDouble()));
+          input._withLimit(input.readInt32(), () {
+            while (!input.isAtEnd()) {
+              list.add(input.readDouble());
+            }
+          });
         } else {
           list.add(input.readDouble());
         }
@@ -173,7 +185,11 @@ void _mergeFromCodedBufferReader(BuilderInfo meta, _FieldSet fs,
       case PbFieldType._REPEATED_INT32:
         final list = fs._ensureRepeatedField(meta, fi);
         if (wireType == WIRETYPE_LENGTH_DELIMITED) {
-          _readPacked(input, () => list.add(input.readInt32()));
+          input._withLimit(input.readInt32(), () {
+            while (!input.isAtEnd()) {
+              list.add(input.readInt32());
+            }
+          });
         } else {
           list.add(input.readInt32());
         }
@@ -181,7 +197,11 @@ void _mergeFromCodedBufferReader(BuilderInfo meta, _FieldSet fs,
       case PbFieldType._REPEATED_INT64:
         final list = fs._ensureRepeatedField(meta, fi);
         if (wireType == WIRETYPE_LENGTH_DELIMITED) {
-          _readPacked(input, () => list.add(input.readInt64()));
+          input._withLimit(input.readInt32(), () {
+            while (!input.isAtEnd()) {
+              list.add(input.readInt64());
+            }
+          });
         } else {
           list.add(input.readInt64());
         }
@@ -189,7 +209,11 @@ void _mergeFromCodedBufferReader(BuilderInfo meta, _FieldSet fs,
       case PbFieldType._REPEATED_SINT32:
         final list = fs._ensureRepeatedField(meta, fi);
         if (wireType == WIRETYPE_LENGTH_DELIMITED) {
-          _readPacked(input, () => list.add(input.readSint32()));
+          input._withLimit(input.readInt32(), () {
+            while (!input.isAtEnd()) {
+              list.add(input.readSint32());
+            }
+          });
         } else {
           list.add(input.readSint32());
         }
@@ -197,7 +221,11 @@ void _mergeFromCodedBufferReader(BuilderInfo meta, _FieldSet fs,
       case PbFieldType._REPEATED_SINT64:
         final list = fs._ensureRepeatedField(meta, fi);
         if (wireType == WIRETYPE_LENGTH_DELIMITED) {
-          _readPacked(input, () => list.add(input.readSint64()));
+          input._withLimit(input.readInt32(), () {
+            while (!input.isAtEnd()) {
+              list.add(input.readSint64());
+            }
+          });
         } else {
           list.add(input.readSint64());
         }
@@ -205,7 +233,11 @@ void _mergeFromCodedBufferReader(BuilderInfo meta, _FieldSet fs,
       case PbFieldType._REPEATED_UINT32:
         final list = fs._ensureRepeatedField(meta, fi);
         if (wireType == WIRETYPE_LENGTH_DELIMITED) {
-          _readPacked(input, () => list.add(input.readUint32()));
+          input._withLimit(input.readInt32(), () {
+            while (!input.isAtEnd()) {
+              list.add(input.readUint32());
+            }
+          });
         } else {
           list.add(input.readUint32());
         }
@@ -213,7 +245,11 @@ void _mergeFromCodedBufferReader(BuilderInfo meta, _FieldSet fs,
       case PbFieldType._REPEATED_UINT64:
         final list = fs._ensureRepeatedField(meta, fi);
         if (wireType == WIRETYPE_LENGTH_DELIMITED) {
-          _readPacked(input, () => list.add(input.readUint64()));
+          input._withLimit(input.readInt32(), () {
+            while (!input.isAtEnd()) {
+              list.add(input.readUint64());
+            }
+          });
         } else {
           list.add(input.readUint64());
         }
@@ -221,7 +257,11 @@ void _mergeFromCodedBufferReader(BuilderInfo meta, _FieldSet fs,
       case PbFieldType._REPEATED_FIXED32:
         final list = fs._ensureRepeatedField(meta, fi);
         if (wireType == WIRETYPE_LENGTH_DELIMITED) {
-          _readPacked(input, () => list.add(input.readFixed32()));
+          input._withLimit(input.readInt32(), () {
+            while (!input.isAtEnd()) {
+              list.add(input.readFixed32());
+            }
+          });
         } else {
           list.add(input.readFixed32());
         }
@@ -229,7 +269,11 @@ void _mergeFromCodedBufferReader(BuilderInfo meta, _FieldSet fs,
       case PbFieldType._REPEATED_FIXED64:
         final list = fs._ensureRepeatedField(meta, fi);
         if (wireType == WIRETYPE_LENGTH_DELIMITED) {
-          _readPacked(input, () => list.add(input.readFixed64()));
+          input._withLimit(input.readInt32(), () {
+            while (!input.isAtEnd()) {
+              list.add(input.readFixed64());
+            }
+          });
         } else {
           list.add(input.readFixed64());
         }
@@ -237,7 +281,11 @@ void _mergeFromCodedBufferReader(BuilderInfo meta, _FieldSet fs,
       case PbFieldType._REPEATED_SFIXED32:
         final list = fs._ensureRepeatedField(meta, fi);
         if (wireType == WIRETYPE_LENGTH_DELIMITED) {
-          _readPacked(input, () => list.add(input.readSfixed32()));
+          input._withLimit(input.readInt32(), () {
+            while (!input.isAtEnd()) {
+              list.add(input.readSfixed32());
+            }
+          });
         } else {
           list.add(input.readSfixed32());
         }
@@ -245,7 +293,11 @@ void _mergeFromCodedBufferReader(BuilderInfo meta, _FieldSet fs,
       case PbFieldType._REPEATED_SFIXED64:
         final list = fs._ensureRepeatedField(meta, fi);
         if (wireType == WIRETYPE_LENGTH_DELIMITED) {
-          _readPacked(input, () => list.add(input.readSfixed64()));
+          input._withLimit(input.readInt32(), () {
+            while (!input.isAtEnd()) {
+              list.add(input.readSfixed64());
+            }
+          });
         } else {
           list.add(input.readSfixed64());
         }
@@ -269,14 +321,6 @@ void _mergeFromCodedBufferReader(BuilderInfo meta, _FieldSet fs,
   }
 }

-void _readPacked(CodedBufferReader input, void Function() readFunc) {
-  input._withLimit(input.readInt32(), () {
-    while (!input.isAtEnd()) {
-      readFunc();
-    }
-  });
-}
-
 void _readPackableToListEnum(
     List list,
     BuilderInfo meta,

cc @mraleph

The text was updated successfully, but these errors were encountered:

mkustermann · 2025-02-06T14:04:41Z

/cc @alexmarkov

mraleph · 2025-02-06T15:24:21Z

@Manishtechbee please no useless LLM generated garbage on this issue tracker or you will be banned for spam.

osa1 · 2025-02-07T10:49:36Z

We tried this making these changes with @mkustermann but it didn't quite give us the same performance as manually inlining _readPacked even though it eliminated all of the closures that we wanted to eliminate:

diff --git a/runtime/vm/compiler/backend/il.cc b/runtime/vm/compiler/backend/il.cc
index 42d89ce95f1..35258314ff3 100644
--- a/runtime/vm/compiler/backend/il.cc
+++ b/runtime/vm/compiler/backend/il.cc
@@ -2969,6 +2969,11 @@ Definition* LoadFieldInstr::Canonicalize(FlowGraph* flow_graph) {
         }
       }
       break;
+    case Slot::Kind::kClosure_context:
+      if (auto closure_allocation = orig_instance->AsAllocateClosure()) {
+        return closure_allocation->context()->definition();
+      }
+      break;
     default:
       break;
   }
@@ -3008,6 +3013,22 @@ Definition* LoadFieldInstr::Canonicalize(FlowGraph* flow_graph) {
     }
   }
 
+  if (auto instr = orig_instance->AsAllocateContext()) {
+    if (IsImmutableLoad()) {
+      for (auto use : instr->input_uses()) {
+        if (auto store = use->instruction()->AsStoreField()) {
+          if (IsDominatedBy(store)) {
+            if ((use->use_index() == StoreFieldInstr::kInstancePos) &&
+                store->slot().IsIdentical(slot())) {
+              // RELEASE_ASSERT(IsDominatedBy(store));
+              return store->value()->definition();
+            }
+          }
+        }
+      }
+    }
+  }
+
   return this;
 }
 
diff --git a/runtime/vm/compiler/backend/inliner.cc b/runtime/vm/compiler/backend/inliner.cc
index 2909c6915fc..e24e17e583f 100644
--- a/runtime/vm/compiler/backend/inliner.cc
+++ b/runtime/vm/compiler/backend/inliner.cc
@@ -464,7 +464,14 @@ class CallSites : public ValueObject {
         worklist.Add(call_info.call);
       }
     }
-    RELEASE_ASSERT(worklist.length() == calls_->length());
+    for (auto& call_info : closure_calls_) {
+      if (call_info.call->HasSSATemp()) {
+        add_to_worklist(call_info.call);
+      } else {
+        worklist.Add(call_info.call);
+      }
+    }
+    // RELEASE_ASSERT(worklist.length() == calls_->length());
     add_transitive_dependencies_to_worklist(0);
 
     // Step 2: canonicalize each definition from the worklist. We process
@@ -588,13 +602,13 @@ class CallSites : public ValueObject {
           HandleStaticCall(static_call, inline_only_profitable_methods, graph,
                            depth, nesting_depth, inlined_info);
         } else if (auto closure_call = current->AsClosureCall()) {
-          if (!inline_only_profitable_methods) {
-            // Consider closure for further inlining. Note that it will
-            // still be subject to all the inlining heuristics.
-            closure_calls_.Add({graph, closure_call, depth, nesting_depth});
-          } else {
-            // No longer consider the closure because inlining is too deep.
-          }
+          // if (!inline_only_profitable_methods) {
+          // Consider closure for further inlining. Note that it will
+          // still be subject to all the inlining heuristics.
+          closure_calls_.Add({graph, closure_call, depth, nesting_depth});
+          // } else {
+          //   // No longer consider the closure because inlining is too deep.
+          // }
         }
       }
     }

We improved the benchmark from 400us to 395us, but manually inlining _readPacked gives us 370us.

There's also something strange in this change which is the IsDominatedBy check before returning the store value. We thought the check shouldn't be necessary (and it doesn't exist for other cases in the same function) as you shouldn't be even allowed to read a field unless it's stored in a dominator block, but removing that check causes causes assertion errors.

It's also strange that eliminating a bunch of closure allocations in a hot loop gave us only a tiny improvement.

alexmarkov · 2025-02-07T16:48:25Z

This is likely a pass ordering problem: full-blown store-load forwarding happens after inlining is already finished. Simple store-loading forwarding during canonicalization (which happens during inlining) is one way to solve it. However, such store-load forwarding does not respect control flow and usually applicable only to final fields.

With captured final local variables you need to be a bit more careful as there could be multiple assignments (stores) to a final local variable. For example:

  final int foo;
  if (...) {
    foo = 2;
  } else {
    foo = 3;
  }
  () { print(foo); } ();

That's why you probably need a IsDominatedBy check.

osa1 added the area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. label Feb 6, 2025

This comment has been minimized.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Closure and closure context field loads are not folded in AOT #60068

Closure and closure context field loads are not folded in AOT #60068

osa1 commented Feb 6, 2025 •

edited

Loading

mkustermann commented Feb 6, 2025

This comment has been minimized.

mraleph commented Feb 6, 2025

osa1 commented Feb 7, 2025

alexmarkov commented Feb 7, 2025

Closure and closure context field loads are not folded in AOT #60068

Closure and closure context field loads are not folded in AOT #60068

Comments

osa1 commented Feb 6, 2025 • edited Loading

mkustermann commented Feb 6, 2025

This comment has been minimized.

mraleph commented Feb 6, 2025

osa1 commented Feb 7, 2025

alexmarkov commented Feb 7, 2025

osa1 commented Feb 6, 2025 •

edited

Loading