Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closure and closure context field loads are not folded in AOT #60068

Open
osa1 opened this issue Feb 6, 2025 · 5 comments
Open

Closure and closure context field loads are not folded in AOT #60068

osa1 opened this issue Feb 6, 2025 · 5 comments
Labels
area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends.

Comments

@osa1
Copy link
Member

osa1 commented Feb 6, 2025

In https://github.com/osa1/protobuf.dart/tree/inlining_issue I add a bunch of inline pragmas to two higher-order functions and a few small functions.

Inlining these higher-order functions should eliminate indirect calls when calling the function arguments as in each call site the closure being passed is a function expression.

However currently AOT does not seem to fold field loads from closures and contexts.

If I manually inline _readPacked (by copy-pasting it into each of the call sites), that improves runtime of from_binary benchmark 7%. The diff for this change is at the end.

To repro:

  • Clone the repo and switch to the inlining_issue branch.
  • In benchmarks directory:
    • ./tool/compile_protos.sh (requires protoc with Dart plugin)
    • dart pub get
    • dart compile exe bin/from_binary.dart

I'm not fluent in VM's flow graph IR, but with --extra-gen-snapshot-options=--print-flow-graph-filter=_mergeFromCodedBufferReader,--print-flow-graph-optimized,--compiler-passes=\*Inlining and inline pragmas, I see code like

v525 <- Constant(#Function '<anonymous closure>': static.) T{Function}
...
v519 <- AllocateContext:10(num_variables=2) T{Context}
StoreField(v519 . input = v84 T{CodedBufferReader})
StoreField(v519 . readFunc = v82 T{_Closure})
...
v524 <- AllocateClosure:14(v525 T{Function}, v519) T{_Closure}
...
v4632 <- LoadField(v524 T{_Closure} . Closure.context {final}) T{!null}
...
v4636 <- LoadField(v4632 . readFunc {final}) T{(dynamic) => void}
v4637 <- ClosureCall:18( closure=v4636<0>, v4636)

(this is in an "after inlining" block)

Here LoadField(v524, ...) could be folded, which would then allow folding LoadField(v4632, ...), which I suspect in the end would give us the optimizations.

Inlining _readPacked manually is not ideal as that would potentially increase binary sizes on the browsers without improving runtime perf. Ideally I want to do this with backend-specific pragmas.

Diff for manually inlining `_readPacked`
diff --git a/protobuf/lib/src/protobuf/coded_buffer.dart b/protobuf/lib/src/protobuf/coded_buffer.dart
index a102cbb..ca5d26d 100644
--- a/protobuf/lib/src/protobuf/coded_buffer.dart
+++ b/protobuf/lib/src/protobuf/coded_buffer.dart
@@ -130,7 +130,11 @@ void _mergeFromCodedBufferReader(BuilderInfo meta, _FieldSet fs,
       case PbFieldType._REPEATED_BOOL:
         final list = fs._ensureRepeatedField(meta, fi);
         if (wireType == WIRETYPE_LENGTH_DELIMITED) {
-          _readPacked(input, () => list.add(input.readBool()));
+          input._withLimit(input.readInt32(), () {
+            while (!input.isAtEnd()) {
+              list.add(input.readBool());
+            }
+          });
         } else {
           list.add(input.readBool());
         }
@@ -146,7 +150,11 @@ void _mergeFromCodedBufferReader(BuilderInfo meta, _FieldSet fs,
       case PbFieldType._REPEATED_FLOAT:
         final list = fs._ensureRepeatedField(meta, fi);
         if (wireType == WIRETYPE_LENGTH_DELIMITED) {
-          _readPacked(input, () => list.add(input.readFloat()));
+          input._withLimit(input.readInt32(), () {
+            while (!input.isAtEnd()) {
+              list.add(input.readFloat());
+            }
+          });
         } else {
           list.add(input.readFloat());
         }
@@ -154,7 +162,11 @@ void _mergeFromCodedBufferReader(BuilderInfo meta, _FieldSet fs,
       case PbFieldType._REPEATED_DOUBLE:
         final list = fs._ensureRepeatedField(meta, fi);
         if (wireType == WIRETYPE_LENGTH_DELIMITED) {
-          _readPacked(input, () => list.add(input.readDouble()));
+          input._withLimit(input.readInt32(), () {
+            while (!input.isAtEnd()) {
+              list.add(input.readDouble());
+            }
+          });
         } else {
           list.add(input.readDouble());
         }
@@ -173,7 +185,11 @@ void _mergeFromCodedBufferReader(BuilderInfo meta, _FieldSet fs,
       case PbFieldType._REPEATED_INT32:
         final list = fs._ensureRepeatedField(meta, fi);
         if (wireType == WIRETYPE_LENGTH_DELIMITED) {
-          _readPacked(input, () => list.add(input.readInt32()));
+          input._withLimit(input.readInt32(), () {
+            while (!input.isAtEnd()) {
+              list.add(input.readInt32());
+            }
+          });
         } else {
           list.add(input.readInt32());
         }
@@ -181,7 +197,11 @@ void _mergeFromCodedBufferReader(BuilderInfo meta, _FieldSet fs,
       case PbFieldType._REPEATED_INT64:
         final list = fs._ensureRepeatedField(meta, fi);
         if (wireType == WIRETYPE_LENGTH_DELIMITED) {
-          _readPacked(input, () => list.add(input.readInt64()));
+          input._withLimit(input.readInt32(), () {
+            while (!input.isAtEnd()) {
+              list.add(input.readInt64());
+            }
+          });
         } else {
           list.add(input.readInt64());
         }
@@ -189,7 +209,11 @@ void _mergeFromCodedBufferReader(BuilderInfo meta, _FieldSet fs,
       case PbFieldType._REPEATED_SINT32:
         final list = fs._ensureRepeatedField(meta, fi);
         if (wireType == WIRETYPE_LENGTH_DELIMITED) {
-          _readPacked(input, () => list.add(input.readSint32()));
+          input._withLimit(input.readInt32(), () {
+            while (!input.isAtEnd()) {
+              list.add(input.readSint32());
+            }
+          });
         } else {
           list.add(input.readSint32());
         }
@@ -197,7 +221,11 @@ void _mergeFromCodedBufferReader(BuilderInfo meta, _FieldSet fs,
       case PbFieldType._REPEATED_SINT64:
         final list = fs._ensureRepeatedField(meta, fi);
         if (wireType == WIRETYPE_LENGTH_DELIMITED) {
-          _readPacked(input, () => list.add(input.readSint64()));
+          input._withLimit(input.readInt32(), () {
+            while (!input.isAtEnd()) {
+              list.add(input.readSint64());
+            }
+          });
         } else {
           list.add(input.readSint64());
         }
@@ -205,7 +233,11 @@ void _mergeFromCodedBufferReader(BuilderInfo meta, _FieldSet fs,
       case PbFieldType._REPEATED_UINT32:
         final list = fs._ensureRepeatedField(meta, fi);
         if (wireType == WIRETYPE_LENGTH_DELIMITED) {
-          _readPacked(input, () => list.add(input.readUint32()));
+          input._withLimit(input.readInt32(), () {
+            while (!input.isAtEnd()) {
+              list.add(input.readUint32());
+            }
+          });
         } else {
           list.add(input.readUint32());
         }
@@ -213,7 +245,11 @@ void _mergeFromCodedBufferReader(BuilderInfo meta, _FieldSet fs,
       case PbFieldType._REPEATED_UINT64:
         final list = fs._ensureRepeatedField(meta, fi);
         if (wireType == WIRETYPE_LENGTH_DELIMITED) {
-          _readPacked(input, () => list.add(input.readUint64()));
+          input._withLimit(input.readInt32(), () {
+            while (!input.isAtEnd()) {
+              list.add(input.readUint64());
+            }
+          });
         } else {
           list.add(input.readUint64());
         }
@@ -221,7 +257,11 @@ void _mergeFromCodedBufferReader(BuilderInfo meta, _FieldSet fs,
       case PbFieldType._REPEATED_FIXED32:
         final list = fs._ensureRepeatedField(meta, fi);
         if (wireType == WIRETYPE_LENGTH_DELIMITED) {
-          _readPacked(input, () => list.add(input.readFixed32()));
+          input._withLimit(input.readInt32(), () {
+            while (!input.isAtEnd()) {
+              list.add(input.readFixed32());
+            }
+          });
         } else {
           list.add(input.readFixed32());
         }
@@ -229,7 +269,11 @@ void _mergeFromCodedBufferReader(BuilderInfo meta, _FieldSet fs,
       case PbFieldType._REPEATED_FIXED64:
         final list = fs._ensureRepeatedField(meta, fi);
         if (wireType == WIRETYPE_LENGTH_DELIMITED) {
-          _readPacked(input, () => list.add(input.readFixed64()));
+          input._withLimit(input.readInt32(), () {
+            while (!input.isAtEnd()) {
+              list.add(input.readFixed64());
+            }
+          });
         } else {
           list.add(input.readFixed64());
         }
@@ -237,7 +281,11 @@ void _mergeFromCodedBufferReader(BuilderInfo meta, _FieldSet fs,
       case PbFieldType._REPEATED_SFIXED32:
         final list = fs._ensureRepeatedField(meta, fi);
         if (wireType == WIRETYPE_LENGTH_DELIMITED) {
-          _readPacked(input, () => list.add(input.readSfixed32()));
+          input._withLimit(input.readInt32(), () {
+            while (!input.isAtEnd()) {
+              list.add(input.readSfixed32());
+            }
+          });
         } else {
           list.add(input.readSfixed32());
         }
@@ -245,7 +293,11 @@ void _mergeFromCodedBufferReader(BuilderInfo meta, _FieldSet fs,
       case PbFieldType._REPEATED_SFIXED64:
         final list = fs._ensureRepeatedField(meta, fi);
         if (wireType == WIRETYPE_LENGTH_DELIMITED) {
-          _readPacked(input, () => list.add(input.readSfixed64()));
+          input._withLimit(input.readInt32(), () {
+            while (!input.isAtEnd()) {
+              list.add(input.readSfixed64());
+            }
+          });
         } else {
           list.add(input.readSfixed64());
         }
@@ -269,14 +321,6 @@ void _mergeFromCodedBufferReader(BuilderInfo meta, _FieldSet fs,
   }
 }

-void _readPacked(CodedBufferReader input, void Function() readFunc) {
-  input._withLimit(input.readInt32(), () {
-    while (!input.isAtEnd()) {
-      readFunc();
-    }
-  });
-}
-
 void _readPackableToListEnum(
     List list,
     BuilderInfo meta,

cc @mraleph

@osa1 osa1 added the area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. label Feb 6, 2025
@mkustermann
Copy link
Member

/cc @alexmarkov

@Manishtechbee

This comment has been minimized.

@mraleph
Copy link
Member

mraleph commented Feb 6, 2025

@Manishtechbee please no useless LLM generated garbage on this issue tracker or you will be banned for spam.

@osa1
Copy link
Member Author

osa1 commented Feb 7, 2025

We tried this making these changes with @mkustermann but it didn't quite give us the same performance as manually inlining _readPacked even though it eliminated all of the closures that we wanted to eliminate:

diff --git a/runtime/vm/compiler/backend/il.cc b/runtime/vm/compiler/backend/il.cc
index 42d89ce95f1..35258314ff3 100644
--- a/runtime/vm/compiler/backend/il.cc
+++ b/runtime/vm/compiler/backend/il.cc
@@ -2969,6 +2969,11 @@ Definition* LoadFieldInstr::Canonicalize(FlowGraph* flow_graph) {
         }
       }
       break;
+    case Slot::Kind::kClosure_context:
+      if (auto closure_allocation = orig_instance->AsAllocateClosure()) {
+        return closure_allocation->context()->definition();
+      }
+      break;
     default:
       break;
   }
@@ -3008,6 +3013,22 @@ Definition* LoadFieldInstr::Canonicalize(FlowGraph* flow_graph) {
     }
   }
 
+  if (auto instr = orig_instance->AsAllocateContext()) {
+    if (IsImmutableLoad()) {
+      for (auto use : instr->input_uses()) {
+        if (auto store = use->instruction()->AsStoreField()) {
+          if (IsDominatedBy(store)) {
+            if ((use->use_index() == StoreFieldInstr::kInstancePos) &&
+                store->slot().IsIdentical(slot())) {
+              // RELEASE_ASSERT(IsDominatedBy(store));
+              return store->value()->definition();
+            }
+          }
+        }
+      }
+    }
+  }
+
   return this;
 }
 
diff --git a/runtime/vm/compiler/backend/inliner.cc b/runtime/vm/compiler/backend/inliner.cc
index 2909c6915fc..e24e17e583f 100644
--- a/runtime/vm/compiler/backend/inliner.cc
+++ b/runtime/vm/compiler/backend/inliner.cc
@@ -464,7 +464,14 @@ class CallSites : public ValueObject {
         worklist.Add(call_info.call);
       }
     }
-    RELEASE_ASSERT(worklist.length() == calls_->length());
+    for (auto& call_info : closure_calls_) {
+      if (call_info.call->HasSSATemp()) {
+        add_to_worklist(call_info.call);
+      } else {
+        worklist.Add(call_info.call);
+      }
+    }
+    // RELEASE_ASSERT(worklist.length() == calls_->length());
     add_transitive_dependencies_to_worklist(0);
 
     // Step 2: canonicalize each definition from the worklist. We process
@@ -588,13 +602,13 @@ class CallSites : public ValueObject {
           HandleStaticCall(static_call, inline_only_profitable_methods, graph,
                            depth, nesting_depth, inlined_info);
         } else if (auto closure_call = current->AsClosureCall()) {
-          if (!inline_only_profitable_methods) {
-            // Consider closure for further inlining. Note that it will
-            // still be subject to all the inlining heuristics.
-            closure_calls_.Add({graph, closure_call, depth, nesting_depth});
-          } else {
-            // No longer consider the closure because inlining is too deep.
-          }
+          // if (!inline_only_profitable_methods) {
+          // Consider closure for further inlining. Note that it will
+          // still be subject to all the inlining heuristics.
+          closure_calls_.Add({graph, closure_call, depth, nesting_depth});
+          // } else {
+          //   // No longer consider the closure because inlining is too deep.
+          // }
         }
       }
     }

We improved the benchmark from 400us to 395us, but manually inlining _readPacked gives us 370us.

There's also something strange in this change which is the IsDominatedBy check before returning the store value. We thought the check shouldn't be necessary (and it doesn't exist for other cases in the same function) as you shouldn't be even allowed to read a field unless it's stored in a dominator block, but removing that check causes causes assertion errors.

It's also strange that eliminating a bunch of closure allocations in a hot loop gave us only a tiny improvement.

@alexmarkov
Copy link
Contributor

This is likely a pass ordering problem: full-blown store-load forwarding happens after inlining is already finished. Simple store-loading forwarding during canonicalization (which happens during inlining) is one way to solve it. However, such store-load forwarding does not respect control flow and usually applicable only to final fields.

With captured final local variables you need to be a bit more careful as there could be multiple assignments (stores) to a final local variable. For example:

  final int foo;
  if (...) {
    foo = 2;
  } else {
    foo = 3;
  }
  () { print(foo); } ();

That's why you probably need a IsDominatedBy check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends.
Projects
None yet
Development

No branches or pull requests

5 participants