why output being slow can cause msgpack block event_loop thread #3677

frankcrc · 2022-03-16T08:55:48Z

frankcrc
Mar 16, 2022

Hello,

I use elasticsearch to collect logs. First I use in_tail to collect logs and out_forward to forward event to fluentd aggregator. In aggregator, I use in_forward to accept events and out_elasticsearch to bulk logs.

And I encounted a problem, in which the buffer was processed extremely slow so that buffer's size became bigger and bigger.

The version of fluentd I use is 1.11.5, which is inside td-agent 3.8.1.

I checkout fluentd's source code, and checkout tag v1.11.5, head commit is 24fe4cb.

By debugging, I found a simple way to reproduce this problem, but I was not sure it's properly. I just saw the similar netstat result(some connection got stuck at ESTABLISHED, and Recv-Q was large), and the buffer was slowly processed.

Way to reproduce:

Setup first fluentd, used to collect logs.

<system>
  root_dir /opt/apps/es_data/td-agent-root
  log_level info
</system>
<source>
  @type tail
  path /opt/apps/es_data/mock_data/*.dat
  pos_file /opt/apps/es_data/pos.log
  path_key log_path
  tag test-block
  read_from_head true
  @label @label-filter
  <parse>
    @type regexp
    expression /^(?<@timestamp>[^ ]+) +(?<severity>[^ ]+) +(?<component>[^ ]+) +\[(?<context>.+)\] +(?<message>.*?)?$/
  </parse>
</source>
<label @label-filter>
  <match **>
    @type copy
    <store>
      @type forward
      @log_level info
      heartbeat_type none
      <service>
        host 127.0.0.1
        port 24224
        weight 50
      </service>
      <buffer tag,time>
        @type file
        path /opt/apps/es_data/buffer/test-block
        chunk_limit_size 10M
        queued_chunks_limit_size 1
        timekey 1h
        timekey_wait 0s
        flush_mode interval
        flush_interval 2s
        flush_thread_count 1
      </buffer>
    </store>
  </match>
</label>

Setup fluentd aggregator.

<system>
  root_dir /opt/apps/es_data/td-agent-root
  log_level info
</system>
<source>
  @type forward
  @id out_fwd
  bind 0.0.0.0
  port 24224
</source>
<match test-**>
  @type file
  @id out_file
  path /opt/apps/es_data/mock_data/out.${tag}.%Y-%m-%d_%H:%M:%S_%z
  <buffer tag,time>
    @type file
    path /opt/apps/es_data/buffer/forward
    chunk_limit_size 100M
    queued_chunks_limit_size 2
    overflow_action block
    timekey 8h
    timekey_wait 0s
    timekey_zone +0800
    flush_mode interval
    flush_interval 5s
    flush_thread_count 1
  </buffer>
</match>

Use shell to mock logs.

#!/bin/bash

for i in {1..400000}
do
	echo $i
	echo '2022-02-21T11:08:46.258+0800 I SHARDING [ConfigServerCatalogCacheLoader-1720] 12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890' >> mock.dat
done

Edit in_forward.rb, which should be at /opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-1.11.5/lib/fluent/plugin if install fluentd through td-agent, or in source

fluentd/lib/fluent/plugin/in_forward.rb

Lines 259 to 266 in 24fe4cb

    
           parser = Fluent::MessagePackFactory.msgpack_unpacker 
        
           serializer = :to_msgpack.to_proc 
        
           feeder = ->(d){ 
        
             parser.feed_each(d){|obj| 
        
               block.call(obj, bytes, serializer) 
        
               bytes = 0 
        
             } 
        
           }

Add log.info at probably line 263, in parser.feed_each's block.

parser = Fluent::MessagePackFactory.msgpack_unpacker
            serializer = :to_msgpack.to_proc
            feeder = ->(d){
              parser.feed_each(d){|obj|
                log.info "feeder inner" # add this line
                block.call(obj, bytes, serializer)
                bytes = 0
              }
            }

Edit buffer.rb, which should be in the same directory of in_forward.rb, or in source

fluentd/lib/fluent/plugin/buffer.rb

Lines 288 to 294 in 24fe4cb

    
           def write(metadata_and_data, format: nil, size: nil, enqueue: false) 
        
             return if metadata_and_data.size < 1 
        
             raise BufferOverflowError, "buffer space has too many data" unless storable? 
        
             log.on_trace { log.trace "writing events into buffer", instance: self.object_id, metadata_size: metadata_and_data.size } 
        
             operated_chunks = []

and add log.on_info at probably line 292.

def write(metadata_and_data, format: nil, size: nil, enqueue: false)
        return if metadata_and_data.size < 1
        raise BufferOverflowError, "buffer space has too many data" unless storable?

        log.on_info { log.info "Current Thread is #{Thread.current.name}"} # add this line
        log.on_trace { log.trace "writing events into buffer", instance: self.object_id, metadata_size: metadata_and_data.size }

        operated_chunks = []

Edit out_file.rb, which should be in the same directory of in_forward.rb, or in source

fluentd/lib/fluent/plugin/out_file.rb

Lines 195 to 199 in 24fe4cb

    
           def write(chunk) 
        
             path = extract_placeholders(@path_template, chunk) 
        
             FileUtils.mkdir_p File.dirname(path), mode: @dir_perm 
        
             writer = case

and add code at probably line 199.

def write(chunk)
        path = extract_placeholders(@path_template, chunk)
        FileUtils.mkdir_p File.dirname(path), mode: @dir_perm

        # begin
        log.info "before loop"
        abc = 0
        # the loop is to slow down flush_thread, the slower, the easier to see the problem.
        while abc < 500000000
         # seems like io-independent operation cannot reproduce the problem
         # like sleep 10 or puts "something"
         abc += 1
        end
        log.info "after loop"
        # end

        writer = case
                when @compress_method.nil?

Start fluentd aggretator.
Start the first fluentd.

My local result.
Log of first fluentd.

2022-03-16 14:11:28 +0800 [info]: parsing config file is succeeded path="/opt/apps/es_data/td-agent.conf"
2022-03-16 14:11:28 +0800 [info]: gem 'fluent-plugin-elasticsearch' version '4.2.2'
2022-03-16 14:11:28 +0800 [info]: gem 'fluent-plugin-elasticsearch' version '2.12.5'
2022-03-16 14:11:28 +0800 [info]: gem 'fluent-plugin-flowcounter-simple' version '0.1.0'
2022-03-16 14:11:28 +0800 [info]: gem 'fluent-plugin-kafka' version '0.15.2'
2022-03-16 14:11:28 +0800 [info]: gem 'fluent-plugin-prometheus' version '1.8.5'
2022-03-16 14:11:28 +0800 [info]: gem 'fluent-plugin-prometheus_pushgateway' version '0.0.2'
2022-03-16 14:11:28 +0800 [info]: gem 'fluent-plugin-record-modifier' version '2.1.0'
2022-03-16 14:11:28 +0800 [info]: gem 'fluent-plugin-rewrite-tag-filter' version '2.3.0'
2022-03-16 14:11:28 +0800 [info]: gem 'fluent-plugin-s3' version '1.4.0'
2022-03-16 14:11:28 +0800 [info]: gem 'fluent-plugin-sd-dns' version '0.1.0'
2022-03-16 14:11:28 +0800 [info]: gem 'fluent-plugin-systemd' version '1.0.2'
2022-03-16 14:11:28 +0800 [info]: gem 'fluent-plugin-td' version '1.1.0'
2022-03-16 14:11:28 +0800 [info]: gem 'fluent-plugin-td-monitoring' version '0.2.4'
2022-03-16 14:11:28 +0800 [info]: gem 'fluent-plugin-webhdfs' version '1.3.1'
2022-03-16 14:11:28 +0800 [info]: gem 'fluentd' version '1.11.5'
2022-03-16 14:11:28 +0800 [info]: adding forwarding server '127.0.0.1:24224' host="127.0.0.1" port=24224 weight=50 plugin_id="object:3fd51ae2a534"
2022-03-16 14:11:28 +0800 [info]: using configuration file: <ROOT>
  <system>
    root_dir "/opt/apps/es_data/td-agent-root"
    log_level info
  </system>
  <source>
    @type tail
    path "/opt/apps/es_data/mock_data/*.dat"
    pos_file "/opt/apps/es_data/pos.log"
    path_key "log_path"
    tag "test-block"
    read_from_head true
    @label @label-filter
    <parse>
      @type "regexp"
      expression /^(?<@timestamp>[^ ]+) +(?<severity>[^ ]+) +(?<component>[^ ]+) +\[(?<context>.+)\] +(?<message>.*?)?$/
      types timeconsume:integer
      unmatched_lines 
    </parse>
  </source>
  <label @label-filter>
    <match **>
      @type copy
      <store>
        @type "forward"
        @log_level "info"
        heartbeat_type none
        <service>
          host "127.0.0.1"
          port 24224
          weight 50
        </service>
        <buffer tag,time>
          @type "file"
          path "/opt/apps/es_data/buffer/test-block"
          chunk_limit_size 10M
          queued_chunks_limit_size 1
          timekey 8h
          timekey_wait 0s
          flush_mode interval
          flush_interval 2s
          flush_thread_count 1
        </buffer>
      </store>
    </match>
  </label>
</ROOT>
2022-03-16 14:11:28 +0800 [info]: starting fluentd-1.11.5 pid=26900 ruby="2.4.10"
2022-03-16 14:11:28 +0800 [info]: spawn command to main:  cmdline=["/opt/td-agent/embedded/bin/ruby", "-Eascii-8bit:ascii-8bit", "/opt/td-agent/embedded/bin/fluentd", "--log", "/opt/apps/es_data/logs/td-agent.log", "--under-supervisor"]
2022-03-16 14:11:29 +0800 [info]: adding match in @label-filter pattern="**" type="copy"
2022-03-16 14:11:29 +0800 [info]: #0 adding forwarding server '127.0.0.1:24224' host="127.0.0.1" port=24224 weight=50 plugin_id="object:3ff8287d2d54"
2022-03-16 14:11:29 +0800 [info]: adding source type="tail"
2022-03-16 14:11:29 +0800 [info]: #0 starting fluentd worker pid=26911 ppid=26900 worker=0
2022-03-16 14:11:29 +0800 [info]: #0 following tail of /opt/apps/es_data/mock_data/mock.dat
2022-03-16 14:11:29 +0800 [info]: #0 Current Thread is 
2022-03-16 14:11:29 +0800 [info]: #0 Current Thread is 
2022-03-16 14:11:29 +0800 [info]: #0 Current Thread is 
2022-03-16 14:11:29 +0800 [info]: #0 Current Thread is 
... Too loog, all are "Current Thread is"
2022-03-16 14:11:40 +0800 [info]: #0 Current Thread is 
2022-03-16 14:11:40 +0800 [info]: #0 Current Thread is 
2022-03-16 14:11:40 +0800 [info]: #0 Current Thread is 
2022-03-16 14:11:40 +0800 [info]: #0 fluentd worker is now running worker=0
2022-03-16 14:13:08 +0800 [info]: Received graceful stop
2022-03-16 14:13:09 +0800 [info]: #0 fluentd worker is now stopping worker=0
2022-03-16 14:13:09 +0800 [info]: #0 shutting down fluentd worker worker=0
2022-03-16 14:13:09 +0800 [info]: #0 shutting down input plugin type=:tail plugin_id="object:3ff827835a2c"
2022-03-16 14:13:09 +0800 [info]: #0 shutting down output plugin type=:copy plugin_id="object:3ff827f86ee8"
2022-03-16 14:13:09 +0800 [info]: #0 shutting down output plugin type=:forward plugin_id="object:3ff8287d2d54"
2022-03-16 14:13:10 +0800 [info]: Worker 0 finished with status 0

log of aggretator.

2022-03-16 14:11:15 +0800 [info]: parsing config file is succeeded path="/opt/apps/es_data/td-agent-forward-server.conf"
2022-03-16 14:11:15 +0800 [info]: gem 'fluent-plugin-elasticsearch' version '4.2.2'
2022-03-16 14:11:15 +0800 [info]: gem 'fluent-plugin-elasticsearch' version '2.12.5'
2022-03-16 14:11:15 +0800 [info]: gem 'fluent-plugin-flowcounter-simple' version '0.1.0'
2022-03-16 14:11:15 +0800 [info]: gem 'fluent-plugin-kafka' version '0.15.2'
2022-03-16 14:11:15 +0800 [info]: gem 'fluent-plugin-prometheus' version '1.8.5'
2022-03-16 14:11:15 +0800 [info]: gem 'fluent-plugin-prometheus_pushgateway' version '0.0.2'
2022-03-16 14:11:15 +0800 [info]: gem 'fluent-plugin-record-modifier' version '2.1.0'
2022-03-16 14:11:15 +0800 [info]: gem 'fluent-plugin-rewrite-tag-filter' version '2.3.0'
2022-03-16 14:11:15 +0800 [info]: gem 'fluent-plugin-s3' version '1.4.0'
2022-03-16 14:11:15 +0800 [info]: gem 'fluent-plugin-sd-dns' version '0.1.0'
2022-03-16 14:11:15 +0800 [info]: gem 'fluent-plugin-systemd' version '1.0.2'
2022-03-16 14:11:15 +0800 [info]: gem 'fluent-plugin-td' version '1.1.0'
2022-03-16 14:11:15 +0800 [info]: gem 'fluent-plugin-td-monitoring' version '0.2.4'
2022-03-16 14:11:15 +0800 [info]: gem 'fluent-plugin-webhdfs' version '1.3.1'
2022-03-16 14:11:15 +0800 [info]: gem 'fluentd' version '1.11.5'
2022-03-16 14:11:15 +0800 [info]: using configuration file: <ROOT>
  <system>
    root_dir "/opt/apps/es_data/td-agent-root"
    log_level info
  </system>
  <source>
    @type forward
    @id out_fwd
    bind "0.0.0.0"
    port 24224
  </source>
  <match test-**>
    @type file
    @id out_file
    path "/opt/apps/es_data/mock_data/out.${tag}.%Y-%m-%d_%H:%M:%S_%z"
    <buffer tag,time>
      @type "file"
      path "/opt/apps/es_data/buffer/forward"
      chunk_limit_size 100M
      queued_chunks_limit_size 2
      overflow_action block
      timekey 8h
      timekey_wait 0s
      timekey_zone "+0800"
      flush_mode interval
      flush_interval 5s
      flush_thread_count 1
    </buffer>
  </match>
</ROOT>
2022-03-16 14:11:15 +0800 [info]: starting fluentd-1.11.5 pid=26764 ruby="2.4.10"
2022-03-16 14:11:15 +0800 [info]: spawn command to main:  cmdline=["/opt/td-agent/embedded/bin/ruby", "-Eascii-8bit:ascii-8bit", "/opt/td-agent/embedded/bin/fluentd", "--log", "/opt/apps/es_data/logs/td-agent-forward-server.log", "--under-supervisor"]
2022-03-16 14:11:16 +0800 [info]: adding match pattern="caih-**" type="file"
2022-03-16 14:11:16 +0800 [info]: adding source type="forward"
2022-03-16 14:11:16 +0800 [info]: #0 starting fluentd worker pid=26775 ppid=26764 worker=0
2022-03-16 14:11:16 +0800 [info]: #0 [out_fwd] listening port port=24224 bind="0.0.0.0"
2022-03-16 14:11:16 +0800 [info]: #0 fluentd worker is now running worker=0
2022-03-16 14:11:29 +0800 [info]: #0 [out_fwd] feeder inner
2022-03-16 14:11:30 +0800 [info]: #0 [out_file] Current Thread is event_loop
2022-03-16 14:11:30 +0800 [info]: #0 [out_fwd] feeder inner
... Duplicated
2022-03-16 14:11:33 +0800 [info]: #0 [out_fwd] feeder inner
2022-03-16 14:11:34 +0800 [info]: #0 [out_file] Current Thread is event_loop
2022-03-16 14:11:34 +0800 [info]: #0 [out_fwd] feeder inner
2022-03-16 14:11:35 +0800 [info]: #0 [out_file] Current Thread is event_loop
================================================================
Look here, at this time, flush_thread was executed, and the loop spent 10 seconds. During this time,
you can see there was no  `feeder inner` in log, but `feeder inner` should be print in event_loop thread.

2022-03-16 14:11:35 +0800 [info]: #0 [out_file] before loop
2022-03-16 14:11:45 +0800 [info]: #0 [out_file] after loop
2022-03-16 14:11:45 +0800 [info]: #0 [out_fwd] feeder inner
2022-03-16 14:11:46 +0800 [info]: #0 [out_file] Current Thread is event_loop
================================================================
2022-03-16 14:11:46 +0800 [info]: #0 [out_file] before loop
2022-03-16 14:11:54 +0800 [info]: #0 [out_file] after loop
2022-03-16 14:11:54 +0800 [info]: #0 [out_fwd] feeder inner
... Duplicated
2022-03-16 14:12:46 +0800 [info]: #0 [out_file] before loop
2022-03-16 14:12:54 +0800 [info]: #0 [out_file] after loop
2022-03-16 14:12:56 +0800 [info]: #0 [out_file] before loop
2022-03-16 14:13:03 +0800 [info]: #0 [out_file] after loop
2022-03-16 14:13:12 +0800 [info]: Received graceful stop
2022-03-16 14:13:12 +0800 [info]: #0 fluentd worker is now stopping worker=0
2022-03-16 14:13:12 +0800 [info]: #0 shutting down fluentd worker worker=0
2022-03-16 14:13:12 +0800 [info]: #0 shutting down input plugin type=:forward plugin_id="out_fwd"
2022-03-16 14:13:12 +0800 [info]: #0 shutting down output plugin type=:file plugin_id="out_file"
2022-03-16 14:13:13 +0800 [info]: Worker 0 finished with status 0

According to the aggregator's log, I think the flush_thread getting slowly actually affect event_loop, and then cause in_forward works abnormally. There is another reason that I think the above way is similar to out_elasticsearch. In out_elasticsearch, There is also a loop to map events in chunk to bulk request body, and I guess it's a io-independent operation.

I am new to ruby, and I was tring to confirm whether ruby's thread switch machenism cause this problem. I also wrote simple tests(including pure ruby thread and interaction between ruby and c extension) to verify, but I failed.

fujimotos · 2022-03-20T01:37:54Z

fujimotos
Mar 20, 2022
Maintainer

I am new to ruby, and I was tring to confirm whether ruby's thread switch machenism cause this problem.

Ruby is essentially a singe thread due to infamous GIL.
For this reason, the while loop in your code:

def write(chunk)
        path = extract_placeholders(@path_template, chunk)
        FileUtils.mkdir_p File.dirname(path), mode: @dir_perm

        # begin
        log.info "before loop"
        abc = 0
        # the loop is to slow down flush_thread, the slower, the easier to see the problem.
        while abc < 500000000
         # seems like io-independent operation cannot reproduce the problem
         # like sleep 10 or puts "something"
         abc += 1
        end
        log.info "after loop"
        # end

... will block other threads in Fluentd from running. You need to use worker
to scale horizontally. For details, read:

https://docs.fluentd.org/deployment/multi-process-workers

1 reply

frankcrc Mar 21, 2022
Author

Thanks @fujimotos .
Maybe, I should learn more about Ruby before digging into this problem. I had seen GIL before. I wrote a test, which contains a ruby thread and a C extension, but I cannot make loop in ruby thread block C extension function.

Here is my test code.
Ruby code,

require "ruby_thread_test/test/version"
require 'thread'

require 'ruby_thread_test/ext/mycext'
include MyCExt

module Ruby_thread_test
    class TestClass
      def testFunc
        puts "hello in TestClass"
      end

      def testFunc2(a)
        puts "hello in TestClass, a = #{a}"
      end

    end

    a = 0
    b = 0
    c = 500000000

    t = TestClass.new

    t1 = Thread.new do
      while a < c
        a += 1
      end
      puts "[thread1] finish"
    end

    t2 = Thread.new do
      # delay calling function in c extension
      sleep 0.5

      while b < 5
        b += 1
        puts "[thread2] out"

        test4 3, t do |idx|
          puts "idxf = #{idx}"
        end
      end
      puts "[thread2] finish"
    end

    t1.join
    t2.join
end

C extension code,

#include "ruby.h"

// Defining a space for information and references about the module to be stored internally
VALUE MyTest = Qnil;

// Prototype for the initialization method - Ruby calls this, not you
void Init_mycext();

VALUE method_test4(VALUE self, VALUE c, VALUE obj);

// The initialization method for this module
void Init_mycext() {
	MyTest = rb_define_module("MyCExt");
	rb_define_method(MyTest, "test4", method_test4, 2);
}

VALUE method_test4(VALUE self, VALUE c, VALUE obj)
{
    int count = NUM2INT(c);
    for (int i = 0; i < count; ++ i) {
        rb_funcall(obj, rb_intern("testFunc"), 0);
        rb_funcall(obj, rb_intern("testFunc2"), 1, INT2NUM(10));
        rb_yield_values(1, INT2NUM(i));
    }
    return INT2NUM(0);
}

To me, method_test4 should not be called until thread 1 is done. But the result is method_test4 can be called, which means thread 2 is not blocked.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why output being slow can cause msgpack block event_loop thread #3677

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

why output being slow can cause msgpack block event_loop thread #3677

frankcrc Mar 16, 2022

Replies: 1 comment · 1 reply

fujimotos Mar 20, 2022 Maintainer

frankcrc Mar 21, 2022 Author

frankcrc
Mar 16, 2022

Replies: 1 comment 1 reply

fujimotos
Mar 20, 2022
Maintainer

frankcrc Mar 21, 2022
Author