Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logs missing during heavy log volume #4693

Open
jicowan opened this issue Nov 3, 2024 · 19 comments
Open

Logs missing during heavy log volume #4693

jicowan opened this issue Nov 3, 2024 · 19 comments
Assignees
Labels
bug Something isn't working

Comments

@jicowan
Copy link

jicowan commented Nov 3, 2024

Describe the bug

During heavy log volumes, e.g. >10k log entries per second, fluentd consistently drops logs. It may be related to log rotation (on Kubernetes). When I ran a load test, I see the following entries in the fluentd logs:

2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 101712298) because an existing watcher for that filepath follows a different inode: 101712295 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-mzxxh_default_logger-8bb9a8d2eb65d5c07af7e194aad99176a79941a69c06b6ae390a0d8b9dd06cf1.log"
2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 97581155) because an existing watcher for that filepath follows a different inode: 97581154 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-nrq45_default_logger-2bad2e8722fb2369996c134f02dcf4a2fff8068d43863d3f7173a56ff2a8bbd0.log"
2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 111149786) because an existing watcher for that filepath follows a different inode: 111149782 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-p4rcl_default_logger-88fb9eaab07505f6d59f03e48e2993069eba82902efe44a46098c0d7d44f24c4.log"
2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 77634742) because an existing watcher for that filepath follows a different inode: 77634741 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-ps45w_default_logger-90f54592392569f72662a2dacfdca239a907c1da4c1729f7a75bb50f56bc9663.log"

When I added follow_inodes=true and rotate_wait=0 to the container configuration, the errors went away, but large chunks of logs were still missing and the following entries appeared in the fluentd logs.

2024-11-02 17:27:59 +0000 [warn]: #0 stat() for /var/log/containers/logger-deployment-57cc6745c7-hw4ds_default_logger-aba43bbd009d1652e1961dbd30ed45f09e337bfb42d3fa247b12fde7af248909.log failed. Continuing without tailing it.
2024-11-02 17:27:59 +0000 [warn]: #0 stat() for /var/log/containers/logger-deployment-57cc6745c7-jtxmz_default_logger-742ba4e5339168b7b5442745705bbfed1d93c832027ca0c680b193c9c62e796f.log failed. Continuing without tailing it.
2024-11-02 17:27:59 +0000 [warn]: #0 stat() for /var/log/containers/logger-deployment-57cc6745c7-kmrlv_default_logger-7682a4b64550055203e19ff9387b686e316fe4e5e7884b720dede3692659c686.log failed. Continuing without tailing it.

I am running the latest version of the fluentd kubernetes daemonset for cloudwatch, fluent/fluentd-kubernetes-daemonset:v1.17.1-debian-cloudwatch-1.2.

During the test, both memory and CPU utilization for fluentd remained fairly low.

To Reproduce

Run multiple replicas of the following program:

import multiprocessing
import os
import time
import random
import sys
from datetime import datetime


def generate_log_entry():
    log_levels = ['INFO', 'WARNING', 'ERROR', 'DEBUG']
    messages = [
        'User logged in',
        'Database connection established',
        'File not found',
        'Memory usage high',
        'Network latency detected',
        'Cache cleared',
        'API request successful',
        'Configuration updated'
    ]

    timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f')[:-3]
    level = random.choice(log_levels)
    message = random.choice(messages)
    pod = os.getenv("POD_NAME", "unknown")

    return f"{timestamp} {pod} [{level}] {message}"


def worker(queue):
    while True:
        log_entry = generate_log_entry()
        queue.put(log_entry)
        time.sleep(0.01)  # Small delay to prevent overwhelming the system


def logger(queue, counter):
    while True:
        log_entry = queue.get()
        with counter.get_lock():
            counter.value += 1
        print(f"[{counter.value}] {log_entry}", flush=True)


if __name__ == '__main__':
    num_processes = multiprocessing.cpu_count()

    manager = multiprocessing.Manager()
    log_queue = manager.Queue()

    # Create a shared counter
    counter = multiprocessing.Value('i', 0)

    # Start worker processes
    workers = []
    for _ in range(num_processes - 1):  # Reserve one process for logging
        p = multiprocessing.Process(target=worker, args=(log_queue,))
        p.start()
        workers.append(p)

    # Start logger process
    logger_process = multiprocessing.Process(target=logger, args=(log_queue, counter))
    logger_process.start()

    try:
        # Keep the main process running
        while True:
            time.sleep(1)
            # Print the current count every second
            print(f"Total logs emitted: {counter.value}", file=sys.stderr, flush=True)
    except KeyboardInterrupt:
        print("\nStopping log generation...", file=sys.stderr)

        # Stop worker processes
        for p in workers:
            p.terminate()
            p.join()

        # Stop logger process
        logger_process.terminate()
        logger_process.join()

        print(f"Log generation stopped. Total logs emitted: {counter.value}", file=sys.stderr)
        sys.exit(0)

Here's the deployment for the test application:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: logger-deployment
  labels:
    app: logger
spec:
  replicas: 1  # Adjust the number of replicas as needed
  selector:
    matchLabels:
      app: logger
  template:
    metadata:
      labels:
        app: logger
    spec:
      affinity:
        podAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - logger
              topologyKey: "kubernetes.io/hostname"
      containers:
      - name: logger
        image: jicowan/logger:v3.0
        resources:
          requests:
            cpu: 4
            memory: 128Mi
          limits:
            cpu: 4
            memory: 256Mi
        env:
          - name: POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name

Here's the container.conf file for fluentd:

<source>
      @type tail
      @id in_tail_container_core_logs
      @label @raw.containers
      @log_level debug
      path /var/log/containers/*fluentd-cloudwatch*.log,/var/log/containers/*aws-node*.log,/var/log/containers/*kube-proxy*.log,/var/log/containers/*kube-system*.log,/var/log/containers/cloudwatch-agent*.log,/var/log/containers/policy-manager*.log,/var/log/containers/*private-ca*.log,/var/log/containers/metrics-server*.log,/var/log/containers/rbac-controller*.log,/var/log/containers/cluster-autoscaler*.log,/var/log/containers/cwagent*.log,/var/log/containers/*prometheus*.log,/var/log/containers/*nginx*.log,/var/log/containers/*kube-state*.log
      pos_file /var/log/fluentd-core-containers.log.pos
      tag corecontainers.**
      read_from_head true
      follow_inodes true
      rotate_wait 0
      <parse>
        @type "#{ENV['FLUENT_CONTAINER_TAIL_PARSER_TYPE'] || 'json'}"
        time_format %Y-%m-%dT%H:%M:%S.%N%:z
      </parse>
    </source>
    <source>
      @type tail
      @id in_tail_container_logs
      @label @raw.containers
      path /var/log/containers/*.log
      exclude_path /var/log/containers/*aws-node*.log,/var/log/containers/*coredns*.log,/var/log/containers/*kube-proxy*.log,/var/log/containers/*kube-system*.log,/var/log/containers/cloudwatch-agent*.log,/var/log/containers/policy-manager*.log,/var/log/containers/*private-ca*.log,/var/log/containers/metrics-server*.log,/var/log/containers/rbac-controller*.log,/var/log/containers/cluster-autoscaler*.log,/var/log/containers/cwagent*.log,/var/log/containers/*prometheus*.log,/var/log/containers/*nginx*.log,/var/log/containers/*opa*.log,/var/log/containers/*fluentd-cloudwatch*.log,/var/log/containers/*datadog-agent*.log,/var/log/containers/*kube-state-metrics*.log,/var/log/containers/*ebs-csi-node*.log,/var/log/containers/*ebs-csi-controller*.log,/var/log/containers/*fsx-csi-node*.log,/var/log/containers/*calico-node*.log
      pos_file /var/log/fluentd-containers.log.pos
      tag container.**
      read_from_head true
      follow_inodes true
      rotate_wait 0
      <parse>
        @type "#{ENV['FLUENT_CONTAINER_TAIL_PARSER_TYPE'] || 'json'}"
        time_format %Y-%m-%dT%H:%M:%S.%N%:z
      </parse>
    </source>
    <source>
      @type tail
      @id in_tail_daemonset_logs
      @label @containers
      path /var/log/containers/*opa*.log,/var/log/containers/*datadog-agent*.log,/var/log/containers/*ebs-csi-node*.log,/var/log/containers/*ebs-csi-controller*.log,/var/log/containers/*fsx-csi-node*.log,/var/log/containers/*calico-node*.log
      pos_file /var/log/daemonset.log.pos
      tag daemonset.**
      read_from_head true
      follow_inodes true
      rotate_wait 0
      <parse>
        @type "#{ENV['FLUENT_CONTAINER_TAIL_PARSER_TYPE'] || 'json'}"
        time_format %Y-%m-%dT%H:%M:%S.%N%:z
      </parse>
    </source>
    <label @raw.containers>
      <match **>
        @id raw.detect_exceptions
        @type detect_exceptions
        remove_tag_prefix raw
        @label @containers
        multiline_flush_interval 1s
        max_bytes 500000
        max_lines 1000
      </match>
    </label>
    <label @containers>
      <filter corecontainers.**>
        @type prometheus
        <metric>
          name fluentd_input_status_num_corecontainer_records_total
          type counter
          desc The total number of incoming corecontainer records
        </metric>
      </filter>
      <filter container.**>
        @type prometheus
        <metric>
          name fluentd_input_status_num_container_records_total
          type counter
          desc The total number of incoming container records
        </metric>
      </filter>
      <filter daemonset.**>
        @type prometheus
        <metric>
          name fluentd_input_status_num_daemonset_records_total
          type counter
          desc The total number of incoming daemonset records
        </metric>
      </filter>
      <filter **>
        @type record_transformer
        @id filter_containers_stream_transformer
        <record>
          seal_id "110628"
          cluster_name "logging"
          stream_name ${tag_parts[4]}
        </record>
      </filter>
      <filter **>
        @type kubernetes_metadata
        @id filter_kube_metadata
        @log_level error
      </filter>
      <match corecontainers.**>
        @type copy
        <store>
          @type prometheus
          <metric>
            name fluentd_output_status_num_corecontainer_records_total
            type counter
            desc The total number of outgoing corecontainer records
          </metric>
        </store>
        <store>
          @type cloudwatch_logs
          @id out_cloudwatch_logs_core_containers
          region "us-west-2"
          log_group_name "/aws/eks/logging/core-containers"
          log_stream_name_key stream_name
          remove_log_stream_name_key true
          auto_create_stream true
          <inject>
              time_key time_nanoseconds
              time_type string
              time_format %Y-%m-%dT%H:%M:%S.%N
          </inject>
          <buffer>
            flush_interval 5s
            chunk_limit_size 2m
            queued_chunks_limit_size 32
            retry_forever true
          </buffer>
        </store>
      </match>
      <match container.**>
        @type copy
        <store>
          @type prometheus
          <metric>
            name fluentd_output_status_num_container_records_total
            type counter
            desc The total number of outgoing container records
          </metric>
        </store>
        <store>
          @type cloudwatch_logs
          @id out_cloudwatch_logs_containers
          region "us-west-2"
          log_group_name "/aws/eks/logging/containers"
          log_stream_name_key stream_name
          remove_log_stream_name_key true
          auto_create_stream true
          <inject>
              time_key time_nanoseconds
              time_type string
              time_format %Y-%m-%dT%H:%M:%S.%N
          </inject>
          <buffer>
            flush_interval 5s
            chunk_limit_size 2m
            queued_chunks_limit_size 32
            retry_forever true
          </buffer>
        </store>
      </match>
      <match daemonset.**>
        @type copy
        <store>
          @type prometheus
          <metric>
            name fluentd_output_status_num_daemonset_records_total
            type counter
            desc The total number of outgoing daemonset records
          </metric>
        </store>
        <store>
          @type cloudwatch_logs
          @id out_cloudwatch_logs_daemonset
          region "us-west-2"
          log_group_name "/aws/eks/logging/daemonset"
          log_stream_name_key stream_name
          remove_log_stream_name_key true
          auto_create_stream true
          <inject>
              time_key time_nanoseconds
              time_type string
              time_format %Y-%m-%dT%H:%M:%S.%N
          </inject>
          <buffer>
            flush_interval 5s
            chunk_limit_size 2m
            queued_chunks_limit_size 32
            retry_forever true
          </buffer>
        </store>
      </match>
    </label>

Expected behavior

The test application assigns an sequence number to each log entry. I have a Python notebook that flattens the json log output, sorts the logs by sequence number, then finds gaps in the sequence. This is how I know that fluentd is dropping logs. If everything is working as it should there should be no log loss.

I ran the same tests with fluent bit and experience no log loss.

Your Environment

- Fluentd version: v1.17.1
- Package version:
- Operating system: Amazon Linux 2
- Kernel version: 5.10.225-213.878.amzn2.x86_64

Your Configuration

data:
  containers.conf: |-
    <source>
          @type tail
          @id in_tail_container_core_logs
          @label @raw.containers
          @log_level debug
          path /var/log/containers/*fluentd-cloudwatch*.log,/var/log/containers/*aws-node*.log,/var/log/containers/*kube-proxy*.log,/var/log/containers/*kube-system*.log,/var/log/containers/cloudwatch-agent*.log,/var/log/containers/policy-manager*.log,/var/log/containers/*private-ca*.log,/var/log/containers/metrics-server*.log,/var/log/containers/rbac-controller*.log,/var/log/containers/cluster-autoscaler*.log,/var/log/containers/cwagent*.log,/var/log/containers/*prometheus*.log,/var/log/containers/*nginx*.log,/var/log/containers/*kube-state*.log
          pos_file /var/log/fluentd-core-containers.log.pos
          tag corecontainers.**
          read_from_head true
          follow_inodes true
          rotate_wait 0
          <parse>
            @type "#{ENV['FLUENT_CONTAINER_TAIL_PARSER_TYPE'] || 'json'}"
            time_format %Y-%m-%dT%H:%M:%S.%N%:z
          </parse>
        </source>
        <source>
          @type tail
          @id in_tail_container_logs
          @label @raw.containers
          path /var/log/containers/*.log
          exclude_path /var/log/containers/*aws-node*.log,/var/log/containers/*coredns*.log,/var/log/containers/*kube-proxy*.log,/var/log/containers/*kube-system*.log,/var/log/containers/cloudwatch-agent*.log,/var/log/containers/policy-manager*.log,/var/log/containers/*private-ca*.log,/var/log/containers/metrics-server*.log,/var/log/containers/rbac-controller*.log,/var/log/containers/cluster-autoscaler*.log,/var/log/containers/cwagent*.log,/var/log/containers/*prometheus*.log,/var/log/containers/*nginx*.log,/var/log/containers/*opa*.log,/var/log/containers/*fluentd-cloudwatch*.log,/var/log/containers/*datadog-agent*.log,/var/log/containers/*kube-state-metrics*.log,/var/log/containers/*ebs-csi-node*.log,/var/log/containers/*ebs-csi-controller*.log,/var/log/containers/*fsx-csi-node*.log,/var/log/containers/*calico-node*.log
          pos_file /var/log/fluentd-containers.log.pos
          tag container.**
          read_from_head true
          follow_inodes true
          rotate_wait 0
          <parse>
            @type "#{ENV['FLUENT_CONTAINER_TAIL_PARSER_TYPE'] || 'json'}"
            time_format %Y-%m-%dT%H:%M:%S.%N%:z
          </parse>
        </source>
        <source>
          @type tail
          @id in_tail_daemonset_logs
          @label @containers
          path /var/log/containers/*opa*.log,/var/log/containers/*datadog-agent*.log,/var/log/containers/*ebs-csi-node*.log,/var/log/containers/*ebs-csi-controller*.log,/var/log/containers/*fsx-csi-node*.log,/var/log/containers/*calico-node*.log
          pos_file /var/log/daemonset.log.pos
          tag daemonset.**
          read_from_head true
          follow_inodes true
          rotate_wait 0
          <parse>
            @type "#{ENV['FLUENT_CONTAINER_TAIL_PARSER_TYPE'] || 'json'}"
            time_format %Y-%m-%dT%H:%M:%S.%N%:z
          </parse>
        </source>
        <label @raw.containers>
          <match **>
            @id raw.detect_exceptions
            @type detect_exceptions
            remove_tag_prefix raw
            @label @containers
            multiline_flush_interval 1s
            max_bytes 500000
            max_lines 1000
          </match>
        </label>
        <label @containers>
          <filter corecontainers.**>
            @type prometheus
            <metric>
              name fluentd_input_status_num_corecontainer_records_total
              type counter
              desc The total number of incoming corecontainer records
            </metric>
          </filter>
          <filter container.**>
            @type prometheus
            <metric>
              name fluentd_input_status_num_container_records_total
              type counter
              desc The total number of incoming container records
            </metric>
          </filter>
          <filter daemonset.**>
            @type prometheus
            <metric>
              name fluentd_input_status_num_daemonset_records_total
              type counter
              desc The total number of incoming daemonset records
            </metric>
          </filter>
          <filter **>
            @type record_transformer
            @id filter_containers_stream_transformer
            <record>
              seal_id "110628"
              cluster_name "logging"
              stream_name ${tag_parts[4]}
            </record>
          </filter>
          <filter **>
            @type kubernetes_metadata
            @id filter_kube_metadata
            @log_level error
          </filter>
          <match corecontainers.**>
            @type copy
            <store>
              @type prometheus
              <metric>
                name fluentd_output_status_num_corecontainer_records_total
                type counter
                desc The total number of outgoing corecontainer records
              </metric>
            </store>
            <store>
              @type cloudwatch_logs
              @id out_cloudwatch_logs_core_containers
              region "us-west-2"
              log_group_name "/aws/eks/logging/core-containers"
              log_stream_name_key stream_name
              remove_log_stream_name_key true
              auto_create_stream true
              <inject>
                  time_key time_nanoseconds
                  time_type string
                  time_format %Y-%m-%dT%H:%M:%S.%N
              </inject>
              <buffer>
                flush_interval 5s
                chunk_limit_size 2m
                queued_chunks_limit_size 32
                retry_forever true
              </buffer>
            </store>
          </match>
          <match container.**>
            @type copy
            <store>
              @type prometheus
              <metric>
                name fluentd_output_status_num_container_records_total
                type counter
                desc The total number of outgoing container records
              </metric>
            </store>
            <store>
              @type cloudwatch_logs
              @id out_cloudwatch_logs_containers
              region "us-west-2"
              log_group_name "/aws/eks/logging/containers"
              log_stream_name_key stream_name
              remove_log_stream_name_key true
              auto_create_stream true
              <inject>
                  time_key time_nanoseconds
                  time_type string
                  time_format %Y-%m-%dT%H:%M:%S.%N
              </inject>
              <buffer>
                flush_interval 5s
                chunk_limit_size 2m
                queued_chunks_limit_size 32
                retry_forever true
              </buffer>
            </store>
          </match>
          <match daemonset.**>
            @type copy
            <store>
              @type prometheus
              <metric>
                name fluentd_output_status_num_daemonset_records_total
                type counter
                desc The total number of outgoing daemonset records
              </metric>
            </store>
            <store>
              @type cloudwatch_logs
              @id out_cloudwatch_logs_daemonset
              region "us-west-2"
              log_group_name "/aws/eks/logging/daemonset"
              log_stream_name_key stream_name
              remove_log_stream_name_key true
              auto_create_stream true
              <inject>
                  time_key time_nanoseconds
                  time_type string
                  time_format %Y-%m-%dT%H:%M:%S.%N
              </inject>
              <buffer>
                flush_interval 5s
                chunk_limit_size 2m
                queued_chunks_limit_size 32
                retry_forever true
              </buffer>
            </store>
          </match>
        </label>
  fluent.conf: |
    @include containers.conf
    @include systemd.conf
    @include host.conf

    <match fluent.**>
      @type null
    </match>
  host.conf: |
    <source>
      @type tail
      @id in_tail_dmesg
      @label @hostlogs
      path /var/log/dmesg
      pos_file /var/log/dmesg.log.pos
      tag host.dmesg
      read_from_head true
      <parse>
        @type syslog
      </parse>
    </source>

    <source>
      @type tail
      @id in_tail_secure
      @label @hostlogs
      path /var/log/secure
      pos_file /var/log/secure.log.pos
      tag host.secure
      read_from_head true
      <parse>
        @type syslog
      </parse>
    </source>

    <source>
      @type tail
      @id in_tail_messages
      @label @hostlogs
      path /var/log/messages
      pos_file /var/log/messages.log.pos
      tag host.messages
      read_from_head true
      <parse>
        @type syslog
      </parse>
    </source>

    <label @hostlogs>
      <filter **>
        @type kubernetes_metadata
        @id filter_kube_metadata_host
        watch false
      </filter>

      <filter **>
        @type record_transformer
        @id filter_containers_stream_transformer_host
        <record>
          stream_name ${tag}-${record["host"]}
        </record>
      </filter>

      <match host.**>
        @type cloudwatch_logs
        @id out_cloudwatch_logs_host_logs
        region "#{ENV.fetch('AWS_REGION')}"
        log_group_name "/aws/containerinsights/#{ENV.fetch('CLUSTER_NAME')}/host"
        log_stream_name_key stream_name
        remove_log_stream_name_key true
        auto_create_stream true
        <buffer>
          flush_interval 5
          chunk_limit_size 2m
          queued_chunks_limit_size 32
          retry_forever true
        </buffer>
      </match>
    </label>
  kubernetes.conf: |
    kubernetes.conf
  systemd.conf: |
    <source>
      @type systemd
      @id in_systemd_kubelet
      @label @systemd
      filters [{ "_SYSTEMD_UNIT": "kubelet.service" }]
      <entry>
        field_map {"MESSAGE": "message", "_HOSTNAME": "hostname", "_SYSTEMD_UNIT": "systemd_unit"}
        field_map_strict true
      </entry>
      path /var/log/journal
      <storage>
        @type local
        persistent true
        path /var/log/fluentd-journald-kubelet-pos.json
      </storage>
      read_from_head true
      tag kubelet.service
    </source>

    <source>
      @type systemd
      @id in_systemd_kubeproxy
      @label @systemd
      filters [{ "_SYSTEMD_UNIT": "kubeproxy.service" }]
      <entry>
        field_map {"MESSAGE": "message", "_HOSTNAME": "hostname", "_SYSTEMD_UNIT": "systemd_unit"}
        field_map_strict true
      </entry>
      path /var/log/journal
      <storage>
        @type local
        persistent true
        path /var/log/fluentd-journald-kubeproxy-pos.json
      </storage>
      read_from_head true
      tag kubeproxy.service
    </source>

    <source>
      @type systemd
      @id in_systemd_docker
      @label @systemd
      filters [{ "_SYSTEMD_UNIT": "docker.service" }]
      <entry>
        field_map {"MESSAGE": "message", "_HOSTNAME": "hostname", "_SYSTEMD_UNIT": "systemd_unit"}
        field_map_strict true
      </entry>
      path /var/log/journal
      <storage>
        @type local
        persistent true
        path /var/log/fluentd-journald-docker-pos.json
      </storage>
      read_from_head true
      tag docker.service
    </source>

    <label @systemd>
      <filter **>
        @type kubernetes_metadata
        @id filter_kube_metadata_systemd
        watch false
      </filter>

      <filter **>
        @type record_transformer
        @id filter_systemd_stream_transformer
        <record>
          stream_name ${tag}-${record["hostname"]}
        </record>
      </filter>

      <match **>
        @type cloudwatch_logs
        @id out_cloudwatch_logs_systemd
        region "#{ENV.fetch('AWS_REGION')}"
        log_group_name "/aws/containerinsights/#{ENV.fetch('CLUSTER_NAME')}/dataplane"
        log_stream_name_key stream_name
        auto_create_stream true
        remove_log_stream_name_key true
        <buffer>
          flush_interval 5
          chunk_limit_size 2m
          queued_chunks_limit_size 32
          retry_forever true
        </buffer>
      </match>
    </label>

Your Error Log

2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 101712298) because an existing watcher for that filepath follows a different inode: 101712295 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-mzxxh_default_logger-8bb9a8d2eb65d5c07af7e194aad99176a79941a69c06b6ae390a0d8b9dd06cf1.log"
2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 97581155) because an existing watcher for that filepath follows a different inode: 97581154 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-nrq45_default_logger-2bad2e8722fb2369996c134f02dcf4a2fff8068d43863d3f7173a56ff2a8bbd0.log"
2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 111149786) because an existing watcher for that filepath follows a different inode: 111149782 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-p4rcl_default_logger-88fb9eaab07505f6d59f03e48e2993069eba82902efe44a46098c0d7d44f24c4.log"
2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 77634742) because an existing watcher for that filepath follows a different inode: 77634741 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-ps45w_default_logger-90f54592392569f72662a2dacfdca239a907c1da4c1729f7a75bb50f56bc9663.log"
2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 101712298) because an existing watcher for that filepath follows a different inode: 101712295 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-mzxxh_default_logger-8bb9a8d2eb65d5c07af7e194aad99176a79941a69c06b6ae390a0d8b9dd06cf1.log"
2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 97581155) because an existing watcher for that filepath follows a different inode: 97581154 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-nrq45_default_logger-2bad2e8722fb2369996c134f02dcf4a2fff8068d43863d3f7173a56ff2a8bbd0.log"
2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 111149786) because an existing watcher for that filepath follows a different inode: 111149782 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-p4rcl_default_logger-88fb9eaab07505f6d59f03e48e2993069eba82902efe44a46098c0d7d44f24c4.log"
2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 77634742) because an existing watcher for that filepath follows a different inode: 77634741 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-ps45w_default_logger-90f54592392569f72662a2dacfdca239a907c1da4c1729f7a75bb50f56bc9663.log"

2024-11-02 14:15:49 +0000 [warn]: #0 [in_tail_container_logs] Skip update_watcher because watcher has been already updated by other inotify event path="/var/log/containers/logger-deployment-57cc6745c7-ps45w_default_logger-90f54592392569f72662a2dacfdca239a907c1da4c1729f7a75bb50f56bc9663.log" inode=77634746 inode_in_pos_file=77634747

***After setting time=0 inodes=true***
2024-11-02 17:26:28 +0000 [info]: #0 [in_tail_container_logs] following tail of /var/log/containers/logger-deployment-57cc6745c7-ckckh_default_logger-1c52a92c2e1ef377d9b6c95dc693b86645d93fbfcf13832ee5337cc9ab201b0b.log
2024-11-02 17:26:32 +0000 [info]: #0 [in_tail_container_logs] detected rotation of /var/log/containers/logger-deployment-57cc6745c7-ckckh_default_logger-1c52a92c2e1ef377d9b6c95dc693b86645d93fbfcf13832ee5337cc9ab201b0b.log; waiting 0.0 seconds
2024-11-02 17:26:32 +0000 [warn]: #0 [in_tail_container_logs] Skip update_watcher because watcher has been already updated by other inotify event path="/var/log/containers/logger-deployment-57cc6745c7-ckckh_default_logger-1c52a92c2e1ef377d9b6c95dc693b86645d93fbfcf13832ee5337cc9ab201b0b.log" inode=152064097 inode_in_pos_file=0
2024-11-02 17:26:32 +0000 [info]: #0 [in_tail_container_logs] detected rotation of /var/log/containers/logger-deployment-57cc6745c7-ckckh_default_logger-1c52a92c2e1ef377d9b6c95dc693b86645d93fbfcf13832ee5337cc9ab201b0b.log; waiting 0.0 seconds
2024-11-02 17:26:32 +0000 [warn]: #0 [in_tail_container_logs] Skip update_watcher because watcher has been already updated by other inotify event path="/var/log/containers/logger-deployment-57cc6745c7-ckckh_default_logger-1c52a92c2e1ef377d9b6c95dc693b86645d93fbfcf13832ee5337cc9ab201b0b.log" inode=152064099 inode_in_pos_file=0
2024-11-02 17:26:32 +0000 [info]: #0 [in_tail_container_logs] detected rotation of /var/log/containers/logger-deployment-57cc6745c7-zwzxv_default_logger-af33706631b5c04250aa71c6956fde092559f09f8891e007dd8d454b12e89135.log; waiting 0.0 seconds
2024-11-02 17:26:32 +0000 [warn]: #0 [in_tail_container_logs] Skip update_watcher because watcher has been already updated by other inotify event path="/var/log/containers/logger-deployment-57cc6745c7-zwzxv_default_logger-af33706631b5c04250aa71c6956fde092559f09f8891e007dd8d454b12e89135.log" inode=112237023 inode_in_pos_file=0

2024-11-02 17:27:48 +0000 [debug]: #0 [in_tail_container_core_logs] tailing paths: target = /var/log/containers/fluentd-cloudwatch-ztwmc_amazon-cloudwatch_copy-fluentd-config-dc7b79cd11ccf90f5b8c512c1552ae13b28abfb2400b2ecd03c12d0ae7ceb564.log,/var/log/containers/fluentd-cloudwatch-ztwmc_amazon-cloudwatch_fluentd-cloudwatch-bc8e8da1056c6759e099f6b5b983d44ae7940a4963e376940b3ccacb18a6ab26.log,/var/log/containers/fluentd-cloudwatch-ztwmc_amazon-cloudwatch_update-log-driver-992ee8554687722124787066407ad9b21e97e3382b08a216205fda34259a0e03.log,/var/log/containers/aws-node-vgl9d_kube-system_aws-eks-nodeagent-d43b788731adaea1b1e53e23b0cd6c6aa4c15b41afd3f61ccb4f0fe466ae8d30.log,/var/log/containers/aws-node-vgl9d_kube-system_aws-node-688f632cd4bffd057003bcfa31b3546f4d64546e737645174cebc611f97e8e15.log,/var/log/containers/aws-node-vgl9d_kube-system_aws-vpc-cni-init-f59e23252a414c9f2041222c095f86766775eb70d37dd3fd89690978f2f554d0.log,/var/log/containers/kube-proxy-6z8zd_kube-system_kube-proxy-a1aae65c089af12b388a0527ebf25f7418eed956da5b284dace2702d58f422df.log,/var/log/containers/coredns-787cb67946-6dfhl_kube-system_coredns-f8b53737ad2d4133a9d9ac69f9f56bfbc9e7afb54d3dc91e6f7489009365ea17.log,/var/log/containers/ebs-csi-controller-5ddc98b494-n2c22_kube-system_csi-attacher-6530ac17c228aeca7e39958a1aa2f02da5878bf3b6b2fb643b5f43b53fcdf0b9.log,/var/log/containers/ebs-csi-controller-5ddc98b494-n2c22_kube-system_csi-provisioner-d3d1c4db5b0837aabf2cb3676951e85bd63c8d432b47b07770ad3d226f3be522.log,/var/log/containers/ebs-csi-controller-5ddc98b494-n2c22_kube-system_csi-resizer-ea911f783028d85009ebe185d03d602a8eb64fa2fe80da03082703caa69584d8.log,/var/log/containers/ebs-csi-controller-5ddc98b494-n2c22_kube-system_ebs-plugin-db350e781604de4725003c8f38a03f4ca2a1eec021c61005565a3caff3cd4733.log,/var/log/containers/ebs-csi-controller-5ddc98b494-n2c22_kube-system_liveness-probe-db10e53f8e6ecef8fab33ca7e68db83f3070dc406680fc4eb6858bffe431a37f.log,/var/log/containers/ebs-csi-node-5w6n2_kube-system_ebs-plugin-bb331132e02cb3ee93c1a2cf5225cd14b2b2d063846e5e1e578665d0679d23ec.log,/var/log/containers/ebs-csi-node-5w6n2_kube-system_liveness-probe-a5f50e5e9490b16833b6fed1d29caf9ccb352dbb8852ec4cf5c93781ad61afd2.log,/var/log/containers/ebs-csi-node-5w6n2_kube-system_node-driver-registrar-9d0b426f9ebb91798f1d9d444a6d728b09f926794c471229e6f5f4d54891a07a.log,/var/log/containers/eks-pod-identity-agent-p9szr_kube-system_eks-pod-identity-agent-b93a02fa5321cba6f33ca5b809c948f9469ea8ffa2f320443960009196ba520a.log,/var/log/containers/eks-pod-identity-agent-p9szr_kube-system_eks-pod-identity-agent-init-b02cdb94178b436faaaf7f9a1e97d131046b38716434e2db474b1d5026a66ff0.log | existing = /var/log/containers/fluentd-cloudwatch-ztwmc_amazon-cloudwatch_copy-fluentd-config-dc7b79cd11ccf90f5b8c512c1552ae13b28abfb2400b2ecd03c12d0ae7ceb564.log,/var/log/containers/fluentd-cloudwatch-ztwmc_amazon-cloudwatch_fluentd-cloudwatch-bc8e8da1056c6759e099f6b5b983d44ae7940a4963e376940b3ccacb18a6ab26.log,/var/log/containers/fluentd-cloudwatch-ztwmc_amazon-cloudwatch_update-log-driver-992ee8554687722124787066407ad9b21e97e3382b08a216205fda34259a0e03.log,/var/log/containers/aws-node-vgl9d_kube-system_aws-eks-nodeagent-d43b788731adaea1b1e53e23b0cd6c6aa4c15b41afd3f61ccb4f0fe466ae8d30.log,/var/log/containers/aws-node-vgl9d_kube-system_aws-node-688f632cd4bffd057003bcfa31b3546f4d64546e737645174cebc611f97e8e15.log,/var/log/containers/aws-node-vgl9d_kube-system_aws-vpc-cni-init-f59e23252a414c9f2041222c095f86766775eb70d37dd3fd89690978f2f554d0.log,/var/log/containers/kube-proxy-6z8zd_kube-system_kube-proxy-a1aae65c089af12b388a0527ebf25f7418eed956da5b284dace2702d58f422df.log,/var/log/containers/coredns-787cb67946-6dfhl_kube-system_coredns-f8b53737ad2d4133a9d9ac69f9f56bfbc9e7afb54d3dc91e6f7489009365ea17.log,/var/log/containers/ebs-csi-controller-5ddc98b494-n2c22_kube-system_csi-attacher-6530ac17c228aeca7e39958a1aa2f02da5878bf3b6b2fb643b5f43b53fcdf0b9.log,/var/log/containers/ebs-csi-controller-5ddc98b494-n2c22_kube-system_csi-provisioner-d3d1c4db5b0837aabf2cb3676951e85bd63c8d432b47b07770ad3d226f3be522.log,/var/log/containers/ebs-csi-controller-5ddc98b494-n2c22_kube-system_csi-resizer-ea911f783028d85009ebe185d03d602a8eb64fa2fe80da03082703caa69584d8.log,/var/log/containers/ebs-csi-controller-5ddc98b494-n2c22_kube-system_ebs-plugin-db350e781604de4725003c8f38a03f4ca2a1eec021c61005565a3caff3cd4733.log,/var/log/containers/ebs-csi-controller-5ddc98b494-n2c22_kube-system_liveness-probe-db10e53f8e6ecef8fab33ca7e68db83f3070dc406680fc4eb6858bffe431a37f.log,/var/log/containers/ebs-csi-node-5w6n2_kube-system_ebs-plugin-bb331132e02cb3ee93c1a2cf5225cd14b2b2d063846e5e1e578665d0679d23ec.log,/var/log/containers/ebs-csi-node-5w6n2_kube-system_liveness-probe-a5f50e5e9490b16833b6fed1d29caf9ccb352dbb8852ec4cf5c93781ad61afd2.log,/var/log/containers/ebs-csi-node-5w6n2_kube-system_node-driver-registrar-9d0b426f9ebb91798f1d9d444a6d728b09f926794c471229e6f5f4d54891a07a.log,/var/log/containers/eks-pod-identity-agent-p9szr_kube-system_eks-pod-identity-agent-b93a02fa5321cba6f33ca5b809c948f9469ea8ffa2f320443960009196ba520a.log,/var/log/containers/eks-pod-identity-agent-p9szr_kube-system_eks-pod-identity-agent-init-b02cdb94178b436faaaf7f9a1e97d131046b38716434e2db474b1d5026a66ff0.log
2024-11-02 17:27:49 +0000 [debug]: #0 [in_tail_container_core_logs] tailing paths: target = /var/log/containers/fluentd-cloudwatch-bwdpf_amazon-cloudwatch_copy-fluentd-config-e1c4560f70a672f811586c42239cd8f823c2da7afe504f49af7965f019091f57.log,/var/log/containers/fluentd-cloudwatch-bwdpf_amazon-cloudwatch_fluentd-cloudwatch-0e493d532c0a48ae46aed7b6500431b93b0403acd74dd6ff92049c571be9e402.log,/var/log/containers/fluentd-cloudwatch-bwdpf_amazon-cloudwatch_update-log-driver-a7799851e03ac287f48cbc63552c5b31016106061ba40493ad644e8a10016e62.log,/var/log/containers/aws-node-9b2rk_kube-system_aws-eks-nodeagent-2a82275bdf85fdb8ac57a6d9e4c927919eb8472e10ffaf77a0290c291111d629.log,/var/log/containers/aws-node-9b2rk_kube-system_aws-eks-nodeagent-a410bd11314ce2fff148d5effd863b8502f0aadf4d492c94c5d841c388b927f4.log,/var/log/containers/aws-node-9b2rk_kube-system_aws-node-0f0417f969145e80e9de2474148256bf009ac84094d26453c53fd5c1c1b0ad6d.log,/var/log/containers/aws-node-9b2rk_kube-system_aws-vpc-cni-init-ffcd1ff811ff67d406fe64096ef05cd9db75666ed1c8efbfbd303f7d09e3c95e.log,/var/log/containers/kube-proxy-4xl5d_kube-system_kube-proxy-32285f83bc32feb2f06700f235ff9db332b23c355b1b7c17b9deaab4a3bcf531.log,/var/log/containers/kube-proxy-4xl5d_kube-system_kube-proxy-a3726048ebd5dceb76fe36e6fadeff5010c6e242aef6bc8f73f4e935a1f4f88c.log,/var/log/containers/coredns-787cb67946-c7jg2_kube-system_coredns-170f21c4cd43ac571eadd5d2f7992734ac46ef62cfca08ae3b4dd9b0bcb7657c.log,/var/log/containers/coredns-787cb67946-c7jg2_kube-system_coredns-cd01a35e8ddbb4255538b165a64aede38b23cc6926a02dc606f7a568edd3a54d.log,/var/log/containers/ebs-csi-controller-5ddc98b494-zksgf_kube-system_csi-attacher-d572d6f311a78a938f22648838d5b85c7c757c0b4cfba2d23f88721a4d969181.log,/var/log/containers/ebs-csi-controller-5ddc98b494-zksgf_kube-system_csi-provisioner-8bb2b99746ddac4a5c72285e2a887bad3d733c5ad66e4f139326a5d8e3bca70e.log,/var/log/containers/ebs-csi-controller-5ddc98b494-zksgf_kube-system_csi-resizer-8ea3c5ce40e31197c5f1f1b922a9b976a5f6bffe499c4a4c6b6db468bc2a421d.log,/var/log/containers/ebs-csi-controller-5ddc98b494-zksgf_kube-system_ebs-plugin-dc900b9e6db16ea65db1bad89d640664140423a92868735f45e1389af16a4233.log,/var/log/containers/ebs-csi-controller-5ddc98b494-zksgf_kube-system_liveness-probe-ddb3d10390ebe8b9457ffddf7e375e4d5d42ae9b7c3d0f52f94baa459527f2fd.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_ebs-plugin-922bec251cadd0bc8c39edddceedaa48fc978968533bef0e47f4cfe1a9bc06b7.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_ebs-plugin-acb6c394d637726269f1fd5ea9818ecc1706596091338e60a4d3720d1e39deac.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_liveness-probe-3ef28982a1e8ed79e8500e05a07f203af6f379f4cd10f31d0dcbe30649271b68.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_liveness-probe-7fb635bdc56be11e79798b4e93150a933da72a0e5c17c13ab04e542ee474b651.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_node-driver-registrar-3dbefb298de8507fced55cfa673fc5513c4b9aecfcefb864196de4885bc180b9.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_node-driver-registrar-cf3ab228b12f1509984a0fc9ece0cb77672cd535936bf7aff366ffdce70cd4b6.log,/var/log/containers/eks-pod-identity-agent-lkbzw_kube-system_eks-pod-identity-agent-27e3fe2cdbb873aef975b154c8007f769c5992b59226c8c3f059db1dc197ab4a.log,/var/log/containers/eks-pod-identity-agent-lkbzw_kube-system_eks-pod-identity-agent-6b685d7c878bed82856f3adb5a4cc0587f114cc3af38e378504540166215c69a.log,/var/log/containers/eks-pod-identity-agent-lkbzw_kube-system_eks-pod-identity-agent-init-3f779997a0b284a999b0505f1424a4b30af12d143a2a243a74dde7e2c9bd0de9.log,/var/log/containers/prometheus-0_lens-metrics_chown-394770bcd616d0c3d8380fcdbd07ca09fc00738fe17e5f15e5315c9d17312e25.log,/var/log/containers/prometheus-0_lens-metrics_prometheus-e713ff6ca1cb5d4e3d09fb1c07d70f4778efe32f94a4a4f89c7d5e3086ed866b.log | existing = /var/log/containers/fluentd-cloudwatch-bwdpf_amazon-cloudwatch_copy-fluentd-config-e1c4560f70a672f811586c42239cd8f823c2da7afe504f49af7965f019091f57.log,/var/log/containers/fluentd-cloudwatch-bwdpf_amazon-cloudwatch_fluentd-cloudwatch-0e493d532c0a48ae46aed7b6500431b93b0403acd74dd6ff92049c571be9e402.log,/var/log/containers/fluentd-cloudwatch-bwdpf_amazon-cloudwatch_update-log-driver-a7799851e03ac287f48cbc63552c5b31016106061ba40493ad644e8a10016e62.log,/var/log/containers/aws-node-9b2rk_kube-system_aws-eks-nodeagent-2a82275bdf85fdb8ac57a6d9e4c927919eb8472e10ffaf77a0290c291111d629.log,/var/log/containers/aws-node-9b2rk_kube-system_aws-eks-nodeagent-a410bd11314ce2fff148d5effd863b8502f0aadf4d492c94c5d841c388b927f4.log,/var/log/containers/aws-node-9b2rk_kube-system_aws-node-0f0417f969145e80e9de2474148256bf009ac84094d26453c53fd5c1c1b0ad6d.log,/var/log/containers/aws-node-9b2rk_kube-system_aws-vpc-cni-init-ffcd1ff811ff67d406fe64096ef05cd9db75666ed1c8efbfbd303f7d09e3c95e.log,/var/log/containers/kube-proxy-4xl5d_kube-system_kube-proxy-32285f83bc32feb2f06700f235ff9db332b23c355b1b7c17b9deaab4a3bcf531.log,/var/log/containers/kube-proxy-4xl5d_kube-system_kube-proxy-a3726048ebd5dceb76fe36e6fadeff5010c6e242aef6bc8f73f4e935a1f4f88c.log,/var/log/containers/coredns-787cb67946-c7jg2_kube-system_coredns-170f21c4cd43ac571eadd5d2f7992734ac46ef62cfca08ae3b4dd9b0bcb7657c.log,/var/log/containers/coredns-787cb67946-c7jg2_kube-system_coredns-cd01a35e8ddbb4255538b165a64aede38b23cc6926a02dc606f7a568edd3a54d.log,/var/log/containers/ebs-csi-controller-5ddc98b494-zksgf_kube-system_csi-attacher-d572d6f311a78a938f22648838d5b85c7c757c0b4cfba2d23f88721a4d969181.log,/var/log/containers/ebs-csi-controller-5ddc98b494-zksgf_kube-system_csi-provisioner-8bb2b99746ddac4a5c72285e2a887bad3d733c5ad66e4f139326a5d8e3bca70e.log,/var/log/containers/ebs-csi-controller-5ddc98b494-zksgf_kube-system_csi-resizer-8ea3c5ce40e31197c5f1f1b922a9b976a5f6bffe499c4a4c6b6db468bc2a421d.log,/var/log/containers/ebs-csi-controller-5ddc98b494-zksgf_kube-system_ebs-plugin-dc900b9e6db16ea65db1bad89d640664140423a92868735f45e1389af16a4233.log,/var/log/containers/ebs-csi-controller-5ddc98b494-zksgf_kube-system_liveness-probe-ddb3d10390ebe8b9457ffddf7e375e4d5d42ae9b7c3d0f52f94baa459527f2fd.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_ebs-plugin-922bec251cadd0bc8c39edddceedaa48fc978968533bef0e47f4cfe1a9bc06b7.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_ebs-plugin-acb6c394d637726269f1fd5ea9818ecc1706596091338e60a4d3720d1e39deac.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_liveness-probe-3ef28982a1e8ed79e8500e05a07f203af6f379f4cd10f31d0dcbe30649271b68.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_liveness-probe-7fb635bdc56be11e79798b4e93150a933da72a0e5c17c13ab04e542ee474b651.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_node-driver-registrar-3dbefb298de8507fced55cfa673fc5513c4b9aecfcefb864196de4885bc180b9.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_node-driver-registrar-cf3ab228b12f1509984a0fc9ece0cb77672cd535936bf7aff366ffdce70cd4b6.log,/var/log/containers/eks-pod-identity-agent-lkbzw_kube-system_eks-pod-identity-agent-27e3fe2cdbb873aef975b154c8007f769c5992b59226c8c3f059db1dc197ab4a.log,/var/log/containers/eks-pod-identity-agent-lkbzw_kube-system_eks-pod-identity-agent-6b685d7c878bed82856f3adb5a4cc0587f114cc3af38e378504540166215c69a.log,/var/log/containers/eks-pod-identity-agent-lkbzw_kube-system_eks-pod-identity-agent-init-3f779997a0b284a999b0505f1424a4b30af12d143a2a243a74dde7e2c9bd0de9.log,/var/log/containers/prometheus-0_lens-metrics_chown-394770bcd616d0c3d8380fcdbd07ca09fc00738fe17e5f15e5315c9d17312e25.log,/var/log/containers/prometheus-0_lens-metrics_prometheus-e713ff6ca1cb5d4e3d09fb1c07d70f4778efe32f94a4a4f89c7d5e3086ed866b.log
2024-11-02 17:27:54 +0000 [info]: #0 [filter_kube_metadata_host] stats - namespace_cache_size: 0, pod_cache_size: 0
2024-11-02 17:27:54 +0000 [info]: #0 [filter_kube_metadata_host] stats - namespace_cache_size: 0, pod_cache_size: 0
2024-11-02 17:27:59 +0000 [warn]: #0 stat() for /var/log/containers/logger-deployment-57cc6745c7-hw4ds_default_logger-aba43bbd009d1652e1961dbd30ed45f09e337bfb42d3fa247b12fde7af248909.log failed. Continuing without tailing it.
2024-11-02 17:27:59 +0000 [warn]: #0 stat() for /var/log/containers/logger-deployment-57cc6745c7-jtxmz_default_logger-742ba4e5339168b7b5442745705bbfed1d93c832027ca0c680b193c9c62e796f.log failed. Continuing without tailing it.
2024-11-02 17:27:59 +0000 [warn]: #0 stat() for /var/log/containers/logger-deployment-57cc6745c7-kmrlv_default_logger-7682a4b64550055203e19ff9387b686e316fe4e5e7884b720dede3692659c686.log failed. Continuing without tailing it.
2024-11-02 17:27:59 +0000 [warn]: #0 stat() for /var/log/containers/logger-deployment-57cc6745c7-ptf4k_default_logger-88c30f214da39c81d5fc04466eacddf79278dcd9f99402e5c051243e26b7218f.log failed. Continuing without tailing it.
2024-11-02 17:27:59 +0000 [warn]: #0 stat() for /var/log/containers/logger-deployment-57cc6745c7-rnm4s_default_logger-df9566f71c1fd7ab074850d94ee4771ea24d9b653599a61cce791f7e221224c2.log failed. Continuing without tailing it.
2024-11-02 17:27:59 +0000 [warn]: #0 stat() for /var/log/containers/logger-deployment-57cc6745c7-vvrtx_default_logger-37eb38772106129b0925b5fdb8bc20f378c6156ef510d787ec35c57fd3bd68bc.log failed. Continuing without tailing it.
2024-11-02 17:27:59 +0000 [warn]: #0 stat() for /var/log/containers/logger-deployment-57cc6745c7-z9cxt_default_logger-c49720681936856bf6d2df5df3f35561a56d62f4c6a7d65aea8c7e0d70c37ad8.log failed. Continuing without tailing it.

Additional context

No response

@jicowan
Copy link
Author

jicowan commented Nov 4, 2024

Consistently seeing the following errors in the logs (changed the wait time to 60s):

2024-11-04 15:27:48 +0000 [info]: #0 [in_tail_container_logs] detected rotation of /var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log; waiting 60.0 seconds
2024-11-04 15:27:48 +0000 [warn]: #0 [in_tail_container_logs] Skip update_watcher because watcher has been already updated by other inotify event path="/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log" inode=100695028 inode_in_pos_file=0
2024-11-04 15:27:48 +0000 [info]: #0 [in_tail_container_logs] detected rotation of /var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log; waiting 60.0 seconds
2024-11-04 15:27:48 +0000 [info]: #0 [in_tail_container_logs] following tail of /var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log

Contents of containers.log.pos file:

/var/log/containers/karpenter-76785c6874-gjsjq_karpenter-system_controller-2008ed03f1b7010e3a10bd6249585a91ea4f52b7bb807abdbffee2012e3634e5.log 0000000000009870        00000000025000d3
/var/log/containers/aws-guardduty-agent-kct6q_amazon-guardduty_aws-guardduty-agent-ce5502b765a04b99c5bc04c9cb3d110d6be023626430780c03c0df7ac25360fb.log 0000000000000e1f        000000000070bb0f
/var/log/containers/aws-guardduty-agent-kct6q_amazon-guardduty_aws-guardduty-agent-7b219cd69b2abd4809d569dd8810052a2c1cc2c139f42589b879db518fb42c98.log 0000000000000e1f        000000000070c062
/var/log/containers/karpenter-76785c6874-gjsjq_karpenter-system_controller-b7642902d9be8cf37a8f2e0e05bf858cdaa6e226a89947538d3856bf25d669a4.log 0000000000009018        00000000025000c1
/var/log/containers/node-exporter-tvjg5_lens-metrics_node-exporter-429a1e98cabdf9227e3d222649c64cbd37200d42148f0aa3c461a6293d25c57f.log 0000000000001fc2        0000000002e00b5e
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log      ffffffffffffffff        0000000006007bf4
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log      ffffffffffffffff        0000000006007bf9
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log      ffffffffffffffff        0000000000000000
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log      ffffffffffffffff        0000000006007bfb
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log      ffffffffffffffff        0000000000000000
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log      ffffffffffffffff        0000000006007bfc
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log      ffffffffffffffff        0000000000000000
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log      ffffffffffffffff        0000000006007bfd
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log      ffffffffffffffff        0000000000000000
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log      ffffffffffffffff        0000000006007bfb
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log      ffffffffffffffff        0000000000000000
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log      0000000000b83ff5        0000000006007bfc
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log      0000000000000000        0000000000000000
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log      00000000003865bb        0000000006007bfd

@daipom
Copy link
Contributor

daipom commented Nov 8, 2024

Thanks for this report.
We need to figure out the possible cause.
I will investigate this weekend.

@jicowan
Copy link
Author

jicowan commented Nov 8, 2024

Thanks. I've tried different combinations of settings since opening this issue, e.g. using a file buffer, increasing the chunk size, increasing the mem/CPU allocated to the fluentd daemonset, etc. None of them seems to have an impact on Fluentd's ability to tail the logs. It's as if it's losing track of the files it's supposed to tail. I have the notebook I've been using to find gaps in the sequence. Let me know if you want me to post it here.

@jicowan
Copy link
Author

jicowan commented Nov 8, 2024

@daipom I just ran a test where I set the kubelet's containerLogMaxSize to 50Mi (the default is 10Mi). After doing that I saw zero log loss. I'm not totally sure why that would be. My only guess is that the files are being rotated less often and so there are fewer files for fluentd to keep track of.

@jicowan
Copy link
Author

jicowan commented Nov 11, 2024

@daipom Do you think increasing the number of workers and allocating them to source block for @type tail would help with smaller log files?

@Watson1978
Copy link
Contributor

I tried it briefly at my local environment, but I could not reproduce this.
Do we need Kubernetes to reproduce it?

@jicowan Can you reproduce this without Kubernetes?

@jicowan
Copy link
Author

jicowan commented Nov 20, 2024

I only tied this on k8s. I ran multiple replicas of it (at least 10). When the logs grew to 10MB, they were rotated by the kubelet. That's where I saw the issue. Fluentd lost track of the inodes because the files were being rotated so quickly.

@daipom daipom moved this to Work-In-Progress in Fluentd Kanban Nov 27, 2024
@daipom
Copy link
Contributor

daipom commented Nov 27, 2024

@jicowan
We are trying to reproduce this issue.
Could you please tell us how to reproduce this in detail?
I can run the node with the test application, but I don't know how to collect the output.
Do we need another Fluetnd node to reproduce this?
Or should we use sidecar?

@daipom
Copy link
Contributor

daipom commented Nov 27, 2024

I think I need to have a file like /var/log/containers/... and collect it by in_tail, but I don't know how to do that.
If I set up a pod as in To Reproduce, the logs will be output to standard output.
Sorry I'm not familiar with K8s, but I need a detailed procedure to reproduce this.

@jicowan
Copy link
Author

jicowan commented Dec 2, 2024

First you need a Kubernetes cluster (try not to use KIND, MiniKube, or another single node version of Kubernetes). Then you need to install the Fluentd DaemonSet. You can download the manifests from here. I used the version for Amazon CloudWatch, but you can use a different backend if you like. So long as it can absorb the volume of logs that you're sending to it, the choice of backend shouldn't effect the results of the tests. The default log file size is 10MB. At 10MB the kubelet (the Kubernetes "agent") will rotate the log file.

You can use the Kubernetes Deployment I created to deploy the logging application:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: logger-deployment
  labels:
    app: logger
spec:
  replicas: 1  # Adjust the number of replicas as needed
  selector:
    matchLabels:
      app: logger
  template:
    metadata:
      labels:
        app: logger
    spec:
      affinity:
        podAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - logger
              topologyKey: "kubernetes.io/hostname"
      containers:
      - name: logger
        image: jicowan/logger:v3.0
        resources:
          requests:
            cpu: 4
            memory: 128Mi
          limits:
            cpu: 4
            memory: 256Mi
        env:
          - name: POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name

The configuration for Fluentd is typically stored in a ConfigMap. If this isn't descriptive enough, I can walk you through the configuration during a web conference.

@jicowan
Copy link
Author

jicowan commented Dec 13, 2024

I can't verify this is happening yet, but it may be that the files are being rotated so fast that fluentd doesn't have enough time to read them before they are compressed. As the kubelet rotates the logs, it renames the file 0.log to 0.log.. It keeps that log uncompressed for 1 log rotation and then compresses it. If fluentd falls too far behind, it may not be able to read the log before it is compressed. Here is the kubelet code where this happens, https://github.com/kubernetes/kubernetes/blob/f545438bd347d2ac853b03983576bf0a6f1cc98b/pkg/kubelet/logs/container_log_manager.go#L400-L421. I assume there would be an error in the fluentd logs it could no longer read the log file, but I haven't seen such an error in my testing so far. There is no way to disable compression so I don't have a way to test my theory.

@Watson1978
Copy link
Contributor

Watson1978 commented Dec 18, 2024

I'm trying to reproduce on local environment usingminikube.
However, I can't reproduce it, yet.
I'd like to know what is your environment on.
Is your environment on AWS?

I'm going to try to reproduce on that environment.

@jicowan
Copy link
Author

jicowan commented Dec 18, 2024

Yes, the environment was on AWS. You can use this eksctl configuration file to provision a similar environment. You can adjust the maximum size of the log file by changing the value of containerLogMaxSize. The default is 10Mi. The default containerLogMaxWorkers is 1. I also changed the storage type from gp3 to io1 because i was using a file buffer and wanted disk with better IO characteristics. You can change it back to gp3 if you want.

# An advanced example of ClusterConfig object with customised nodegroups:
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: logging
  region: us-west-2
  version: "1.30"

nodeGroups:
  - name: ng3
    instanceType: m5.4xlarge
    desiredCapacity: 2
    privateNetworking: true
    ssh:
      enableSsm: true 
    kubeletExtraConfig:
      containerLogMaxWorkers: 5
      containerLogMaxSize: "50Mi"
    ebsOptimized: true
    volumeType: io1
      
iam:
  withOIDC: true

accessConfig:
  authenticationMode: API_AND_CONFIG_MAP

vpc:
  nat:
    gateway: Single

If you send the logs to CloudWatch, you'll need to use IRSA or pod identities to assign an IAM role to the pod.

@Watson1978
Copy link
Contributor

If the log files are rotated in a shorter time than specified in refresh_interval, it may not be handled properly.
The workaround would be to shorten the refresh_interval, or increase the size limit of the rotation file to extend the rotation time.

@slopezxrd
Copy link

If you increse the size limit of the rotation file, due the fact fluentd read slower than the logs are written, in one moment you lost one of the rotation files.

@jicowan
Copy link
Author

jicowan commented Jan 7, 2025

The refresh interval is set to 1 @Watson1978. @slopezxrd I can't verify this yet, but if Fluentd is unable to read the logs fast enough, they will get compressed [by the Kubelet] before it has had time to read the whole file which will result in lost logs. If you look at the code for the Kubelet, it has already accounted for this once before, https://github.com/kubernetes/kubernetes/blob/f1b3fdf7e6d40714b1a43757221832aa1c4a49d1/pkg/kubelet/logs/container_log_manager.go#L451-L472.

@Watson1978
Copy link
Contributor

Watson1978 commented Jan 24, 2025

Sorry for late response.
I have been investigated this issue for a while.

Now I recommend following configuration for running on kubernetes.

Recommend configuration

<source>
  @type tail

  follow_inodes false
  rotate_wait 0
  path /var/log/containers/...path to your app logs...

...
</source>

follow_inodes false

With follow_inodes false, if a log file rotation is detected, a new log file may not be read until the refresh_interval has elapsed.
I recommend to set follow_inodes true to avoid this behavior.

rotate_wait 0

With follow_inodes false, it will display many warning message of Skip update_watcher because watcher has been already updated....
The rotate_wait 0 might suppress this message and you can ignore the Skip update_watcher because watcher has been already updated... warning message.

There is no problem with Fluentd's behavior when that message is displayed.

path /var/log/containers/...path to your app logs...

There is symbolic link to the application log under /var/log/containers/.
It would be sufficient to use that as the read target.

Warning messages

You can ignore the following warning messages. There is no problem with Fluentd's behavior when that message is displayed.

  • Skip update_watcher because watcher has been already updated...
  • Could not follow a file (inode: 101712298) because an existing watcher for that filepath follows a different inode...

I will fix these warning messages or relax warning log level.

@daipom
Copy link
Contributor

daipom commented Jan 27, 2025

@Watson1978 Thanks for investigating!
So, the problem is that the rotation occurs at very high speed.
In that case, it is certainly better to set follow_inodes false (default) and rotate_wait 0.

Warning messages

You can ignore the following warning messages. There is no problem with Fluentd's behavior when that message is displayed.

* `Skip update_watcher because watcher has been already updated...`

* `Could not follow a file (inode: 101712298) because an existing watcher for that filepath follows a different inode...`

I will fix these warning messages or relax warning log level.

Yes!
There was a bug in older versions that could cause in_tail collection to stop without an error log.
These warning logs were placed at that time as a precaution.

In this case, the fast rotation causes this warning, but there seems to be no problem with the collection.
So, as @Watson1978 says, you can ignore these warnings.
These logs should be fixed, considering the case of fast rotations.

@daipom
Copy link
Contributor

daipom commented Jan 27, 2025

During heavy log volumes, e.g. >10k log entries per second, fluentd consistently drops logs.

Hmm, does setting follow_inodes false and rotate_wait 0 causes log lost?
Looks like we need to investigate log lost problem more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Work-In-Progress
Development

No branches or pull requests

4 participants