How to pull data from Elasticsearch to Splunk without data duplication?

Hello,
I have data in Elasticsearch that I’ve stored across 3 indices.
I want to copy the data to Splunk — I want only the data from the last minute, and I want to store the data from each Elasticsearch index into a separate index in Splunk.

I used Logstash and created 3 pipelines — one for each index — but I noticed that there is duplication in the data.

Is there a solution to the duplication issue, or is there another way to pull the data?

These are the pipelines I created.


#--------------------First pipeline----------------


input {
  elasticsearch {
    hosts => ["https://host:9200"]
    index => "index-2125211241830947454*"
    query => '{
      "query": {
        "range": {
          "@timestamp": {
            "gte": "now-2m",
            "lt": "now"
          }
        }
      },
      "sort": [
        { "@timestamp": "asc" }
      ]
    }'
    size => 1000
    scroll => "2m"
    schedule => "* * * * *"
    docinfo => true
    docinfo_target => "[docinfo]"
    user => "elastic"
    password => "pass"
    ssl_certificate_verification => false
  }
}

filter {
  fingerprint {
    source => ["[docinfo][_id]"]
    target => "[@metadata][dedup_fingerprint]"
    method => "SHA1"
  }
#  ruby {
#    init => 'require "set"; @@seen ||= Set.new'
#    code => '
#      fp = event.get("@metadata")["dedup_fingerprint"]
#      if @@seen.include?(fp)
#        event.cancel
#      else
#        @@seen.add(fp)
#      end
#    '
#  }

}

output {
  http {
    format => "json"
    content_type => "application/json"
    http_method => "post"
    ssl_verification_mode => "none"
    url => "http://host:8088/services/collector/raw"
    headers => ["Authorization", "Splunk ","X-Splunk-Request-Id", "%{[@metadata][dedup_fingerprint]}"]
    retry_failed => true
    pool_max => 50
  }
}


#----------------Second pipeline---------------------

input {
  elasticsearch {
    hosts => ["https://host:9200"]
    index => "index-2125211241830947466*"
    query => '{
      "query": {
        "range": {
          "@timestamp": {
            "gte": "now-2m",
            "lt": "now"
          }
        }
      },
      "sort": [
        { "@timestamp": "asc" }
      ]
    }'
    size => 1000
    scroll => "2m"
    schedule => "* * * * *"
    docinfo => true
    docinfo_target => "[docinfo]"
    user => "elastic"
    password => "pass"
    ssl_certificate_verification => false
  }
}

filter {
  fingerprint {
    source => ["[docinfo][_id]"]
    target => "[@metadata][dedup_fingerprint]"
    method => "SHA1"
  }
#  ruby {
#    init => 'require "set"; @@seen ||= Set.new'
#    code => '
#      fp = event.get("@metadata")["dedup_fingerprint"]
#      if @@seen.include?(fp)
#        event.cancel
#      else
#        @@seen.add(fp)
#      end
#    '
#  }

}

output {
  http {
    format => "json"
    content_type => "application/json"
    http_method => "post"
    ssl_verification_mode => "none"
    url => "http://host:8088/services/collector/raw"
    headers => ["Authorization", "Splunk ","X-Splunk-Request-Id", "%{[@metadata][dedup_fingerprint]}"]
    retry_failed => true
    pool_max => 50
  }
}



#---------------Third pipeline------------

input {
  elasticsearch {
    hosts => ["https://host:9200"]
    index => "index-2125211241830947108*"
    query => '{
      "query": {
        "range": {
          "@timestamp": {
            "gte": "now-2m",
            "lt": "now"
          }
        }
      },
      "sort": [
        { "@timestamp": "asc" }
      ]
    }'
    size => 1000
    scroll => "2m"
    schedule => "* * * * *"
    docinfo => true
    docinfo_target => "[docinfo]"
    user => "elastic"
    password => "pass"
    ssl_certificate_verification => false
  }
}

filter {
  fingerprint {
    source => ["[docinfo][_id]"]
    target => "[@metadata][dedup_fingerprint]"
    method => "SHA1"
  }
#  ruby {
#    init => 'require "set"; @@seen ||= Set.new'
#    code => '
#      fp = event.get("@metadata")["dedup_fingerprint"]
#      if @@seen.include?(fp)
#        event.cancel
#      else
#        @@seen.add(fp)
#      end
#    '
#  }

}

output {
  http {
    format => "json"
    content_type => "application/json"
    http_method => "post"
    ssl_verification_mode => "none"
    url => "http://host:8088/services/collector/raw"
    headers => ["Authorization", "Splunk ","X-Splunk-Request-Id", "%{[@metadata][dedup_fingerprint]}"]
    retry_failed => true
    pool_max => 50
  }
}





Are you running these as separate pipelines using pipelines.yml?

No, I run it in one file pipelines.conf.

In that case the "pipelines" are not independent. Data from every input will go through every filter and be sent to every output. Use three separate files and configure each one as an independent pipeline.

1 Like

I created each pipeline in a separate file. When I ran the pipelines, the first one ran without any issues. However, when I tried to run the second and third, I got an error saying that a pipeline was already running. That's why I merged all three into a single file.

Okay, I understand what you mean. Thank you.

Okay, what about duplicates? Can I add something to the filter to fetch data without duplication, or is there a solution to this problem?

Sorry, most of us wont know splunk in great detail, doesn't it offer a way to reject documents based on some "id" it has seen/indexed already? i.e. the equivalent of elasticsearch's _id field? Isn't that what the "X-Splunk-Request-Id", "%{[@metadata][dedup_fingerprint]}" part is all about?

btw, what is the resolution of your @timestamp field? To the second, millisecond, nanosecond?

1 Like

Yes, but the duplication still exists. I don't know exactly what the problem is.

millisecond

Here is an example of an @timestamp:

2025-01-15T23:56:39.736Z

Does splunk support deduplication at ingestion in the way you configured? I never used Splunk, so I'm not sure it supports, but according to this answer on a Splunk forum it seems that it does not support it.

Deduplication needs to be done on the destination, not on Logstash side, for Logstash each event is independent from each other.

Also, did you change Logstash ot use pipelines.yml and run each pipeline from a different file?

If you are running Logstash with the 3 configuration on a single file, each event would be sent three times to the destination as the output is the same, this is one thing that can lead to duplication.

You need to run each pipeline from a different file using pipelines.yml.

1 Like

Yes, it does not support deduplication at ingestion time, which is why I want to send non-duplicate data from Elasticsearch via Logstash. I want to send data from the last one or two minutes, even if it's just a single pipeline. It's still returning duplicate values. I believe there is a mistake in one of the time-related settings that I couldn’t configure properly to make the time match the data returned by the pipeline.

Yes, I ran each pipeline separately, and the same issue occurs — even with just one pipeline, some duplicates still appear.

You have

        "range": {
          "@timestamp": {
            "gte": "now-2m",
            "lt": "now"
          }
        }
...
    schedule => "* * * * *"

Maybe I'm being slow, but am I missing something here? Every minute (according to schedule) you are querying for data within a 2 minute time window? And you are wondering why there are duplicates in the results from these (have to be overlapping) time windows ?

Something like

        "range": {
          "@timestamp": {
            "gte": "now-3m/m",
            "lt": "now-2m/m"
          }
        }

might work a bit better with that schedule, at least the time windows should not overlap.

There's also the question of lag, documents with @timestamp in any 1 minute window might not appear in the query results at the same time, maybe some take longer to ingest and be returned by a query. So I'd also be concerned that the data in Splunk could be incomplete.

But IMHO a better solution is to change architecture here, The original data should be ingested into Splunk and elasticsearch via independent (parallel) paths. Making one dependent on the other does not seem the best design.

1 Like

You understood what I meant — this is exactly what’s happening.

Okay, thank you. I will adjust the time and try to see if it works well; otherwise, I will input the full data into both.
The problem is that the data size is large, and I only want the latest data to perform some analyses, but I will try both methods.

Good luck.

If you want "near to realtime" availability in splunk, going via elasticsearch+logstash is not the way to go IMO.

Also, if using logstash you need to at least do the round-to-the-minute style, "now-1m/m", as "now()" is evaluated at query time, and will naturally vary a bit. Here's examples of what it evaluated for me over a few minutes:

"2025-07-14T13:39:00.177Z"
"2025-07-14T13:40:00.276Z"
"2025-07-14T13:41:00.372Z"
"2025-07-14T13:42:00.479Z"
"2025-07-14T13:43:00.559Z"
"2025-07-14T13:44:00.640Z"
"2025-07-14T13:45:00.700Z"
"2025-07-14T13:46:00.763Z"
"2025-07-14T13:47:00.862Z"
"2025-07-14T13:48:01.061Z" <-- a second later already
"2025-07-14T13:49:00.069Z"
"2025-07-14T13:50:00.143Z"
"2025-07-14T13:51:00.211Z"
"2025-07-14T13:52:00.320Z"
"2025-07-14T13:53:00.420Z"
"2025-07-14T13:54:00.479Z"
"2025-07-14T13:55:00.544Z"
"2025-07-14T13:56:00.676Z"

i.e. from "now()-1m" to "now()" would leave small gaps / have small overlaps.

1 Like

Thank you for this note.

I will try it; it might be a suitable solution.