How to pull data from Elasticsearch to Splunk without data duplication?

Ella_conan · July 10, 2025, 1:44pm

Hello,
I have data in Elasticsearch that I’ve stored across 3 indices.
I want to copy the data to Splunk — I want only the data from the last minute, and I want to store the data from each Elasticsearch index into a separate index in Splunk.

I used Logstash and created 3 pipelines — one for each index — but I noticed that there is duplication in the data.

Is there a solution to the duplication issue, or is there another way to pull the data?

These are the pipelines I created.


#--------------------First pipeline----------------


input {
  elasticsearch {
    hosts => ["https://host:9200"]
    index => "index-2125211241830947454*"
    query => '{
      "query": {
        "range": {
          "@timestamp": {
            "gte": "now-2m",
            "lt": "now"
          }
        }
      },
      "sort": [
        { "@timestamp": "asc" }
      ]
    }'
    size => 1000
    scroll => "2m"
    schedule => "* * * * *"
    docinfo => true
    docinfo_target => "[docinfo]"
    user => "elastic"
    password => "pass"
    ssl_certificate_verification => false
  }
}

filter {
  fingerprint {
    source => ["[docinfo][_id]"]
    target => "[@metadata][dedup_fingerprint]"
    method => "SHA1"
  }
#  ruby {
#    init => 'require "set"; @@seen ||= Set.new'
#    code => '
#      fp = event.get("@metadata")["dedup_fingerprint"]
#      if @@seen.include?(fp)
#        event.cancel
#      else
#        @@seen.add(fp)
#      end
#    '
#  }

}

output {
  http {
    format => "json"
    content_type => "application/json"
    http_method => "post"
    ssl_verification_mode => "none"
    url => "http://host:8088/services/collector/raw"
    headers => ["Authorization", "Splunk ","X-Splunk-Request-Id", "%{[@metadata][dedup_fingerprint]}"]
    retry_failed => true
    pool_max => 50
  }
}


#----------------Second pipeline---------------------

input {
  elasticsearch {
    hosts => ["https://host:9200"]
    index => "index-2125211241830947466*"
    query => '{
      "query": {
        "range": {
          "@timestamp": {
            "gte": "now-2m",
            "lt": "now"
          }
        }
      },
      "sort": [
        { "@timestamp": "asc" }
      ]
    }'
    size => 1000
    scroll => "2m"
    schedule => "* * * * *"
    docinfo => true
    docinfo_target => "[docinfo]"
    user => "elastic"
    password => "pass"
    ssl_certificate_verification => false
  }
}

filter {
  fingerprint {
    source => ["[docinfo][_id]"]
    target => "[@metadata][dedup_fingerprint]"
    method => "SHA1"
  }
#  ruby {
#    init => 'require "set"; @@seen ||= Set.new'
#    code => '
#      fp = event.get("@metadata")["dedup_fingerprint"]
#      if @@seen.include?(fp)
#        event.cancel
#      else
#        @@seen.add(fp)
#      end
#    '
#  }

}

output {
  http {
    format => "json"
    content_type => "application/json"
    http_method => "post"
    ssl_verification_mode => "none"
    url => "http://host:8088/services/collector/raw"
    headers => ["Authorization", "Splunk ","X-Splunk-Request-Id", "%{[@metadata][dedup_fingerprint]}"]
    retry_failed => true
    pool_max => 50
  }
}



#---------------Third pipeline------------

input {
  elasticsearch {
    hosts => ["https://host:9200"]
    index => "index-2125211241830947108*"
    query => '{
      "query": {
        "range": {
          "@timestamp": {
            "gte": "now-2m",
            "lt": "now"
          }
        }
      },
      "sort": [
        { "@timestamp": "asc" }
      ]
    }'
    size => 1000
    scroll => "2m"
    schedule => "* * * * *"
    docinfo => true
    docinfo_target => "[docinfo]"
    user => "elastic"
    password => "pass"
    ssl_certificate_verification => false
  }
}

filter {
  fingerprint {
    source => ["[docinfo][_id]"]
    target => "[@metadata][dedup_fingerprint]"
    method => "SHA1"
  }
#  ruby {
#    init => 'require "set"; @@seen ||= Set.new'
#    code => '
#      fp = event.get("@metadata")["dedup_fingerprint"]
#      if @@seen.include?(fp)
#        event.cancel
#      else
#        @@seen.add(fp)
#      end
#    '
#  }

}

output {
  http {
    format => "json"
    content_type => "application/json"
    http_method => "post"
    ssl_verification_mode => "none"
    url => "http://host:8088/services/collector/raw"
    headers => ["Authorization", "Splunk ","X-Splunk-Request-Id", "%{[@metadata][dedup_fingerprint]}"]
    retry_failed => true
    pool_max => 50
  }
}

Badger · July 10, 2025, 3:36pm

Are you running these as separate pipelines using pipelines.yml?

Ella_conan · July 12, 2025, 7:07am

No, I run it in one file pipelines.conf.

Badger · July 12, 2025, 12:54pm

In that case the "pipelines" are not independent. Data from every input will go through every filter and be sent to every output. Use three separate files and configure each one as an independent pipeline.

Ella_conan · July 13, 2025, 8:27am

I created each pipeline in a separate file. When I ran the pipelines, the first one ran without any issues. However, when I tried to run the second and third, I got an error saying that a pipeline was already running. That's why I merged all three into a single file.

Ella_conan · July 13, 2025, 8:34am

Okay, I understand what you mean. Thank you.

Ella_conan · July 13, 2025, 8:40am

Okay, what about duplicates? Can I add something to the filter to fetch data without duplication, or is there a solution to this problem?

RainTown · July 13, 2025, 10:44am

Sorry, most of us wont know splunk in great detail, doesn't it offer a way to reject documents based on some "id" it has seen/indexed already? i.e. the equivalent of elasticsearch's _id field? Isn't that what the "X-Splunk-Request-Id", "%{[@metadata][dedup_fingerprint]}" part is all about?

btw, what is the resolution of your @timestamp field? To the second, millisecond, nanosecond?

Ella_conan · July 13, 2025, 2:14pm

Yes, but the duplication still exists. I don't know exactly what the problem is.

Ella_conan · July 13, 2025, 2:23pm

millisecond

Here is an example of an @timestamp:

2025-01-15T23:56:39.736Z

leandrojmp · July 13, 2025, 3:02pm

Does splunk support deduplication at ingestion in the way you configured? I never used Splunk, so I'm not sure it supports, but according to this answer on a Splunk forum it seems that it does not support it.

Deduplication needs to be done on the destination, not on Logstash side, for Logstash each event is independent from each other.

Also, did you change Logstash ot use pipelines.yml and run each pipeline from a different file?

If you are running Logstash with the 3 configuration on a single file, each event would be sent three times to the destination as the output is the same, this is one thing that can lead to duplication.

You need to run each pipeline from a different file using pipelines.yml.

Ella_conan · July 14, 2025, 6:00am

Yes, it does not support deduplication at ingestion time, which is why I want to send non-duplicate data from Elasticsearch via Logstash. I want to send data from the last one or two minutes, even if it's just a single pipeline. It's still returning duplicate values. I believe there is a mistake in one of the time-related settings that I couldn’t configure properly to make the time match the data returned by the pipeline.

Yes, I ran each pipeline separately, and the same issue occurs — even with just one pipeline, some duplicates still appear.

RainTown · July 14, 2025, 11:18am

You have

        "range": {
          "@timestamp": {
            "gte": "now-2m",
            "lt": "now"
          }
        }
...
    schedule => "* * * * *"

Maybe I'm being slow, but am I missing something here? Every minute (according to schedule) you are querying for data within a 2 minute time window? And you are wondering why there are duplicates in the results from these (have to be overlapping) time windows ?

Something like

        "range": {
          "@timestamp": {
            "gte": "now-3m/m",
            "lt": "now-2m/m"
          }
        }

might work a bit better with that schedule, at least the time windows should not overlap.

There's also the question of lag, documents with @timestamp in any 1 minute window might not appear in the query results at the same time, maybe some take longer to ingest and be returned by a query. So I'd also be concerned that the data in Splunk could be incomplete.

But IMHO a better solution is to change architecture here, The original data should be ingested into Splunk and elasticsearch via independent (parallel) paths. Making one dependent on the other does not seem the best design.

Ella_conan · July 14, 2025, 1:33pm

You understood what I meant — this is exactly what’s happening.

Ella_conan · July 14, 2025, 1:38pm

Okay, thank you. I will adjust the time and try to see if it works well; otherwise, I will input the full data into both.
The problem is that the data size is large, and I only want the latest data to perform some analyses, but I will try both methods.

RainTown · July 14, 2025, 2:10pm

Good luck.

If you want "near to realtime" availability in splunk, going via elasticsearch+logstash is not the way to go IMO.

Also, if using logstash you need to at least do the round-to-the-minute style, "now-1m/m", as "now()" is evaluated at query time, and will naturally vary a bit. Here's examples of what it evaluated for me over a few minutes:

"2025-07-14T13:39:00.177Z"
"2025-07-14T13:40:00.276Z"
"2025-07-14T13:41:00.372Z"
"2025-07-14T13:42:00.479Z"
"2025-07-14T13:43:00.559Z"
"2025-07-14T13:44:00.640Z"
"2025-07-14T13:45:00.700Z"
"2025-07-14T13:46:00.763Z"
"2025-07-14T13:47:00.862Z"
"2025-07-14T13:48:01.061Z" <-- a second later already
"2025-07-14T13:49:00.069Z"
"2025-07-14T13:50:00.143Z"
"2025-07-14T13:51:00.211Z"
"2025-07-14T13:52:00.320Z"
"2025-07-14T13:53:00.420Z"
"2025-07-14T13:54:00.479Z"
"2025-07-14T13:55:00.544Z"
"2025-07-14T13:56:00.676Z"

i.e. from "now()-1m" to "now()" would leave small gaps / have small overlaps.

Ella_conan · July 16, 2025, 12:59pm

Thank you for this note.

I will try it; it might be a suitable solution.

Topic		Replies	Views
How to avoid data duplication in elasticsearch when data send from logstash? Elasticsearch	5	518	October 17, 2018
Logstash ingest and export to elasticsearch files twice Logstash	16	712	March 16, 2022
Duplicate Entries of Log data Elasticsearch	6	4780	September 29, 2017
How to avoid elasticsearch duplicate documents Logstash	6	1688	March 5, 2018
Logstash write data to the elasticsearch how to remove duplication Logstash	4	663	July 6, 2017

How to pull data from Elasticsearch to Splunk without data duplication?

Related topics