Counting the number of unique IP addresses in a very large file

You have a simple text file with IPv4 addresses. One line is one address, line by line:

145.67.23.4
8.34.5.23
89.54.3.124
89.54.3.124
3.45.71.5
...

The file is unlimited in size and can occupy tens and hundreds of gigabytes.

You should calculate the number of unique addresses in this file using as little memory and time as possible. There is a "naive" algorithm for solving this problem (read line by line, put lines into HashSet). It's better if your implementation is more complicated and faster than this naive algorithm.

Solutions

In theory, there can be 2^32 unique IP addresses.

Using map[uint32]bool to store the occurred IP addresses would be inefficient. since uint32 uses 4 bytes and bool uses 1 byte in go. 2^32*5=21474836480 bytes (21.47484 GB).
The most efficient approach using built-in structures would be a slice []bool, where the index is the int representation of the IP and the value of its occurrence. since bool uses 1 byte in Go. 2^32*1=4294967296 bytes (4.294967 GB)
We can take this a step further and make our code more efficient. The idea is to use only one bit to represent the occurred IP, rather than the entire byte. Unfortunately, Go does not have a built-in bit type. But bitsets come to the rescue. 2^32/1=536870912 bytes (0.5368709 GB)

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
bitset		bitset
.gitignore		.gitignore
go.mod		go.mod
go.sum		go.sum
main.go		main.go
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Counting the number of unique IP addresses in a very large file

Solutions

About

Uh oh!

Releases

Packages

Uh oh!

Languages

biganashvili/bitset

Folders and files

Latest commit

History

Repository files navigation

Counting the number of unique IP addresses in a very large file

Solutions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages