area-51
is a tool for detecting anomalies in a set of data using Z-test.
A statistics preferences of reference data use for scoring raw data.
The sources of data are files in CSV format.
area-51
is CLI tool to process data stored in files.
Reference data and raw data should store in two different files.
Raw data is split by Z-score
and put into two files:
- Clear data is stored to
clean.csv
; - Record contains data greater than expected deviation is stored to
anomalies.csv
.
The tool launches processes existing files and listens to reference and raw files until a user stops it. It processes changes in data (appending, replacing a file, etc.) in real-time.
A user can launch the tool even source files are not in place.
The tool will process it if a user puts files on a place identified by CLI options --reference
and --raw
.
Log messages inform a user that existing data is completely processed. A user stops the tool at the proper time, when files will not change, or another reason.
There are three CLI options to configure the tool:
--reference
- path to reference file;--raw
- path to raw file;--output
- path to directory for output file.
There are several options to run a tool.
Docker image of area-51
is stored in Docker hub. Use this command to start:
docker run -v /test-data:/mnt/test-data 7phs/area-51 --reference /mnt/test-data/in/ref.csv --raw /mnt/test-data/in/raw.csv --output /mnt/test-data/output/
area-51
requires Go 1.17+ to build it.
git clone [email protected]:7phs/area-51.git
cd ./area-51
go run ./cmd/server --reference /test-data/in/ref.csv --raw /test-data/in/raw.csv --output /test-data/output/
There are two sides of the solution that I would like to describe in detail.
Z-test uses for estimating the quality of a data record.
Needs to know to mean and standard deviation a feature of partition assigned to a data record to calculate Z-score
.
There are several ways to do it:
- collect all data are included into partition;
- calculating mean and standard deviation on a stream of data.
The solution is implemented the second one based on the description at Rapid calculation methods for the standard deviation:
This is a "one pass" algorithm for calculating variance of n samples without the need to store prior data during the calculation. Applying this method to a time series will result in successive values of standard deviation corresponding to n data points as n grows larger with each new sample, rather than a constant-width sliding window calculation.
The solution breaks implementation into two big components:
- reading data - responsible for listening of files changes reading, representing data stored in files and stream of bytes;
- processing data - responsible for parsing a stream of bytes into data records, scoring them, and storing them into destination files.
A major reason for it is representing data as an infinite stream for a processor. It helps update a data source and possibly replace it with a stream from network services, etc.
Important parts of the solution that were implemented to reduce the processing time of data files:
- Listen to OS file events to handle all changes of raw and reference files;
- Using a buffer to read data from a file and serialize it;
- Custom CSV reader to reducing the overhead of standard solution;
- Use slices of the data buffer to assign as record fields instead of copy data during record's fetching;
- Parse an only significant part of the data record (key and features);
- Keep raw representation of data record as a slice of bytes to easily serialize it.
Data processing time measurements were taken to determine the overall performance level of the solution.
Test reference contains ~100 000 records, and raw data contains ~100 000 records.
Hardware:
- MacBook Pro (15-inch, 2018); 2,6 GHz 6-Core Intel Core i7
Result:
Load and process reference files (before process the first line of raw data):
100 000 records / from 150 ms to 250 ms
Scoring raw data:
50 000 records / from 65 ms to 100 ms
- Check existing file per queue and open and read it immediately
- Parse CSV
- Parse features float array
- Probably send command, or add a feature of handling that file is close on data-stream
- Skip header
- Inter-buffer processing
- Handle error of csv format - just skip record
- Check a record for anomalies (call dummy preference)
- Split output stream into two
- Write record to file
- Reference collector
- Calculate Z-test of reference
- Waiting for reference before calculating Z-test
- Anomalies detector, which uses the reference collector
- Dockerfile and build image
- Push image to hub.docker.com
- Description of the project
- Configuration for delimiter, skipping the first line, size of buffer
- Output stream is a dedicated entity with own interface
- Pool for read buffer
- Unit-test of happy cases of watcher, data-stream, detector, reference, etc.
- Logger with levels