Skip to content

7phs/area-51

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Anomalies 🛸 detector

area-51 is a tool for detecting anomalies in a set of data using Z-test. A statistics preferences of reference data use for scoring raw data. The sources of data are files in CSV format.

How to use

area-51 is CLI tool to process data stored in files.

Reference data and raw data should store in two different files.

Raw data is split by Z-score and put into two files:

  • Clear data is stored to clean.csv;
  • Record contains data greater than expected deviation is stored to anomalies.csv.

The tool launches processes existing files and listens to reference and raw files until a user stops it. It processes changes in data (appending, replacing a file, etc.) in real-time.

A user can launch the tool even source files are not in place. The tool will process it if a user puts files on a place identified by CLI options --reference and --raw.

Log messages inform a user that existing data is completely processed. A user stops the tool at the proper time, when files will not change, or another reason.

Run

There are three CLI options to configure the tool:

  • --reference - path to reference file;
  • --raw - path to raw file;
  • --output - path to directory for output file.

There are several options to run a tool.

Run as a docker image

Docker image of area-51 is stored in Docker hub. Use this command to start:

docker run -v /test-data:/mnt/test-data 7phs/area-51 --reference /mnt/test-data/in/ref.csv --raw /mnt/test-data/in/raw.csv --output /mnt/test-data/output/

Run with go tool

area-51 requires Go 1.17+ to build it.

git clone [email protected]:7phs/area-51.git
cd ./area-51
go run ./cmd/server --reference /test-data/in/ref.csv --raw /test-data/in/raw.csv --output /test-data/output/

Description of the project

There are two sides of the solution that I would like to describe in detail.

Statistics

Z-test uses for estimating the quality of a data record.

Needs to know to mean and standard deviation a feature of partition assigned to a data record to calculate Z-score.

There are several ways to do it:

  • collect all data are included into partition;
  • calculating mean and standard deviation on a stream of data.

The solution is implemented the second one based on the description at Rapid calculation methods for the standard deviation:

This is a "one pass" algorithm for calculating variance of n samples without the need to store prior data during the calculation. Applying this method to a time series will result in successive values of standard deviation corresponding to n data points as n grows larger with each new sample, rather than a constant-width sliding window calculation.

Engineering

The solution breaks implementation into two big components:

  • reading data - responsible for listening of files changes reading, representing data stored in files and stream of bytes;
  • processing data - responsible for parsing a stream of bytes into data records, scoring them, and storing them into destination files.

Component diagram

Component diagram

A major reason for it is representing data as an infinite stream for a processor. It helps update a data source and possibly replace it with a stream from network services, etc.

Important parts of the solution that were implemented to reduce the processing time of data files:

  • Listen to OS file events to handle all changes of raw and reference files;
  • Using a buffer to read data from a file and serialize it;
  • Custom CSV reader to reducing the overhead of standard solution;
  • Use slices of the data buffer to assign as record fields instead of copy data during record's fetching;
  • Parse an only significant part of the data record (key and features);
  • Keep raw representation of data record as a slice of bytes to easily serialize it.

Benchmark

Data processing time measurements were taken to determine the overall performance level of the solution.

Test reference contains ~100 000 records, and raw data contains ~100 000 records.

Hardware:

  • MacBook Pro (15-inch, 2018); 2,6 GHz 6-Core Intel Core i7

Result:

Load and process reference files (before process the first line of raw data):
      100 000 records / from 150 ms to 250 ms

Scoring raw data:
       50 000 records / from 65 ms to 100 ms  

TODO

  • Check existing file per queue and open and read it immediately
  • Parse CSV
  • Parse features float array
  • Probably send command, or add a feature of handling that file is close on data-stream
  • Skip header
  • Inter-buffer processing
  • Handle error of csv format - just skip record
  • Check a record for anomalies (call dummy preference)
  • Split output stream into two
  • Write record to file
  • Reference collector
  • Calculate Z-test of reference
  • Waiting for reference before calculating Z-test
  • Anomalies detector, which uses the reference collector
  • Dockerfile and build image
  • Push image to hub.docker.com
  • Description of the project
  • Configuration for delimiter, skipping the first line, size of buffer
  • Output stream is a dedicated entity with own interface
  • Pool for read buffer
  • Unit-test of happy cases of watcher, data-stream, detector, reference, etc.
  • Logger with levels

References:

  1. Z-test
  2. Rapid calculation methods for standard deviation
  3. Comparing means: z and t tests

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published