Probabilistic and On-Line Data Structures

August 11, 2021 One-minute read

systems_design • notes

Probabilistic data structures
- Often used for real-time analytics and large datasets
- Bloom filters for compact existence check:
  - Always tells if item is in set, slight change of false positives
  - bitmap
  - Check if a website is malicious (bloom filter of giant set of malicious sites)
- hyperloglog
- t-digest
  - Ted Dunning talk
  - on-line accumuluation of rank-based statistics
  - sort of like a histogram with many quintiles, more accuracy in very high and very low quintiles, less in middle quintiles
  - variable accuracy, constant relative accuracy
  - computes clusters, large in middle, small towards end of distribution
  - continuously compute median (50th percentile) of a large number of integers from distributed nodes
  - order points, collect points together given scale of quintile, continually merging points, short buffer for new points, sort and merge
  - <100 nanosecond overhead per measurement
  - under the hood uses avl trees
- semigroups and monoids