Probabilistic and On-Line Data Structures

  • Probabilistic data structures
    • Often used for real-time analytics and large datasets
    • Bloom filters for compact existence check:
      • Always tells if item is in set, slight change of false positives
      • bitmap
      • Check if a website is malicious (bloom filter of giant set of malicious sites)
    • hyperloglog
    • t-digest
      • Ted Dunning talk
      • on-line accumuluation of rank-based statistics
      • sort of like a histogram with many quintiles, more accuracy in very high and very low quintiles, less in middle quintiles
      • variable accuracy, constant relative accuracy
      • computes clusters, large in middle, small towards end of distribution
      • continuously compute median (50th percentile) of a large number of integers from distributed nodes
      • order points, collect points together given scale of quintile, continually merging points, short buffer for new points, sort and merge
      • <100 nanosecond overhead per measurement
      • under the hood uses avl trees
    • semigroups and monoids