# MacroBase: Analytic Monitoring for the Internet of Things

*Reference: Bailis, Peter, et al. "Macrobase: Prioritizing attention in fast data." Proceedings of the 2017 ACM International Conference on Management of Data. ACM, 2017.*

## 0. Abstract

An increasing proportion of data today is generated by automated processes, sensors, and devices—collectively, the Internet of Things (IoT). IoT applications’ rising data volume, demands for timesensitive analysis, and heterogeneity exacerbate the challenge of identifying and highlighting important trends in IoT deployments. In response, we present MacroBase, a data analytics engine that performs statistically-informed *analytic monitoring* of IoT data streams by identifying deviations within streams and generating potential explanations for each. MacroBase is the first analytics engine to combine streaming outlier detection and streaming explanation operators, allowing cross-layer optimizations that deliver order-of-magnitude speedups over existing, primarily non-streaming alternatives. As a result, MacroBase can deliver accurate results at speeds of up to 2M events per second per query on a single core. MacroBase has delivered meaningful analytic monitoring results in production, including an IoT company monitoring hundreds of thousands of vehicles.

## 1. Introduction

- Three key characteristics of the
**Internet of Things**(IoT) applications: (1) generate immense data volume; (2) often require timely analyses; (3) frequently exhibit heterogeneous behaviors. - MacroBase: a new data analysis system specialized for high-performance analytic monitoring of IoT streams.

## 2. Target Environment

### 2.1 Analytic Monitoring Scenarios

- Mobile applications(e.g., CMT)
- Datacenter operation(e.g., Amazon AWS)
- Industrial Monitoring(e.g., explosion at Horsehead Holding Corp.’s Monaca)

### 2.2 Related Work

- Streaming and Specialized Analytics
- Outlier Detection
- Data Explanation

## 3. MacroBase Architecture and APIs

**Principles of Pipeline Design**: (1) all operators operate over stream; (2) MacroBase uses compiler’s type system to enforce a particular pipeline structure.

**Pipelines**

**Ingestion**: MacroBase ingests data stream from a number of external data sources. Each data point contains a set of*metrics*, corresponding to IoT measurements(e.g., trip time, battery drain), and*attributes*, corresponding to metadata about the measurements(e.g., user ID and device ID). MacroBase uses metrics to detect abnormal or unusual events, and attributes to explain behaviors. In this paper, the authors only consider real-valued metrics and categorical attributes.**Feature Transformation**: MacroBase executes a series of optional domain-specific data transformations over the stream, allowing operations including time-series operations(e.g., windowing, seasonality removal, autocorrelation, frequency analysis), and datatype-specific operations(e.g., hue extraction for images, optical flow for video).**Outlier Detection**: labeling each*Point*as an inlier or an outlier based on its input metrics and producing a stream of labeled*Point*outputs.**Explanation**: MacroBase performs summarization to generate*explanations*by finding attributes that are common among outliers and uncommon among stream inliers. MacroBase’s explanation operators are designed to emit explanation on demand, which act as streaming view maintainer whose output can be queried at any time.**Presentation**: Most pipelines rank explanations before presentation according to statistics corresponding to each explanation. MacroBase’s default presentation mode is a static report rendered via a REST API or via a GUI.

**Extensibility**: (1) user can add new domain-specific feature transformations to the start of a pipeline without modifying the other operators; (2) users can input rules and/or labels to MacroBase to perform supervised classification; (3) users can write their own feature transformation, outlier detector, data explanation operators, and pipelines.

**Operating Modes**: (1) graphical front-end; (2) one-shot queries; (3) streaming queries.

## 4. MDP Outlier Detection

MDP -> MacroBase’s Default Operator Pipeline

### 4.1 Use of Robust Statistics

- Limitation of
*Z-Score*: not robust to outliers *Robust statistical estimation*: finding statistical distributions for data that is mostly well-behaved but may contain a number of ill-behaved data points.- Robust replacement for univariate data: Median Absolute Deviation(MAD) measures the median of absolute distance from each point in the sample to the sample median.
- Robust replacement for multivariate data: Minimum Covariance Determinant(MCD) estimator finds the tightest group of points that best represents a sample, and summarizes the set of points according to its location and scatter(i.e., covariance) in metric space by computing the
*Mahalanobis*distance. *Classifying Outliers*: MDP uses a percentile-based cutoff over scores to identify the most extreme points in the sample. Points with scores above the percentile-based cutoff are classified as outliers, reflecting their distance from the body of the distribution.

### 4.2 MDP Streaming Execution

Training MAD or MCD in a streaming context is problematic because as the distribution within data streams change over time.

**ADR: Adaptable Damped Reservoir**

- The ADR maintains a sample of input data that is exponentially weighted towards more recent points, and the key difference from traditional reservoir sampling is that the ADR operates over arbitrary window sizes.
- MDP maintains an ADR sample of the detector input to periodically recompute its estimator and a second ADR sample of the outlier scores to periodically recompute its classification threshold.
- ADR employed a decay policy that decays in
*time*, not number of tuples because tuple-at-a-time decay may skew the reservoir towards periods of high stream volume. MacroBase currently supports two decay policies: (1) time-based decay; (2) batch-based decay. - MacroBase’s MDP uses the ADR to: (1) maintain inputs for training; (2) maintain percentile threshold.

## 5. MDP Outlier Explanation

### 5.1 Semantics: Support and OI-Ratio

**OI-Ratio**: MDP finds combinations with high*outlier-inlier*(**OI-Ratio**), or the ratio of its rate of occurrence(i.e., support) in the outliers to its rate of occurrence(i.e., support) in the outliers to its rate of occurrence in the inliers, such that MacroBase is able to identify combinations of attribute values that are relatively uncommon in the inliers.**Support**: MDP finds combinations with high support, or occurrence(by count) in the outliers, such that MacroBase is able to optionally eliminate explanations corresponding to rare but non-systemic combinations.

### 5.2 Basic Explanation Strategy

**Optimization**

*Item ratios are cheap*. MDP first computes OI-ratios for single attribute values, then computes attribute combinations using only attribute values with sufficient OI-ratios.*Exploit cardinality imbalance*. The cardinality of the outlier input stream by definition much smaller than that of the input stream. Accordingly, MDP only searches for attributes that were supported in the outliers to reduce the space of inlier attributes to explore.

**Outlier-Aware Explanation Strategy**

**Algorithm and Data Structures**

The authors decided on prefix-tree-based approaches inspired by FPGrowth. In brief, the FPGrowth algorithm maintains a frequency-descending prefix tree of attributes that can be subsequently by mined by recursively generating a set of “conditional” trees, which was fast and proved extensible in streaming implementation.

### 5.3 Streaming Explanation

**Single Attribute Summarization**

*AMC: Amortized Maintenance Counter*

- Limitation of
**Space-Saving**algorithm: not efficient to update. - By separation of insertion and maintenance, AMC allows: (1) constant-time insertion; (2) a range of maintenance policies, including a size-based policy and a variable period policy.
- AMC has three major differences compared to SpaceSaving: (1) AMC updates are constant time(hash table insertion) compare to O() for SpaceSaving; (2) AMC has an additional maintenance step, which is amortized across all items seen in a window; (3) AMC has higher space overhead since it must maintain all items it has seen between maintenance windows.

**Streaming Combinations**

- MDP adapts a combination of two data structures: AMC for counting the frequent attributes, and a novel adaption of the CPS-tree to store frequent attributes.

## 6. Evaluation

- MacroBase’s MDP pipeline is accurate
- MacroBase can process up to 2M points per second per query on a range of real-world datasets;
- MacroBase’s cardinality-aware explanation strategy produces meaningful speedups(average: 3.2x speedup)
- MacroBase’s use of AMC is up to 500x faster than existing sketches on production IoT data.
- MacroBase’s architecture is extensible