Monitoring global states of a cloud application deployed over distributed computing nodes is a critical functionality for administration of cloud applications and services in datacenters. State monitoring requiresmeeting two demanding objectives: high level of correctness, which ensures zero or very low error rate, and high communication efficiency, which demands minimal communication cost in detecting critical state updates. Most existing work follows an instantaneous state monitoring model which triggers state alert whenever a constraint is violated. This model may cause frequent and unnecessary state alerts due to momentary monitored value bursts and outliers. Counter-measures of such alerts may lead to unnecessary and problematic operations.
In our research, we have tried to bridge this gap. We have developed a window-based state monitoring system, WISE, that reports alerts only when state violation is continuous within a specified time window. We show that window-based state monitoring is not only more resilient to temporary value bursts and outliers, but also can save considerable communication when implemented in a distributed manner. This project makes three contributions. First, WISE employs distributed filtering time windows and avoids collecting global information to achieve communication efficiency, while guaranteeing monitoring correctness. Second, WISE is equipped with a scalable parameter tuning scheme which searches for the best local monitoring parameters to minimize communication cost. Compared with centralized tuning schemes, the distributed state monitoring framework presented in this paper is highly scalable by distributing tuning tasks to each monitoring node and thus is more suitable for large-scale datacenter monitoring. Third, WISE provides a set of performance optimization techniques that further reduce the communication cost of state violation detection. Experimental results show that WISE reduces communication by 50%-90% compared with instantaneous monitoring approaches and simple alternative schemes
Shicong Meng
Ting Wang
Ling Liu
Shicong Meng, Ting Wang and Ling Liu, "Monitoring Continuous State Violation in Datacenters: Exploring the Time Dimension ", 26th IEEE International Conference on Data Engineering (ICDE'10), March 1-6, 2010, Long Beach, California, USA.
Shicong Meng, Srinivas Karshyap, Chitra Venketramani and Ling Liu, "REMO: Resource-Aware Application State Monitoring for Large-Scale Distributed Systems", Proceedings of IEEE Int. Conf. on Distributed Computing (ICDCS'09), June 22-26, in Montreal, Quebec, Canada
Shicong Meng <smeng[AT]cc.gatech.edu>
This work is partially supported by grants from NSF CyberTrust, NSF NetSE, and IBM faculty award, IBN SUR grant and a grant from Intel research council. The authors also would like to thank the anonymous ICDE reviewers for their insightful and valuable comments that helped improve this paper.
2010 Shicong Meng