Transforming data to information for automated diagnosis of distributed systems performance

Ira Cohen
Thursday, 18.5.2006, 12:30
Taub 337

As systems and distributed applications grow in complexity and scale, management of such systems becomes more difficult and sometimes infeasible for human operators. Recent research activity has shown encouraging results for performance debugging and failure diagnosis and detection in systems by using machine learning approaches, leveraging the vast amount of data collected on distributed systems. In this talk I'll describe some of the recent work of at HP-Labs () applying machine learning and probabilistic modeling in the area of system performance diagnosis. In particular, I will present in detail our work, published at SOSP'05, aimed at transforming the system diagnosis problem into an information retrieval one. The method automatically extracts from a running system an indexable signature that distills the essential characteristic from a system state that can be subjected to automated clustering and similarity-based retrieval. This allows operators to identify and quantify the frequency of recurrent problems, to leverage previous diagnostic efforts, and to establish whether problems seen at different installations of the same site are similar or distinct. I'll describe our signature representation method, which builds on characterizing performance problems using an ensemble of probabilistic classifiers. Ira Cohen is a senior researcher at Hewlett Packard research labs, where he works on applying machine learning and pattern recognition techniques to system diagnosis, management and control. Ira joined HP-Labs in 2003 after receiving his PhD in Electrical and Computer Engineering from the University of Illinois at Urbana-Champaign where he worked on semi-supervised learning and computer vision applications. Ira holds a BSc. From Ben Gurion University, Israel. His research interests are in machine learning, probabilistic models, systems management and control, computer vision and human computer interaction.

Back to the index of events