Pixel Club: Clustering and Approximating High-Dimensional Streaming Data using Coresets

דן פלדמן (מכון טכנולוגי, קליפורניה)
יום ראשון, 22.5.2011, 11:30
חדר 337, בניין טאוב למדעי המחשב

Data analysis of massive data sets is common today for web-search (e.g. Google), social networking (e.g. Facebook), financial applications, supermarkets, bioinformatics and many other fields.

A coreset (or, core-set) for a given problem is a "compressed" representation of its input, in the sense that a solution for the problem with the (small) coreset as input would yield an approximate solution to the problem with the original (large) input. Using traditional techniques, a coreset usually implies provable linear time algorithms for the corresponding optimization problem, which can be computed in parallel, via one pass over the data, and using only logarithmic space (i.e, in the streaming model).

In this talk I will present a unified framework that yields the efficient construction of succinct coresets for several problems. Representing the data by a set F of positive functions over a ground set X, our framework forges a link between the combinatorial complexity of the function family F at hand (measured in terms of classical VC dimension) and the paradigm of coresets. Our coresets are obtained via randomized sampling according to a delicately designed sampling distribution. Examples in which our framework has been successful (and significantly improves over previous works) include the k-median, the k-line median, projective clustering, linear regression, low rank approximation, and subspace approximation problems.

בחזרה לאינדקס האירועים