We devise new cut sparsifiers that are related to the classical sparsification of Nagamochi and Ibaraki [Algorithmica, 1992], which is an algorithm that, given an unweighted graph G on n nodes and a parameter k, computes a subgraph with O(nk) edges that preserves all cuts of value up to k. We put forward the notion of a friendly cut sparsifier, which is a minor of $G$ that preserves all friendly cuts of value up to k, where a cut in $G$ is called friendly if every node has more edges connecting it to its own side of the cut than to the other side. We present an algorithm that, given a simple graph G, computes in almost-linear time a friendly cut sparsifier with $\tilde{O}(n \sqrt{k})$ edges. Using similar techniques, we also show how, given in addition a terminal set T, one can compute in almost-linear time a \emph{terminal sparsifier}, which preserves the minimum $st$-cut between every pair of terminals, with $\tilde{O}(n \sqrt{k} + |T| k)$ edges.
Plugging these sparsifiers into the recent $n^{2+o(1)}$-time algorithms for constructing a Gomory-Hu tree of simple graphs, along with a relatively simple procedure for handling the unfriendly minimum cuts, we improve the running time for moderately dense graphs (e.g., with $m=n^{1.75}$ edges). In particular, assuming a linear-time Max-Flow algorithm, the new state-of-the-art for Gomory-Hu tree is the minimum between our $(m+n^{1.75})^{1+o(1)}$ and the known $m n^{1/2+o(1)}$.
We further investigate the limits of this approach and the possibility of better sparsification. Under the hypothesis that an $\tilde{O}(n)$-edge sparsifier that preserves all friendly minimum $st$-cuts can be computed efficiently, our upper bound improves to $\tilde{O}(m+n^{1.5})$ which is the best possible without breaking the cubic barrier for constructing Gomory-Hu trees in non-simple graphs.
We devise the first coreset for kernel k-Means, and use it to obtain new, more efficient, algorithms. Kernel k-Means has superior clustering capability compared to classical k-Means particularly when clusters are separable non-linearly, but it also introduces significant computational challenges. We address this computational issue by constructing a coreset, which is a reduced dataset that accurately preserves the clustering costs.
Our main result is the first coreset for kernel k-Means, whose size is independent of the number of input points n, and moreover is constructed in time near-linear in n. This result immediately implies new algorithms for kernel k-Means, such as a (1+epsilon)-approximation in time near-linear in n, and a streaming algorithm using space and update time $poly(k\epsilon^{-1}\log n)$.
We validate our coreset on various datasets with different kernels. Our coreset performs consistently well, achieving small errors while using very few points. We show that our coresets can speed up kernel k-Means++ (the kernelized version of the widely used k-Means++ algorithm), and we further use this faster kernel k-Means++ for spectral clustering. In both applications, we achieve up to 1000x speedup while the error is comparable to baselines that do not use coresets.
We consider an approximate version of the trace reconstruction problem, where the goal is to recover an unknown string $s\in\{0,1\}^n$ from $m$ traces (each trace is generated independently by passing $s$ through a probabilistic insertion-deletion channel with rate $p$). We present a deterministic near-linear time algorithm for the average-case model, where $s$ is random, that uses only \emph{three} traces. It runs in near-linear time $\tilde O(n)$ and with high probability reports a string within edit distance $O(\epsilon p n)$ from $s$ for $\epsilon=\tilde O(p)$, which significantly improves over the straightforward bound of $O(pn)$.
Technically, our algorithm computes a $(1+\epsilon)$-approximate median of the three input traces. To prove its correctness, our probabilistic analysis shows that an approximate median is indeed close to the unknown $s$. To achieve a near-linear time bound, we have to bypass the well-known dynamic programming algorithm that computes an optimal median in time $O(n^3)$.
We provide the first coreset for clustering points in R^d that have multiple missing values (coordinates). Previous coreset constructions only allow one missing coordinate. The challenge in this setting is that objective functions, like k-Means, are evaluated only on the set of available (non-missing) coordinates, which varies across points. Recall that an \epsilon-coreset of a large dataset is a small proxy, usually a reweighted subset of points, that (1+\epsilon)-approximates the clustering objective for every possible center set.
Our coresets for k-Means and k-Median clustering have size $(jk)^{O(\min(j,k))} (\epsilon^{-1} d \log n)^2$, where $n$ is the number of data points, $d$ is the dimension and $j$ is the maximum number of missing coordinates for each data point. We further design an algorithm to construct these coresets in near-linear time, and consequently improve a recent quadratic-time PTAS for \kMeans with missing values [Eiben et al., SODA 2021] to near-linear time.
We validate our coreset construction, which is based on importance sampling and is easy to implement, on various real data sets. Our coreset exhibits a flexible tradeoff between coreset size and accuracy, and generally outperforms the uniform-sampling baseline. Furthermore, it significantly speeds up a Lloyd's-style heuristic for k-Means with missing values.
We design an $n^{2+o(1)}$-time algorithm that constructs a cut-equivalent (Gomory-Hu) tree of a simple graph on n nodes. This bound is almost-optimal in terms of n, and it improves on the recent $\tilde{O}(n^{2.5})$ bound by the authors (STOC 2021), which was the first to break the cubic barrier. Consequently, the All-Pairs Maximum-Flow (APMF) problem has time complexity $n^{2+o(1)}$, and for the first time in history, this problem can be solved faster than All-Pairs Shortest Paths (APSP). We further observe that an almost-linear time algorithm (in terms of the number of edges m) is not possible without first obtaining a subcubic algorithm for multigraphs.
Finally, we derandomize our algorithm, obtaining the first subcubic deterministic algorithm for Gomory-Hu Tree in simple graphs, showing that randomness is not necessary for beating the n-1 times max-flow bound from 1961. The upper bound is $\tilde{O}(n^{2\frac{2}{3}})$ and it would improve to $n^{2+o(1)}$ if there is a deterministic single-pair maximum-flow algorithm that is almost-linear. The key novelty is in using a ``dynamic pivot'' technique instead of the randomized pivot selection that was central in recent works.
Graph sparsification has been studied extensively over the past two decades, culminating in spectral sparsifiers of optimal size (up to constant factors). Spectral hypergraph sparsification is a natural analogue of this problem, for which optimal bounds on the sparsifier size are not known, mainly because the hypergraph Laplacian is non-linear, and thus lacks the linear-algebraic structure and tools that have been so effective for graphs.
Our main contribution is the first algorithm for constructing \epsilon-spectral sparsifiers for hypergraphs with $O^*(n)$ hyperedges, where $O^*$ suppresses $(\epsilon^{-1} \log n)^{O(1)}$ factors. This bound is independent of the rank r (maximum cardinality of a hyperedge), and is essentially best possible due to a recent bit complexity lower bound of \Omega(nr) for hypergraph sparsification.
This result is obtained by introducing two new tools. First, we give a new proof of spectral concentration bounds for sparsifiers of graphs; it avoids linear-algebraic methods, replacing e.g. the usual application of the matrix Bernstein inequality and therefore applies to the (non-linear) hypergraph setting. To achieve the result, we design a new sequence of hypergraph-dependent \epsilon-nets on the unit sphere in R^n. Second, we extend the weight-assignment technique of Chen, Khanna and Nagda [FOCS'20] to the spectral sparsification setting. Surprisingly, the number of spanning trees after the weight assignment can serve as a potential function guiding the reweighting process in the spectral setting.
Large matrices are often accessed as a row-order stream. We consider the setting where rows are time-sensitive (i.e. they expire), which can be described by the sliding-window row-order model, and provide the first (1+epsilon)-approximation of Schatten p-norms in this setting. Our main technical contribution is a proof that Schatten p-norms in row-order streams are smooth, and thus fit the smooth-histograms technique of Braverman and Ostrovsky (FOCS 2007) for sliding-window streams.
We consider the problem of sparse normal means estimation in a distributed setting with communication constraints. We assume there are M machines, each holding a d-dimensional observation of a K-sparse vector \mu corrupted by additive Gaussian noise. A central fusion machine is connected to the M machines in a star topology, and its goal is to estimate the vector \mu with a low communication budget. Previous works have shown that to achieve the centralized minimax rate for the l2 risk, the total communication must be high - at least linear in the dimension d. This phenomenon occurs, however, at very weak signals. We show that once the signal-to-noise ratio (SNR) is slightly higher, the support of \mu can be correctly recovered with much less communication. Specifically, we present two algorithms for the distributed sparse normal means problem, and prove that above a certain SNR threshold, with high probability, they recover the correct support with total communication that is sublinear in the dimension d. Furthermore, the communication decreases exponentially as a function of signal strength. If in addition KM<
Every undirected graph G has a (weighted)
We give the first subcubic-time algorithm that constructs such a tree for an unweighted graph G. Its time complexity is $\tilde{O}(n^{2.75})$, for n=|V(G)|; previously, only $\tilde{O}(n^3)$ was known, except for restricted cases like sparse graphs. Consequently, we obtain the first algorithm for All-Pairs Max-Flow in unweighted graphs that breaks the cubic-time barrier.
Under the hypothesis that single-pair Max-Flow can be computed in almost-linear time, our time bound improves further to $n^{2.5+o(1)}$. Even under this assumption, no previously known algorithm is subcubic.
Cut and spectral sparsification of graphs have numerous applications, including e.g. speeding up algorithms for cuts and Laplacian solvers. These powerful notions have recently been extended to hypergraphs, which are much richer and may offer new applications. However, the current bounds on the size of hypergraph sparsifiers are not as tight as the corresponding bounds for graphs.
Our first result is a polynomial-time algorithm that, given a hypergraph on $n$ vertices with maximum hyperedge size r, outputs an $\epsilon$-spectral sparsifier with $O^*(nr)$ hyperedges, where $O^*$ suppresses $(\epsilon^{-1} \log n)^{O(1)}$ factors. This size bound improves the two previous bounds: $O^*(n^3)$ [Soma and Yoshida, SODA'19] and $O^*(nr^3)$ [Bansal, Svensson and Trevisan, FOCS'19]. Our main technical tool is a new method for proving concentration of the nonlinear analogue of the quadratic form of the Laplacians for hypergraph expanders.
We complement this with lower bounds on the bit complexity of any compression scheme that $(1+\epsilon)$-approximates all the cuts in a given hypergraph, and hence also on the bit complexity of every $\epsilon$-cut/spectral sparsifier. These lower bounds are based on Ruzsa-Szemeredi graphs, and a particular instantiation yields an $\Omega(nr)$ lower bound on the bit complexity even for fixed constant $\epsilon$. This is tight up to polylogarithmic factors in $n$, due to recent hypergraph cut sparsifiers of [Chen, Khanna and Nagda, FOCS'20].
Finally, for directed hypergraphs, we present an algorithm that computes an $\epsilon$-spectral sparsifier with $O^*(n^2r^3)$ hyperarcs, where $r$ is the maximum size of a hyperarc. For small $r$, this improves over $O^*(n^3)$ known from [Soma and Yoshida, SODA'19], and is getting close to the trivial lower bound of $\Omega(n^2)$ hyperarcs.
We consider a natural generalization of the Steiner tree problem, the Steiner forest problem, in the Euclidean plane: the input is a multiset $X \subseteq R^2$, partitioned into k color classes $C_1, C_2, \ldots, C_k \subseteq X$. The goal is to find a minimum-cost Euclidean graph G such that every color class C_i is connected in G. We study this Steiner forest problem in the streaming setting, where the stream consists of insertions and deletions of points to $X$. Each input point $x\in X$ arrives with its color $color(x) \in [k]$, and as usual for dynamic geometric streams, the input points are restricted to the discrete grid {0,..., \Delta}^2$.
We design a single-pass streaming algorithm that uses $poly(k \log\Delta)$ space and time, and estimates the cost of an optimal Steiner forest solution within ratio arbitrarily close to the famous Euclidean Steiner ratio $\alpha_2$ (currently $1.1547 \leq \alpha_2 \leq 1.214$). Our approach relies on a novel combination of streaming techniques, like sampling and linear sketching, with the classical dynamic-programming framework for geometric optimization problems, which usually requires large memory and has so far not been applied in the streaming setting.
We complement our streaming algorithm for the Steiner forest problem with simple arguments showing that any finite approximation requires Omega(k) bits of space. In addition, our approximation ratio is currently the best even for streaming Steiner tree, i.e., k=1.
Many real-world data sets are sparse or almost sparse. One method to measure this for a matrix $A\in R^{n\times n}$ is the numerical sparsity, denoted $\ns(A)$, defined as the minimum $k\geq 1$ such that $||a||_1/||a||_2 \leq \sqrt{k}$ for every row and every column $a$ of $A$. This measure of $a$ is smooth and is clearly only smaller than the number of non-zeros in the row/column $a$.
The seminal work of Achlioptas and McSherry [2007] has put forward the question of approximating an input matrix $A$ by entrywise sampling. More precisely, the goal is to quickly compute a sparse matrix $\tilde{A}$ satisfying $||A - \tilde{A}||_2 \leq \epsilon ||A||_2$ (i.e., additive spectral approximation) given an error parameter $\epsilon>0$. The known schemes sample and rescale a small fraction of entries from $A$.
We propose a scheme that sparsifies an almost-sparse matrix $A$ --- it produces a matrix $\tilde{A}$ with $O(\epsilon^{-2}\ns(A) n\ln n)$ non-zero entries with high probability. We also prove that this upper bound on $\nnz(\tilde{A})$ is tight up to logarithmic factors. Moreover, our upper bound improves when the spectrum of $A$ decays quickly (roughly replacing $n$ with the stable rank of $A$). Our scheme can be implemented in time $O(\mathsf{nnz}(A))$ when $||A||_2$ is given. Previously, a similar upper bound was obtained by Achlioptas et. al [2013] but only for a restricted class of inputs that does not even include symmetric or covariance matrices. Finally, we demonstrate two applications of these sampling techniques, to faster approximate matrix multiplication, and to ridge regression by using sparse preconditioners.
We study approximation algorithms for variants of the median string problem, which asks for a string that minimizes the sum of edit distances from a given set of m strings of length n. Only the straightforward 2-approximation is known for this NP-hard problem. This problem is motivated e.g. by computational biology, and belongs to the class of median problems (over different metric spaces), which are fundamental tasks in data analysis.
Our main result is for the Ulam metric, where all strings are permutations over [n] and each edit operation moves a symbol (deletion plus insertion). We devise for this problem an algorithms that breaks the 2-approximation barrier, i.e., computes a (2-\delta)-approximate median permutation for some constant \delta>0 in time $\tilde{O}(nm^2+n^3)$. We further use these techniques to achieve a (2-\delta) approximation for the median string problem in the special case where the median is restricted to length n and the optimal objective is large \Omega(mn).
We also design an approximation algorithm for the following probabilistic model of the Ulam median: the input consists of m perturbations of an (unknown) permutation x, each generated by moving every symbol to a random position with probability (a parameter) \epsilon>0. Our algorithm computes with high probability a (1+o(1/\epsilon))-approximate median permutation in time O(mn^2+n^3).
Min-Cut queries are fundamental: Preprocess an undirected edge-weighted graph, to quickly report a minimum-weight cut that separates a query pair of nodes s,t. The best data structure known for this problem simply builds a cut-equivalent tree, discovered 60 years ago by Gomory and Hu, who also showed how to construct it using n-1 minimum st-cut computations. Using state-of-the-art algorithms for minimum $st$-cut (Lee and Sidford, FOCS 2014), one can construct the tree in time $\tilde{O}(mn^{3/2})$, which is also the preprocessing time of the data structure. (Throughout, we focus on polynomially-bounded edge weights, noting that faster algorithms are known for small/unit edge weights, and use n and m for the number of nodes and edges in the graph.)
Our main result shows the following equivalence: Cut-equivalent trees can be constructed in near-linear time if and only if there is a data structure for Min-Cut queries with near-linear preprocessing time and polylogarithmic (amortized) query time, and even if the queries are restricted to a fixed source. That is, equivalent trees are an essentially optimal solution for Min-Cut queries. This equivalence holds even for every minor-closed family of graphs, such as bounded-treewidth graphs, for which a two-decade old data structure (Arikati, Chaudhuri, and Zaroliagis, J. Algorithms 1998) implies the first near-linear time construction of cut-equivalent trees.
Moreover, unlike all previous techniques for constructing cut-equivalent trees, ours is robust to relying on approximation algorithms. In particular, using the almost-linear time algorithm for $(1+\epsilon)$-approximate minimum $st$-cut (Kelner, Lee, Orecchia, and Sidford, SODA 2014), we can construct a $(1+\epsilon)$-approximate flow-equivalent tree (which is a slightly weaker notion) in time $n^{2+o(1)}$. This leads to the first $(1+\epsilon)$-approximation for All-Pairs Max-Flow that runs in time $n^{2+o(1)}$, and matches the output size almost-optimally.
Coresets are modern data-reduction tools that are widely used in data analysis to improve efficiency in terms of running time, space and communication complexity. Our main result is a fast algorithm to construct a small coreset for \kMedian in (the shortest-path metric of) an excluded-minor graph. Specifically, we give the first coreset of size that depends only on $k$, $\epsilon$ and the excluded-minor size, and our running time is quasi-linear (in the size of the input graph).
The main innovation in our new algorithm is that is iterative; it first reduces the $n$ input points to roughly $O(\log n)$ reweighted points, then to $O(\log\log n)$, and so forth until the size is independent of $n$. Each step in this iterative size reduction is based on the importance sampling framework of Feldman and Langberg (STOC 2011), with a crucial adaptation that reduces the number of \emph{distinct points}, by employing a terminal embedding (where low distortion is guaranteed only for the distance from every terminal to all other points). Our terminal embedding is technically involved and relies on shortest-path separators, a standard tool in planar and excluded-minor graphs.
Furthermore, our new algorithm is applicable also in Euclidean metrics, by simply using a recent terminal embedding result of Narayanan and Nelson, (STOC 2019), which extends the Johnson-Lindenstrauss Lemma. We thus obtain an efficient coreset construction in high-dimensional Euclidean spaces, thereby matching and simplifying state-of-the-art results (Sohler and Woodruff, FOCS 2018; Huang and Vishnoi, STOC 2020).
In addition, we also employ terminal embedding with additive distortion to obtain small coresets in graphs with bounded highway dimension, and use applications of our coresets to obtain improved approximation schemes, e.g., an improved PTAS for planar \kMedian via a new centroid set.
We consider the rooted orienteering problem in Euclidean space: Given $n$ points P in R^d, a root point $s\in P$ and a budget B>0, find a path that starts from s, has total length at most B, and visits as many points of P as possible. This problem is known to be NP-hard, hence we study (1-\delta)-approximation algorithms. The previous Polynomial-Time Approximation Scheme (PTAS) for this problem, due to Chen and Har-Peled (2008), runs in time $n^{O(d\sqrt{d}/\delta)}(\log n)^{(d/\delta)^{O(d)}}$, and improving on this time bound was left as an open problem. Our main contribution is a PTAS with a significantly improved time complexity of $n^{O(1/\delta)}(\log n)^{(d/\delta)^{O(d)}}$.
A known technique for approximating the orienteering problem is to reduce it to solving 1/\delta correlated instances of rooted k-TSP (a k-TSP tour is one that visits at least $k$ points). However, the k-TSP tours in this reduction must achieve a certain excess guarantee (namely, their length can surpass the optimum length only in proportion to a parameter of the optimum called excess) that is stronger than the usual (1+\delta)-approximation. Our main technical contribution is to improve the running time of these k-TSP variants, particularly in its dependence on the dimension d. Indeed, our running time is polynomial even for a moderately large dimension, roughly up to d=O(\log\log n) instead of d=O(1).
The edit distance is a way of quantifying how similar two strings are to one another by counting the minimum number of character insertions, deletions, and substitutions required to transform one string into the other. A simple dynamic programming computes the edit distance between two strings of length n in O(n^2) time, and a more sophisticated algorithm runs in time O(n+t^2) when the edit distance is t [Landau, Myers and Schmidt, SICOMP 1998]. In pursuit of obtaining faster running time, the last couple of decades have seen a flurry of research on approximating edit distance, including polylogarithmic approximation in near-linear time [Andoni, Krauthgamer and Onak, FOCS 2010], and a constant-factor approximation in subquadratic time [Chakrabarty, Das, Goldenberg, Kouck\'y and Saks, FOCS 2018].
We study sublinear-time algorithms for small edit distance, which was investigated extensively because of its numerous applications. Our main result is an algorithm for distinguishing whether the edit distance is at most t or at least t^2 (the quadratic gap problem) in time $\tilde{O}(\frac{n}{t}+t^3)$. This time bound is sublinear roughly for all t in [\omega(1), o(n^{1/3})], which was not known before. The best previous algorithms solve this problem in sublinear time only for t=\omega(n^{1/3}) [Andoni and Onak, STOC 2009].
Our algorithm is based on a new approach that adaptively switches between uniform sampling and reading contiguous blocks of the input strings. In contrast, all previous algorithms choose which coordinates to query non-adaptively. Moreover, it can be extended to solve the t vs t^{2-\epsilon} gap problem in time $\tilde{O}(\frac{n}{t^{1-\epsilon}}+t^3)$.
We investigate for which metric spaces the performance of distance labeling and of $\ell_\infty$-embeddings differ, and how significant can this difference be. Recall that a distance labeling is a distributed representation of distances in a metric space $(X,d)$, where each point $x\in X$ is assigned a succinct label, such that the distance between any two points $x,y \in X$ can be approximated given only their labels. A highly structured special case is an embedding into $\ell_\infty$, where each point $x\in X$ is assigned a vector $f(x)$ such that $\|f(x)-f(y)\|_\infty$ is approximately $d(x,y)$. The performance of a distance labeling or an $\ell_\infty$-embedding is measured via its distortion and its label-size/dimension.
We also study the analogous question for the prioritized versions of these two measures. Here, a priority order $\pi=(x_1,\dots,x_n)$ of the point set $X$ is given, and higher-priority points should have shorter labels. Formally, a distance labeling has prioritized label-size $\alpha(.)$ if every $x_j$ has label size at most $\alpha(j)$. Similarly, an embedding $f: X \to \ell_\infty$ has prioritized dimension $\alpha(.)$ if $f(x_j)$ is non-zero only in the first $\alpha(j)$ coordinates. In addition, we compare these their prioritized measures to their classical (worst-case) versions.
We answer these questions in several scenarios, uncovering a surprisingly diverse range of behaviors. First, in some cases labelings and embeddings have very similar worst-case performance, but in other cases there is a huge disparity. However in the prioritized setting, we most often find a strict separation between the performance of labelings and embeddings. And finally, when comparing the classical and prioritized settings, we find that the worst-case bound for label size often ``translates'' to a prioritized one, but also a surprising exception to this rule.
The spectrum of a matrix contains important structural information about the underlying data, and hence there is considerable interest in computing various functions of the matrix spectrum. A fundamental example for such functions is the l_p-norm of the spectrum, called the Schatten p-norm of the matrix. Large matrices representing real-world data are often sparse (most entries are zeros) or doubly sparse, i.e., sparse in both rows and columns. These large matrices are usually accessed as a stream of updates, typically organized in row-order. In this setting, where space (memory) is the limiting resource, computing spectral functions is an expensive task and known algorithms require space that is polynomial in the dimension of the matrix, even for sparse matrices. Thus, it is highly desirable to design algorithms requiring significantly smaller space.
We answer this challenge by providing the first algorithm that uses space independent of the matrix dimension to compute the Schatten p-norm of a doubly-sparse matrix presented in row order. Instead, our algorithm uses space polynomial in the sparsity parameter k and makes O(p) passes over the data stream. We further prove that multiple passes are unavoidable in this setting and show several extensions of our primary technique, including stronger upper bounds for special matrix families, algorithms for the more difficult turnstile model, and a trade-off between space requirements and number of passes.
We initiate the study of coresets for clustering in graph metrics, i.e., the shortest-path metric of edge-weighted graphs. Such clustering problems are essential to data analysis and used for example in road networks and data visualization. A coreset is a compact summary of the data that approximately preserves the clustering objective for every possible center set, and it offers significant efficiency improvements in terms of running time, storage, and communication, including in streaming and distributed settings. Our main result is a near-linear time construction of a coreset for k-Median in a general graph G, with size $O_{\epsilon, k}(tw(G))$ where $tw(G)$ is the treewidth of G, and we complement the construction with a nearly-tight size lower bound. The construction is based on the framework of Feldman and Langberg [STOC 2011], and our main technical contribution, as required by this framework, is a uniform bound of $O(tw(G))$ on the shattering dimension under any point weights. We validate our coreset on real-world road networks, and our scalable algorithm constructs tiny coresets with high accuracy, which translates to a massive speedup of existing approximation algorithms such as local search for graph k-Median.
Orthogonal Matching pursuit (OMP) is a popular algorithm to estimate an unknown sparse vector from multiple linear measurements of it. Assuming exact sparsity and that the measurements are corrupted by additive Gaussian noise, the success of OMP is often formulated as exactly recovering the support of the sparse vector. Several authors derived a sufficient condition for exact support recovery by OMP with high probability depending on the signal-to-noise ratio, defined as the magnitude of the smallest non-zero coefficient of the vector divided by the noise level. We make two contributions. First, we derive a slightly sharper sufficient condition for two variants of OMP, in which either the sparsity level or the noise level is known. Next, we show that this sharper sufficient condition is tight, in the following sense: for a wide range of problem parameters, there exist a dictionary of linear measurements and a sparse vector with a signal-to-noise ratio slightly below that of the sufficient condition, for which with high probability OMP fails to recover its support. Finally, we present simulations which illustrate that our condition is tight for a much broader range of dictionaries.
We study algorithms for the sliding-window model, an important variant of the data-stream model, in which the goal is to compute some function of a fixed-length suffix of the stream. We extend the smooth-histogram framework of Braverman and Ostrovsky (FOCS 2007) to almost-smooth functions, which includes all subadditive functions. Specifically, we show that if a subadditive function can be $(1+\epsilon)$-approximated in the insertion-only streaming model, then it can be $(2+\epsilon)$-approximated also in the sliding-window model with space complexity larger by factor $O(\epsilon^{-1}\log w)$, where $w$ is the window size.
We demonstrate how our framework yields new approximation algorithms with relatively little effort for a variety of problems that do not admit the smooth-histogram technique. For example, in the frequency-vector model, a symmetric norm is subadditive and thus we obtain a sliding-window $(2+\epsilon)$-approximation algorithm for it. Another example is for streaming matrices, where we derive a new sliding-window $(\sqrt{2}+\epsilon)$-approximation algorithm for Schatten $4$-norm. We then consider graph streams and show that many graph problems are subadditive, including maximum submodular matching, minimum vertex-cover, and maximum $k$-cover, thereby deriving sliding-window $O(1)$-approximation algorithms for them almost for free (using known insertion-only algorithms). Finally, we design for every $d\in (1,2]$ an artificial function, based on the maximum-matching size, whose almost-smoothness parameter is exactly $d$.
We design coresets for Ordered k-Median, a generalization of classical clustering problems such as k-Median and k-Center, that offers a more flexible data analysis, like easily combining multiple objectives (e.g., to increase fairness or for Pareto optimization). Its objective function is defined via the Ordered Weighted Averaging (OWA) paradigm of Yager (1988), where data points are weighted according to a predefined weight vector, but in order of their contribution to the objective (distance from the centers).
A powerful data-reduction technique, called a coreset, is to summarize a point set X in R^d into a small (weighted) point set X', such that for every set of k potential centers, the objective value of the coreset X' approximates that of X within factor $1\pm \epsilon$. When there are multiple objectives (weights), the above standard coreset might have limited usefulness, whereas in a simultaneous coreset, which was introduced recently by Bachem and Lucic and Lattanzi (2018), the above approximation holds for all weights (in addition to all centers). Our main result is a construction of a simultaneous coreset of size $O_{\epsilon, d}(k^2 \log^2 |X|)$ for Ordered k-Median.
To validate the efficacy of our coreset construction we ran experiments on a real geographical data set. We find that our algorithm produces a small coreset, which translates to a massive speedup of clustering computations, while maintaining high accuracy for a range of weights.
We investigate the time-complexity of the All-Pairs-Max-Flow problem: Given a graph with $n$ nodes and $m$ edges, compute for all pairs of nodes the maximum-flow value between them. If Max-Flow (the version with a given source-sink pair $s,t$) can be solved in time $T(m)$, then an $O(n^2) \cdot T(m)$ is a trivial upper bound. But can we do better?
For directed graphs, recent results in fine-grained complexity suggest that this time bound is essentially optimal. In contrast, for undirected graphs with edge capacities, a seminal algorithm of Gomory and Hu (1961) runs in much faster time $O(n)\cdot T(m)$. Under the plausible assumption that Max-Flow can be solved in near-linear time $m^{1+o(1)}$, this half-century old algorithm yields an $nm^{1+o(1)}$ bound. Several other algorithms have been designed through the years, including $\tilde{O}(mn)$ time for unit-capacity edges (unconditionally), but none of them break the $O(mn)$ barrier. Meanwhile, no super-linear lower bound was shown for undirected graphs.
We design the first hardness reductions for All-Pairs-Max-Flow in undirected graphs, giving an essentially optimal lower bound for the node-capacities setting. For edge capacities, our efforts to prove similar lower bounds have failed, but we have discovered a surprising new algorithm that breaks the $O(mn)$ barrier for graphs with unit-capacity edges! Assuming $T(m)=m^{1+o(1)}$, our algorithm runs in time $m^{3/2 +o(1)}$ and outputs a cut-equivalent tree (similarly to the Gomory-Hu algorithm). Even with current Max-Flow algorithms we improve state-of-the-art as long as $m=O(n^{5/3-\varepsilon})$. Finally, we explain the lack of lower bounds by proving a non-reducibility result. This result is based on a new quasi-linear time $\tO(m)$ non-deterministic algorithm for constructing a cut-equivalent tree and may be of independent interest.
Most known algorithms in the streaming model of computation aim to approximate a single function such as an $\ell_p$-norm. In 2009, Nelson [https://sublinear.info, Open Problem 30] asked if it possible to design universal algorithms, that simultaneously approximate multiple functions of the stream. In this paper we answer the question of Nelson for the class of subset $\ell_0$-norms in the insertion-only frequency-vector model. Given a family of subsets $F\subset 2^{[n]}$, we provide a single streaming algorithm that can $(1\pm \epsilon)$-approximate the subset-norm for every $S\in F. Here, the subset-$\ell_p$-norm of $v\in R^n$ with respect to set $S\subseteq [n]$ is the $\ell_p$-norm of vector $v_{|S}$ (which denotes restricting $v$ to $S$, by zeroing all other coordinates).
Our main result is a near-tight characterization of the space complexity of every family $F\subset 2^{[n]}$ of subset-$\ell_0$-norms in insertion-only streams, expressed in terms of the ``heavy-hitter dimension'' of $F, a new combinatorial quantity that is related to the VC-dimension of $F. In contrast, we show that the more general turnstile and sliding-window models require a much larger space usage. All these results easily extend to $\ell_1$.
In addition, we design algorithms for two other subset-$\ell_p$-norm variants. These can be compared to the Priority Sampling algorithm of Duffield, Lund and Thorup [JACM 2007], which achieves additive approximation $\epsilon\norm{v}$ for all possible subsets ($\cS=2^{[n]}$) in the entry-wise update model. One of our algorithms extends this algorithm to handle turnstile updates, and another one achieves multiplicative approximation given a family $\cS$.
We study sublinear algorithms that solve linear systems locally. In the classical version of this problem the input is a matrix $S\in \R^{n\times n}$ and a vector $b\in\R^n$ in the range of $S$, and the goal is to output $x\in \R^n$ satisfying $Sx=b$. For the case when the matrix $S$ is symmetric diagonally dominant (SDD), the breakthrough algorithm of Spielman and Teng [STOC 2004] approximately solves this problem in near-linear time (in the input size which is the number of non-zeros in $S$), and subsequent papers have further simplified, improved, and generalized the algorithms for this setting.
Here we focus on computing one (or a few) coordinates of $x$, which potentially allows for sublinear algorithms. Formally, given an index $u\in [n]$ together with $S$ and $b$ as above, the goal is to output an approximation $\hat{x}_u$ for $x^*_u$, where $x^*$ is a fixed solution to $Sx=b$.
Our results show that there is a qualitative gap between SDD matrices and the more general class of positive semidefinite (PSD) matrices. For SDD matrices, we develop an algorithm that approximates a single coordinate $x_{u}$ in time that is polylogarithmic in $n$, provided that $S$ is sparse and has a small condition number (e.g., Laplacian of an expander graph). The approximation guarantee is additive $| \hat{x}_u-x^*_u | \le \epsilon \| x^* \|_\infty$ for accuracy parameter $\epsilon>0$. We further prove that the condition-number assumption is necessary and tight.
In contrast to the SDD matrices, we prove that for certain PSD matrices $S$, the running time must be at least polynomial in $n$. This holds even when one wants to obtain the same additive approximation, and $S$ has bounded sparsity and condition number.
We reprove three known (algorithmic) bounds for terminal-clustering problems, using a single framework that leads to simpler proofs. The input in terminal-clustering problems is a metric space (X,d) (possibly arising from a graph) and a subset $K\subset X$ of terminals, and the goal is to partition the points X, such that each part, called cluster, contains exactly one terminal (possibly with connectivity requirements), so as to minimize some objective. The three bounds we reprove are for Steiner Point Removal on trees [Gupta, SODA 2001], Metric 0-Extension for bounded doubling dimension [Lee and Naor, unpublished 2003], and Connected Metric 0-Extension [Englert et al., SICOMP 2014].
A natural approach is to cluster each point with its closest terminal, which would partition X into so-called Voronoi cells, but this approach can fail miserably due to its stringent cluster boundaries. A now-standard fix is to enlarge each Voronoi cell computed in some order to obtain disjoint clusters, which defines the Noisy-Voronoi algorithm. This method, first proposed by Calinescu, Karloff and Rabani [SICOMP 2004], was employed successfully to provide state-of-the-art results for terminal-clustering problems on general metrics. However, for restricted families of metrics, e.g., trees and doubling metrics, only more complicated, ad-hoc algorithms are known. Our main contribution is to demonstrate that the Noisy-Voronoi algorithm is not only applicable to restricted metrics, but actually leads to relatively simple algorithms and analyses.
The relationship between the sparsest cut and the maximum concurrent multi-flow in graphs has been studied extensively. For general graphs, the worst-case gap between these two quantities is now settled: When there are $k$ terminal pairs, the flow-cut gap is O(log k), and this is tight. But when topological restrictions are placed on the flow network, the situation is far less clear. In particular, it has been conjectured that the flow-cut gap in planar networks is O(1), while the known bounds place the gap somewhere between 2 (Lee and Raghavendra, 2003) and O(sqrt{log k}) (Rao, 1999).
A seminal result of Okamura and Seymour (1981) shows that when all the terminals of a planar network lie on a single face, the flow-cut gap is exactly $1$. This setting can be generalized by considering planar networks where the terminals lie on one of $\gamma > 1$ faces in some fixed planar drawing. Lee and Sidiropoulos (2009) proved that the flow-cut gap is bounded by a function of $\gamma$, and Chekuri, Shepherd, and Weibel (2013) showed that the gap is at most $3 \gamma$. We significantly improve these asymptotics by establishing that the flow-cut gap is $O(\log \gamma)$. This is achieved by showing that the edge-weighted shortest-path metric induced on the terminals admits a stochastic embedding into trees with distortion $O(\log \gamma)$. The latter result is tight, e.g., for a square planar lattice on $\Theta(\gamma)$ vertices.
The preceding results refer to the setting of edge-capacitated networks. For vertex-capacitated networks, it can be significantly more challenging to control flow-cut gaps. While there is no exact vertex-capacitated version of the Okamura-Seymour Theorem, an approximate version holds; Lee, Mendel, and Moharrami (2015) showed that the vertex-capacitated flow-cut gap is O(1) on planar networks whose terminals lie on a single face. We prove that the flow-cut gap is $O(\gamma)$ for vertex-capacitated instances when the terminals lie on at most $\gamma$ faces. In fact, this result holds in the more general setting of submodular vertex capacities.
We introduce a batch version of sparse recovery, where the goal is to report a sequence of vectors $A_1',\ldots,A_m' \in R^n$ that estimate unknown signals $A_1,\ldots,A_m \in R^n$ using a few linear measurements, each involving exactly one signal vector, under an assumption of average sparsity. More precisely, we want to have $$ \sum_{j \in [m]}{||A_j- A_j'||_p^p} \le C \cdot \min \{ \sum_{j \in [m]}{||A_j - A_j^*||_p^p} \} $$ for predetermined constants $C \ge 1$ and $p$, where the minimum is over all $A_1^*,\ldots,A_m^*\in R^n$ that are $k$-sparse on average. We assume $k$ is given as input, and ask for the minimal number of measurements required to satisfy the inquality above. The special case $m=1$ is known as stable sparse recovery and has been studied extensively.
We resolve the question for $p=1$ up to polylogarithmic factors, by presenting a randomized adaptive scheme that performs $\tilde{O}(km)$ measurements and with high probability has output satisfying \eqref{eq:batchRec}, for arbitrarily small $C > 1$. Finally, we show that adaptivity is necessary for every non-trivial scheme.
Given a directed graph, the vertex connectivity from $u$ to $v$ is the maximum number of internally vertex-disjoint paths from $u$ to $v$. We design faster algorithms that, given as input a directed graph $G$ with unit node capacities and a threshold $k$, report for all vertex pairs $(s,t)$ the size of a minimum $st$-vertex cut (equivalently, maximum $st$-flow or vertex connectivity) if it is less than $k$, or report that it is at least $k$ otherwise. We abbreviate this problem kAPMVC, and the more general, unit edge capacities version as kAPMC.
First, we present a randomized algorithm for kAPMVC that runs in time $O((nk)^{\omega})$, where $\omega$ is the fast matrix multiplication exponent. This result stems from an application of the network coding method by Cheung, Lau, and Leung [SICOMP 2013] to vertex-capacitated digraphs. Second, we present two deterministic algorithms for DAGs for the harder kAPMC and where we also want to compute min-cut witnesses. The first algorithm is combinatorial (it does not involve matrix multiplication) and runs in time $O(2^{O(k^2)} \cdot mn)$. The second algorithm is faster on dense DAGs and runs in time $O((k\log n)^{4^k+o(k)}\cdot n^{\omega})$. Notice that a solution even to kAPMVC, for any $k\geq 1$, implies a solution to triangle finding and to transitive closure: thus, our bounds for $k=o(\sqrt{\log n})$ and for $k=o(\log\log n)$, are tight up to subpolynomial factors in $n$, where the former applies to combinatorial algorithms [Abboud and Williams, FOCS 2014]. To obtain those results, we exploit new insights on the structure of $k$-cuts and we apply suitable algebraic tools. Our final set of results rules out the possibility that kAPMVC can be solved as efficiently as a transitive closure computation for all $k$. We design a novel reduction showing a lower bound of $n^{\omega-1-o(1)} k^2$ for kAPMVC assuming that $4$-Clique requires $n^{\omega+1-o(1)}$ time. For combinatorial algorithms, our reduction implies an $n^{2-o(1)} k^2$ conditional lower bound. These lower bounds are significantly higher than previously known ones (under SETH) even for the general case of $k=n$ and for the All-Pairs Max-Flow problem.
We study the problem of a decision-maker having to select one of many competing alternatives (e.g., choosing between projects, designs, or suppliers) whose future revenues are a priori unknown and modeled as random variables of known probability distributions. The decision-maker can pay to test each alternative to reveal its specific revenue realization (e.g., by conducting market research), and her goal is to maximize the expected revenue of the selected alternative minus the testing costs. This model captures an interesting trade-off between gaining revenue of a high-yield alternative and spending resources to reduce the uncertainty in selecting it. The combinatorial nature of the problem leads to a dynamic programming (DP) formulation with high-dimensional state space that is computationally intractable. By characterizing the structure of the optimal policy, we derive efficient optimal and near-optimal policies that are simple and easy-to-compute. In fact, these policies are also myopic -- they only consider a limited horizon of one test. Moreover, our policies can be described using intuitive `testing intervals' around the expected revenue of each alternative, and in many cases, the dynamics of an optimal policy can be explained by the interaction between the testing intervals of various alternatives.
In the Set Cover problem, the input is a ground set of n elements and a collection of m sets, and the goal is to find the smallest sub-collection of sets whose union is the entire ground set. The fastest algorithm known runs in time $O(mn 2^n)$ [Fomin et al., WG 2004], and the Set Cover Conjecture (SeCoCo) [Cygan et al., TALG 2016] asserts that for every fixed $\epsilon>0$, no algorithm can solve Set Cover in time $2^{(1-\epsilon)n} \poly(m)$, even if set sizes are bounded by $\Delta=\Delta(\epsilon)$. We show strong connections between this problem and kTree, a special case of Subgraph Isomorphism where the input is an n-node graph G and a k-node tree $T$, and the goal is to determine whether G has a subgraph isomorphic to T.
First, we propose a weaker conjecture Log-SeCoCo, that allows input sets of size $\Delta=O(1/\epsilon \cdot\log n)$, and show that an algorithm breaking Log-SeCoCo would imply a faster algorithm than the currently known $2^n \poly(n)$-time algorithm [Koutis and Williams, TALG 2016] for Directed nTree, which is kTree with k=n and arbitrary directions to the edges of G and T. This would also improve the running time for Directed Hamiltonicity, for which no algorithm significantly faster than $2^n \poly(n)$ is known despite extensive research.
Second, we prove that if p-Partial Cover, a parameterized version of Set Cover that requires covering at least p elements, cannot be solved significantly faster than $2^{n} \poly(m)$ (an assumption even weaker than Log-SeCoCo) then kTree cannot be computed significantly faster than $2^{k} \poly(n)$, the running time of the Koutis and Williams' algorithm.
We study the following version of cut sparsification. Given a large edge-weighted network $G$ with $k$ terminal vertices, compress it into a small network $H$ with the same terminals, such that the minimum cut in $H$ between every bipartition of the terminals approximates the corresponding one in $G$ within factor $q\geq 1$, called the quality.
We provide two new insights about the structure of cut sparsifiers, and then apply them to obtain improved cut sparsifiers (and data structures) for planar graphs. Our first main contribution identifies a subset of the minimum terminal cuts that generates all the other ones. Restricting attention to these cuts, which we call elementary, is effective in reducing the number of requirements from the sparsifier $H$. Our applications of this result are new cut sparsifiers of quality $q=1$ (also called mimicking networks): (1) for planar networks $G$, we design a cut sparsifier of size $O(k 2^{2k})$, slightly improving the known bound from [Krauthgamer and Rika, SODA 2013]; and (2) for planar networks $G$ whose terminals are incident to at most $\gamma=\gamma(G)$ faces, we design a cut sparsifier of size $O(\gamma^5 2^{2\gamma} k^4)$, which was previously unknown.
Our second main contribution is a duality between minor cut sparsification and minor distance sparsification (i.e., when the sparsifier $H$ is required to be a minor of $G$), applicable to planar networks $G$ with all their terminals on the same face. This duality connects problems that were previously studied separately, implying new results, new proofs of known results, and equivalences between open gaps. All our cut sparsifiers are minors of the input network $G$.
We provide evidence that computing the maximum flow value between every pair of nodes in a directed graph on n nodes, m edges, and capacities in the range [1..n], which we call the All-Pairs Max-Flow problem, cannot be solved in time that is significantly faster (i.e., by a polynomial factor) than O(n^3) even for sparse graphs, namely m=O(n); thus for general m, it cannot be solved significantly faster than O(n^2 m). Since a single maximum st-flow can be solved in time $\tilde O(m\sqrt{n})$ [Lee and Sidford, FOCS 2014], we conclude that the all-pairs version might require time equivalent to $\tilde\Omega(n^{3/2})$ computations of maximum st-flow, which strongly separates the directed case from the undirected one. Moreover, if maximum st-flow can be solved in time $\tilde O(m)$, then the runtime of $\tilde\Omega(n^2)$ computations is needed. This is in contrast to a conjecture of Lacki, Nussbaum, Sankowski, and Wulff-Nilsen [FOCS 2012] that All-Pairs Max-Flow in general graphs can be solved faster than the time of O(n^2) computations of maximum st-flow.
Specifically, we show that in sparse graphs G=(V,E,w), if one can compute the maximum st-flow from every s in an input set of sources $S\subseteq V$ to every t in an input set of sinks $T\subseteq V$ in time $O((|S| |T| m)^{1-\varepsilon})$, for some $|S|$, $|T|$ and a constant $\varepsilon>0$, then MAX-CNF-SAT (maximum satisfiability of conjunctive normal form formulas) with n' variables and m' clauses can be solved in time ${m'}^{O(1)}2^{(1-\delta)n'}$ for a constant $\delta(\varepsilon)>0$, a problem for which not even $2^{n'}/\poly(n')$ algorithms are known. Such running time for MAX-CNF-SAT would in particular refute the Strong Exponential Time Hypothesis (SETH). Hence, we improve the lower bound of Abboud, Vassilevska-Williams, and Yu [STOC 2015], who showed that for every fixed $\varepsilon>0$ and $|S|=|T|=O(\sqrt{n})$, if the above problem can be solved in time $O(n^{3/2-\varepsilon})$, then some incomparable (and intuitively weaker) conjecture is false. Furthermore, a larger lower bound than ours implies strictly super-linear time for maximum st-flow problem, which would be an amazing breakthrough.
In addition, we show that All-Pairs Max-Flow in uncapacitated networks with every edge-density m=m(n), cannot be computed in time significantly faster than O(mn), even for acyclic networks. The gap to the fastest known algorithm by Cheung, Lau, and Leung [FOCS 2011] is a factor of $O(m^{\omega-1}/n)$, and for acyclic networks it is $O(n^{\omega-1})$, where $\omega$ is the matrix multiplication exponent.
Finally, we extend our lower bounds to the version that asks only for the maximum-flow values below a given threshold (over all source-sink pairs).
A central problem in data streams is to characterize which functions of an underlying frequency vector can be approximated efficiently. Recently there has been considerable effort in extending this problem to that of estimating functions of a matrix that is presented as a data-stream. This setting generalizes classical problems to the analogous ones for matrices. For example, instead of estimating frequent-item counts, we now wish to estimate "frequent-direction" counts. A related example is to estimate norms, which now correspond to estimating a vector norm on the singular values of the matrix. Despite recent efforts, the current understanding for such matrix problems is considerably weaker than that for vector problems.
We study a number of aspects of estimating matrix norms in a stream that have not previously been considered: (1) multi-pass algorithms, (2) algorithms that see the underlying matrix one row at a time, and (3) time-efficient algorithms. Our multi-pass and row-order algorithms use less memory than what is provably required in the single-pass and entrywise-update models, and thus give separations between these models (in terms of memory). Moreover, all of our algorithms are considerably faster than previous ones. We also prove a number of lower bounds, and obtain for instance, a near-complete characterization of the memory required of row-order algorithms for estimating Schatten p-norms of sparse matrices.
In the snippets problem we are interested in preprocessing a text T so that given two pattern queries P1 and P2, one can quickly locate the occurrences of the patterns in T that are the closest to each other. A closely related problem is that of constructing a color-distance oracle, where the goal is to preprocess a set of points from some metric space, in which every point is associated with a set of colors, so that given two colors one can quickly locate two points associated with those colors, that are as close as possible to each other.
We introduce efficient data structures for both color-distance oracles and the snippets problem. Moreover, we prove conditional lower bounds for these problems from both the 3SUM conjecture and the Combinatorial Boolean Matrix Multiplication conjecture.
We characterize the streaming space complexity of every symmetric norm l (a norm on R^n invariant under sign-flips and coordinate-permutations), by relating this space complexity to the measure-concentration characteristics of l. Specifically, we provide matching upper and lower bounds (up to polylog(n) factors) on the space complexity of approximating the norm of the stream, where both bounds depend on the median of l(x), when x is drawn uniformly from the l2 unit sphere. The same quantity governs many phenomena in high-dimensional spaces, such as large-deviation bounds and the critical dimension in Dvoretzky's Theorem.
The family of symmetric norms contains several well-studied norms, such as all l_p norms, and indeed we provide a new explanation for the disparity in space complexity between p <= 2 and p>2. In addition, we apply our general results to easily derive bounds for several norms were not studied before in the streaming model, including for example the top-k norm and the k-support norm, which was recently shown to be effective for machine learning tasks.
Overall, these results make progress on two outstanding problems in the area of sublinear algorithms (Problems 5 and 30 in this http URL).
By a classical result of Gomory and Hu (1961), in every edge-weighted graph G=(V,E,w), the minimum st-cut values, when ranging over all $s,t\in V$, take at most |V|-1 distinct values. That is, these $\binom{|V|}{2}$ instances exhibit redundancy factor $\Omega(|V|)$. They further showed how to construct from G a tree (V,E',w') that stores all minimum st-cut values. Motivated by this result, we obtain tight bounds for the redundancy factor of several generalizations of the minimum st-cut problem.
Update: The latest version (on arXiv) contains additional references to previous work (which have some overlap with our results).
A valued constraint satisfaction problem (VCSP) instance $(V,\Pi,w)$ is a set of variables $V$ with a set of constraints $\Pi$ weighted by $w$. Given a VCSP instance, we are interested in a re-weighted sub-instance $(V,\Pi'\subset \Pi,w')$ such that preserves the value of the given instance (under every assignment to the variables) within factor $1\pm\epsilon$. A well-studied special case is cut sparsification in graphs, which has found various applications.
We show that a VCSP instance consisting of a single boolean predicate $P(x,y)$ (e.g., for cut, $P=XOR$) can be sparsified into $O(|V|/\epsilon^2)$ constraints if and only if the number of inputs that satisfy $P$ is anything but one (i.e., $|P^{-1}(1)| \neq 1$). Furthermore, this sparsity bound is tight unless $P$ is a relatively trivial predicate. We conclude that also systems of 2SAT (or 2LIN) constraints can be sparsified.
We study the problem of reconstructing a low-rank matrix, where the input is an $n\times m$ matrix $M$ over a field F and the goal is to reconstruct a (near-optimal) matrix $M'$ that is low-rank and close to $M$ under some distance function $\Delta$. Furthermore, the reconstruction must be local, i.e., provides access to any desired entry of $M'$ by reading only a few entries of the input $M$ (ideally, independent of the matrix dimensions $n$ and $m$). Our formulation of this problem is inspired by the local reconstruction framework of Saks and Seshadhri (SICOMP, 2010).
Our main result is a local reconstruction algorithm for the case where $\Delta$ is the normalized Hamming distance (between matrices). Given $M$ that is $\epsilon$-close to a matrix of rank $d<1/\epsilon$ (together with $d$ and $\epsilon$), this algorithm computes with high probability a rank-$d$ matrix $M'$ that is $O(\sqrt{d\epsilon})$-close to $M$. This is a local algorithm that proceeds in two phases. The preprocessing phase reads only $\tilde O(\sqrt{d/\epsilon^3})$ random entries of $M$, and stores a small data structure. The query phase deterministically outputs a desired entry $M'_{i,j}$ by reading only the data structure and $2d$ additional entries of $M$.
We also consider local reconstruction in an easier setting, where the algorithm can read an entire matrix column in a single operation. When $\Delta$ is the normalized Hamming distance between vectors, we derive an algorithm that runs in polynomial time by applying our main result for matrix reconstruction. For comparison, when $\Delta$ is the truncated Euclidean distance and F=R, we analyze sampling algorithms by using statistical learning tools.
We study resistance sparsification of graphs, in which the goal is to find a sparse subgraph (with reweighted edges) that approximately preserves the effective resistances between every pair of nodes. We show that every dense regular expander admits a $(1+\epsilon)$-resistance sparsifier of size $\tilde O(n/\epsilon)$, and conjecture this bound holds for all graphs on $n$ nodes. In comparison, spectral sparsification is a strictly stronger notion and requires $\Omega(n/\epsilon^2)$ edges even on the complete graph.
Our approach leads to the following structural question on graphs: Does every dense regular expander contain a sparse regular expander as a subgraph? Our main technical contribution, which may of independent interest, is a positive answer to this question in a certain setting of parameters. Combining this with a recent result of von Luxburg, Radl, and Hein~(JMLR, 2014) leads to the aforementioned resistance sparsifiers.
We investigate the problem of approximate Nearest-Neighbor Search (NNS) in graphical metrics: The task is to preprocess an edge-weighted graph $G=(V,E)$ on $m$ vertices and a small ``dataset'' $D\subset V$ of size $n \ll m$, so that given a query point $q\in V$, one can quickly approximate $dist(q,D)$ (the distance from $q$ to its closest vertex in $D$) and find a vertex $a\in D$ within this approximated distance. We assume the query algorithm has access to a distance oracle, that quickly evaluates the exact distance between any pair of vertices.
For planar graphs $G$ with maximum degree $\Delta$, we show how to efficiently construct a compact data structure -- of size $\tilde{O}(n(\Delta+1/\epsilon))$ -- that answers $(1+\epsilon)$-NNS queries in time $\tilde{O}(\Delta+1/\epsilon)$. Thus, as far as NNS applications are concerned, metrics derived from bounded-degree planar graphs behave as low-dimensional metrics, even though planar metrics do not necessarily have a low doubling dimension, nor can they be embedded with low distortion into $\ell_2$. We complement our algorithmic result by lower bounds showing that the access to an exact distance oracle (rather than an approximate one) and the dependency on $\Delta$ (in query time) are both essential.
A prominent tool in many problems involving metric spaces is a notion of randomized low-diameter decomposition. Loosely speaking, $\beta$-decomposition refers to a probability distribution over partitions of the metric into sets of low diameter, such that nearby points (parameterized by $\beta>0$) are likely to be ``clustered'' together. Applying this notion to the shortest-path metric in edge-weighted graphs, it is known that $n$-vertex graphs admit an $O(\ln n)$-padded decomposition \cite{Bartal96}, and that excluded-minor graphs admit $O(1)$-padded decomposition [KPR93,FT03,AGGNT14].
We design decompositions to the family of $p$-path-separable graphs, which was defined by Abraham and Gavoille [AG06] and refers to graphs that admit vertex-separators consisting of at most $p$ shortest paths in the graph.
Our main result is that every $p$-path-separable $n$-vertex graph admits an $O(\ln (p \ln n))$-decomposition, which refines the $O(\ln n)$ bound for general graphs, and provides new bounds for families like bounded-treewidth graphs. Technically, our clustering process differs from previous ones by working in (the shortest-path metric of) carefully chosen subgraphs.
An outstanding open question [sublinear.info, Question #5] asks to characterize metric spaces in which distances can be estimated using efficient sketches. Specifically, we say that a sketching algorithm is efficient if it achieves constant approximation using constant sketch size. A well-known result of Indyk (J. ACM, 2006) implies that a metric that admits a constant-distortion embedding into $\ell_p$ for $p\in(0,2]$ also admits an efficient sketching scheme. But is the converse true, i.e., is embedding into $\ell_p$ the only way to achieve efficient sketching?
We address these questions for the important special case of normed spaces, by providing an almost complete characterization of sketching in terms of embeddings. In particular, we prove that a finite-dimensional normed space allows efficient sketches if and only if it embeds (linearly) into $\ell_{1-\eps}$ with constant distortion. We further prove that for norms that are closed under sum-product, efficient sketching is equivalent to constant-distortion embedding into $\ell_1$. Examples of such norms include the Earth Mover's Distance (specifically its norm variant, called Kantorovich-Rubinstein norm), and the trace norm (a.k.a. Schatten 1-norm or the nuclear norm). Using known non-embeddability theorems for these norms by Naor and Schechtman (SICOMP, 2007) and by Pisier (Compositio. Math., 1978), we then conclude that these spaces do not admit efficient sketches either, making progress towards answering another open question [sublinear.info, Question #7].
Finally, we observe that resolving whether ``sketching is equivalent to embedding into $\ell_1$ for general norms'' (i.e., without the above restriction) is equivalent to resolving a well-known open problem in Functional Analysis posed by Kwapien in 1969.
Sketching and streaming algorithms are in the forefront of current research directions for cut problems in graphs. In the streaming model, we show that (1-\epsilon)-approximation for Max-Cut must use $n^{1-O(\epsilon)}$ space; moreover, beating 4/5-approximation requires polynomial space. For the sketching model, we show that r-uniform hypergraphs admit a (1+\epsilon)-cut-sparsifier (i.e., a weighted subhypergraph that approximately preserves all the cuts) with $O(\epsilon^{-2}n(r+\log n))$ edges. We also make first steps towards sketching general CSPs (Constraint Satisfaction Problems).
We introduce the st-cut version of the Sparsest-Cut problem, where the goal is to find a cut of minimum sparsity in a graph $G(V,E)$ among those separating two distinguished vertices $s,t\in V$. Clearly, this problem is at least as hard as the usual (non-st) version. Our main result is a polynomial-time algorithm for the product-demands setting, that produces a cut of sparsity $O(\sqrt{OPT})$, where OPT denotes the optimum, and the total edge capacity and the total demand are assumed (by normalization) to be 1.
Our result generalizes the recent work of Trevisan [arXiv, 2013] for the non-st version of the same problem (Sparsest-Cut with product demands), which in turn generalizes the bound achieved by the discrete Cheeger inequality, a cornerstone of Spectral Graph Theory that has numerous applications. Indeed, Cheeger's inequality handles graph conductance, the special case of product demands that are proportional to the vertex (capacitated) degrees. Along the way, we obtain an $O(\log |V|)$-approximation for the general-demands setting of Sparsest st-Cut.
We study spectral algorithms for the high-dimensional Nearest Neighbor Search problem (NNS). In particular, we consider a semi-random setting where a dataset is chosen arbitrarily from an unknown subspace of low dimension, and then perturbed by full-dimensional Gaussian noise. We design spectral NNS algorithms whose query time depends polynomially on the dimension and logarithmically on the size of the point set. These spectral algorithms use a repeated computation of the top PCA vector/subspace, and are effective even when the random-noise magnitude is much larger than the interpoint distances. Our motivation is that in practice, a number of spectral NNS algorithms outperform the random-projection methods that seem otherwise theoretically optimal on worst-case datasets. In this paper we aim to provide theoretical justification for this disparity.
We undertake a systematic study of sketching a quadratic form: given an $n \times n$ matrix $A$, create a succinct sketch $sk{A}$ which can produce (without further access to $A$) a multiplicative $(1+\epsilon)$-approximation to $x^T A x$ for any desired query $x \in R^n$. While a general matrix does not admit non-trivial sketches, positive semi-definite (PSD) matrices admit sketches of size $\Theta(\epsilon^{-2} n)$, via the Johnson-Lindenstrauss lemma, achieving the ``for each'' guarantee, namely, for each query $x$, with a constant probability the sketch succeeds. (For the stronger ``for all'' guarantee, where the sketch succeeds for all $x$'s simultaneously, again there are no non-trivial sketches.)
We design significantly better sketches for the important subclass of graph Laplacian matr ices, which we also extend to symmetric diagonally dominant matrices. A sequence of work c ulminating in that of Batson, Spielman, and Srivastava (SIAM Review, 2014), shows that by choosing and reweighting $O(\epsilon^{-2} n)$ edges in a graph, one achieves the ``for all'' guarantee. Our main results advance this front.
Our results provide the first separation between the sketch size needed for the ``for all'' and ``for each'' guarantees for Laplacian matrices.
In edge orientations, the goal is usually to orient (direct) the edges of an undirected $n$-vertex graph $G$ such that all out-degrees are bounded. When the graph $G$ is fully dynamic, i.e., admits edge insertions and deletions, we wish to maintain such an orientation while keeping a tab on the update time. Low out-degree orientations turned out to be a surprisingly useful tool, with several algorithmic applications involving static or dynamic graphs.
Brodal and Fagerberg (1999) initiated the study of the edge orientation problem in terms of the graph's arboricity, which is very natural in this context. They provided a solution with constant out-degree and amortized logarithmic update time for all graphs with constant arboricity, which include all planar and excluded-minor graphs. However, it remained an open question (first proposed by Brodal and Fagerberg, later by others) to obtain similar bounds with worst-case update time.
We resolve this 15 year old question in the affirmative, by providing a simple algorithm with worst-case bounds that nearly match the previous amortized bounds. Our algorithm is based on a new approach of a combinatorial invariant, and achieves a logarithmic out-degree with logarithmic worst-case update times. This result has applications in various dynamic graph problems such as maintaining a maximal matching, where we obtain $O(\log n)$ worst-case update time compared to the $O(\frac{\log n}{\log\log n})$ amortized update time of Neiman and Solomon (2013).
The problem of partitioning an edge-capacitated graph on n vertices into k balanced parts has been amply researched. Motivated by applications such as load balancing in distributed systems and market segmentation in social networks, we propose a new variant of the problem, called Multiply Balanced k-Partitioning, where the vertex-partition must be balanced under d vertex-weight functions simultaneously.
We design bicriteria approximation algorithms for this problem, i.e., they partition the vertices into up to $k$ parts that are nearly balanced simultaneously for all weight functions, and their approximation factor for the capacity of cut edges matches the bounds known for a single weight function times d. For the case where d=2, for vertex weights that are integers bounded by a polynomial in n and any fixed $\epsilon>0$, we obtain a $(2+\epsilon, O(\sqrt{\log n \log k}))$-bicriteria approximation, namely, we partition the graph into parts whose weight is at most $2+\epsilon$ times that of a perfectly balanced part (simultaneously for both weight functions), and whose cut capacity is $O(\sqrt{\log n \log k})\cdot OPT$. For unbounded (exponential) vertex weights, we achieve approximation $(3, O(\log n))$.
Our algorithm generalizes to d weight functions as follows: For vertex weights that are integers bounded by a polynomial in $n$ and any fixed $\epsilon>0$, we obtain a $(2d +\epsilon, O(d\sqrt{\log n \log k}))$-bicriteria approximation. For unbounded (exponential) vertex weights, we achieve approximation $(2d+ 1, O(d\log n))$.
A useful approach to "compress" a large network G is to represent it with a flow-sparsifier, i.e., a small network H that supports the same flows as G, up to a factor $q \geq 1$ called the quality of sparsifier. Specifically, we assume the network G contains a set of k terminals T, shared with the network H, i.e., $T\subseteq V(G)\cap V(H)$, and we want H to preserve all multicommodity flows that can be routed between the terminals T. The challenge is to construct H that is small.
These questions have received a lot of attention in recent years, leading to some known tradeoffs between the sparsifier's quality $q$ and its size |V(H)|. Nevertheless, it remains an outstanding question whether every $G$ admits a flow-sparsifier H with quality q=1+ε, or even q=O(1), and size $|V(H)|\leq f(k,ε)$ (in particular, independent of |V(G)| and the edge capacities).
Making a first step in this direction, we present new constructions for several scenarios:
We consider the problem of Non-Uniform Graph Partitioning, where the input is an edge-weighted undirected graph G=(V,E) and k capacities n_1,...,n_k, and the goal is to find a partition {S_1,S_2,...,S_k} of V satisfying |S_j| \leq n_j for all 1\leq j\leq k, that minimizes the total weight of edges crossing between different parts. This natural graph partitioning problem arises in practical scenarios, and generalizes well-studied balanced partitioning problems such as Minimum Bisection, Minimum Balanced Cut, and Minimum k-Partitioning. Unlike these problems, Non-Uniform Graph Partitioning seems to be resistant to many of the known partitioning techniques, such as spreading metrics, recursive partitioning, and Raecke's tree decomposition, because k can be a function of n and the capacities could be of different magnitudes.
We present a bicriteria approximation algorithm for Non-Uniform Graph Partitioning that approximates the objective within O(log n) factor while deviating from the required capacities by at most a constant factor. Our approach is to apply stopping-time based concentration results to a simple randomized rounding of a configuration LP. These concentration bounds are needed as the commonly used techniques of bounded differences and bounded conditioned variances do not suffice.
Estimating the leading principal components of data assuming they are sparse, is a central task in modern high-dimensional statistics. Many algorithms were suggested for this sparse PCA problem, from simple diagonal thresholding to sophisticated semidefinite programming (SDP) methods. A key theoretical question asks under what conditions can such algorithms recover the sparse principal components. We study this question for a single-spike model, with a spike that is $\ell_0$-sparse, and dimension $p$ and sample size $n$ that tend to infinity. Amini and Wainwright (2009) proved that for sparsity levels $k\geq\Omega(n/\log p)$, no algorithm, efficient or not, can reliably recover the sparse eigenvector. In contrast, for sparsity levels $k\leq O(\sqrt{n/\log p})$, diagonal thresholding is asymptotically consistent.
It was further conjectured that the SDP approach may close this gap between computational and information limits. We prove that when $k \geq \Omega(\sqrt{n})$ the SDP approach, at least in its standard usage, cannot recover the sparse spike. In fact, we conjecture that in the single-spike model, no computationally-efficient algorithm can recover a spike of $\ell_0$-sparsity $k\geq \Omega(\sqrt{n})$. Finally, we present empirical results suggesting that up to sparsity levels $k=O(\sqrt{n})$, recovery is possible by a simple covariance thresholding algorithm.
Our main result is that the Steiner Point Removal (SPR) problem can always be solved with polylogarithmic distortion, which answers in the affirmative a question posed by Chan, Xia, Konjevod, and Richa (2006). Specifically, we prove that for every edge-weighted graph $G = (V,E,w)$ and a subset of terminals $T \subseteq V$, there is a graph $G'=(T,E',w')$ that is isomorphic to a minor of $G$, such that for every two terminals $u,v\in T$, the shortest-path distances between them in $G$ and in $G'$ satisfy $d_{G,w}(u,v) \le d_{G',w'}(u,v) \le O(\log^5|T|) \cdot d_{G,w}(u,v)$. Our existence proof actually gives a randomized polynomial-time algorithm.
Our proof features a new variant of metric decomposition. It is well-known that every finite metric space $(X,d)$ admits a $\beta$-separating decomposition for $\beta=O(\log |X|)$, which means that for every $\Delta>0$ there is a randomized partitioning of $X$ into clusters of diameter at most $\Delta$, satisfying the following separation property: for every $x,y \in X$, the probability they lie in different clusters of the partition is at most $\beta\,d(x,y)/\Delta$. We introduce an additional requirement in the form of a tail bound: for every shortest-path $P$ of length $d(P) \leq \Delta/\beta$, the number of clusters of the partition that meet the path $P$, denoted $Z_P$, satisfies $\Pr[Z_P > t] \le 2e^{-\Omega(t)}$ for all $t>0$.
We initiate the study of dimensionality reduction in general metric spaces in the context of supervised learning. Our statistical contribution consists of tight Rademacher bounds for Lipschitz functions in metric spaces that are doubling, or nearly doubling. As a by-product, we obtain a new theoretical explanation for the empirically reported improvements gained by pre-processing Euclidean data by PCA (Principal Components Analysis) prior to constructing a linear classifier. On the algorithmic front, we describe an analogue of PCA for metric spaces, namely an efficient procedure that approximates the data's intrinsic dimension, which is often much lower than the ambient dimension. Thus, our approach can exploit the dual benefits of low dimensionality: (1) more efficient proximity search algorithms, and (2) more optimistic generalization bounds.
Given a large edge-weighted network G with k terminal vertices, we wish to compress it and store, using little memory, the value of the minimum cut (or equivalently, maximum flow) between every bipartition of terminals. One appealing methodology to implement a compression of G is to construct a mimicking network: a small network G' with the same k terminals, in which the minimum cut value between every bipartition of terminals is the same as in G. This notion was introduced by Hagerup, Katajainen, Nishimura, and Ragde [JCSS '98], who proved that such G' of size at most $2^{2^k}$ always exists. Obviously, by having access to the smaller network G', certain computations involving cuts can be carried out much more efficiently.
We provide several new bounds, which together narrow the previously known gap from doubly-exponential to only singly-exponential, both for planar and for general graphs. Our first and main result is that every $k$-terminal planar network admits a mimicking network G' of size $O(k^2\cdot 2^{2k})$, which is moreover a minor of G. On the other hand, some planar networks $G$ require $|E(G')| \ge \Omega(k^2)$. For general networks, we show that certain bipartite graphs only admit mimicking networks of size $|V(G')| \geq 2^{\Omega(k)}$, and moreover, every data structure that stores the minimum cut value between all bipartitions of the terminals must use $2^{\Omega(k)}$ machine words.
We examine the efficiency of clustering a set of points, when the encompassing metric space may be preprocessed in advance. In computational problems of this genre, there is a first stage of preprocessing, whose input is a collection of points $M$; the next stage receives as input a query set $Q\subset M$, and should report a clustering of $Q$ according to some objective, such as 1-median, in which case the answer is a point $a\in M$ minimizing $\sum_{q\in Q} d_M(a,q)$.
We design fast algorithms that approximately solve such problems under standard clustering objectives like p-center and p-median, when the metric $M$ has low doubling dimension. By leveraging the preprocessing stage, our algorithms achieve query time that is near-linear in the query size $n=|Q|$, and is (almost) independent of the total number of points $m=|M|$.
The significant progress in constructing graph spanners that are sparse (small number of edges) or light (low total weight) has skipped spanners that are everywhere-sparse (small maximum degree). This disparity is in line with other network design problems, where the maximum-degree objective has been a notorious technical challenge. Our main result is for the Lowest Degree 2-Spanner (LD2S) problem, where the goal is to compute a 2-spanner of an input graph so as to minimize the maximum degree. We design a polynomial-time algorithm achieving approximation factor $\tilde O(\Delta^{3-2\sqrt{2}}) \approx \tilde O(\Delta^{0.172})$, where $\Delta$ is the maximum degree of the input graph. The previous $\tilde O(\Delta^{1/4})$ -approximation was proved nearly two decades ago by Kortsarz and Peleg [SODA 1994, SICOMP 1998].
Our main conceptual contribution is to establish a formal connection between LD2S and a variant of the Densest k-Subgraph (DkS) problem. Specifically, we design for both problems strong relaxations based on the Sherali-Adams linear programming (LP) hierarchy, and show that "faithful" randomized rounding of the DkS-variant can be used to round LD2S solutions. Our notion of faithfulness intuitively means that all vertices and edges are chosen with probability proportional to their LP value, but the precise formulation is more subtle.
Unfortunately, the best algorithms known for DkS use the Lov\'asz-Schrijver LP hierarchy in a non-faithful way [Bhaskara, Charikar, Chlamtac, Feige, and Vijayaraghavan, STOC 2010]. Our main technical contribution is to overcome this shortcoming, while still matching the gap that arises in random graphs by planting a subgraph with same log-density.
We introduce the following notion of compressing an undirected graph G with edge-lengths and terminal vertices $R\subseteq V(G)$. A distance-preserving minor is a minor G' (of G) with possibly different edge-lengths, such that $R\subseteq V(G')$ and the shortest-path distance between every pair of terminals is exactly the same in G and in G'. What is the smallest f*(k) such that every graph G with k=|R| terminals admits a distance-preserving minor G' with at most f*(k) vertices?
Simple analysis shows that $f*(k)\le O(k^4)$. Our main result proves that $f*(k)\ge \Omega(k^2)$, significantly improving over the trivial $f*(k)\ge k$. Our lower bound holds even for planar graphs G, in contrast to graphs G of constant treewidth, for which we prove that O(k) vertices suffice.
The Traveling Salesman Problem (TSP) is among the most famous NP-hard optimization problems. We design for this problem a randomized polynomial-time algorithm that computes a (1+epsilon)-approximation to the optimal tour, for any fixed epsilon>0, in TSP instances that form an arbitrary metric space with bounded intrinsic dimension.
The celebrated results of Arora [JACM 1998] and Mitchell [SICOMP 1999] prove that the above result holds in the special case of TSP in a fixed-dimensional Euclidean space. Thus, our algorithm demonstrates that the algorithmic tractability of metric TSP depends on the dimensionality of the space and not on its specific geometry. This result resolves a problem that has been open since the quasi-polynomial time algorithm of Talwar [STOC 2004].
We present a framework for performing efficient regression in general metric spaces. Roughly speaking, our regressor predicts the value at a new point by computing a Lipschitz extension --- the smoothest function consistent with the observed data --- after performing structural risk minimization to avoid overfitting. We obtain finite-sample risk bounds with minimal structural and noise assumptions, and a natural speed-precision tradeoff. The offline (learning) and online (prediction) stages can be solved by convex programming, but this naive approach has runtime complexity $O(n^3)$, which is prohibitive for large datasets. We design instead a regression algorithm whose speed and generalization performance depend on the intrinsic dimension of the data, to which the algorithm adapts. While our main innovation is algorithmic, the statistical results may also be of independent interest.
We study graph partitioning problems from a min-max perspective, in which an input graph on $n$ vertices should be partitioned into $k$ parts, and the objective is to minimize the maximum number of edges leaving a single part. The two main versions we consider are where the $k$ parts need to be of equal-size, and where they must separate a set of $k$ given terminals. We consider a common generalization of these two problems, and design for it an $O(\sqrt{\log n\log k})$-approximation algorithm. This improves over an $O(\log^2 n)$ approximation for the second version due to Svitkina and Tardos (2004), and roughly $O(k\log n)$ approximation for the first version that follows from other previous work. We also give an improved $O(1)$-approximation algorithm for graphs that exclude any fixed minor.
Along the way, we study the Unbalanced-Cut problem, whose goal is to find, in an input graph $G=(V,E)$, a set $S\subseteq V$ of size $|S|=\rho n$ that minimizes the number of edges leaving $S$. We provide a bicriteria approximation of $O(\sqrt{\log{n}\log{(1/\rho)}})$; when the input graph excludes a fixed-minor we improve this guarantee to $O(1)$. Note that the special case $\rho = 1/2$ is the well-known Minimum-Bisection problem, and indeed our bounds generalize those of Arora, Rao and Vazirani (2008) and of Klein, Plotkin, and Rao (1993). Our algorithms work also for the closely related Small-Set Expansion (SSE) problem, which asks for a set $S\subseteq V$ of size $0<|S| \leq \rho n$ with minimum edge-expansion, and was suggested recently by Raghavendra and Steurer (2010). In fact, our algorithm handles more general, weighted, versions of both problems. Previously, an $O(\log n)$ true approximation for both \UC and \SSE follows from Raecke (2008).
In STOC 2005, Indyk and Woodruff introduced a new technique that yielded a near-optimal space algorithm for $F_k$ moment estimation problem. Since then, the technique has inspired a number of advances in streaming algorithmics. We show that at least several of these results follow easily from the application of a single probabilistic technique, called Precision Sampling. Using it, we obtain simple streaming algorithms that maintain a randomized sketch of an input vector $x=(x_1,\ldots x_n)$, useful for the following applications:
Precision Sampling itself addresses the problem of estimating a sum $\sum_{i=1}^n a_i$ from weak estimates of each real $a_i\in[0,1]$. More precisely, one chooses in advance ``precisions'' $w_i\ge 1$ for each $i\in[n]$ and obtains an estimate of $a_i$ within additive $1/w_i$. The core question is: what is the best trade-off between the approximation to $\sum a_i$ and the total precision, $\sum_i w_i$ ? In previous work, we showed [Andoni, Krauthgamer, and Onak, FOCS 2010] that, as long as $\sum a_i=\Omega(1)$, one can achieve good multiplicative approximation using total precision of only $O(n\log n)$.
A natural requirement of many distributed structures is fault-tolerance: after some failures, whatever remains from the structure should still be effective for whatever remains from the network. In this paper we examine spanners of general graphs that are tolerant to vertex failures, and significantly improve their dependence on the number of faults $r$, for all stretch bounds.
For stretch $k \geq 3$ we design a simple transformation that converts every k-spanner construction with at most $f(n)$ edges into an $r$-fault-tolerant $k$-spanner construction with at most $O(r^3 \log n) \cdot f(2n/r)$ edges. Applying this to standard greedy spanner constructions gives r-fault tolerant k-spanners with $\tilde O(r^{2} n^{1+\frac{2}{k+1}})$ edges. The previous construction by Chechik, Langberg, Peleg, and Roddity [STOC 2009] depends similarly on n but exponentially on r (approximately like $k^r$).
For the case $k=2$ and unit-length edges, an $O(r \log n)$-approximation algorithm is known from recent work of Dinitz and Krauthgamer [arXiv 2010], where several spanner results are obtained using a common approach of rounding a natural flow-based linear programming relaxation. Here we use a different (stronger) LP relaxation and improve the approximation ratio to $O(\log n)$, which is, notably, independent of the number of faults r. We further strengthen this bound in terms of the maximum degree by using the Lovasz Local Lemma.
Finally, we show that most of our constructions are inherently local by designing equivalent distributed algorithms in the LOCAL model of distributed computation.
We examine directed spanners through flow-based linear programming relaxations. We design an $\tilde{O}(n^{2/3})$-approximation algorithm for the directed $k$-spanner problem that works for all $k\geq 1$, which is the first sublinear approximation for arbitrary edge-lengths. Even in the more restricted setting of unit edge-lengths, our algorithm improves over the previous $\tilde{O}(n^{1-1/k})$ approximation of Bhattacharyya, Grigorescu, Jung, Raskhodnikova, and Woodruff [SODA 2009] when $k\geq 4$. For the special case of $k=3$ we design a different algorithm achieving an $\tilde{O}(\sqrt{n})$-approximation, improving the previous $\tilde{O}(n^{2/3})$ of Elkin and Peleg [TCS 2005] and Bhattacharyya et al. [SODA 2009]. Both of our algorithms easily extend to the fault-tolerant setting, which has recently attracted attention but not from an approximation viewpoint. We also prove a nearly matching integrality gap of $\tilde{\Omega}(n^{1/3 - \epsilon})$ for any constant $\epsilon > 0$.
A virtue of all our algorithms is that they are relatively simple. Technically, we introduce a new yet natural flow-based relaxation, and show how to approximately solve it even when its size is not polynomial. The main challenge is to design a rounding scheme that ``coordinates'' the choices of flow-paths between the many demand pairs while using few edges overall. We achieve this, roughly speaking, by randomization at the level of vertices.
We give the first constant-factor approximation algorithm for Sparsest-Cut with general demands in bounded tr eewidth graphs. In contrast to previous algorithms, which rely on the flow-cut gap and/or metric embeddings, our approach exploits the Sherali-Adams hierarchy of linear programming relaxations.
Given a capacitated graph G = (V,E) and a set of terminals $K \subseteq V$, how should we produce a graph H only on the terminals K so that every (multicommodity) flow between the terminals in G could be supported in H with low congestion, and vice versa? (Such a graph H is called a flow-sparsifier for G.) What if we want H to be a ``simple'' graph? What if we allow H to be a convex combination of simple graphs?
Improving on results of Moitra [FOCS 2009] and Leighton and Moitra [STOC 2010], we give efficient algorithms for constructing: (a) a flow-sparsifier $H$ that maintains congestion up to a factor of $\alpha_{FHRT}$, where k = |K|. (b) a convex combination of trees over the terminals K that maintains congestion up to a factor of O(log k). (c) for a planar graph G, a convex combination of planar graphs that maintains congestion up to a constant factor. This requires us to give a new algorithm for the 0-extension problem, the first one in which the preimages of each terminal are connected in G. Moreover, this result extends to minor-closed families of graphs.
Our bounds immediately imply improved approximation guarantees for several terminal-based cut and ordering problems.
We introduce a new problem in the study of doubling spaces: Given a point set S and a target dimension d*, remove from S the fewest number of points so that the remaining set has doubling dimension at most d^*. We present a bicriteria approximation for this problem, and extend this algorithm to solve a group of proximity problems.
Recent advances in large-margin classification of data residing in general metric spaces (rather than Hilbert spaces) enable classification under various natural metrics, such as edit and earthmover distance. The general framework developed for this purpose by von Luxburg and Bousquet [JMLR, 2004] left open the question of computational efficiency and providing direct bounds on classification error.
We design a new algorithm for classification in general metric spaces, whose runtime and accuracy depend on the doubling dimension of the data points. It thus achieves superior classification performance in many common scenarios. The algorithmic core of our approach is an approximate (rather than exact) solution to the classical problems of Lipschitz extension and of Nearest Neighbor Search. The algorithm's generalization performance is established via the fat-shattering dimension of Lipschitz classifiers.
We present a near-linear time algorithm that approximates the edit distance between two strings within a polylogarithmic factor; specifically, for strings of length n and every fixed $\eps>0$, it can compute a $(\log n)^{O(1/\eps)}$-approximation in $n^{1+\eps}$ time. This is an {\em exponential} improvement over the previously known factor, $2^{\tilde O(\sqrt{\log n})}$, with a comparable running time [OR05,AO09].
Previously, no efficient polylogarithmic approximation algorithm was known for any computational task involving edit distance (e.g., nearest neighbor search or sketching).
This result arises naturally in the study of a new asymmetric query model. In this model, the input consists of two strings x and y, and an algorithm can access y in an unrestricted manner, while being charged for querying every symbol of x. Indeed, we obtain our main result by designing an algorithm that makes a small number of queries in this model. We then provide also a nearly-matching lower bound on the number of queries. Our lower bound is the first to expose hardness of edit distance stemming from the input strings being ``repetitive'', which means that many of their substrings are approximately identical. Consequently, our lower bound provides the first rigorous separation between edit distance and Ulam distance, which is edit distance on non-repetitive strings, i.e., permutations.
The
The snowflake metric $d^{1/2}$ of a doubling set
$S\subset
In fact, the target
dimension is polylogarithmic in the doubling constant. Our techniques are
robust and extend to the more difficult spaces
The quest for a Polynomial Time Approximation Scheme (PTAS) for Nash equilibrium in a two-player game, which emerged as a major open question in Algorithmic Game Theory, seeks to circumvent the PPAD-completeness of finding an (exact) Nash equilibrium by finding an approximate equilibrium. The closely related problem of finding an equilibrium maximizing a certain objective, such as the social welfare, was shown to be NP-hard [Gilboa and Zemel, Games and Economic Behavior, 1989]. However, this NP-hardness is unlikely to extend to approximate equilibria, since the latter admits a quasi-polynomial time algorithm [Lipton, Markakis and Mehta, In Proc. of EC, 2003].
We show that this optimization problem, namely, finding in a two-player game an approximate equilibrium achieving a large social welfare, is unlikely to have a polynomial time algorithm. One interpretation of our results is that a PTAS for Nash equilibrium (if it exists) should not extend to a PTAS for finding the best Nash equilibrium.
Technically, our result is a reduction from the notoriously difficult problem in modern Combinatorics, of finding a planted (but hidden) clique in a random graph G(n,1/2). Our reduction starts from an instance with planted clique size O(log n) For comparison, the currently known algorithms are effective only for a much larger clique size Omega(sqrt n).
We consider the k-balanced partitioning problem, where the goal is to partition the vertices of an input graph G into k equally sized components, while minimizing the total weight of the edges connecting different components. We allow k to be part of the input and denote the cardinality of the vertex set by n. This problem is a natural and important generalization of well-known graph partitioning problems, including minimum bisection and minimum balanced cut.
We present a (bi-criteria) approximation algorithm achieving
an approximation of O(sqrt{log n / log k}), which
matches or improves over previous algorithms for all relevant values of
k. Our algorithm uses a semidefinite relaxation which combines
We consider the problem of approximate nearest neighbors in high dimensions, when the queries are lines. In this problem, given n points in R^d, we want to construct a data structure to support efficiently the following queries: given a line L, report the point p closest to L. This problem generalizes the more familiar nearest neighbor problem. From a practical perspective, lines, and low-dimensional flats in general, may model data under linear variation, such as physical objects under different lighting.
For approximation 1+epsilon, we achieve a query time of O(n^{0.5+t}), for arbitrary small t>0, with a space of n^{O(1/epsilon^2+1/t^2)}. To the best of our knowledge, this is the first algorithm for this problem with polynomial space and sub-linear query time.
A common approach for solving computational problems over a difficult metric space is to embed the ``hard'' metric into L_1, which admits efficient algorithms and is thus considered an ``easy'' metric. This approach has proved successful or partially successful for important spaces such as the edit distance, but it also has inherent limitations: it is provably impossible to go below certain approximation for some metrics.
We propose a new approach, of embedding the difficult space into richer host spaces, namely iterated products of standard spaces like L_1 and L_\infty. We show that this class is rich since it contains useful metric spaces with only a constant distortion, and, at the same time, it is tractable and admits efficient algorithms. Using this approach, we obtain for example the first nearest neighbor data structure with O(log log d) approximation for edit distance in non-repetitive strings (the Ulam metric). This approximation is exponentially better than the lower bound for embedding into L_1. Furthermore, we give constant factor approximation for two other computational problems. Along the way, we answer positively a question posed in [Ajtai, Jayram, Kumar, and Sivakumar, STOC 2002]. One of our algorithms has already found applications for smoothed edit distance over 0-1 strings [Andoni and Krauthgamer, ICALP 2008].
We initiate the study of the smoothed complexity of sequence alignment, by proposing a semi-random model of edit distance between two input strings, generated as follows. First, an adversary chooses two binary strings of length d and a longest common subsequence A of them. Then, every character is perturbed independently with probability p, except that A is perturbed in exactly the same way inside the two strings.
We design two efficient algorithms that compute the edit distance on smoothed instances up to a constant factor approximation. The first algorithm runs in near-linear time, namely d^{1+epsilon} for any fixed epsilon>0. The second one runs in time sublinear in d, assuming the edit distance is not too small. These approximation and runtime guarantees are significantly better then the bounds known for worst-case inputs, e.g. near-linear time algorithm achieving approximation roughly d^{1/3}, due to Batu, Ergun, and Sahinalp [SODA 2006].
Our technical contribution is twofold. First, we rely on finding matches between substrings in the two strings, where two substrings are considered a match if their edit distance is relatively small, a prevailing technique in commonly used heuristics, such as PatternHunter of Ma, Tromp and Li [Bioinformatics, 2002]. Second, we effectively reduce the smoothed edit distance to a simpler variant of (worst-case) edit distance, namely, edit distance on permutations (a.k.a. Ulam's metric). We are thus able to build on algorithms developed for the Ulam metric, whose much better algorithmic guarantees usually do not carry over to general edit distance.
A common technique for processing conjunctive queries is to first match each predicate separately using an index lookup, and then compute the intersection of the resulting row-id lists, via an AND-tree. The performance of this technique depends crucially on the order of lists in this tree: it is important to compute early the intersections that will produce small results. But this optimization is hard to do when the data or predicates have correlation.
We present a new algorithm for ordering the lists in an AND-tree by sampling the intermediate intersection sizes. We prove that our algorithm is near-optimal and validate its effectiveness experimentally on datasets with a variety of distributions.
How should a seller price his goods in a market where each buyer prefers a single good among his desired goods, and will buy the cheapest such good, as long as it is within his budget? We provide efficient algorithms that compute near-optimal prices for this problem, focusing on a commodity market, where the range of buyer budgets is small. We also show that our technique (which is based on LP-rounding) easily extends to a different scenario, in which the buyers want to buy all the desired goods, as long as they are within budget.
We design approximation algorithms for a number of fundamental optimization problems in metric spaces, namely computing separating and padded decompositions, sparse covers, and metric triangulations. Our work is the first to emphasize relative guarantees that compare the produced solution to the optimal one for the input at hand. By contrast, the extensive previous work on these topics has sought absolute bounds that hold for every possible metric space (or for a family of metrics). While absolute bounds typically translate to relative ones, our algorithms provide significantly better relative guarantees, using a rather different algorithm.
Our technical approach is to cast a number of metric clustering problems that have been well studied---but almost always as disparate problems---into a common modeling and algorithmic framework, which we call the consistent labeling problem. Having identified the common features of all of these problems, we provide a family of linear programming relaxations and simple randomized rounding procedures that achieve provably good approximation guarantees.
We prove the first non-trivial communication complexity lower bound for the problem of estimating the edit distance (aka Levenshtein distance) between two strings. A major feature of our result is that it provides the first setting in which the complexity of computing the edit distance is provably larger than that of Hamming distance.
Our lower bound exhibits a trade-off between approximation and communication, asserting, for example, that protocols with O(1) bits of communication can only obtain approximation Omega(log d/loglog d), where d is the length of the input strings. This case of O(1) communication is of particular importance, since it captures constant-size sketches as well as embeddings into spaces like L_1 and squared-L_2, two prevailing algorithmic approaches for dealing with edit distance. Furthermore, the bound holds not only for strings over alphabet Sigma={0,1}, but also for strings that are permutations (called the Ulam metric).
Besides being applicable to a much richer class of algorithms than all previous results, our bounds are near-tight in at least one case, namely of embedding permutations into L_1. The proof uses a new technique, that relies on Fourier analysis in a rather elementary way.
The Earth Mover Distance (EMD) between two equal-size sets of points in R^d is defined to be the minimum cost of a bipartite matching between the two pointsets. It is a natural metric for comparing sets of features, and as such, it has received significant interest in computer vision. Motivated by recent developments in that area, we address computational problems involving EMD over high-dimensional pointsets.
A natural approach is to embed the EMD metric into l_1, and use the algorithms designed for the latter space. However, Khot and Naor [FOCS 2005] show that any embedding of EMD over the d-dimensional Hamming cube into l_1 must incur a distortion Omega(d), thus practically losing all distance information. We circumvent this roadblock by focusing on sets with cardinalities upper-bounded by a parameter s, and achieve a distortion of only O(log s log d). Since in applications the feature sets have bounded size, the resulting distortion is much smaller than the Omega(d) lower bound. Our approach is quite general and easily extends to EMD over R^d.
We then provide a strong lower bound on the multi-round communication complexity of estimating EMD, which in particular strengthens the known non-embeddability result of Khot and Naor. Our bound exhibits a smooth tradeoff between approximation and communication, and for example implies that every algorithm that estimates EMD using constant size sketches can only achieve Omega(log s) approximation.
Network triangulation is a method for estimating distances between nodes in the network, by letting every node measure its distance to a few beacon nodes, and deducing the distance between every two nodes x,y by using their measurements to their common beacons and applying the triangle inequality. Kleinberg, Slivkins and Wexler [FOCS 2004] initiated a theoretical study of triangulation in metric spaces, and Slivkins [PODC 2005] subsequently showed that metrics of bounded doubling dimension admit a triangulation that approximates arbitrarily well all pairwise distances using only O(log n) beacons per point, where n is the number of points in the network. He then asked whether this term is necessary (for doubling metrics).
We provide the first lower bounds on the number of beacons required for a triangulation in some specific simple networks. In particular, these bounds (i) answer Slivkins' question positively, even for one-dimensional metrics, and (ii) prove that, up to constants, Slivkins' triangulation achieves an optimal number of beacons (as a function of the approximation guarantee and the doubling dimension).
The distance to monotonicity of a sequence is the minimum number of edit operations required to transform the sequence into an increasing order; this measure is complementary to the length of the longest increasing subsequence (LIS). We address the question of estimating these quantities in the one-pass data stream model and present the first sub-linear space algorithms for both problems.
We first present O(sqrt{n})-space deterministic algorithms that approximate the distance to monotonicity and the LIS to within a factor that is arbitrarily close to $1$. We also show a lower bound of Omega(n) on the space required by any randomized algorithm to compute the LIS (or alternatively the distance from monotonicity) exactly, demonstrating that approximation is necessary for sub-linear space computation; this bound improves upon the existing lower bound of Omega(sqrt{n}) [Liben-Nowell, Vee, and Zhu, J. Combinatorial Optimization, 2006].
Our main result is a randomized algorithm that uses only O(log^2 n) space and approximates the distance to monotonicity to within a factor that is arbitrarily close to 4. In contrast, we believe that any significant reduction in the space complexity for approximating the length of the LIS is considerably hard. We conjecture that any deterministic 1 + epsilon approximation algorithm for LIS requires Omega(sqrt{n}) space, and as a step towards this conjecture, prove a space lower bound of $\Omega(\sqrt n)$ for a restricted yet natural class of deterministic algorithms.
We initiate the study of approximate algorithms on negatively curved spaces. These spaces have recently become of interest in various domains of computer science including networking and vision. The classical example of such a space is the real-hyperbolic space H^d for d>=2, but our approach applies to a more general family of spaces characterized by Gromov's (combinatorial) hyperbolic condition. We give efficient algorithms and data structures for problems like approximate nearest-neighbor search and compact, low-stretch routing on subsets of negatively curved spaces of fixed dimension (including H^d as a special case). In a different direction, we show that there is a PTAS for the Traveling Salesman Problem when the set of cities lie, for example, in H^d. This generalizes Arora's results for R^d.
Most of our algorithms use the intrinsic distance geometry of the data set, and only need the existence of an embedding into some negatively curved space in order to function properly. In other words, our algorithms regard the inter-point distance function as a black box, and are independent of the representation of the input points.
Edit distance is a fundamental measure of distance between strings, the extensive study of which has recently focused on computational problems such as nearest neighbor search, sketching and fast approximation. A very powerful paradigm is to map the metric space induced by the edit distance into a normed space (e.g., ℓ_{1}) with small distortion, ℓ_{1} and then use the rich algorithmic toolkit known for normed spaces. Although the minimum distortion required to embed edit distance into ℓ_{1} has received a lot of attention lately, there is a large gap between known upper and lower bounds. We make progress on this question by considering large, well-structured, submetrics of the edit distance metric space.
Our main technical result is that the Ulam metric, namely, the edit distance on permutations of length at most n, embeds into ℓ_{1} with distortion O(log n). This immediately leads to sketching algorithms with constant size sketches, and to efficient approximate nearest neighbor search algorithms, with approximation factor O(log n). The embedding and its algorithmic consequences present a big improvement over those previously known for the Ulam metric, and they are significantly better than the state of the art for edit distance in general. Further, we extend these results for the Ulam metric to edit distance on strings that are (locally) non-repetitive, i.e., strings where (close by) substrings are distinct.
We simplify and improve upon recent lower bounds on the minimum distortion of embedding certain finite metric spaces into L_1. In particular, we show that for infinitely many values of n there are n-point metric spaces of negative type that require a distortion of Omega(log log n) for such an embedding, implying the same lower bound on the integrality gap of a well-known SDP relaxation for Sparsest-Cut. This result builds upon and improves the recent lower bound of (log log n)^{1/6-o(1)} due to Khot and Vishnoi [STOC 2005]. We also show that embedding the edit distance on {0,1}^n into L_1 requires a distortion of Omega(log n). This result simplifies and improves a very recent lower bound due to Khot and Naor [FOCS 2005].
We show that the Multicut, Sparsest-Cut, and Min-2CNF= Deletion problems are NP-hard to approximate within every constant factor, assuming the Unique Games Conjecture of Khot [STOC, 2002]. A quantitatively stronger version of the conjecture implies an inapproximability factor of $\Omega(\sqrt{\log\log n})$.
We address the problems of pattern matching and approximate pattern matching in the sketching model. We show that it is impossible to compress the text into a small sketch and use only the sketch to decide whether a given pattern occurs in the text. We also prove a sketch size lower bound for approximate pattern matching, and show it is tight up to a logarithmic factor.
Edit distance has been extensively studied for the past several years. Nevertheless, no linear-time algorithm is known to compute the edit distance between two strings, or even to approximate it to within a modest factor. Furthermore, for various natural algorithmic problems such as low-distortion embeddings into normed spaces, approximate nearest-neighbor schemes, and sketching algorithms, known results for the edit distance are rather weak. We develop algorithms that solve gap versions of the edit distance problem: given two strings of length $n$ with the promise that their edit distance is either at most $k$ or greater than $l$, decide which of the two holds. We present two sketching algorithms for gap versions of edit distance. Our first algorithm solves the $k$ vs. $(k n)^{2/3}$ gap problem, using a constant size sketch. A more involved algorithm solves the stronger $k$ vs. $l$ gap problem, where $l$ can be as small as $O(k^2)$---still with a constant sketch---but works only for strings that are mildly ``non-repetitive''. Finally, we develop an $n^{3/7}$-approximation quasi-linear time algorithm for edit distance, improving the previous best factor of $n^{3/4}$ due to Cole and Hariharan [SIAM J. on Computing, 2002]; if the input strings are assumed to be non-repetitive, then the approximation factor can be strengthened to $n^{1/3}$.
We devise a new embedding technique, which we call measured descent, based on decomposing a metric space locally, at varying speeds, according to the density of some probability measure. This provides a refined and unified framework for the two primary methods of constructing Fr\'echet embeddings for finite metrics, due to [Bourgain, 1985] and [Rao, 1999]. We prove that any $n$-point metric space $(X,d)$ embeds in Hilbert space with distortion $O(\sqrt{\alpha_X \cdot \log n})$, where $\alpha_X$ is a geometric estimate on the decomposability of $X$. An an immediate corollary, we obtain an $O(\sqrt{\log \lambda_X \log n})$ distortion embedding, where $\lambda_X$ is the doubling constant of $X$. Since $\lambda_X\le n$, this result recovers Bourgain's theorem, but when the metric $X$ is, in a sense, ``low-dimensional,'' improved bounds are achieved.
Our embeddings are volume-respecting for subsets of arbitrary size. One consequence is the existence of $(k, \log n)$ volume-respecting embeddings for all $1 \leq k \leq n$, which is the best possible, and answers positively a question posed in [Feige, 1998]. Our techniques are also used to answer positively a question of Y. Rabinovich, showing that any weighted $n$-point planar graph embeds in $\ell_\infty^{O(\log n)}$ with $O(1)$ distortion. The $O(\log n)$ bound on the dimension is optimal, and improves upon the previously known bound of $O(\log n)^2$.
We define a natural notion of efficiency for approximate nearest-neighbor (ANN) search in general $n$-point metric spaces, namely the existence of a randomized algorithm which answers (1+ε)-approximate nearest neighbor queries in polylog(n) time using only polynomial space. We then study which families of metric spaces admit efficient ANN schemes in the black-box model, where only oracle access to the distance function is given, and any query consistent with the triangle inequality may be asked.
For ε < 2/5, we offer a complete answer to this problem. Using the notion of metric dimension defined in \cite{GKL03} (\`a la [Assouad, 1983]), we show that a metric space $X$ admits an efficient (1+ε)-ANN scheme for any ε < 2/5 if and only if $\dim(X) = O(\log \log n)$. For coarser approximations, clearly the upper bound continues to hold, but there is a threshold at which our lower bound breaks down---this is precisely when points in the ``ambient space'' may begin to affect the complexity of ``hard'' subspaces $S \subseteq X$. Indeed, we give examples which show that $\dim(X)$ does not characterize the black-box complexity of ANN above the threshold.
Our scheme for ANN in low-dimensional metric spaces is the first to yield efficient algorithms without relying on any heuristic assumptions on the input. In previous approaches (e.g., [Clarkson, 1999], [Karger and Ruhl, 2002], [Krauthgamer and Lee, 2004]) even spaces with $\dim(X) = O(1)$ sometimes required $\Omega(n)$ query times.
Aggregate statistical data about the web is very useful in numerous scenarios, such as market research, intelligence, and social studies. In many of these applications, one is interested not in generic data about the whole web but rather in highly focused information pertinent to a specific domain or topic. Furthermore, timely acquisition of this information provides a competitive advantage.
Focused statistical data can be gathered by a brute force crawl of the whole web, or by a "focused crawl", which collects mainly pages that are relevant to the topic of interest. Crawling, however, is an expensive enterprise, requiring substantial resources. For the purpose of gathering statistical data, random sampling of web pages is a much faster, cheaper, and even more reliable approach. We develop the first efficient method for generating a random sample of web pages relevant to a given user-specified topic. Previously, techniques for getting only an unfocused sample of pages from the whole web were known. Our method is based on a new random walk on (a modified version of) a subgraph of the web graph.
We present a simple deterministic data structure for maintaining a set $S$ of points in a general metric space, while supporting proximity search (nearest neighbor and range queries) and updates to $S$ (insertions and deletions). Our data structure consists of a sequence of progressively finer $\epsilon$-nets of $S$, with pointers that allow us to navigate easily from one scale to the next. We analyze the worst-case complexity of this data structure in terms of the ``abstract dimensionality'' of the metric $S$. Our data structure is extremely efficient for metrics of bounded dimension and is essentially optimal in a certain model of distance computation. Finally, as a special case, our approach improves over one recently devised by Karger and Ruhl [STOC, 2002].
Given a metric space $(X,d)$, a natural distance measure on probability distributions over $X$ is the {\em earthmover metric}. We use randomized rounding of earthmover metrics to devise new approximation algorithms for two well-known classification problems, namely, metric labeling and 0-extension.
Our first result is for the 0-extension problem. We show that if the terminal metric is decomposable with parameter $\alpha$ (e.g., planar metrics are decomposable with $\alpha=O(1)$), then the earthmover based linear program (for $0$-extension) can be rounded to within an $O(\alpha)$ factor.
Our second result is an $O(\log n)$-approximation for metric labeling, using probabilistic tree embeddings in a way very different from the $O(\log k)$-approximation of Kleinberg and Tardos [JACM, 2002]. (Here, $n$ is the number of nodes, and $k$ is the number of labels.) The key element is rounding the earthmover based linear program (for metric labeling) without increasing the solution's cost, when the input graph is a tree. This rounding method also provides an alternate proof to a result stated in Chekuri et al. [SODA, 2001], that the earthmover based linear program is integral when the input graph is a tree.
Our simple and constructive rounding techniques contribute to the understanding of earthmover metrics and may be of independent interest.
In the Asymmetric $k$-Center problem, the input is an integer $k$ and a complete digraph over $n$ points together with a distance function obeying the directed triangle inequality. The goal is to choose a set of $k$ points to serve as centers and to assign all the points to the centers, so that the maximum distance of any point from its center is as small as possible.
We show that the Asymmetric $k$-Center problem is hard to approximate up to a factor of $\log^* n - O(1)$ unless $NP \subseteq DTIME(n^{\log \log n})$. Since an $O(\log^* n)$-approximation algorithm is known for this problem, this resolves the asymptotic approximability of Asymmetric $k$-Center. This is the first natural problem whose approximability threshold does not polynomially relate to the known approximation classes. We also resolve the approximability threshold of the metric (symmetric) $k$-Center problem with costs.
The doubling constant of a metric space $(X,d)$ is the smallest value $\lambda$ such that every ball in $X$ can be covered by $\lambda$ balls of half the radius. The doubling dimension of $X$ is then defined as $dim(X) = \log_2 \lambda$. A metric (or sequence of metrics) is called doubling precisely when its doubling dimension is bounded. This is a robust class of metric spaces which contains many families of metrics that occur in applied settings.
We give tight bounds for embedding doubling metrics into (low-dimensional) normed spaces. We consider both general doubling metrics, as well as more restricted families such as those arising from trees, from graphs excluding a fixed minor, and from snowflaked metrics. Our techniques include decomposition theorems for doubling metrics, and an analysis of a fractal in the plane due to Laakso [Bull. London Math. Soc., 2002]. Finally, we discuss some applications and point out a central open question regarding dimensionality reduction in $L_2$.
Motivation: Comparing two protein databases is a fundamental task in biosequence annotation. Given two databases, one must find all pairs of proteins that align with high score under a biologically meaningful substitution score matrix, such as a BLOSUM matrix (Henikoff and Henikoff, 1992). Distance-based approaches to this problem map each peptide in the database to a point in a metric space, such that peptides aligning with higher scores are mapped to closer points. Many techniques exist to discover close pairs of points in a metric space efficiently, but the challenge in applying this work to proteomic comparison is to find a distance mapping that accurately encodes all the distinctions among residue pairs made by a proteomic score matrix. Buhler (2002) proposed one such mapping but found that it led to a relatively inefficient algorithm for protein-protein comparison.
Results: This work proposes a new distance mapping for peptides under the BLOSUM matrices that permits more efficient similarity search. We first propose a new distance function on peptides derived from a given score matrix. We then show how to map peptides to bit vectors such that the distance between any two peptides is closely approximated by the Hamming distance (i.e., number of mismatches) between their corresponding bit vectors. We combine these two results with the LSH-ALL-PAIRS-SIM algorithm of Buhler (2002) to produce an improved distance-based algorithm for proteomic comparison. An initial implementation of the improved algorithm exhibits sensitivity within 5% of that of the original LSH-ALL-PAIRS-SIM, while running up to eight times faster.
We devise the first constant factor approximation algorithm for minimum quotient vertex-cuts in planar graphs. Our algorithm achieves approximation ratio $3.5 (1+\epsilon)^2$ with running time $O(W n^{3+2/\epsilon})$. The approximation ratio improves to $\frac{4}{3} (1+\epsilon)$ if there is an optimal quotient vertex-cut $(A^*,B^*,C^*)$ where $C^*$ has relatively small weight compared to $A^*$ and $B^*$; this holds, for example, when the input graph has uniform (or close to uniform) weight and cost. The approximation ratio further improves to $1+\epsilon$ if, in addition, $\min( w(A^*),w(B^*) ) \le \frac{1}{3} W$.
We use our quotient cut algorithm to find the first constant-factor pseudo-approximation for vertex separators in planar graphs.
Our technical contribution is two-fold. First, we prove a structural theorem for vertex-cuts in planar graphs, showing the existence of a near-optimal vertex-cut whose high-level structure is that of a bounded-depth tree. Second, we develop an algorithm for optimizing over such complex structures, whose running time depends (exponentially) not on the size of the structure, but rather only on its depth. These techniques may be applicable in other problems.
We resolve the following conjecture raised by Levin together with Linial, London, and Rabinovich [Combinatorica, 1995]. Let $\dim(G)$ be the smallest $d$ such that $G$ occurs as a (not necessarily induced) subgraph of the infinite graph $Z_\infty^d$, whose vertex set is $Z^d$ and which has an edge $(u,v)$ whenever $||u-v||_\infty = 1$. The growth rate of $G$, denoted $\rho_G$, is the minimum $\rho$ such that every ball of radius $r>1$ in $G$ contains at most $r^\rho$ vertices. By simple volume arguments, $\dim(G) = \Omega(\rho_G)$. Levin conjectured that this lower bound is tight, i.e., that $\dim(G) = O(\rho_G)$ for every graph $G$.
We show that a weaker form of Levin's conjecture, namely, $\dim(G) = O(\rho_G \log \rho_G)$, holds for every graph $G$. We disprove, however, the specific bound of the conjecture and show that our upper bound is tight by exhibiting graphs for which $\dim(G) =\Omega(\rho_G \log \rho_G)$. For families of graphs which exclude a fixed minor, we salvage the strong form, showing that $\dim(G) = O(\rho_G)$. This holds also for graphs without long induced simple cycles. Our results extend to a variant of the conjecture for finite-dimensional Euclidean spaces due to Linial [Haifa Workshop, 2002].
We provide the first hardness result for a polylogarithmic approximation ratio (for a natural NP-hard optimization problem). We show that for every fixed $\epsilon>0$, the Group Steiner Tree problem admits no efficient $\log^{2-\epsilon} k$--approximation, where $k$ denotes the number of groups, unless NP has quasi-polynomial Las-Vegas algorithms. This result holds even for input graphs which are Hierarchically Well-Separated Trees, introduced by Bartal [FOCS, 1996], in which case our bound is nearly tight with the $O(\log^2 k)$--approximation currently known. Our results imply that for every fixed $\epsilon>0$, the Directed Steiner Tree problem admits no $\log^{2-\epsilon} n$--approximation, where $n$ is the number of vertices in the graph, under the same complexity assumption.
We present an $\Omega(\log^2 k)$ lower bound on the integrality ratio of the flow-based relaxation for the Group Steiner Tree problem, where $k$ denotes the number of groups; this holds even for input graphs that are Hierarchically Well-Separated Trees, introduced by Bartal [\textit{Symp.\ Foundations of Computer Science}, pp.\ 184--193, 1996], in which case this lower bound is tight. This relaxation appears to be the only one that have been studied for the problem, as well as for its generalization, the Directed Steiner Tree problem. For the latter problem, our results imply an $\Omega(\frac{\log^2 n}{(\log \log n)^2})$ integrality ratio, where $n$ is the number of vertices in the graph. For both problems, this is the first known lower bound on the integrality ratio that is superlogarithmic in the input size. We also show algorithmically that the integrality ratio for Group Steiner Tree is much better for certain families of instances, which helps pinpoint the types of instances that appear to be most difficult to approximate.
Data dimensionality is a crucial issue in a variety of settings, where it is desirable to determine whether a data set given in a high-dimensional space adheres to a low-dimensional structure. We study this problem in the framework of property testing: Given a query access to a data set $S$, we wish to determine whether $S$ is low-dimensional, or whether it should be modified significantly in order to have the property. Allowing a constant probability of error, we aim at algorithms whose complexity does not depend on the size of $S$.
We present algorithms for testing the low-dimensionality of a set of vectors and for testing whether a matrix is of low rank. We then address low-dimensionality in metric spaces. For vectors in the metric space $l_1$, we show that low-dimensionality is not testable. For $l_2$, we show that a data set can be tested for having a low-dimensional structure, but that the property of approximately having such a structure is not testable.
In the survivable network design problem (SNDP), the goal is to find a minimum-cost subgraph satisfying certain connectivity requirements. We study the vertex-connectivity variant of SNDP in which the input specifies, for each pair of vertices, a required number of vertex-disjoint paths connecting them.
We give the first lower bound on the approximability of SNDP, showing that the problem admits no efficient $2^{\log^{1-\epsilon} n}$ ratio approximation for any fixed $\epsilon>0$ unless $NP\subseteq DTIME(n^{\polylog(n)})$. We also show hardness of approximation results for several important special cases of SNDP, including constant factor hardness for the $k$-vertex connected spanning subgraph problem ($k$-VCSS) and for the vertex-connectivity augmentation problem, even when the edge costs are severely restricted.
The extensive study of metric spaces and their embeddings has so far focused on embeddings that preserve pairwise distances. A very intriguing concept introduced by Feige [JCSS, 2000] allows us to quantify the extent to which larger structures are preserved by a given embedding. We investigate this concept, focusing on several major graph families such as paths, trees, cubes and expanders. We find some similarities to the regular (pairwise) distortion, as well as some striking differences.
Lovasz and Schrijver (SIAM J. Optim., 1991) devised a lift-and-project method that produces a sequence of convex relaxations for the problem of finding in a graph an independent set (or a clique) of maximum size. Each relaxation in the sequence is tighter than the one before it, while the first relaxation is already at least as strong as the Lovasz theta function (IEEE Trans. Inform. Theory, 1979}. We show that on a random graph $G_{n,1/2}$, the value of the $r$th relaxation in the sequence is roughly $\sqrt{n/2^r}$, almost surely. It follows that for those relaxations known to be efficiently computable, namely for $r=O(1)$, the value of the relaxation is comparable to the theta function. Furthermore, a perfectly tight relaxation is almost surely obtained only at the $r=\Theta(\log n)$ relaxation in the sequence.
The notion of private approximation was introduced recently by Feigenbaum, Fong, Strauss and Wright. Informally, a private approximation of a function $f$ is another function $F$ that approximates $f$ in the usual sense, but does not yield any information on $x$ other than what can be deduced from $f(x)$. As such, $F(x)$ is useful for private computation of $f(x)$ (assuming that $F$ can be computed more efficiently than $f$).
In this work we examine the properties and limitations of this new notion. Specifically, we show that for many NP-hard problems, the privacy requirement precludes non-trivial approximation. This is the case even for problems that otherwise admit very good approximation (e.g., problems with PTAS). On the other hand, we show that slightly relaxing the privacy requirement, by means of leaking ``just a few bits of information'' about $x$, again permits good approximation.
A web content hosting service provider needs to dynamically allocate servers in a server farm to its customers' web sites. Ideally, the allocation to a site should always suffice to handle its load. However, due to a limited number of servers and the overhead incurred in changing the allocation of a server from one site to another, the system may become overloaded. The problem faced by the web hosting service provider is how to allocate the available servers in the most profitable way. Adding to the complexity of this problem is the fact that future loads of the sites are either unknown or known only for the very near future.
In this paper we model this server allocation problem, and consider both its offline and online versions. We give a polynomial time algorithm for computing the optimal offline allocation. In the online setting, we show almost optimal algorithms (both deterministic and randomized) for any positive lookahead. The quality of the solution improves as the lookahead increases. We also consider several special cases of practical interest. Finally, we present some experimental results using actual trace data that show that one of our online algorithm performs very close to optimal.
Interestingly, the online server allocation problem can be cast as a more general benefit task system that we define. Our results extend to this task system, which captures also the benefit maximization variants of the $k$-server problem and the metrical task system problem. It follows that the benefit maximization variants of these problems are more tractable than their cost minimization variants.
The achromatic number problem is to legally color the vertices of an input graph with the maximum number of colors, denoted $\psi^*$, so that every two color classes share at least one edge. This problem is known to be NP-hard.
For general graphs we give an algorithm that approximates the achromatic number within ratio of $O(n \log\log n/\log n)$. This improves over the previously known approximation ratio of $O(n/\sqrt{\log n})$, due to Chaudhary and Vishwanathan (SODA, 1997).
For graphs of girth at least 5 we give an algorithm with approximation ratio $O(\min\{n^{1/3},\sqrt{\psi^*}\})$. This improves over an approximation ratio $O(\sqrt{\psi^*})=O(n^{3/8})$ for the more restricted case of graphs with girth at least 6, due to Krista and Lorys (ESA 1999).
We also give the first hardness result for approximating the achromatic number. We show that for every fixed $\epsilon>0$ there is no $2-\epsilon$ approximation algorithm, unless $P=NP$.
A bisection of a graph with $n$ vertices is a partition of its vertices into two sets, each of size $n/2$. The bisection cost is the number of edges connecting the two sets. Finding the bisection of minimum cost is NP-hard. We present an algorithm that finds a bisection whose cost is within ratio of $O(\log^2 n)$ from the optimal. For graphs excluding any fixed graph as a minor (e.g. planar graphs) we obtain an improved approximation ratio of $O(\log n)$. The previously known approximation ratio for bisection was roughly $\sqrt{n}$.
A bisection of a graph with $n$ vertices is a partition of its vertices into two sets, each of size $n/2$. The bisection size is the number of edges connecting the two sets. Finding the bisection of minimum size is NP-hard. We present an algorithm that finds a bisection that is within $O(\sqrt{n}\log n)$ of optimal. No sublinear approximation ratio for bisection was previously known.
We consider the problem of finding in an undirected graph a minimum cut that separates exactly a given number $k$ of vertices. For general $k$ (i.e. $k$ is part of the input and may depend on $n$) this problem is NP-hard.
We present for this problem a randomized approximation algorithm, which is useful when $k$ is relatively small. In particular, for $k = O(\log n)$ we obtain a PTAS (polynomial time approximation scheme), and for $k=\Omega(\log n)$ we obtain an approximation ratio $O(k/\log n)$.
The motivation for our work is the observation that Web pages on a particular topic are often linked to other pages on the same topic. We model and analyze the problem of how to improve the classification of Web pages (that is, determining the topic of the page) by using link information. In our setting, an initial classifier examines the text of a Web page and assigns to it some classification, possibly mistaken. We investigate how to reduce the error probability using the observation above, thus building an improved classifier. We present a theoretical framework for this problem based on a random graph model and suggest two linear time algorithms, based on similar methods that have been proven effective in the setting of error-correcting codes. We provide simulation results to verify our analysis and to compare the performance of our suggested algorithms.
Alon, Krivelevich and Sudakov (Random Structures and Algorithms, 1998) designed an algorithm based on spectral techniques that almost surely finds a clique of size $\Omega(\sqrt{n})$ hidden in an otherwise random graph. We show that a different algorithm, based on the Lovasz theta function, almost surely both finds the hidden clique and certifies its optimality. Our algorithm has an additional advantage of being more robust: it also works in a semi-random hidden clique model, in which an adversary can remove edges from the random portion of the graph.
Hot-potato routing is a form of synchronous routing which makes no use of buffers at intermediate nodes. Packets must move at every time step, until they reach their destination. If contention prevents a packet from taking its preferred outgoing edge, it is deflected on a different edge. Two simple design principles for hot potato routing algorithms are minimum advance, that advances at least one packet towards its destination from every nonempty node (and possibly deflects all other packets), and maximum advance, that advances the maximum possible number of packets.
Livelock is a situation in which packets keep moving indefinitely in the network without any packet ever reaching its destination. It is known that even maximum advance algorithms might livelock on some networks. We show that minimum advance algorithms never livelock on tree networks, and that maximum advance algorithms never livelock on triangulated networks.
A stereoscopic family of permutations maps an $m$-dimensional mesh into several one-dimensional lines, in a way that jointly preserves distance information. Specifically, consider any two points and denote their distance on the $m$-dimensional mesh by $d$. Then the distance between their images, on the line on which these images are closest together, is $O(d^m)$.
We initiate a systematic study of stereoscopic families of permutations. We show a construction of these families that involves the use of $m+1$ images. We also show that under some additional restrictions (namely, adjacent points on the image lines originate at points which are not too far away on the mesh), three images are necessary in order to construct such a family for the two-dimensional mesh.
We present two applications for stereoscopic families of permutations. One application is an algorithm for routing on the mesh that guarantees delivery of each packet within a number of steps that depends upon the distance between this packet's source and destination, but is independent of the size of the mesh. Our algorithm is exceptionally simple, involves no queues, and can be used in dynamic settings in which packets are continuously generated. Another application is an extension of the construction of non-expansive hash functions of Linial and Sasson (STOC 1996) from the case of one dimensional metrics to arbitrary dimensions.