Abstract:
We consider the problem of learning unions of rectangles over the domain $[b]^n$, in the uniform distribution membership query learning setting, where both b and n are "large". We obtain poly$(n, \log b)$-time algorithms for the following classes: - poly$(n \log b)$-way Majority of $O(\frac{\log(n \log b)} {\log \log(n \log b)})$-dimensional rectangles. - Union of poly$(\log(n \log b))$ many $O(\frac{\log^2 (n \log b)} {(\log \log(n \log b) \log \log \log (n \log b))^2})$-dimensional rectangles. - poly$(n \log b)$-way Majority of poly$(n \log b)$-Or of disjoint $O(\frac{\log(n \log b)} {\log \log(n \log b)})$-dimensional rectangles. Our main algorithmic tool is an extension of Jackson's boosting- and Fourier-based Harmonic Sieve algorithm [Jackson 1997] to the domain $[b]^n$, building on work of [Akavia, Goldwasser, Safra 2003]. Other ingredients used to obtain the results stated above are techniques from exact learning [Beimel, Kushilevitz 1998] and ideas from recent work on learning augmented $AC^{0}$ circuits [Jackson, Klivans, Servedio 2002] and on representing Boolean functions as thresholds of parities [Klivans, Servedio 2001].

Abstract:
In this article we develop quantum algorithms for learning and testing juntas, i.e. Boolean functions which depend only on an unknown set of k out of n input variables. Our aim is to develop efficient algorithms: - whose sample complexity has no dependence on n, the dimension of the domain the Boolean functions are defined over; - with no access to any classical or quantum membership ("black-box") queries. Instead, our algorithms use only classical examples generated uniformly at random and fixed quantum superpositions of such classical examples; - which require only a few quantum examples but possibly many classical random examples (which are considered quite "cheap" relative to quantum examples). Our quantum algorithms are based on a subroutine FS which enables sampling according to the Fourier spectrum of f; the FS subroutine was used in earlier work of Bshouty and Jackson on quantum learning. Our results are as follows: - We give an algorithm for testing k-juntas to accuracy $\epsilon$ that uses $O(k/\epsilon)$ quantum examples. This improves on the number of examples used by the best known classical algorithm. - We establish the following lower bound: any FS-based k-junta testing algorithm requires $\Omega(\sqrt{k})$ queries. - We give an algorithm for learning $k$-juntas to accuracy $\epsilon$ that uses $O(\epsilon^{-1} k\log k)$ quantum examples and $O(2^k \log(1/\epsilon))$ random examples. We show that this learning algorithms is close to optimal by giving a related lower bound.

Abstract:
We prove two main results on how arbitrary linear threshold functions $f(x) = \sign(w\cdot x - \theta)$ over the $n$-dimensional Boolean hypercube can be approximated by simple threshold functions. Our first result shows that every $n$-variable threshold function $f$ is $\eps$-close to a threshold function depending only on $\Inf(f)^2 \cdot \poly(1/\eps)$ many variables, where $\Inf(f)$ denotes the total influence or average sensitivity of $f.$ This is an exponential sharpening of Friedgut's well-known theorem \cite{Friedgut:98}, which states that every Boolean function $f$ is $\eps$-close to a function depending only on $2^{O(\Inf(f)/\eps)}$ many variables, for the case of threshold functions. We complement this upper bound by showing that $\Omega(\Inf(f)^2 + 1/\epsilon^2)$ many variables are required for $\epsilon$-approximating threshold functions. Our second result is a proof that every $n$-variable threshold function is $\eps$-close to a threshold function with integer weights at most $\poly(n) \cdot 2^{\tilde{O}(1/\eps^{2/3})}.$ This is a significant improvement, in the dependence on the error parameter $\eps$, on an earlier result of \cite{Servedio:07cc} which gave a $\poly(n) \cdot 2^{\tilde{O}(1/\eps^{2})}$ bound. Our improvement is obtained via a new proof technique that uses strong anti-concentration bounds from probability theory. The new technique also gives a simple and modular proof of the original \cite{Servedio:07cc} result, and extends to give low-weight approximators for threshold functions under a range of probability distributions beyond just the uniform distribution.

Abstract:
In this article we give several new results on the complexity of algorithms that learn Boolean functions from quantum queries and quantum examples. Hunziker et al. conjectured that for any class C of Boolean functions, the number of quantum black-box queries which are required to exactly identify an unknown function from C is $O(\frac{\log |C|}{\sqrt{{\hat{\gamma}}^{C}}})$, where $\hat{\gamma}^{C}$ is a combinatorial parameter of the class C. We essentially resolve this conjecture in the affirmative by giving a quantum algorithm that, for any class C, identifies any unknown function from C using $O(\frac{\log |C| \log \log |C|}{\sqrt{{\hat{\gamma}}^{C}}})$ quantum black-box queries. We consider a range of natural problems intermediate between the exact learning problem (in which the learner must obtain all bits of information about the black-box function) and the usual problem of computing a predicate (in which the learner must obtain only one bit of information about the black-box function). We give positive and negative results on when the quantum and classical query complexities of these intermediate problems are polynomially related to each other. Finally, we improve the known lower bounds on the number of quantum examples (as opposed to quantum black-box queries) required for $(\epsilon,\delta)$-PAC learning any concept class of Vapnik-Chervonenkis dimension d over the domain $\{0,1\}^n$ from $\Omega(\frac{d}{n})$ to $\Omega(\frac{1}{\epsilon}\log \frac{1}{\delta}+d+\frac{\sqrt{d}}{\epsilon})$. This new lower bound comes closer to matching known upper bounds for classical PAC learning.

Abstract:
We consider quantum versions of two well-studied classical learning models: Angluin's model of exact learning from membership queries and Valiant's Probably Approximately Correct (PAC) model of learning from random examples. We give positive and negative results for quantum versus classical learnability. For each of the two learning models described above, we show that any concept class is information-theoretically learnable from polynomially many quantum examples if and only if it is information-theoretically learnable from polynomially many classical examples. In contrast to this information-theoretic equivalence betwen quantum and classical learnability, though, we observe that a separation does exist between efficient quantum and classical learnability. For both the model of exact learning from membership queries and the PAC model, we show that under a widely held computational hardness assumption for classical computation (the intractability of factoring), there is a concept class which is polynomial-time learnable in the quantum version but not in the classical version of the model.

Abstract:
We make progress on two important problems regarding attribute efficient learnability. First, we give an algorithm for learning decision lists of length $k$ over $n$ variables using $2^{\tilde{O}(k^{1/3})} \log n$ examples and time $n^{\tilde{O}(k^{1/3})}$. This is the first algorithm for learning decision lists that has both subexponential sample complexity and subexponential running time in the relevant parameters. Our approach establishes a relationship between attribute efficient learning and polynomial threshold functions and is based on a new construction of low degree, low weight polynomial threshold functions for decision lists. For a wide range of parameters our construction matches a 1994 lower bound due to Beigel for the ODDMAXBIT predicate and gives an essentially optimal tradeoff between polynomial threshold function degree and weight. Second, we give an algorithm for learning an unknown parity function on $k$ out of $n$ variables using $O(n^{1-1/k})$ examples in time polynomial in $n$. For $k=o(\log n)$ this yields a polynomial time algorithm with sample complexity $o(n)$. This is the first polynomial time algorithm for learning parity on a superconstant number of variables with sublinear sample complexity.

Abstract:
A $k$-modal probability distribution over the discrete domain $\{1,...,n\}$ is one whose histogram has at most $k$ "peaks" and "valleys." Such distributions are natural generalizations of monotone ($k=0$) and unimodal ($k=1$) probability distributions, which have been intensively studied in probability theory and statistics. In this paper we consider the problem of \emph{learning} (i.e., performing density estimation of) an unknown $k$-modal distribution with respect to the $L_1$ distance. The learning algorithm is given access to independent samples drawn from an unknown $k$-modal distribution $p$, and it must output a hypothesis distribution $\widehat{p}$ such that with high probability the total variation distance between $p$ and $\widehat{p}$ is at most $\epsilon.$ Our main goal is to obtain \emph{computationally efficient} algorithms for this problem that use (close to) an information-theoretically optimal number of samples. We give an efficient algorithm for this problem that runs in time $\mathrm{poly}(k,\log(n),1/\epsilon)$. For $k \leq \tilde{O}(\log n)$, the number of samples used by our algorithm is very close (within an $\tilde{O}(\log(1/\epsilon))$ factor) to being information-theoretically optimal. Prior to this work computationally efficient algorithms were known only for the cases $k=0,1$ \cite{Birge:87b,Birge:97}. A novel feature of our approach is that our learning algorithm crucially uses a new algorithm for \emph{property testing of probability distributions} as a key subroutine. The learning algorithm uses the property tester to efficiently decompose the $k$-modal distribution into $k$ (near-)monotone distributions, which are easier to learn.

Abstract:
We consider a basic problem in unsupervised learning: learning an unknown \emph{Poisson Binomial Distribution}. A Poisson Binomial Distribution (PBD) over $\{0,1,\dots,n\}$ is the distribution of a sum of $n$ independent Bernoulli random variables which may have arbitrary, potentially non-equal, expectations. These distributions were first studied by S. Poisson in 1837 \cite{Poisson:37} and are a natural $n$-parameter generalization of the familiar Binomial Distribution. Surprisingly, prior to our work this basic learning problem was poorly understood, and known results for it were far from optimal. We essentially settle the complexity of the learning problem for this basic class of distributions. As our first main result we give a highly efficient algorithm which learns to $\eps$-accuracy (with respect to the total variation distance) using $\tilde{O}(1/\eps^3)$ samples \emph{independent of $n$}. The running time of the algorithm is \emph{quasilinear} in the size of its input data, i.e., $\tilde{O}(\log(n)/\eps^3)$ bit-operations. (Observe that each draw from the distribution is a $\log(n)$-bit string.) Our second main result is a {\em proper} learning algorithm that learns to $\eps$-accuracy using $\tilde{O}(1/\eps^2)$ samples, and runs in time $(1/\eps)^{\poly (\log (1/\eps))} \cdot \log n$. This is nearly optimal, since any algorithm {for this problem} must use $\Omega(1/\eps^2)$ samples. We also give positive and negative results for some extensions of this learning problem to weighted sums of independent Bernoulli random variables.

Abstract:
We consider the problem of learning an unknown product distribution $X$ over $\{0,1\}^n$ using samples $f(X)$ where $f$ is a \emph{known} transformation function. Each choice of a transformation function $f$ specifies a learning problem in this framework. Information-theoretic arguments show that for every transformation function $f$ the corresponding learning problem can be solved to accuracy $\eps$, using $\tilde{O}(n/\eps^2)$ examples, by a generic algorithm whose running time may be exponential in $n.$ We show that this learning problem can be computationally intractable even for constant $\eps$ and rather simple transformation functions. Moreover, the above sample complexity bound is nearly optimal for the general problem, as we give a simple explicit linear transformation function $f(x)=w \cdot x$ with integer weights $w_i \leq n$ and prove that the corresponding learning problem requires $\Omega(n)$ samples. As our main positive result we give a highly efficient algorithm for learning a sum of independent unknown Bernoulli random variables, corresponding to the transformation function $f(x)= \sum_{i=1}^n x_i$. Our algorithm learns to $\eps$-accuracy in poly$(n)$ time, using a surprising poly$(1/\eps)$ number of samples that is independent of $n.$ We also give an efficient algorithm that uses $\log n \cdot \poly(1/\eps)$ samples but has running time that is only $\poly(\log n, 1/\eps).$

Abstract:
For $f$ a weighted voting scheme used by $n$ voters to choose between two candidates, the $n$ \emph{Shapley-Shubik Indices} (or {\em Shapley values}) of $f$ provide a measure of how much control each voter can exert over the overall outcome of the vote. Shapley-Shubik indices were introduced by Lloyd Shapley and Martin Shubik in 1954 \cite{SS54} and are widely studied in social choice theory as a measure of the "influence" of voters. The \emph{Inverse Shapley Value Problem} is the problem of designing a weighted voting scheme which (approximately) achieves a desired input vector of values for the Shapley-Shubik indices. Despite much interest in this problem no provably correct and efficient algorithm was known prior to our work. We give the first efficient algorithm with provable performance guarantees for the Inverse Shapley Value Problem. For any constant $\eps > 0$ our algorithm runs in fixed poly$(n)$ time (the degree of the polynomial is independent of $\eps$) and has the following performance guarantee: given as input a vector of desired Shapley values, if any "reasonable" weighted voting scheme (roughly, one in which the threshold is not too skewed) approximately matches the desired vector of values to within some small error, then our algorithm explicitly outputs a weighted voting scheme that achieves this vector of Shapley values to within error $\eps.$ If there is a "reasonable" voting scheme in which all voting weights are integers at most $\poly(n)$ that approximately achieves the desired Shapley values, then our algorithm runs in time $\poly(n)$ and outputs a weighted voting scheme that achieves the target vector of Shapley values to within error $\eps=n^{-1/8}.$