-
$\texttt{metabench}$ -- A Sparse Benchmark to Measure General Ability in Large Language Models
Authors:
Alex Kipnis,
Konstantinos Voudouris,
Luca M. Schulze Buschoff,
Eric Schulz
Abstract:
Large Language Models (LLMs) vary in their abilities on a range of tasks. Initiatives such as the $\texttt{Open LLM Leaderboard}$ aim to quantify these differences with several large benchmarks (sets of test items to which an LLM can respond either correctly or incorrectly). However, high correlations within and between benchmark scores suggest that (1) there exists a small set of common underlyin…
▽ More
Large Language Models (LLMs) vary in their abilities on a range of tasks. Initiatives such as the $\texttt{Open LLM Leaderboard}$ aim to quantify these differences with several large benchmarks (sets of test items to which an LLM can respond either correctly or incorrectly). However, high correlations within and between benchmark scores suggest that (1) there exists a small set of common underlying abilities that these benchmarks measure, and (2) items tap into redundant information and the benchmarks may thus be considerably compressed. We use data from $n > 5000$ LLMs to identify the most informative items of six benchmarks, ARC, GSM8K, HellaSwag, MMLU, TruthfulQA and WinoGrande (with $d=28,632$ items in total). From them we distill a sparse benchmark, $\texttt{metabench}$, that has less than $3\%$ of the original size of all six benchmarks combined. This new sparse benchmark goes beyond point scores by yielding estimators of the underlying benchmark-specific abilities. We show that these estimators (1) can be used to reconstruct each original $\textit{individual}$ benchmark score with, on average, $1.5\%$ root mean square error (RMSE), (2) reconstruct the original $\textit{total}$ score with $0.8\%$ RMSE, and (3) have a single underlying common factor whose Spearman correlation with the total score is $r = 0.93$.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
EntropyRank: Unsupervised Keyphrase Extraction via Side-Information Optimization for Language Model-based Text Compression
Authors:
Alexander Tsvetkov,
Alon Kipnis
Abstract:
We propose an unsupervised method to extract keywords and keyphrases from texts based on a pre-trained language model (LM) and Shannon's information maximization. Specifically, our method extracts phrases having the highest conditional entropy under the LM. The resulting set of keyphrases turns out to solve a relevant information-theoretic problem: if provided as side information, it leads to the…
▽ More
We propose an unsupervised method to extract keywords and keyphrases from texts based on a pre-trained language model (LM) and Shannon's information maximization. Specifically, our method extracts phrases having the highest conditional entropy under the LM. The resulting set of keyphrases turns out to solve a relevant information-theoretic problem: if provided as side information, it leads to the expected minimal binary code length in compressing the text using the LM and an entropy encoder. Alternately, the resulting set is an approximation via a causal LM to the set of phrases that minimize the entropy of the text when conditioned upon it. Empirically, the method provides results comparable to the most commonly used methods in various keyphrase extraction benchmark challenges.
△ Less
Submitted 29 August, 2023; v1 submitted 25 August, 2023;
originally announced August 2023.
-
Separating the Human Touch from AI-Generated Text using Higher Criticism: An Information-Theoretic Approach
Authors:
Alon Kipnis
Abstract:
We propose a method to determine whether a given article was entirely written by a generative language model versus an alternative situation in which the article includes some significant edits by a different author, possibly a human. Our process involves many perplexity tests for the origin of individual sentences or other text atoms, combining these multiple tests using Higher Criticism (HC). As…
▽ More
We propose a method to determine whether a given article was entirely written by a generative language model versus an alternative situation in which the article includes some significant edits by a different author, possibly a human. Our process involves many perplexity tests for the origin of individual sentences or other text atoms, combining these multiple tests using Higher Criticism (HC). As a by-product, the method identifies parts suspected to be edited. The method is motivated by the convergence of the log-perplexity to the cross-entropy rate and by a statistical model for edited text saying that sentences are mostly generated by the language model, except perhaps for a few sentences that might have originated via a different mechanism. We demonstrate the effectiveness of our method using real data and analyze the factors affecting its success. This analysis raises several interesting open challenges whose resolution may improve the method's effectiveness.
△ Less
Submitted 24 August, 2023;
originally announced August 2023.
-
The Minimax Risk in Testing Uniformity of Poisson Data under Missing Ball Alternatives within a Hypercube
Authors:
Alon Kipnis
Abstract:
We study the problem of testing the goodness of fit of occurrences of items from many categories to an identical Poisson distribution over the categories. As a class of alternative hypotheses, we consider the removal of an $\ell_p$ ball, $p \leq 2$, of radius $ε$ from a hypercube around the sequence of uniform Poisson rates. When the expected number of samples $n$ and number of categories $N$ go t…
▽ More
We study the problem of testing the goodness of fit of occurrences of items from many categories to an identical Poisson distribution over the categories. As a class of alternative hypotheses, we consider the removal of an $\ell_p$ ball, $p \leq 2$, of radius $ε$ from a hypercube around the sequence of uniform Poisson rates. When the expected number of samples $n$ and number of categories $N$ go to infinity while $ε$ is small, the minimax risk asymptotes to $2Φ(-n N^{2-2/p} ε^2/\sqrt{8N})$; $Φ(x)$ is the normal CDF. This result allows the comparison of the many estimators previously proposed for this problem at the constant level, rather than at the rate of convergence of the risk or the scaling order of the sample complexity. The minimax test relies exclusively on collisions in the small sample limit but behaves like the chisquared test otherwise. Empirical studies over a range of problem parameters show that the asymptotic risk estimate is accurate in finite samples and that the minimax test is significantly better than the chisquared test or a test that only uses collisions. Our analysis combines standard ideas from non-parametric hypothesis testing with new results in the low count limit of multiple Poisson distributions, including the convexity of certain kernels and a central limit theorem of linear test statistics.
△ Less
Submitted 17 July, 2024; v1 submitted 29 May, 2023;
originally announced May 2023.
-
Gaussian Approximation of Quantization Error for Estimation from Compressed Data
Authors:
Alon Kipnis,
Galen Reeves
Abstract:
We consider the distributional connection between the lossy compressed representation of a high-dimensional signal $X$ using a random spherical code and the observation of $X$ under an additive white Gaussian noise (AWGN). We show that the Wasserstein distance between a bitrate-$R$ compressed version of $X$ and its observation under an AWGN-channel of signal-to-noise ratio $2^{2R}-1$ is sub-linear…
▽ More
We consider the distributional connection between the lossy compressed representation of a high-dimensional signal $X$ using a random spherical code and the observation of $X$ under an additive white Gaussian noise (AWGN). We show that the Wasserstein distance between a bitrate-$R$ compressed version of $X$ and its observation under an AWGN-channel of signal-to-noise ratio $2^{2R}-1$ is sub-linear in the problem dimension. We utilize this fact to connect the risk of an estimator based on an AWGN-corrupted version of $X$ to the risk attained by the same estimator when fed with its bitrate-$R$ quantized version. We demonstrate the usefulness of this connection by deriving various novel results for inference problems under compression constraints, including minimax estimation, sparse regression, compressed sensing, and the universality of linear estimation in remote source coding.
△ Less
Submitted 12 December, 2021; v1 submitted 9 January, 2020;
originally announced January 2020.
-
Higher Criticism for Discriminating Word-Frequency Tables and Testing Authorship
Authors:
Alon Kipnis
Abstract:
We adapt the Higher Criticism (HC) goodness-of-fit test to measure the closeness between word-frequency tables. We apply this measure to authorship attribution challenges, where the goal is to identify the author of a document using other documents whose authorship is known. The method is simple yet performs well without handcrafting and tuning; reporting accuracy at the state of the art level in…
▽ More
We adapt the Higher Criticism (HC) goodness-of-fit test to measure the closeness between word-frequency tables. We apply this measure to authorship attribution challenges, where the goal is to identify the author of a document using other documents whose authorship is known. The method is simple yet performs well without handcrafting and tuning; reporting accuracy at the state of the art level in various current challenges. As an inherent side effect, the HC calculation identifies a subset of discriminating words. In practice, the identified words have low variance across documents belonging to a corpus of homogeneous authorship. We conclude that in comparing the similarity of a new document and a corpus of a single author, HC is mostly affected by words characteristic of the author and is relatively unaffected by topic structure.
△ Less
Submitted 21 June, 2022; v1 submitted 30 October, 2019;
originally announced November 2019.
-
Mean Estimation from One-Bit Measurements
Authors:
Alon Kipnis,
John C. Duchi
Abstract:
We consider the problem of estimating the mean of a symmetric log-concave distribution under the constraint that only a single bit per sample from this distribution is available to the estimator. We study the mean squared error as a function of the sample size (and hence the number of bits). We consider three settings: first, a centralized setting, where an encoder may release $n$ bits given a sam…
▽ More
We consider the problem of estimating the mean of a symmetric log-concave distribution under the constraint that only a single bit per sample from this distribution is available to the estimator. We study the mean squared error as a function of the sample size (and hence the number of bits). We consider three settings: first, a centralized setting, where an encoder may release $n$ bits given a sample of size $n$, and for which there is no asymptotic penalty for quantization; second, an adaptive setting in which each bit is a function of the current observation and previously recorded bits, where we show that the optimal relative efficiency compared to the sample mean is precisely the efficiency of the median; lastly, we show that in a distributed setting where each bit is only a function of a local sample, no estimator can achieve optimal efficiency uniformly over the parameter space. We additionally complement our results in the adaptive setting by showing that \emph{one} round of adaptivity is sufficient to achieve optimal mean-square error.
△ Less
Submitted 9 May, 2022; v1 submitted 10 January, 2019;
originally announced January 2019.
-
Analog-to-Digital Compression: A New Paradigm for Converting Signals to Bits
Authors:
Alon Kipnis,
Yonina C. Eldar,
Andrea J. Goldsmith
Abstract:
Processing, storing and communicating information that originates as an analog signal involves conversion of this information to bits. This conversion can be described by the combined effect of sampling and quantization, as illustrated in Fig. 1. The digital representation is achieved by first sampling the analog signal so as to represent it by a set of discrete-time samples and then quantizing th…
▽ More
Processing, storing and communicating information that originates as an analog signal involves conversion of this information to bits. This conversion can be described by the combined effect of sampling and quantization, as illustrated in Fig. 1. The digital representation is achieved by first sampling the analog signal so as to represent it by a set of discrete-time samples and then quantizing these samples to a finite number of bits. Traditionally, these two operations are considered separately. The sampler is designed to minimize information loss due to sampling based on characteristics of the continuous-time input. The quantizer is designed to represent the samples as accurately as possible, subject to a constraint on the number of bits that can be used in the representation. The goal of this article is to revisit this paradigm by illuminating the dependency between these two operations. In particular, we explore the requirements on the sampling system subject to constraints on the available number of bits for storing, communicating or processing the analog information.
△ Less
Submitted 20 January, 2018;
originally announced January 2018.
-
Compress-and-Estimate Source Coding for a Vector Gaussian Source
Authors:
Ruiyang Song,
Stefano Rini,
Alon Kipnis,
Andrea Goldsmith
Abstract:
We consider the remote vector source coding problem in which a vector Gaussian source is to be estimated from noisy linear measurements. For this problem, we derive the performance of the compress-and-estimate (CE) coding scheme and compare it to the optimal performance. In the CE coding scheme, the remote encoder compresses the noisy source observations so as to minimize the local distortion meas…
▽ More
We consider the remote vector source coding problem in which a vector Gaussian source is to be estimated from noisy linear measurements. For this problem, we derive the performance of the compress-and-estimate (CE) coding scheme and compare it to the optimal performance. In the CE coding scheme, the remote encoder compresses the noisy source observations so as to minimize the local distortion measure, independent from the joint distribution between the source and the observations. In reconstruction, the decoder estimates the original source realization from the lossy-compressed noisy observations. For the CE coding in the Gaussian vector case, we show that, if the code rate is less than a threshold, then the CE coding scheme attains the same performance as the optimal coding scheme. We also introduce lower and upper bounds for the performance gap above this threshold. In addition, an example with two observations and two sources is studied to illustrate the behavior of the performance gap.
△ Less
Submitted 3 July, 2017;
originally announced July 2017.
-
The Distortion-Rate Function of Sampled Wiener Processes
Authors:
Alon Kipnis,
Andrea J. Goldsmith,
Yonina C. Eldar
Abstract:
We consider the recovery of a continuous-time Wiener process from a quantized or lossy compressed version of its uniform samples under limited bitrate and sampling rate. We derive a closed form expression for the optimal tradeoff among sampling rate, bitrate, and quadratic distortion in this setting. This expression is given in terms of a reverse waterfilling formula over the asymptotic spectral d…
▽ More
We consider the recovery of a continuous-time Wiener process from a quantized or lossy compressed version of its uniform samples under limited bitrate and sampling rate. We derive a closed form expression for the optimal tradeoff among sampling rate, bitrate, and quadratic distortion in this setting. This expression is given in terms of a reverse waterfilling formula over the asymptotic spectral distribution of a sequence of finite-rank operators associated with the optimal estimator of the Wiener process from its samples. We show that the ratio between this expression and the standard distortion rate function of the Wiener process, describing the optimal tradeoff between bitrate and distortion without a sampling constraint, is only a function of the number of bits per sample. For example using one bit per sample on average, the expected distortion is approximately 1.2 times the standard distortion rate function, indicating a performance loss of about 20% due to sampling. We next consider the distortion when the continuous-time process is estimated from the output of an encoder that is optimal with respect to the discrete-time samples. We show that while the latter is strictly greater than the distortion under optimal encoding, the ratio between the two does not exceed 1.027. We therefore conclude that nearly optimal performance is attained even if the encoder is unaware of the sampling rate and encodes the samples without taking into account the continuous-time underlying process.
△ Less
Submitted 26 July, 2018; v1 submitted 16 August, 2016;
originally announced August 2016.
-
Optimal Rate Allocation in Mismatched Multiterminal Source Coding
Authors:
Ruiyang Song,
Stefano Rini,
Alon Kipnis,
Andrea J. Goldsmith
Abstract:
We consider a multiterminal source coding problem in which a source is estimated at a central processing unit from lossy-compressed remote observations. Each lossy-encoded observation is produced by a remote sensor which obtains a noisy version of the source and compresses this observation minimizing a local distortion measure which depends only on the marginal distribution of its observation. The…
▽ More
We consider a multiterminal source coding problem in which a source is estimated at a central processing unit from lossy-compressed remote observations. Each lossy-encoded observation is produced by a remote sensor which obtains a noisy version of the source and compresses this observation minimizing a local distortion measure which depends only on the marginal distribution of its observation. The central node, on the other hand, has knowledge of the joint distribution of the source and all the observations and produces the source estimate which minimizes a different distortion measure between the source and its reconstruction. In this correspondence, we investigate the problem of optimally choosing the rate of each lossy-compressed remote estimate so as to minimize the distortion at the central processing unit subject to a bound on the overall communication rate between the remote sensors and the central unit. We focus, in particular, on two models of practical relevance: the case of a Gaussian source observed in additive Gaussian noise and reconstructed under quadratic distortion, and the case of a binary source observed in bit-flipping noise and reconstructed under Hamming distortion. In both scenarios we show that there exist regimes under which having more remote encoders does reduce the source distortion: in other words, having fewer, high-quality remote estimates provides a smaller distortion than having more, lower-quality estimates.
△ Less
Submitted 12 May, 2016;
originally announced May 2016.
-
The Rate-Distortion Risk in Estimation from Compressed Data
Authors:
Alon Kipnis,
Stefano Rini,
Andrea J. Goldsmith
Abstract:
Consider the problem of estimating a latent signal from a lossy compressed version of the data when the compressor is agnostic to the relation between the signal and the data. This situation arises in a host of modern applications when data is transmitted or stored prior to determining the downstream inference task. Given a bitrate constraint and a distortion measure between the data and its compr…
▽ More
Consider the problem of estimating a latent signal from a lossy compressed version of the data when the compressor is agnostic to the relation between the signal and the data. This situation arises in a host of modern applications when data is transmitted or stored prior to determining the downstream inference task. Given a bitrate constraint and a distortion measure between the data and its compressed version, let us consider the joint distribution achieving Shannon's rate-distortion (RD) function. Given an estimator and a loss function associated with the downstream inference task, define the rate-distortion risk as the expected loss under the RD-achieving distribution. We provide general conditions under which the operational risk in estimating from the compressed data is asymptotically equivalent to the RD risk. The main theoretical tools to prove this equivalence are transportation-cost inequalities in conjunction with properties of compression codes achieving Shannon's RD function. Whenever such equivalence holds, a recipe for designing estimators from datasets undergoing lossy compression without specifying the actual compression technique emerges: design the estimator to minimize the RD risk. Our conditions simplified in the special cases of discrete memoryless or multivariate normal data. For these scenarios, we derive explicit expressions for the RD risk of several estimators and compare them to the optimal source coding performance associated with full knowledge of the relation between the latent signal and the data.
△ Less
Submitted 10 January, 2021; v1 submitted 5 February, 2016;
originally announced February 2016.
-
Fundamental Distortion Limits of Analog-to-Digital Compression
Authors:
Alon Kipnis,
Yonina C. Eldar,
Andrea J. Goldsmith
Abstract:
Representing a continuous-time signal by a set of samples is a classical problem in signal processing. We study this problem under the additional constraint that the samples are quantized or compressed in a lossy manner under a limited bitrate budget. To this end, we consider a combined sampling and source coding problem in which an analog stationary Gaussian signal is reconstructed from its encod…
▽ More
Representing a continuous-time signal by a set of samples is a classical problem in signal processing. We study this problem under the additional constraint that the samples are quantized or compressed in a lossy manner under a limited bitrate budget. To this end, we consider a combined sampling and source coding problem in which an analog stationary Gaussian signal is reconstructed from its encoded samples. These samples are obtained by a set of bounded linear functionals of the continuous-time path, with a limitation on the average number of samples obtained per unit time available in this setting. We provide a full characterization of the minimal distortion in terms of the sampling frequency, the bitrate, and the signal's spectrum. Assuming that the signal's energy is not uniformly distributed over its spectral support, we show that for each compression bitrate there exists a critical sampling frequency smaller than the Nyquist rate, such that the distortion in signal reconstruction when sampling at this frequency is minimal. Our results can be seen as an extension of the classical sampling theorem for bandlimited random processes in the sense that it describes the minimal amount of excess distortion in the reconstruction due to lossy compression of the samples, and provides the minimal sampling frequency required in order to achieve this distortion. Finally, we compare the fundamental limits in the combined source coding and sampling problem to the performance of pulse code modulation (PCM), where each sample is quantized by a scalar quantizer using a fixed number of bits.
△ Less
Submitted 10 April, 2018; v1 submitted 24 January, 2016;
originally announced January 2016.
-
The Distortion Rate Function of Cyclostationary Gaussian Processes
Authors:
Alon Kipnis,
Andrea J. Goldsmith,
Yonina C. Eldar
Abstract:
A general expression for the distortion rate function (DRF) of cyclostationary Gaussian processes in terms of their spectral properties is derived. This expression can be seen as the result of orthogonalization over the different components in the polyphase decomposition of the process. We use this expression to derive, in a closed form, the DRF of several cyclostationary processes arising in prac…
▽ More
A general expression for the distortion rate function (DRF) of cyclostationary Gaussian processes in terms of their spectral properties is derived. This expression can be seen as the result of orthogonalization over the different components in the polyphase decomposition of the process. We use this expression to derive, in a closed form, the DRF of several cyclostationary processes arising in practice. We first consider the DRF of a combined sampling and source coding problem. It is known that the optimal coding strategy for this problem involves source coding applied to a signal with the same structure as one resulting from pulse amplitude modulation (PAM). Since a PAM-modulated signal is cyclostationary, our DRF expression can be used to solve for the minimal distortion in the combined sampling and source coding problem. We also analyze in more detail the DRF of a source with the same structure as a PAM-modulated signal, and show that it is obtained by reverse waterfilling over an expression that depends on the energy of the pulse and the baseband process modulated to obtain the PAM signal. This result is then used to study the information content of a PAM-modulated signal as a function of its symbol time relative to the bandwidth of the underlying baseband process. In addition, we also study the DRF of sources with an amplitude-modulation structure, and show that the DRF of a narrow-band Gaussian stationary process modulated by either a deterministic or a random phase sine-wave equals the DRF of the baseband process.
△ Less
Submitted 10 August, 2016; v1 submitted 20 May, 2015;
originally announced May 2015.
-
Indirect Rate-Distortion Function of a Binary i.i.d Source
Authors:
Alon Kipnis,
Stefano Rini,
Andrea J. Goldsmith
Abstract:
The indirect source-coding problem in which a Bernoulli process is compressed in a lossy manner from its noisy observations is considered. These noisy observations are obtained by passing the source sequence through a The indirect source-coding problem in which a Bernoulli process is compressed in a lossy manner from its noisy observations is considered. These noisy observations are obtained by pa…
▽ More
The indirect source-coding problem in which a Bernoulli process is compressed in a lossy manner from its noisy observations is considered. These noisy observations are obtained by passing the source sequence through a The indirect source-coding problem in which a Bernoulli process is compressed in a lossy manner from its noisy observations is considered. These noisy observations are obtained by passing the source sequence through a binary symmetric channel so that the channel crossover probability controls the amount of information available about the source realization at the encoder. We use classic results in rate-distortion theory to compute an expression of the rate-distortion function for this model, where the Bernoulli source is not necessarily symmetric. The indirect rate-distortion function is given in terms of a solution to a simple equation. In addition, we derive an upper bound on the indirect rate-distortion function which is given in a closed. These expressions capture precisely the expected behavior that the noisier the observations, the smaller the return from increasing bit-rate to reduce distortion.
△ Less
Submitted 3 June, 2015; v1 submitted 19 May, 2015;
originally announced May 2015.
-
Distortion-Rate Function of Sub-Nyquist Sampled Gaussian Sources
Authors:
Alon Kipnis,
Andrea J. Goldsmith,
Yonina C. Eldar,
Tsachy Weissman
Abstract:
The amount of information lost in sub-Nyquist sampling of a continuous-time Gaussian stationary process is quantified. We consider a combined source coding and sub-Nyquist reconstruction problem in which the input to the encoder is a noisy sub-Nyquist sampled version of the analog source. We first derive an expression for the mean squared error in the reconstruction of the process from a noisy and…
▽ More
The amount of information lost in sub-Nyquist sampling of a continuous-time Gaussian stationary process is quantified. We consider a combined source coding and sub-Nyquist reconstruction problem in which the input to the encoder is a noisy sub-Nyquist sampled version of the analog source. We first derive an expression for the mean squared error in the reconstruction of the process from a noisy and information rate-limited version of its samples. This expression is a function of the sampling frequency and the average number of bits describing each sample. It is given as the sum of two terms: Minimum mean square error in estimating the source from its noisy but otherwise fully observed sub-Nyquist samples, and a second term obtained by reverse waterfilling over an average of spectral densities associated with the polyphase components of the source. We extend this result to multi-branch uniform sampling, where the samples are available through a set of parallel channels with a uniform sampler and a pre-sampling filter in each branch. Further optimization to reduce distortion is then performed over the pre-sampling filters, and an optimal set of pre-sampling filters associated with the statistics of the input signal and the sampling frequency is found. This results in an expression for the minimal possible distortion achievable under any analog to digital conversion scheme involving uniform sampling and linear filtering. These results thus unify the Shannon-Whittaker-Kotelnikov sampling theorem and Shannon rate-distortion theory for Gaussian sources.
△ Less
Submitted 6 November, 2015; v1 submitted 21 May, 2014;
originally announced May 2014.