
Estimation, Optimization, and Parallelism when
Data is Sparse or Highly Varying - Google
Estimation, Optimization, and Parallelism when
Data is Sparse or Highly Varying - Google
Revenir à l'accueil

Accéder au pdf :ACCEDER AU PDF
Estimation, Optimization, and Parallelism when
Data is Sparse or Highly Varying
John C. Duchi Michael I. Jordan H. Brendan McMahan
November 10, 2013
Abstract
We study stochastic optimization problems when the data is sparse, which is in a sense
dual to the current understanding of high-dimensional statistical learning and optimization.
We highlight both the difficulties—in terms of increased sample complexity that sparse data
necessitates—and the potential benefits, in terms of allowing parallelism and asynchrony in the
design of algorithms. Concretely, we derive matching upper and lower bounds on the minimax
rate for optimization and learning with sparse data, and we exhibit algorithms achieving these
rates. We also show how leveraging sparsity leads to (still minimax optimal) parallel and
asynchronous algorithms, providing experimental evidence complementing our theoretical results
on several medium to large-scale learning tasks.
1 Introduction and problem setting
In this paper, we investigate stochastic optimization problems in which the data is sparse. Formally,
let {F(·; ξ), ξ ∈ Ξ} be a collection of real-valued convex functions, each of whose domains contains
the convex set X ⊂ R
d
. For a probability distribution P on Ξ, we consider the following optimization
problem:
minimize
x∈X
f(x) := E[F(x; ξ)] = Z
Ξ
F(x; ξ)dP(ξ). (1)
By data sparsity, we mean that the sampled data ξ is sparse: samples ξ are assumed to lie in R
d
,
and if we define the support supp(x) of a vector x to the set of indices of its non-zero components,
we assume that
supp ∇F(x; ξ) ⊂ supp ξ. (2)
The sparsity condition (2) means that F(x; ξ) does not “depend” on the values of xj for indices j
such that ξj = 0.1 This type of data sparsity is prevalent in statistical optimization problems and
machine learning applications, though in spite of its prevalence, study of such problems has been
somewhat limited.
As a motivating example, consider a text classification problem: data ξ ∈ R
d
represents words
appearing in a document, and we wish to minimize a logistic loss F(x; ξ) = log(1 + exp(hξ, xi)) on
the data (we encode the label implicitly with the sign of ξ). Such generalized linear models satisfy
the sparsity condition (2), and while instances are of very high dimension, in any given instance,
1Formally, if we define πξ as the coordinate projection that zeros all indices j of its argument where ξj = 0, then
F(πξ(x); ξ) = F(x; ξ) for all x, ξ. This is implied by first order conditions for convexity [6, Chapter VI.2]
1very few entries of ξ are non-zero [8]. From a modelling perspective, it thus makes sense to allow
a dense predictor x: any non-zero entry of ξ is potentially relevant and important. In a sense, this
is dual to the standard approaches to high-dimensional problems; one usually assumes that the
data ξ may be dense, but there are only a few relevant features, and thus a parsimonious model
x is desirous [2]. So while such sparse data problems are prevalent—natural language processing,
information retrieval, and other large data settings all have significant data sparsity—they do not
appear to have attracted as much study as their high-dimensional “duals” of dense data and sparse
predictors.
In this paper, we investigate algorithms and their inherent limitations for solving problem (1)
under natural conditions on the data generating distribution. Recent work in the optimization and
machine learning communities has shown that data sparsity can be leveraged to develop parallel
optimization algorithms [12, 13, 14], but the authors do not study the statistical effects of data
sparsity. In recent work, Duchi et al. [4] and McMahan and Streeter [9] develop “adaptive” stochastic
gradient algorithms designed to address problems in sparse data regimes (2). These algorithms
exhibit excellent practical performance and have theoretical guarantees on their convergence, but
it is not clear if they are optimal—in the sense that no algorithm can attain better statistical
performance—or whether they can leverage parallel computing as in the papers [12, 14].
In this paper, we take a two-pronged approach. First, we investigate the fundamental limits
of optimization and learning algorithms in sparse data regimes. In doing so, we derive lower
bounds on the optimization error of any algorithm for problems of the form (1) with sparsity
condition (2). These results have two main implications. They show that in some scenarios,
learning with sparse data is quite difficult, as essentially each coordinate j ∈ [d] can be relevant
and must be optimized for. In spite of this seemingly negative result, we are also able to show that
the AdaGrad algorithms of [4, 9] are optimal, and we show examples in which their dependence
on the dimension d can be made exponentially better than standard gradient methods.
As the second facet of our two-pronged approach, we study how sparsity may be leveraged
in parallel computing frameworks to give substantially faster algorithms that still achieve optimal
sample complexity in terms of the number of samples ξ used. We develop two new algorithms,
asynchronous dual averaging (AsyncDA) and asynchronous AdaGrad (AsyncAdaGrad), which
allow asynchronous parallel solution of the problem (1) for general convex f and X . Combining
insights of Niu et al.’s Hogwild! [12] with a new analysis, we prove our algorithms can achieve
linear speedup in the number of processors while maintaining optimal statistical guarantees. We
also give experiments on text-classification and web-advertising tasks to illustrate the benefits of
the new algorithms.
Notation For a convex function x 7→ f(x), we let ∂f(x) denote its subgradient set at x (if f has
two arguments, we say ∂xf(x, y) is the subgradient w.r.t. x). For a positive semi-definite matrix A,
we let k·kA
be the (semi)norm defined by kvk
2
A
:= hv, Avi, where h·, ·i is the standard inner product.
We let 1 {·} be the indicator function, which is 1 when its argument is true and 0 otherwise.
2 Minimax rates for sparse optimization
We begin our study of sparse optimization problems by establishing their fundamental statistical
and optimization-theoretic properties. To do this, we derive bounds on the minimax convergence
rate of any algorithm for such problems. Formally, let xb denote any estimator for a minimizer of the
2objective (1). We define the optimality gap ǫN for the estimator xb based on N samples ξ
1
, . . . , ξN
from the distribution P as
ǫN (x, F, b X , P) := f(xb) − inf
x∈X
f(x) = EP [F(xb; ξ)] − inf
x∈X
EP [F(x; ξ)] .
This quantity is a random variable, since xb is a random variable (it is a function of ξ
1
, . . . , ξN ). To
define the minimax error, we thus take expectations of the quantity ǫN , though we require a bit
more than simply E[ǫN ]. We let P denote a collection of probability distributions, and we consider
a collection of loss functions F specified by a collection F of convex losses F : X × ξ → R. We can
then define the minimax error for the family of losses F and distributions P as
ǫ
∗
N (X ,P, F) := inf
xb
sup
P ∈P
sup
F ∈F
EP [ǫN (x, F, b X , P)], (3)
where the infimum is taken over all possible estimators (optimization schemes) xb.
2.1 Minimax lower bounds
Let us now give a more precise characterization of the (natural) set of sparse optimization problems
we consider to provide the lower bound. For the next proposition, we let P consist distributions
supported on Ξ = {−1, 0, 1}
d
, and we let pj := P(ξj 6= 0) be the marginal probability of appearance
of feature j (j ∈ {1, . . . , d}). For our class of functions, we set F to consist of functions F satisfying
the sparsity condition (2) and with the additional constraint that for g ∈ ∂xF(x; ξ), we have that
the jth coordinate |gj | ≤ Mj for a constant Mj < ∞. We obtain
Proposition 1. Let the conditions of the preceding paragraph hold. Let R be a constant such that
X ⊃ [−R, R]
d
. Then
ǫ
∗
N (X ,P, F) ≥
1
8
R
X
d
j=1
Mj min
pj ,
√pj
√
N log 3
.
We provide the proof of Proposition 1 in Appendix A.1, providing a few remarks here. We
begin by giving a corollary to Proposition 1 that follows when the data ξ obeys a type of power
law: let p0 ∈ [0, 1], and assume that P(ξj 6= 0) = p0j
−α. We have
Corollary 2. Let α ≥ 0. Let the conditions of Proposition 1 hold with Mj ≡ M for all j, and
assume the power law condition P(ξj 6= 0) = p0j
−α on coordinate appearance probabilities. Then
(1) If d > (p0N)
1/α,
ǫ
∗
N (X ,P, F) ≥
MR
8
2
2 − α
r
p0
N
(p0N)
2−α
2α − 1
+
p0
1 − α
d
1−α − (p0N)
1−α
α
.
(2) If d ≤ (p0N)
1/α,
ǫ
∗
N (X ,P, F) ≥
MR
8
r
p0
N
1
1 − α/2
d
1− α
2 −
1
1 − α/2
.
3For simplicity assume that the features are not too extraordinarly sparse, say, that α ∈ [0, 2],
and that number of samples is large enough that d ≤ (p0N)
1/α. Then we find ourselves in regime (2)
of Corollary 2, so that the lower bound on optimization error is of order
MRr
p0
N
d
1− α
2 when α < 2, MRr
p0
N
log d when α → 2, and MRr
p0
N
when α > 2. (4)
These results beg the question of tightness: are they improvable? As we see presently, they are
not.
2.2 Algorithms for attaining the minimax rate
The lower bounds specified by Proposition 1 and the subsequent specializations are sharp, meaning
that they are unimprovable by more than constant factors. To show this, we review a few stochastic
gradient algorithms. We first recall stochastic gradient descent, after which we review the dual
averaging methods and an extension of both.
We begin with stochastic gradient descent (SGD): for this algorithm, we repeatedly sample
ξ ∼ P, compute g ∈ ∂xF(x; ξ), then perform the update x ← ΠX (x − ηg), where η is a stepsize
parameter and ΠX denotes Euclidean projection onto X . Then standard analyses of stochastic
gradient descent (e.g. [10]) show that after N samples ξ, in our setting the SGD estimator xb(N)
satisfies
E[f(xb(N))] − inf
x∈X
f(x) ≤ O(1)
R2M
qPd
j=1 pj
√
N
, (5)
where R2 denotes the ℓ2-radius of X . Dual averaging, due to Nesterov [11] and referred to as “follow
the regularized leader” in the machine learning literature (see, e.g., the survey article by Hazan
[5]) is somewhat more complex. In dual averaging, one again samples g ∈ ∂xF(x; ξ), but instead of
updating the parameter vector x one updates a dual vector z by z ← z + g, then computes
x ← argmin
x∈X
hz, xi +
1
η
ψ(x)
,
where ψ(x) is a strongly convex function defined over X (often one takes ψ(x) = 1
2
kxk
2
2
). The
dual averaging algorithm, as we shall see, is somewhat more natural in asynchronous and parallel
computing environments, and it enjoys the same type of convergence guarantees (5) as SGD.
The AdaGrad algorithm [4, 9] is a slightly more complicated extension of the preceding stochastic
gradient methods. It maintains a diagonal matrix S, where upon receiving a new sample ξ,
AdaGrad performs the following: it computes g ∈ ∂xF(x; ξ), then updates
Sj ← Sj + g
2
j
for j ∈ [d].
Depending on whether the dual averaging or stochastic gradient descent (SGD) variant is being
used, AdaGrad performs one of two updates. In the dual averaging case, it maintains the dual
vector z, which is updated by z ← z + g; in the SGD case, the parameter x is maintained. The
updates for the two cases are then
x ← argmin
x′∈X
g, x′
+
1
2η
D
x
′ − x, S
1
2 (x
′ − x)
E
4for stochastic gradient descent and
x ← argmin
x′∈X
z, x′
+
1
2η
D
x
′
, S
1
2 x
′
E
for dual averaging, where η is a stepsize. Then appropriate choice of η shows that after N samples
ξ, the averaged parameter xb(N) AdaGrad returns satisfies
E[f(xb(N))] − inf
x∈X
f(x) ≤ O(1)R∞M
√
N
X
d
j=1
√pj , (6)
where R∞ denotes the ℓ∞-radius of X (e.g. [4, Section 1.3 and Theorem 5], where one takes
η ≈ R∞). By inspection, the AdaGrad rate (6) matches the lower bound in Proposition 1 and
is thus optimal. It is interesting to note, though, that in the power law setting of Corollary 2
(recall the error order (4)), a calculation shows that the multiplier for the SGD guarantee (5)
becomes R∞
√
d max{d
(1−α)/2
, 1}, while AdaGrad attains rate at worst R∞ max{d
1−α/2
, log d}
(by evaluation of P
j
√pj ). Thus for α > 1, the AdaGrad rate is no worse, and for α ≥ 2, is more
than √
d/ log d better than SGD—an exponential improvement in the dimension.
3 Parallel and asynchronous optimization with sparsity
As we note in the introduction, recent works [12, 14] have suggested that sparsity can yield benefits
in our ability to parallelize stochastic gradient-type algorithms. Given the optimality of AdaGradtype
algorithms, it is natural to focus on their parallelization in the hope that we can leverage their
ability to “adapt” to sparsity in the data. To provide the setting for our further algorithms, we
first revisit Niu et al.’s Hogwild!.
The Hogwild! algorithm of Niu et al. [12] is an asynchronous (parallelized) stochastic gradient
algorithm that proceeds as follows. To apply Hogwild!, we must assume the domain X in problem
(1) is a product space, meaning that it decomposes as X = X1 × · · · × Xd, where Xj ⊂ R. Fix
a stepsize η > 0. Then a pool of processors, each running independently, performs the following
updates asynchronously to a centralized vector x:
1. Sample ξ ∼ P
2. Read x and compute g ∈ ∂xF(x; ξ)
3. For each j s.t. gj 6= 0, update xj ← ΠXj
(xj − ηgj )
Here ΠXj denotes projection onto the jth coordinate of the domain X . The key of Hogwild! is
that in step 2, the parameter x at which g is calculated may be somewhat inconsistent—it may
have received partial gradient updates from many processors—though for appropriate problems,
this inconsistency is negligible. Indeed, Niu et al. [12] show a linear speedup in optimization time as
the number of independent processors grow; they show this empirically in many scenarios, providing
a proof under the somewhat restrictive assumption that there is at most one non-zero entry in any
gradient g.
53.1 Asynchronous dual averaging
One of the weaknesses of Hogwild! is that, as written it appears to only be applicable to problems
for which the domain X is a product space, and the known analysis assumes that kgk0 = 1
for all gradients g. In effort to alleviate these difficulties, we now develop and present our asynchronous
dual averaging algorithm, AsyncDA. In AsyncDA, instead of asynchronously updating
a centralized parameter vector x, we maintain a centralized dual vector z. A pool of processors performs
asynchronous additive updates to z, where each processor repeatedly performs the following
updates:
1. Read z and compute x := argminx∈X n
hz, xi +
1
η
ψ(x)
o
// Implicitly increment “time” counter
t and let x(t) = x
2. Sample ξ ∼ P and let g ∈ ∂xF(x; ξ) // Let g(t) = g.
3. For j ∈ [d] such that gj 6= 0, update zj ← zj + gj
Because the actual computation of the vector x in asynchronous dual averaging (AsyncDA)
is performed locally on each processor in step 1 of the algorithm, the algorithm can be executed
with any proximal function ψ and domain X . The only communication point between any of the
processors is the addition operation in step 3. As noted by Niu et al. [12], this operation can often
be performed atomically on modern processors.
In our analysis of AsyncDA, and in our subsequent analysis of the adaptive methods, we
require a measurement of time elapsed. With that in mind, we let t denote a time index that exists
(roughly) behind-the-scenes. We let x(t) denote the vector x ∈ X computed in the tth step 1 of the
AsyncDA algorithm, that is, whichever is the tth x actually computed by any of the processors.
We note that this quantity exists and is recoverable from the algorithm, and that it is possible to
track the running sum Pt
τ=1 x(τ ).
Additionally, we require two assumptions encapsulating the conditions underlying our analysis.
Assumption A. There is an upper bound m on the delay of any processor. In addition, for each
j ∈ [d] there is a constant pj ∈ [0, 1] such that P(ξj 6= 0) ≤ pj .
We also require an assumption about the continuity (Lipschitzian) properties of the loss functions
being minimized; the assumption amounts to a second moment constraint on the sub-gradients of
the instantaneous F along with a rough measure of the sparsity of the gradients.
Assumption B. There exist constants M and (Mj )
d
j=1 such that the following bounds hold for all
x ∈ X : E[k∂xF(x; ξ)k
2
2
] ≤ M2 and for each j ∈ [d] we have E[|∂xjF(x; ξ)|] ≤ pjMj .
With these definitions, we have the following theorem, which captures the convergence behavior
of AsyncDA under the assumption that X is a Cartesian product, meaning that X = X1×· · ·×Xd,
where Xj ⊂ R, and that ψ(x) = 1
2
kxk
2
2
. Note the algorithm itself can still be efficiently parallelized
for more general convex X , even if the theorem does not apply.
Theorem 3. Let Assumptions A and B and the conditions in the preceding paragraph hold. Then
E
X
T
t=1
F(x(t); ξ
t
) − F(x
∗
; ξ
t
)
≤
1
2η
kx
∗
k
2
2 +
η
2
TM2 + ηTmX
d
j=1
p
2
jM2
j
.
6We provide the proof of Theorem 3 in Appendix B.
As stated, the theorem is somewhat unwieldy, so we provide a corollary and a few remarks
to explain and simplify the result. Under a more stringent condition that |∂xjF(x; ξ)| ≤ Mj ,
Assumption A implies E[k∂xF(x; ξ)k
2
2
] ≤
Pd
j=1 pjM2
j
. Thus, without loss of generality for the
remainder of this section we take M2 =
Pd
j=1 pjM2
j
, which serves as an upper bound on the
Lipschitz continuity constant of the objective function f. We then obtain the following corollary.
Corollary 4. Define xb(T) = 1
T
PT
t=1 x(t), and set η = kx
∗k2
/M
√
T. Then
E[f(xb(T)) − f(x
∗
)] ≤
M kx
∗k
√
2
T
+ m
kx
∗k2
2M
√
T
X
d
j=1
p
2
jM2
j
Corollary 4 is almost immediate. To see the result, note that since ξ
t
is independent of x(t), we
have E[F(x(t); ξ
t
) | x(t)] = f(x(t)); applying Jensen’s inequality to f(xb(T)) and performing an
algebraic manipulation give the corollary.
If the data is suitably “sparse,” meaning that pj ≤ 1/m (which may also occur if the data is of
relatively high variance in Assumption B) the bound in Corollary 4 simplifies to
E[f(xb(T)) − f(x
∗
)] ≤
3
2
M kx
∗k
√
2
T
=
3
2
qPd
j=1 pjM2
j
kx
∗k2
√
T
(7)
which is the convergence rate of stochastic gradient descent (and dual averaging) even in nonasynchronous
situations (5). In non-sparse cases, setting η ∝ kx
∗k2
/
√
mM2T in Theorem 3 recovers
the bound
E[f(xb(T)) − f(x
∗
)] ≤ O(1)√
m ·
M kx
∗k
√
2
T
.
The convergence guarantee (7) shows that after T timesteps, we have error scaling 1/
√
T; however,
if we have k processors, then updates can occur roughly k times as quickly, as all updates are
asynchronous. Thus in time scaling as n/k, we can evaluate n gradient samples: a linear speedup.
3.2 Asynchronous AdaGrad
We now turn to extending AdaGrad to asynchronous settings, developing AsyncAdaGrad (asynchronous
AdaGrad). As in the AsyncDA algorithm, AsyncAdaGrad maintains a shared dual
vector z among the processors, which is the sum of gradients observed; AsyncAdaGrad also maintains
the matrix S, which is the diagonal sum of squares of gradient entries (recall Section 2.2). The
matrix S is initialized as diag(δ
2
), where δj ≥ 0 is an initial value. Each processor asynchronously
performs the following iterations:
1. Read S and z and set G = S
1
2 . Compute x := argminx∈X n
hz, xi +
1
2η
hx, Gxi
o
// Implicitly
increment “time” counter t and let x(t) = x, S(t) = S
2. Sample ξ ∼ P and let g ∈ ∂F(x; ξ)
3. For j ∈ [d] such that gj 6= 0, update Sj ← Sj + g
2
j
and zj ← zj + gj
7As in the description of AsyncDA, we note that x(t) is the vector x ∈ X computed in the tth
“step” of the algorithm (step 1), and similarly associate ξ
t with x(t).
To analyze AsyncAdaGrad, we make a somewhat stronger assumption on the sparsity properties
of the losses F than Assumption B.
Assumption C. There exist constants (Mj )
d
j=1 such that for any x ∈ X and ξ ∈ Ξ, we have
E[(∂xjF(x; ξ))2
| ξj 6= 0] ≤ M2
j
.
Indeed, taking M2 =
P
j
pjM2
j
shows that Assumption C implies Assumption B with specific constants.
We then have the following convergence result, whose proof we provide defer to Appendix C.
Theorem 5. In addition to the conditions of Theorem 3, let Assumption C hold. Assume that
δ
2 ≥ M2
j m for all j and that X ⊂ [−R∞, R∞]
d
. Then
X
T
t=1
E
F(x(t); ξ
t
) − F(x
∗
; ξ
t
)
≤
X
d
j=1
min
1
η
R
2
∞E
"
δ
2 +
X
T
t=1
gj (t)
2
1
2
#
+ ηE
"X
T
t=1
gj (t)
2
1
2
#
(1 + pjm), MjR∞pjT
.
We can also relax the condition on the initial constant diagonal term δ slightly, which gives a
qualitatively similar result (see Appendix C.3).
Corollary 6. Under the conditions of Theorem 5, assume that δ
2 ≥ M2
j min{m, 6 max{log T, mpj}}
for all j. Then
X
T
t=1
E
F(x(t); ξ
t
) − F(x
∗
; ξ
t
)
≤
X
d
j=1
min
1
η
R
2
∞E
"
δ
2 +
X
T
t=1
gj (t)
2
1
2
#
+
3
2
ηE
X
T
t=1
gj (t)
2
1
2
(1 + pjm), MjR∞pjT
.
It is natural to ask in which situations the bound Theorem 5 and Corollary 6 provides is optimal.
We note that, as in the case with Theorem 3, we may take an expectation with respect to ξ
t and
obtain a convergence rate for f(xb(T)) − f(x
∗
), where xb(T) = 1
T
PT
t=1 x(t). By Jensen’s inequality,
we have for any δ that
E
"
δ
2 +
X
T
t=1
gj (t)
2
1
2
#
≤
δ
2 +
X
T
t=1
E[gj (t)
2
]
1
2
≤
q
δ
2 + T pjM2
j
.
For interpretation, let us now make a few assumptions on the probabilities pj . If we assume that
pj ≤ c/m for a universal (numerical) constant c, then Theorem 5 guarantees that
E[f(xb(T)) − f(x
∗
)] ≤ O(1)
1
η
R
2
∞ + η
X
d
j=1
Mj min (p
log(T)/T + pj
√
T
, pj
)
, (8)
8
Auditory Sparse Coding
1
Auditory Sparse Coding
Steven R. Ness
University of Victoria
Thomas Walters
Google Inc.
Richard F. Lyon
Google Inc.
CONTENTS
1.1 Summary .................................................................. 3
1.2 Introduction ............................................................... 4
1.2.1 The stabilized auditory image .................................... 6
1.3 Algorithm ................................................................. 7
1.3.1 Pole–Zero Filter Cascade .......................................... 7
1.3.2 Image Stabilization ................................................ 8
1.3.3 Box Cutting ....................................................... 9
1.3.4 Vector Quantization ............................................... 9
1.3.5 Machine Learning ................................................. 10
1.4 Experiments ............................................................... 11
1.4.1 Sound Ranking .................................................... 11
1.4.2 MIREX 2010 ...................................................... 14
1.5 Conclusions ................................................................ 17
1.1 Summary
The concept of sparsity has attracted considerable interest in the field of
machine learning in the past few years. Sparse feature vectors contain mostly
values of zero and one or a few non-zero values. Although these feature vectors
can be classified by traditional machine learning algorithms, such as SVM,
there are various recently-developed algorithms that explicitly take advantage
of the sparse nature of the data, leading to massive speedups in time, as
well as improved performance. Some fields that have benefited from the use
of sparse algorithms are finance, bioinformatics, text mining [1], and image
classification [4]. Because of their speed, these algorithms perform well on
very large collections of data [2]; large collections are becoming increasingly
34 Book title goes here
relevant given the huge amounts of data collected and warehoused by Internet
businesses.
In this chapter, we discuss the application of sparse feature vectors in the
field of audio analysis, and specifically their use in conjunction with preprocessing
systems that model the human auditory system. We present early results
that demonstrate the applicability of the combination of auditory-based
processing and sparse coding to content-based audio analysis tasks.
We present results from two different experiments: a search task in which
ranked lists of sound effects are retrieved from text queries, and a music information
retrieval (MIR) task dealing with the classification of music into
genres.
1.2 Introduction
Traditional approaches to audio analysis problems typically employ a shortwindow
fast Fourier transform (FFT) as the first stage of the processing
pipeline. In such systems a short, perhaps 25ms, segment of audio is taken
from the input signal and windowed in some way, then the FFT of that segment
is taken. The window is then shifted a little, by perhaps 10ms, and the
process is repeated. This technique yields a two-dimensional spectrogram of
the original audio, with the frequency axis of the FFT as one dimension, and
time (quantized by the step-size of the window) as the other dimension.
While the spectrogram is easy to compute, and a standard engineering tool,
it bears little resemblance to the early stages of the processing pipeline in the
human auditory system. The mammalian cochlea can be viewed as a bank of
tuned filters the output of which is a set of band-pass filtered versions of the
input signal that are continuous in time. Because of this property, fine-timing
information is preserved in the output of cochlea, whereas in the spectrogram
described above, there is no fine-timing information available below the 10ms
hop-size of the windowing function.
This fine-timing information from the cochlea can be made use of in later
stages of processing to yield a three-dimensional representation of audio, the
stabilized auditory image (SAI)[11], which is a movie-like representation of
sound which has a dimension of ‘time-interval’ in addition to the standard
dimensions of time and frequency in the spectrogram. The periodicity of the
waveform gives rise to a vertical banding structure in this time interval dimension,
which provides information about the sound which is complementary to
that available in the frequency dimension. A single example frame of a stabilized
auditory image is shown in Figure 1.1.
While we believe that such a representation should be useful for audio
analysis tasks, it does come at a cost. The data rate of the SAI is many times
that of the original input audio, and as such some form of dimensionalityAuditory Sparse Coding 5
reduction is required in order to create features at a suitable data rate for
use in a recognition system. One approach to this problem is to move from a
the dense representation of the SAI to a sparse representation, in which the
overall dimensionality of the features is high, but only a limit number of the
dimensions are nonzero at any time.
In recent years, machine learning algorithms that utilize the properties
of sparsity have begun to attract more attention and have been shown to
outperform approaches that use dense feature vectors. One such algorithm is
the passive-aggressive model for image retrieval (PAMIR), a machine learning
algorithm that learns a ranking function from the input data, that is, it takes
an input set of documents and orders them based on their relevance to a
query. PAMIR was originally developed as a machine vision method and has
demonstrated excellent results in this field.
There is also growing evidence that in the human nervous system sensory
inputs are coded in a sparse manner; that is, only small numbers of neurons
are active at a given time [10]. Therefore, when modeling the human auditory
system, it may be advantageous to investigate this property of sparseness in
relation to the mappings that are being developed. The nervous systems of
animals have evolved over millions of years to be highly efficient in terms of
energy consumption and computation. Looking into the way sound signals are
handled by the auditory system could give us insights into how to make our
algorithms more efficient and better model the human auditory system.
One advantage of using sparse vectors is that such coding allows very
fast computation of similarity, with a trainable similarity measure [4]. The
efficiency results from storing, accessing, and doing arithmetic operations on
only the non-zero elements of the vectors. In one study that examined the
performance of sparse representations in the field of natural language processing,
a 20- to 80-fold speedup over LIBSVM was found [7]. They comment
that kernel-based methods, like SVM, scale quadratically with the number
of training examples and discuss how sparsity can allow algorithms to scale
linearly based on the number of training examples.
In this chapter, we use the stabilized auditory image (SAI) as the basis of
a sparse feature representation which is then tested in a sound ranking task
and a music information retrieval task. In the sound raking task, we generate
a two-dimensional SAI for each time slice, and then sparse-code those images
as input to PAMIR. We use the ability of PAMIR to learn representations of
sparse data in order to learn a model which maps text terms to audio features.
This PAMIR model can then be used rank a list of unlabeled sound effects
according to their relevance to some text query. We present results that show
that in certain tasks our methods can outperform highly tuned FFT based
approaches. We also use similar sparse-coded SAI features as input to a music
genre classification system. This system uses an SVM classifier on the sparse
features, and learns text terms associated with music. The system was entered
into the annual music information retrieval evaluation exchange evaluation
(MIREX 2010).6 Book title goes here
Results from the sound-effects ranking task show that sparse auditorymodel-based
features outperform standard MFCC features, reaching precision
about 73% for the top-ranked sound, compared to about 60% for standard
MFCC and 67% for the best MFCC variant. These experiments involved
ranking sounds in response to text queries through a scalable online machine
learning approach to ranking.
1.2.1 The stabilized auditory image
In our system we have taken inspiration from the human auditory system in
order to come up with a rich set of audio features that are intended to more
closely model the audio features that we use to listen and process music.
Such fine timing relations are discarded by traditional spectral techniques.
A motivation for using auditory models is that the auditory system is very effective
at identifying many sounds. This capability may be partially attributed
to acoustic features that are extracted at the early stages of auditory processing.
We feel that there is a need to develop a representation of sounds that
captures the full range of auditory features that humans use to discriminate
and identify different sounds, so that machines have a chance to do so as well.
FIGURE 1.1
An example of a single SAI of a sound file of a spoken vowel sound. The
vertical axis is frequency with lower frequencies at the bottom of the figure
and higher frequencies on the top. The horizontal axis is the autocorrelation
lag. From the positions of the vertical features, one can determine the pitch
of the sound.
This SAI representation generates a 2D image from each section of waveform
from an audio file. We then reduce each image in several steps: first
cutting the image into overlapping boxes converted to fixed resolution per
box; second, finding row and column sums of these boxes and concatenating
those into a vector; and finally vector quantizing the resulting medium-Auditory Sparse Coding 7
dimensionality vector, using a separate codebook for each box position. The
VQ codeword index is a representation of a 1-of-N sparse code for each box,
and the concatenation of all of those sparse vectors, for all the box positions,
makes the sparse code for the SAI image. The resulting sparse code is accumulated
across the audio file, and this histogram (count of number of occurrences
of each codeword) is then used as input to an SVM [5] classifier[3]. This approach
is similar to that of the “bag of words” concept, originally from natural
language processing, but used heavily in computer vision applications as “bag
of visual words”; here we have a “bag of auditory words”, each “word” being
an abstract feature corresponding to a VQ codeword. The bag representation
is a list of occurrence counts, usually sparse.
1.3 Algorithm
In our experiments, we generate a stream of SAIs using a series of modules that
process an incoming audio stream through the various stages of the auditory
model. The first module filters the audio using the pole–zero filter cascade
(PZFC) [9], then subsequent modules find strobe points in this audio, and
generate a stream of SAIs at a rate of 50 per second. The SAIs are then
cut into boxes and are transformed into a high dimensional dense feature
vector [12] which is vector quantized to give a high dimensional sparse feature
vector. This sparse vector is then used as input to a machine learning system
which performs either ranking or classification. This whole process is shown
in diagrammatic form in Figure 1.2
1.3.1 Pole–Zero Filter Cascade
We first process the audio with the pole–zero filter cascade (PZFC) [9], a
model inspired by the dynamics of the human cochlea. The PZFC is a cascade
of a large number of simple filters with an output tap after each stage. The
effect of this filter cascade is to transform an incoming audio signal into a
set of band-pass filtered versions of the signal. In our case we used a cascade
with 95 stages, leading to 95 output channels. Each output channel is halfwave
rectified to simulate the output of the inner hair cells along the length
of the cochlea. The PZFC also includes an automatic gain control (AGC)
system that mimics the effect of the dynamic compression mechanisms seen
in the cochlea. A smoothing network, fed from the output of each channel,
dynamically modifies the characteristics of the individual filter stages. The
AGC can respond to changes in the output on the timescale of milliseconds,
leading to very fast-acting compression. One way of viewing this filter cascade
is that its outputs are an approximation of the instantaneous neuronal firing
rate as a function of cochlear place, modeling both the frequency filtering and8 Book title goes here
FIGURE 1.2
A flowchart describing the flow of data in our system. First, either the pole–
zero filter cascade (PZFC) or gammatone filterbank filters the input audio
signal. Filtered signals then pass through a half-wave rectification module
(HCL), and trigger points in the signal are then calculated by the local-max
module. The output of this stage is the SAI, the image in which each signal is
shifted to align the trigger time to the zero lag point in the image. The SAI
is then cut into boxes with the box-cutting module, and the resulting boxes
are then turned into a codebook with the vector-quantization module.
the automatic gain control characteristics of the human cochlea [8]. The PZFC
parameters used for the sound-effects ranking task are described in [9]. We
did not do any further tuning of this system to the problems of genre, mood
or song classification; this would be a fruitful area of further research.
1.3.2 Image Stabilization
The output of the PZFC filterbank is then subjected to a process of strobe
finding where large peaks in the PZFC signal are found. The temporal locations
of these peaks are then used to initiate a process of temporal integration
whereby the stabilized auditory image is generated. These strobe points
“stabilize” the signal in a manner analogous to the trigger mechanism in an
oscilloscope. When these strobe points are found, a modified form of autocorrelation,
known as strobed temporal integration, which is like a sparse version
of autocorrelation where only the strobe points are correlated against the sig-Auditory Sparse Coding 9
FIGURE 1.3
The cochlear model, a filter cascade with half-wave rectifiers at the output
taps, and an automatic gain control (AGC) filter network that controls the
tuning and gain parameters in response to the sound.
nal. Strobed temporal integration has the advantage of being considerably less
computationally expensive than full autocorrelation.
1.3.3 Box Cutting
We then divide each image into a number of overlapping boxes using the
same process described in [9]. We start with rectangles of size 16 lags by 32
frequency channels, and cover the SAI with these rectangles, with overlap.
Each of these rectangles is added to the set of rectangles to be used for vector
quantization. We then successively double the height of the rectangle up to
the largest size that fits in an SAI frame, but always reducing the contents of
each box back to 16 by 32 values. Each of these doublings is added to the set
of rectangles. We then double the width of each rectangle up to the width of
the SAI frame and add these rectangles to the SAI frame. The output of this
step is a set of 44 overlapping rectangles. The process of box-cutting is shown
in Figure 1.4.
In order to reduce the dimensionality of these rectangles, we then take
their row and column marginals and join them together into a single vector.
1.3.4 Vector Quantization
The resulting dense vectors from all the boxes of a frame are then converted
to a sparse representation by vector quantization.
We first preprocessed a collection of 1000 music files from 10 genres using
a PZFC filterbank followed by strobed temporal integration to yield a set of10 Book title goes here
FIGURE 1.4
The boxes, or multi-scale regions, used to analyze the stabilized auditory
images are generated in a variety of heights, widths, and positions.
SAI frames for each file . We then take this set of SAI and apply the boxcutting
technique described above. The followed by the calculation of row and
column marginals. These vectors are then used to train dictionaries of 200
entries, representing abstract “auditory words”, for each box position, using
a k-means algorithm.
This process requires the processing of large amounts of data, just to train
the VQ codebooks on a training corpus.
The resulting dictionaries for all boxes are then used in the MIREX experiment
to convert the dense features from the box cutting step on the test
corpus songs into a set of sparse features where each box was represented by
a vector of 200 elements with only one element being non-zero. The sparse
vectors for each box were then concatenated, and these long spare vectors are
histogrammed over the entire audio file to produce a sparse feature vector
for each song or sound effect. This operation of constructing a sparse bag of
auditory words was done for both the training and testing corpora.Auditory Sparse Coding 11
1.3.5 Machine Learning
For this system, we used the support vector machine learning system from
libSVM which is included in the Marsyas[13] framework. Standard Marsyas
SVM parameters were used in order to classify the sparse bag of auditory
words representation of each song. It should be noted that SVM is not the
ideal algorithm for doing classification on such a sparse representation, and
if time permitted, we would have instead used the PAMIR machine learning
algorithm as described in [9]. This algorithm has been shown to outperform
SVM on ranking tasks, both in terms of execution speed and quality of results.
1.4 Experiments
1.4.1 Sound Ranking
We performed an experiment in which we examined a quantitative ranking
task over a diverse set of audio files using tags associated with the audio files.
For this experiment, we collected a dataset of 8638 sound effects, which
came from multiple places. 3855 of the sound files were from commercially
available sound effect libraries, of these 1455 were from the BBC sound effects
library. The other 4783 audio files were collected from a variety of sources on
the internet, including findsounds.com, partnersinrhyme.com, acoustica.com,
ilovewaves.com, simplythebest.net, wav-sounds.com, wav-source.com and
wavlist.com.
We then manually annotated this dataset of sound effects with a small
number of tags for each file. Some of the files were already assigned tags and
for these, we combined our tags with this previously existing tag information.
In addition, we added higher level tags to each file, for example, files with
the tags “cat”, “dog” and “monkey” were also given the tags “mammal” and
“animal”. We found that the addition of these higher level tags assist retrieval
by inducing structure over the label space. All the terms in our database were
stemmed, and we used the Porter stemmer for English, which left a total of
3268 unique tags for an average of 3.2 tags per sound file.
In order to estimate the performance of the learned ranker, we used a
standard three-fold cross-validation experimental setup. In this scheme, two
thirds of the data is used for training and one third is used for testing; this
process is then repeated for all three splits of the data and results of the three
are averaged. We removed any queries that had fewer than 5 documents in
either the training set or the test set, and if the corresponding documents had
no other tags, these documents were removed as well.
To determine the values of the hyperparameters for PAMIR we performed
a second level of cross-validation where we iterated over values for the aggressiveness
parameter C and the number of training iterations. We found that in12 Book title goes here
general system performance was good for moderate values of C and that lower
values of C required a longer training time. For the agressiveness parameter,
we selected a value of C=0.1, a value which was also found to be optimal in
other research [6]. For the number of iterations, we chose 10M, and found that
in our experience, the system was not very sensitive to the exact value of these
parameters.
We evaluated our learned model by looking at the precision within the top
k audio files from the test set as ranked by each query. Precision at top k is
a commonly used measure in retrieval tasks such as these and measures the
fraction of positive results within the top k results from a query.
The stabilized auditory image generation process has a number of parameters
which can be adjusted including the parameters of the PZFC filter and
the size of rectangles that the SAI is cut into for subsequent vector quantization.
We created a default set of parameters and then varied these parameters
in our experiments. The default SAI box-cutting was performed with 16 lags
and 32 channels, which gave a total of 49 rectangles. These rectangles were
then reduced to their marginal values which gives a 48 dimension vector, and a
codebook of size 256 was used for each box, giving a total of 49 x 256 = 12544
feature dimensions. Starting from these, we then made systematic variations
to a number of different parameters and measured their effect on precision of
retrieval. For the box-cutting step, we adjusted various parameters including
the smallest sized rectangle, and the maximum number of rectangles used for
segmentation. We also varied the codebook sizes that we used in the sparse
coding step.
In order to evaluate our method, we compared it with results obtained
using a very common feature extraction method for audio analysis, MFCCs
(mel-frequency cepstral coefficients). In order to compare this type of feature
extraction with our own, we turned these MFCC coefficients into a sparse
code. These MFCC coefficients were calculated with a Hamming window with
initial parameters based on a setting optimized for speech. We then changed
various parameters of the MFCC algorithm, including the number of cepstral
coefficients (13 for speech), the length of each frame (25ms for speech), and the
number of codebooks that were used to sparsify the dense MFCC features for
each frame. We obtained the best performance with 40 cepstral coefficients, a
window size of 40ms and codebooks of size 5000.
We investigated the effect of various parameters of the SAI feature extraction
process on test-set precision, these results are displayed graphically in
Figure 1.5 where the precision of the top ranked sound file is plotted against
the number of features used. As one can see from this graph, performance
saturates when the number of features approaches 105 which results from
the use of 4000 code words per codebook, with a total of 49 codebooks. This
particular set of parameters led to a performance of 73%, significantly better
than the best MFCC result which achieved a performance of 67%, which represents
a smaller error of 18% (from 33 % to 27 % error). It is also notableAuditory Sparse Coding 13
FIGURE 1.5
Ranking at top-1 retrieved result for all the experimental runs described in
this section. A few selected experiment names are plotted next to each point,
and different experiments are shown by different icons. The convex hull that
connects the best-performing experiments is plotted as a solid line.
that SAI can achieve better precision-at-top-k consistently for all values of k,
albeit with a smaller improvement in relative precision.
In table 1.2 results of three queries along with the top five sound files that
were returned by the best SAI-based and MFCC-based systems. From this
table, one can see that the two systems perform in different ways, this can be
expected when one considers the basic audio features that these two systems
extract. For example, for the query “gulp”, the SAI system returns “pouring”
and “water-dripping”, all three of these share the similarity of involving the
movement of water or liquids.
When we calculated performance, it was based on textual tags, which
are often noisy and incomplete. Due to the nature of human language and
perception, people often use different words to describe sounds that are very
similar, for example, a Chopin Mazurka could be described with the words “piano”,
“soft”, “classical”, “Romantic”, and “mazurka”. To compound this diffi-
culty, a song that had a female vocalist singing could be labelled as “woman”,
“women”, “female”, “female vocal”, or “vocal”. This type of multi-label problem
is common in the field of content based retrieval. It can be alleviated
by a number of techniques, including the stemming of words, but due to the
varying nature of human language and perception, will continue to remain an
issue.
In Figure 1.6 the performance of the SAI and MFCC based systems are14 Book title goes here
FIGURE 1.6
A comparison of the average precision of the SAI and MFCC based systems.
Each point represents a single query, with the horizontal position being the
MFCC average precision and the vertical position being the SAI average precision.
More of the points appear above the y=x line, which indicates that the
SAI based system achieved a higher mean average precision.
compared to each other with respect to their average precision. A few select
full tag names are placed on this diagram, for the rest, only a plus is shown.
This is required because otherwise the text would overlap to such a great
degree that it would be impossible to read.
In this diagram we plot the average precision of the SAI based system
against that of the MFCC based system, with the SAI precision shown along
the vertical axis and the MFCC precision shown along the horizontal axis.
If the performance of the two systems was identical, all points would lie on
the line y=x. Because more points lie above the line than below the line, the
performance of the SAI based system is better than that of the MFCC based
system.Auditory Sparse Coding 15
top-k SAI MFCC percent error reduction
1 27 33 18 %
2 39 44 12 %
5 60 62 4 %
10 72 74 3 %
20 81 84 4 %
TABLE 1.1
A comparison of the best SAI and MFCC configurations. This table shows
the percent error at top-k, where error is defined as 1 - precision.
Query SAI file (labels) MFCC file (labels)
tarzan Tarzan-2 (tarzan, yell) TARZAN (tarzan, yell)
tarzan2 (tarzan, yell) 175orgs (steam, whistle)
203 (tarzan) mosquito-2 (mosquito)
wolf (mammal, wolves, wolf, ...) evil-witch-laugh (witch, evil, laugh)
morse (morse, code) Man-Screams (horror, scream, man)
applause 27-Applause-from-audience 26-Applause-from-audience
audience 30-Applause-from-audience phase1 (trek, phaser, star)
golf50 (golf) fanfare2 (fanfare, trumpet)
firecracker 45-Crowd-Applause (crowd, applause)
53-ApplauseLargeAudienceSFX golf50
gulp tite-flamn (hit, drum, roll) GULPS (gulp, drink)
water-dripping (water, drip) drink (gulp, drink)
Monster-growling (horror, monster, growl) california-myotis-search (blip)
Pouring (pour,soda) jaguar-1 (bigcat, jaguar, mammal, ...)
TABLE 1.2
A comparison of the best SAI and MFCC configurations. This table shows
the percent error at top-k, where error is defined as (1 - precision).16 Book title goes here
Algorithm Classification Accuracy
SAI/VQ 0.4987
Marsyas MFCC 0.4430
Best 0.6526
Average 0.455
TABLE 1.3
Classical composer train/test classification task
Algorithm Classification Accuracy
SAI/VQ 0.4861
Marsyas MFCC 0.5750
Best 0.6417
Average 0.49
TABLE 1.4
Music mood train/test classification task
1.4.2 MIREX 2010
All of these algorithms were then ported to the Marsyas music information
retrieval framework from AIM-C, and extensive tests were written as described
above. These algorithms were submitted to the MIREX 2010 competition as
C++ code, which was then run by the organizers on blind data. As of this
date, only results for two of the four train/test tasks have been released. One
of these is for the task of classifying classical composers and the other is for
classifying the mood of a piece of music. There were 40 groups participating in
this evaluation, the most ever for MIREX, which gives some indication about
how this classification task is increasingly important in the real world. Below
I present the results for the best entry, the average of all entries, our entry,
and the other entry for the Marsyas system. It is instructive to compare our
result to that of the standard Marsyas system because in large part we would
like to compare the SAI audio feature to the standard MFCC features, and
since both of these systems use the SVM classifier, we partially negate the
influence of the machine learning part of the problem.
For the classical composer task the results are shown in table 1.3 and for
the mood classification task, results are shown in table 1.4
From these results we can see that in the classical composer task we outperformed
the traditional Marsyas system which has been tuned over the course
of a number of years to perform well. This gives us the indication that the
use of these SAI features has promise. However, we underperform the best
algorithm, which means that there is work to be done in terms of testing different
machine learning algorithms that would be better suited to this type
of data. However, in a more detailed analysis of the results, which is shown
in 1.7, it is evident that each of the algorithms has a wide range of perfor-Auditory Sparse Coding 17
mance on different classes. This graph shows that the most well predicted in
our SAI/VQ classifier overlap significantly with those from the highest scoring
classification engines.
FIGURE 1.7
Per class results for classical composer
In the mood task, we underperform both Marsyas and the leading algorithm.
This is interesting and might speak to the fact that we did not tune the
parameters of this algorithm for the task of music classification, but instead
used the parameters that worked best for the classification of sound effects.
Music mood might be a feature that has spectral aspects that evolve over
longer time periods than other features. For this reason, it would be important
to search for other parameters in the SAI algorithm that would perform
well for other tasks in music information retrieval.
For these results, due to time constraints, we only used the SVM classifier
on the SAI histograms. This has been shown in [9] to be an inferior classifier
for this type of sparse, high-dimensional data than the PAMIR algorithm. In
the future, we would like to add the PAMIR algorithm to Marsyas and to
try these experiments using this new classifier. It was observed that the MIR
community is increasingly becoming focused on advanced machine learning
techniques, and it is clear that it will be critical to both try different machine
learning algorithms on these audio features as well as to perform wider sweeps
of parameters for these classifiers. Both of these will be important in increasing
the performance of these novel audio features.18 Book title goes here
1.5 Conclusions
The use of physiologically-plausible acoustic models combined with a sparsi-
fication approach has shown promising results in both the sound effects ranking
and MIREX 2010 experiments. These features are novel and hold great
promise in the field of MIR for the classification of music as well as other tasks.
Some of the results obtained were better than that of a highly tuned MIR system
on blind data. In this task we were able to expose the MIR community
to these new audio features. These new audio features have been shown to
outperform MFCC features in a sound-effects ranking task, and by evaluating
these features with machine learning algorithms more suited for these high dimensional,
sparse features, we have great hope that we will obtain even better
results in future MIREX evaluations.Bibliography
[1] Suhrid Balakrishnan and David Madigan. Algorithms for sparse linear
classifiers in the massive data setting. J. Mach. Learn. Res., 9:313–337,
2008.
[2] L´eon Bottou, Olivier Chapelle, Dennis DeCoste, and Jason Weston.
Large-Scale Kernel Machines (Neural Information Processing). The MIT
Press, 2007.
[3] O. Chappelle, B. Scholkopf, and A. Zien. Semi-Supervised Learning. MIT
Press, Cambridge, MA, 2006.
[4] Gal Chechik, Varun Sharma, Uri Shalit, and Samy Bengio. Large scale
online learning of image similarity through ranking. J. Mach. Learn.
Res., 11:1109–1135, 2010.
[5] Yasser EL-Manzalawy and Vasant Honavar. WLSVM: Integrating LibSVM
into Weka Environment, 2005.
[6] David Grangier and Samy Bengio. A discriminative kernel-based approach
to rank images from text queries. IEEE Trans. Pattern Anal.
Mach. Intell., 30(8):1371–1384, 2008.
[7] Patrick Haffner. Fast transpose methods for kernel learning on sparse
data. In ICML ’06: Proceedings of the 23rd international conference on
Machine learning, pages 385–392, New York, NY, USA, 2006. ACM.
[8] R. F. Lyon. Automatic gain control in cochlear mechanics. In P Dallos
et al., editor, The Mechanics and Biophysics of Hearing, pages 395–420.
Springer-Verlag, 1990.
[9] Richard F. Lyon, Martin Rehn, Samy Bengio, Thomas C. Walters, and
Gal Chechik. Sound retrieval and ranking using auditory sparsecode
representations. Neural Computation, 22, 2010.
[10] Bruno A. Olshausen and David J. Field. Sparse coding of sensory inputs.
Current opinion in neurobiology, 14(4):481–487, 2004.
[11] R. D. Patterson. Auditory images: how complex sounds are represented
in the auditory system. The Journal of the Acoustical Society of America,
21:183–190, 2000.
1920 Book title goes here
[12] Martin Rehn, Richard F. Lyon, Samy Bengio, Thomas C. Walters, and
Gal Chechik. Sound ranking using auditory sparse-code representations.
ICML 2009 Workshop on Sparse Methods for Music Audio, 2009.
[13] G. Tzanetakis. Marsyas-0.2: A case study in implementing music information
retrieval systems, chapter 2, pages 31–49. Intelligent Music
Information Systems: Tools and Methodologies. Information Science Reference,
2008. Shen, Shepherd, Cui, Liu (eds).
Extracting Patterns from Location History
Andrew Kirmse
Google Inc
Mountain View, California
akirmse@google.com
Tushar Udeshi
Google Inc
Boulder, Colorado
tudeshi@google.com
Jim Shuma
Google Inc
Mountain View, California
jshuma@google.com
Pablo Bellver
Google Inc
Mountain View, California
pablob@google.com
ABSTRACT
In this paper, we describe how a user's location history (recorded
by tracking the user's mobile device location with his permission)
is used to extract the user's location patterns. We describe how we
compute the user's commonly visited places (including home and
work), and commute patterns. The analysis is displayed on the
Google Latitude history dashboard [7] which is only accessible to
the user.
Categories and Subject Descriptors
D.0 [General]: Location based services.
General Terms
Algorithms.
Keywords
Location history analysis, commute analysis.
1. INTRODUCTION
Location-based services have been gaining in popularity. Most
services[4,5] utilize a “check-in” model where a user takes some
action on the phone to announce that he has reached a particular
place. He can then advertise this to his friends and also to the
business owner who might give him some loyalty points. Google
Latitude [6] utilizes a more passive model. The mobile device
periodically sends his location to a server which shares it with his
registered friends. The user also has the option of opting into
latitude location history. This allows Google to store the user's
location history. This history is analyzed and displayed for the
user on a dashboard [7].
A user's location history can be used to provide several useful
services. We can cluster the points to determine where he
frequents and how much time he spends at each place. We can
determine the common routes the user drives on, for instance, his
daily commute to work. This analysis can be used to provide
useful services to the user. For instance, one can use real-time
traffic services to alert the user when there is traffic on the route
he is expected to take and suggest an alternate route.
We expect many more useful services to arise from location
history. It is important to note that a user's location history is
stored only if he explicitly opts into this feature. However, once
signed in, he can get several useful services without any
additional work on his part (like checking in).
2. PREVIOUS WORK
Much previous work assumes clean location data sampled at very
high frequency. Ashbrook and Starner [2] cluster a user's
significant locations from GPS traces by identifying locations
where the GPS signal reappears after an absence of 10 minutes or
longer. This approach is unable to identify important outdoor
places and is also susceptible to spurious GPS signal loss (e.g. in
urban canyons or when the recording device is off). In addition
they use a Markov model to predict where the user is likely to go
next from where he is. Liao, et al [11] attempt to segment a user's
day into everyday activities such as “working”, “sleeping” etc.
using a hierarchical activity model. Both these papers obtain one
GPS reading per second. This is impractical with today's mobile
devices due to battery usage. Kang et al [10] use time-based
clustering of locations obtained using a “Place Lab Client” to infer
the user's important locations. The “Place lab client” infers
locations by listening to RF-emissions from known wi-fi access
points. This requires less power than GPS. However, their
clustering algorithm assumes a continuous trace of one sample per
second. Real-world data is not so reliable and often has missing
and noisy data as illustrated in Section 3.2.
Ananthanarayanan et al [1] describe a technique to infer a user's
driving route. They also match users having similar routes to
suggest carpool partners. Liao et al [12] use a hierarchical Markov
Model to infer a user's transportation patterns including different
modes of transportation (e.g. bus, on foot, car etc.). Both these
papers use clean regularly-sampled GPS traces as input.
3. LOCATION ANALYSIS
3.1 Input Data
For every user, we have a list of timestamped points. Each point
has a geolocation (latitude and longitude), an accuracy radius and
an input source: 17% of our data points are from GPS and these
have an accuracy in the 10 meter range. Points derived from wifi
signatures have an accuracy in the 100 meter range and represent
57% of our data. The remaining 26% of our points are derived
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise,
or republish, to post on servers or to redistribute to lists, requires prior
specific permission and/or a fee.
ACM SIGSPATIAL GIS '11, November 1-4, 2011. Chicago, IL, USA
Copyright © 2011 ACM ISBN 978-1-4503-1031-4/11/11...$10.00from cell tower triangulation and these have an accuracy in the
1000 meter range.
We have a test database of location history of around a thousand
users. We used this database to generate the data in this paper.
3.2 Location Filtering
The raw geolocations reported from mobile devices may contain
errors beyond the measurable uncertainties inherent in the
collection method. Hardware or software bugs in the mobile
device may induce spurious readings, or variations in signal
strength or terrain may cause a phone to connect to a cell tower
that is not the one physically closest to the device. Given a stream
of input locations, we apply the following filters to account for
these errors:
1. Reject any points that fall outside the boundaries of
international time zones over land. While this discards
some legitimate points over water (presumably collected
via GPS), in practice it removes many more false
readings.
2. Reject any points with timestamps before the known
public launch of the collection software.
3. Identify cases of “jitter”, where the reported location
jumps to a distant point and soon returns. As shown in
Figure 1, this is surprisingly common. We look for a
sequence of consecutive locations {P1, P2, …, Pn} where
the following conditions hold:
◦ P1 and Pn are within a small distance threshold D of
each other.
◦ P1 and Pn have timestamps within a few hours of
each other.
◦ P1 and Pn have high reported accuracy.
◦ P2, …, Pn-1 have low reported accuracy.
◦ P2, …, Pn-1 are farther than D from P1.
In such a case, we conclude that the points P2, …, Pn-1
are due to jitter, and discard them.
4. If a pair of consecutive points implies a non-physical
velocity, reject the later one.
Any points that are filtered are discarded, and are not used in the
remaining algorithms described in this paper.
3.3 Computing Frequently Visited Places
In this section, we describe the algorithms we use to compute
places frequented by a user from his location history. We first
filter out the points for which the user is stationary i.e. moving at
a very low velocity. These stationary points need to be clustered to
extract interesting locations.
3.3.1 Clustering Stationary Points
We use two different algorithms for clustering stationary points.
3.3.1.1 Leader-based Clustering
For every point, we determine if it belongs to any of the already
generated clusters by computing its distance to the cluster leader's
location. If this is below a threshold radius, the point is added to
the cluster. Psuedocode is in Figure 2. This algorithm is simple
and efficient. It runs in O(NC) where N is the number of points
and C is the number of clusters. However, the output clusters are
dependent on the order of the input points. For example, consider
3 points P1, P2 and P3 which lie on a straight line with a distance of
radius between them as shown in Figure 3. If the input points are
ordered {P1, P2, P3}, we would get 2 clusters: {P1} and {P2, P3}.
But if they are ordered {P2, P1 ,P3} we would get only 1 cluster
containing all 3 points.
3.3.1.2 Mean Shift Clustering
Mean shift [3] is an iterative procedure that moves every point to
the average of the data points in its neighborhood. The iterations
are stopped when all points move less than a threshold. This
algorithm is guaranteed to converge. We use a weighted average
to compute the move position where the weights are inversely
proportional to the accuracy. This causes the points to gravitate
towards high accuracy points. Once the iterations converge, the
moved locations of the points are chosen as cluster centers. All
input points within a threshold radius to a cluster center are added
to its cluster. We revert back to leader-based clustering if the
iterations do not converge or some of the input points remain
unclustered. Psuedocode is shown in Figure 4.
This algorithm does not suffer from the input-order dependency of
leader-based clustering. For the input point set of Figure 3, it will
always return a cluster comprising all 3 points. The algorithm
generates a smaller number of better located clusters compared to
leader-based clustering. For example consider 4 points on the
vertices of a square with a diagonal of 2*radius as shown in
Figure 5. Leader-based clustering would generate 4 clusters, 1 per
Figure 3. Three equidistant points on a line.
P2
P3
P
1
radius radius
Let points = input points.
Let clusters = []
foreach p in points:
foreach c in clusters:
if distance(c.leader(), p) < radius:
Add p to c
break
else:
Create a new cluster c with p as leader
clusters.add(c)
Figure 2. Leader-based Clustering Algorithm.
Figure 1. A set of reported locations exhibiting “jitter”. One of
the authors was actually stationary during the time interval
represented by these points.point. Mean-shift clustering would return only 1 cluster whose
centroid is at the center of the square.
The iterative nature of this algorithm makes it expensive. We
therefore limit the maximum number of iterations and revert to
leader-based clustering if the algorithm does not converge quickly
enough.
When we ran Mean-shift clustering on our test database, the
algorithm converged in 2.4 iterations on an average. 3% of the
input points could not be clustered (i.e. we had to revert to leaderbased
for them). However, it did not cause a significant reduction
in the number of computed clusters (< 1%). We concluded that the
marginal improvement in quality did not justify the increased
computational cost.
3.3.1.3 Adaptive Radius Clustering
The two clustering algorithms described above return clusters of
the input points. One possibility would be to deem clusters larger
than a threshold as interesting locaitons. However, this is not ideal
since the input points have varying accuracy. For instance, if there
are three stationary GPS points within close proximity of each
other, we have high confidence that the user visited that place as
opposed to three stationary cell tower points. We run the
clustering algorithms multiple times, increasing the radius as well
as the minimum cluster size after every iteration. When a cluster
is generated, we check to see if it overlaps an already computed
cluster (generated from a smaller radius). If that is the case, we
merge it into the larger cluster. Note that adaptive radius
clustering can be used in conjunction with any clustering
algorithm.
From our test database, we found that adaptively increasing the
clustering radius from 20 meters to 500 meters and the minimum
cluster size from 2 to 4, increased the number of computed visited
places by 81% as compared to clustering with a fixed radius of
500 meters and a minimum cluster size of 4. We also surveyed
users and found that the majority of the new visited places
generated were correct and useful enough to display on the
Latitude history dashboard [7].
3.3.2 Computing Home and Work Locations
We use a simple heuristic to determine the user's home and work
locations. A user is likely to be at home at night. We filter out the
user's points which occur at night and cluster them. The largest
cluster is deemed the user's home location.
Similarly, work location is derived by clustering points which
occur on weekdays in the middle of the day and clustering them.
Note that this heuristic will not work for users with non-standard
schedules (e.g. work at night or work in multiple locations). Such
users have the option of correcting their home and work location
on the Latitude history dashboard [7]. These updated locations
will be used for other analyses (e.g. commute analysis described
in Section 3.4).
3.3.3 Computing Visited Places
We do some additional filtering of the input points before
clustering for visited places:
1. We remove points which are within a threshold
distance of home and work locations.
2. We remove points which are on the user's commute
between home and work. These points are determined
using the algorithm described in Section 3.4.1.
3. We remove points near airports since these are reported
as flights as described in Section 3.5.
Without these filters, we get spurious visited places. Even when a
user is stationary at home or work, the location reported can jump
around, as described in Section 3.2. Without the first filter, we
would get multiple visited places near home and work. If a user
regularly stops at a long traffic signal on his commute to work, it
has a good chance of being clustered to a visited place. This is
why we need the second filter.
3.4 Commute Analysis
We can deduce a user's driving commute patterns from his
location history. The main challenge here is that points are
reported infrequently and we have to derive the path the user has
taken in between these points. Also, the accuracy of the points can
be very low and so one needs to snap the points to the road he is
likely to be on. The commutes are analyzed in three steps: (1)
Extract sets of commute points from the input. (2) Fit the
commute points to a road path. (3) Cluster paths together spatially
and (optionally) temporally to generate the most common
commutes taken by the user.
3.4.1 Extracting Commute Points
Given a source and destination location (e.g. home and work), we
extract the points from the user's location history which likely
occurred on the user's driving commute from source to
destination. We first filter out all points with low accuracy from
the input set. We then find pairs of source-destination points. All
points between a pair are candidate points for a single commute.
The input points are noisy and therefore we do some sanity checks
on the commute candidate points: (1) The commute distance
should be reasonable. (2) The commute duration should be
reasonable. (3) The commute should be at reasonable driving
velocity.
Figure 5. Four points on a square.
Let points = input points
Let clusters = []
foreach p in points:
Compute Weight(p)
Let cluster_centers = points
while all cluster_centers move < shift_threshold:
foreach p in cluster_centers:
Find all points np which are within a threshold distance of p
p =
∑ Weight (npi
)×npi
∑ Weight( npi
)
while not cluster_centers.empty():
Choose p with highest accuracy from cluster_centers
Find all points, say rp, in points which are within radius of p
Create a new cluster c with rp
clusters.add(c)
points = points – rp
cluster_centers = cluster_centers – Moved(rp)
Cluster remaining points with Leader-based clustering.
Figure 4. Mean-Shift Clustering Algorithm
2*radius3.4.2 Fitting a Road Path to Commute Points
We use the routing engine used to compute driving directions in
Google Maps [8] to fit a path to the commute point. For the rest of
this paper, we will refer to this routing engine as “Pathfinder”.
This is an iterative algorithm. We first query Pathfinder for the
route between source and destination. If all the commute points
are within the accuracy threshold distance (used in Section 3.4.1),
we terminate and return this path. If not, we add the point which is
furthest away from the path as a waypoint and query Pathfinder
again. If Pathfinder fails, we assume that the waypoint is not valid
(for example it might be in water) and drop it. We continue
iterating until all commute points are within the accuracy distance
threshold. Psuedocode is shown in Figure 6. Two iterations of this
algorithm are shown in Figure 7. The small blue markers are the
points from the user's history. The large green markers are the
Pathfinder waypoints used to generate the path. After the second
iteration all the user points are within the accuracy threshold.
3.4.3 Clustering Commutes
Given a bag of commute paths and the time intervals the
commutes occurred, we cluster them to determine the most
frequent commutes. Two commutes are deemed temporally close
if they start and end within some threshold of each other on the
same day of the week. Two commutes are deemed spatially close
if their Hausdorff distance [9] is within a threshold. We use a
variant of leader-based clustering (described in Section 3.3.1.1) to
generate commute clusters.
The largest commute cluster on a particular day of week is the
most common route taken by the user. This can be used for traffic
alerts as described in Section 4.1.
4. ACKNOWLEDGMENTS
Our thanks to Matthieu Devin, Max Braun, Jesse Rosenstock, Will
Robinson, Dale Hawkins, Jim Guggemos, and Baris Gultekin for
contributing to this project.
5. REFERENCES
[1] Ananthanarayanan, G., Haridasan , M., Mohomed , I., Terry,
D., Thekkath , C. A. 2009. StarTrack: A Framework for
Enabling Track-Based Applications. Proceedings of the 7th
international conference on Mobile systems, applications,
and services.
[2] Ashbrook, D., Starner, T. 2003. Using GPS to Learn
Significant Locations and Predict Movement across Multiple
Users. Personal and Ubiquitous Computing, Vol. 7, Issue 5.
[3] Cheng, Y. C. 1995. Mean Shift, Mode Seeking, and
Clustering. IEEE Transactions on Pattern Analysis and
Machine Intelligence, Vol. 17, No. 8.
[4] Facebook places http://www.facebook.com/places.
[5] Foursquare http://www.foursquare.com.
[6] Google Latitude http://www.google.com/latitude.
[7] Google Latitude History Dashboard.
http://www.google.com/latitude/history/dashboard
[8] Google Maps http://maps.google.com.
[9] Huttenlocher D. P., Klanderman, G. A., and Rucklidge W. J.
1993. Comparing Images using the Hausdorff Distance.
IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol 15, No 9.
[10] Kang, J.H., Welbourne, W., Stewart, B. and Borriello, G.
2005. Extracting Places from Traces of Locations.
SIGMOBILE Mob. Comput. Commun. Rev. 9, 3.
[11] Liao, L., Fox, D., and Kautz, H. 2007. Extracting Places and
Activities from GPS traces using Hierarchical Conditional
Random Fields. International Journal of Robotics Research.
[12] Liao, L., Patterson D. J., Fox, D., and Kautz, H. 2007,
Learning and Inferring Transportation Routines. Artificial
Intelligence. Vol. 171, Issues 5-6.
Figure 7. Two iterations of path fitting algorithm. The small
markers are the user's points. The large marker are
Pathfinder waypoints used to generate the path.
Iteration 1
Iteration 2
Let points = input points
Let waypoints = [source, destination]
Let current_path = Pathfinder route for waypoints
while points.size() > 2:
Let p be the point in points farthest away from current_path
if distance(p, current_path) < threshold:
Record current_path as a commute.
break
Add p to waypoints in the correct position
current_path = Pathfinder route for waypoints
if Pathfinder fails:
Erase p from waypoints
Erase p from points
Figure 6. Algorithm to fit a path to commute
Abstract
When we started implementing a refactoring tool for real-world
C programs, we recognized that preprocessing and parsing in
straightforward and accurate ways would result in unacceptably
slow analysis times and an overly-complicated parsing system.
Instead, we traded some accuracy so we could parse, analyze, and
change large, real programs while still making the refactoring
experience feel interactive and fast. Our tradeoffs fell into three
categories: using different levels of accuracy in different parts of
the analysis, recognizing that collected wisdom about C programs
didn't hold for Objective-C programs, and finding ways to exploit
delays in typical interaction with the tool.
Categories and Subject Descriptors D.2.6 [Software Engineering]:
Programming Environments
General Terms Design, Language
Keywords: refactoring, case study, scalability, Objective-C
1. Introduction
Taking software engineering tools from research to development
requires addressing the practical details of software development:
huge amounts of source code, the nuances of real languages,
and multiple build configurations. Making tools useful for
real programmers requires either addressing all these sorts of issues,
or accepting various trade-offs in order to ship a reasonable
software tool.
In our case, we wanted to add refactoring to Apple’s Xcode IDE
(integrated development environment.) 1 The refactoring feature
would manipulate programs written in Objective-C. Objective-C
is an object-oriented extension to C, and Apple’s primary development
language [1]. In past research [2], I’d found it acceptable
to take multiple minutes to perform a transformation on a small
Scheme program. The critical requirements for our commercial
tool were quite different:
• Support the most common and useful transformations. Renaming
declarations, replacing a block of code with a call to a
new function, and moving declarations up and down a class
hierarchy were mandatory features.
• Refactor 200,000 line programs. The feature had to work on
real, medium-sized applications. The actual amount of code to
parse was much larger than the program’s size. Most Mac OS X
compilation units pull in headers for common system libraries,
requiring at least another 60-120,000 lines of code that would
need to be parsed for every compilation unit. Such large sets of
headers are not unique to Mac OS X. C programs using large
libraries like the Qt user interface library would encounter similar
scalability issues.
• Interactive behavior. Xcode’s refactoring feature would be
part of the source code editor. Users will expect transformations
to complete in seconds rather than minutes, and the whole experience
would need to feel interactive [3]. Parsing and analyzing
programs of this size in straightforward ways would result
in an unacceptable user experience. In one of my first experiences
with a similar product, renaming a declaration in a 4,200
line C program (with the previously-mentioned 60,000 lines of
headers) took two minutes.
• Don't force the user to change the program in order to refactor.
The competing product previously mentioned could
provide much more acceptable performance if the user specified
a pre-compiled header—a single header included by all
compilation units. However, converting a large existing project
to use a pre-compiled header is not a trivial task, and the additional
and hidden setup step discourages new users.
• Be aware of use of C's preprocessor. The programs being
manipulated would make common use of preprocessor macros
and conditionally compiled code. If we did not fully address
how the preprocessor affected refactoring, we would at least
need to be aware of the potential issues.
• Reuse existing parsing infrastructure. We realized there
wasn’t sufficient time or resources to write a new parser from
scratch. Analysis would need to be done by an existing Objective
C parser used for indexing global declarations.
Refactoring had to work best for our third-party developers—primarily
developers writing GUI applications. It should also
work well for developers within Apple, but not for those writing
low-level operating system or device driver code.
Performance and interactivity were key—we wanted to provide
an excellent refactoring experience. In order to meet these performance
and interactivity goals, we attacked three areas: using
different levels of accuracy in different parts of the tool, recognizing
differences between our target programmers and typical C
programmers, and finding ways to exploit delays in the user’s
interaction with the tool.
2. Different Levels of Accuracy
In C, each source file is preprocessed and compiled independently
as a “compilation unit”. Each can include different headers,
or can include the same headers with different inclusion order or
Performance Trade-offs Implementing Refactoring Support for
Objective-C
Robert Bowdidge*
rbowdidge@mac.com
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. To copy otherwise, to republish, to post on servers or to redistribute
to lists, requires prior specific permission and/or a fee.
3rd Workshop on Refactoring Tools '09, Oct. 26, 2009, Orlando, FL.
Copyright © 2009 ACM 978-1-60558-909-1...$10.00
* This work was performed while the author was at Apple, and discusses
the initial implementation of refactoring for Xcode 3.0. The author is
currently at Google.initial macro settings. As a result, each compilation unit may
interpret the same headers different ways, and may parse different
declarations in those same headers. For correct parsing, the compiler
needs to compile every source file independently, read in
header files anew each time, and fully parse all headers.
For small programs, this may not matter, but with Mac OS X,
each source file includes between 60-120,000 lines of code from
header files. Precompiled headers and other optimizations could
speed compile times, but not all developers use precompiled
headers, nor could we demand that developers use such schemes
in order to use refactoring. Naively parsing all source code was
not acceptable; we saw parse times of around five seconds to
parse a typical set of headers, so five seconds minimum per file
per build configuration would be completely unacceptable.
We realized two facts about programs that made us question
whether we needed compilation-unit-level accuracy. We realized
that although programmers have the opportunity for header files
to be interpreted differently in each compilation unit, most programmers
intend for the headers to be processed the same in all
compilation units. (When header files are not processed uniformly,
it can cause subtle, nasty bugs that can take days to track
down.) We also realized that system header files are not really
part of the project, and not targets for refactoring. We needed to
correctly parse system header files merely for their information on
types and external function declarations. For most refactoring
operations, we didn’t care if the my_integer_t type was 4
bytes long or 8; we just needed to know that the name referred to
a type. We also knew that correct refactoring transformations
shouldn’t change the write-protected system header files.
We thus made two assumptions about headers we parsed. First,
we decided to parse each header file at most once, and would
assume that the files were interpreted the same in each compilation
unit. This meant that we could shorten parsing times for at
least five seconds per file to five seconds (for all system header
files), plus the additional time to only parse the source files and
headers in the project.
Second, we gathered less position information for system
header files. We knew that changes in system header files were
both incorrect (because we couldn’t change the existing code in
libraries) and uninteresting (because we couldn’t change all other
clients of the header file.) We gathered less exact position information
for such files, and would flag errors if a transformation
would change code in a system header file.
We also realized that the user interface needed information
about the source code to identify whether refactoring was possible
for a given selection, which transformations were possible, and
what the default parameters for the transformation would be.
Because we wanted the user interface to make these suggestions
immediately without waiting for parsing to complete, we used
saved information from the Xcode’s declaration index when helping
the user propose a refactoring transformation. We did have
some issues where indexer information had inaccuracies (when its
less accurate parser misparsed certain constructs), but in general
we found the information good enough for our first release.
3. The Typical Programmer
Dealing with conditional code and multiple build configurations
is another major issue for refactoring and source code analysis
of C programs. We realized that many of the assumptions
about C code did not hold for Objective-C programs, and changed
our expectations of what we would implement.
C’s preprocessor supports conditional code—code only compiled
if certain macros are set. Although some conventions exist
for using conditional directives, the criteria triggering a particular
block of code usually can be understood only by evaluating the
values of the controlling macros at the point the preprocessor
would have interpreted the directive. If source code with conditional
code was refactored without considering all potential conditions,
syntax errors or changed behavior could be introduced.
Others have proposed various solutions for handling conditional
code. Garrido and Johnson expanded the conditional code
to cover entire declarations, and annotated the ASTs to mark the
configurations including each declaration [4]. Vittek suggested
parsing only the feasible sets of configuration macros, parsing
each condition separately, and merging the resulting parse trees
[5]. McCloskey and Brewer proposed a new preprocessor amenable
to analysis and change, with tools to migrate existing programs
to the new preprocessor [6].
We instead chose to parse for a single build configuration—a
single set of macros, compiler flags, and include paths. Parsing a
single build configuration appeared reasonable because ObjectiveC
programs use the preprocessor less than typical C programs,
because occurrences of conditional code were unlikely to be refactored,
and because remaining uses of conditional code were
insensitive to the refactoring changes.
Ernst’s survey of preprocessor use found that UNIX utilities
varied in their use of preprocessor directives. He found the percentage
of preprocessor directives to total non-comment, nonblank
(NCNB) lines ranged between 4% and 22% [7]. By contrast,
only 3-8% of lines in typical Objective-C programs were
preprocessor directives. (Measurements were made on sources for
the Osirix medical visualization application, Adium multiprotocol
chat client, and Xcode itself.)
Within those Objective-C programs, preprocessor directives
and conditional code also occurred much more frequently in the
code unlikely to be refactored. Many were either in third-party
utility code, or in cross-platform C++ code. The utility code was
often public-domain source code intended for multiple operating
systems. Such code is unlikely to be refactored for fear of complicating
merges of newer versions. For applications designed for
multiple operating systems, often a core C++ library would be the
basis of all versions, and separate user interface code would be
written for each operating system. Because our first release would
not refactor or parse C++ code, such core code would be irrelevant
to refactoring. For the Objective-C portions of the projects,
only 2-4% of all lines were preprocessor directives.
The preprocessor directives that do appear in Objective-C code
are often irrelevant to refactoring. Of Ernst’s eleven categories of
conditional code, many are either unlikely to affect the target
audience, or are irrelevant to refactoring in general. Include
guards are less frequently used in Objective-C because a separate
directive (#import) ensures a file is included only once. Conditional
directives that always disabled code (“#if (0)”) can be handled
in the same way comments are processed. Operating system
-specific conditional code is unlikely in Objective-C code because
the language is used only on Mac OS X.
However, there are three problematic conditional code directives
that appear in Objective-C programs: code for debugging,
architecture-specific code, and conditional code for enabling and
disabling features in the project.
Conditional code for debugging is unlikely to be troublesome.
The rename transformation will make an incorrect change if a
declaration is referenced in conditional code that is not parsed. If
the condition is parsed, then the conditional code is not a concern.
The most dangerous case occurs when code that needs to be manipulated
exists in two conditionally compiled sections of code
never parsed at the same time. Luckily, most conditional code
controlled by debugging macros only adds code to the debug case, and does not add code to the non-debug case. As long as we parse
the program with debug macros set (which should be the default
during development), then we should parse all necessary code.
Architecture-specific code is more common at Apple because
we support two architectures (x86 and PowerPC), both in 32 and
64 bit versions. Most of the architecture-specific conditional code
is found in low level system code and device drivers. The external
developers we are targeting with refactoring would be working on
application software, and would be unlikely to have architecturespecific
code.
Project-specific features controlled by conditional compilation
directives represent a larger risk. Some of these may actually be in
use (such as code shared between an iPhone and Mac application),
and others may represent dead code. Code may exist on both sides
of a condition. For the first release, we only changed code in the
current build configuration, and relied on the user to be aware of
and avoid changes in project-specific conditional code.
4. Exploiting Interaction Delays
A final area for optimization was deciding when parsing and
refactoring work would begin during actual use. Even with our
previous decisions, parsing speed still wasn’t acceptable. Our
rough numbers were that we could parse all the system header
files in about 5 seconds, and then could parse an additional ten
files a second on a typical machine. Caching the results of the
header file parsing was an obvious solution, but we weren’t sure
we had the time to implement such caching.
A straightforward implementation would start parsing after the
user specified the transformation to be performed, and only show
results when the transformation was complete. We realized we
could speed perceived performance by starting parsing early, and
showing partial results before the transformation completed.
4.1 Optimistically Starting Parsing
It usually takes a few seconds for a programmer to specify a
refactoring transformation. Even for the simple rename, the user
needs to indicate that he wants to rename a declaration, then needs
to type in the new name. For “extract function”, the additional
choices for parameter name and order requires additional time.
To improve perceived performance, we began parsing the currently
active file and header files as soon as the programmer had
selected the “refactor” menu item. For refactoring transformations
that only affected a single file, this often meant that as soon
as the user specified the parameters for refactoring, the parsing
had already been completed, and the transformation would be
ready immediately.
4.2 Showing Partial Results
When performing transformations changing multiple files, we
similarly exploited how programmers would interact with the
refactoring tool. We knew that most programmers beginning to
use refactoring might want to examine the changes being made to
double-check that the transformation was correct. If we assumed
that most transformations would be successful (because the programmer
was unlikely to try a transformation they thought would
break their code), then we could begin showing partial results
immediately rather than waiting for the entire transformation to be
complete and validated to be safe.
Most descriptions of refactoring break each transformation into
two parts: the pre-conditions (which indicate the requirements that
must be met before a transformation may be performed) and the
changes to the source code (which are only performed after the
change is believed safe. [8]) Because parse times are liable to be
longer than a few seconds, the “check, then perform” approach
would not have been interactive. The user would have to wait
until all source code was parsed and all refactoring complete before
examining any results. Similarly, parse trees for all functions
would need to be generated before any refactoring work could
begin. If the project being manipulated was particularly large, then
the parse trees could consume huge amounts of memory.
To make refactoring more palatable on large projects, we designed
our transformations to work in several phases so that
changes could be presented shown after only some of the code
had been parsed and portions of the transformation performed.
(See Figure 1). We also could dispose of some parse trees as soon
as that code has been analyzed. The seven phases for our transformations
are:
• check user input: precondition checks that could be done with
the initial inputs to the transformation only.
• check first file: precondition checks to do after the file containing
the selection is parsed. Generally, the analysis performed in
this phase only performs initial sanity checks requiring parse
trees. For the rename transformation, the phase checks that the
declaration can be renamed, if the name is a valid C identifier,
and if the declaration is not in a system header file.
• perform first file: apply any changes that can be determined
after the first file is parsed. Few transformations do work in this
phase.
• check per-file: precondition checks to do after parsing each
compilation unit.
• perform per-file: changes to apply after parsing each compilation
unit. Most transformations do the bulk of their work in the
per-file category. The check and perform parts both look at
newly found uses of relevant declarations, and make appropriate
changes. Each transformation specifies if the memory for
parsed representations of function bodies can be freed before
beginning the next file.
• check final: precondition checks to do after parsing all files.
The after-parsing checks tend to involve existence tests or nonexistence
tests—whether any situations exist that indicate the
transformation is unsafe such as “did we ever see any declaraparse
b.c
check per-file
perform per-file
check per-file
perform per-file
check per-file
perform first file perform per-file
check first file
parse a.c parse c.c
perform final
check final
Process a.c Process b.c Process c.c
Figure 1: Order of processing of interleaved refactoring transformation on three source files a.c, b.c, and c.c. Results of
the transformation are incrementally updated after each perform- phase is complete.tions with this name already?” Some of these checks could be
done incrementally as each file is parsed.
• perform final: changes to apply after parsing all files. The
perform final phase is typically used for edits that cannot be
constructed until all sources have been parsed. For example,
when converting references to a structure’s field to call getter or
setter functions, the transformation needs to determine where to
place the new accessor functions. The accessors need to be
placed in a source file (rather than a header), preferably near
existing references to the field or the definition of the structure.
Typically, the transformation can place the functions as soon as
a likely location is found. If no appropriate location for the new
code is found in any source file, the perform final phase
chooses an arbitrary location.
By breaking up each transformation in this way, the user experience
of refactoring becomes more interactive. The refactoring
user interface can show the list of files which must be parsed for a
transformation. As each file is parsed and changes are identified,
the user interface indicates completion and notes the number of
changes in that file. Selecting the filename shows a side-by-side
view of the source before and after the change. As the transformation
progresses, more files and edits are displayed. The user can
examine proposed changes as soon as each file is processed.
While examining the changes, the user can also choose not to
include some changes, or can make additional edits to the changed
source code. In this way, the user can both measure progress and
can be working productively as the transformation progresses.
The interleaved transformation approach has the risk of declaring
a transformation unsafe after the user has already examined
some changes. This turned out not to be a problem in actual use.
Programmers weren’t bothered by the delayed negative answer.
We also found very few transformations where we could outright
refuse to do a transformation. We might warn the result is incorrect,
but we found programmers often wanted the chance to apply
those incorrect changes and then fix remaining problems with
straight edits.
5. Conclusions
Overall, our progress on refactoring matched effort described
on similar projects. Our first prototype was completed in three
months by one person, and our first release required two years and
three people. We found the transformations tended to be easy to
write. Most of our parsing effort focused on scalability - getting
parsing performance and memory use low, and making sure it
worked well inside the IDE. We also found that implementing a
polished user interface took the majority of the overall effort, with
two of the engineers working full time on refactoring workflow
and on making the file comparison view as polished as possible.
With the trade-offs described here, we met our performance
goals. Our goal at the beginning of the project was to permit refactoring
on 200,000 line projects, and be able to rename a
frequently-referenced declaration within 30 seconds. On a 2.2
GHz Dual Xeon PowerMac with 1 GB of memory, we renamed
declarations in a 270,000 line Objective-C project. We found we
could rename a class referenced in 382 places through 123 files in
28 seconds. We could rename a class used in 65 files in 15 seconds.
Operations involving only a single file took around 8 seconds;
this time was irrespective of the source file because parsing
the headers dominated. Most transformations only require parsing
a small subset of source files in a project. However, one of the
transformations searches all code for iterators that can be converted
to use a new language feature. Parsing the entire 270,000
line project for this transformation takes around 90 seconds. This
is not acceptable for the interactive transformations, but is adequate
for an infrequently run transformation that changes all
source files. The refactoring feature as described shipped as part
of Xcode 3.0 and Mac OS X 10.5.
Building software development tools in industry requires making
tradeoffs in both requirements and design. Some are driven
by the expected needs of users such as the size of programs to be
refactored, or response times expected. Some are driven by scalability
issues such as whether to save pre-processed header files in
the IDE between refactorings, or whether to re-parse headers from
scratch each time. Other tradeoffs occur for business, timing, or
staffing reasons, affecting whether a feature might even be implemented,
or whether a new parser is written from scratch.
As described in this paper, our requirements strongly affected
what we could and did implement. The particular tradeoffs we
made may not appear to be the "right" or "perfect" decision in all
cases, but they are representative of the sorts of decisions that
must be made during the process of commercial development.
Our three themes of trade-offs—identifying where different levels
of accuracy were acceptable, recognizing differences between
"our typical user" and "a typical user", and exploiting delays in
user interaction to improve responsiveness—suggest ways that
other tools can meet their own goals.
Acknowledgements
Thanks to Michael Van De Vanter and Todd Fernandez for their
feedback on a previous version of this paper. Dave Payne originally
suggested applying the transformations file-by-file. Andrew
Pontious and Yuji Akimoto implemented the refactoring user interface,
and kept us focused on an interactive experience.
Our approach for incrementally showing refactoring results is
also described in U.S. Patent Application 20080052684, “ Stepwise
source code refactoring”.
References
[1] Apple, "Apple Developer Documentation: Objective-C Programming
Language," Cupertino, CA 2007.
[2] R. W. Bowdidge and W. G. Griswold, "Supporting the Restructuring
of Data Abstractions through Manipulation of a Program
Visualization," ACM Transactions on Software Engineering
and Methodology, vol. 7(2), 1998.
[3] D. Bäumer, E. Gamma, and A. Kiezun, "Integrating refactoring
support into a Java development tool," in OOPSLA 2001
Companion, 2001.
[4] A. Garrido and R. Johnson, "Analyzing Multiple Configurations
of a C Program," in 21st IEEE International Conference on
Software Maintenance (ICSM), 2005.
[5] M. Vittek, "Refactoring Browser with Preprocessor," in 7th
European Conference on Software Maintenance and Reengineering,
Benevento, Italy, 2003.
[6] B. McCloskey and E. Brewer, "ASTEC: a new approach to
refactoring C," in 13th ACM SIGSOFT international symposium
on Foundations of Software Engineering ESEC/FSE-13, 2005.
[7] M. D. Ernst, G. J. Badros, and D. Notkin, "An Empirical
Analysis of C Preprocessor Use," IEEE Transactions on Software
Engineering, vol. 28, pp. 1146-1170, December 2002.
[8] W. F. Opdyke, "Refactoring: A Program Restructuring Aid
in Designing Object-Oriented Application Frameworks," University
of Illinois, Urbana-Champaign, 1991.
Linear-Space Computation of the Edit-Distance
between a String and a Finite Automaton
Cyril Allauzen1 and Mehryar Mohri2,1
1 Google Research
76 Ninth Avenue, New York, NY 10011, US.
2 Courant Institute of Mathematical Sciences
251 Mercer Street, New York, NY 10012, US.
Abstract. The problem of computing the edit-distance between a string
and a finite automaton arises in a variety of applications in computational
biology, text processing, and speech recognition. This paper presents
linear-space algorithms for computing the edit-distance between a string
and an arbitrary weighted automaton over the tropical semiring, or an
unambiguous weighted automaton over an arbitrary semiring. It also
gives an efficient linear-space algorithm for finding an optimal alignment
of a string and such a weighted automaton.
1 Introduction
The problem of computing the edit-distance between a string and a finite automaton
arises in a variety of applications in computational biology, text processing,
and speech recognition [8, 10, 18, 21, 14]. This may be to compute the
edit-distance between a protein sequence and a family of protein sequences compactly
represented by a finite automaton [8, 10, 21], or to compute the error rate
of a word lattice output by a speech recognition with respect to a reference
transcription [14]. A word lattice is a weighted automaton, thus this further motivates
the need for computing the edit-distance between a string and a weighted
automaton. In all these cases, an optimal alignment is also typically sought. In
computational biology, this may be to infer the function and various properties
of the original protein sequence from the one it is best aligned with. In speech
recognition, this determines the best transcription hypothesis contained in the
lattice.
This paper presents linear-space algorithms for computing the edit-distance
between a string and an arbitrary weighted automaton over the tropical semiring,
or an unambiguous weighted automaton over an arbitrary semiring. It also gives
an efficient linear-space algorithm for finding an optimal alignment of a string
and such a weighted automaton. Our linear-space algorithms are obtained by
using the same generic shortest-distance algorithm but by carefully defining different
queue disciplines. More precisely, our meta-queue disciplines are derived
in the same way from an underling queue discipline defined over states with the
same level.The connection between the edit-distance and the shortest distance in a
directed graph was made very early on (see [10, 4–6] for a survey of string algorithms).
This paper revisits some of these algorithms and shows that they are all
special instances of the same generic shortest-distance algorithm using different
queue disciplines. We also show that the linear-space algorithms all correspond
to using the same meta-queue discipline using different underlying queues. Our
approach thus provides a better understanding of these classical algorithms and
makes it possible to easily generalize them, in particular to weighted automata.
The first algorithm to compute the edit-distance between a string x and
a finite automaton A as well as their alignment was due to Wagner [25] (see
also [26]). Its time complexity was in O(|x||A|
2
Q) and its space complexity in
O(|A|
2
Q|Σ| + |x||A|Q), where Σ denotes the alphabet and |A|Q the number of
states of A. Sankoff and Kruskal [23] pointed out that the time and space complexity
O(|x||A|) can be achieved when the automaton A is acyclic. Myers and
Miller [17] significantly improved on previous results. They showed that when A
is acyclic or when it is a Thompson automaton, that is an automaton obtained
from a regular expression using Thompson’s construction [24], the edit-distance
between x and A can be computed in O(|x||A|) time and O(|x|+|A|) space. They
also showed, using a technique due to Hirschberg [11], that the optimal alignment
between x and A can be obtained in O(|x| + |A|) space, and in O(|x||A|) time if
A is acyclic, and in O(|x||A| log |x|) time when A is a Thompson automaton.
The remainder of the paper is organized as follows. Section 2 introduces the
definition of semirings, and weighted automata and transducers. In Section 3,
we give a formal definition of the edit-distance between a string and a finite
automaton, or a weighted automaton. Section 4 presents our linear-space algorithms,
including the proof of their space and time complexity and a discussion
of an improvement of the time complexity for automata with some favorable
graph structure property.
2 Preliminaries
This section gives the standard definition and specifies the notation used for
weighted transducers and automata which we use in our computation of the
edit-distance.
Finite-state transducers are finite automata [20] in which each transition is
augmented with an output label in addition to the familiar input label [2, 9].
Output labels are concatenated along a path to form an output sequence and
similarly input labels define an input sequence. Weighted transducers are finitestate
transducers in which each transition carries some weight in addition to the
input and output labels [22, 12]. Similarly, weighted automata are finite automata
in which each transition carries some weight in addition to the input label. A
path from an initial state to a final state is called an accepting path. A weighted
transducer or weighted automaton is said to be unambiguous if it admits no two
accepting paths with the same input sequence.The weights are elements of a semiring (K, ⊕, ⊗, 0, 1), that is a ring that
may lack negation [12]. Some familiar semirings are the tropical semiring (R+ ∪
{∞}, min, +, ∞, 0) and the probability semiring (R+ ∪ {∞}, +, ×, 0, 1), where
R+ denotes the set of non-negative real numbers. In the following, we will only
consider weighted automata and transducers over the tropical semiring. However,
all the results of section 4.2 hold for an unambiguous weighted automaton A over
an arbitrary semiring.
The following gives a formal definition of weighted transducers.
Definition 1. A weighted finite-state transducer T over the tropical semiring
(R+ ∪ {∞}, min, +, ∞, 0) is an 8-tuple T = (Σ, ∆, Q, I, F, E, λ, ρ) where Σ is
the finite input alphabet of the transducer, ∆ its finite output alphabet, Q is a
finite set of states, I ⊆ Q the set of initial states, F ⊆ Q the set of final states,
E ⊆ Q × (Σ ∪ {ǫ}) × (∆ ∪ {ǫ}) × (R+ ∪ {∞}) × Q a finite set of transitions,
λ : I → R+ ∪ {∞} the initial weight function, and ρ : F → R+ ∪ {∞} the final
weight function mapping F to R+ ∪ {∞}.
We define the size of T as |T | = |T |Q + |T |E where |T |Q = |Q| is the number of
states and |T |E = |E| the number of transitions of T .
The weight of a path π in T is obtained by summing the weights of its
constituent transitions and is denoted by w[π]. The weight of a pair of input and
output strings (x, y) is obtained by taking the minimum of the weights of the
paths labeled with (x, y) from an initial state to a final state.
For a path π, we denote by p[π] its origin state and by n[π] its destination
state. We also denote by P(I, x, y, F) the set of paths from the initial states I
to the final states F labeled with input string x and output string y. The weight
T (x, y) associated by T to a pair of strings (x, y) is defined by:
T (x, y) = min
π∈P (I,x,y,F )
λ(p[π]) + w[π] + ρ[n[π]]. (1)
Figure 1(a) shows an example of weighted transducer over the tropical semiring.
Weighted automata can be defined as weighted transducers A with identical
input and output labels, for any transition. Thus, only pairs of the form (x, x)
can have a non-zero weight by A, which is why the weight associated by A to
(x, x) is abusively denoted by A(x) and identified with the weight associated by
A to x. Similarly, in the graph representation of weighted automata, the output
(or input) label is omitted. Figure 1(b) shows an example.
3 Edit-distance
We first give the definition of the edit-distance between a string and a finite
automaton.
Let Σ be a finite alphabet, and let Ω be defined by Ω = (Σ ∪ {ǫ}) ×
(Σ ∪ {ǫ}) − {(ǫ, ǫ)}. An element of Ω can be seen as a symbol edit operation:
(a, ǫ) is a deletion, (ǫ, a) an insertion, and (a, b) with a 6= b a substitution.
We will denote by h the natural morphism between Ω∗ and Σ∗ × Σ∗ defined by a:b/.1 0
1
a:b/.2
2/1
a:b/.4
3/.8
b:a/.6
b:a/.3
b:a/.5
a/.1 0
1
a/.2
2/1
a/.4
3/.8
b/.6
b/.3
b/.5
(a) (b)
Fig. 1. (a) Example of a weighted transducer T. (b) Example of a weighted automaton
A. T(aab, bba) = A(aab) = min(.1 +.2 +.6 +.8, .2 +.4 +.5 +.8). A bold circle indicates
an initial state and a double-circle a final state. The final weight ρ[q] of a final state q
is indicated after the slash symbol representing q.
h((a1, b1)· · ·(an, bn)) = (a1 · · · an, b1 · · · bn). An alignment ω between two strings
x and y is an element of Ω∗
such that h(ω) = (x, y).
Let c : Ω → R+ be a function associating a non-negative cost to each edit operation.
The cost of an alignment ω = ω1 · · · ωn is defined as c(ω) = Pn
i=1 c(ωi).
Definition 2. The edit-distance d(x, y) of two strings x and y is the minimal
cost of a sequence of symbols insertions, deletions or substitutions transforming
one string into the other:
d(x, y) = min
h(ω)=(x,y)
c(ω). (2)
When c is the function defined by c(a, a) = 0 and c(a, ǫ) = c(ǫ, a) = c(a, b) =
1 for all a, b in Σ such that a 6= b, the edit-distance is also known as the
Levenshtein distance. The edit-distance d(x, A) between a string x and a finite
automaton A can then be defined as
d(x, A) = min
y∈L(A)
d(x, y), (3)
where L(A) denotes the regular language accepted by A. The edit-distance d(x, A)
between a string x and a weighted automaton A over the tropical semiring is
defined as:
d(x, A) = min
y∈Σ∗
A(y) + d(x, y)
. (4)
4 Algorithms
In this section, we present linear-space algorithms both for computing the editdistance
d(x, A) between an arbitrary string x and an automaton A, and an
optimal alignment between x and A, that is an alignment ω such that c(ω) =
d(x, A).
We first briefly describe two general algorithms that we will use as subroutines.4.1
General algorithms
Composition. The composition of two weighted transducers T1 and T2 over the
tropical semiring with matching input and output alphabets Σ, is a weighted
transducer denoted by T1 ◦ T2 defined by:
(T1 ◦ T2)(x, y) = min
z∈Σ∗
T1(x, z) + T2(z, y). (5)
T1 ◦ T2 can be computed from T1 and T2 using the composition algorithm for
weighted transducers [19, 15]. States in the composition T1 ◦ T2 are identified
with pairs of a state of T1 and a state of T2. In the absence of transitions with
ǫ inputs or outputs, the transitions of T1 ◦ T2 are obtained as a result of the
following matching operation applied to the transitions of T1 and T2:
(q1, a, b, w1, q′
1
) and (q2, b, c, w2, q′
2
) → ((q1, q′
1
), a, c, w1 + w2,(q2, q′
2
)). (6)
A state (q1, q2) of T1 ◦T2 is initial (resp. final) iff q1 and q2 are initial (resp. final)
and, when it is final, its initial (resp.final) weight is the sum of the initial (resp.
final) weights of q1 and q2. In the worst case, all transitions of T1 leaving a state
q1 match all those of T2 leaving state q2, thus the space and time complexity of
composition is quadratic, that is O(|T1||T2|).
Shortest distance. Let A be a weighted automaton over the tropical semiring.
The shortest distance from p to q is defined as
d[p, q] = min
π∈P (p,q)
w[π]. (7)
It can be computed using the generic single-source shortest-distance algorithm
of [13], a generalization of the classical shortest-distance algorithms. This generic
shortest-distance algorithm works with an arbitrary queue discipline, that is the
order according to which elements are extracted from a queue. We shall make use
of this key property in our algorithms. The pseudocode of a simplified version
of the generic algorithm for the tropical semiring is given in Figure 2.
The complexity of the algorithm depends on the queue discipline selected for
S. Its general expression is
O(|Q| + C(A) max
q∈Q
N(q)|E| + (C(I) + C(X))X
q∈Q
N(q)), (8)
where N(q) denotes the number of times state q is extracted from queue S, C(X)
the cost of extracting a state from S, C(I) the cost of inserting a state in S, and
C(A) the cost of an assignment.
With a shortest-first queue discipline implemented using a heap, the algorithm
coincides with Dijkstra’s algorithm [7] and its complexity is O((|E| +
|Q|) log |Q|). For an acyclic automaton and with the topological order queue
discipline, the algorithm coincides with the standard linear-time (O(|Q| + |E|))
shortest-distance algorithm [3].Shortest-Distance(A, s)
1 for each p ∈ Q do
2 d[p] ← ∞
3 d[s] ← 0
4 S ← {s}
5 while S 6= ∅ do
6 q ← Head(S)
7 Dequeue(S)
8 for each e ∈ E[q] do
9 if (d[s] + w[e] < d[n[e]]) then
10 d[n[e]] ← d[s] + w[e]
11 if (n[e] 6∈ S) then
12 Enqueue(S, n[e])
Fig. 2. Pseudocode of the generic shortest-distance algorithm.
4.2 Edit-distance algorithms
The edit cost function c can be naturally represented by a one-state weighted
transducer over the tropical semiring Tc = (Σ, Σ, {0}, {0}, {0}, Ec, 1, 1), or T in
the absence of ambiguity, with each transition corresponding to an edit operation:
Ec = {(0, a, b, c(a, b), 0)|(a, b) ∈ Ω}.
Lemma 1. Let A be a weighted automaton over the tropical semiring and let X
be the finite automaton representing a string x. Then, the edit-distance between
x and A is the shortest-distance from the initial state to a final state in the
weighted transducer U = X ◦ T ◦ A.
Proof. Each transition e in T corresponds to an edit operation (i[e], o[e]) ∈ Ω,
and each path π corresponds to an alignment ω between i[π] and o[π]. The cost of
that alignment is, by definition of T , c(ω) = w[π]. Thus, T defines the function:
T (u, v) = min
ω∈Ω∗
{c(ω): h(ω) = (u, v)} = d(u, v), (9)
for any strings u, v in Σ∗
. Since A is an automaton and x is the only string
accepted by X, it follows from the definition of composition that U(x, y) =
T (x, y) + A(y) = d(x, y) + A(y). The shortest-distance from the initial state to
a final state in U is then:
min
π∈PU (I,F )
w[π] = min
y∈Σ∗
min
π∈PU (I,x,y,F )
w[π] = min
y∈Σ∗
U(x, y) (10)
= min
y∈Σ∗
d(x, y) + A(y)
= d(x, A), (11)
that is the edit-distance between x and A. ⊓⊔0 1
a/0 2
b/0 3/0 a/0 0/0 1
a/0
b/0
(a) (b)
0/0
ε:a/1
ε:b/1
a:ε/1
a:a/0
a:b/1
b:ε/1
b:a/1
b:b/0
0,0 0,1 ε:a/1
1,0
a:ε/1
1,1
a:a/0
ε:b/1
a:b/1
a:ε/1
ε:a/1
2,0
b:ε/1
2,1
b:a/1
ε:b/1
b:b/0
b:ε/1
ε:a/1
3,0/0
a:ε/1
3,1
a:a/0
ε:b/1
a:b/1
a:ε/1
ε:a/1
ε:b/1
(c) (d)
Fig. 3. (a) Finite automaton X representing the string x = aba. (b) Finite automaton
A. (c) Edit transducer T over the alphabet {a, b} where the cost of any insertion,
deletion and substitution is 1. (d) Weighted transducer U = X ◦ T ◦ A.
Figure 3 shows an example illustrating Lemma 1. Using the lateral strategy
of the 3-way composition algorithm of [1] or an ad hoc algorithm exploiting the
structure of T , U = X ◦ T ◦ A can be computed in O(|x||A|) time. The shortestdistance
algorithm presented in Section 4.1 can then be used to compute the
shortest distance from an initial state of U to a final state and thus the edit
distance of x and A. Let us point out that different queue disciplines in the computation
of that shortest distance lead to different algorithms and complexities.
In the next section, we shall give a queue discipline enabling us to achieve a
linear-space complexity.
4.3 Edit-distance computation in linear space
Using the shortest-distance algorithm described in Section 4.1 leads to an algorithm
with space complexity linear in the size of U, i.e. in O(|x||A|). However,
taking advantage of the topology of U, it is possible to design a queue discipline
that leads to a linear space complexity O(|x| + |A|).
We assume that the finite automaton X representing the string x is topologically
sorted. A state q in the composition U = X ◦T ◦A can be identified with a
triplet (i, 0, j) where i is a state of X, 0 the unique state of T , and j a state of A.
Since T has a unique state, we further simplify the notation by identifying each
state q with a pair (i, j). For a state q = (i, j) of U, we will refer to i by the level
of q. A key property of the levels is that there is a transition in U from q to q
′iff level(q
′
) = level(q) or level(q
′
) = level(q) + 1. Indeed, a transition from (i, j)
to (i
′
, j′
) in U corresponds to taking a transition in X (in that case i
′ = i + 1
since X is topologically sorted) or staying at the same state in X and taking an
input-ǫ transition in T (in that case i
′ = i).
From any queue discipline ≺ on the states of U, we can derive a new queue
discipline ≺l over U defined for all q, q′
in U as follows:
q ≺l q
′
iff
level(q) < level(q
′
)
or
level(q) = level(q
′
) and q ≺ q
′
. (12)
Proposition 1. Let ≺ be a queue discipline that requires at most O(|V |) space
to maintain a queue over any set of states V . Then, the edit-distance between x
and A can be computed in linear space, O(|x| + |A|), using the queue discipline
≺l.
Proof. The benefit of the queue discipline ≺l
is that when computing the shortest
distance to q = (i, j) in U, only the shortest distances to the states in U of level
i and i − 1 need to be stored in memory. The shortest distances to the states of
level strictly less than i − 1 can be safely discarded. Thus, the space required to
store the shortest distances is in O(|A|Q).
Similarly, there is no need to store in memory the full transducer U. Instead,
we can keep in memory the last two levels active in the shortest-distance
algorithm. This is possible because the computation of the outgoing transitions
of a state with level i only requires knowledge about the states with
level i and i + 1. Therefore, the space used to store the active part of U is
in O(|A|E +|A|Q) = O(|A|). Thus, it follows that the space required to compute
the edit-distance of x and A is linear, that is in O(|x| + |A|). ⊓⊔
The time complexity of the algorithm depends on the underlying queue discipline
≺. A natural choice is for ≺ is the shortest-first queue discipline, that
is the queue discipline used in Dijkstra’s algorithm. This yields the following
corollary.
Corollary 1. The edit-distance between a string x and an automaton A can
be computed in time O(|x||A| log |A|Q) and space O(|x| + |A|) using the queue
discipline ≺l.
Proof. A shortest-first queue is maintained for each level and contains at most
|A|Q states. The cost for the global queue of an insertion, C(I), or an assignment,
C(A), is in O(log |A|Q) since it corresponds to inserting in or updating one of the
underlying level queues. Since N(q) = 1, the general expression of the complexity
(8) leads to an overall time complexity of O(|x||A| log |A|Q) for the shortestdistance
algorithm. ⊓⊔
When the automaton A is acyclic, the time complexity can be further improved
by using for ≺ the topological order queue discipline.
Corollary 2. If the automaton A is acyclic, the edit-distance between x and
A can be computed in time O(|x||A|) and space O(|x| + |A|) using the queue
discipline ≺l with the topological order queue discipline for ≺.Proof. Computing the topological order for U would require O(|U|) space. Instead,
we use the topological order on A, which can be computed in O(|A|),
to define the underlying queue discipline. The order inferred by (12) is then a
topological order on U. ⊓⊔
Myers and Miller [17] showed that when A is a Thompson automaton, the
time complexity can be reduced to O(|x||A|) even when A is not acyclic. This is
possible because of the following observation: in a weighted automaton over the
tropical semiring, there exists always a shortest path that is simple, that is with
no cycle, since cycle weights cannot decrease path weight.
In general, it is not clear how to take advantage of this observation. However,
a Thompson automaton has additionally the following structural property: a
loop-connectedness of one. The loop-connectedness of A is k if in any depth-
first search of A, a simple path goes through at most k back edges. [17] showed
that this property, combined with the observation made previously, can be used
to improve the time complexity of the algorithm. The results of [17] can be
generalized as follows.
Corollary 3. If the loop-connectedness of A is k, then the edit-distance between
x and A can be computed in O(|x||A|k) time and O(|x| + |A|) space.
Proof. We first use a depth-first search of A, identify back edges, and mark them
as such. We then compute the topological order for A, ignoring these back edges.
Our underlying queue discipline ≺ is defined such that a state q = (i, j) is ordered
first based on the number of times it has been enqueued and secondly based on
the order of j in the topological order ignoring back edges. This underlying queue
can be implemented in O(|A|Q) space with constant time costs for the insertion,
extraction and updating operations. The order ≺l derived from ≺ is then not
topological for a transition e iff e was obtained by matching a back edge in A and
level(p[e]) = level(n[e]). When such a transition e is visited, n[e] is reinserted in
the queue.
When state q is dequeued for the lth time, the value of d[q] is the weight of
the shortest path from the initial state to q that goes through at most l −1 back
edges. Thus, the inequality N(q) ≤ k + 1 holds for all q and, since the costs for
managing the queue, C(I), C(A), and C(X), are constant, the time complexity of
the algorithm is in O(|x||A|k). ⊓⊔
4.4 Optimal alignment computation in linear space
The algorithm presented in the previous section can also be used to compute an
optimal alignment by storing a back pointer at each state in U. However, this
can increase the space complexity up to O(|x||A|Q). The use of back pointers
to compute the best alignment can be avoided by using a technique due to
Hirschberg [11], also used by [16, 17].
As pointed out in previous sections, an optimal alignment between x and A
corresponds to a shortest path in U = X ◦ T ◦ A. We will say that a state q in U
is a midpoint of an optimal alignment between x and A if q belongs to a shortest
path in U and level(q) = ⌊|x|/2⌋.Lemma 2. Given a pair (x, A), a midpoint of the optimal alignment between x
and A can be computed in O(|x|+|A|) space with a time complexity in O(|x||A|)
if A is acyclic and in O(|x||A| log |A|Q) otherwise.
Proof. Let us consider U = X ◦ T ◦ A. For a state q in U let d[q] denote the
shortest distance from the initial state to q, and by d
R[q] the shortest distance
from q to a final state. For a given state q = (i, j) in U, d[(i, j)] +d
R[(i, j)] is the
cost of the shortest path going through (i, j). Thus, for any i, the edit-distance
between x and A is d(x, A) = minj (d[(i, j)] + d
R[(i, j)]).
For a fixed i0, we can compute both d[(i0, j)] and d
R[(i0, j)] for all j in
O(|x||A| log |A|Q) time (or O(|x||A| time if A is acyclic) and in linear space
O(|x| + |A|) using the algorithm from the previous section forward and backward
and stopping at level i0 in each case. Running the algorithm backward
(exchanging initial and final states and permuting the origin and destination of
every transition) can be seen as computing the edit-distance between x
R and
AR, the mirror images of x and A.
Let us now set i0 = ⌊|x|/2⌋ and j0 = argminj
(d[(i0, j)] + d
R[(i0, j)]). It
then follows that (i0, j0) is a midpoint of the optimal alignment. Hence, for a
pair (x, A), the running-time complexity of determining the midpoint of the
alignment is in O(|x||A|) if A is acyclic and O(|x||A| log |A|Q) otherwise. ⊓⊔
The algorithm proceeds recursively by first determining the midpoint of the
optimal alignment. At step 0 of the recursion, we first find the midpoint (i0, j0)
between x and A. Let x
1 and x
2 be such that x = x
1x
2 and |x
1
| = i0, and let
A1 and A2 be the automaton obtained from A by respectively changing the final
state to j0 in A1 and the initial state to j0 in A2
. We can now recursively find
the alignment between x
1 and A1 and between x
2 and A2
.
Theorem 1. An optimal alignment between a string x and an automaton A can
be computed in linear space O(|x| + |A|) and in time O(|x||A|) if A is acyclic,
O(|x||A| log |x| log |A|Q) otherwise.
Proof. We can assume without loss of generality that the length of x is a
power of 2. At step k of the recursion, we need to compute the midpoints
for 2k
string-automaton pairs (x
i
k
, Ai
k
)1≤i≤2k . Thus, the complexity of step k
is in O(
P2
k
i=1 |x
i
k
||Ai
k
| log |Ai
k
|Q) = O(
|x|
2
k
P2
k
i=1 |Ai
k
| log |Ai
k
|Q) since |x
i
k
| = |x|/2
k
for all i. When A is acyclic, the log factor can be avoided and the equality
P2
k
i=1 |Ai
k
| = O(|A|) holds, thus the time complexity of step k is in O(|x||A|/2
k
).
In the general case, each |Ai
k
| can be in the order of |A|, thus the complexity of
step k is in O(|x||A| log |A|Q).
Since there are at most log |x| steps in the recursion, this leads to an overall
time complexity in O(|x||A|) if A is acyclic and O(|x||A| log |A|Q log |x|) in general.
⊓⊔
When the loop-connectedness of A is k, the time complexity can be improved to
O(k|x||A| log |x|) in the general case.5 Conclusion
We presented general algorithms for computing in linear space both the editdistance
between a string and a finite automaton and their optimal alignment.
Our algorithms are conceptually simple and make use of existing generic algorithms.
Our results further provide a better understanding of previous algorithms
for more restricted automata by relating them to shortest-distance algorithms
and general queue disciplines.
References
1. C. Allauzen and M. Mohri. 3-way composition of weighted finite-state transducers.
In O. Ibarra and B. Ravikumar, editors, Proceedings of CIAA 2008, volume
5148 of Lecture Notes in Computer Science, pages 262–273. Springer-Verlag Berlin
Heidelberg, 2008.
2. J. Berstel. Transductions and Context-Free Languages. Teubner Studienbucher:
Stuttgart, 1979.
3. T. Cormen, C. Leiserson, and R. Rivest. Introduction to Algorithms. The MIT
Press: Cambridge, MA, 1992.
4. M. Crochemore, C. Hancart, and T. Lecroq. Algorithms on Strings. Cambridge
University Press, 2007.
5. M. Crochemore and W. Rytter. Text Algorithms. Oxford University Press, 1994.
6. M. Crochemore and W. Rytter. Jewels of Stringology. World Scientific, 2002.
7. E. W. Dijkstra. A note on two problems in connexion with graphs. Numerische
Mathematik, 1:269–271, 1959.
8. R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis:
Probalistic Models of Proteins and Nucleic Acids. Cambridge University Press,
Cambridge, UK, 1998.
9. S. Eilenberg. Automata, Languages and Machines, volume A–B. Academic Press,
1974–1976.
10. D. Gusfield. Algorithms on Strings, Trees and Sequences. Cambridge University
Press, Cambridge, UK, 1997.
11. D. S. Hirschberg. A linear space algorithm for computing maximal common subsequences.
Communications of the ACM, 18(6):341–343, June 1975.
12. W. Kuich and A. Salomaa. Semirings, Automata, Languages. Number 5 in EATCS
Monographs on Theoretical Computer Science. Springer-Verlag, 1986.
13. M. Mohri. Semiring frameworks and algorithms for shortest-distance problems.
Journal of Automata, Languages and Combinatorics, 7(3):321–350, 2002.
14. M. Mohri. Edit-distance of weighted automata: General definitions and algorithms.
International Journal of Foundations of Computer Science, 14(6):957–982, 2003.
15. M. Mohri, F. C. N. Pereira, and M. Riley. Weighted automata in text and speech
processing. In Proceedings of the 12th biennial European Conference on Artificial
Intelligence (ECAI-96), Workshop on Extended finite state models of language,
Budapest, Hungary. John Wiley and Sons, Chichester, 1996.
16. E. W. Myers and W. Miller. Optimal alignments in linear space. CABIOS, 4(1):11–
17, 1988.
17. E. W. Myers and W. Miller. Approximate matching of regular expressions. Bulletin
of Mathematical Biology, 51(1):5–37, 1989.18. G. Navarro and M. Raffinot. Flexible pattern matching. Cambridge University
Press, 2002.
19. F. Pereira and M. Riley. Finite State Language Processing, chapter Speech Recognition
by Composition of Weighted Finite Automata. The MIT Press, 1997.
20. D. Perrin. Finite automata. In J. V. Leuwen, editor, Handbook of Theoretical
Computer Science, Volume B: Formal Models and Semantics, pages 1–57. Elsevier,
Amsterdam, 1990.
21. P. A. Pevzner. Computational Molecular Biology: an Algorithmic Approach. MIT
Press, 2000.
22. A. Salomaa and M. Soittola. Automata-Theoretic Aspects of Formal Power Series.
Springer-Verlag, 1978.
23. D. Sankoff and J. B. Kruskal. Time Wraps, String Edits and Macromolecules:
The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading, MA,
1983.
24. K. Thompson. Regular expression search algorithm. Communications of the ACM,
11(6):365–375, 1968.
25. R. A. Wagner. Order-n correction for regular languages. Communications of the
ACM, 17(5):265–268, May 1974.
26. R. A. Wagner and J. I. Seiferas. Correcting counter-automaton-recognizable languages.
SIAM Journal on Computing, 7(3):357–375, August 1978.
JMLR: Workshop and Conference Proceedings vol 23 (2012) 44.1–44.3 25th Annual Conference on Learning Theory
Open Problem:
Better Bounds for Online Logistic Regression
H. Brendan McMahan MCMAHAN@GOOGLE.COM
Google Inc., Seattle, WA
Matthew Streeter MSTREETER@GOOGLE.COM
Google Inc., Pittsburgh, PA
Editor: Shie Mannor, Nathan Srebro, Robert C. Williamson
Abstract
Known algorithms applied to online logistic regression on a feasible set of L2 diameter D achieve
regret bounds like O(e
D log T) in one dimension, but we show a bound of O(
√
D + log T) is
possible in a binary 1-dimensional problem. Thus, we pose the following question: Is it possible
to achieve a regret bound for online logistic regression that is O(poly(D) log(T))? Even if this is
not possible in general, it would be interesting to have a bound that reduces to our bound in the
one-dimensional case.
Keywords: online convex optimization, online learning, regret bounds
1. Introduction and Problem Statement
Online logistic regression is an important problem, with applications like click-through-rate prediction
for web advertising and estimating the probability that an email message is spam. We formalize
the problem as follows: on each round t the adversary selects an example (xt
, yt) ∈ R
n × {−1, 1},
the algorithm chooses model coefficients wt ∈ R
n
, and then incurs loss
`(wt
; xt
, yt) = log(1 + exp(−ytwt
· xt)), (1)
the negative log-likelihood of the example under a logistic model. For simplicity we assume
kxtk2 ≤ 1 so that any gradient kO`(wt)k2 ≤ 1. While conceptually any w ∈ R
n
could be used as
model parameters, for regret bounds we consider competing with a feasible set W = {w | kwk2 ≤
D/2}, the L2 ball of diameter D centered at the origin.
Existing algorithms for online convex optimization can immediately be applied. First-order
algorithms like online gradient descent (Zinkevich, 2003) achieve bounds like O(D
√
T). On a
bounded feasible set logistic loss (Eq. (1)) is exp-concave, and so we can use second-order algorithms
like Follow-The-Approximate-Leader (FTAL), which has a general bound of O(( 1
α +
GD)n log T) (Hazan et al., 2007) when the loss functions are α-exp-concave on the feasible set; we
have α = e
−D/2
for the logistic loss (see Appendix A), which leads to a bound of O((exp(D) +
D)n log T) in the general case, or O(exp(D) log T) in the one-dimensional case. The exponential
dependence on the diameter of the feasible set can make this bound worse than the O(D
√
T)
bounds for practical problems where the post-hoc optimal probability can be close to zero or one.
We suggest that better bounds may be possible. In the next section, we show that a simple
Follow-The-Regularized-Leader (FTRL) algorithm can achieve a much better result, namely
c 2012 H.B. McMahan & M. Streeter.MCMAHAN STREETER
O(
√
D + log T), for one-dimensional problems where the adversary is further constrained1
to pick
xt ∈ {−1, 0, +1}. A single mis-prediction can cost about D/2, and so the additive dependence on
the diameter of the feasible set is less than the cost of one mistake. The open question is whether
such a bound is achievable for problems of arbitrary finite dimension n. Even the general onedimensional
case, where xt ∈ [−1, 1], is not obvious.
2. Analysis in One Dimension
We analyze an FTRL algorithm. We can ignore any rounds when xt = 0, and then since only the
sign of ytxt matters, we assume xt = 1 and the adversary picks yt ∈ {−1, 1}. The cumulative loss
function on P positive examples and N negative examples is
c(w; N, P) = P log(1 + exp(−w)) + N log(1 + exp(w)).
Let Nt denote the number of negative examples seen through the t’th round, with Pt
the corresponding
number of positive examples. We play FTRL, with
wt+1 = arg min
w
c(w; Nt + λ, Pt + λ),
for a constant λ > 0. This is just FTRL with a regularization function r(w) = c(w; λ, λ). Using the
FTRL lemma (e.g., McMahan and Streeter (2010, Lemma 1)), we have
Regret ≤ r(w
∗
) +X
T
t=1
ft(wt) − ft(wt+1)
where ft(w) = `(w; xt
, yt).
It is easy to verify that r(w) ≤ λ(|w| + 2 log 2). It remains to bound ft(wt) − ft(wt+1). Fix
a round t. For compactness, we write N = Nt−1 and P = Pt−1. Suppose that yt = −1, so
Nt = N + 1 and Pt = P (the case when yt+1 = +1 is analogous). Since ft
is convex, by definition
ft(w) ≥ ft(wt) + gt(w − wt) where gt = Oft(wt). Taking w = wt+1 and re-arranging, we have
ft(wt) − ft(wt+1) ≤ gt(wt − wt+1) ≤ |gt
||wt − wt+1|.
It is easy to verify that |gt
| ≤ 1, and also that
wt = log
P + λ
N + λ
.
Since yt = −1, wt+1 < wt
, and so
|wt − wt+1| = log
P + λ
N + λ
− log
P + λ
N + 1 + λ
= log(N + 1 + λ) − log(N + λ)
= log
1 +
1
N + λ
≤
1
N + λ
.
1. Constraining the adversary in this way is reasonable in many applications. For example, re-scaling each xt so
kxtk2 = 1 is a common pre-processing step, and many problems also are naturally featurized by xt,i ∈ {0, 1},
where xt,i = 1 indicates some property i is present on the t’th example.
44.2OPEN PROBLEM: ONLINE LOGISTIC REGRESSION
Thus, if we let T
− = {t | yt = −1}, we have
X
t∈T −
ft(wt) − ft(wt+1) ≤
X
NT
N=0
1
N + λ
≤
1
λ
+
X
NT
N=1
1
N
≤
1
λ
+ log(NT ) + 1.
Applying a similar argument to rounds with positive labels and summing over the rounds with
positive and negative labels independently gives
Regret ≤ λ(|w
∗
| + 2 log 2) + log(PT ) + log(NT ) + 2
λ
+ 2.
Note log(PT ) + log(NT ) ≤ 2 log T. We wish to compete with w
∗ where |w
∗
| ≤ D/2, so we can
choose λ = √
1
D/2
which gives
Regret ≤ O(
√
D + log T).
References
Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex
optimization. Mach. Learn., 69, December 2007.
H. Brendan McMahan and Matthew Streeter. Adaptive bound optimization for online convex optimization.
In COLT, 2010.
Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In
ICML, 2003.
Appendix A. The Exp-Concavity of the Logistic Loss
Theorem 1 The logistic loss function `(wt
; xt
, yt) = log(1 + exp(−ytwt
· xt)), from Eq. (1), is
α-exp-concave with α = exp(−D/2) over set W = {w | kwk2 ≤ D/2} when kxtk2 ≤ 1 and
yt ∈ {−1, 1}.
Proof Recall that a function ` is α-exp-concave if O
2
exp(−α`(w)) 0. When `(w) = g(w·x) for
x ∈ R
n
, we have O
2
exp(−α`(w)) = O
2f
00(z)xx>, where f(z) = exp(−αg(z)). For the logistic
loss, we have g(z) = log(1 + exp(z)) (without loss of generality, we consider a negative example),
and so f(z) = (1 + exp(z))−α. Then,
f
00(z) = αez
(1 + e
z
)
−α−2
(αez − 1).
We need the largest α such that f
00(z) ≤ 0, given a fixed z. We can see by inspection that α = 0
is a zero. Since e
z
(1 + e
z
)
−α−2 > 0, from the term (αez − 1) we conclude α = e
−z
is the largest
value of α where f
00(z) ≤ 0. Note that z = wt
· xt
, and so |z| ≤ D/2 since kxtk2 ≤ 1, and so
taking the worst case over wt ∈ W and xt with kxtk2 ≤ 1, we have α = exp(−D/2).
44.3
Online Microsurveys for User
Experience Research
Abstract
This case study presents a critical analysis of
microsurveys as a method for conducting user
experience research. We focus specifically on Google
Consumer Surveys (GCS) and analyze a combination of
log data and GCSs run by the authors to investigate
how they are used, who the respondents are, and the
quality of the data. We find that such microsurveys can
be a great way to quickly and cheaply gather large
amounts of survey data, but that there are pitfalls that
user experience researchers should be aware of when
using the method.
Author Keywords
Microsurveys; user experience research; user research
methods
ACM Classification Keywords
H.5.2. User Interfaces: Theory and methods.
Introduction
To keep up with fast paced design and development
teams, user researchers must develop a toolkit of
methods to quickly and efficiently address research
questions. One such method is the microsurvey, or a
short survey of only one to three questions. There are
several commercial microsurveys—including Google
Consumer Surveys (GCS), SlimSurveys, and Survata—
Permission to make digital or hard copies of part or all of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. Copyrights
for third-party components of this work must be honored. For all other
uses, contact the owner/author(s). Copyright is held by the
author/owner(s).
CHI 2014, April 26–May 1, 2014, Toronto, Ontario, Canada.
ACM 978-1-4503-2474-8/14/04.
http://dx.doi.org/10.1145/2559206.2559975
David Huffaker
Google Inc.
Mountain View, CA 94043
huffaker@google.com
Gueorgi Kossinets
Google Inc.
Mountain View, CA 94043
gkossinets@google.com
Kerwell Liao
Google Inc.
Mountain View, CA 94043
kerwell@google.com
Paul McDonald
Google Inc.
Mountain View, CA 94043
pmcdonald@google.com
Aaron Sedley
Google Inc.
Mountain View, CA 94043
asedley@google.com
Victoria Schwanda Sosik
Cornell University
301 College Ave.
Ithaca, NY 14850
vsosik@cs.cornell.edu
Elie Bursztein
Google Inc.
Mountain View, CA 94043
elieb@google.com
Sunny Consolvo
Google Inc.
Mountain View, CA 94043
sconsolvo@google.com
Case Study: Creating Methods CHI 2014, One of a CHInd, Toronto, ON, Canada
889Figure 1. An
respondent en
asked to answ
or share the p
social media i
the publisher’
Figure 2. Pa
interface. To
responses by
to a multiple
to the right.
example of how a
ncounters GCS. The
wer a short survey q
page they are readin
in order to continue
’s content.
rt of the GCS result
the left are controls
y demographics, and
choice question are
that p
data q
study
micro
quest
respo
collec
for us
One
Since
brief o
shown
quest
each r
survey
of twe
ended
respo
the qu
must
limits
to sho
Surve
popula
demo
Doubl
Quest
publis
catego
Refere
contin
way, t
betwe
acces
y are
question,
ng via
reading
s
s to filter
d results
e shown
promise to provide
quickly and at a re
, we present a cri
survey, Google Co
ions about how th
ndents are, and o
t. We conclude wi
sing this method in
Example of a M
we use GCS in th
overview of how it
n only one questio
ion. If a survey ha
respondent is ran
y questions. The s
elve predefined qu
d, single answer, m
nses. Certain que
uestion or respons
be short, with 12
respectively; mu
owing 5 response
ey designers can r
ation, or target re
graphics (as infer
eClick cookies) or
tions are then sho
sher’s premium co
ories of News, Art
ence—and people
nue reading the co
these microsurvey
een the responden
s.
e people with larg
elatively low cost.
tical analysis of o
onsumer Surveys
hey are being used
of what quality is t
ith some current b
n user research.
Microsurvey: G
his case study, we
t works. Each GCS
on, two if there is
as more than one
domly shown only
survey designer c
uestion formats th
multiple answer, a
stion formats allo
ses. Questions an
5 character and 4
ltiple choice quest
options to each re
request a represen
espondents based
rred by IP address
r by using a scree
own to people tryi
ontent—primarily
ts & Entertainmen
answer the quest
ontent (see Figure
ys are acting as a
nt and the content
e amounts of
In this case
ne type of
, addressing
d, who their
the data they
best practices
GCS
e first provide a
S respondent is
a screening
question, then
y one of the
can choose one
hat include open
and rating scale
ow for images in
d responses
44 character
tions are limited
espondent.
ntative
on specific
ses and
ening question.
ng to access a
in the
nt, and
tion in order to
e 1); in this
surveywall
t they want to
After data
results in t
basic analy
different d
clustering
Results:
We analyz
surveys ru
run specifi
and others
for our pro
a methodo
GCS by th
GCS log da
types of q
Table 1). T
up over 80
the most c
the lowest
On averag
to a GCS q
seconds (s
quickly—o
collecting
created, a
four days.
collection
targeted s
Who are G
In Novemb
compare G
is collected, surv
the GCS interface
ysis tools includin
demographics and
of open-ended te
Analysis of GC
zed GCS log data a
un by the authors
ically to gather da
s were run to answ
oduct teams, how
ological perspectiv
he Numbers
ata shows that th
uestions are mult
Together, single a
0% of all deployed
common question
t completion rate
ge, respondents sp
question, and the
see Figure 3). GCS
on average, survey
data between one
nd complete data
General populati
on the lower end
surveys tend to ta
GCS Respondents?
ber 2012, PEW Re
GCS demographic
vey designers can
e, which provides
ng comparison of r
d automatic, edita
ext responses (see
CS
and data from sev
. Some of the sur
ata about GCS as
wer user research
wever we analyzed
ve for this case st
e two most frequ
tiple-choice questi
and multiple answ
d GCS questions.
type—multiple an
(see Table 1).
pend 9.7 seconds
modal response t
Ss also collect dat
ys are approved t
e and four hours a
a collection in abou
on surveys finish
of that range, wh
ke the four days.
?
esearch ran a stud
s with that of the
view the
users with
results by
ble
e Figure 2).
veral
rveys were
a method,
h questions
d them from
tudy.
ently used
ions (see
wers make
However
nswer—has
responding
time is 4
ta very
to start
after being
ut two to
data
hereas
dy to
ir
Case Study: Creating Methods CHI 2014, One of a CHInd, Toronto, ON, Canada
890Demogra
Men
Women
18–24
25–34
35—44
45—54
55—64
65+
Unknown Ag
Survey
Question
Do you ever
use the inter
to use a socia
networking s
like MySpace
Facebook, or
LinkedIn.co
m
What is the
primary socia
networking s
you use?
Table 2. Infe
compared to
Table 3. Soci
older America
survey sample
phic PEW
32%
35%
33%
37%
49%
38%
28%
18%
ge —
n
PEW
G
net
al
ite
e,
r
m?
42%
[age
50+]
4
[age
al
ite
Fac
(8
Linked
Twitte
Google
MyS
(1
erred GCS demograp
PEW demographics.
al Network usage a
m
ns, using PEW and
G
es.
teleph
respo
compo
that t
heav
y
Q
Mult
Sing
Open
Ratin
Nu
m
Ratin
Ratin
Larg
Side
Imag
Open
Two
GCS
27%
27%
18%
30%
32%
28%
26%
23%
27%
GCS
46%
e 45+]
ebook
85%)
dIn (6%)
er (4%)
e+ (3%)
Space
1%)
Table
compl
types
Figure
GCS su
phics
.
mong
GCS
hone panels. Their
ndents “conform c
osition of the ove
here is little evide
y internet users. [
Question Type
tiple answers
gle answer
n Ended
ng
meric open ended
ng with text
ng with image
ge image choice
e-by-side images
ge with menu
n ended with image
choices with image
e 1. Rate of usage a
letion rate among re
of GCS questions.
e 3. Distribution of r
urvey questions.
r overall findings
w
closely to the de
m
rall internet popu
ence that GCS is b
4]
Usage
C
62.04%
21.71%
4.62%
3.81%
1.60%
1.50%
1.30%
0.99%
0.92%
0.82%
0.69%
--
mong survey design
espondents for the 1
response times in se
were that GCS
mographic
lation,” and
biased towards
Completion
Rate
20.56%
39.37%
27.03%
34.19%
25.30%
34.09%
27.20%
28.49%
29.37%
36.57%
27.79%
--
ners and
12 different
econds to
We ran a s
and techno
of tablet o
phone own
(35%, 33
%
was lower
of demogr
and gende
networking
findings us
We also co
from Surv
Knowledge
and techno
were simil
heavier int
being the
Overall, w
between th
of unkno
w
similar acr
representi
Responde
n
We ran a
G
surveywal
are trying
options the
premium c
response
w
followed b
(34%), ma
purchasing
they then
series of GCSs to
ology-use questio
ownership (PEW
=
nership (91%, 67
%) or the internet
among GCS than
raphics, GCS sho
w
er (see Table 2).
W
g site usage amon
sing GCS were clo
ompared GCS res
ey Sampling Inte
e Networks (KN)
w
ology adoption.
R
ar, with SSI respo
ternet users and t
lowest (see Table
while we notice de
m
he survey sample
wns in GCS—techn
ross all four samp
ng the high and lo
nts’ Attitudes To
w
GCS to explore re
ls that stand bet
w
to access. We as
ey would prefer
w
content. We found
was taking a shor
by having content
aking a small one
g a subscription (6
had to specify as
dig deeper into d
ons. We found tha
= 34%, GCS = 28
%
%) and use of ce
t for banking (61
%
n PEW respondent
ws lower rates acr
With respect to so
ng older American
ose to PEW (see
T
pondents to respo
rnational (SSI) an
with respect to int
Results across the
ondents tending t
technology adopte
e 4).
mographic differe
es—likely due to th
nology usage and
ples, with PEW and
ow extremes, res
ward Surveywalls
espondents’ attitud
ween them and co
ked them which o
when trying to acc
d that the most po
rt microsurvey (47
sponsored by an
e-time payment (1
6%), and other (3
open ended text)
emographic
at the rate
%), cell
ll phones
%, 48%)
ts. In terms
ross age
ocial
ns, our
Table 3).
ondents
nd
ternet use
panels
to be the
ers, and KN
nces
he number
adoption is
d KN
pectively.
des toward
ontent they
of five
cess
opular
7%),
advertiser
10%),
3%; which
).
Case Study: Creating Methods CHI 2014, One of a CHInd, Toronto, ON, Canada
891Data Quality: Survey Attentiveness
As one measure of data quality, we ran a GCS that
asked respondents one of several trap questions. For a
summary of how respondents performed, see the
sidebar to the left. We find that our GCS respondents
answered correctly the “Very Often” question less often
(73%) than an example of the same trap question
being asked on a paper survey (97%) [3]. A trap
survey run in Mechanical Turk found only 61% of
respondents answering correctly when asked to read an
email and answer two questions [2], but this task is
arguably harder than the questions we asked.
Data Quality: Garbage Open Ended Responses
We also analyzed data quality by looking at the rate of
garbage responses that we received across 25 GCS
questions run for other projects. Examples of these
questions include: “which web browser(s) do you use?”
and “what does clicking on this image allow you to do?”
responses such as “blah”, “who cares”, and “zzzzz” and
found that the percentage of garbage responses ranged
from 1.8% to 23.4% (Mean = 7.8%). Our analysis
revealed that the percentage of “I don’t know”
responses tended to correlate with the percentage of
garbage responses, suggesting that people were more
likely to provide such garbage responses when they
were not sure of what the question was asking of them.
Conclusion: Best Practices for Microsurveys
We find that microsurveys such as Google Consumer
Surveys can quickly provide large amounts of data with
relatively low setup costs. We also see that the GCS
population is fairly representative as compared to other
large-scale survey panels.
However there are also pitfalls to keep in mind. Our
findings from the trap question survey suggests that
being concise is important to maximize data quality,
which supports GCS’s question length constraints. We
also suggest that it is important to appropriately target
surveys to a population in order to keep garbage open
ended responses to a minimum. If respondents are
being asked about something they are unfamiliar with,
they are less likely to provide meaningful responses.
Finally, multiple answer questions had the lowest
completion rate—which is often used as a measure of
data quality (e.g. [1])—so we suggest that people think
critically about the types of questions they use, and
consider using other question types if at all appropriate.
With respect to analyzing microsurveys, first it is
important to remember that demographics are inferred,
and there are many “unknowns”. We also suggest using
built-in text clustering tools to categorize open-ended
responses, and if desired, following up with multiple
choice questions to determine how frequent these
categories are.
References
[1] Dillman, D. A. & Schaefer, D. R. (1998).
Development of a standard e-mail methodology: results
of an experiment. Public Opinion Quarterly, 62, 3.
[2] Downs, J.S., Holbrook, M. B., Sheng, S., & Cranor,
L. F. (2010). Are your participants gaming the system?
Screening mechanical turk workers. In Proceedings of
CHI '10.
[3] Hargittai, E. (2005). Survey Measures of WebOriented
Digital Literacy. Social Science Computer
Review. 23, 3, 371–379.
[4] Pew Research (November 2012). A Comparison of
Results from Surveys by the Pew Research Center and
Google Consumer Surveys.
KN GCS SSI
For personal purposes, I normally use the
Internet (5 = every hour or more, 1 =
once per week or less)
3.2 3.5 3.8
Other people often seek my ideas and
advice regarding technology (5 = describes
me very well, 1 = describes me very
poorly)
2.7 3.1 3.2
I am willing to pay more for the latest
technology (same as above)
2.3 2.6 3.1
Which of the following best describes when
you buy or try out new technology? (5 =
Among the first people, 1 = I am usually
not interested)
2.5 2.6 3.1
How frequently do you post on social
networks? (5 = multiple times a day, 1 =
once a month or less)
1.7 2.1 2.4
Trap Questions in GCS
What is the color of a red ball?
(90.3% correct)
What is the shape of a red ball?
(85.7%)
The purpose of this question is to
assess your attentiveness to
question wording. For this question
please mark the ‘Very Often’
response. (72.5%)
The purpose of this question is to
assess your attentiveness to
question wording. Ignore the
question below, and select “blue”
from the answers. What color is a
basketball? (57%)
Table 4. Technology use and adoption
among 3 different survey panels.
Case Study: Creating Methods CHI 2014, One of a CHInd, Toronto, ON, Canada
892
Minimizing off-target signals in RNA fluorescent
in situ hybridization
Aaron Arvey1,2, Anita Hermann3
, Cheryl C. Hsia3
, Eugene Ie2,4, Yoav Freund2 and
William McGinnis3,*
1
Computational and Systems Biology Center, Memorial Sloan-Kettering Cancer Center, New York, NY, 10065,
2
Department of Computer Sciences and Engineering, 3
Division of Biological Sciences, University of California,
San Diego, La Jolla, CA 92093 and 4
Google Inc., Mountain View, CA 94043, USA
Received November 4, 2009; Revised December 11, 2009; Accepted January 17, 2010
ABSTRACT
Fluorescent in situ hybridization (FISH) techniques
are becoming extremely sensitive, to the point
where individual RNA or DNA molecules can be
detected with small probes. At this level of sensitivity,
the elimination of ‘off-target’ hybridization is of
crucial importance, but typical probes used for RNA
and DNA FISH contain sequences repeated elsewhere
in the genome. We find that very short (e.g.
20 nt) perfect repeated sequences within much
longer probes (e.g. 350–1500 nt) can produce significant
off-target signals. The extent of noise is surprising
given the long length of the probes and the
short length of non-specific regions. When we
removed the small regions of repeated sequence
from either short or long probes, we find that the
signal-to-noise ratio is increased by orders of magnitude,
putting us in a regime where fluorescent
signals can be considered to be a quantitative
measure of target transcript numbers. As the
majority of genes in complex organisms contain
repeated k-mers, we provide genome-wide annotations
of k-mer-uniqueness at http://cbio.mskcc
.org/aarvey/repeatmap.
INTRODUCTION
The gene expression profiles of individual cells can be
drastically different from that of adjacent cells. This is
particularly true in developing or heterogeneous tissues
such as embryos (1), proliferative adult epithelia (2) and
tumors (3). Visualization of RNA expression patterns in
fields of cells is often accomplished with fluorescence
in situ hybridization (FISH) using antisense probes.
Analysis of cellular patterns of gene expression by FISH
has provided insight into prognosis (3) and cell fate (4) of
tissues. A challenge for the future is to use FISH in tissues
to quantify RNA expression levels on a cell-by-cell basis,
which requires high resolution, high sensitivity and high
signal-to-noise ratios (1,5–8).
A major hurdle in making RNA FISH methods quantitative
has been increasing sensitivity and specificity to the
point where genuine target RNA signals can be distinguished
from background. One way to produce probes
of high specificity has been to produce chemicallysynthesized
oligonucleotides that are directly labeled
with fluorophores, and tiled along regions of RNA
sequence (6–8). Although directly-labeled oligo probes
are elegant, they have not yet been widely applied, in
part due to their expense, and in part due to their relatively
low signal strength (6–8). One alternative method
for single RNA molecule detection employs long
haptenylated riboprobes that are enzymatically
synthesized from cDNAs (1,5). Such probes are cheaply
and easily produced, and when detected with primary and
fluorescently-labeled secondary antibodies, they have
higher signal intensities and equivalent resolution when
compared to probes that are directly labeled with
fluorophores (1,5).
However, tiling probes have a natural advantage with
respect to specificity: if a single probe ‘tile’ hybridizes to an
off-target transcript, it is unlikely to generate sufficient
signal to pass an intensity threshold that is characteristic
of genuine RNA transcripts, which contain multiple tiled
binding sites. In contrast, a single haptenylated probe,
even if fragmented to sizes in the range of hundreds of
nucleotides, may yield strong off-target signals due to
the amplification conferred by primary and secondary
antibodies. One traditional approach to determine background
levels of fluorescence, and thus act as a crude
estimate of specificity, has been the use of sense
*To whom correspondence should be addressed. Tel: 858 822 0461; Fax: 858 822 3021; Email: wmcginnis@ucsd.edu
The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors.
Published online 17 February 2010 Nucleic Acids Research, 2010, Vol. 38, No. 10 e115
doi:10.1093/nar/gkq042
The Author(s) 2010. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
by on July 11, 2010 http://nar.oxfordjournals.org Downloaded from
Collaboration in the Cloud at Google
Yunting Sun, Diane Lambert, Makoto Uchida, Nicolas Remy
Google Inc.
January 8, 2014
Abstract
Through a detailed analysis of logs of activity for
all Google employees1
, this paper shows how the
Google Docs suite (documents, spreadsheets and
slides) enables and increases collaboration within
Google. In particular, visualization and analysis
of the evolution of Google’s collaboration network
show that new employees2
, have started
collaborating more quickly and with more people
as usage of Docs has grown. Over the last two
years, the percentage of new employees who collaborate
on Docs per month has risen from 70%
to 90% and the percentage who collaborate with
more than two people has doubled from 35% to
70%. Moreover, the culture of collaboration has
become more open, with public sharing within
Google overtaking private sharing.
1 Introduction
Google Docs is a cloud productivity suite and it
is designed to make collaboration easy and natural,
regardless of whether users are in the same
or different locations, working at the same or different
times, or working on desktops or mobile
devices. Edits and comments on the document
are displayed as they are made, even if many people
are simultaneously writing and commenting
on or viewing the document. Comments enable
real-time discussion and feedback on the document,
without changing the document itself. Authors
are notified when a new comment is made
or replied to, and authors can continue a conversation
by replying to the comment, or end
the discussion by resolving it, or re-start the discussion
by re-opening a closed discussion stream.
Because documents are stored in the cloud, users
can access any document they own or that has
been shared with them anywhere, any time and
on any device. The question is whether this enriched
model of collaboration matters?
There have been a few previous qualitative analyses
of the effects of Google Docs on collaboration.
For example, the review of Google Docs in
[1] suggested that its features should improve collaboration
and productivity among college students.
A technical report [2] from the University
of Southern Queensland, Australia argued that
Google Docs can overcome barriers to usability
such as difficulty of installation and document
version control and help resolve conflicts among
co-authors of research papers. There has also
been at least one rigorous study of the effect of
Google Docs on collaboration. Blau and Caspi
[3] ran a small experiment that was designed to
compare collaboration on writing documents to
merely sharing documents. In their experiment,
118 undergraduate students of the Open University
of Israel were randomized to one of five
groups in which they shared their written assignments
and received feedback from other students
to varying degrees, ranging from keeping texts
1Full-time Google employees, excluding interns, part-times, vendors, etc
2Full-time employees who have joined Google for less than 90 days
12 COLLABORATION VISUALIZATION
private to allowing in-text suggestions or allowing
in-text edits. None of the students had used
Google Docs previously. The authors found that
only students in the collaboration group perceived
the quality of their final document to be
higher after receiving feedback, and students in
all groups thought that collaboration improves
documents.
This paper takes a different approach, and looks
for the effects of collaboration on a large, diverse
organization with thousands of users over a much
longer period of time. The first part of the paper
describes some of the contexts in which Google
Docs is used for collaboration, and the second
part analyzes how collaboration has evolved over
the last two years.
2 Collaboration Visualization
2.1 The Data
This section introduces a way to visualize the
events during a collaboration and some simple
statistics that summarize how widespread collaboration
using Google Docs is at Google. The
graphics and metrics are based on the view, edit
and comment actions of all full-time employees
on tens of thousands of documents created in
April 2013.
2.2 A Simple Example
To start, a document with three collaborators
Adam (A), Bryant (B) and Catherine (C) is
shown in Figure 1. The horizontal axis represents
time during the collaboration. The vertical
axis is broken into three regions representing
viewing, editing and commenting. Each contributor
is assigned a color. A box with the contributor’s
color is drawn in any time interval in
which the contributor was active, at a vertical
position that indicates what the user was doing
in that time interval. This allows us to see when
contributors were active and how often they contributed
to the document. Stacking the boxes allows
us to show when contributors were acting at
the same time. Only time intervals in which at
least one contributor was active are shown, and
gaps in time that are shorter than a threshold
are ignored. Gray vertical bars of fixed width
are used to represent periods of no activity that
are longer than the threshold. In this paper, the
threshold is set to be 12 hours in all examples.
In Figure 1, an interval represents an hour.
Adam and Bryant edited the document together
during the hour of 10 AM May 4 and Bryant
edited alone in the following hour. The collaboration
paused for 8 days and resumed during
the hour of 2 pm on May 12. Adam, Bryant and
Catherine all viewed the document during that
hour. Catherine commented on the document
in the next hour. Altogether, the collaboration
had two active sessions, with a pause of 8 days
between them.
Figure 1: This figure shows an example of the
collaboration visualization technique. Each colored
block except the gray one represents an hour and the
gray one represents a period of no activity. The Y
axis is the number of users for each action type. This
document has three contributors, each assigned a different
color.
Although we have used color to represent collaborators
here, we could instead use color to
represent the locations of the collaborators, their
organizations, or other variables. Examples with
different colorings are given in Sections 2.5 and
2.6.
2 Google Inc.2 COLLABORATION VISUALIZATION 2.3 Collaboration Metrics
2.3 Collaboration Metrics
To estimate the percentage of users who concurrently
edit a document and the percentage of
documents which had concurrent editing, we discretize
the timestamps of editing actions into 15
minute intervals and consider editing actions by
different contributors in the same 15 minute interval
to be concurrent. Two users who edit the
same document but always more than 15 minutes
apart would not be considered as concurrent, although
they would still be considered collaborators.
Edge cases in which two collaborators edit
the same document within 15 minutes of each
other but in two adjacent 15 minute intervals
would not be counted as concurrent events.
The choice of 15 minutes is arbitrary; however,
metrics based on a 15 minute discretization and
a 5 minute discretization are little different. The
choice of 15 minute intervals makes computation
faster. A more accurate approach would be to
look for sequences of editing actions by different
users with gaps below 15 minutes, but that
requires considerably more computing.
2.4 Collaborative Editing
Collaborative editing is common at Google. 53%
of the documents that were created and shared
in April 2013 were edited by more than one employee,
and half of those had at least one concurrent
editing session in the following six months.
Looking at employees instead of documents, 80%
of the employees who edited any document contributed
content to a document owned by others
and 65% participated in at least one 15 minute
concurrent editing session in April 2013. Concurrent
editing is sticky, in the sense that 76% of the
employees who participate in a 15 minute concurrent
editing session in April will do so again
the following month.
There are many use cases for collaborative editing,
including weekly reports, design documents,
and coding interviews. The following three plots
show an example of each of these use cases.
Figure 2: Collaboration activity on a design document. The X axis is time in hours and the Y axis is the
number of users for each action type. The document was mainly edited by 3 employees, commented on by
18 and viewed by 50+.
Google Inc. 32.5 Commenting 2 COLLABORATION VISUALIZATION
Figure 2 shows the life of a design document created
by engineers. The X axis is time in hours
and the Y axis is the number of employees working
on the document for each action type. The
document was mainly edited by three employees,
commented on by 18 employees and viewed
by more than 50 employees from three major locations.
This document was completed within
two weeks and viewed many times in the subsequent
month. Design documents are common at
Google, and they typically have many contributors.
Figure 3 shows the life of a weekly report document.
Each bar represents a day and the Y
axis is the number of employees who edited and
viewed the document in a day. This document
has the following submission rules:
• Wednesday, AM: Reminder for submissions
• Wednesday, PM: All teams submit updates
• Thursday, AM: Document is locked
The activities on the document exhibit a pronounced
weekly pattern that mirrors the submission
rules. Weekly reports and meeting notes
that are updated regularly are often used by employees
to keep everyone up-to-date as projects
progress.
Figure 3: Collaboration on a weekly report. The
X axis is time in days and the Y axis is the number
of users for each action type. The activities exhibit
a pronounced weekly pattern and reflect the submission
rules of the document.
Finally, Figure 4 shows the life of a document
used in an interview. The X axis represents time
in minutes. The document was prepared by a recruiter
and then viewed by an engineer. At the
beginning of the interview, the engineer edited
the document and the candidate then wrote code
in the document. The engineer was able to watch
the candidate typing. At the end of the interview,
the candidate’s access to the document was
revoked so no further change could be made, and
the document was reviewed by the engineer. Collaborative
editing allows the coding interview to
take place remotely, and it is an integral part of
interviews for software engineers at Google.
Figure 4: The activity on a phone interview document.
The X axis is time in minutes and the Y axis
is the number of users for each action type. The engineer
was able to watch the candidate typing on the
document during a remote interview.
2.5 Commenting
Commenting is common at Google. 30% of the
documents created in April 2013 that are shared
received comments within six months of creation.
57% of the employees who used Google Docs in
April commented at least once in April, and 80%
of the users who commented in April commented
again in the following month.
4 Google Inc.2 COLLABORATION VISUALIZATION 2.6 Collaboration Across Sites
Figure 5: Commenting and editing on a design document. The X axis is time in hours and the Y axis
is the number of user actions for each user location. There are four user actions, each assigned a different
color. Timestamps are in Pacific time.
Figure 5 shows the life of a design document.
Here color represents the type of user action (create
a comment, reply to a comment, resolve a
comment and edit the document), and the Y axis
is split into two locations. The document was
written by one engineering team and reviewed
by another. The review team used commenting
to raise many questions, which the engineering
team resolved over the next few days. Collaborators
were located in London, UK and Mountain
View, California, with a nine hour time zone difference,
so the two teams were almost ”taking
turns” working on the document (timestamps
are in Pacific time). There are many similar
communication patterns between engineers via
commenting to ask questions, have discussions
and suggest modifications.
2.6 Collaboration Across Sites
Employees use the Docs suite to collaborate with
colleagues across the world, as Figure 6 shows.
In that figure, employees working from nine locations
in eight countries across the globe contributed
to a document that was written within a
week. The document was either viewed or edited
with gaps of less than 12 hours (the threshold for
suppressing gaps in the plot) in the first seven
days as people worked in their local timezones.
After final changes were made to the document,
it was reviewed by people in Dublin, Mountain
View, and New York.
Figure 7 shows one month of global collaborations
for full-time employees using Google Docs.
The blue dots show the locations of the employees
and a line connects two locations if a document
is created in one location and viewed in the
other. The warmer the color of the line, moving
from green to red, the more documents shared
between the two locations.
Google Inc. 52.6 Collaboration Across Sites 2 COLLABORATION VISUALIZATION
Figure 6: Activity on a document. Each user location is assigned a different color. The X axis is time in
hours and the Y axis is the number of locations for each action type. Users from nine different locations
contributed to the document.
Figure 7: Global collaboration on Docs. The blue dots are locations and the dots are connected if there is
collaboration on Google Docs between the two locations.
6 Google Inc.3 THE EVOLUTION OF COLLABORATION 2.7 Cross Device Work
2.7 Cross Device Work
The advantage of cloud-based software and storage
is that a document can be accessed from any
device. Figure 8 shows one employee’s visits to
a document from multiple devices and locations.
When the employee was in Paris, a desktop or
laptop was used during working hours and a mobile
device during non-working hours. Apparently,
the employee traveled to Aix-En-Provence
on August 18. On August 18 and the first part of
August 19, the employee continued working on
the same document from a mobile device while
on the move.
Figure 8: Visits to a document by one user working
on multiple devices and from multiple locations.
Not surprisingly, the pattern of working on desktops
or laptops during working hours and on mobile
devices out of business hours holds generally
at Google, as Figure 9 shows. The day of week
is shown on the X axis and hour of day in local
time on the Y axis. Each pixel is colored
according to the average number of employees
working in Google Docs in a day of week and
time of day slot, with brighter colors representing
higher numbers. Pixel values are normalized
within each plot separately. Desktop and laptop
usage of Google Docs peaks during conventional
working hours (9:00 AM to 11:00 AM and
1:00 PM to 5:00 PM), while mobile device usage
peaks during conventional commuting and other
out-of-office hours (7:00 AM to 9:00 AM and 6:00
PM to 8:00 PM).
Figure 9: The average number of active users working
in Google Docs in each day of week and time of
day slot. The X axis is day of the week and the Y
axis is time of the day in local time. Desktop/Laptop
usage peaks during working hours while mobile usage
peaks at out-of-office working hours.
3 The Evolution of Collaboration
3.1 The Data
This section explores changes in the usage of
Google Docs over time. Section 2 defined collaborators
as users who edited or commented on the
same document and used logs of employee editing,
viewing and commenting actions to describe
collaboration within Google. This section defines
collaborators differently using metadata on documents.
Metadata is much less rich than the
event history logs used in Section 2, but metadata
is retained for a much longer period of time.
Document metadata includes the document creation
time and the last time that the document
Google Inc. 73.2 Collaboration for New Employees 3 THE EVOLUTION OF COLLABORATION
was accessed, but no other information about its
revision history. However, the metadata does include
the identification numbers for employees
who have subscribed to the document, where a
subscriber is anyone who has permission to view,
edit or comment on a document and who has
viewed the document at least once. Here we use
metadata on documents, slides and spreadsheets.
We call two employees collaborators (or subscription
collaborators to be clear) if one is a subscriber
to a document owned by the other and
has viewed the document at least once and the
document has fewer than 20 subscribers. The
owner of the document is said to have shared
the document with the subscriber. The number
of subscribers is capped at 20 to avoid overcounting
collaborators. The more subscribers
the document has, the less likely it is that all
the subscribers contributed to the document.
There is no timestamp for when the employee
subscribed to the document in the metadata, so
the exact time of the collaboration is not known.
Instead, the document creation time, which is
known, is taken to be the time of the collaboration.
An analysis (not shown here) of the event
history data discussed in Section 2 showed that
most collaborators join a collaboration soon after
a document is created, so taking collaboration
time to be document creation time is not
unreasonable. To make this assumption even
more tenable, we exclude documents for which
the time of the last view, comment or edit is more
than six months after the document was created.
This section uses metadata on documents created
between January 1, 2011 and March 31,
2013. We say that two employees had a subscription
collaboration in July if they collaborated on
a document that was created in July.
3.2 Collaboration for New Employees
Here we define the new employees for a given
month to be all the employees who joined Google
no more than 90 days before the beginning of
the month and started using Google Docs in
the given month. For example, employees called
new in the month of January 2011 must have
joined Google no more than 90 days before January
1, 2011 and used Google Docs in January
2011. Each month can include different employees.
New employees are said to share a document
if they own a document that someone else subscribed
to, whether or not the person subscribed
to the document is a new employee. Similarly, a
new employee is counted as a subscriber, regardless
of the tenure of the document creator.
Figure 10 shows that collaboration among new
employees has increased since 2011. Over the
last two years, subscribing has risen from 55% to
85%, sharing has risen from 30% to 50%, and the
fraction of users who either share or subscribe
has risen from 70% to 90%. In other words, new
employees are collaborating earlier in their career,
so there is a faster ramp-up and easier access
to collective knowledge.
Figure 10: This figure shows the percentage of new
employees who share, subscribe to others’ documents
and either share or subscribe in each one-month period
over the last two years.
Not only do new employees start collaborating
more often (as measured by subscription and
sharing), they also collaborate with more people.
Figure 11 shows the percentage of new employees
with at least a given number of collaborators
by month. For example, the percentage of
8 Google Inc.3 THE EVOLUTION OF COLLABORATION 3.3 Collaboration in Sales and Marketing
new employees with at least three subscription
collaborators was 35% in January 2011 (the bottom
red curve) and 70% in March 2013 (the top
blue curve), a doubling over two years. It is interesting
that the curves hardly cross each other
and the curves for the farthest back months lie
below those for recent months, suggesting that
there has been steady growth in the number of
subscription collaborators per new employee over
this period.
Figure 11: This figure shows the proportion of new
employees who have at least a given number of collaborators
in each one-month period. Each period is
assigned a different color. The cooler the color of the
curve, moving from red to blue, the more recent the
month. The legend only shows the labels for a subset
of curves. The percentage of new employees who have
at least three collaborators has doubled from 35% to
70%.
To present the data in Figure 11 in another way,
Table 1 shows percentiles of the distribution of
the number of subscription collaborators per new
employee using Google Docs in January 2011 and
in January 2013. For example, the lowest 25% of
new employees using Google Docs had no such
collaborators in January 2011 and two such collaborators
in January 2013.
25% 50% 75% 90% 95%
January 2011 0 1 4 7 11
January 2013 2 5 10 17 22
Table 1: This table shows the percentile of number
of collaborators a new employee have in January 2011
and January 2013. The entire distribution shifts to
the right.
3.3 Collaboration in Sales and Marketing
Section 3.2 compared new employees who joined
Google in different months. This section follows
current employees in Sales and Marketing who
joined Google before January 1, 2011. That is,
the previous section considered changes in new
employee behavior over time and this section
considers changes in behavior for a fixed set of
employees over time. We only analyze subscription
collaborations among this fixed set of employees
and collaborations with employees not
in this set are excluded.
Figure 12: This figure shows the percentage of current
employees in Sales and Marketing who have at
least a given number of collaborators in each onemonth
period.
Figure 12 shows the percentage of current employees
in Sales and Marketing who have at least
Google Inc. 93.4 Collaboration Between Organizations 3 THE EVOLUTION OF COLLABORATION
a given number of collaborators at several times
in the past. There we see that more employees
are sharing and subscribing over time because
the fraction of the group with at least one subscription
collaborator has increased from 80%
to 95%. And the fraction of the group with
at least three subscription collaborators has increased
from 50% to 80%. It shows that many of
the employees who used to have no or very few
subscription collaborators have migrated to having
multiple subscription collaborators. In other
words, the distribution of number of subscription
collaborators for employees who have been
in Sales and Marketing since January 1, 2011 has
shifted right over time, which implies that collaboration
in that group of employees has increased
over time.
Finally, the number of documents shared by the
employees who have been in Sales and Marketing
at Google since January 1, 2011 has nearly doubled
over the last two years. Figure 13 shows the
number of shared documents normalized by the
number of shared documents in January, 2011.
Figure 13: This figure shows the number of shared
documents created by employees in Sales and Marketing
each month normalized by the number of shared
documents in January 2011. The number has almost
doubled over the last two years.
3.4 Collaboration Between Organizations
Collaboration between organizations has increased
over time. To show that, we consider
hundreds of employees in nine teams within the
Sales and Marketing group and the Engineering
and Product Management group who joined
Google before January 1, 2011, were still active
in March 31, 2013 and used Google Docs in that
period. Figure 14 represents the Engineering and
Product Management employees as red dots and
the Sales and Marketing employees as blue dots.
The same dots are included in all three plots
in Figure 14 because the employees included in
this analysis do not change. A line connects two
dots if the two employees had at least one subscription
collaboration in the month shown. The
denser the lines in the graph, the more collaboration,
and the more lines connecting red and blue
dots, the more collaboration between organizations.
Clearly, subscription collaboration has increased
both within and across organizations in
the past two years. Moreover, the network shows
more pronounced communities (groups of connected
dots) over time. Although there are nine
individual teams, there seems to be only three
major communities in the network. Figure 14
indicates that teams can work closely with each
other even though they belong to separate departments.
We also sampled 187 teams within the Sales and
Marketing group and the Engineering and Product
Management group. Figure 15 represents
teams in Engineering and Product Management
as red dots and teams in Sales and Marketing
as blue dots. Two dots are connected if the two
teams had a least one subscription collaboration
between their members in the month. Figure
15 shows that the collaboration between those
teams has increased and the interaction between
the two organizations has becomed stronger over
the past two years.
10 Google Inc.3 THE EVOLUTION OF COLLABORATION 3.4 Collaboration Between Organizations
Figure 14: An example of collaboration across organizations.
Red dots represent employees in Engineering
and Product Management and blue dots represent
employees in Sales and Marketing
Figure 15: An example of collaboration between
teams. Red dots represent teams in Engineering and
Product Management and blue dots represent teams
in Sales and Marketing
Google Inc. 113.5 Cultural Changes in Collaboration 4 CONCLUSIONS
3.5 Cultural Changes in Collaboration
Google Docs allows users to specify the access
level (visibility) of their documents. The default
access level in Google Docs is private, which
means that only the user who created the document
or the current owner of the document can
view it. Employees can change the access level on
a document they own and allow more people to
access it. For example, the document owner can
specify particular employees who are allowed to
access the document, or the owner can mark the
document as public within Google, in which case
any employee can access the document. Clearly,
not all documents created in Google can be visible
to everyone at Google, but the more documents
are widely shared, the more open the environment
is to collaboration.
Figure 16: This figure shows the percentage of
shared documents that are ”public within Google”
created in each month. Public sharing is overtaking
private sharing at Google.
Figure 16 shows the percentage of shared documents
in Google created each month between
January 1, 2012 and March 31, 2013 that are
public within Google. The red line, which is a
curve fit to the data to smooth out variability,
shows that the percentage has increased about
12% from 48% to 54% in the last year alone. In
that sense, the culture of sharing is changing in
Google from private sharing to public sharing.
4 Conclusions
We have examined how Google employees collaborate
with Docs and how that collaboration has
evolved using logs of user activity and document
metadata. To show the current usage of Docs in
Google, we have developed a visualization technique
for the revision history of a document and
analyzed key features in Docs such as collaborative
editing, commenting, access from anywhere
and on any device. To show the evolution
of collaboration in the cloud, we have analyzed
new employees and a fixed group of employees
in Sales and Marketing, and computed collaboration
network statistics each month. We find
that employees are engaged in using the Docs
suite, and collaboration has grown rapidly over
the last two years.
It would also be interesting to conduct a similar
analysis for other enterprises and see how long it
would take them to reach the benchmark Google
has set for collaboration on Docs. Not only has
the collaboration on Docs changed at Google,
the number of emails, comments on G+, calender
meetings between people who work together
has also had significant changes over the past few
years. How those changes reinforce each other
over time would also be an interesting topic to
study.
Acknowledgements
We would like to thank Ariel Kern for her
insights about collaboration on Google Docs,
Penny Chu and Tony Fagan for their encouragement
and support and many thanks to Jim
Koehler for his constructive feedback.
12 Google Inc.REFERENCES REFERENCES
References
[1] Dan R. Herrick (2009). Google this!: using
Google apps for collaboration and productivity.
Proceeding of the ACM SIGUCCS fall
conference (pp. 55-64).
[2] Stijn Dekeyser, Richard Watson (2009). Extending
Google Docs to Collaborate on Research
Papers. Technical Report, The University
of Southern Queensland, Australia.
[3] Ina Blau, Avner Caspi (2009). What Type
of Collaboration Helps? Psychological Ownership,
Perceived Learning and Outcome
Quality of Collaboration Using Google Docs.
Learning in the technological era: Proceedings
of the Chais conference on instructional
technologies research (pp. 48-55).
Google Inc. 13
Page [1] of [40]
[March, 2013] WORKING GROUP 4
Network Security Best Practices
FINAL Report – BGP Security Best PracticesThe Communications Security, Reliability and Interoperability Council III Working Group [#4]
Draft Report [MARCH, 2013]
Page [2] of [40]
Table of Contents
Table of Contents
1 RESULTS IN BRIEF...................................................................................................................................... 3
1.1 CHARTER ........................................................................................................................................................ 3
1.2 EXECUTIVE SUMMARY................................................................................................................................. 3
2 INTRODUCTION........................................................................................................................................... 4
2.1 CSRIC STRUCTURE......................................................................................................................................... 5
2.2 WORKING GROUP [#4] TEAM MEMBERS ..................................................................................................... 7
3 OBJECTIVE, SCOPE, AND METHODOLOGY ................................................................................... 8
3.1 OBJECTIVE ..................................................................................................................................................... 8
3.2 SCOPE .............................................................................................................................................................. 9
3.3 METHODOLOGY ............................................................................................................................................ 9
4 BACKGROUND.............................................................................................................................................. 9
4.1 DEPLOYMENT SCENARIOS .............................................................................................................................. 9
5 ANALYSIS, FINDINGS AND RECOMMENDATIONS .............................................................................10
5.1 BGP SESSION-LEVEL VULNERABILITY ........................................................................................................10
5.1.1 SESSION HIJACKING ...................................................................................................................................................10
5.1.2 DENIAL OF SERVICE (DOS) VULNERABILITY ........................................................................................................12
5.1.3 SOURCE-ADDRESS FILTERING ..................................................................................................................................17
5.2 BGP INJECTION AND PROPAGATION VULNERABILITY...............................................................................20
5.2.1 BGP INJECTION AND PROPAGATION COUNTERMEASURES.................................................................................22
5.2.2 BGP INJECTION AND PROPAGATION RECOMMENDATIONS ................................................................................25
5.3 OTHER ATTACKS AND VULNERABILITIES OF ROUTING INFRASTRUCTURE..............................................26
5.3.1 HACKING AND UNAUTHORIZED 3RD PARTY ACCESS TO ROUTING INFRASTRUCTURE.....................................26
5.3.2 ISP INSIDERS INSERTING FALSE ENTRIES INTO ROUTERS ...................................................................................28
5.3.3 DENIAL-OF-SERVICE ATTACKS AGAINST ISP INFRASTRUCTURE.......................................................................28
5.3.4 ATTACKS AGAINST ADMINISTRATIVE CONTROLS OF ROUTING IDENTIFIERS....................................................30
6 CONCLUSIONS............................................................................................................................................32
7 APPENDIX .....................................................................................................................................................33
7.1 BACKGROUND................................................................................................................................................33
7.1.1 SALIENT FEATURES OF BGP OPERATION..............................................................................................................33
7.1.2 REVIEW OF ROUTER OPERATIONS..........................................................................................................................34
7.2 BGP SECURITY INCIDENTS AND VULNERABILITIES ...................................................................................35
7.3 BGP RISKS MATRIX......................................................................................................................................38
7.4 BGP BCP DOCUMENT REFERENCES............................................................................................................40The Communications Security, Reliability and Interoperability Council III Working Group [#4]
Draft Report [MARCH, 2013]
Page [3] of [40]
1 Results in Brief
1.1 Charter
This Working Group was convened to examine and make recommendations to the Council
regarding best practices to secure the Domain Name System (DNS) and routing system of the
Internet during the period leading up to some significant deployment of protocol extensions such
as the Domain Name System Security Extensions (DNSSEC), Secure BGP (Border Gateway
Protocol) and the like. The focus of the group is limited to what is possible using currently
available and deployed hardware and software. Development and refinement of protocol
extensions for both systems is ongoing, as is the deployment of such extensions, and is the
subject of other FCC working groups. The scope of Working Group 4 is to focus on currently
deployed and available feature-sets and processes and not future or non-widely deployed
protocol extensions.
1.2 Executive Summary
Routing is what provides reachability between the various end-systems on the Internet be they
servers hosting web or email applications; home user machines; VoIP (Voice over Internet
Protocol) equipment; mobile devices; connected home monitoring or entertainment systems.
Across the length and breadth of the global network it is inter-domain routing that allows for a
given network to learn of the destinations available in a distant network. BGP (Border Gateway
Protocol) has been used for inter-domain routing for over 20 years and has proven itself a
dynamic, robust, and manageable solution to meet these goals.
BGP is configured within a network and between networks to exchange information about
which IP address ranges are reachable from that network. Among its many features, BGP allows
for a flexible and granular expression of policy between a given network and other networks that
it exchanges routes with. Implicit in this system is required trust in information learned from
distant entities. That trust has been the source of problems from time to time causing
reachability and stability problems. These episodes have typically been short-lived but
underscored the need for expanding the use of Best Current Practices (BCPs) for improving the
security of BGP and the inter-domain routing system.
These mechanisms have been described in a variety of sources and this document does not seek
to re-create the work done elsewhere but to provide an overview and gloss on the vulnerabilities
and methods to address each. Additionally, the applicability of these BCPs can vary somewhat
given different deployment scenarios such as the scale of a network’s BGP deployment and the
number of inter-domain neighbors. By tailoring advice for these various scenarios,
recommendations that may seem confusing or contradictory can be clarified. Further, an
appendix includes a table that indexes the risks and countermeasures according to different
deployment scenarios.
Issues that the working group considered included:
• Session hijacking
• Denial of service (DoS) vulnerabilities
• Source-address filtering
• BGP injection and propagation vulnerabilitiesThe Communications Security, Reliability and Interoperability Council III Working Group [#4]
Draft Report [MARCH, 2013]
Page [4] of [40]
• Hacking and unauthorized access to routing infrastructure
• Attacks against administrative controls of routing identifiers
Working Group 4 recommends that the FCC encourage adoption of numerous best practices for
protecting ISPs’ routing infrastructures and addressing risks related to routing that are
continuously faced by ISPs.. Inter-domain routing via BGP is a fundamental requirement for
ISPs and their customers to connect and interoperate with the Internet. As such, it is a critical
service that ISPs must ensure is resilient to operational challenges and protect from abuse by
miscreants.
SPECIAL NOTE: For brevity, and to address the remit of the CSRIC committee to make
recommendations for ISPs, the term ISP is used throughout the paper. However, in most
instances the reference or the recommendations are applicable to any BGP service components
whether implemented by an ISP or by other organizations that peer to the Internet such as
business enterprises, hosting providers, and cloud providers.
2 Introduction
CSRIC was established as a federal advisory committee designed to provide recommendations
to the Commission regarding Best Practices and actions the Commission may take to ensure
optimal operability, security, reliability, and resiliency of communications systems, including
telecommunications, media, and public safety communications systems.
Due to the large scope of the CSRIC mandate, the committee then divided into a set of Working
Groups, each of which was designed to address individual issue areas. In total, some 10
different Working Groups were created, including Working Group 4 on Network Security Best
Practices. This Working Group will examine and make recommendations to the Council
regarding best practices to secure the Domain Name System (DNS) and routing system of the
Internet during the period leading up to a anticipated widespread implementation of protocol
updates such as the Domain Name System Security Extensions (DNSSEC) and Secure Border
Gateway Protocol (BGPsec) extensions.
The Working Group presented its report on DNS Security issues in September 2012.
This Final Report – BGP Best Practices documents the efforts undertaken by CSRIC Working
Group 4 Network Security Best Practices with respect to securing the inter-domain routing
infrastructure that is within the purview of ISPs, enterprises, and other BGP operators. Issues
affecting the security of management systems that provide control and designation of routing
and IP-space allocation records that BGP is based on were also considered.
Routing and BGP related services are necessary and fundamental components of all ISP
operations, and there are many established practices and guidelines available for operators to
consult. Thus most ISPs have mature BGP/routing management and infrastructures in-place.
Still, there remain many issues and exposures that introduce major risk elements to ISPs, since
the system itself is largely insecure and unauthenticated, yet provides the fundamental traffic
control system of the Internet. This report enumerates the issues the group identified as most
critical and/or that may need more attention.Page [5] of [40]
2.1 CSRIC Structure
Communications Security, Reliability, and Interoperability Council (CSRIC) III
Working Group 5:
DNSSEC
Implementation
Practices for ISPs
Working Group 6:
Secure BGP
Deployment
Working Group 4:
Network Security
Best Practices
Working Group 7:
Botnet
Remediation
Working Group 3:
E911 Location
Accuracy
Working Group 8:
E911 Best
Practices
Working Group 2:
Next Generation
Alerting
Working Group 9:
Legacy Broadcast
Alerting Issues
Working Group 1:
Next Generation
911
Working Group 10:
911 Prioritization
CSRIC Steering Committee
Co-Chairs
Working Group
1
Co-Chairs
Working Group
2
Co-Chairs
Working Group
3
Co-Chairs
Working Group
4
Chair
Working Group
5
Co-Chairs
Working Group
6
Chair
Working Group
7
Chair
Working Group
8
Co-Chairs
Working Group
9
Co-Chairs
Working Group
10Page [6] of [40]The Communications Security, Reliability and Interoperability Council III Working Group [#4]
Draft Report [MARCH, 2013]
Page [7] of [40]
2.2 Working Group [#4] Team Members
Working Group [#4] consists of the members listed below for work on this report.
Name Company
Rodney Joffe – Co-Chair Neustar, Inc.
Rod Rasmussen – Co-Chair Internet Identity
Mark Adams ATIS (Works for Cox Communications)
Steve Bellovin Columbia University
Donna Bethea-Murphy Iridium
Rodney Buie TeleCommunication Systems, Inc.
Kevin Cox Cassidian Communications, an EADS NA Comp
John Crain ICANN
Michael Currie Intrado, Inc.
Dale Drew Level 3 Communications
Chris Garner CenturyLink
Joseph Gersch Secure64 Software Corporation
Jose A. Gonzalez Sprint Nextel Corporation
Kevin Graves TeleCommunication Systems (TCS)
Tom Haynes Verizon
Chris Joul T-Mobile
Mazen Khaddam Cox
Kathryn Martin Access Partnership
Ron Mathis Intrado, Inc.
Danny McPherson Verisign
Doug Montgomery NIST
Chris Oberg ATIS (Works for Verizon Wireless)
Victor Oppleman Packet Forensics
Elman Reyes Internet Identity
Ron Roman Applied Communication Sciences
Heather Schiller Verizon
Jason Schiller Google
Marvin Simpson Southern Company Services, Inc.
Tony Tauber Comcast
Paul Vixie Internet Systems Consortium
Russ White Verisign
Bob Wright AT&T
Name Company
Rodney Joffe – Co-Chair Neustar, Inc.
Rod Rasmussen – Co-Chair Internet Identity
Mark Adams ATIS (Works for Cox Communications)
Steve Bellovin Columbia University
Donna Bethea-Murphy Iridium
Rodney Buie TeleCommunication Systems, Inc.
Kevin Cox Cassidian Communications, an EADS NA Comp
John Crain ICANN
Michael Currie Intrado, Inc.
Dale Drew Level 3 CommunicationsThe Communications Security, Reliability and Interoperability Council III Working Group [#4]
Draft Report [MARCH, 2013]
Page [8] of [40]
Chris Garner CenturyLink
Joseph Gersch Secure64 Software Corporation
Jose A. Gonzalez Sprint Nextel Corporation
Kevin Graves TeleCommunication Systems (TCS)
Tom Haynes Verizon
Chris Joul T-Mobile
Mazen Khaddam Cox
Kathryn Martin Access Partnership
Ron Mathis Intrado, Inc.
Danny McPherson Verisign
Doug Montgomery NIST
Chris Oberg ATIS (Works for Verizon Wireless)
Victor Oppleman Packet Forensics
Elman Reyes Internet Identity
Ron Roman Applied Communication Sciences
Heather Schiller Verizon
Jason Schiller Google
Marvin Simpson Southern Company Services, Inc.
Tony Tauber Comcast
Paul Vixie Internet Systems Consortium
Russ White Verisign
Bob Wright AT&T
Table 1 - List of Working Group Members
3 Objective, Scope, and Methodology
3.1 Objective
This Working Group was convened to examine and make recommendations to the Council
regarding best practices to secure the Domain Name System (DNS) and routing system of the
Internet during the period leading up to what some anticipate might be widespread
implementation of protocol updates such as the Domain Name System Security Extensions
(DNSSEC) and Secure Border Gateway Protocol (BGPsec) extensions (though the latter
outcome is not entirely uncontroversial).
DNS is the directory system that associates a domain name with an IP (Internet Protocol)
address. In order to achieve this translation, the DNS infrastructure makes hierarchical inquiries
to servers that contain this global directory. As DNS inquiries are made, their IP packets rely on
routing protocols to reach their correct destination. BGP is the protocol utilized to identify the
best available paths for packets to take between points on the Internet at any given moment. This
foundational system was built upon a distributed unauthenticated trust model that has been
mostly sufficient for over two decades but has some room for improvement.
These foundational systems are vulnerable to compromise through operator procedural mistakes
as well as through malicious attacks that can suspend a domain name or IP address's availability,
or compromise their information and integrity. While there are formal initiatives under way
within the IETF (which has been chartered to develop Internet technical standards and protocols)
that will improve this situation significantly, global adoption and implementation will take some
time. The Communications Security, Reliability and Interoperability Council III Working Group [#4]
Draft Report [MARCH, 2013]
Page [9] of [40]
This Working Group will examine vulnerabilities within these areas and recommend best
practices to better secure these critical functions of the Internet during the interval of time
preceding deployment of more robust, secure protocol extensions.
This report covers the BGP portion of these overall group objectives.
3.2 Scope
Working Group 4’s charter clearly delineates its scope to focus on two subsets of overall
network security, DNS and routing. It further narrows that scope to exclude consideration of the
implementation of DNSSEC (tasked to Working Group 5) and secure extensions of BGP (tasked
to Working Group 6). While those groups deal with protocol modifications requiring new
software and/or hardware deployments; WG4 is geared toward items that either don't require
these extensions or are risks which are outside the scope of currently contemplated extensions.
For this report regarding BGP, the focus is on using known techniques within the Operator
community. Some of these methods and the risks they seek to address are useful even in cases
where protocol extensions are used in some future landscape.
3.3 Methodology
With the dual nature of the work facing Working Group 4, the group was divided into two subgroups,
one focused on issues in DNS security, another in routing security. Starting in
December 2011, the entire Working Group met every two weeks via conference call(s) to review
research and discuss issues, alternating between sub-groups. The group created a mailing list to
correspond and launched a wiki to gather documents and to collectively collaborate on the
issues. Additional subject matter experts were occasionally tapped to provide information to the
working group via conference calls.
The deliverables schedule called for a series of reports starting in June 2012 that would first
identify issues for both routing and DNS security, then enumerate potential solutions, and finally
present recommendations. The initial deliverables schedule was updated in March in order to
concentrate efforts in each particular area for separate reports. This first report on DNS security
issues was presented in September 2012, and this, the second report on routing issues, is being
published in March 2013.
Based on the discussions of the group, a list of BGP risks, potential solutions, and relevant BCP
documents was created and refined over the course of the work. Subject matter experts in BGP
then drove development of the initial documentation of issues and recommendations. These
were then brought together into a full document for review and feedback. Text contributions, as
completed, were reviewed, edited and approved by the full membership of Working Group 4.
4 Background
4.1 Deployment Scenarios
BGP is deployed in many different kinds of networks of different size and profiles. Many
different recommendations exist to improve the security and resilience of the inter-domain
routing system. Some of the advice can even appear somewhat contradictory and often the key
decision can come down to understanding what is most important or appropriate for a given
network considering its size, the number of external connections, number of BGP routers, size The Communications Security, Reliability and Interoperability Council III Working Group [#4]
Draft Report [MARCH, 2013]
Page [10] of [40]
and expertise of the staff and so forth.
We attempt to tailor the recommendations and highlight which are most significant for a given
network operator’s situation. Further background and information on routing operations can be
found in the Appendix (Section 7) of this document for readers unfamiliar with this area of
practice.
5 Analysis, Findings and Recommendations
The primary threats to routing include:
• Risks to the routers and exchange of routing information
• Routing information that is incorrect or propagates incorrectly
• General problems with network operations
5.1 BGP Session-Level Vulnerability
When two routers are connected and properly configured, they form a BGP peering session.
Routing information is exchanged over this peering session, allowing the two peers to build a
local routing table, which is then used to forward actual packets of information. The first BGP4
attack surface is the peering session between two individual routers, along with the routers
themselves. Two classes of attacks are included here, session hijacking and denial of service.
5.1.1 Session Hijacking
The BGP session between two routers is based on the Transport Control Protocol (TCP), a
session protocol also used to transfer web pages, naming information, files, and many other
types of data across the Internet. Like all these other connection types, BGP sessions can be
hijacked or disrupted, as shown in Figure 1.
Figure 1: Session Hijacking
In this diagram, the attack host can either take over the existing session between Routers A and
B, or build an unauthorized session with Router A. By injecting itself into the peering between
Routers A and B, the attacker can inject new routing information, change the routing
information being transmitted to Router A, or even act as a “man in the middle,” modifying the
routing information being exchanged between these two routers.The Communications Security, Reliability and Interoperability Council III Working Group [#4]
Draft Report [MARCH, 2013]
Page [11] of [40]
5.1.1.1 Session-Level Countermeasures
Current solutions to these types of attacks center on secure hash mechanisms, such as HMACMD5
(which has been deprecated) and HMAC-SHA. These mechanisms rely on the peering
routers sharing a key (a shared key – essentially a password) that is used to calculate a
cryptographic hash across each packet (or message) transmitted between the two routers, and
included in the packet itself. The receiving router can use the same key to calculate the hash. If
the hash in the packet matches the locally calculated hash, the packet could have only been
transmitted by another router that knows the shared key.
This type of solution is subject to a number of limitations. First, the key must actually be shared
among the routers building peering sessions. In this case, the routers involved in the peering
session are in different administrative domains. Coming to some uniform agreement about how
keys are generated and communicated (e.g. phone, email, etc.) with the often hundreds of
partners and customers of an ISP is an impractical task.
There is the possibility that any key sharing mechanism deployed to ease this administrative
burden could, itself, come under attack (although such attacks have never been seen in the wild).
Lastly, some concerns have been raised that burden of cryptographic calculations could itself
become a vector for a Denial-of-Service (DoS) attack by a directed stream of packets with
invalid hash components. One way to deny service is to make the processor that is responsible
for processing routing updates and maintaining liveness too busy to reliably process these
updates in a timely manner. In many routers the processor responsible for calculating the
cryptographic hash is also responsible for processing new routing information learned, sending
out new routing information, and even transmitting keep-alive messages to keep all existing
sessions up. Since calculating the cryptographic hash is computationally expensive, a smaller
flood of packets with an invalid hash can consume all the resources of the processor, thus
making it easier to cause the processor to be to busy.
Other mechanisms, such as the Generalized TTL Security Mechanism (GTSM, described in
RFC 5082), focus on reducing the scope of such attacks. This technique relies on a feature of the
IP protocol that would prevent an attacker from effectively reaching the BGP process on a router
with forged packets from some remote point on the Internet. Since most BGP sessions are built
across point-to-point links (on which only two devices can communicate), this approach would
prevent most attackers from interfering in the BGP session. Sessions built over a shared LAN,
such as is the case in some Internet exchanges, will be protected from those outside the LAN,
but will remain vulnerable to all parties that are connected to the LAN.
This solution is more complicated to implement when BGP speaking routers are not directly
connected. It is possible to count the number of hops between routers and limit the TTL value to
only that number of hops. This will provide some protection, limiting the scope of possible
attackers to be within that many hops. If this approach is used consider failure scenarios of
devices between the pair of BGP speaking routers, what impact those failures will have on the
hop count between the routers, and if you want to expand the TTL value to allow the session to
remain up for a failure that increases the hop count. The Communications Security, Reliability and Interoperability Council III Working Group [#4]
Draft Report [MARCH, 2013]
Page [12] of [40]
5.1.1.2 Session-Level Current Recommendations
For a network with a small (e.g., single-digit) number of eBGP neighbors, it is reasonable to
follow the lead of what is specified by the upstream ISPs who may have a blanket policy of how
they configure their eBGP sessions. A network with larger numbers of eBGP neighbors may be
satisfied that they can manage the number of keys involved either through data-store or rubric.
Note that a rubric may not always be feasible as you cannot ensure that your neighbors will
always permit you to choose the key.
Managing the keys for a large number of routers involved in BGP sessions (a large organization
may have hundreds or thousands of such routers) can be an administrative burden. Questions
and issues can include:
• In what system should the keys be stored and who should have access?
• Should keys be unique per usage having only one key for internal usage and another key
that is shared for all external BGP sessions?
• Should keys be unique per some geographical or geo-political boundary say separate
keys per continent or per country or per router?
• Should keys be unique to each administrative domain, for example a separate key for
each Autonomous System a network peers with?
• There is no easy way to roll over keys, as such changing a key is quite painful, as it
disrupts the transmission of routing information, and requires simultaneous involvement
from parties in both administrative domains. This makes questions of how to deal with
the departure of an employee who had access to the keys, or what keys to use when
peering in a hostile country more critical.
Another consideration is the operational cost of having a key. Some routing domains will
depend on their peers to provide the key each time a new session is established, and not bother
to make a record of the key. This avoids the problems of how to store the key, and ensure the
key remains secure. However if a session needs to be recreated because configuration
information is lost either due to accidental deletion of the configuration, or hardware
replacement, then the key is no longer known. The session will remain down until the peer can
be contacted and the key is re-shared. Often times this communication does not occur, and the
peer may simply try to remove the key as a troubleshooting step, and note the session reestablishes.
When this happens the peer will often prefer for the session to remain up, leaving
the peering session unsecured until the peer can be contacted, and a maintenance window can be
scheduled. For unresponsive peers, an unsecured peering session could persist, especially
considering that the urgency to address the outage has now passed.
Despite these vulnerabilities having been widely known for a decade or more, they have not
been implicated in any notable number of incidents. As a result some network operators have
not found the cost/benefit trade-offs to warrant the operational cost of deploying such
mechanisms while others have. Given these facts, the Working Group recommends that
individual network operators continue to make their own determinations in using these countermeasures.
5.1.2 Denial of Service (DoS) Vulnerability
Because routers are specialized hosts, they are subject to the same sorts of Denial of Service
(DoS) attacks any other host or server is subject to. These attacks fall into three types:The Communications Security, Reliability and Interoperability Council III Working Group [#4]
Draft Report [MARCH, 2013]
Page [13] of [40]
1. Attacks that seek to consume all available interface bandwidth making it difficult for
enough legitimate traffic to get through such as UDP floods and reflective attacks
2. Attacks that seek to exhaust resources such as consume all available CPU cycles,
memory, or ports so that the system is too busy to respond such as TCP SYN attacks
3. Attacks utilizing specially crafted packets in an attempt to cause the system to crash or
operate in an unexpected way such as buffer overflow attacks, or malformed packet
attacks that create an exception that is not properly dealt with
Bandwidth exhaustion attacks attempt to use so much bandwidth that there is not enough
available bandwidth for services to operate properly. This type of an attack can cause routers to
fail to receive routing protocol updates or keep-alive messages resulting in out-of-date routing
information, routing loops, or interruption of routing altogether, such as happens when a BGP
session goes down and the associated routing information is lost.
Resource exhaustion attacks target traffic to the router itself, and attempt to make the router
exhaust its CPU cycles or memory. In the case of the former, the router’s CPU becomes too
busy to properly process routing keepalives and updates causing the adjacencies to go down. In
the case of the latter, the attacker sends so much routing protocol information that the router has
no available memory to store all of the required routing information.
Crafted packets attacks attempt to send a relatively small number of packets that the router does
not deal with appropriately. When a router receives this type of packet it may fill up interface
buffers and then not forward any traffic on that interface causing routing protocols to crash and
restart, reboot, or hang. In some cases the router CPU may restart, reboot, or hang likely
causing loss of all topological and routing state. One example was the “protocol 55” attack,
where some router vendors simply did not code properly how to deal with this traffic type.
Some routers are specialized to forward high rates of traffic. These routers often implement
their forwarding capabilities in hardware that is optimized for high throughput, and implement
the less demanding routing functions in software. As such bandwidth exhaustion attacks are
targeted at the routers interfaces or the backplane between those interfaces, or the hardware
responsible for making forwarding decisions. The other types of attack target the software
responsible for making the routing decisions.
Due to the separation between routing and forwarding, a fourth class of attacks are targeted at
exhausting the bandwidth of the internal interconnection between the forwarding components
and the routing components.
The section below on “Denial-of-Service Attacks on ISP Infrastructure” contains a discussion of
disruptive attacks besides those targeting the exchange of BGP routing information.
5.1.2.1 Denial of Service Countermeasures
GTSM, described above, can be an effective counter to some forms of DoS attacks against
routers, by preventing packets originating outside the direct connection between two BGP peers
from being processed by the router under attack. GTSM cannot resolve simple buffer overflow
problems, or DoS attacks that exploit weaknesses in packet processing prior to the TTL check,
however.The Communications Security, Reliability and Interoperability Council III Working Group [#4]
Draft Report [MARCH, 2013]
Page [14] of [40]
Another mechanism currently used to prevent DoS attacks against routers is to simply make the
interfaces on which the BGP session is running completely unreachable from outside the local
network or the local segment. Using link-local addresses in IPv6, is one technique (with
obviously limited applicability). Another approach is applying packet-filters on the relevant
address ranges at the network edge. (This process is called infrastructure filtering).
Other well-known and widely deployed DoS mitigation techniques can be used to protect routers
from attack just as they can be used to protect other hosts. For instance, Control Plane Policing
can prevent the routing process on a router from being overwhelmed with high levels of traffic
by limiting the amount of traffic accepted by the router directed at the routing processor itself.
5.1.2.2 Denial-of-Service Current Recommendations
Since routers are essentially specialized hosts, mechanisms that can be used to protect individual
routers and peering sessions from attack are widely studied and well understood. What prevents
these techniques from being deployed on a wide scale?
Two things: the perception that the problem space is not large enough to focus on, and the
administrative burden of actually deploying such defenses. For instance, when GTSM is used
with infrastructure filtering, cryptographic measures may appear to be an administrative burden
without much increased security. Smaller operators, and end customers, often believe the
administrative burden too great to configure and manage any of these techniques.
Despite these vulnerabilities having been widely known for a decade or more, they have not
been implicated in any notable number of incidents. As a result some network operators have
not found the cost/benefit trade-offs to warrant the operational cost of deploying such
mechanisms while others have. Given these facts, the Working Group recommends that
individual network operators continue to make their own determinations in using these countermeasures.
In dealing with vulnerabilities due to “crafted packets”, the vendor should provide notification to
customers as the issues are discovered as well as providing fixed software in a timely manner.
Customers should make it a point to keep abreast of notifications from their vendors and from
various security information clearing-houses.
5.1.2.2.1 Interface Exhaustion Attacks
Recommendations include:
1. Understanding the actual forwarding capabilities of your equipment in your desired
configuration
2. Examining your queuing configuration
3. Carefully considering which types of traffic share a queue with your routing protocols,
and if that traffic can be blocked, rate-limited or forced to another queue
4. Understanding packet filtering capabilities of your equipment, and under what scenarios
it is safe to deploy packet filters
5. When it is safe to do so, tactically deploy packet filers upstream from a router that is
being attacked The Communications Security, Reliability and Interoperability Council III Working Group [#4]
Draft Report [MARCH, 2013]
Page [15] of [40]
The first thing to consider with regard to attacks that attempt to consume all available bandwidth
is to determine the actual throughput of the router. It is not safe to assume that a router with two
or more 100 Gigabit Ethernet interfaces can receive 100G on one interface and transmit that
same 100G out another interface. Some routers can forward at line rate, and some routers
cannot. Performance may vary with packet size, for example forwarding traffic in software is
more taxing on the CPU as the number of packets increases.
The next thing to consider is outbound queuing of routers upstream from the router that is being
attacked. Routers typically place routing protocol traffic in a separate Network Control (NC)
queue. Determine the characteristics of this queue such as the queue depth and frequency of it
being serviced. These values may be tunable. Also consider what types of traffic are placed in
this queue, and specifically what traffic an outside attacker can place in this queue. Consider
preventing users of the network from being able to place traffic in this queue if they do not need
to exchange routing information with your network. For direct customers running eBGP, limit
traffic permitted into the NC queue to only traffic required to support their routing protocols.
Consider rate-limiting this traffic so no one customer can fill up the queue. Note that rate limits
will increase convergence time, so test a customer configuration that is advertising and receiving
the largest set of routes, and measure how long it takes to re-learn the routing table after the
BGP session is reset with and without the rate limits.
If the attack traffic has a particular profile, and all traffic matching that profile can be dropped
without impacting legitimate traffic than a packet filter can be deployed upstream from the
router that is under attack. Ensure that a deploying a packet filter will not impact the
performance of your router by testing packets filters with various types of attacks and packet
sizes on your equipment in a lab environment. Ensure that total throughput is not decreased, and
that there is not a particular packet per second count that causes the router to crash, or become
unresponsive, or stop forwarding traffic reliably, cause routing protocols to time out, etc.
If the upstream router belongs to a non-customer network, you will need to work with them to
mitigate the attack. Additional bandwidth on the interconnect may allow you to move the
bottleneck deeper into your network where you can deal with it.
Often the IP destination of these attacks is something downstream from the router. It is possible
that some or all of the attack traffic may be destined to the router. In that case some of the
mitigation techniques in the next section may also be helpful.
5.1.2.2.2 Resource Exhaustion Attacks
Recommendations include:
1. Consider deploying GTSM
2. Consider making router interfaces only reachable by directly connected network
3. Consider only permitting traffic sourced from configured neighbors
4. Consider deploying MD5
5. Deploy maximum prefixes
The first set of recommendations is to consider deploying mechanisms that restrict who can send
routing protocol traffic to a router. The second set of recommendations restricts how much
routing protocol state a neighbor can cause a router to hold.The Communications Security, Reliability and Interoperability Council III Working Group [#4]
Draft Report [MARCH, 2013]
Page [16] of [40]
GTSM, described above, can be an effective counter to some forms of DoS attacks against
routers, by limiting who can send routing protocol traffic to the router by a configured hop-count
radius. GTSM works by preventing packets originating outside the direct connection between
two BGP peers from being processed by the router under attack. GTSM cannot resolve simple
buffer overflow problems, or DoS attacks that exploit weaknesses in packet processing prior to
the TTL check, however.
Another mechanism currently used to prevent DoS attacks against routers is to simply make the
interfaces on which the BGP session is running completely unreachable from outside the local
network or the local segment. Using link-local addresses in IPv6, is one technique (with
obviously limited applicability). Another approach is applying packet-filters on the relevant
address ranges at the network edge. (This process is called infrastructure filtering).
Some routers can dynamically generate packet filters from other portions of the router
configuration. This enables one to create an interface packet filter that only allows traffic on the
BGP ports from source IP addresses that belong to a configured neighbor. This means attempts
to send packets to the BGP port by IP addresses that are not a configured neighbor will be
dropped right at the interface.
The same session level protections discussed earlier, such as MD5 can also limit who can send
routing information to only those routers or hosts that have the appropriate key. As such this is
also an effective mechanism to limit who can send routing protocol traffic. While these packets
will be processed by the router, and could possibly tax the CPU, they cannot cause the router to
create additional routing state such as adding entried to the Routing Information Base (RIB) or
Forwarding Information Base (FIB).
Lastly each neighbor should be configured to limit the number of prefixes they can send to a
reasonable value. A single neighbor accidentally or intentionally de-aggregating all of the
address space they are permitted to send could consume a large amount of RIB and FIB
memory, especially with the large IPv6 allocations.
5.1.2.2.3 Crafted Packet Attacks
Crafted packet attacks typically occur when a router receives some exception traffic that the
vendor did not plan for. In some cases it may be possible to mitigate these attacks by filtering
the attack traffic if that traffic has a profile that can be matched on and all traffic matching that
profile can be discarded. More often then not, this is not the case.
In all cases the vendor should provide new code that deals with the exception.
Recommendations include keeping current with all SIRT advisories from your vendors. When
vulnerability is published move quickly to upgrade vulnerable versions of the code. This may
require an upgrade to a newer version of the code or a patch to an existing version.
For larger organizations that have extensive and lengthy software certification programs, it is
often more reasonable to ask the vendor to provide a patch for the specific version(s) of code
that organization is running. If possible the vendor should provide the extent to which the code
is modified to quantify how substantial the change is in order to help the provide plan what
should be included in the abbreviated software certification tests.The Communications Security, Reliability and Interoperability Council III Working Group [#4]
Draft Report [MARCH, 2013]
Page [17] of [40]
For smaller organizations, or organizations that complete little or no software certification, the
newer version of code with the fix in place should be deployed. Generally this is deployed
cautiously at first to see if issues are raised with a limited field trail, followed by a more
widespread deployment.
5.1.2.2.4 Internal Bandwidth Exhaustion Attacks
Other well-known and widely deployed DoS mitigation techniques can be used to protect routers
from attack just as they can be used to protect other hosts. For instance, Control Plane Policing
can prevent the routing process on a router from being overwhelmed with high levels of traffic
by limiting the amount of traffic accepted by the router directed at the routing processor itself.
One should consider not only the impact on the router CPU, but also the impact on the
bandwidth between the forward components and the router CPU. There may be some internal
queuing in place on the interconnect between the forwarding components and the router CPU. It
may be possible to influence which queue routing protocol traffic is placed in, or with queue
traffic generated by customers is placed in, the depths and or servicing of these queues in order
to separate and minimize the ability of non-routing traffic to impact routing traffic. If traffic
generated by customers (for routing protocols or otherwise) can crowd out a network’s internal
routing protocol traffic, then operators may consider separately rate limiting this customer
traffic.
5.1.3 Source-address filtering
Many Internet security exploits hinge on the ability of an attacker to send packets with spoofed
source IP addresses. Masquerading in this way can give the attacker entre to unauthorized
access at the device or application level and some BGP vulnerabilities are also in this category.
The problem of source-spoofing has long been recognized and countermeasures available for
filtering at the interface level.
5.1.3.1 Source-address spoofing example
Consider the diagram below which illustrates the legitimate bi-directional traffic flow between
two hosts on the left-hand side. An attacker connected to another network can send IP packets
with a source address field set to the address of one of the other machines, unless filtering is
applied at the point that that attacker’s host or network attaches to AS65002.The Communications Security, Reliability and Interoperability Council III Working Group [#4]
Draft Report [MARCH, 2013]
Page [18] of [40]
5.1.3.2 Source-address spoofing attacks
Though most IP transactions are bi-directional, attacks utilizing spoofed source IPs do not
require bi-directional communication but instead exploit particular protocol or programmatic
semantic weaknesses.
Exploits using this technique have covered many areas over the years including these types
followed by some examples.
• Attacks against services which rely only on source IP of the incoming packet for
authorization
• rsh, rlogin, NFS, Xwindows, etc.
• Attacks where the unreachability of the source can be exploited
• TCP SYN floods which exhaust resources on the server
• Attacks where the attacker masquerades as the “victim”
• Small DNS or SNMP requests resulting in highly asymmetric data flow back toward the
victim
• Abusive traffic which result in the legitimate user getting blocked from the server or
network
5.1.3.3 Source-address filtering challenges
The barriers to implementing these countermeasures have ranged from lack of vendor support to
lack of solid motivation to implement them.
• Lack of proper vendor support: In older implementations of network devices, filtering
based on the source address of a packet was performed in software, rather than hardware,
and thus had a major impact on the rate of forwarding through the device. Most modern The Communications Security, Reliability and Interoperability Council III Working Group [#4]
Draft Report [MARCH, 2013]
Page [19] of [40]
network equipment can perform source filtering in the hardware switching path,
eliminating this barrier to deployment.
• Lack of scalable deployment and configuration management: In older deployments,
filters based on the source of traffic was configured and managed manually, adding a
large expense to the entity running the network. This barrier has largely been resolved
through remote triggered black hole, unicast Reverse Path Forwarding (uRPF), and loose
uRPF options.
• Fear of interrupting legitimate traffic, for example in multi-homed situations: Vendors
have created flexibility in uRPF filtering to reduce or eliminate this barrier. Future
possible additions include “white lists,” which would allow traffic to pass through a
uRPF check even though it didn’t meet the rules.
• Lack of business motivation: Unilateral application of these features does not benefit or
protect a network or its customers directly; rather, it contributes to the overall security of
the Internet. Network operators are realizing that objection to incurring this “cost” is
being overcome by the realization that if everyone performs this type of filtering, then
everyone benefits.
5.1.3.4 Source-address filtering recommendations
Filtering should be applied as close as possible to the entry point of traffic. Wherever one host,
network, or subnet is attached, a feature such as packet filtering, uRPF, or source-addressvalidation
should be used. Ensure adequate support from equipment vendors for subscribermanagement
systems (e.g. for Cable and DSL providers) or data-center or campus routers and
switches.
Stub networks should also filter traffic at their borders to ensure IP ranges assigned to them do
not appear in the source field of incoming packets and only those ranges appear in the source
field of outgoing packets.
Transit networks should likewise use features such as uRPF. Strict mode should be used at a
border with a topological stub network and loose mode between transit networks.
Transit networks that provide connectivity primarily stub networks, such as consumer ISPs,
should consider uRPF strict mode on interfaces facing their customers. If these providers
provide a home router to their customers they should consider making uRPF part of the default
home router configuration.
Transit networks that provide connectivity to a mix of stub networks and multi-homed networks
must consider the administrative burden of configuring uRPF strict mode only on stub customers
and uRPF loose mode, or no uRPF on customers that are, or become multi-homed.
When using uRPF loose mode with the presence of a default route, one must special care to
consider configuration options to include or exclude the default route.
The value of loose mode uRPF with networks in the default free zone is debatable. It will only
prevent traffic with a source address of RFC-1918 space and dark IPs (IP addresses that are not
routed on the Internet). Often these dark IP addresses are useful for backscatter techniques and The Communications Security, Reliability and Interoperability Council III Working Group [#4]
Draft Report [MARCH, 2013]
Page [20] of [40]
tracing the source(s) of a spoofed DoS attack. It is also important to consider if RFC-1918
addresses are used internal to the transit provider’s network. This practice may become more
common if ISPs implement Carrier Grade NAT.
It is also worth pointing out that some business customers depend on VPN software that is
poorly implemented, and only changes the destination IP address when re-encapsulating a
packet. If these customers are using non-routed IP addresses in their internal network then
enabling uRPF will break these customers.
It is important to measure the impact of forwarding when enabling uRPF. Even when uRPF is
implemented in hardware, the router must lookup the destination as well as the source. A
double lookup will cause forwarding throughput to be reduced by half. This may have no in the
forwarding rate if the throughput of the forwarding hardware is more than twice the rate of all
the interfaces it supports.
Further, more detailed advice and treatment of this subject can be found in:
• IETF BCP38/RFC 2827 Network Ingress Filtering1
• BCP 84/RFC 3704 Ingress Filtering for Multihomed Networks2
5.2 ICANN SAC004 Securing the EdgeBGP Injection and Propagation
Vulnerability
A second form of attack against the routing information provided by BGP4 is through injection
of misleading routing information, as shown in figure 2.
Figure 2: A Prefix Hijacking Attack
In this network, AS65000 has authority to originate 192.0.2.0/24. Originating a route, in this
context, means that computers having addresses within the address space advertised are actually
reachable within your network — that a computer with the address 192.0.2.1, for instance, is
physically attached to your network.
Assume AS65100 would like to attract traffic normally destined to a computer within the
192.0.2.0/24 address range. Why would AS65100 want to do this? There are a number of
possible motivations, including:
1 http://tools.ietf.org/html/bcp38
2 http://tools.ietf.org/html/bcp84The Communications Security, Reliability and Interoperability Council III Working Group [#4]
Draft Report [MARCH, 2013]
Page [21] of [40]
• A server with an address in this range accepts logins from customers or users, such as a
financial web site, or a site that hosts other sensitive information, or information of value
• A server with an address in this range processes information crucial to the operation of a
business the owner of AS65100 would like to damage in some way, such as a
competitor, or a political entity under attack
AS65100, the attacker, can easily attract packets normally destined to 192.0.2.0/24 within
AS65000 by simply advertising a competing route for 192.0.2.0/24 to AS65002. From within
BGP itself, there’s no way for the operators in AS65002 to know which of these two
advertisements is correct (or whether both origins are valid – a configuration which does see
occasional legitimate use). The impact of the bogus information may be limited to the directly
neighboring AS(es) depending on the routing policy of the nearby ASes. The likelihood of the
incorrect route being chosen can be improved by two attributes of the route:
• A shorter AS Path
A shorter AS Path has the semantic value of indicating a topologically “closer” network. In the
example above, the normal propagation of the route would show AS65100 as “closer” to
AS65001 thus, other factors being equal, more preferred than the legitimate path via AS65000.
• A longer prefix
Longer prefixes represent more-specific routing information, so a longer prefix is always
preferred over a shorter one. For instance, in this case the attacker might advertise 192.0.2.0/25,
rather than 192.0.2.0/24, to make the false route to the destination appear more desirable than
the real one.
• A higher local-preference setting
Local-preference is the non-transitive BGP attribute that most network operators use to
administratively influence their local routing. Typically, routes learned from a “customer” (i.e.,
paying) network are preferred over those where the neighboring network has a non-transit
relationship or where the operator is paying for transit from the neighboring network. This
attribute is more important in the decision algorithm for BGP than AS-path length so routes
learned over such a session can draw traffic even without manipulation of the AS-path attribute.
This illustration can be used to help describe some related types of risks:The Communications Security, Reliability and Interoperability Council III Working Group [#4]
Draft Report [MARCH, 2013]
Page [22] of [40]
Figure 3: BGP Propagation Vulnerabilities
Route Leak: In this case, AS65100 is a transit customer of both AS65000 and AS65002. The
operator of AS65100 accidentally leaks routing information advertised by AS65000 into its
peering session with AS65002. This could possibly draw traffic passing from AS65002 towards
a destination reachable through AS65000 through AS65100 when this path was not intended to
provide transit between these two networks. Most often such happenings are the result of
misconfigurations and can result in overloading the links between AS65100 and the other ASes.
As the name suggests, this phenomenon has most often been the result of inadvertent misconfiguration.
Occasionally they can result in more malicious outcomes:
• Man in the Middle: In this case, all the autonomous systems shown have non-transit
relationships. For policy reasons, AS65000 would prefer traffic destined to 192.0.2.0/24
pass through AS65100. To enforce this policy, AS65000 filters the route for 192.0.2.0/24
towards AS65100. In order to redirect traffic through itself (for instance, in order to
snoop on the traffic stream), AS65100 generates a route advertisement that makes it
appear as if AS65000 has actually advertised 192.0.2.0/24, advertising this route to
AS65002, and thus drawing traffic destined to 192.0.2.0/24 through itself.
• False Origination: This attack is similar to the man in the middle explained above,
however in this case there is no link between AS65000 and AS65100. Any traffic
destined to 192.0.2.0/24 into AS65100, is discarded rather than being delivered.
Note that all three of these vulnerabilities are variations on a single theme: routing information
that should not be propagated based on compliance with some specific policy nonetheless is.
5.2.1 BGP Injection and Propagation Countermeasures
5.2.1.1 Prefix Filtering
The key bulwark against entry and propagation of illegitimate routing announcements into the
global routing system is prefix-level filtering; typically at the edge between the ISP and their
customers. The usual method involves the customer communicating a list of prefixes and
downstream ASes which they expect to be reachable through the connection to the ISP. The ISP
will then craft a filter applied to the BGP session which explicitly enumerates this list of
expected prefixes (a “prefix-list”), perhaps allowing for announcement of some more-specific The Communications Security, Reliability and Interoperability Council III Working Group [#4]
Draft Report [MARCH, 2013]
Page [23] of [40]
prefixes within the ranges such as might be needed by the customer to achieve some goals in
adjusting the load of the customer’s inbound traffic across their various connections. This
configuration forms a “white-list”, in security parlance, of possible downstream destinations but
does not validate the overall semantic correctness of the resulting routing table.
5.2.1.1.1 Manual Prefix-Filter Limitations
The validity of the information in this list is obviously important and if the customer is either
malicious or simply mistaken in the prefixes they communicate, the prefix filter could obviously
still leave open a vulnerability to bogus route injection. Thus typically the information
communicated by the customer is checked against registration records such as offered in the
“whois” information available from Regional Internet Registries (RIRs) and/or others in the
address assignment hierarchy. However, there is no information in the RIR information
explicitly indicating a mapping between the address-assignment and the origin AS.
The reality is that the process of manually checking a filter that is a few thousand lines long,
with hundreds of changes a week is tedious and time consuming. Many transit providers do not
check at all. Others have a policy to always check, but support staff may grow complacent with
updates from certain customers that have long filter lists or have filter lists that change weekly,
especially if those customers have never had a questionable prefix update in the past. In these
cases they may only spot-check, only check the first few changes and then give up, or grow
fatigued and be less diligent the more lines they check.
Moreover, the entity name fields in the “whois” information are free-form and often can’t be
reliably matched to the entity name used by the ISP’s customer records. Typically it is a fuzzy
match between the customer name on record and the company name listed in the “whois”
record. A truly malicious actor could order service with a name which is intentionally similar to
a company name whose IP addresses they intend to use. Another possibility is that the company
name on record is legitimate and an exact match to the company name on the “whois” record,
but the customer is a branch office, and the legitimate holder of the IP addresses is the corporate
office which has not authorized the branch office to use their space. It is also possible that the
customer is the legitimate holder of the address space, but the individual who called in to the
provider support team is not authorized to change the routing of the IP block in question. This
problem is further complicated when a transit provider’s customer has one or more downstream
customers of its own. These relationships are typically hard or impossible to verify.
If every transit provider accurately filtered all of the prefixes their customers advertised, and
each network that a transit provider peers with could be trusted to also accurately filter all of the
prefixes of their customers, then route origination and propagation problems could be virtuailly
eliminated. However, managing filters requires thousands of operators examining, devising, and
adjusting the filters on millions of devices throughout the Internet. While there are processes and
tools within any given network, such highly inconsistent processes, particularly when handling
large amounts of data (a tedious process in and of itself), tends to produce an undesirable rate of
errors. Each time an individual operator misjudges a particular piece of information, or simply
makes a mistake in building a filter, the result is a set of servers (or services) that are
unreachable until the mistake is found and corrected.
5.2.1.2 Internet Routing Registry (IRR)
The second source of information a provider can use as a basis for filtering received routing The Communications Security, Reliability and Interoperability Council III Working Group [#4]
Draft Report [MARCH, 2013]
Page [24] of [40]
information is a voluntary set of databases of routing policy and origination called Internet
Routing Registries (IRRs). These IRRs allow providers to register information about their
policies towards customers and other providers, and also allow network operators to register
which address space they intend to originate. Some providers require their customers to register
their address space in an IRR before accepting the customer’s routes, oftentimes the provider
will “proxy register” information on the customer’s behalf since most customers are not versed
in IRR details.
5.2.1.2.1 IRR Limitations
Because IRRs are voluntary, there is some question about the accuracy and timeliness of the
information they contain (see Research on Routing Consistency in the Internet Routing Registry
by Nagahashi and Esaki for a mostly negative view, and How Complete and Accurate is the
Internet Routing Registry by Khan for a more positive view). Anecdotally, RIPE’s IRR is in
widespread use today, and some large providers actually build their filtering off this database, so
the accuracy level is at least operationally acceptable for some number of operators. Some IRR
repositories use an authorization model as well as authentication but none that primarily serve
North America perform RPSL authorization using the scheme described in RFC2725 – Routing
Policy System Security3
.
5.2.1.3 AS-Path filtering
Filters on the AS_PATH contents of incoming BGP announcements can also be part of a
defensive strategy to guard against improper propagation of routing information. Some ISPs
have used AS-path filters on customer-facing BGP sessions instead of prefix-filters. This
approach is generally inadequate to protect against even the most naïve misconfigurations, much
less a deliberate manipulation. Often a leak has involved either redistributing BGP routes
inadvertently from one of a stub network’s ISPs to the other. Another problem in the past has
involved redistributing BGP routes into an internal routing protocol and back to BGP.
Where AS-path filters can be useful is to guard against an egregious leak. For instance an ISP
would not expect ASNs belonging to known large ISPs to show up in the AS_PATH of updates
from an enterprise-type customer network. Applying an AS-path filter to such a BGP session
could act as a second line of defense to the specific prefix-list filter. Similarly, if there are
networks which the ISP has non-transit relationships with, applying a similar AS-path filter to
those sessions (which wouldn’t be candidates for prefix filters) could help guard against a leak
resulting in an unintended transit path.
5.2.1.3.1 AS-Path filtering limitations
Maintaining such a list of “known” networks which aren’t expected to show up in transit
adjacencies can be fairly manual, incomplete and error-prone. Again, applying a filter which
validates the neighbor AS is in the path is useless since this state is the norm of what’s expected.
5.2.1.4 Maximum-prefix cut-off threshold
Many router feature-sets include the ability to limit the number of prefixes that are accepted
from a neighbor via BGP advertisements. When the overall limit is exceeded, the BGP session
is torn down on the presumption that this situation is a dangerous error condition. Typically also
a threshold can be set at which a warning notification (e.g. log message) to the Operations staff
3 https://tools.ietf.org/html/rfc2725The Communications Security, Reliability and Interoperability Council III Working Group [#4]
Draft Report [MARCH, 2013]
Page [25] of [40]
is issued. This way a gradual increase in the number of advertisements will trigger a sensible
manual raise in the cut-off threshold without causing an outage.
This tool can be used to guard against the most egregious leaks which can, if the numbers are
large enough, exhaust the routing table memory on the recipient’s routers and/or otherwise cause
widespread network instability.
Typical deployments will set the threshold based on the current observed number of
advertisements within different bands; for instance, 1-100, 100-1000, 1000-5000, 5000-10000,
10000-50000, 50,000-100,000, 100,000-150,000, 150,000-200,000, 200,000-250,000, 250,000-
300,000.
5.2.1.4.1 Maximum-prefix limitations
When the threshold is exceeded, the session is shut down and manual intervention is required to
bring it back up. In the case where a network has multiple interconnection points to another
network (thus multiple BGP neighbors), all sessions will typically go down at the same time
assuming all are announcing the same number of prefixes. In this case, it may be the case that
all connectivity between the two networks is lost during this period. Obviously this measure is
an attempt to balance two different un-desirable outcomes so must be weighed judiciously.
Above 10000 or perhaps 50000 (e.g., a full Internet routing table from a transit provider),
applying maximum-prefix thresholds provide limited protection. A small number of neighbors
each advertising a unique set of 300,000 routes would fill the memory of the receiving router
anyway. However if these neighbors are all advertising a large portion of the Internet routes,
with many routes overlapping, then the limit offers some protection.
5.2.1.5 Monitoring
Aside from a proactive filtering approach, a network operator can use various vantage points
external to their own network (e.g, “route servers” or “looking glasses”) to monitor the prefixes
for which they have authority to monitor for competing announcements which may have entered
the BGP system. Some tools such as BGPmon have been devised to automate such monitoring.
5.2.1.5.1 Monitoring Limitations
Obviously, this approach is reactive rather than proactive and steps would then need to be taken
to contact the offending AS and/or intermediate AS(es) to stop the advertisement and/or
propagation of the misinformation. Also, the number of such vantage points is limited so a
locally impacting bogus route may or may not be detected with this method.
5.2.2 BGP Injection and Propagation Recommendations
The most common router software implementations of BGP do not perform filtering of route
advertisements, either inbound or outbound, by default. While this situation eases the burden of
configuration on network operators (the customers of the router vendors), it has also caused the
majority of unintentional inter-domain routing problems to date. Thus it is recommended that
network operators of all sizes take extra care in configuration of BGP sessions to keep
unintentional routes from being injected and propagated.
Stub network operators should configure their outbound sessions to only explicitly allow the The Communications Security, Reliability and Interoperability Council III Working Group [#4]
Draft Report [MARCH, 2013]
Page [26] of [40]
prefixes which they expect to be advertising over a particular session.
ISPs should explicitly filter their inbound sessions at the boundary with their “customer edge”.
The inter-provider connections between large ISPs are impractical locations for filtering given
the requirement for significant dynamism in BGP routing and traffic-engineering across the
global Internet. However, the cumulative gains accrued when each ISP filters at this “customer
edge” are significant enough to lessen the residual risk of not filtering on these “non-customer”
BGP sessions.
ISPs (and even stub networks) should also consider using AS-path filters and maximum-prefix
limits on sessions as a second line of defense to guard against leaks or other pathological
conditions.
5.3 Other Attacks and Vulnerabilities of Routing Infrastructure
There are many vulnerabilities and attack vectors that can be used to disrupt the routing
infrastructure of an ISP outside of the BGP protocol and routing-specific operations. These are
just as important to address as issues the working group has identified within the routing space
itself.
The largest attack surface for routing infrastructure likely lies within the standard operational
security paradigm that applies to any critical networked asset. Therefore the working group
looked at including BCPs relating to network and operational security as part of addressing these
issues, and ISPs should be aware that they are likely to see attacks against their routing
infrastructure based on these “traditional” methods of computer and network intrusion.
5.3.1 Hacking and unauthorized 3rd party access to routing infrastructure
ISPs and all organizations with an Internet presence face the ever-present risk of hacking and
other unauthorized access attempts on their infrastructure from various actors, both on and off
network. This was already identified as a key risk for ISPs, and CSRIC 2A – Cyber Security
Best Practices was published in March 2011 to provide advice to address these types of attacks
and other risks for any ISP infrastructure elements, including routing infrastructure. The current
CSRIC III has added a new Working Group 11 that will report out an update to prior CSRIC
work in light of recent advancements in cybersecurity practices and a desire of several US
government agencies to adopt consensus guidelines to protect government and critical
infrastructure computers and networks.
A recent SANS publication, Twenty Critical Security Controls for Effective Cyber Defense:
Consensus Audit Guidelines (CAG)4 lays out these principals and maps them out versus prior
work, including another relevant document, NIST SP-800-53 Recommended Security Controls
for Federal Information Systems and Organizations.
5 The SANS publication appears to be a
primary driver for Working Group 11’s work. The entire document is available for review, and
we have included the 20 topic areas here for reference:
4 http://www.sans.org/critical-security-controls/
5 http://csrc.nist.gov/publications/PubsSPs.htmlThe Communications Security, Reliability and Interoperability Council III Working Group [#4]
Draft Report [MARCH, 2013]
Page [27] of [40]
Critical Control 1: Inventory of Authorized and Unauthorized Devices
Critical Control 2: Inventory of Authorized and Unauthorized Software
Critical Control 3: Secure Configurations for Hardware and Software on Laptops,
Workstations, and Servers
Critical Control 4: Continuous Vulnerability Assessment and Remediation
Critical Control 5: Malware Defenses
Critical Control 6: Application Software Security
Critical Control 7: Wireless Device Control
Critical Control 8: Data Recovery Capability
Critical Control 9: Security Skills Assessment and Appropriate Training to Fill Gaps
Critical Control 10: Secure Configurations for Network Devices such as Firewalls, Routers, and
Switches
Critical Control 11: Limitation and Control of Network Ports, Protocols, and Services
Critical Control 12: Controlled Use of Administrative Privileges
Critical Control 13: Boundary Defense
Critical Control 14: Maintenance, Monitoring, and Analysis of Audit Logs
Critical Control 15: Controlled Access Based on the Need to Know
Critical Control 16: Account Monitoring and Control
Critical Control 17: Data Loss Prevention
Critical Control 18: Incident Response Capability
Critical Control 19: Secure Network Engineering
Critical Control 20: Penetration Tests and Red Team Exercises
Because this work is being analyzed directly by Working Group 11 to address the generic risk to
ISPs of various hacking and unauthorized access issues, Working Group 4 will not be
commenting in-depth in this area, and refers readers to reports from Working Group 11 for
comprehensive, and updated coverage of these risks when they issue their report. We will
comment upon current BCPs for ISPs to look to adopt in the interim, and provide further
background around risks unique to running BGP servers/routers in this area.
An ISP’s routing infrastructure is an important asset to protect, as gaining control of it can lead
to a wide variety of harms to ISP customers. Further, an ISP’s staff computers, servers, and
networking infrastructure also rely upon their own routers to correctly direct traffic to its
intended destinations. The ISP’s own sensitive data and processes could be compromised via
hacked routers/servers. Thus routers should be included on the list of network assets that are
assigned the highest level of priority for protection under any type of ISP security program.
There are many industry standard publications pertaining to overall cybersecurity best practices
available for adoption by ISPs or any organization at risk of attack, including prior CSRIC
reports. It is incumbent upon ISPs to maintain their overall security posture and be up-to-date
on the latest industry BCPs and adopt the practices applicable to their organization. Of
particular note is the IETF’s RFC 4778 - Current Operational Security Practices in Internet
Service Provider Environments6 which offers a comprehensive survey of ISP security practices.
An older IETF publication, but still active BCP, that still applies to ISP environments can be
found with BCP 46, aka RFC 3013 Recommended Internet Service Provider Security Services
and Procedures7
. NIST also puts out highly applicable advice and BCPs for running
6 http://www.ietf.org/rfc/rfc4778.txt
7 http://www.apps.ietf.org/rfc/rfc3013.txtThe Communications Security, Reliability and Interoperability Council III Working Group [#4]
Draft Report [MARCH, 2013]
Page [28] of [40]
government networks, with the most currently relevant special report, NIST SP-800-53.
The ultimate goal of someone attempting unauthorized access to routing infrastructure would be
to either deny customer use of those servers or, more likely, insert false entries within the router
to misdirect the users of those routers. This is a functional equivalent to route injection and
propagation attacks as already described in section 5.2. So the analysis and recommendations
presented in section 5.2.1.5 with respect to monitoring for and reacting to route injection and
propagation attacks apply in the scenario where an attacker has breached a router to add
incorrect entries.
5.3.1.1 Recommendations
1) ISPs should refer to and implement the practices found in CSRIC 2A – Cyber Security
Best Practices that apply to securing servers and ensure that routing infrastructure is
protected.
2) ISPs should adopt applicable BCPs found in other relevant network security industry
approved/adopted publications. Monitor for applicable documents and update. Three
documents were identified that currently apply to protecting ISP networks: IETF RFC
4778 and BCP 46 (RFC 3013); NIST special publications series: NIST SP-800-53
3) ISPs should ensure that methods exist within the ISP’s operations to respond to detected
or reported successful route injection and propagation attacks, so that such entries can
be rapidly remediated.
4) ISPs should consider implementing routing-specific monitoring regimes to assess the
integrity of data being reported by the ISP’s routers that meet the particular operational
and infrastructure environments of the ISP.
5.3.2 ISP insiders inserting false entries into routers
While insider threats can be considered a subset of the more general security threat of
unauthorized access and hacking, they deserve special attention in the realm of routing security.
ISP insiders have unparalleled access to any systems run by an ISP, and in the case of routers,
the ability to modify entries is both trivially easy and potentially difficult to detect. Since routers
don’t typically have company-sensitive information, are accessed by thousands of machines
continuously, and are not usually hardened or monitored like other critical servers, it is relatively
easy for an insider to alter a router’s configuration in a way that adversely affects routing.
The analysis and recommendations for this particular threat do not differ significantly from
those presented in Section 5.3.1 of this report - Hacking and unauthorized 3rd party access to
routing infrastructure. However, it is worth paying special attention to this particular exposure
given the liabilities an ISP may be exposed to from such difficult-to-detect activities of its own
employees.
5.3.2.1 Recommendations
1) Refer to section 5.3.1.1 for generic hacking threats.
5.3.3 Denial-of-Service Attacks against ISP Infrastructure
Denial-of-Service (DoS) and Distributed Denial-of-Service (DDoS) are some of the oldest and
most prolific attacks that ISPs have faced over the years and continue to defend against today. The Communications Security, Reliability and Interoperability Council III Working Group [#4]
Draft Report [MARCH, 2013]
Page [29] of [40]
Typically, an external actor who is targeting some Internet presence or infrastructure to make it
unusable is behind such attacks. However, DoS/DDoS attacks come in many flavors that can be
broadly lumped into two primary categories: logic attacks and resource exhaustion/flooding
attacks.8
Logic attacks exploit vulnerabilities to cause a server or service to crash or reduce
performance below usable thresholds. Resource exhaustion or flooding attacks cause server or
network resources to be consumed to the point where the targeted service no longer responds or
service is reduced to the point it is operationally unacceptable. We will examine the latter type
of attack in this section of analysis, as resource exhaustion . Logic attacks are largely directed to
break services/servers and can be largely addressed with the analysis and recommendations
described above with respect to BGP specific issues and also put forward in section 5.3.1 that
cover protecting networked assets from various hacking and other attacks.
There is a large variety of flooding attacks that an ISP could face in daily operations. These can
be targeted at networks or any server, machine, router, or even user of an ISP’s network. From
the perspective of routing operations, it is helpful to differentiate between “generic” DoS attacks
that could affect any server, and those that exploit some characteristic of BGP that can be
utilized to affect routers in particular, which have already been covered.
Due to the long history, huge potential impact, and widespread use of various DoS and DDoS
attacks, there is an abundance of materials, services, techniques and BCPs available for dealing
with these attacks. ISPs will likely have some practices in place for dealing with attacks both
originating from their networks and that are being directed at their networks and impacting their
services. The IETF’s RFC 4732 Internet Denial-of-Service Considerations9 provides an ISP
with a thorough overview of DoS/DDoS attacks and mitigation strategies and provides a solid
foundational document. The SANS Institute has published a useful document for ISPs that is
another reference document of BCPs against DoS/DDoS attacks entitled A Summary of
DoS/DDoS Prevention, Monitoring and Mitigation Techniques in a Service Provider
Environment10.
As mentioned in section 5.3.1, there are several documents that cover general ISP security
concerns, and those typically include prescriptive advice for protecting a network against
DoS/DDoS attacks. Such advice can be found in previously cited documents including prior
CSRIC reports: CSRIC 2A – Cyber Security Best Practices11, the IETF’s RFC 4778 - Current
Operational Security Practices in Internet Service Provider Environments12, BCP 46, RFC 3013
Recommended Internet Service Provider Security Services and Procedures13 and NIST’s special
report, NIST SP-800-53.
For the most part, an ISP’s routers for interdomain routing must be publicly available in order
for the networks they serve to be reachable across the Internet. Thus measures to restrict access
8 http://static.usenix.org/publications/library/proceedings/sec01/moore/moore.pdf
9 http://tools.ietf.org/rfc/rfc4732.txt
10 http://www.sans.org/reading_room/whitepapers/intrusion/summary-dos-ddos-preventionmonitoring-mitigation-techniques-service-provider-enviro_1212
11 http://www.fcc.gov/pshs/docs/csric/WG2A-Cyber-Security-Best-Practices-Final-Report.pdf
12 http://www.ietf.org/rfc/rfc4778.txt
13 http://www.apps.ietf.org/rfc/rfc3013.txtThe Communications Security, Reliability and Interoperability Council III Working Group [#4]
Draft Report [MARCH, 2013]
Page [30] of [40]
that can be implemented for an ISP’s internal infrastructure are unavailable as options for these
connecting routers. This leaves an ISP with limited choices for DDoS protection, including the
traditional approaches of overprovisioning of equipment and bandwidth and various DoS/DDoS
protection services and techniques.
5.3.3.1 Recommendations
1) ISPs should implement BCPs and recommendations for securing an ISP’s infrastructure
against DoS/DDoS attacks that are enumerated in the IETF’s RFC 4732 Internet
Denial-of-Service Considerations and consider implementing BCPs enumerated in the
SANS Institute reference document of BCPs against DoS/DDoS attacks entitled A
Summary of DoS/DDoS Prevention, Monitoring and Mitigation Techniques in a Service
Provider Environment.
2) ISPs should refer to and implement the BCPs related to DoS/DDoS protection found in
CSRIC 2A – Cyber Security Best Practices that apply to protecting servers from
DoS/DDoS attacks.
3) ISPs should consider adopting BCPs found in other relevant network security industry
approved/adopted publications that pertain to DoS/DDoS issues, and monitor for
applicable documents and updates. Four that currently apply to protecting ISP networks
from DoS/DDoS threats are IETF RFC 4778 and BCP 46 (RFC 3013); NIST special
publications series: NIST SP-800-53; and ISOC Publication Towards Improving DNS
Security, Stability, and Resiliency.
4) ISPs should review and apply BCPs for protecting network assets against DoS/DDoS
attacks carefully to ensure they are appropriate to protect routing infrastructure.
5.3.4 Attacks against administrative controls of routing identifiers
Blocks of IP space and Autonomous Systems Numbers (ASNs) are allocated by various
registries around the world. Each of these Regional Internet Registries (RIR’s) is provided IP
space and ASN allocation blocks by IANA, to manage under their own rules and practices. Inturn,
several of these registries allow for country or other region/use specific registries to suballocate
IP space based on their own rules, processes and systems. Each RIR maintains a
centralized “whois” database that designates the “owner” of IP spaces or ASN’s within their
remit. Access to the databases that control these designations, and thus “rights” to use a
particular space or ASN is provided and managed by the RIR’s and sub RIR registries
depending upon the region. Processes for authentication and management of these identifier
resources are not standardized, and until recently, were relatively unsophisticated and insecure.
This presents an administrative attack vector allowing a miscreant to use a variety of account
attack methods, from hacking to password guessing to social engineering and more that could
allow them to assume control over an ASN or IP space allocation. In other fields, such attacks
would be considered “hijacking” or “account take-over” attacks, but the use of the word
“hijacking” in the BGP space to include various injection and origin announcements complicates
the common taxonomy. Thus for this section, we will refer to account “hijacking” as “account
take-over”.
The primary concern for most ASN and prefix block owners and the ISPs that service them in
such scenarios is the take-over of active space they are using. A miscreant could literally “take The Communications Security, Reliability and Interoperability Council III Working Group [#4]
Draft Report [MARCH, 2013]
Page [31] of [40]
over” IP space being routed and used by the victim, much like an origin attack as described in
section 5.3. In this case, it would be equivalent to a full take-over, with the majority of the
global routing system recognizing the miscreant’s announcement as being the new “legitimate”
one, with all the inherent risks previously described. The real owner will have to prove their
legitimacy and actual legal ownership/control of the resource that has been taken over.
Depending upon the authentication scheme the registry uses, this can prove difficult, especially
for legacy space and older ASN registrations.
A corollary of this attack scenario, is a miscreant taking over “dormant” IP space or an unused
ASN, and thus “squatting” in unused territory14. While not impactful on existing Internet
presence, squatting on IP space can lead to many forms of abuse, including the announcement of
bogus peering arrangements, if the squatted resource is an ASN.
In a take-over scenario, a miscreant typically impersonates or compromises the registrant of the
ASN and/or IP space in order to gain access to the management account for that ASN or CIDR
block. Until recently, nearly all RIRs and registries used an e-mail authentication scheme to
manage registrant change requests. Thus, if the registrant’s e-mail address uses an available
domain name, the miscreant can register the domain name, recreate the administration email
address, and authenticates himself with the registry. If the domain isn’t available, the criminal
could still try to hijack the domain name registration account to gain control of that same
domain. If a registry or RIR requires more verification for registrant account management, the
criminal use various social engineering tricks against the registry staff to get into the
management account.
Once a criminal has control of the registration account, they can update the information there to
allow them to move to a new peering ISP, create new announcements from their “new” space, or
launch any sort of BGP-type attack as listed above. Even more basically, the criminal can
simply utilize their new control of the ASN/prefix to have their own abusive infrastructure
announced on the Internet for whatever process they would like. This includes direct abuse
against the Internet in general (e.g. hosting malware controllers, phishing, on-line scams, etc.),
but also the ability to impersonate the original holder of the space they have taken over. Of
course they can also intercept traffic originally destined for the legitimate holder of the space as
previously described for various route-hijacking scenarios.
The end result of an administrative account take-over is likely to be similar to other injection
attacks against routing infrastructure as covered in section 5.2. Thus, ISPs will want to consult
BCPs covering techniques for monitoring and reacting to those types of attacks. These BCPs
cover the general effects of a BGP origin attacks – dealing with service interruptions and the
worldwide impacts. ISPs typically do not have direct access or control to RIR or other registry
account information that has been compromised in most hijacking attacks. The ISP is dependent
upon the affected registry to restore control of the ISP’s management account, or in the case of a
serious breach, the registrar/registry’s own services. Once control is re-established, the original,
correct information needs to be re-entered and published again. This will usually mean updating
14 For a full description of the taxonomy of hijacking, squatting, and spoofing attacks in routing
space, see Internet Address Hijacking, Spoofing and Squatting Attacks -
http://securityskeptic.typepad.com/the-security-skeptic/2011/06/internet-address-hijackingspoofing-and-squatting-attacks.htmlThe
Communications Security, Reliability and Interoperability Council III Working Group [#4]
Draft Report [MARCH, 2013]
Page [32] of [40]
routing table entries/BGP announcements, and fixing any account information that has been
modified.
The industry has largely been slow to adopt security measures to protect account access for
controlling ASN and CIDR block management that are found in other online services like
financial services, e-commerce, or even some ISP management systems. The industry also has
many participants with a wide variety of geographical regions, with few standards and
requirements for the security of registration systems, and very limited oversight. This means it
is often difficult to find support for typical online security tools like multi-factor authentication,
multi-channel authentication, and verification of high-value transactions. This has been
changing in recent years at the RIR level, with ARIN, RIPE, and several other IP registries at
various levels of authority implementing new controls and auditing account information.
Despite this, gaps exist, especially with “legacy” data entered many years ago before current
management systems and authentication processes were implemented.
While there is scant guidance on this topic area for ASN/IP block management, ICANN’s
Security and Stability Advisory Committee (SSAC) has released two documents to address these
issues in the domain name space which is quite analogous to the provisioning of IP space.
These documents provide BCPs for avoiding and mitigating many of these issues. SAC 40
Measures to Protect Domain Registration Services Against Exploitation or Misuse15, addresses
issues faced by domain name registrars and offers numerous BCPs and recommendations for
securing a registrar against the techniques being used by domain name hijackers. Many of the
BCP’s presented there would be applicable to RIR’s and other IP address provisioning
authorities, including ISPs managing their own customers. SSAC 44, A Registrant's Guide to
Protecting Domain Name Registration Accounts16, provides advice to domain name registrants
to put in place to better protect their domains from hijacking. Similar techniques could be used
by operators to protect their own IP space allocations. Given the limited choices and practices
followed by various IP space allocators, ISPs need to carefully evaluate their security posture
and the practices of their RIR’s or other IP space allocators with these BCPs in mind.
5.3.4.1 Recommendations
1) ISPs and their customers should refer to the BCPs and recommendations found in SSAC
44 A Registrant's Guide to Protecting Domain Name Registration Accounts with respect
to managing their ASN’s and IP spaces they register and use to provide services.
2) ISPs should review the BCPs and recommendations found in SAC 40 Measures to
Protect Domain Registration Services Against Exploitation or Misuse to provide similar
protections for IP space they allocate to their own customers.
6 Conclusions
Working Group 4 has recommended the adoption of numerous best practices for protecting the
inter-domain BGP routing system. As a distributed infrastructure requiring several actors to both
enable and protect it, network operators face challenges outside of their direct control in tackling
many of the issues identified. The more widely Best Current Practices are utilized, the more
robust the whole system will be to both bad actors and simple mistakes. See Appendix 7.3 for a
15 http://www.icann.org/en/groups/ssac/documents/sac-040-en.pdf
16 http://www.icann.org/en/committees/security/sac044.pdfThe Communications Security, Reliability and Interoperability Council III Working Group [#4]
Draft Report [MARCH, 2013]
Page [33] of [40]
tabular display of risks indexed with the appropriate countermeasures as discussed in the body
text of the document.
7 Appendix
7.1 Background
Note that in order to remain consistent with other CSRIC III reports, considerable portions of
“Salient Features of BGP Operation” have been taken verbatim from the Appendix section of
the CSRIC III, Working Group 6 Interim Report published March 8, 2012. Other parts have
been taken from the NIST (National Institute of Standards and Technologies) report entitled
Border Gateway Protocol Security.
7.1.1 Salient Features of BGP Operation
This section is intended for non-experts who have a need to understand the origins of BGP
security problems.
Although unknown to most users, the Border Gateway Protocol (BGP) is critical to keeping the
Internet running. BGP is a routing protocol, which means that it is used to update routing
information between major systems. BGP is in fact the primary inter-domain routing protocol,
and has been in use since the commercialization of the Internet. Because systems connected to
the Internet change constantly, the most efficient paths between systems must be updated on a
regular basis. Otherwise, communications would quickly slow down or stop. Without BGP,
email, Web page transmissions, and other Internet communications would not reach their
intended destinations. Securing BGP against attacks by intruders is thus critical to keeping the
Internet running smoothly.
Many organizations do not need to operate BGP routers because they use Internet service
providers (ISP) that take care of these management functions. But larger organizations with
large networks have routers that run BGP and other routing protocols. The collection of routers,
computers, and other components within a single administrative domain is known as an
autonomous system (AS). An ISP typically represents a single AS. In some cases, corporate
networks tied to the ISP may also be part of the ISP’s AS, even though some aspects of their
administration are not under the control of the ISP.
Participating in the global BGP routing infrastructure gives an organization some control over
the path traffic traverses to and from its IP addresses (Internet destinations). To participate in the
global BGP routing infrastructure, an organization needs:
• Assigned IP addresses, grouped into IP network addresses (aka prefixes) for routing.
• A unique integer identifier called an Autonomous System Number (ASN).
• A BGP router ready to connect to a neighbor BGP router on an Internet Service
Provider’s network (or another already connected AS) that is willing to establish a BGP
session and exchange routing information and packet traffic with the joining
organization.
The basic operation of BGP is remarkably simple – each BGP-speaking router can relay The Communications Security, Reliability and Interoperability Council III Working Group [#4]
Draft Report [MARCH, 2013]
Page [34] of [40]
messages to its neighbors about routes to network addresses (prefixes) that it already knows,
either because it “owns” these prefixes, or it already learned routes to them from another
neighbor. As part of traveling from one border router to another, a BGP route announcement
incrementally collects information about the ASes that the route “update” traversed in an
attribute called AS_PATH. Therefore, every BGP route is constructed hop-by-hop according to
local routing policies in each AS. This property of BGP is a source of its flexibility in serving
diverse business needs, and also a source of vulnerabilities.
The operators of BGP routers can configure routing policy rules that determine which received
routes will be rejected, which will be accepted, and which will be propagated further – possibly
with modified attributes, and can specify which prefixes will be advertised as allocated to, or
reachable through, the router’s AS. In contrast to the simplicity of the basic operation of BGP, a
routing policy installed in a BGP router can be very complex. A BGP router can have very
extensive capabilities for manipulating and transforming routes to implement the policy, and
such capabilities are not standardized, but instead, are largely dictated by AS interconnection
and business relationships. A route received from a neighbor can be transformed before a
decision is made to accept or reject the route, and can be transformed again before the route is
relayed to other neighbors; or, the route may not be disseminated at all.
All this works quite well most of the time – largely because of certain historically motivated
trust and established communication channels among human operators of the global BGP
routing system. This is the trust that a route received from a neighbor accurately describes a path
to a prefix legitimately reachable through the neighbor ASes networks, and its attributes have
not been tampered with. Notwithstanding the above, the “trust but verify” rule applies: Best
Current Practices recommend filtering the routes received from neighbors. While this can be
done correctly for well-known direct customers, currently there is no validated repository of the
“ground truth” allowing for correct filtering of routes to all networks in the world.
Now observe that the BGP protocol itself provides a perfect mechanism for spreading
malformed or maliciously constructed routes, unless the BGP players are vigilant in filtering
them out from further propagation. However, adequate route filtering may not be in place, and
from time to time a malicious or inadvertent router configuration change creates a BGP security
incident: malformed or maliciously constructed routing messages will propagate from one AS to
another simply by exploiting legitimate route propagation rules, and occasionally can spread to
virtually all BGP routers in the world. Because some BGP-speaking routers advertise all local
BGP routes to all external BGP peers by default, another example that commonly occurs
involves a downstream of two or more upstream ASes advertising routes learned from one
upstream ISP to another ISP – both the customer and the ISPs should put controls in place to
scope the propagation of all routes to those explicitly allocated to the customer AS, but this is
difficult given the lack of “ground truth”. The resulting routing distortions can cause very severe
Internet service disruptions, in particular effective disconnection of victim networks or third
parties from parts or all of Internet, or forcing traffic through networks that shouldn’t carry it,
potentially opening higher-level Internet transactions up to packet snooping or man-in-themiddle
attacks.
7.1.2 Review of Router Operations
In a small local area network (LAN), data packets are sent across the wire, typically using
Ethernet hardware, and all hosts on the network see the transmitted packets. Packets addressed The Communications Security, Reliability and Interoperability Council III Working Group [#4]
Draft Report [MARCH, 2013]
Page [35] of [40]
to a host are received and processed, while all others are ignored. Once networks grow beyond a
few hosts, though, communication must occur in a more organized manner. Routers perform the
task of communicating packets among individual LANs or larger networks of hosts.
To make internetworking possible, routers must accomplish these primary functions:
• Parsing address information in received packets
• Forwarding packets to other parts of the network, sometimes filtering out packets that
should not be forwarded
• Maintaining tables of address information for routing packets.
BGP is used in updating routing tables, which are essential in assuring the correct operation of
networks. BGP is a dynamic routing scheme—it updates routing information based on packets
that are continually exchanged between BGP routers on the Internet. Routing information
received from other BGP routers (often called “BGP speakers”) is accumulated in a routing
table. The routing process uses this routing information, plus local policy rules, to determine
routes to various network destinations. These routes are then installed in the router’s forwarding
table. The forwarding table is actually used in determining how to forward packets, although the
term routing table is often used to describe this function (particularly in documentation for home
networking routers).
7.2 BGP Security Incidents and Vulnerabilities
In this section we classify the observed BGP security incidents, outline the known worst-case
scenarios, and attempt to tie the incidents to features of proposed solutions that could prevent
them. Many of the larger incidents are believed to have been the result of misconfigurations or
mistakes rather than intentional malice or criminal intent. It has long been suspected that more
frequent, less visible incidents have been happening with less attention or visibility.
BGP security incidents usually originate in just one particular BGP router, or a group of related
BGP routers in an AS, by means of changing the router’s configuration leading to
announcements of a peculiar route or routes that introduce new paths towards a given
destination or trigger bugs or other misbehaviors in neighboring routers in the course of
propagation.
There are no generally accepted criteria for labeling a routing incident as an “attack”, and – as
stressed in the recommendations – lack of broadly accepted routing security metrics that could
automatically identify certain routing changes as “routing security violations”.
BGP security incidents that were observed to date can be classified as follows:
• Route origin hijacking (unauthorized announcements of routes to IP space not assigned
to the announcer). Such routing integrity violations may happen under various scenarios:
malicious activity, inadvertent misconfigurations (“fat fingers”), or errors in traffic
engineering. There are further sub-categories of such suspected security violations:The Communications Security, Reliability and Interoperability Council III Working Group [#4]
Draft Report [MARCH, 2013]
Page [36] of [40]
o Hijacking of unused IP space such as repetitive hijacks of routes to prefixes
within a large IP blocks assigned to an entity such as US government but
normally not routed on the public Internet. Temporarily using these “unused”
addresses enables criminal or antisocial activities (spam, network attacks) while
complicating efforts to detect and diagnose the perpetrators.
o Surgically targeted hijacks of specific routes and de-aggregation attacks on
specific IP addresses. They may be hard to identify unless anomaly detection is
unambiguous, or the victim is important enough to create a large commotion.
Examples: Pakistan Hijacks YouTube17 (advertisement of a more specific is
globally accepted, and totally black-holes the traffic to the victim). There may be
significantly more such attacks than publicly reported, as they may be difficult to
distinguish from legitimate traffic engineering or network re-engineering
activities.
o Unambiguous massive hijacks of many routes where many distinct legitimate
origin ASes are replaced by a new unauthorized origin AS advertising the
hijacked routes. Significant recent incidents include a 2010 “China's 18-minute
Mystery”18, or a hijacking of a very large portion of the Internet for several hours
by TTNet in 200419, or a 2006 ConEd incident20. Without knowing the
motivations of the implicated router administrators it is difficult to determine if
these and similar incidents were due to malicious intent, or to errors in
implementations of routing policy changes.
• Manipulation of AS_PATH attribute in transmitted BGP messages executed by
malicious, selfish, or erroneous policy configuration. The intention of such attacks is to
exploit BGP routers’ route selection algorithms dependent on AS_PATH properties, such
as immediate rejection of a route with the router’s own ASN in the AS_PATH
(mandated to prevent routing loops), or AS_PATH length. Alternatively, such attacks
may target software bugs in distinct BGP implementations (of which quite a few were
triggered in recent years with global impact).
o For routing incidents triggered by long AS_PATHs see House of Cards21,
AfNOG Takes Byte Out of Internet22, Longer is Not Always Better23 for actual
examples.
o Route leaks - A possibility of “man in the middle” (MITM) AS_PATH attacks
detouring traffic via a chosen AS was publicly demonstrated at DEFCON in
17 http://www.renesys.com/blog/2008/02/pakistan-hijacks-youtube-1.shtml
18 http://www.renesys.com/blog/2010/11/chinas-18-minute-mystery.shtml
19 Alin C. Popescu, Brian J. Premore, and Todd Underwood, Anatomy of a Leak: AS9121.
NANOG 34, May 16, 2005.
20 http://www.renesys.com/blog/2006/01/coned-steals-the-net.shtml
21 http://www.renesys.com/blog/2010/08/house-of-cards.shtml
22 http://www.renesys.com/blog/2009/05/byte-me.shtml
23 http://www.renesys.com/blog/2009/02/longer-is-not-better.shtmlThe Communications Security, Reliability and Interoperability Council III Working Group [#4]
Draft Report [MARCH, 2013]
Page [37] of [40]
200824. Two other similar incidents were found in a 7-month period surrounding
the DEFCON demo by mining of a BGP update repository conducted in 200925
but were not confirmed as malicious. This can occur either by accident as
detailed above, and is sometimes referred to as route “leaks”, or may be
intentional. Additionally, such attacks may or may not attempt to obscure the
presence of additional ASes in the AS path, should they exist. These are
particularly problematic to identify as they require some knowledge of intent by
the resource holder and intermediate ASes.
o AS_PATH poisoning – sometimes used by operators to prevent their traffic AS
from reaching and/or transiting a selected AS, or steer the traffic away from
certain paths. It is technically a violation of BGP protocol and could be used
harmfully as well.
• Exploitations of router packet forwarding bugs, router performance degradation, bugs in
BGP update processing
o Example of a transient global meltdown caused by a router bug tickled by deaggregation26
and several other cases cited there.
There are also BGP vulnerabilities that may have not been exploited in the wild so far, but that
theoretically could do a lot of damage. The BGP protocol does not have solid mathematical
foundations, and certain bizarre behaviors – such as persistent route oscillations – are quite
possible.
There have been several RFCs and papers addressing BGP vulnerabilities in the context of
protocol standard specification and threat modeling, see the following Request For Comments
(RFCs):
• RFC 4272 “BGP Security Vulnerabilities Analysis” S. Murphy, Jan 2006.
• RFC 4593 “Generic Threats to Routing Protocols”, A. Barbir, S. Murphy and Y. Yang,
Oct 2006.
• Internet draft draft-foo-sidr-simple-leak-attack-bgpsec-no-help-01 “Route Leak Attacks
Against BGPSEC”, D. McPherson and S. Amante, Nov 2011.
• Internet draft draft-ietf-sidr-bgpsec-threats-01 “Threat Model for BGP Path Security”, S.
Kent and A. Chi, Feb 2012.
24 A. Pilosov and T. Kapela, Stealing the internet, DEFCON 16 August 10, 2008
25C. Hepner and E. Zmijewski, Defending against BGP Man-in-the-Middle attacks, Black Hat
DC February 2009
26 J. Cowie, The Curious Incident of 7 November 2011, NANOG 54, February 7, 2012Page [38] of [40]
7.3 BGP Risks Matrix
BGP Routing Security Risks Examined by WG 4
Network Operator Role Risks Report
Sect.
Recommendations
Stub network
(e.g. Enterprise, Data Center)
Session-level
threats
5.1.1 • Consider MD5 or GTSM if neighbor recommends it
DoS (routers and
routing info)
5.1.2 • Control-Plane Policing (rate-limiting)
• Keep up-to-date router software
Spoofed Source
IP Addresses
5.1.3 • Use uRPF (unicast Reverse Path Forwarding) in strict mode or
other similar features at access edge of network (e.g. datacenter
or campus).
• Filter source IP address on packets at network edge to ISPs
Incorrect route
injection and
propagation
5.2.1 • Keep current information in “whois” and IRR (Internet
Routing Registry) databases
• Outbound prefix filtering
• Use monitoring services to check for incorrect routing
announcements and/or propagation
Other Attacks
(e.g., hacking,
insider, social
engineering)
5.3 • Consider many recommendations about operational security
processes
Internet Service Provider
Network
Session-level
threats
5.1.1 • Consider a plan to use MD5 or GTSM including flexibility to
adjust to different deployment scenario specifics
DoS (routers and
routing info)
5.1.2 • Control-Plane Policing (rate-limiting)The Communications Security, Reliability and Interoperability Council III Working Group [#4]
Draft Report [MARCH, 2013]
Page [39] of [40]
• Keep up-to-date router software
Spoofed Source
IP Addresses
5.1.3 • Use uRPF (unicast Reverse Path Forwarding) in strict or loose
mode as appropriate (e.g. strict mode at network ingress such
as data-center or subscriber edge, loose mode at inter-provider
border)
Incorrect route
injection and
propagation
5.2.1 • Keep current information in “whois” and IRR (Internet
Routing Registry) databases
• Consult current information in “whois” and IRR (Internet
Routing Registry) databases when provisioning or updating
customer routing
• Implement inbound prefix filtering from customers
• Consider AS-path filters and maximum-prefix limits as second
line of defense
• Use monitoring services to check for incorrect routing
announcements and/or propagation
Other Attacks
(e.g., hacking,
insider, social
engineering)
5.3 • Consider many recommendations about operational security
processesPage [40] of [40]
7.4 BGP BCP Document References
Network Protection Documents
NIST Special Publication 800-54 Border Gateway Protocol (BGP) Security Recommendations
WG2A - Cyber Security Best Practices
SANS: Twenty Critical Security Controls for Effective Cyber Defense: Consensus Audit
Guidelines (CAG)
NIST Special Publication 800-53 Recommended Security Controls for Federal Information
Systems and Organizations
IETF RFC 4778 - Current Operational Security Practices in Internet Service Provider
Environments
IETF RFC 3013 Recommended Internet Service Provider Security Services and Procedures
Source Address verification/filtering
IETF BCP38/RFC 2827 Network Ingress Filtering: Defeating Denial of Service Attacks which
employ IP Source Address Spoofing
BCP 84/RFC 3704 Ingress Filtering for Multihomed Networks
BCP 140/RFC 5358 Preventing Use of Recursive Nameservers in Reflector Attacks
ICANN SAC004 Securing the Edge
DoS/DDoS Considerations
IETF RFC 4732 Internet Denial-of-Service Considerations
SANS A Summary of DoS/DDoS Prevention, Monitoring and Mitigation Techniques in a
Service Provider Environment
Scalable Dynamic Nonparametric Bayesian Models of Content and Users ∗
Amr Ahmed1 Eric Xing2
1Research @ Google , 2Carnegie Mellon University
amra@google.com, epxing@cs.cmu.edu
Abstract
Online content have become an important medium
to disseminate information and express opinions.
With their proliferation, users are faced with the
problem of missing the big picture in a sea of
irrelevant and/or diverse content. In this paper,
we addresses the problem of information organization
of online document collections, and provide
algorithms that create a structured representation
of the otherwise unstructured content. We
leverage the expressiveness of latent probabilistic
models (e.g., topic models) and non-parametric
Bayes techniques (e.g., Dirichlet processes), and
give online and distributed inference algorithms
that scale to terabyte datasets and adapt the inferred
representation with the arrival of new documents.
This paper is an extended abstract of the 2012
ACM SIGKDD best doctoral dissertation award of
Ahmed [2011].
1 Introduction
Our online infosphere is evolving with an astonishing rate. It
is reported that there are 50 million scientific journal articles
published thus far [Jinha, 2010], 126 million blogs 1
, an average
of one news story published per second, and around 500
million tweets per day. With the proliferation of such content,
users are faced with the problem of missing the big picture
in a sea of irrelevant and/or diverse content. Thus several
unsupervised techniques were proposed to build a structured
representation of users and content.
Traditionally, clustering is used as a popular unsupervised
technique to explore and visualize a document collection.
When applied in document modeling, it assumes that each
document is generated from a single component (cluster or
topic) and that each cluster is a uni-gram distribution over
a given vocabulary. This assumption limits the expressive
power of the model, and does not allow for modeling documents
as a mixture of topics.
∗The dissertation on which this extended abstract is based was
the recipient of the 2012 ACM SIGKDD best doctoral dissertation
award, [Ahmed, 2011].
1
http://www.blogpulse.com/
Recently, mixed membership models [Erosheva et al.,
2004], also known as admixture models, have been proposed
to remedy the aforementioned deficiency of mixture models.
Statistically, an object wd is said to be derived from an admixture
if it consists of a bag of elements, say {wd1, . . . , wdN },
each sampled independently or coupled in some way, from
a mixture model, according to an admixing coefficient vector
θ, which represents the (normalized) fraction of contribution
from each of the mixture component to the object being
modeled. In a typical text modeling setting, each document
corresponds to an object, the words thereof correspond to the
elements constituting the object, and the document-specific
admixing coefficient vector is often known as a topic vector
and the model is known as latent Dirichlet allocation (LDA)
model due to the choice of a Dirichlet distribution as the prior
for the topic vector θ [Blei et al., 2003].
Notwithstanding these developments, existing models can
not faithfully model the dynamic nature of online content,
represent multiple facets of the same topic and scale to the
size of the data on the internet. In this paper, we highlight
several techniques to build a structured representation
of content and users. First we present a flexible dynamic
non-parametric Bayesian process called the Recurrent Chinese
Restaurant Process for modeling longitudinal data and
then present several applications in modeling scientific publication,
social media and tracking of user interests.
2 Recurrent Chinese Restaurant Process
Standard clustering techniques assume that the number of
clusters is known a priori or can be determined using cross
validation. Alternatively, one can consider non-parametric
techniques that adapt the number of clusters as new data arrives.
The power of non-parametric techniques is not limited
to model selection, but they endow the designer with necessary
tools to specify priors over sophisticated (possibly in-
finite) structures like trees, and provide a principled way of
learning these structures form data. A key non-parametric
distribution is the Dirichlet process (DP). DP is a distribution
over distributions [Ferguson, 1973]. A DP denoted by
DP(G0, α) is parameterized by a base measure G0 and a
concentration parameter α. We write G ∼ DP(G0, α) for
a draw of a distribution G from the Dirichlet process. G itself
is a distribution over a given parameter space θ, therefore
we can draw parameters θ1:N from G. Integrating out G, the1987
speech
Neuroscience
NN
Classificatio
n
Methods
Control
Prob.
Models
image
SOM
RL
Bayesian
Mixtures
Generalizat
-ion
1990
boosting
1991
Clustering
1995
ICA
Kernels
1994 1996
Memory
speech Kernels ICA
PM
Classification
Mixtures
Control
support
kernel
svm
regularization
sv
vectors
feature
regression
kernel
support
sv
svm
machines
regression
vapnik
feature
solution
Kernels
kernel
support
Svm
regression
feature
machines
solution
margin pca
Kernel svm
support
regression
solution
machines
matrix
feature
regularization
1996 1997 1998 1999
-Support Vector Method for Function
Approximation, Regression Estimation,
and Signal Processing,
V.Vapnik, S. E. Golowich and A.Smola
- Support Vector Regression Machines
H. Drucker, C. Burges, L. Kaufman, A.
Smola and V. Vapnik
-Improving the Accuracy and Speed of
Support Vector Machines,
C. Burges and B. Scholkopf
- From Regularization Operators
to Support Vector Kernels,
A. Smola and B. Schoelkopf
- Prior Knowledge in Support
Vector Kernels,
B. Schoelkopf, P. Simard, A. Smola
and V.Vapnik
- Uniqueness of the SVM Solution,
C. Burges and D.. Crisp
- An Improved Decomposition
Algorithm for Regression Support
Vector Machines,
P. Laskov
..... Many more
Figure 1: Left: the NIPS conference timeline as discovered by the iDTM. Right the evolution of the Topic Kernel Methods.
parameters θ follow a Polya urn distribution [Blackwell and
MacQueen, 1973], also knows as the Chinese restaurant process
(CRP), in which the previously drawn values of θ have
strictly positive probability of being redrawn again, thus making
the underlying probability measure G discrete with probability
one. More formally,
θi
|θ1:i−1, G0, α ∼
X
k
mk
i − 1 + α
δ(φk) + α
i − 1 + α
G0. (1)
where φ1:k denotes the distinct values among the parameters
θ, and mk is the number of parameters θ having value φk. By
using the DP at the top of a hierarchical model, one obtains
the Dirichlet process mixture model, DPM [Antoniak, 1974].
The generative process thus proceeds as follows:
G|α, G0 ∼ DP(α, G0), θd|G ∼ G, wd|θd ∼ F(.|θd), (2)
where F is a given likelihood function parameterized by θ.
Dirichlet process mixture (or CRP) models provide a flexible
Bayesian framework, however the full exchangeability assumption
they employ makes them an unappealing choice for
modeling longitudinal data such as text streams that can arrive
or accumulate as epochs, where data points inside the same
epoch can be assumed to be fully exchangeable, whereas
across the epochs both the structure (i.e., the number of mixture
components) and the parametrization of the data distributions
can evolve and therefore unexchangeable. In this section,
we present the Recurrent Chinese Restaurant Process (
RCRP ) [Ahmed and Xing, 2008] as a framework for modeling
these complex longitudinal data, in which the number
of mixture components at each time point is unbounded; the
components themselves can retain, die out or emerge over
time; and the actual parametrization of each component can
also evolve over time in a Markovian fashion.
In RCRP, documents are assumed to be divided into epochs
(e.g., one hour or one day); we assume exchangeability only
within each epoch. For a new document at epoch t, a probability
mass proportional to α is reserved for generating a new
cluster. Each existing cluster may be selected with probability
proportional to the sum mkt + m0
kt, where mkt is the number
of documents at epoch t that belong to cluster k, and m0
kt is
the prior weight for cluster k at time t. If we let ctd denotes
the cluster assingment of document d at time t, then:
ctd|c1:t−1, ct,1:d−1 ∼ RCRP(α, λ, ∆) (3)
to indicate the distribution
P(ctd|c1:t−1, ct,1:d−1) ∝
m0
kt + m−td
kt existing cluster
α new cluster
(4)
As in the original CRP, the count m−td
kt is the number of documents
in cluster k at epoch t, not including d. The temporal
aspect of the model is introduced via the prior m
0
kt, which is
defined as
m
0
kt =
X
∆
δ=1
e
− δ
λ mk,t−δ. (5)
This prior defines a time-decaying kernel, parametrized by
∆ (width) and λ (decay factor). When ∆ = 0 the RCRP
degenerates to a set of independent Chinese Restaurant Processes
at each epoch; when ∆ = T and λ = ∞ we obtain
a global CRP that ignores time. In between, the values of
these two parameters affect the expected life span of a given
component, such that the lifespan of each storyline follows a
power law distribution [Ahmed and Xing, 2008]. In addition,
the distribution φk of each component changes over time in a
markovian fashion, i.e.: φkt|φk,t−1 ∼ P(.|φk,t−1). In the following
three sections we give various models build on top of
RCRP and highlight how inference is performed and scaled
to the size of data over the internet.
3 Modeling Scientific Publications
With the large number of research publications available
online, it is important to develop automated methods that
can discover salient topics (research area), when each topicstarted, how each topic developed over time and what are the
representative publications in each topic at each year. Mixedmembership
models (such as LDA) are static in nature and
while several dynamic extensions have been proposed ([Blei
and Lafferty, 2006]), non of them can deal with evolving all
of the aforementioned aspects. While, the RCRP models can
be used for modeling the temporal evolution of research topics,
it assumes that each document is generated from a single
topic (cluster). To marry these two approaches, we first
introduce Hierarchical Dirichlet Processes (HDP [Teh et al.,
2006]) and then illustrate our proposed model.
Instead of modeling each document wd as a single data
point, we could model each document as a DP. In this setting,each
word wdn is a data point and thus will be associated
with a topic sampled from the random measure Gd, where
Gd ∼ DP(α, G0). The random measure Gd thus represents
the document-specific mixing vector over a potentially infi-
nite number of topics. To share the same set of topics across
documents, we tie the document-specific random measures
by modeling the base measure G0 itself as a random measure
sampled from a DP(γ, H). The discreteness of the base
measure G0 ensures topic sharing between all the documents.
Now we proceed to introduce our model, iDTM [Ahmed
and Xing, 2010b] which allows for infinite number of topics
with variable durations. The documents in epoch t are
modeled using an epoch specific HDP with high-level base
measure denoted as Gt
0
. These epoch-specific base measures
{Gt
0} are tied together using the RCRP of Section 2
to evolve the topics’ popularity and distribution over words
as time proceeds. To enable the evolution of the topic distribution
over words, we model each topic as a logistic normal
distribution and evolve its parameters using a Kalman filter.
This choice introduces non-conjugacy between the base
measure and the likelihood function and we deal with it using
a Laplace approximate inference technique proposed in
[Ahmed and Xing, 2007].
We applied this model to the collection of papers published
in the NIPS conference over 18 years. In Figure 1 we depict
the conference timeline and the evolution of the topic ‘Kernel
Methods’ alone with popular papers in each year.
In addition to modeling temporal evolution of topics, in
[Ahmed et al., 2009] we developed a mixed-membership
model for retrieving relevant research papers based on multiple
modalities: for example figures or key entities in the
paper such as genes or protein names (as in biomedical papers).
Figures in biomedical papers pose various modeling
challenges that we omit here for space limitations.
4 Modeling Social Media
News portals and blogs/twitter are the main means to disseminate
news stories and express opinions. With the sheer volume
of documents and blog entries generated every second, it
is hard to stay informed. This section explores methods that
create a structured representation of news and opinions.
Storylines emerge from events in the real world, such as
the Tsunami in Japan, and have certain durations. Each story
can be classified under multiple topics such as disaster, rescue
and economics. In addition, each storyline focuses on certain
Sports
games
won
team
final
season
league
held
Politics
government
minister
authorities
opposition
officials
leaders
group
Unrest
police
attack
run
man
group
arrested
move
India-Pakistan tension
nuclear
border
dialogue
diplomatic
militant
insurgency
missile
Pakistan
India
Kashmir
New Delhi
Islamabad
Musharraf
Vajpayee
UEFA-soccer
champions
goal
leg
coach
striker
midfield
penalty
Juventus
AC Milan
Real Madrid
Milan
Lazio
Ronaldo
Lyon
Tax bills
tax
billion
cut
plan
budget
economy
lawmakers
Bush
Senate
US
Congress
Fleischer
White House
Republican
T
O
P
I
C
S
S
T
O
R
Y
L
I
N
E
S
Figure 2: Some example storylines and topics extracted by our system.
For each storyline we list the top words in the left column,
and the top named entities at the right; the plot at the bottom shows
the storyline strength over time. For topics we show the top words.
The lines between storylines and topics indicate that at least 10% of
terms in a storyline are generated from the linked topic.
words and named entities such as the name of the cities or
people involved in the event. In [Ahmed et al., 2011b,a] we
used RCRP to model storylines. In a nutshell, we emulate the
process of generating news articles. A story is characterized
by a mixture of topics and the names of the key entities involved
in it. Any article discussing this story then draws its
words from the topic mixture associated with the story, the
associated named entities, and any story-specific words that
are not well explained by the topic mixture. The latter modification
allows us to improve our estimates for a given story
once it becomes popular. In summary, we model news story
clustering by applying a topic model to the clusters, while
simultaneously allowing for cluster generation using RCRP.
Such a model has a number of advantages: estimates in
topic models increase with the amount of data available.
Modeling a story by its mixture of topics ensures that we have
a plausible cluster model right from the start, even after observing
only one article for a new story. Third, the RCRP
ensures a continuous flow of new stories over time. Finally, a
distinct named entity model ensures that we capture the characteristic
terms rapidly. In order to infer storyline from text
stream, we developed a Sequential Monte Carlo (SMC) algorithm
that assigns news articles to storylines in real time.
Applying our online model to a collection of news articles extracted
from a popular news portal, we discovered the structure
shown in Figure 2. This structure enables the user to
browse the storylines by topics as well as retrieve relevant storylines
based on any combination of the storyline attributes.
Note that named entities are extracted by a preprocessing step
using standard extractors. Quantitatively, we compared the
accuracy of our online clustering with a strong offline algorithm
[Vadrevu et al., 2011] with favorable outcome.
Second, we address the problem of ideology-bias detection
in user generated content such as microblogs. We follow the
notion of ideology as defines by Van Dijk [Dijk, 1998]: “a
set of general abstract beliefs commonly shared by a grouppalestinian
israeli
peace
year
political
process
state
end
right
government
need
conflict
way
security
palestinian
israeli
Peace
political
occupation
process
end
security
conflict
way
government
people
time year
force
negotiation
bush US president american
sharon administration prime
settlement pressure policy
washington ariel new middle
unit state american george
powell minister colin visit
internal policy statement
express pro previous package
work transfer european
administration receive
arafat state leader roadmap
george election month iraq
week peace june realistic
yasir senior involvement
clinton november post
mandate terrorism
US role Israeli View
roadmap phase violence
security ceasefire state plan
international step implement
authority final quartet issue
map effort
roadmap end settlement
implementation obligation
stop expansion commitment
consolidate fulfill unit illegal
present previou assassination
meet forward negative calm
process force terrorism unit
road demand provide
confidence element interim
discussion want union succee
point build positive recognize
present timetable
Roadmap process
israel syria syrian negotiate
lebanon deal conference
concession asad agreement
regional october initiative
relationship
track negotiation official
leadership position
withdrawal time victory
present second stand
circumstance represent sense
talk strategy issue participant
parti negotiator
peace strategic plo hizballah
islamic neighbor territorial
radical iran relation think
obviou countri mandate
greater conventional intifada
affect jihad time
Arab Involvement
Palestinian View
Israeli
Background
topic
Palestinian
Background
topic
Figure 3: Ideology-detection. Middle topics represent the unbiased
portion of each topic, while each side gives the Israeli and Palestinian
perspective.
At time t At time t+1 At time t+2 At time t+3
User 1
process
User 2
process
User 3
process
Global
process
m
m'
n
n'
Figure 4: A Fully evolving non-parametric process. Top level process
evolves the global topics via an RCRP. Each row represents a
user process evolving using an RCR process whose topics depends
both on the the global topics at each epoch and the previous state
of the user at previous epochs. The user process is sparser than the
global process as users need not appear in each epoch, moreover
users can appear (and leave) at any epoch.
of people.” In other words, an ideology is a set of ideas that
directs one’s goals, expectations, and actions. For instance,
freedom of choice is a general aim that directs the actions of
“liberals”, whereas conservation of values is the parallel for
“conservatives”. In Ahmed and Xing [2010a] we developed a
multi-view mixed-membership model that utilizes a factored
representation of topics, where each topic is composed of two
parts: an unbiased part (shared across ideologies) and a biased
part (different for each ideology). Applying this model
on a few ideologically labelled documents as seeds and many
unlabeled documents, we were able to identify how each ideology
stands with respect to mainstream topics. For instance
in Figure 3 we show the result of applying the model to a set
of articles written on the middle east conflict by both Israeli
and Palestinian writers. Given a new documents, the model
can 1) detect its idealogical bias (if any), 2) point where the
bias appears (i.e. highlight words and/or biased sentences)
and 3) retrieve documents written on the same topic from the
opposing ideology. Our model achieves state of the art results
in task 1 and 3 while being unique in solving task 2.
0 10 20 30 40 0
0.1
0.2
0.3
0.4
0.5 Propotion
Day
Baseball Dating
Celebrity
Health 0 10 20 30 40 0
0.1
0.2
0.3 Propotion
Day
Baseball
Finance
Jobs Dating
Snooki
Tom
Cruise
Katie
Holmes
Pinkett
Kudrow
Hollywood
League
baseball
basketball,
doublehead
Bergesen
Griffey
bullpen
Greinke
skin
body
fingers
cells
toes
wrinkle
layers
women
men
dating
singles
personals
seeking
match
Dating Baseball Celebrity Health
job
career
business
assistant
hiring
part-time
receptionist
financial
Thomson
chart
real
Stock
Trading
currency
Jobs Finance
Figure 5: Dynamic interests of two users.
5 Modeling User Interests
Historical user activity is key for building user profiles to predict
the user behaviour and affinities in many web applications
such as targeting of online advertising, content personalization
and social recommendations. User profiles are temporal,
and changes in a user’s activity patterns are particularly
useful for improved prediction and recommendation. For instance,
an increased interest in car-related web pages suggests
that the user might be shopping for a new vehicle.
In Ahmed et al. [2011c] we present a comprehensive statistical
framework for user profiling based on the RCRP model
which is able to capture such effects in a fully unsupervised
fashion. Our method models topical interests of a user dynamically
where both the user association with the topics and
the topics themselves are allowed to vary over time, thus ensuring
that the profiles remain current. For instance if we represent
each user as a bag of the words in their search history,
we could use the iDTM model described in Section 3. However,
unlink research papers that exist in a given epoch, users
exist along multiple epoch (where each epoch here might denote
a day). To solve this problem we extend iDTM by modeling
each user himself as a RCRP that evolves over time as
shown in Figure 4. To deal with the size of data on the internet,
we developed a streaming, distributed inference algorithm
that distribute users over multiple machines and synchronizing
the model parameters using an asynchronous consensus
protocol described in more details in [Ahmed et al.,
2012; Smola and Narayanamurthy, 2010]. Figure 5 shows
qualitatively the output of our model over two users. Quantitatively
the discovered interests when used as features in
an advertising task results in significant improvement over a
strong deployed system.
6 Conclusions
Our infosphere is diverse and dynamic. Automated methods
that create a structured representation of users and content are
key to help users staying informed. We presented a flexible
nonparametric Bayesian model called the Recurrent Chinese
Restaurant Process and showed how using this formalism (in
addition to mixed-membership models) can solve this task.
We validated our approach on many domains and showed
how to scale the inference to the size of the data on the internet
and how to performing inference in online settings.References
A. Ahmed and E. P. Xing. On tight approximate inference
of the logistic normal topic admixture model. In AISTATS,
2007.
A. Ahmed and E. P. Xing. Dynamic non-parametric mixture
models and the recurrent chinese restaurant process:
with applications to evolutionary clustering. In SDM, pages
219–230. SIAM, 2008.
A. Ahmed and E. P. Xing. Staying informed: Supervised and
semi-supervised multi-view topical analysis of ideological
perspective. In EMNLP, 2010.
A. Ahmed and E. P. Xing. Timeline: A dynamic hierarchical
dirichlet process model for recovering birth/death and
evolution of topics in text stream. In UAI, 2010.
A. Ahmed, E. P. Xing, W. W. Cohen, and R. F. Murphy.
Structured correspondence topic models for mining captioned
figures in biological literature. In KDD, pages 39–
48. ACM, 2009.
A. Ahmed, Q. Ho, J. Eisenstein, E. P. Xing, A. J. Smola, , and
C. H. Teo. Unified analysis of streaming news. In in WWW
2011, 2011.
A. Ahmed, Q. Ho, C. hui, J. Eisenstein, A. Somla, and E. P.
Xing. Online inference for the infinte topic-cluster model:
Storylines from text stream. In AISTATS, 2011.
A. Ahmed, Y. Low, M. Aly, V. Josifovski, and A. J. Smola.
Scalable distributed inference of dynamic user interests for
behavioral targeting. In KDD, pages 114–122, 2011.
A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A.J.
Smola. Scalable inference in latent variable models. In Web
Science and Data Mining (WSDM), 2012.
A. Ahmed. Modeling Users and Content: Structured Probabilistic
Representation, and Scalable Online Inference
Algorithms. PhD thesis, School of Computer Science,
Carnegie Mellon University, 2011.
C. E. Antoniak. Mixtures of dirichlet processes with applications
to bayesian nonparametric problems. The Annals of
Statistics, 2(6):1152–1174, 1974.
D. Blackwell and J. MacQueen. Ferguson distributions via
polya urn schemes. The Annals of Statistics, 1(2):353–355,
1973.
D. M. Blei and J. D. Lafferty. Dynamic topic models. In
ICML, volume 148, pages 113–120. ACM, 2006.
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation.
Journal of Machine Learning Research, 3:993–1022,
2003.
T. A. Van Dijk. Ideology: A multidisciplinary approach.
1998.
E. Erosheva, S. Fienberg, and J. Lafferty. Mixed membership
models of scientific publications. PNAS, 101(1), 2004.
T. S. Ferguson. A bayesian analysis of some nonparametric
problems. The Annals of Statistics, 1(2):209–230, 1973.
A. E. Jinha. Article 50 million: an estimate of the number
of scholarly articles in existence. Learned Publishing,
23(3):258–263, 2010.
A. J. Smola and S. Narayanamurthy. An architecture for parallel
topic models. In Very Large Databases (VLDB), 2010.
Y. Teh, M. Jordan, M. Beal, and D. Blei. Hierarchical dirichlet
processes. Journal of the American Statistical Association,
101(576):1566–1581, 2006.
S. Vadrevu, C. H. Teo, S. Rajan, K. Punera, B. Dom, A. J.
Smola, Y. Chang, and Z. Zheng. Scalable clustering of
news search results. In in WSDM 2011, 2011.
Towards A Unified Modeling and Verification of
Network and System Security Configurations
Mohammed Noraden Alsaleh, Ehab Al-Shaer
University of North Carolina at Charlotte
Charlotte, NC, USA
Email: {malsaleh, ealshaer}@uncc.edu
Adel El-Atawy
Google Inc
Mountain View, CA, USA
Email: aelatawy@google.com
Abstract—Systems and networks access control configuration
are usually analyzed independently although they are logically
combined to define the end-to-end security property. While
systems and applications security policies define access control
based on user identity or group, request type and the requested
resource, network security policies uses flow information such
as host and service addresses for source and destination to
define access control. Therefore, both network and systems access
control have to be configured consistently in order to enforce endto-end
security policies. Many previous research attempt to verify
either side separately, but it does not provide a unified approach
to automatically validate the logical consistency between both
of them. Thus, using existing techniques requires error-prone
manual and ad-hoc analysis to validate this link.
In this paper, we introduce a cross-layer modeling and verifi-
cation system that can analyze the configurations and policies
across both application and network components as a single
unit. It combines policies from different devices as firewalls,
NAT, routers and IPSec gateways as well as basic RBACbased
policies of higher service layers. This allows analyzing,
for example, firewall polices in the context of application access
control and vice versa. Thus, by incorporating policies across
the network and over multiple layers, we provide a true endto-end
configuration verification tool. Our model represents the
system as a state machine where packet header, service request
and location determine the state and transitions that conform
with the configuration, device operations, and packet values are
established. We encode the model as Boolean functions using
binary decision diagrams (BDDs). We used an extended version of
computational tree logic (CTL) to provide more useful operators
and then use it with symbolic model checking to prove or find
counter examples to needed properties. The tool is implemented
and we gave special consideration to efficiency and scalability.
I. INTRODUCTION
Users inadvertently trigger a long sequence of operations in
many locations and devices by just a simple request. The application
requests are encapsulated inside network packets which
in turn are routed through the network devices and subjected
to different types of routing, access control and transformation
policies. Misconfigurations at the different layers in any of
the network devices can affect the end to end connection
between the hosts them selves and the communicating services
running on top of them. Moreover, applications may require
to transform requests to another one or more requests with
different characteristics. This means that the network layer
should guarantee more than one packet flow at the same time
in order for the application request to be successful. Although
it is already very hard to verify that only the legitimate packets
can pass through the network successfully, the consistency
between network and application layer access configuration
adds another challenge. The different natures of policies from
network layer devices to the logic of application access control
makes it more complex.
In this paper we have extended ConfigChecker [3] to
include application layer access control. The ConfigChecker
is a model checker for network configuration. It implemented
many network devices including: routers, firewalls, IPSec
gateways, hosts, and NAT/PAT nodes. ConfigChecker models
the transitions based on the packet forwarding at the network
layer where the packet header fields along with the location
represent the variables for the model checker. We define
application layer requests following a loose RBAC model:
a 4-tuple of . The request can
be created by users or services running on top of hosts in
the network. The services in our model can also transform a
request into another one or more requests and forward them to
a different destination. We have implemented a parallel model
checker for application layer configuration. The transitions in
the application layer model is determined by the movement of
the requests between different services. However, the network
and application layer model checkers are not operating separately.
Requirements regarding both models can be verified
in a single query using our unified query interface. Moreover,
inconsistency between network configuration and application
layer configuration can be detected.
The nature of the problem of verifying network-wide configurations
necessitate having a very scalable system in terms of
time and space requirements. Larger networks, more complex
configurations, and richer variety of devices are all dimensions
over which the system should handle gracefully. The
application-layer access control depends on different variables
than most of network layer policies. We chose to implement
a parallel model checker for application access control rather
than adding the application layer variables (which correspond
to request fields) to the network model checker itself. This
can decrease the number of system states and improve the
performance. As the case in ConfigChecker, both model
checkers are represented as state machines and encoded as
boolean function using Binary Decision Diagrams (BDDs). We
use an extended version of computational tree logic (CTL) toFig. 1. A simple overview of the framework design and flow
provide more useful operators and then use it with symbolic
model checking to prove or find counter examples for needed
properties regarding both models.
The rest of this paper is organized as follows. We first
briefly describe our framework components in Section II. We
then present the model used for capturing the network and
application layer configuration in Section III and Section IV
respectively. Section V shows how to query the model for
properties, and lists some sample queries. The related work
is presented in Section VI. We finally present our conclusion
and future remarks in Section VII.
II. FRAMEWORK OVERVIEW
The framework consists of a few key components: configuration
loader, model compiler, and query engine. The duty of
each component is described briefly below:
• The Configuration Loader parses the main configuration
file that points out to the configuration files of network
devices. Each file represents a device or entity (e.g.,
firewall, router, application-layer service, etc). Each con-
figuration file consists mainly of two sections: meta-data
directives, and policy. The initial directives act as an
initialization step to configure the device properties like
default gateway, service port, host address, etc. The policy
is listed afterwards as a simple list of rules.
• The Model Compiler translates the configuration into a
Boolean expressions for each policy rule and builds a
single expression for each device. These expressions are
then combined into a single expression representing the
whole network.
• The Query Engine is responsible for verifying properties
of the system by executing simple scripts based on CTL
expressions. Scripts is written using a very limited set of
primitives for refining the user output, and for defining
the property itself.
The model compiler component builds two separate expressions.
The first represents the network layer configuration that
reflects the packets forwarding and transformation through the
network core and end points as described in Section III. The
other expression represents the application layer configuration
including services and users. This reflects how requests are
forwarded and transformed in the service level. Section IV
describes this process in more details. Although we can
integrate the two expressions and build only one expression
that accommodates for both the network and application layer
configuration, we chose to split them into two expression.
The variables used on each of them are generally independent
except for the location variable. Building one expression
takes more space because the network configuration will be
duplicated for each different combination of application layer
variables generating more and more states. This helps our
model to scale and avoid state explosion.
III. NETWORK MODEL
We model the network as a single monolithic finite state
machine. The state space is the cross-product of the packet
properties by its possible locations in the network. The packet
properties include the header information that determines the
network response to a specific packet.
A. State representation
Initially, the only information we need about the packet
is the source and destination information contained in the IP
header and the current location of the packet in the network.
Therefore, we can encode the state of the network with the
following characteristic function:
σn : IPs × ports × IPd × portd × loc → {true, false}
IPs the 32-bit source IP address
ports
the 16-bit source port number
IPd the 32-bit destination IP address
portd the 16-bit destination port number
loc the 32-bit IP address of the device currently processing
the packet
The function σn encodes the state of the network by
evaluating to true whenever the parameters used as input to
the function correspond to a packet that is in the network and
false otherwise. If the network contains 5 different packets,
then exactly five assignments to the parameters of the function
σn will result in true. Note that because we abstract payload
information, we cannot distinguish between 2 packets that
are at the same device if they also have the same IP header
information.
Each device in the network can then be modeled by describing
how it changes a packet that is currently located at
the device. For example, a firewall might remove the packet
from the network or it might allow it to move on to the device
on the other side of the firewall. A router might change the
location of the packet but leave all the header information
unchanged. A device performing network address translation
might change the location of the packet as well as some of the
IP header information. A hub might copy the same packet to
multiple new locations. The behavior of each of these devices
can be described by a list of rules. Each rule has a condition
and an action. The rule condition can be described using a
Boolean formula over the bits of the state (the parameters
of the characteristic function σn). If the packet at the devicematches (satisfies) a rule condition, then the appropriate action
is taken. As described above, the action could involve changing
the packet location as well as changing IP header information.
In all cases, however, the change can be described by a
Boolean formula over the bits of the state. Sometimes the new
values are constant (completely determined by the rule itself),
and sometimes they may depend on the values of some of the
bits in the current state. In either case, a transition relation can
be constructed as a relation or characteristic function over two
copies of the state bits. An assignment to the bits/variables in
the transition relation yields true if the packet described by
the first copy of the bits will be transformed into a packet
described by the second copy of the bits when it is at the
device in question.
B. Network devices
We integrate the policies of different network devices
including firewalls, routers, NAT and IPSec gateways. The
details of their policies and how they are encoded into BDDs
are discussed thoroughly in [3]. However, we have modified
the encoding of hosts to reflect the request transformation
performed by the services running on top of each host.
The host may be configured to run one or multiple services.
Each of which has its own access-control list as will be
discussed in Section IV. The service configuration may also
specify a set of possible request transformations where the
incoming request is transformed into another (sometimes completely
new) request. For example, a request to a web server
can be translated into an NFS request to load a users home
page. The new request will be carried through the network over
packets. In our initial model the host receives packet and then
forward them into the application layer within the host itself
and it cannot forward it to another host. We need to modify
this model so that the host will be able to forward packets to
the other hosts in order to support the requests transformation
performed by the services running on top of it.
IV. APPLICATION LAYER MODEL
We also model the application layer as a finite state machine.
The state space is the cross-product of the application layer
request properties by its possible locations in the network.
The request properties include the its fields that determine the
service response to a specific request.
σp : usr × role × obj × act × loc × srv → {true, false}
usr the 32-bit user ID
role the 32-bit role ID which the user belongs to
obj the 32-bit object ID
act the 16-bit action ID
loc the 32-bit IP address of the device currently processing
the request
srv the 16-bit service ID
In the application layer model the devices in the network are
modeled by describing how they change the requests. Only the
devices that operates on the application level are considered
(i.e., the devices who has a defined users list or services
running on top of them). Here, we describe how we define
access-control rights for service requests, and how we model
these services and integrate them into the application layer
state transition diagram.
A. Application layer access-control
In order to have a homogeneous policy definition across
applications, we revert to a simplified RBAC model as a
way to specify all application requests and consequently the
access-control policy. As in firewall policy, the access-control
list of application layer services is defined by specifying
an action like (permit or deny) to the requests satisfying
certain criteria. This criteria is defined using the request fields
< user, role, object, action > or < u, r, o, a > for short. We
assume that each host has a list of potential users who can
use it to send requests. This list can simply be set to ”any”,
to indicate that all defined users can access the host which
enables a more powerful model for an adversary against which
we want to verify the robustness of the policy. Also, another
assumption is that any user can assume any role. This enables
a more flexible usage of the model to incorporate more types
of services. It is even possible to use one of the request fields
in a slightly different meaning. For example, in a web server
model; an action can be a POST or GET, the role can be a
logged in versus guest visitor and the object and user will
have their obvious meanings. On the other hand, for database
servers, we might have users, roles, actions, and resources used
in their original meaning.
When a service receives the a request from other device it
first verifies it against its access-control policy. If it satisfies
the access-control policy it will be forwarded to the service to
be executed or transformed as shown in Fig. 2. If the request
does not satisfy the access policy it will be dropped (i.e., there
is no valid transition in the finite state machine that goes to
the required service). A policy typically is defined as a list of
tuples with an assigned action:
user role resource action decision
;user black listing
1 * * * deny
2 * * * deny
;admin account
100 1 * * permit
;guests can only read
* guest 1-50 3 permit
;a read only resource
* * 60 4 permit
* * 60 * deny
As in firewall policies, we use a first-match methodology.
For example, the last two rules allow read access, and then
deny every other action. Also, the first few black-listing rules
do not conflict with the guest account rule that appears later
in the policy. From the common practices in the area of
application level and RBAC policies, we believe that it should
never be the case that a user or role be specified as a range.
The value of user and role IDs are irrelevant, therefore having
a range in the specification is hard to have a practical value.
Although, this fact does not affect our implementation (i.e.,
we support single values, ranges, or “any” values in all four
fields of the access-control rules).Fig. 2. Service Model. S1, S2, S3, and S4 represent services running on
different hosts. The dashed lines represent application requests. Requests are
subjected to the access control list of the target service.
B. State representation
Requests that pass the access-control phase are forwarded
to the execution phase of the service. We have simplified the
execution phase to one of two options as suggested in Fig. 2.
A request can be executed on the service itself, which means
that the request life-time ends at this phase, and no further
events are triggered. The other possibility is that the service
transforms the request into another form by modifying one
or more fields (i.e., user, role, object or action) and sends it
to another service running on the same or a different host.
For example, a request to a web server can be translated into
an NFS request to load a user’s home page. Each request
transformation is associated with a packet flow in the network
level, the host should be able to send the appropriate packet
to its gateway based on the service transformation.
Only the network devices that support application layer
services represented by hosts in our model are included in
the application layer model. The requests that leave a host
may come from two sources: either a user operating directly
on the host or through a service, or a request is transformed
from another one. We do not require the destination of each
request to be defined in the configuration. We assume that any
request instantiated in any host can be directed to any service
in the network whose access control list allows the request
to pass. To build the transition relation for each host in the
network we need the following inputs.
• The set U of users who can access the host. Each user
is represented by its unique ID in the system. It can also
be expressed as a pair.
• The set T of possible request transformations that can be
performed in the host. The configuration may specify the
exact ID and location of the target service, or it can be
anonymous.
• The set P of access control policies for each service
running in the network. These policies are encoded as
BDD expressions before building the transition relation
of the application layer. Each policy in the set corresponds
to a particular service ID (The port number of the service
can be used as its ID) running on a particular host.
Lets assume that the list UH of pairs represents
the users who can access the host H. We need first to encode
the possible states that result from having these users on the
host H (recall that the state is the product of the request
properties in this case by the location in which
the request exists).
UBDD =
_
i∈UH
(usr = ui ∧ role = ri ∧ loc = H) (1)
where ui and ri are the user and role IDs of the item i in
the users list UH. To find the transitions we need to find out
which services can be reached starting from the states defined
in the expression UBDD (i.e., any service whose access-control
policy allows defined requests to pass).
Tu =
_
i∈indices(P )
(UBDD ∧ P(i) ∧ loc0 = li ∧ srv0 = si)
(2)
where P(i), li
, and si are the policy, location, and the service
ID of the target service i respectively. The variables loc0
and
srv0
represents the location and service ID variables in the
next state of the transition.
The expression Tu does not include the transitions that
results from request transformation by the service running on
the host. The following represents the transition that result
from one transformation performed by the service i.
usr = ui ∧ role = ri ∧ obj = oi ∧ act = ai
loc = H ∧ srv = si
usr0 = u
0
i ∧ role0 = r
0
i ∧ obj0 = o
0
i ∧ act0 = a
0
i ∧ P
0
(s
0
i
)
loc0 = l
0
i ∧ srv = s
0
i
(3)
The values < ui
, ri
, oi
, ai > are the properties of the initial
request and the values < u0
i
, r0
i
, o0
i
, a0
i > represent the properties
of the transformed packet. The values P
0
(s
0
i
), l
0
i
and s
0
i
are the policy, location and the service ID of the target service
to which the request should be transformed. Note that we use
P
0
(s
0
i
) instead of P(s
0
i
) to indicate that the transformed request
(and not the initial request) should pass the target service
access-control policy in order to complete the transition. The
disjunction of all the transitions caused by all the possible
transformations along with the expression Tu calculated earlier
formulate the total transition relation of the host H.
V. QUERYING THE MODEL
A query in our system takes the form of a Boolean expression
that specifies some properties over packet flows, requests
and locations, with temporal logic criteria specified using CTL
operators. By evaluating the given expression in the context of
the built state machine (i.e., states and transitions), we obtain
the satisfying assignments to that expression represented in
the same symbolic representation as the model itself. The
simplest form for the result can be the constant expression
“true” (e.g., the property is always satisfied), or “false” (e.g.,
no one violates the required property), or can be any subset
of the space that satisfies the property (e.g., only flows with
port 80, or traffic that starts from this location, etc).In the following subsections, we will go over a few examples
for reachability and security properties. For each, we
show how to construct the query (i.e., the Boolean expression),
and what the results look like. Moreover, we discuss how to
write the script that extracts the results and data fields in the
intended format that makes sense to every specific query. The
aim of this presentation is to show the applicability of the
system to many types of properties, as well as showing the
expressive power of the model.
A. Model Checking
We have described how to construct a transition relation
for each device in the network. Each such transition relation
describes a list of outgoing transitions for the device it models.
The formulas are constructed with the requirement that the
current location be equal to the device being modeled, so
these transitions can only be taken when a packet or a request
is at the device. To get the transition relation for the entire
network, we simply take the disjunction of the formulas
for the individual devices. This is applied on both models
(The network and the application layer models). The current
location of a packet or a request will match the location of
one device in the network at most, and so only its transitions
will apply.
Recall that this global transition relation is a characteristic
function for transitions in the model. If we substitute the
values for a packet that is in the system into the current state
variables of the transition relation, what we are left with is a
formula describing what the possible next states of that packet
look like. We have all the machinery to perform symbolic
model checking. We use BDDs for all the formulas described
above and we use standard model checking algorithms to
explore the state space and compute states that satisfy various
CTL properties. The BuDDy BDD package provides all the
required operations (including quantification). For a much
more complete description of symbolic model checking, the
reader is encouraged to see [7].
The network layer model checker and the application layer
model checker are encoded separately. Each of them is working
on different variables. However, we may need to verify
some requirements using both of them together. Although the
application layer requests are transmitted from an application
to another, the requests are encapsulated inside network packets.
We do not require to have static one-one mapping between
each request and the packet flow that should be used to transfer
the request. The mapping can be expressed in the query
itself by specifying precise network packet characteristics
. For this purpose we use the variable loc which is
common between the two models and has the same meaning.
Fig. 3 shows how the two models are used together to verify
the requirements.
B. Query structure and features
The query in our model checkers retrieves the states that
satisfy a given condition. The condition is expressed by
Fig. 3. Using two models to run a query.
restricting some variables in the model checker to a given
value or using CTL operators to express a temporal condition.
We also need to specify what information to be retrieved about
the states which satisfy the query (i.e., a list of variables to be
retrieved). An example query can look like this:
Q3 = [loc(10.12.13.14) ∧ EF((¬loc(10.0.0.0/8)) ∧ (¬Q2))]
Q3 : extractF ield loc dport
Q3 : listBounded 20 loc dport
The query Q3 is defined by the given expression (i.e., what
flows are in a given location (10.12.13.14) and in the future
will be outside of the domain (10.0.0.0/8) and do not satisfy
a previously defined query (Q2). The second and third lines
are used to format the result of the query. The second line tells
the query engine that we are only interested in the variables
loc and dport. The third line specifies that we need to display
only the first 20 satisfying assignments. If there is no satisfying
assignment for the given query nothing will be returned.
To handle the queries on both the network layer and application
layer models, we introduced the concept of sub-query.
Each sub-query is applied on one model. The applicationlayer
sub-query should not include variables related to network
layer model such as source or destination addresses and port
numbers and vice versa. A query can include one or more
sub-queries based on the following cases.
• It can include only one sub-query. In this case the query is
applied only on the appropriate model. The query engine
detects the appropriate model based on the variables used.
• It can include more than one sub-query of the same
type linked by the logical operators such as AND, OR,
IMP LIES, etc. In this case all the sub-queries are
executed on the appropriate model and the final result is
calculated by applying the specified operation. The results
of the different sub-queries in this case are identical in
terms of the number and type of the variables returned.
The linking operation can be directly applied.
• It can include multiple sub-queries of different types
linked by logical operators. In this case, the results of
the different types of sub-queries have different variables
and we cannot apply the linking operation directly. Thelocation variable (loc) is common between the two models
and it has the same meaning and value for the same
device. For different types of sub-queries we apply the
logical operations based on the location variable only. For
example, if an application-layer sub-query is combined
with a network-layer sub-query by the AND operation,
we calculate the result of both sub-queries and then
calculate the intersection between the location values
in both results. Only requests and packet flows whose
location falls within the intersection are returned.
The result of a query is a list of states that satisfy the query
expression. In our model we may have two types of results.
The first is a set of network-layer states each represented as
a packet flow characteristics and location. The second is a
set of application-layer states each represented as a request
characteristics and location. The existence of these two types
depends on the types of sub-queries included within the query
script.
C. Example properties
Property 1: Conflicting network and application accesscontrol
(a): Given a user location and userID, does the current
configuration allows the user to access the server machine,
while the application layer access-control blocks the
connection?
The query shown in the Table I specifies the initial properties
of flows with certain user information (e.g., a specific source
IP ”user addr”, and user identifier in the application layer
request) and targeted towards a service residing elsewhere
(i.e., server, port0
). If there is an inconsistency in the
configuration the query returns a list of requests that cannot
access the specified service and another list of packet flows
that can eventually reach the host network layer. We can see
that the query combines two different types of sub-queries
and restricts the location to a particular source machine. Each
sub-query is surrounded by angle brackets ”[ ]”.
(b): Does the current configuration blocks the user’s access to
the server machine through network layer filtering, while the
application’s access-control layer permits such connection?
As in the pervious property (a), we try to see if a request
that is permitted by the application layer access control
will never reach the service. This means that somewhere
before reaching the server hosting the service there is a
network layer device blocks the traffic, or fails to route
it correctly. We use the application layer model to find
those requests that can pass from the source to a particular
service, and we use the network model to find if the
underlaying packet flows who should carry the requests are
allowed to flow from the source to the appropriate destination.
Property 2: Can a user access a resource under different
credentials, if he is prohibited from accessing it under his
original identity?
In this property, we check if a certain user can masquerade
under another identity to access a resource. This forms
a back-door to this specific object-action pair. A straight
forward example can be a user accessing an NFS server for
which he does not have access via a web-server who can
retrieve the content in the form of web pages. This can be
achieved by an improper request transformation in a service
which should not be reachable by the specified user. This
is defined formally by evaluating the expression specifying
which users cannot access an object, while it can be accessed
eventually if the constraint on the user identity is removed.
Property 3: What access rights does an object require?
(a): What roles can user u use to access object o?
It is sometimes essential to know what roles can a user
manifest when accessing a specific object, or a group of
objects. The query consists of checking the space of requests
that can pass through the network and RBAC filtering to
reach our object of interest. By restricting the user part of
the space, we get the possible roles that can be used. We
can also, restrict the action if needed. An addition that can
prove practically useful is to add another restriction of origin
of request: by filtering the location and source address from
which the request originated from other than that of the
server. In Table I we show only the condition part of the
query neglecting the result format part, we can specify to
return only the roles and/or any other fields in the results.
(b): Which users can access object o via a role r?
This is a similar query to the previous one. This query
concerns the different users who can access a given object in
a certain capacity. For example, we might want to know who
can access a critical file as an administrator. Also, we can
add extra restrictions to see who can access this object for
writing rather than just reading.
Property 4: Is there any conflicts within the application
layer access-control?
(a): Is there any inconsistencies in allowed actions for a
specific object?
Such conflicts can arise within the same policy or cross
policies. For example, if a user is granted the write access
to an object then, most probably, read access should
be allowed as well. This query is application dependent,
and priority between actions has to be specified explicitly
(e.g., ‘delete’ > ‘write’ > ‘read’). We write the query
for a service to check if it is possible for some user/role
to reach the service via a higher action, but not with a
lower one. We represent the general form for such query
in Table I, the profiles [high security requirements] and
[low security requirements] can be replaced with any
combination of constraints on the request fields. For example,
to compare rights for reading and writing, the high security
profile may be [obj(o)∧act(wr)] and the low security profile
may be represented as [obj(o) ∧ act(rd)] for the particular
object (o).
(b): Are role-role relations consistent?
As in the pervious property (a), we might need to verify that
the order of role privileges is maintained. In other words, aProperty Query expression form
P1 (a): Requests reaching the host but not the service it is running.
loc(user addr)∧[src(user addr)∧dest(server)∧dport(port0
)∧ EF(loc(server))]∧[usr(userID)∧
¬EF(loc(server) ∧ srv(port0
))]
P1 (b): Requests reaching the service but can’t reach the host itself.
loc(user addr) ∧ [usr(userID) ∧ EF(loc(server) ∧ srv(port0
))]∧
[src(user addr) ∧ dest(server) ∧ dport(port0
) ∧ ¬EF(loc(server))]
P2: Backdoor: A user is denied direct access to a service, but can use another service to indirectly access it.
loc(user addr) ∧ [usr(userID) ∧ ¬EF(usr(userID) ∧ obj(o) ∧ loc(server) ∧ srv(port0
)) ∧
EF(usr(¬userID) ∧ obj(o) ∧ loc(server) ∧ srv(port0
)] ∧ [EF(loc(server) ∧ dport(port0
))]
P3 (a): What roles and actions can a user use to access a specific object from outside the server domain?
¬loc(server) ∧ [EF(loc(server) ∧ srv(port0
) ∧ usr(u) ∧ obj(o))]∧
[¬src(server) ∧ EF(loc(server) ∧ dport(port0
))]
P3 (b): What users can access a given object?
¬loc(server) ∧ [¬src(server) ∧ EF(loc(server) ∧ dport(port0
))]∧
[EF(loc(server) ∧ srv(port0
) ∧ role(r) ∧ obj(o) ∧ act(a))]
P4: Is there any inconsistency between rights of low and high privilege requests?
EF(loc(server) ∧ srv(port0
) ∧ [high security requirements])
∧¬EF(loc(server) ∧ srv(port0
) ∧ [low security requirements])
TABLE I
EXAMPLES FOR REACHABILITY AND SECURITY PROPERTIES
more powerful role should be always capable of performing
all actions possible for a weaker role. For example, an administrator
should perform at least everything doable by a staff
member, and guests should never have more access to other
roles. It is defined by checking if the space of possible actionsover-objects
that can be performed by role1 but not role2 is
empty (given that role1 < role2). In this case the high and low
security profiles can be represented as [role(role1)∧obj(o)∧
act(a)] and [role(role2) ∧ obj(o) ∧ act(a)] respectively.
VI. RELATED WORK
There have been significant research effort in the area
of configuration verification and management in the past
few years. We can classify the work in this area into
two main approaches: top-down and bottom-up. The topdown
approaches [20], [5] create clean-slate configurations
based on high-level requirements. However, the bottom-up
approaches [13], [1], [24] analyze the existing configuration to
verify desired properties. We focus our discussion on bottomup
approach as it is closer to our work in this paper.
There has been considerable work recently in detecting
misconfiguration in routing and firewall. Many of these approaches
are specific for BGP misconfiguration [9], [17], [11],
[4]. [23], [13], [1], [24] focused on conflict analysis of firewalls
configuration. A BDD-based modeling and taxonomy of
IPSec configuration conflicts was presented in [2], [13]. FIREMAN
[24] uses BDD to show conflicts on Linux iptables
configurations. In [19] and [21], the authors developed a
firewall analysis tool to perform customized queries on a set of
filtering rules of a firewall. But no general model of network
connections is used in this work.
In the field of distributed firewalls, current research mainly
focuses on the management of distributed firewall policies.
The first generation of global policy management technology
is presented in [12], which proposes a global policy definition
language along with algorithms for verifying the policy and
generating filtering rules. In [6], the authors adopted a better
approach by using a modular architecture that separates the
security policy and the underlying network topology to allow
for flexible modification of the network topology without the
need to update the security policy. Similar work has been
done in [14] with a procedural policy definition language,
and in [16] with an object-oriented policy definition language.
In terms of distributed firewall policy enforcement, a novel
architecture is proposed in [15] where the authors suggest
using a trust management system to enforce a centralized
security policy at individual network endpoints based on
access rights granted to users or hosts. We found that none
of the published work in this area addressed the problem of
discovering conflicts in distributed firewall environments.
A variety of approaches have been proposed in the area of
policy conflict analysis. The most significant attempt for IPSec
policy analysis is proposed in [10]. The technique simulates
IPSec processing by tracking the protection applied on the
traffic in every IPSec device. At any point in the simulation,
if packet protection violates the security policy requirements,
a policy conflict is reported. Although this approach can
discover IPSec policy violations in a certain simulation scenario,
there is no guarantee that it discovers every possible
violation that may exist. In addition, the proposed technique
only discovers IPSec conflicts resulting from incorrect tunnel
overlapping, but do not address the other types of conflictsthat we study in this research.
Other works attempt to create general models for analyzing
network configuration [8], [22]. An approach for formulating
and deriving of sufficient conditions of connectivity constraints
is presented in [8]. The static analysis approach [22]is one
of the most interesting work that is close to ConfigChecker.
This work uses graph-based approach to model connectivity
of network configuration and use set operations to perform
static analysis. The transitive closure, as apposed to a fixed
point in our approach, is computed. Thus, it seems that all
possible paths are computed explicitly. In addition, considering
security devices and properties, providing a rich query
interface based on our CTL extension, and utilizing BDDs
optimization are major advantages of our work. Anteater [18]
is another interesting tool for checking invariants in the data
plane. It checks the high-level network invariants represented
as instances of boolean satisfiability problems (SAT) against
network state using a SAT solver, and reports counterexamples
for violations, if exist.
Thus, in conclusion, although this body of work has a
significant impact on the filed, it is either provide limited
analysis due to restriction on specific network or application.
Unlike the previous work, our work offers a global configuration
verification that is comprehensive, scalable and highly
expressive.
VII. CONCLUSION
We presented an extension to the ConfigChecker tool to
incorporate both network and application configurations in a
unified system across the entire network. Our extended system
models the configuration of various devices in the network
layer (hubs, switches, routers, firewalls, IPsec gateways) and
access control of application layer services including multiplelevel
of request translation. Network and system configuration
can be modeled together and used to verify properties using
CTL-embedded functions translated into Boolean operations.
We show that we can separate variables in two model checkers
to reduce the state space and required resources. Yet, both
models can be used to run combine queries.
Our future work includes enhancements in the model’s performance
for even faster execution and lower the construction
time. Also, we plan to extending the supported devices, and
node types to add more virtual devices and compound devices
that can incorporate multi-node functionality as in some
modern network-based devices. Moreover, a user interface
for facilitating interactive execution of queries as well as
updating and editing the configurations for a more practical
deployment patterns for the tool. We will also try to find a
practical mapping scheme between application requests and
corresponding packet flows to automatically detect the flows
required to communicate a request between different services.
REFERENCES
[1] E. Al-Shaer and H. Hamed. Discovery of policy anomalies in distributed
firewalls. In Proceedings of IEEE INFOCOM’04, March 2004.
[2] E. Al-Shaer and H. Hamed. Taxonomy of conflicts in network security
policies. IEEE Communications Magazine, 44(3), March 2006.
[3] E. Al-Shaer, W. Marrero, A. El-Atawy, and K. Elbadawi. Network
configuration in a box: Towards end-to-end verification of network
reachability and security. In ICNP, pages 123–132, 2009.
[4] R. Alimi, Y. Wang, and Y. R. Yang. Shadow configuration as a network
management primitive. In SIGCOMM ’08: Proceedings of the ACM
SIGCOMM 2008 conference on Data communication, pages 111–122,
New York, NY, USA, 2008. ACM.
[5] H. Ballani and P. Francis. Conman: a step towards network manageability.
SIGCOMM Comput. Commun. Rev., 37(4):205–216, 2007.
[6] Y. Bartal, A. Mayer, K. Nissim, and A. Wool. Firmato: A novel firewall
management toolkit. ACM Trans. Comput. Syst., 22(4):381–420, 2004.
[7] J. Burch, E. Clarke, K. McMillan, D. Dill, and J. Hwang. Symbolic
model checking: 1020 states and beyond. Journal of Information and
Computation, 98(2):1–33, June 1992.
[8] R. Bush and T. Griffin. Integrity for virtual private routed networks. In
IEEE INFOCOM 2003, volume 2, pages 1467– 1476, 2003.
[9] N. Feamster and H. Balakrishnan. Detecting BGP configuration faults
with static analysis. In NSDI, 2005.
[10] Z. Fu, F. Wu, H. Huang, K. Loh, F. Gong, I. Baldine, and C. Xu.
IPSec/VPN security policy: Correctness, conflict detection and resolution.
In Policy’2001 Workshop, pages 39–56, January 2001.
[11] T. G. Griffin and G. Wilfong. On the correctness of IBGP configuration.
In SIGCOMM ’02: Proceedings of the ACM SIGCOMM 2002 conference
on Data communication, pages 17–29, 2002.
[12] J. Guttman. Filtering posture: Local enforcement for global policies. In
IEEE Symposium on Security and Privacy, pages 120–129, May 1997.
[13] Hazem Hamed, Ehab Al-Shaer and Will Marrero. Modeling and
verification of IPSec and VPN security policies. In IEEE International
Conference of Network Protocols (ICNP’2005), Nov. 2005.
[14] S. Hinrichs. Policy-based management: Bridging the gap. In 15th
Annual Computer Security Applications Conference (ACSAC’99), pages
209–218, December 1999.
[15] S. Ioannidis, A. Keromytis, S. Bellovin, and J. Smith. Implementing
a distributed firewall. In 7
th ACM Conference on Computer and
Comminications Security (CCS’00), pages 190–199, November 2000.
[16] I. Luck, C. Schafer, and H. Krumm. Model-based tool assistance for
packet-filter design. In IEEE Workshop on Policies for Distributed
Systems and Networks (POLICY’01), pages 120–136, January 2001.
[17] R. Mahajan, D. Wetherall, and T. Anderson. Understanding BGP misconfiguration.
In SIGCOMM ’02: Proceedings of the ACM SIGCOMM
2002 conference on Data communications, pages 3–16, New York, NY,
USA, 2002. ACM.
[18] H. Mai, A. Khurshid, R. Agarwal, M. Caesar, P. B. Godfrey, and S. T.
King. Debugging the data plane with anteater. SIGCOMM Comput.
Commun. Rev., 41(4):290–301, Aug. 2011.
[19] A. Mayer, A. Wool, and E. Ziskind. Fang: A firewall analysis engine.
In IEEE Symposium on Security and Privacy (SSP’00), pages 177–187,
May 2000.
[20] S. Narain. Network configuration management via model finding. In
LISA, pages 155– 168, 2005.
[21] A. Wool. A quantitative study of firewall configuration errors. IEEE
Computer, 37(6):62–67, 2004.
[22] G. G. Xie, J. Zhan, D. Maltz, H. Zhang, A. Greenberg, G. Hjalmtysson,
and J. Rexford. On static reachability analysis of ip networks. In IEEE
INFOCOM 2005, volume 3, pages 2170– 2183, 2005.
[23] Y. Yang, C. U. Martel, and S. F. Wu. On building the minimum number
of tunnels: An ordered-split approach to manage ipsec/vpn tunnels.
In In 9th IEEE/IFIP Network Operation and Management Symposium
(NOMS2004), pages 277–290, May 2004.
[24] L. Yuan, J. Mai, Z. Su, H. Chen, C. Chuah, and P. Mohapatra.
FIREMAN: A toolkit for firewall modeling and analysis. In IEEE
Symposium on Security and Privacy (SSP’06), May 2006.
JMLR: Workshop and Conference Proceedings 2012 11th International Conference on Grammatical Inference
Bootstrapping Dependency Grammar Inducers
from Incomplete Sentence Fragments via Austere Models
Valentin I. Spitkovsky valentin@cs.stanford.edu
Computer Science Department, Stanford University and Google Research, Google Inc.
Hiyan Alshawi hiyan@google.com
Google Research, Google Inc., 1600 Amphitheatre Parkway, Mountain View, CA, 94043
Daniel Jurafsky jurafsky@stanford.edu
Departments of Linguistics and Computer Science, Stanford University, Stanford, CA, 94305
Editors: Jeffrey Heinz, Colin de la Higuera, and Tim Oates
Abstract
Modern grammar induction systems often employ curriculum learning strategies that begin
by training on a subset of all available input that is considered simpler than the full data.
Traditionally, filtering has been at granularities of whole input units, e.g., discarding entire
sentences with too many words or punctuation marks. We propose instead viewing interpunctuation
fragments as atoms, initially, thus making some simple phrases and clauses of
complex sentences available to training sooner. Splitting input text at punctuation in this
way improved our state-of-the-art grammar induction pipeline. We observe that resulting
partial data, i.e., mostly incomplete sentence fragments, can be analyzed using reduced
parsing models which, we show, can be easier to bootstrap than more nuanced grammars.
Starting with a new, bare dependency-and-boundary model (DBM-0), our grammar inducer
attained 61.2% directed dependency accuracy on Section 23 (all sentences) of the Wall
Street Journal corpus: more than 2% higher than previous published results for this task.
Keywords: Dependency Grammar Induction; Unsupervised Dependency Parsing;
Curriculum Learning; Partial EM; Punctuation; Unsupervised Structure Learning.
1. Introduction
“Starting small” strategies (Elman, 1993) that gradually increase complexities of training
models (Lari and Young, 1990; Brown et al., 1993; Frank, 2000; Gimpel and Smith, 2011)
and/or input data (Brent and Siskind, 2001; Bengio et al., 2009; Krueger and Dayan, 2009;
Tu and Honavar, 2011) have long been known to aid various aspects of language learning.
In dependency grammar induction, pre-training on sentences up to length 15 before moving
on to full data can be particularly effective (Spitkovsky et al., 2010a,b, 2011a,b). Focusing
on short inputs first yields many benefits: faster training, better chances of guessing larger
fractions of correct parse trees, and a preference for more local structures, to name a few.
But there are also drawbacks: notably, unwanted biases, since many short sentences are not
representative, and data sparsity, since most typical complete sentences can be quite long.
We propose starting with short inter-punctuation fragments of sentences, rather than
with small whole inputs exclusively. Splitting text on punctuation allows more and simpler
word sequences to be incorporated earlier in training, alleviating data sparsity and complex-
c 2012 V.I. Spitkovsky, H. Alshawi & D. Jurafsky.ity concerns. Many of the resulting fragments will be phrases and clauses, since punctuation
correlates with constituent boundaries (Ponvert et al., 2010, 2011; Spitkovsky et al., 2011a),
and may not fully exhibit sentence structure. Nevertheless, we can accommodate these and
other unrepresentative short inputs using our dependency-and-boundary models (DBMs),
which distinguish complete sentences from incomplete fragments (Spitkovsky et al., 2012).
DBMs consist of overlapping grammars that share all information about head-dependent
interactions, while modeling sentence root propensities and head word fertilities separately,
for different types of input. Consequently, they can glean generalizable insights about local
substructures from incomplete fragments without allowing their unrepresentative lengths
and root word distributions to corrupt grammars of complete sentences. In addition, chopping
up data plays into other strengths of DBMs — which learn from phrase boundaries,
such as the first and last words of sentences — by increasing the number of visible edges.
Figure 1: Three types of input: (a) fragments lacking sentence-final punctuation are always
considered incomplete; (b) sentences with trailing but no internal punctuation
are considered complete though unsplittable; and (c) text that can be split on
punctuation yields several smaller incomplete fragments, e.g., Bach’s, Air and
followed. In modeling stopping decisions, Bach’s is still considered left-complete
— and followed right-complete — since the original input sentence was complete.
Odds and Ends
(a) An incomplete fragment.
“It happens.”
(b) A complete sentence that cannot
be split on punctuation.
Bach’s “Air” followed.
(c) A complete sentence that can
be split into three fragments.
2. Methodology
All of our experiments make use of DBMs, which are head-outward (Alshawi, 1996) classbased
models, to generate projective dependency parse trees for Penn English Treebank’s
Wall Street Journal (WSJ) portion (Marcus et al., 1993). Instead of gold parts-of-speech,
we use context-sensitive unsupervised tags,1 obtained by relaxing a hard clustering produced
by Clark’s (2003) algorithm using an HMM (Goldberg et al., 2008). As in our original setup
without gold tags (Spitkovsky et al., 2011b), training is split into two stages of Viterbi
EM (Spitkovsky et al., 2010b): first on shorter inputs (15 or fewer tokens), then on most
sentences (up to length 45). Evaluation is against the reference parse trees of Section 23.2
Our baseline system learns DBM-2 in Stage I and DBM-3 (with punctuation-induced
constraints) in Stage II, starting from uniform punctuation-crossing attachment probabilities
(see Appendix A for details of DBMs). Smoothing and termination of both stages are
as in Stage I of the original system. This strong baseline achieves 59.7% directed dependency
accuracy — somewhat higher than our previous state-of-the-art result (59.1%, see
also Table 1). In all experiments we will only make changes to Stage I’s training, initialized
from the same exact trees as in the baselines and affecting Stage II only via its initial trees.
1. http://nlp.stanford.edu/pubs/goldtags-data.tar.bz2:untagger.model
2. Unlabeled dependencies are converted from labeled constituents using deterministic “head-percolation”
rules (Collins, 1999) — after discarding punctuation marks, tokens that are not pronounced where they
appear (i.e., having gold part-of-speech tags $ and #) and any empty nodes — as is standard practice.Table 1: Directed dependency and exact tree accuracies (DDA / TA) for our baseline,
experiments with split data, and previous state-of-the-art on Section 23 of WSJ.
Stage I Stage II DDA TA
Baseline (§2) DBM-2 constrained DBM-3 59.7 3.4
Experiment #1 (§3) split DBM-2 constrained DBM-3 60.2 3.5
Experiment #2 (§4) split DBM-i constrained DBM-3 60.5 4.9
Experiment #3 (§5) split DBM-0 constrained DBM-3 61.2 5.0
(Spitkovsky et al., 2011b, §5.2) constrained DMV constrained L-DMV 59.1 —
Table 2: Feature-sets parametrizing dependency-and-boundary models three, two, i and
zero: if comp is false, then so are comproot and both of compdir; otherwise, comproot is
true for unsplit inputs, compdir for prefixes (if dir = L) and suffixes (when dir = R).
Model PATTACH (root-head) PATTACH (head-dependent) PSTOP (adjacent/not)
DBM-3 (Appendix A) (⋄, L, cr, comproot) (ch, dir, cd, cross) (compdir, ce, dir, adj)
DBM-2 (§3, Appendix A) (⋄, L, cr, comproot) (ch, dir, cd) (compdir, ce, dir, adj)
DBM-i (§4, Appendix B) (⋄, L, cr, comproot) (ch, dir, cd) (compdir, ce, dir)
DBM-0 (§5, Appendix B) (⋄, L, cr) iff comproot (ch, dir, cd) (compdir, ce, dir)
3. Experiment #1 (DBM-2): Learning from Fragmented Data
In our experience (Spitkovsky et al., 2011a), punctuation can be viewed as implicit partial
bracketing constraints (Pereira and Schabes, 1992): assuming that some (head) word from
each inter-punctuation fragment derives the entire fragment is a useful approximation in
the unsupervised setting. With this restriction, splitting text at punctuation is equivalent
to learning partial parse forests — partial because longer fragments are left unparsed, and
forests because even the parsed fragments are left unconnected (Moore et al., 1995). We
allow grammar inducers to focus on modeling lower-level substructures first,3 before forcing
them to learn how these pieces may fit together. Deferring decisions associated with potentially
long-distance inter-fragment relations and dependency arcs from longer fragments to a
later training stage is thus a variation on the “easy-first” strategy (Goldberg and Elhadad,
2010), which is a fast and powerful heuristic from the supervised dependency parsing setting.
We bootstrapped DBM-2 using snippets of text obtained by slicing up all input sentences
at punctuation. Splitting data increased the number of training tokens from 163,715
to 709,215 (and effective short training inputs from 15,922 to 34,856). Ordinarily, tree generation
would be conditioned on an exogenous sentence-completeness status (comp), using
presence of sentence-final punctuation as a binary proxy. We refined this notion, accounting
for new kinds of fragments: (i) for the purposes of modeling roots, only unsplit sentences
could remain complete; as for stopping decisions, (ii) leftmost fragments (prefixes of complete
original sentences) are left-complete; and, analogously, (iii) rightmost fragments (suf-
fixes) retain their status vis-`a-vis right stopping decisions (see Figure 1). With this set-up,
performance improved from 59.7 to 60.2% (from 3.4 to 3.5% for exact trees — see Table 1).
Next, we will show how to make better use of the additional fragmented training data.
3. About which our loose and sprawl punctuation-induced constraints agree (Spitkovsky et al., 2011a, §2.2).4. Experiment #2 (DBM-i): Learning with a Coarse Model
In modeling head word fertilities, DBMs distinguish between the adjacent case (adj = T,
deciding whether or not to have any children in a given direction, dir ∈ {L, R}) and nonadjacent
cases (adj = F, whether to cease spawning additional daughters — see PSTOP in
Table 2). This level of detail can be wasteful for short fragments, however, since nonadjacency
will be exceedingly rare there: most words will not have many children. Therefore,
we can reduce the model by eliding adjacency. On the down side, this leads to some loss of
expressive power; but on the up side, pooled information about phrase edges could flow more
easily inwards from input boundaries, since it will not be quite so needlessly subcategorized.
We implemented DBM-i by conditioning all stopping decisions only on the direction in
which a head word is growing, the input’s completeness status in that direction and the
identity of the head’s farthest descendant on that side (the head word itself, in the adjacent
case — see Table 2 and Appendix B). With this smaller initial model, directed dependency
accuracy on the test set improved only slightly, from 60.2 to 60.5%; however, performance
at the granularities of whole trees increased dramatically, from 3.5 to 4.9% (see Table 1).
5. Experiment #3 (DBM-0): Learning with an Ablated Model
DBM-i maintains separate root distributions for complete and incomplete sentences (see
PATTACH for ⋄ in Table 2), which can isolate verb and modal types heading typical sentences
from the various noun types deriving captions, headlines, titles and other fragments that
tend to be common in news-style data. Heads of inter-punctuation fragments are less homogeneous
than actual sentence roots, however. Therefore, we can simplify the learning task
by approximating what would be a high-entropy distribution with a uniform multinomial,
which is equivalent to updating DBM-i via a “partial” EM variant (Neal and Hinton, 1999).
We implemented DBM-0 by modifying DBM-i to hardwire the root probabilities as one
over the number of word classes (1/200, in our case), for all incomplete inputs. With this
more compact, asymmetric model, directed dependency accuracy improved substantially,
from 60.5 to 61.2% (though only slightly for exact trees, from 4.9 to 5.0% — see Table 1).
6. Conclusion
We presented an effective divide-and-conquer strategy for bootstrapping grammar inducers.
Our procedure is simple and efficient, achieving state-of-the-art results on a standard English
dependency grammar induction task by simultaneously scaffolding on both model and
data complexity, using a greatly simplified dependency-and-boundary model with interpunctuation
fragments of sentences. Future work could explore inducing structure from
sentence prefixes and suffixes — or even bootstrapping from intermediate n-grams, perhaps
via novel parsing models that may be better equipped for handling distituent fragments.
Acknowledgments
We thank the anonymous reviewers and conference organizers for their help and suggestions.
Funded, in part, by Defense Advanced Research Projects Agency (DARPA) Machine Reading
Program, under Air Force Research Laboratory (AFRL) prime contract no. FA8750-09-C-0181.References
H. Alshawi. Head automata for speech translation. In ICSLP, 1996.
Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In ICML, 2009.
M. R. Brent and J. M. Siskind. The role of exposure to isolated words in early vocabulary development.
Cognition, 81, 2001.
P. F. Brown, V. J. Della Pietra, S. A. Della Pietra, and R. L. Mercer. The mathematics of statistical
machine translation: Parameter estimation. Computational Linguistics, 19, 1993.
A. Clark. Combining distributional and morphological information for part of speech induction. In
EACL, 2003.
M. Collins. Head-Driven Statistical Models for Natural Language Parsing. PhD thesis, University
of Pennsylvania, 1999.
J. L. Elman. Learning and development in neural networks: The importance of starting small.
Cognition, 48, 1993.
R. Frank. From regular to context-free to mildly context-sensitive tree rewriting systems: The path
of child language acquisition. In A. Abeill´e and O. Rambow, editors, Tree Adjoining Grammars:
Formalisms, Linguistic Analysis and Processing. CSLI Publications, 2000.
K. Gimpel and N. A. Smith. Concavity and initialization for unsupervised dependency grammar
induction. Technical report, CMU, 2011.
Y. Goldberg and M. Elhadad. An efficient algorithm for easy-first non-directional dependency
parsing. In NAACL-HLT, 2010.
Y. Goldberg, M. Adler, and M. Elhadad. EM can find pretty good HMM POS-taggers (when given
a good start). In HLT-ACL, 2008.
K. A. Krueger and P. Dayan. Flexible shaping: How learning in small steps helps. Cognition, 110,
2009.
K. Lari and S. J. Young. The estimation of stochastic context-free grammars using the inside-outside
algorithm. Computer Speech and Language, 4, 1990.
M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz. Building a large annotated corpus of English:
The Penn Treebank. Computational Linguistics, 19, 1993.
R. Moore, D. Appelt, J. Dowding, J. M. Gawron, and D. Moran. Combining linguistic and statistical
knowledge sources in natural-language processing for ATIS. In SLST, 1995.
R. M. Neal and G. E. Hinton. A view of the EM algorithm that justifies incremental, sparse, and
other variants. In M. I. Jordan, editor, Learning in Graphical Models. MIT Press, 1999.
F. Pereira and Y. Schabes. Inside-outside reestimation from partially bracketed corpora. In ACL,
1992.
E. Ponvert, J. Baldridge, and K. Erk. Simple unsupervised identification of low-level constituents.
In ICSC, 2010.
E. Ponvert, J. Baldridge, and K. Erk. Simple unsupervised grammar induction from raw text with
cascaded finite state models. In ACL-HLT, 2011.
V. I. Spitkovsky, H. Alshawi, and D. Jurafsky. From Baby Steps to Leapfrog: How “Less is More”
in unsupervised dependency parsing. In NAACL-HLT, 2010a.
V. I. Spitkovsky, H. Alshawi, D. Jurafsky, and C. D. Manning. Viterbi training improves unsupervised
dependency parsing. In CoNLL, 2010b.
V. I. Spitkovsky, H. Alshawi, and D. Jurafsky. Punctuation: Making a point in unsupervised
dependency parsing. In CoNLL, 2011a.
V. I. Spitkovsky, A. X. Chang, H. Alshawi, and D. Jurafsky. Unsupervised dependency parsing
without gold part-of-speech tags. In EMNLP, 2011b.
V. I. Spitkovsky, H. Alshawi, and D. Jurafsky. Three dependency-and-boundary models for grammar
induction. In EMNLP-CoNLL, 2012.
K. Tu and V. Honavar. On the utility of curricula in unsupervised learning of probabilistic grammars.
In IJCAI, 2011.Appendix A. The Dependency-and-Boundary Models (DBMs 1, 2 and 3)
All DBMs begin by choosing a class for the root word (cr). Remainders of parse structures,
if any, are produced recursively. Each node spawns off ever more distant left dependents by
(i) deciding whether to have more children, conditioned on direction (left), the class of the
(leftmost) fringe word in the partial parse (initially, itself), and other parameters (such as
adjacency of the would-be child); then (ii) choosing its child’s category, based on direction,
the head’s own class, etc. Right dependents are generated analogously, but using separate
factors. Unlike traditional head-outward models, DBMs condition their generative process
on more observable state: left and right end words of phrases being constructed. Since left
and right child sequences are still generated independently, DBM grammars are split-head.
DBM-2 maintains two related grammars: one for complete sentences (comp = T), approximated
by presence of final punctuation, and another for incomplete fragments. These
grammars communicate through shared estimates of word attachment parameters, making
it possible to learn from mixtures of input types without polluting root and stopping factors.
DBM-3 conditions attachments on additional context, distinguishing arcs that cross
punctuation boundaries (cross = T) from lower-level dependencies. We allowed only heads of
fragments to attach other fragments as part of (loose) constrained Viterbi EM; in inference,
entire fragments could be attached by arbitrary external words (sprawl). All missing families
of factors (e.g., those of punctuation-crossing arcs) were initialized as uniform multinomials.
Appendix B. Partial Dependency-and-Boundary Models (DBMs i and 0)
Since dependency structures are trees, few heads get to spawn multiple dependents on the
same side. High fertilities are especially rare in short fragments, inviting economical models
whose stopping parameters can be lumped together (because in adjacent cases heads and
fringe words coincide: adj = T → h = e, hence ch = ce). Eliminating inessential components,
such as the likely-heterogeneous root factors of incomplete inputs, can also yield benefits.
Consider the sentence a z . It admits two structures: ya z and xa z . In theory, neither
should be preferred. In practice, if the first parse occurs 100p% of the time, a multicomponent
model could re-estimate total probability as p
n + (1 − p)
n
, where n may exceed
its number of independent components. Only root and adjacent stopping factors are nondeterministic
here: PROOT(a ) = PSTOP(z , L) = p and PROOT(z ) = PSTOP(a , R) = 1 − p; attachments are
fixed (a can only attach z and vice-versa). Tree probabilities are thus cubes (n = 3): a root
and two stopping factors (one for each word, on different sides), P(a z ) = P(
ya z ) + P(
xa z )
=
p
z }| {
PROOT(a ) PSTOP(a , L)
| {z }
1
p
z }| {
(1 − PSTOP(a , R)) PATTACH(a , R,z )
| {z }
1
p
z }| {
PSTOP(z , L) PSTOP(z , R)
| {z }
1
+
1−p
z }| {
PROOT(z ) PSTOP(z , R)
| {z }
1
1−p
z }| {
(1 − PSTOP(z , L)) PATTACH(z , L,a )
| {z }
1
1−p
z }| {
PSTOP(a , R) PSTOP(a , L)
| {z }
1
= p
3 + (1 − p)
3
.
For p ∈ [0, 1] and n ∈ Z
+, p
n+(1−p)
n ≤ 1, with strict inequality if p /∈ {0, 1} and n > 1. Clearly, as
n grows above one, optimizers will more strongly prefer extreme solutions p ∈ {0, 1}, despite
lacking evidence in the data. Since the exponent n is related to numbers of input words
and independent modeling components, a recipe of short inputs — combined with simpler,
partial models — could help alleviate some of this pressure towards arbitrary determinism.
Building high-level features
using large scale
unsupervised learning
Quoc%V.%Le%
Stanford%University%and%Google%
Joint%work%with:%Marc’Aurelio Ranzato,%Rajat Monga,%MaEhieu%Devin,%Kai%Chen,%%
Greg%Corrado,%Jeff%Dean,%Andrew%Y.%Ng%pixels%
edges%
Face%parts%
(combinaRon%%
of%edges)%
Face%detectors%
Lee%et%al,%2009.%Sparse%DBNs.%
Hierarchy%of%feature%representaRons%Faces% Random%images%from%the%Internet%Quoc%V.%Le%
Face%detector% Human%body%detector% Cat%detector%
Key%results%Algorithm%
Each%RICA%layer%=%1%filtering%layer%+%pooling%layer%+%local%contrast%
normalizaRon%layer%
See%Le%et%al,%NIPS%11%and%Le%et%al,%CVPR%11%for%applicaRons%on%acRon%
recogniRon,%object%recogniRon,%biomedical%imaging%
Very%large%model%`>%Cannot%fit%in%a%single%machine%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`>%Model%parallelism,%Data%parallelism%
Pooling Size = 5
Number
of maps = 8
Image Size = 200
Number of output
channels = 8
Number of input
channels = 3
One layer
RF size = 18
Input to another layer above
(image with 8 channels)
W
H
LCN Size = 5
Image%
Sparse%
autoencoder
Quoc%V.%Le%
Sparse%
autoencoder
Sparse%
autoencoderLocal%recepRve%field%networks%
Machine%#1% Machine%#2% Machine%#3% Machine%#4%
Le,%et%al.,%Tiled&Convolu,onal&Neural&Networks.%NIPS%2010%
Features%
Image%
Quoc%V.%Le%Asynchronous%Parallel%SGDs%
Parameter%server%
Quoc%V.%Le%Asynchronous%Parallel%SGDs%
Parameter%server%
Quoc%V.%Le%Training%
Dataset:%10%million%200x200%unlabeled%images%%from%YouTube/Web%
Train%on%1000%machines%(16000%cores)%for%1%week%
1.15%billion%parameters%
` 100x%larger%than%previously%reported%%
` Small%compared%to%visual%cortex%
Pooling Size = 5
Number
of maps = 8
Image Size = 200
Number of output
channels = 8
Number of input
channels = 3
One layer
RF size = 18
Input to another layer above
(image with 8 channels)
W
H
LCN Size = 5
Quoc%V.%Le%
Image%
Sparse%
autoencoder
Sparse%
autoencoder
Sparse%
autoencoderQuoc%V.%Le%
Top%sRmuli%from%the%test%set%Quoc%V.%Le%
Face%detector%
OpRmal%sRmulus%via%opRmizaRon%Quoc%V.%Le%
Face%detector% Human%body%detector% Cat%detector%Feature%value%
Random%distractors%
Faces%
Frequency%
Quoc%V.%Le%Invariance%properRes%
Feature%response%
Horizontal%shils% VerRcal%shils% Feature%response%
3D%rotaRon%angle%
Feature%response%
90%
20%pixels%
o%
Feature%response%
Scale%factor%
1.6x%
0%pixels% 0%pixels% 20%pixels%
0%
o% 0.4x% 1x%
Quoc%V.%Le%ImageNet%classificaRon%
20,000%categories,%16,000,000%images%
Hand`engineered%features%(SIFT,%HOG,%LBP),%%SpaRal%pyramid,%%
SparseCoding/Compression,%Kernel%SVMs%
Quoc%V.%Le%20,000%is%a%lot%of%categories…%%
…%
smoothhound,%smoothhound%shark,%Mustelus mustelus
American%smooth%dogfish,%Mustelus canis
Florida%smoothhound,%Mustelus norrisi
whiteRp%shark,%reef%whiteRp%shark,%Triaenodon obseus
AtlanRc%spiny%dogfish,%Squalus acanthias
Pacific%spiny%dogfish,%Squalus suckleyi
hammerhead,%hammerhead%shark%
smooth%hammerhead,%Sphyrna zygaena
smalleye%hammerhead,%Sphyrna tudes
shovelhead,%bonnethead,%bonnet%shark,%Sphyrna Rburo
angel%shark,%angelfish,%SquaRna squaRna,%monkfish%
electric%ray,%crampfish,%numbfish,%torpedo%
smalltooth%sawfish,%PrisRs pecRnatus
guitarfish%
roughtail sRngray,%DasyaRs centroura
buEerfly%ray%
eagle%ray%
spoEed%eagle%ray,%spoEed%ray,%Aetobatus narinari
cownose%ray,%cow`nosed%ray,%Rhinoptera bonasus
manta,%manta%ray,%devilfish%
AtlanRc%manta,%Manta%birostris
devil%ray,%Mobula hypostoma
grey%skate,%gray%skate,%Raja%baRs
liEle%skate,%Raja%erinacea
…%
SRngray%
Mantaray
Quoc%V.%Le%0.005%%
Random%guess%
9.5%% ?%
Feature%learning%%
From%raw%pixels%
State`of`the`art%
(Weston,%Bengio%‘11)%
Quoc%V.%Le%ImageNet%2009%(10k%categories):%Best%published%result:%17%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%(Sanchez%&%Perronnin%‘11%),%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%Our%method:%19%%
Using%only%1000%categories,%our%method%>%50%%
0.005%%
Random%guess%
9.5%%
State`of`the`art%
(Weston,%Bengio%‘11)%
15.8%%
Feature%learning%%
From%raw%pixels%
Quoc%V.%Le%Feature%1%
Feature%2%
Feature%3%
Feature%4%
Feature%5%
Quoc%V.%Le%Pooling Size = 5
Number
of maps = 8
Image Size = 200
Number of output
channels = 8
Number of input
channels = 3
One layer
RF size = 18
Input to another layer above
(image with 8 channels)
W
H
LCN Size = 5
Feature%7%%
Feature%8%
Feature%6%
Feature%9%
Quoc%V.%Le%Pooling Size = 5
Number
of maps = 8
Image Size = 200
Number of output
channels = 8
Number of input
channels = 3
One layer
RF size = 18
Input to another layer above
(image with 8 channels)
W
H
LCN Size = 5
Feature%11%%
Feature%10%
Feature%12%
Feature%13%
Quoc%V.%Le%• RICA%learns%invariant%features%
• Face%neuron%with%totally%unlabeled%data%%
%%%%%%%%with%enough%training%and%data%
• State`of`the`art%performances%on%%
– AcRon%RecogniRon%
– Cancer%image%classificaRon%
– ImageNet
Conclusions%
Cancer%classificaRon% AcRon%recogniRon%
Feature%visualizaRon% Face%neuron%
0.005%% 9.5%% 15.8%%
Random%guess% Best%published%result% Our%method%
ImageNetSamy Bengio,%Zhenghao%Chen,%Tom%Dean,%Pangwei Koh,%
Mark%Mao,%Jiquan Ngiam,%Patrick%Nguyen,%Andrew%Saxe,%
Mark%Segal,%Jon%Shlens,%%Vincent%Vanhouke,%%Xiaoyun%Wu,%%
Peng Xe,%Serena%Yeung,%Will%Zou
AddiRonal%
Thanks:%
Kai%Chen% Greg%Corrado% Jeff%Dean% MaEhieu%Devin%
Rajat%Monga% Andrew%Ng% MarcʼAurelio%
Ranzato
Paul%Tucker% Ke%Yang%
Joint%work%with%• Q.V.%Le,%M.A.%Ranzato,%R.%Monga,%M.%Devin,%G.%Corrado,%K.%Chen,%J.%Dean,%A.Y.%
Ng.%Building(high*level(features(using(large*scale(unsupervised(learning.% ICML,%2012.%
• Q.V.%Le,%J.%Ngiam,%Z.%Chen,%D.%Chia,%P.%Koh,%A.Y.%Ng.%Tiled(Convolu7onal(Neural(
Networks.%NIPS,%2010.%%
• Q.V.%Le,%W.Y.%Zou,%S.Y.%Yeung,%A.Y.%Ng.%Learning(hierarchical(spa7o*temporal(
features(for(ac7on(recogni7on(with(independent(subspace(analysis.%CVPR,%
2011.%
• Q.V.%Le,%J.%Ngiam,%A.%Coates,%A.%Lahiri,%B.%Prochnow,%A.Y.%Ng.%%
On(op7miza7on(methods(for(deep(learning.%ICML,%2011.%%
• Q.V.%Le,%A.%Karpenko,%J.%Ngiam,%A.Y.%Ng.%%ICA(with(Reconstruc7on(Cost(for(
Efficient(Overcomplete(Feature(Learning.%NIPS,%2011.%%
• Q.V.%Le,%J.%Han,%J.%Gray,%P.%Spellman,%A.%Borowsky,%B.%Parvin.%Learning(Invariant(
Features(for(Tumor(Signatures.%ISBI,%2012.%%
• I.J.%Goodfellow,%Q.V.%Le,%A.M.%Saxe,%H.%Lee,%A.Y.%Ng,%%Measuring(invariances(in(
deep(networks.%NIPS,%2009.%
References%
hEp://ai.stanford.edu/~quocle
Tera-scale deep learning
Quoc%V.%Le%
Stanford%University%and%Google%Samy Bengio,%Zhenghao%Chen,%Tom%Dean,%Pangwei Koh,%
Mark%Mao,%Jiquan Ngiam,%Patrick%Nguyen,%Andrew%Saxe,%
Mark%Segal,%Jon%Shlens,%%Vincent%Vanhouke,%%Xiaoyun%Wu,%%
Peng Xe,%Serena%Yeung,%Will%Zou
AddiNonal%
Thanks:%
Kai%Chen% Greg%Corrado% Jeff%Dean% MaQhieu%Devin%
Rajat%Monga% Andrew%Ng% MarcʼAurelio%
Ranzato
Paul%Tucker% Ke%Yang%
Joint%work%with%Machine%Learning%successes%
Face%recogniNon% OCR% Autonomous%car%
RecommendaNon%systems% Web%page%ranking%
Email%classificaNon%Classifier%
Feature%extracNon%
(Mostly%hand\cra]ed%features)%
Feature%ExtracNon%Hand\Cra]ed%Features%
Computer%vision:%
Speech%RecogniNon:%
MFCC% Spectrogram% ZCR%
…%
SIFT/HOG% SURF%
…%New%feature\designing%paradigm%
Unsupervised%Feature%Learning%/%Deep%Learning%%
ReconstrucNon%ICA%
Expensive%and%typically%applied%to%small%problems%The%Trend%of%BigDataNo%maQer%the%algorithm,%more%features%always%more%successful.%
Outline%
\%%%%ReconstrucNon%ICA%
\ ApplicaNons%to%videos,%cancer%images%
\ Ideas%for%scaling%up%
\ Scaling%up%Results%Topographic%Independent%Component%Analysis%(TICA)%2.%Learning%
Input%data:%
W%
1%
W%
9%
W%
1%
T% W%
9%
T%
1.%Feature%computaNon%
W%
9%
T% (% )%
2%
W%
1% (% T% )%2%
W%=%%
W%
1%
W%
2%
W%
10000%
.%
.%Topographic%Independent%Component%Analysis%(TICA)%Invariance%explained%
F2%
F1%
Loc1%
1%
0%
0%
1%
Pooled%feature%of%F1%and%F2%
Loc2%
Images%
Features%
Image1% Image2%
sqrt(1%+%0%)%=%1% 2% 2% sqrt(0%%+%1%)%=%1% 2% 2%
Same%value%regardless%the%locaNon%of%the%edge%TICA:% ReconstrucNon%ICA:%
Equivalence%between%Sparse%Coding,%Autoencoders,%RBMs%and%ICA%
Build%deep%architecture%by%treaNng%the%output%of%one%layer%as%input%to%
another%layer%
Le,%et%al.,%ICA$with$Reconstruc1on$Cost$for$Efficient$Overcomplete$Feature$Learning.%NIPS%2011%ReconstrucNon%ICA:%
Le,%et%al.,%ICA$with$Reconstruc1on$Cost$for$Efficient$Overcomplete$Feature$Learning.%NIPS%2011%ReconstrucNon%ICA:%
Le,%et%al.,%ICA$with$Reconstruc1on$Cost$for$Efficient$Overcomplete$Feature$Learning.%NIPS%2011%
Data%whitening%TICA:% ReconstrucNon%ICA:%
Le,%et%al.,%ICA$with$Reconstruc1on$Cost$for$Efficient$Overcomplete$Feature$Learning.%NIPS%2011%
Data%whitening%Why%RICA?%
Algorithms% Speed%
Sparse%Coding%
RBMs/Autoencoders
Ease%of%training% Invariant%Features%%
TICA%
ReconstrucNon%ICA%
Le,%et%al.,%ICA$with$Reconstruc1on$Cost$for$Efficient$Overcomplete$Feature$Learning.%NIPS%2011%Summary%of%RICA%
\ Two\layered%network%
\ ReconstrucNon%cost%instead%of%orthogonality%constraints%
\ Learns%invariant%features%ApplicaNons%of%RICA%Sit%up% Drive%Car% Get%%Out%of%Car%
Eat% Answer%phone% Kiss%
Run% Stand%up% Shake%hands%
AcNon%recogniNon%
Le,%et%al.,%Learning$hierarchical$spa1o>temporal$features$for$$
ac1on$recogni1on$with$independent$subspace$analysis.%CVPR%2011%Le,%et%al.,%Learning$hierarchical$spa1o>temporal$features$for$$
ac1on$recogni1on$with$independent$subspace$analysis.%CVPR%2011%70%
71%
72%
73%
74%
75%
76%
75%
77%
79%
81%
83%
85%
87%
35%
37%
39%
41%
43%
45%
47%
49%
51%
53%
55%
80%
82%
84%
86%
88%
90%
92%
94%
KTH% Hollywood2%
UCF% YouTube%
Hessian/SURF% Learned%Features% Learned%Features%
Combined%
Engineered%Features%
Learned%Features%
Learned%Features%
pLSA HOF% GRBMs% 3DCNN% HMAX% HOG% Hessian/SURF% HOG/HOF% HOG3D% GRBMS% HOF%
Hessian% HOF% HOG3D%
HOG.HOF%
Hessian/SURF% HOG%
Le,%et%al.,%Learning$hierarchical$spa1o>temporal$features$for$$
ac1on$recogni1on$with$independent$subspace$analysis.%CVPR%2011%Cancer%classificaNon%
ApoptoNc%
Viable%tumor%
region%
Necrosis%
…%
Le,%et%al.,%Learning$Invariant$Features$of$Tumor$Signatures.%ISBI%2012%
84%%
86%%
88%%
90%%
92%%
Hand%engineered%Features% RICA%Scaling%up%%
deep%RICA%networks%Scaling%up%Deep%Learning%
Real%data%
Deep%learning%data%No%maQer%the%algorithm,%more%features%always%more%successful.%
It’s%beQer%to%have%more%features!%
Coates,%et%al.,%An$Analysis$of$Single>Layer$Networks$in$Unsupervised$Feature$Learning.%AISTATS’11%Most%are%%
local%features%Local%recepNve%field%networks%
Machine%#1% Machine%#2% Machine%#3% Machine%#4%
Le,%et%al.,%Tiled$Convolu1onal$Neural$Networks.%NIPS%2010%
RICA%features%
Image%Challenges%with%1000s%of%machines%Asynchronous%Parallel%SGDs%
Parameter%server%
Le,%et%al.,%Building$high>level$features$using$large>scale$unsupervised$learning.%ICML%2012%Asynchronous%Parallel%SGDs%
Parameter%server%
Le,%et%al.,%Building$high>level$features$using$large>scale$unsupervised$learning.%ICML%2012%Summary%of%Scaling%up%
\ Local%connecNvity%
\ Asynchronous%SGDs%
…%And%more%
\ RPC%vs MapReduce
\ Prefetching%
\ Single%vs%Double%
\ Removing%slow%machines%
\ OpNmized%So]max
\ …%10%million%200x200%images%%
1%billion%parameters%Training%
Dataset:%10%million%200x200%unlabeled%images%%from%YouTube/Web%
Train%on%2000%machines%(16000%cores)%for%1%week%
1.15%billion%parameters%
\ 100x%larger%than%previously%reported%%
\ Small%compared%to%visual%cortex%
Pooling Size = 5
Number
of maps = 8
Image Size = 200
Number of output
channels = 8
Number of input
channels = 3
One layer
RF size = 18
Input to another layer above
(image with 8 channels)
W
H
LCN Size = 5
Le,%et%al.,%Building$high>level$features$using$large>scale$unsupervised$learning.%ICML%2012%
Image%
RICA%
RICA%
RICA%Top%sNmuli%from%the%test%set% OpNmal%sNmulus%%
by%numerical%opNmizaNon%
The%face%neuron%
Le,%et%al.,%Building$high>level$features$using$large>scale$unsupervised$learning.%ICML%2012%Feature%value%
Random%distractors%
Faces%
Le,%et%al.,%Building$high>level$features$using$large>scale$unsupervised$learning.%ICML%2012%
Frequency%Invariance%properNes%
Feature%response%
Horizontal%shi]s% VerNcal%shi]s% Feature%response%
3D%rotaNon%angle%
Feature%response%
90%
20%pixels%
o%
Feature%response%
Scale%factor%
1.6x%
Le,%et%al.,%Building$high>level$features$using$large>scale$unsupervised$learning.%ICML%2012%
0%pixels% 0%pixels% 20%pixels%
0%
o% 0.4x% 1x%Top%sNmuli%from%the%test%set% OpNmal%sNmulus%%
by%numerical%opNmizaNon%
Le,%et%al.,%Building$high>level$features$using$large>scale$unsupervised$learning.%ICML%2012%Random%distractors%
Pedestrians%
Feature%value%
Le,%et%al.,%Building$high>level$features$using$large>scale$unsupervised$learning.%ICML%2012%
Frequency%Top%sNmuli%from%the%test%set% OpNmal%sNmulus%%
by%numerical%opNmizaNon%
Le,%et%al.,%Building$high>level$features$using$large>scale$unsupervised$learning.%ICML%2012%Random%distractors%
Cat%faces%
Feature%value%
Le,%et%al.,%Building$high>level$features$using$large>scale$unsupervised$learning.%ICML%2012%
Frequency%ImageNet%classificaNon%
22,000%categories%
14,000,000%images%
Hand\engineered%features%(SIFT,%HOG,%LBP),%%
SpaNal%pyramid,%%SparseCoding/Compression%
Le,%et%al.,%Building$high>level$features$using$large>scale$unsupervised$learning.%ICML%2012%22,000%is%a%lot%of%categories…%%
…%
smoothhound,%smoothhound%shark,%Mustelus mustelus
American%smooth%dogfish,%Mustelus canis
Florida%smoothhound,%Mustelus norrisi
whiteNp%shark,%reef%whiteNp%shark,%Triaenodon obseus
AtlanNc%spiny%dogfish,%Squalus acanthias
Pacific%spiny%dogfish,%Squalus suckleyi
hammerhead,%hammerhead%shark%
smooth%hammerhead,%Sphyrna zygaena
smalleye%hammerhead,%Sphyrna tudes
shovelhead,%bonnethead,%bonnet%shark,%Sphyrna Nburo
angel%shark,%angelfish,%SquaNna squaNna,%monkfish%
electric%ray,%crampfish,%numbfish,%torpedo%
smalltooth%sawfish,%PrisNs pecNnatus
guitarfish%
roughtail sNngray,%DasyaNs centroura
buQerfly%ray%
eagle%ray%
spoQed%eagle%ray,%spoQed%ray,%Aetobatus narinari
cownose%ray,%cow\nosed%ray,%Rhinoptera bonasus
manta,%manta%ray,%devilfish%
AtlanNc%manta,%Manta%birostris
devil%ray,%Mobula hypostoma
grey%skate,%gray%skate,%Raja%baNs
liQle%skate,%Raja%erinacea
…%
SNngray%
MantarayBest%sNmuli%
Pooling Size = 5
Number
of maps = 8
Image Size = 200
Number of output
channels = 8
Number of input
channels = 3
One layer
RF size = 18
Input to another layer above
(image with 8 channels)
W
H
LCN Size = 5
Feature%1%
Feature%2%
Feature%3%
Feature%4%
Feature%5%
Le,%et%al.,%Building$high>level$features$using$large>scale$unsupervised$learning.%ICML%2012%Pooling Size = 5
Number
of maps = 8
Image Size = 200
Number of output
channels = 8
Number of input
channels = 3
One layer
RF size = 18
Input to another layer above
(image with 8 channels)
W
H
LCN Size = 5
Feature%7%%
Feature%8%
Feature%6%
Feature%9%
Best%sNmuli%
Le,%et%al.,%Building$high>level$features$using$large>scale$unsupervised$learning.%ICML%2012%Pooling Size = 5
Number
of maps = 8
Image Size = 200
Number of output
channels = 8
Number of input
channels = 3
One layer
RF size = 18
Input to another layer above
(image with 8 channels)
W
H
LCN Size = 5
Feature%11%%
Feature%10%
Feature%12%
Feature%13%
Best%sNmuli%
Le,%et%al.,%Building$high>level$features$using$large>scale$unsupervised$learning.%ICML%2012%0.005%%
Random%guess%
9.5%% ?%
Feature%learning%%
From%raw%pixels%
State\of\the\art%
(Weston,%Bengio%‘11)%
Le,%et%al.,%Building$high>level$features$using$large>scale$unsupervised$learning.%ICML%2012%ImageNet%2009%(10k%categories):%Best%published%result:%17%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%(Sanchez%&%Perronnin%‘11%),%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%Our%method:%20%%
Using%only%1000%categories,%our%method%>%50%%
0.005%%
Random%guess%
9.5%%
State\of\the\art%
(Weston,%Bengio%‘11)%
15.8%%
Feature%learning%%
From%raw%pixels%
Le,%et%al.,%Building$high>level$features$using$large>scale$unsupervised$learning.%ICML%2012%No%maQer%the%algorithm,%more%features%always%more%successful.%
Other%results%
\ We%also%have%great%features%for%%
\ Speech%recogniNon%
\ Word\vector%embedding%for%NLPs%• RICA%learns%invariant%features%
• Face%neuron%with%totally%unlabeled%data%%
%%%%%%%%with%enough%training%and%data%
• State\of\the\art%performances%on%%
– AcNon%RecogniNon%
– Cancer%image%classificaNon%
– ImageNet
Conclusions%
80%
82%
84%
86%
88%
90%
92%
94%
Cancer%classificaNon% AcNon%recogniNon% AcNon%recogniNon%benchmarks%
Feature%visualizaNon% Face%neuron%
0.005%% 9.5%% 15.8%%
Random%guess% Best%published%result% Our%method%
ImageNet• Q.V.%Le,%M.A.%Ranzato,%R.%Monga,%M.%Devin,%G.%Corrado,%K.%Chen,%J.%Dean,%A.Y.%
Ng.%Building)high+level)features)using)large+scale)unsupervised)learning.% ICML,%2012.%
• Q.V.%Le,%J.%Ngiam,%Z.%Chen,%D.%Chia,%P.%Koh,%A.Y.%Ng.%Tiled)Convolu8onal)Neural)
Networks.%NIPS,%2010.%%
• Q.V.%Le,%W.Y.%Zou,%S.Y.%Yeung,%A.Y.%Ng.%Learning)hierarchical)spa8o+temporal)
features)for)ac8on)recogni8on)with)independent)subspace)analysis.%CVPR,%
2011.%
• Q.V.%Le,%J.%Ngiam,%A.%Coates,%A.%Lahiri,%B.%Prochnow,%A.Y.%Ng.%%
On)op8miza8on)methods)for)deep)learning.%ICML,%2011.%%
• Q.V.%Le,%A.%Karpenko,%J.%Ngiam,%A.Y.%Ng.%%ICA)with)Reconstruc8on)Cost)for)
Efficient)Overcomplete)Feature)Learning.%NIPS,%2011.%%
• Q.V.%Le,%J.%Han,%J.%Gray,%P.%Spellman,%A.%Borowsky,%B.%Parvin.%Learning)Invariant)
Features)for)Tumor)Signatures.%ISBI,%2012.%%
• I.J.%Goodfellow,%Q.V.%Le,%A.M.%Saxe,%H.%Lee,%A.Y.%Ng,%%Measuring)invariances)in)
deep)networks.%NIPS,%2009.%
References%
hQp://ai.stanford.edu/~quocle
On the Predictability of Search Trends
Yair Shimshoni Niv Efron Yossi Matias
Google, Israel Labs
Draft date: August 17, 2009
1. Introduction
Since Google Trends and Google Insights for Search were launched, they provide a daily
insight into what the world is searching for on Google, by showing the relative volume of
search traffic in Google for any search query. An understanding of web search trends can be
useful for advertisers, marketers, economists, scholars, and anyone else interested in knowing
more about their world and what's currently top-of-mind.
The trends of some search queries are quite seasonal and have repeated patterns. See, for
instance, the search trends for ski in the US and in Australia peak during the winter season; or
check out how search trends for basketball correlate with annual league events and how
consistent it is year-over-year. When looking at trends of the aggregated volume of search
queries related to particular categories, one can also observe regular patterns in at least some
of hundreds of categories, like the Food & Drink or Automotive categories. Such trends
sequences appear quite predictable, and one would naturally expect the patterns of previous
years to repeat looking forward.
On the other hand, for many other search queries and categories, the trends are quite
irregular and hard to predict. For example, the search trends for Obama, Twitter, Android, or
global warming, and trends of aggregate searches in the News & Current Events category.
Having predictable trends for a search query or for a group of queries could have interesting
ramifications. One could forecast the trends into the future, and use it as a "best guess" for
various business decisions such as budget planning, marketing campaigns and resource
allocations. One could identify deviation from such forecasting and identify new factors that are
influencing the search volume like in the detection of influenza epidemics using search queries
[Ginsberg etal. 2009] known as Flu Trends.
We were therefore interested in the following questions:
• How many search queries have trends that are predictable?
• Are some categories more predictable than others? How is the distribution of
predictable trends between the various categories?
• How predictable are the trends of aggregated search queries for different categories?
Which categories are more predictable and which are less so?
To learn about the predictability of search trends, and so as to overcome our basic limitation of
not knowing what the future will entail, we characterize the predictability of a Trends series
based on its historical performance. That is, based on the a posteriori predictability of a
sequence determined by the discrepancy of forecast trends applied at some point in the past
vs the actual performance.Specifically, we have used a simple forecasting model that learns basic seasonality and general
trend. For each trends sequence of interest, we take a point in time, t, which is about a year
back, compute a one year forecasting for t based on historical data available at time t, and
compare it to the actual trends sequence that occurs since time t. The discrepancy between
the forecasting trends and the actual trends characterize the predictability level of a sequence,
and when the discrepancy is smaller than a predefined threshold, we denote the trends query
as predictable.
We investigate time series of search trends provided by Google Insights for Search (I4S),
which represent query shares of given search terms (or for aggregations of terms). A query
share is the total number of queries for a search term (or an entire search category) in a given
geographic region divided by the total number of queries in that region at a given point in
time. The query share represents the popularity of a query, or the aggregated search interest
that users have in a query, and we will therefore use the term search interest interchangeably
with query share.
The highlights of our observations can be summarized as follows:
• Over half of the most popular Google search queries were found predictable in 12
month ahead forecast, with a mean absolute prediction error of approximately 12% on
average.
• Nearly half of the most popular queries are not predictable, with respect to the
prediction model and evaluation framework that we have used.
• Some categories have particularly high fraction of predictable queries; for instance,
Health (74%), Food & Drink (67%) and Travel (65%).
• Some categories have particularly low fraction of predictable queries; for instance,
Entertainment (35%) and Social Networks & Online Communities (27%).
• The trends of aggregated queries per categories are much more predictable: 88% of
the aggregated category search trends of over 600 categories in Insights for Search
are predictable with a mean absolute prediction error of less than 6% on average.
• There is a clear association between the existence of seasonality patterns and higher
predictability as well as an association between high levels of outliers and lower
predictability.
Recently the research community has started to use Google search data provided publicly by
Google Insights for Search (I4S) as auxiliary indicators for economic forecast. [Choi & Varian
2009] have shown that aggregated search trends of Google Categories can be used as extra
indicators and effectively leverage several US econometrics prediction models. [Askitas &
Zimmermann 2009] and [Suhoy 2009] have shown similar findings on German and Israeli
economic data, respectively. Getting a better insight into the behavior of relevant search
trends has therefore high potential applicability for these domains.
For queries or aggregated set of queries for which the search trends are predictable, one can
use a forecasted trends based on the prediction model as a baseline for identifying deviations
in actual trends. Such deviations are of particular interest as they are often indicative of
material changes in the domain of the queries. We consider a few examples with observed
deviation of actual trends relative to the forecasted trends, including:
• Automotive Industry
We show that in the recent 12 months there is a positive deviation relative to the
forecast baseline (i.e., an increased query share) in the searches of Auto Parts and
Vehicle Maintenance while there is a negative deviation (i.e., a decrease in query share)
in the searches of Vehicle Shopping and Auto Financing.• US Unemployment
In relation with the recent research that showed an improvement in prediction of
unemployment rates using Google query shares [Choi and Varian 2009 b], we show that
the search interest in the category of Welfare & Unemployment has substantially risen in
the last year above the forecast based on the prediction model. We also show that the
search interest in the category Jobs has significantly decreased according to the
prediction model.
• Mexico as Vacation Destination
We examine the large decrease in the query share of the category Mexico as a vacation
destination, compared to the predictions for the last 12 months. We show that a similar
deviation of (actual vs. forecast) query share is not observed for other related
categories.
• Recession Markers
We show several examples that demonstrate possible influences of the recent recession
on search behavior, like an observed increase of query share for the category Coupons &
Rebate compared to the forecast. We also show a negative deviation between the query
share for category Restaurants compared to the forecast, where as the category Cooking
and Recipes shows a similar positive deviation.
Outline
The rest of this paper is organized as follows. In Section 2 we formulate the notion of
predictability and describe the method of estimating it along with the evaluation measures,
prediction model and the time series data that we use. In Section 3 we describe the
experiments we conducted and present their results. In Section 4 we examine the association
between the predictability of search interest and the level of seasonality or internal deviation of
the underlying search trends. Section 5 will present sensitivity analysis and error diagnostics
and in section 6 we discuss the potential use of forecasting as a baseline for identifying
deviations from regular search behavior; we demonstrate with some examples that the
discrepancies from model predictions can act as signals for recent changes in the query share.
2. Time Series Predictability
In this section we define the notion of predictability as we use it in our experiments.
Predictability.
We characterize the predictability of a time series with respect to a prediction model and a
discrepancy measure, as follows.
Assume we have:
• A time series X={xt-H, ... ,xt+F} with history of size H and a future horizon of size F
Denote: X
H={xt-H, ... ,xt} and X
F={xt+1, ... ,xt+F}
• A Prediction model: M, which computes a forecast Y=M(X
H
) ,
where Y={yt+1, ... ,yt+F}
• A Discrepancy Measure: D=D(X,Y)
• A Threshold: D'
Then, we say that X is predictable w.r.t (M,D,D',t,H,F) iff D=D(X,Y) < D' .
The size of the discrepancy also characterizes the level of predictability of a series.
We will often refer to a trends sequence as predictable (or not-predictable) where the various
parameters are implied by the context.Data.
The time series that are used in the following experiments are based on Google Insights for
Search1
(I4S), which reports the query share for search terms in any time, location and
category, as well as capable of reporting the most popular queries within a given time /
location / category.
A query share is defined as the total number of queries for a search term or a set of terms
(e.g., an entire search category) in a given geographic region, divided by the total number of
queries in that region, at a given point in time. The I4S categories are organized in a tree-like
hierarchical structure, with about 30 root level categories, that are further divided into subcategories
in a 3-level taxonomy, to a total of about 600 categories and sub-categories. Each
search query is classified by I4S to a single category, nevertheless it will also be counted as a
part of the query share to all its 'parent' categories. For each category, I4S calculates an
aggregated times series which represents the overall query share of this category (i.e., the
combined search interest of all the queries in the category).
In order to stay focused on the most influential patterns of the yearly seasonality and overall
trend (direction), we are using time series of monthly granularity (i.e., one data point per
calendar month) and refer to the entire available period (2004-2009). Obviously search trends
with finer granularity (e.g., weekly or daily search data) do capture more patterns of search
behavior within the intra-monthly and especially the day-of-week effect, however the fine
resolution data is also noisier and thus calls for prediction models with higher complexity and a
less homogeneous model space. We leave that for future research.
We have extracted time series of the entire available time range (2004-2009)2
that consists of
67 data points, which were partitioned into two parts:
1. The History Period - 55 monthly data points (January 2004 - July 2008)
2. The Forecast Period - 12 monthly data points (August 2008 - July 2009)
Throughout the work, we will refer to 3 data sets of time series (with a similar format):
1. Country Data - Includes time series of the query shares for the 10,000 most popular
queries in each of these countries: USA, UK, Germany, France and Brazil.
2. Category Data - Includes time series of the query shares for the 1,000 most popular
queries in the US, for 10 major I4S categories: Automotive, Entertainment, Finance &
Insurance, Food & Drink, Health, Social Networks & Online Communities, Real Estate,
Shopping, Telecommunications, and Travel.
3. Aggregated Categories Data - Includes time series of aggregated query shares for
about 600 I4S categories, which represent the normalized combined search volume in
the US for each respective category.
Generic Prediction Model.
Our prediction process is based on the STL Procedure [Cleveland et.al. 1990], which is a
filtering procedure based on locally weighted least squares for decomposing a given time series
X into the Trend, Seasonal and Residual components.
STL is basically an EM-like algorithm that calculates the seasonal part assuming knowledge of
the trend part (iteratively). To compute the forecast of the future values, we extrapolate the
trend sub-series using regression, and use the last seasonal period of the seasonal component.
1. URL: http://www.google.com/insights/search/#
2. The time series data was pulled during July 2009, thus the value for this last month might
change.The STL procedure uses 6 configuration parameters, 3 of which are smoothing parameters for
the three components, which in general should be chosen per time series. The prediction
process in our experiments was using a fixed STL configuration for the forecast of all the time
series. Given a sampled archive of search time series, we have used an exhaustive exploration
and evaluation process that was searching for the best parameter set from a pre-defined set of
optional parameter values. The optimality criterion was minimal mean absolute error and the
output was a single parameter set w.r.t. the given sampled archive. By choosing to use a
particular (fixed) configuration, rather than adjusting an individual parameters set for each
given time series, we are adjusting the configuration to a large set of time series thus
simplifying the prediction model and enabling much faster forecast.
Prediction Discrepancy function.
We define the discrepancy D as a combination of several error metrics between the forecasted
trends X
F
and the actual trends Y, as well as seasonal consistency metrics determined by
difference in the auto-correlation between X
H
and X.
Specifically, D is defined as a tuple:
D = < MAPE, MaxAPE, NMSSE, MeanAbsACFDiff, MaxAbsACFDiff >
based on metrics defined below, we say that D .
Thus, we say that a given time series is predictable within the available time frame, w.r.t. the
prediction model we use and the above error and consistency metrics, if all the following
conditions are fulfilled:
1) The Mean Absolute Prediction Error (MAPE) < 25%
2) The Max Absolute Prediction Error (MaxAPE) < 100%
3) The Normalized Mean Sum of Squared Errors (NMSSE) < 10.0
4) The Mean Absolute Difference of the ACF Coef. Sets (MeanAbsACFDiff) < 0.2
5) The Max Absolute Difference of the ACF Coef. Sets (MaxAbsACFDiff) < 0.4
Predictability Ratio.
Given a set A of time series, denote its predictability ratio as the number of predictable time
series in A, divided by the total number of time series in A.3. Experiments and Results
Comparing the Predictability of Top Queries in Different Countries.
We have conducted an experiment to test the predictability of search trends regard to the
10,000 most popular search queries in five countries:
Country Predictability
Ratio
Avg. MAPE
(for predictable
queries)
Avg. MaxAPE
(for predictable
queries)
USA 54.1 11.8 27.1
UK 51.4 12.7 32.1
Germany 56.1 11.8 28.2
France 46.9 12.8 28.8
Brazil 46.3 13.7 30.5
Although the above results show some variability among the different countries, one can see
that in general, about half of the time series that correspond to popular queries in Google Web
Search are predictable with respect to the given prediction model and discrepancy function /
threshold. One can see that among the predictable queries, the mean absolute prediction error
(MAPE) is about 12% on average, while the maximum absolute prediction error (MaxAPE) is
about 30% on average.
The Seasonally of Time Series.
Time series in general often include various forms of regularity, like a consistent trend
(straight, upward or downward) or seasonal patterns (daily, weekly, monthly, etc). In seasonal
time series, the amplitude changes along the time in a regular recurring fashion according to
the relevant season. In many practical cases, it is common to use a seasonality adjustment
where the seasonal component is subtracted from the time series before the analysis, where
there are procedures that decompose time series into their seasonal and trend components
[Cleveland and Tiao 1976], [Lytras etal. 2007]. We use such a decomposition to compute a
metric that represents the relative portion of seasonality within a time series as follows:
Given a time series X = {x1, ... ,xT} and a decomposition of X into a seasonal component S and
a Trend (i.e., directional) component Tr then:
Seasonality Ratio(X) = ( ∑ |Si| ) / ( ∑ |Tri| )
For each time series we forecast, we compute the respective seasonality ratio.
For example, let us examine the time series which represents the search interest for the query
Cheesecake (in the US, 2004-2009). The blue curve in the following plot shows the original
time series which has a significant seasonal component. The red curve is the seasonality
adjusted time series; i.e., the trend component that is left after subtracting the seasonality
component. It has an upward trend (with Slope:0.18) plus some variability. The seasonality
ratio is 2.64 (which is on the 96 percentile of the 10,000 tested queries) and approximates theratio of the area between the red and blue curves, and the area underneath the red curve.
The Deviation of a Time Series.
In order to assess the extent to which a time series contains extreme values or outliers with
large deviation from the overall pattern, we calculate for each time series the deviation ratio.
In general, we compute the sum of the top values in the series divided by the total sum of the
series, assuming that a large ratio would indicate the existence of considerable extreme values
in the series. We normalize by the relative number of top values under consideration.
Given a time series X = {x1, ... ,xT} and an integer w, s.t. 1= Prc(X,w) .
We use w=90. Notice that the normalization term, (1-(w/100)) in the denominator, is setting
the minimal ratio to be 1. Due to the relatively short time series (of 67 points) and since many
cases show seasonal patterns with high and narrow peaks (e.g., like in the plot above), it is
possible that these sharp peaks will be considered as outliers, although they are a regular part
of the time series' recurring dynamics. To mitigate this, we computed the deviation ratio on
the seasonal adjusted time series (i.e., on the Trend component that is left after the seasonal
component is subtracted by the decomposition we have described above).
The Predictability of Search Categories.
In order to assess the predictability of categories, we have extracted the 1000 most popular
queries in the US for a selection of 10 root level categories and tested their predictability. In
the following table, we present the summary results, where the Predictability column on the
left refers to the entire 1,000 queries (per category), and the two error metrics (MAPE and
MaxAPE in columns 3 & 4) refer only to the sub-set of Predictable queries within each category.
The seasonality and deviation ratios are also referring to the entire category sets.
The Predictability per category spans from 74% for the Health category, to 27% for the Social
Networks & Online Communities category. In the third column we can see that the Mean
Absolute Prediction Error (MAPE), varies from of 9% (in the Health category), to 14.1% (in the
Social Networks & Online Communities category). The average MAPE for the 10 categories is
12.35%. Notice that the order of predictability ratio is not equal to the order of the MAPE errorsince the Predictability is based on several other metrics as described in Section 2, however
the correlation between them is high (r= -0.85).
The variability within the columns of seasonality ratio and deviation ratio represents the
differences between the search profiles of the various categories, which correspond to the
variability of the categories' predictability ratio. For example notice the relatively high
seasonality ratio and low deviation ratio of the Food & Drink category which has 66.7%
predictability ratio vs. the opposite situation of the Entertainment category that has 35.4%
predictability ratio with a relatively low seasonality ratio and high deviation ratio.
Category
Name
Predictability
Ratio
MAPE
predictable queries
MaxAPE
predictable queries
Seasonality
Ratio
Deviation
Ratio
Health 74.00 9.00 20.00 0.73 1.58
Food & Drink 66.70 11.90 26.00 1.20 1.74
Travel 64.70 11.80 27.00 1.09 1.61
Shopping 63.30 12.40 28.00 1.21 1.78
Automotive 57.60 11.20 24.90 0.71 1.84
Finance & Insurance 52.90 13.30 30.60 0.65 2.00
Real Estate 49.50 12.90 29.90 0.72 1.82
Telecommunications 45.60 12.90 29.40 0.32 2.34
Entertainment 35.40 14.00 32.30 0.46 2.49
Social Networks 27.50 14.10 30.10 0.19 2.95
For the above summary results of the 10 categories, the correlation between the Predictability
and the Seasonality Ratio is r= 0.80 while the Deviation Ratio has a (negative) correlation of
r= -0.94 with the Predictability. In the next section we will further examine the association
between these regularity characteristics and the predictability.
The Predictability of Aggregated Time Series that represent Categories.
We now show the results of an experiment of forecasting aggregated times series that
represent the overall query share of categories (i.e., the combined search interest of all the
queries in the category).
We ran the experiment on the aggregated time series of over 600 I4S categories and
computed the average absolute prediction error over a period of 12 months ahead. We found
88% of the aggregated category time series to be predictable. The average MAPE for the entire
set of aggregated category time series is 8.15%. (6.7% for Predictable queries only), with
STD=4.18%. The Average Maximum Prediction Error (MaxAPE) for the entire set was 19.2%
(16.6% for Predictable queries only).
In the table below, we show the prediction errors for the aggregated time series for the same
10 root categories we examined above. Notice that the prediction errors are now smaller,
which was expected. However, we can also see that the order of the categories is not the same
as the respective order in the table of the previous experiment. In general, the aggregated
time series should have a higher predictability due to the noise reduction effect of the
aggregation. The rightmost column shows the MAPE Reduction Rate, which is the relative
improvement of the prediction error (MAPE) of the 1,000 queries per category (in the previousexperiment) and the single MAPE for the aggregated category time series here. All categories
(except Social Networks & Online Communities) had their MAPE reduced, starting from 47%
improvement for the Finance & Insurance category up to 85% for the Food & Drink Category.
Category MAPEMaxAPESeasonality Ratio Deviation Ratio MAPE Reduction Rate
Food & Drink 1.76 4.52 0.70 1.18 0.85
Shopping 2.72 6.02 2.77 1.11 0.78
Entertainment 2.74 5.95 0.30 1.16 0.80
Health 2.99 7.69 1.04 1.11 0.67
Automotive 3.27 7.36 1.69 1.12 0.71
Travel 3.94 7.61 1.92 1.12 0.67
Telecommunications 5.2 9.07 0.74 1.20 0.60
Real Estate 5.62 12.8 2.95 1.11 0.56
Finance & Insurance 7.08 17.8 0.61 1.26 0.47
Social Networks 38.6 50.4 0.06 2.46 -1.74
The I4S classification into search categories is based on a hierarchical tree-like taxonomy
where each category at the root level of the tree has several sub-categories under it. Thus, a
combination of all the categories' prediction error into an overall evaluation of the prediction
error, can consist of the average MAPE values of the 27 root level categories. However, a
'regular' (uniform) average which gives the same weight to each category, might be
inaccurate. Therefore, we have computed a weighted average of the root categories' MAPE,
where the weights are the overall relative search interest of each root category. The MAPE
Weighted Average is 4.25%.
The following table shows the predication errors for the I4S root categories (sorted by the
MAPE):
Root Category MAPE MaxAPE
Food & Drink 1.76 4.52
Beauty & Personal Care 2.2 7.41
Home & Garden 2.21 4.9
Photo & Video 2.34 8.31
Lifestyles 2.38 5.27
Games 2.59 4.45
Shopping 2.72 6.02
Entertainment 2.74 5.95
Business 2.91 11.5
Health 2.99 7.69
Local 3.24 5.49
Automotive 3.27 7.36
Reference 3.7 8.2
Industries 3.77 7.14
Recreation 3.81 7.58
Computers & Electronics 3.93 7.83Travel 3.94 7.61
Internet 4.87 15
Telecommunications 5.2 9.07
Society 5.57 12.6
Real Estate 5.62 12.8
Sports 5.81 29.3
Arts & Humanities 6.98 11.8
Finance & Insurance 7.08 17.8
Science 10.1 15.5
News & Current Events 16.6 47
Social Networks 38.6 50.4
Average 5.81 12.5
Comparing the Predictability of a Category and its Sub-Categories.
It is reasonable to expect that a time series of the aggregated search of a set of queries should
in general be more predictable than single queries. The larger the aggregation set is, the
smaller would be the variability of the aggregated time series. This has implications on the
predictability of categories vs. sub-categories, but also has implications regarding aggregated
time series of group of queries such as campaign related queries or brand/topic related queries
in general. In order to demonstrate this, we have explored the MAPE and MaxAPE prediction
errors of the I4S category Vehicle Brands (in the Automotive category), compared to all its 31
'children' sub-categories.
The variability of Prediction Errors (MAPE) within the 31 vehicle brands sub-categories is
substantial and varies from 3% to 38%. The average MAPE of the 31 brands is 11.4% (with
STD=7.7%) which is quite similar to the average MAPE for the 1,000 most popular queries in
the Automotive category (11.2%) as we presented above.
As expected, the average MAPE of the 31 sub-categories is larger than the MAPE of the
aggregated time series of the Vehicle Brands category which is only 3.39%. We have also
calculated the median MAPE (9.3%), as well as the weighted average MAPE (with relative
search interest per category as weights) (9.7%). Both the median and the weighted average
are lower than the regular average but still much larger than the MAPE for the overall
aggregated category of Vehicle Brands.4. Predictability vs. Seasonality and Deviation Ratios
Among the 10 categories for which we have analyzed their 1,000 most popular queries, we
calculated a correlation of r= 0.80 between the Predictability and the Seasonality Ratio and
r= -0.94 between the Predictability and the Deviation Ratio (see table in Section 3).
Below, we examine the association between the these two time series' characteristics and the
MAPE prediction error in the experiment we conducted on the 10,000 most popular queries in
the US.
Seasonality and Prediction Errors.
Many patterns of search behavior have a strong seasonal component (e.g. holidays shopping,
summer vacation, etc.) as implied from the specific market they are in. Occasionally, there is
also a directional trend effect (up, down or changing) which might be less visually pronounced
due to the confounding seasonal pattern. We have used the Seasonality Ratio (described
above) as a representation for the 'level of seasonality' of the queries.
Among the 10,000 most popular queries in the US, the Seasonality Ratio varies in the rather
large range [0.01,13], from time series with no seasonal component up to extremely seasonal
time series. The median Seasonality Ratio is 0.4 and its mean value is 0.8.
We could see no significant correlation between the prediction error and the seasonality ratio.
In order to visualize this possible association, we have sorted the values of seasonality ratio
and created a ('smoothed') arrays of 10 average points3
. Similarly, we have computed a
'smoothed' array of averages for the 10,000 corresponding MAPE prediction errors which were
sorted according to the corresponding seasonality ratio.
We show here a scatter plot of the 'smoothed' MAPE vs the 'smoothed' seasonality ratio. The
plot shows a non-stable 'negative' association between prediction errors and the seasonality.
The correlation coefficient between the 'smoothed' arrays is substantial (r=0.55), compared to
the insignificant correlation we saw for the entire set.
3. Given a time series {YN}, N=10,000 ; K=10; M=N/K=1,000. We compute an array A={A1,
A2,.....,AK} of the averages of K consecutive non-overlapping windows of size M over the time
series {YN}, such that Ak= (1/M) ∑Yi, where k=1,..,K and i={1+(k-1)M,..,kM}.For the next plot we have repeated the same process - but for predictable time series only.
The result shows a stronger 'negative' association between the MAPE prediction error and the
seasonality ratio for Predictable queries.
Deviation Ratio and Prediction Errors.
The Deviation Ratio, which represents the level of outliers and irregular extreme values in a
time series was found to be associated with the Predictability of the search interest time series.
For the 10,000 queries we tested, the average deviation ratio was 2.08 (STD=1.9). Only 5% of
the Predictable time series had a deviation ratio in the upper quartile and 73% of the
predictable time series had a deviation ratio under the median. The correlation coefficient
between the deviation ratio and the the MAPE error was r=0.29. The average deviation ratio
for the Predictable time series was: 1.50 where as for the non-Predictable queries the average
was: 2.77.
We have applied the same process as above in order to visually demonstrate the association
between MAPE and the deviation ratio. The following plot shows a clear positive association
between the (sorted) 'smoothed' array of the deviation Ratio and the corresponding 'smoothed'
array of prediction errors (MAPE). The correlation coefficient calculated for the 'smoothed'
arrays was r=0.88 (compared to r=0.29 which was computed with the original values). Hence,
we can say that the larger the deviation level in the time series, the larger is the prediction
error. This can also be seen in the next plot for the the Predictable queries only.5. Sensitivity Analysis and Errors Diagnostics
Sensitivity of the Predictability Thresholds.
As described earlier, we have chosen a predefined set of thresholds which correspond to the
three prediction error metrics (MAPE, MaxAPE, NMSSE) and two consistency metrics. These
thresholds are responsible for the trade-off between the Predictability Ratio and the
distribution of errors within the Predictable time series. In the following figure we see a
sensitivity plot for the Mean Absolute Prediction Error (MAPE), that shows how the
Predictability Ratio behaves as a function of the Predictability Threshold. We present a separate
analysis for each error measure and not as a conjunction of all the conditions as appears in our
Predictability definition.
The following plot shows that choosing a Predictability Threshold [MAPE<0.25] 'qualifies' more
than 60% of the queries (for a single metric condition). Raising the MAPE threshold by 100%
into 0.5, would imply that the Predication Ratio would rise by ~30% (using only the MAPE
error metric). Raising the MAPE threshold even more, by 200% into 0.75, would imply that the
Predication Ratio would rise by ~50% and will qualify approximately 90% of the queries.The next plots are the sensitivity plots for the MaxAPE and NMSSE error metrics. We can see
that both chosen Predictability thresholds (1.0, 10.0) are located much farther into the
"Predictable Region" and qualify almost 90% of the queries. Thus, in our experiments we use
the MAPE as our primary 'filter' where the MaxAPE and the NMSSE play a secondary role.
The following plot displays a similar presentation by showing the number of Predicable time
series as a function of the Predictability Threshold (using only the MAPE error measure).Prediction Errors Diagnostics.
In this section we show diagnostics plots for the US data (top 10,000 queries). The following
figure shows the actual values vs. the predicted values (in log scale), for each of the 12
months in the Forecast Period. The top 12 diagrams refer to the entire set of queries, followed
by 12 diagrams for the Predictable queries only. One can clearly see the better prediction
performance for the Predictable queries (at the bottom part) as expected. Notice that the
performance for the different months deteriorates with time (higher average and STD of the
prediction errors) especially towards the later months.In order to learn more on the distribution of the average and maximum prediction errors within
the top 10,000 most popular queries in the US, we present the histogram of the MAPE and
MaxAPE error measures, with the density estimation superimposed (in red). We can see that
both distributions are positively skewed and that the value of the average error is largely
affected by the extreme error values. Notice that we have trimmed the data at 0.75 and 3.0
for MAPE and MaxAPE respectively (i.e., 3 x the chosen thresholds), to stay focused on the
major part of the distribution.Comparison of the Forecast Performance along the Future Horizon.
Since in our experiments we are simultaneously predicting 12 month ahead, it is expected that
the forecasts for the later months may have larger prediction errors. We have compared the
prediction performance for the 12 consecutive month in the forecast period. The following plot
shows the distribution of MAPE prediction errors for each future month. We are showing the
average monthly MAPE for the Predictable queries only (among the 10,000 most popular in the
US). Notice that the first month is predicted in greater accuracy than the rest, then there is an
approximately constant error level for months 2-9, with some increase of the error rate in the
last 3 months in the Forecast period.
The following plot shows the same type of diagram, but for the Mean Prediction Error (i.e., the
'directional' error measure with the sign). We can learn from this plot that there was a positive
bias (upward) in the predictions along all months except the 11'th month. Such systematic
tendency of the errors can be explained by a reduction of query share for many queries in the
Forecast period (Aug 2008 - July 2009) due to the global economic crisis. Hence the actual
search interest values were lower than expected by the prediction model that was based on
the previous years.
In the following section we present examples of categories (and queries) regarding various
markets and brands, for which the actual monthly query shares for the recent 12 months are
different than model prediction.6. Search Interest Forecasting as baseline for identifying deviations
The aggregated query share of the Google Insights for Search (I4S) categories were used in a
recent work of Choi and Varian (2009), that showed how data taken from Google I4S could
help to predict economic time series. For example, in the analysis on the US Retail Trade they
have used the weekly aggregated time series of categories like: Automotive, Computers &
Electronics, Apparel, Sporting Goods, Mass Shopping, Merchants & Department Stores, etc. In
a later work [Choi and Varian 2009 b] have applied the same methodology on the U.S.
unemployment time series using two sub-categories, Jobs and Welfare & Unemployment. They
did not attempt to forecast the Google query share; rather, they have successfully used it as
predictors for external economic time series. Other works have shown similar results,
regarding the capability of aggregated categories' query share to predict econometrics and
unemployment data from Germany [Askitas and Zimmerman 2009] as well as from Israel
[Suhoy 2009].
In the following, we will show time series of monthly query share of categories, where the
forecast values (in red) were superimposed on the actual values (in blue). The errors made by
the prediction model are expressing the deviation between the expected and the actual search
behavior, which conveys a valuable information regarding the current state of search interest
in the respective categories. Choi and Varian have shown that the users' search interest in
several categories as represented by the aggregated query shares indeed have a short term
predictive power regarding the actual underlying. The following plots show the aggregated time
series of various categories that relate to some major US markets. These category plots, which
are ordered by their average MAPE, vary in their Predictability level.
From the 10 category plots, we can see that many present a clear seasonal pattern. The first 7
time series showed a relatively low error rate (MAPE<6%), which is in accordance with the
substantial regularity of search behavior of the respective categories that was maintained
throughout the Forecast period. However, notice that the category of Finance & Insurance
which shows a seasonal patterns with some medium irregularities (the seasonality ratio is well
above its median), underwent a considerable change in the recent 12 months, highlighting an
observed discrepancies between the predicted and the actual monthly search interest. The
months of September-October 2008 which were low months in each year during the entire
history period are observed as peak month in the Forecast period. This is an example where
the prediction model could not anticipate the unexpected exogenous events.
The category of Energy and Utility showed the most irregular search behavior (with the lowest
Seasonality Ratio and the highest Deviation Ratio among the first 9 categories). In addition to
the low regularity of its history, it seems that this category has also underwent a change in the
dynamics of search interest, probably since mid year 2008. These contributed to the low
prediction results for this category.
Another good example for lack of Predictability w.r.t. the prediction model, is the last plot of
the Social Networks & Online Communities category that has shown a considerable exponential
growth in the forecast period (due to the growing popularity of social networks like Facebook
and Twitter), which could not be captured by the prediction model (notice the high deviationratio). We will show below several other examples of the relation between the prediction
performance and the external market events.Next, we show several examples where one can use the (posterior) prediction results in order
to explore the changing dynamics of users' search behavior and possibly get insights on the
relevant markets. Whenever we observe substantial prediction errors, i.e., discrepancies
between the actual values vs the predicted values, we can conclude that the regularities in the
time series (e.g., seasonality and trend) which were captured by the prediction model, were
disturbed in the Forecast period. In cases where the actual values show a regularity that is not
in accordance with the history's regularity, one could investigate the reasons for such deviation
with relation to known external factors. It is important to emphasize that users' search interest
is not necessarily always related to consumer preferences, buying intentions, etc. and can be
related sometimes to news or or other associated events. A full discussion on the background
and reasons for the following market observations is beyond the scope of this paper.
Example: The Automotive Industry.
We can see that the forecast for the entire Vehicle brands category for the 12 month period
between Aug-08 and Jul-094
shows a relatively low prediction error rate of -2.3% on average.
However, as we show below there are some noticeable deviations in different sub-categories.
We can see in the next 4 plots that the category Vehicle Shopping shows an average negative
deviation of 6% from the prediction model in the last 12 months and that the category Auto
4. The time series data was pulled during July 2009, thus the value for this last month is
partial and might be biased.Financing is showing a small negative deviation with average of 2.3% respectively. Notice that
both categories of Vehicle Maintenance and Auto Parts are showing a positive average
deviation of 4.3% and 5.2% respectively, compared to the predictions.Example: US Unemployment.
Choi and Varian (2009 b) have used weekly time series of the I4S aggregated categories
Welfare & unemployment and Jobs, to help in short term prediction of "Initial Jobless Claims”
reports which are issued by US Department of Labor. In the following plots, we show that the
search interest the category Welfare & Unemployment has risen substantially above the
forecast by the prediction model. The deviation of Welfare and Unemployment is systematic
and relatively quite large. While the average MAPE for the entire set of (aggregated)
categories' query shares is 8.1%, with STD 8.2%, the MAPE for Welfare & Unemployment is
31.2% which is 2.8 standard deviations above the overall average MAPE.
The actual monthly values for the aggregated query share of the category Jobs are also all
higher than forecasted by the model. The time series shows a seasonal pattern with a
distinguishable low value in December each year and a relatively constant level in between. At
the end of the History period and throughout the Forecast, this regularity is shifted upwards by
a confounding volatile factor, which causes large positive prediction errors. The Average Error
is almost 9% per month.We present here also the aggregated query share of the category Recruitment & Staffing, for
which we can observe a corresponding negative deviation where the model expectations are
larger than the actual search interest values. Interestingly, despite a similar seasonal pattern
as in the Jobs category, it seems that the change in the users' search behavior in this category
has not started until March 2009. Beforehand the predictions were rather accurate and the
average monthly deviation is therefore only about (-4.8%).
Example: Mexico as Vacation Destination.
In this example we show that the search interest for Mexico as a vacation destination has
decreased substantially in the recent months. The I4S category Mexico is a sub-category of the
Vacation Destinations category (in the Travel root category) which aggregates only the
vacation related searches on Mexico. In the next plots we can see that the search interest in
the category Mexico is down by almost 15% compared to the predicted. In comparison, we
show the respective deviation in the entire category of Vacation Destinations, which is only
-1.6% on average in the same forecast period. Notice for a reference that the search interest
of another related vacation destination, the Caribbean Islands (with a similar seasonal
pattern), also has not shown a deviation of similar magnitude (only -2.5%).We considered the recent outbreak of the Swine Flu pandemic that started to spread in April
2009 as a possible contributor for such a negative deviation of actual-vs-forecast query share
for Mexico. We examined the time series of the query share for H1N1 and found it to be highly
(anti) correlated (r = -0.93) with the observed deviations for Mexico.
As a reference, we show the aggregated query share for the category Infectious Diseases,
demonstrating the magnitude of the search interest in this subject (in blue) that was spiking
following the Swine Flu outbreak:Example: Recession Markers.
The following plots present the aggregated query share for some I4S sub-categories in
subjects that might demonstrate the influence of the recent recession on search behavior of
consumers, and often appear in articles and blog posts. The change in search interest for the
category Coupons & Rebates is visible in the following plot, where one can see an average
monthly deviation of 15.9% between the observed query share in the recent 12 month
compared to the values predicted by the model. The model has captured the general seasonal
pattern, however only accounted for a lower holidays peak and a much more moderate upward
trend.
Next we see the observed query share of the I4S category Restaurants, that is systematically
lower than the model predictions. The time series for the aggregated search interests in this
category does not show a seasonal pattern, however there exist an upward trend since 2004,
which was apparently broken at September 2008 hence causing negative actual-vs-forecast
deviation with a an average of -7.8% per month.Below we can see for reference that the Cooking & Recipes category has a systematic positive
deviation of actual-vs-forecast query share. The average monthly deviation of 6.15%
represents a higher observed search interest in this category for the entire Forecast period
compared to model prediction, with almost a constant deviation since January 2009.
Another example is the category Gifts, for which the query share has decreased in the recent
12 months compared to the model predictions, by 11% per month on average. Below we can
also see that the category Luxury Goods is showing a negative deviation in the actual-vsforecast
query share, of 5.8% per month on average.7. Conclusions
We studied the predictability of search trends. We found that over half of the most popular
Google search queries are predictable w.r.t. the method we have selected, and that several
search categories were considerably more predictable than others; that the aggregated queries
of the different categories are more predictable than the individual queries and that almost
90% of I4S categories have predictable query shares. In particular we showed that queries
with seasonal time series and lower levels of outliers are more predictable.
We considered forecasting as a baseline for identification of deviation of actual-vs-forecast, and
considered some concrete examples for situations from the automotive, travel and labor
verticals.
Further research can include an improved implementation of the prediction model as well as
incorporating other forecasting models. We would also like to examine short-term forecasting
in finer time granularity. Further analysis on actual-vs-forecast (including confidence
estimation) could be conducted in various domains, like market analysis, economy, health, etc.
In conjunction with this study, a basic forecasting capability was introduced into Google
Insights For Search, which provides forecasting for trends that are identified as predictable.
Researchers, marketers, journalists, and others, can use I4S to get a wide picture on search
trends which now also includes predictability of single queries and aggregated categories in
any area of interest.
Acknowledgments
We would like to thank Yannai Gonczarowsky for designing and implementing the forecasting
capabilities in I4S as well as Nir Andelman, Yuval Netzer and Amit Weinstein for creating the
forecasting model library. We thank Hal Varian for his helpful comments. Special thanks to the
entire team of Google Insights for Search that made this research possible.References
[Askitas and Zimmerman 2009] Nikos Askitas and Kalus F. Zimmerman.
Google econometrics and unemployment forecasting. Applied Economics Quarterly,
55:107;120, 2009. URL http://ftp.iza.org/dp4201.pdf
[Choi and Varian 2009] Hyunyoung Choi and Hal Varian.
Predicting the present with google trends. Technical report, Google, 2009.
URL http://google.com/googleblogs/pdfs/google_predicting_the_present.pdf.
[Choi and Varian 2009b] Hyunyoung Choi and Hal Varian.
Predicting Initial Claims for Unemployment Insurance Using Google Trends. Tech. Report,
Google, 2009. URL http://research.google.com/archive/papers/initialclaimsUS.pdf
[Cleveland and Tiao 1976] W.P. Cleveland and G.C. Tiao.
Decomposition of Seasonal Time Series: A Model for the Census X-11 Program, Journal of the
American Statistical Association, Vol. 71, No. 355, 1976 pp. 581-587.
[Cleveland etal. 1990] R.B Cleveland, W.S. Cleveland, J.E. McRae and Irma Terpenning.
STL: A Seasonal-Trend Decomposition Procedure Based on Loess. Jou. of Official Stat., VOL. 6,
No. 1, 1990 pp. 3-73.
[Ginsberg etal. 2009] Jeremy Ginsberg, Matthew H. Mohebbi, Rajan S. Patel,
Lynnette Brammer, Mark S. Smolinski & Larry Brilliant.
Detecting influenza epidemics using search engine query data. Nature 457, 1012-1014 (2009).
URL http://www.nature.com/nature/journal/v457/n7232/full/nature07634.html
[Lytras etal. 2007] Demerta P. Lytras, Roxanne M. Felpausch, and William R. Bell.
Determining Seasonality: A Comparison of Diagnostics From X-12-ARIMA (Presented at ICES
III, June, 2007).
[Suhoy 2009] Tanya Suhoy.
Query indices and a 2008 downturn: Israeli data. Tech. Report, Bank of Israel, 2009.
URL http://www.bankisrael.gov.il/deptdata/mehkar/papers/dp0906e.pdf
Building High-level Features
Using Large Scale Unsupervised Learning
Quoc V. Le quocle@cs.stanford.edu
Marc’Aurelio Ranzato ranzato@google.com
Rajat Monga rajatmonga@google.com
Matthieu Devin mdevin@google.com
Kai Chen kaichen@google.com
Greg S. Corrado gcorrado@google.com
Jeff Dean jeff@google.com
Andrew Y. Ng ang@cs.stanford.edu
Abstract
We consider the problem of building highlevel,
class-specific feature detectors from
only unlabeled data. For example, is it possible
to learn a face detector using only unlabeled
images? To answer this, we train a 9-
layered locally connected sparse autoencoder
with pooling and local contrast normalization
on a large dataset of images (the model has
1 billion connections, the dataset has 10 million
200x200 pixel images downloaded from
the Internet). We train this network using
model parallelism and asynchronous SGD on
a cluster with 1,000 machines (16,000 cores)
for three days. Contrary to what appears to
be a widely-held intuition, our experimental
results reveal that it is possible to train a face
detector without having to label images as
containing a face or not. Control experiments
show that this feature detector is robust not
only to translation but also to scaling and
out-of-plane rotation. We also find that the
same network is sensitive to other high-level
concepts such as cat faces and human bodies.
Starting with these learned features, we
trained our network to obtain 15.8% accuracy
in recognizing 22,000 object categories
from ImageNet, a leap of 70% relative improvement
over the previous state-of-the-art.
Appearing in Proceedings of the 29 th International Conference
on Machine Learning, Edinburgh, Scotland, UK, 2012.
Copyright 2012 by the author(s)/owner(s).
1. Introduction
The focus of this work is to build high-level, classspecific
feature detectors from unlabeled images. For
instance, we would like to understand if it is possible to
build a face detector from only unlabeled images. This
approach is inspired by the neuroscientific conjecture
that there exist highly class-specific neurons in the human
brain, generally and informally known as “grandmother
neurons.” The extent of class-specificity of
neurons in the brain is an area of active investigation,
but current experimental evidence suggests the possibility
that some neurons in the temporal cortex are
highly selective for object categories such as faces or
hands (Desimone et al., 1984), and perhaps even specific
people (Quiroga et al., 2005).
Contemporary computer vision methodology typically
emphasizes the role of labeled data to obtain these
class-specific feature detectors. For example, to build
a face detector, one needs a large collection of images
labeled as containing faces, often with a bounding box
around the face. The need for large labeled sets poses
a significant challenge for problems where labeled data
are rare. Although approaches that make use of inexpensive
unlabeled data are often preferred, they have
not been shown to work well for building high-level
features.
This work investigates the feasibility of building highlevel
features from only unlabeled data. A positive
answer to this question will give rise to two significant
results. Practically, this provides an inexpensive way
to develop features from unlabeled data. But perhaps
more importantly, it answers an intriguing question as
to whether the specificity of the “grandmother neuron”
could possibly be learned from unlabeled data. Informally,
this would suggest that it is at least in principle
possible that a baby learns to group faces into one classBuilding high-level features using large-scale unsupervised learning
because it has seen many of them and not because it
is guided by supervision or rewards.
Unsupervised feature learning and deep learning have
emerged as methodologies in machine learning for
building features from unlabeled data. Using unlabeled
data in the wild to learn features is the key idea behind
the self-taught learning framework (Raina et al.,
2007). Successful feature learning algorithms and their
applications can be found in recent literature using
a variety of approaches such as RBMs (Hinton et al.,
2006), autoencoders (Hinton & Salakhutdinov, 2006;
Bengio et al., 2007), sparse coding (Lee et al., 2007)
and K-means (Coates et al., 2011). So far, most of
these algorithms have only succeeded in learning lowlevel
features such as “edge” or “blob” detectors. Going
beyond such simple features and capturing complex
invariances is the topic of this work.
Recent studies observe that it is quite time intensive
to train deep learning algorithms to yield state of the
art results (Ciresan et al., 2010). We conjecture that
the long training time is partially responsible for the
lack of high-level features reported in the literature.
For instance, researchers typically reduce the sizes of
datasets and models in order to train networks in a
practical amount of time, and these reductions undermine
the learning of high-level features.
We address this problem by scaling up the core components
involved in training deep networks: the dataset,
the model, and the computational resources. First,
we use a large dataset generated by sampling random
frames from random YouTube videos.1 Our input data
are 200x200 images, much larger than typical 32x32
images used in deep learning and unsupervised feature
learning (Krizhevsky, 2009; Ciresan et al., 2010;
Le et al., 2010; Coates et al., 2011). Our model, a
deep autoencoder with pooling and local contrast normalization,
is scaled to these large images by using
a large computer cluster. To support parallelism on
this cluster, we use the idea of local receptive fields,
e.g., (Raina et al., 2009; Le et al., 2010; 2011b). This
idea reduces communication costs between machines
and thus allows model parallelism (parameters are distributed
across machines). Asynchronous SGD is employed
to support data parallelism. The model was
trained in a distributed fashion on a cluster with 1,000
machines (16,000 cores) for three days.
Experimental results using classification and visualization
confirm that it is indeed possible to build highlevel
features from unlabeled data. In particular, using
a hold-out test set consisting of faces and distractors,
we discover a feature that is highly selective for faces.
1This is different from the work of (Lee et al., 2009) who
trained their model on images from one class.
This result is also validated by visualization via numerical
optimization. Control experiments show that
the learned detector is not only invariant to translation
but also to out-of-plane rotation and scaling.
Similar experiments reveal the network also learns the
concepts of cat faces and human bodies.
The learned representations are also discriminative.
Using the learned features, we obtain significant leaps
in object recognition with ImageNet. For instance, on
ImageNet with 22,000 categories, we achieved 15.8%
accuracy, a relative improvement of 70% over the stateof-the-art.
Note that, random guess achieves less than
0.005% accuracy for this dataset.
2. Training set construction
Our training dataset is constructed by sampling frames
from 10 million YouTube videos. To avoid duplicates,
each video contributes only one image to the dataset.
Each example is a color image with 200x200 pixels.
A subset of training images is shown in Appendix
A. To check the proportion of faces in
the dataset, we run an OpenCV face detector on
60x60 randomly-sampled patches from the dataset
(http://opencv.willowgarage.com/wiki/). This experiment
shows that patches, being detected as faces by
the OpenCV face detector, account for less than 3% of
the 100,000 sampled patches
3. Algorithm
In this section, we describe the algorithm that we use
to learn features from the unlabeled training set.
3.1. Previous work
Our work is inspired by recent successful algorithms
in unsupervised feature learning and deep
learning (Hinton et al., 2006; Bengio et al., 2007;
Ranzato et al., 2007; Lee et al., 2007). It is strongly
influenced by the work of (Olshausen & Field, 1996)
on sparse coding. According to their study, sparse
coding can be trained on unlabeled natural images
to yield receptive fields akin to V1 simple
cells (Hubel & Wiesel, 1959).
One shortcoming of early approaches such as sparse
coding (Olshausen & Field, 1996) is that their architectures
are shallow and typically capture low-level
concepts (e.g., edge “Gabor” filters) and simple invariances.
Addressing this issue is a focus of recent work in
deep learning (Hinton et al., 2006; Bengio et al., 2007;
Bengio & LeCun, 2007; Lee et al., 2008; 2009) which
build hierarchies of feature representations. In particular,
Lee et al (2008) show that stacked sparse RBMs
can model certain simple functions of the V2 area ofBuilding high-level features using large-scale unsupervised learning
the cortex. They also demonstrate that convolutional
DBNs (Lee et al., 2009), trained on aligned images of
faces, can learn a face detector. This result is interesting,
but unfortunately requires a certain degree of
supervision during dataset construction: their training
images (i.e., Caltech 101 images) are aligned, homogeneous
and belong to one selected category.
Figure 1. The architecture and parameters in one layer of
our network. The overall network replicates this structure
three times. For simplicity, the images are in 1D.
3.2. Architecture
Our algorithm is built upon these ideas and can be
viewed as a sparse deep autoencoder with three important
ingredients: local receptive fields, pooling and local
contrast normalization. First, to scale the autoencoder
to large images, we use a simple idea known as
local receptive fields (LeCun et al., 1998; Raina et al.,
2009; Lee et al., 2009; Le et al., 2010). This biologically
inspired idea proposes that each feature in the
autoencoder can connect only to a small region of the
lower layer. Next, to achieve invariance to local deformations,
we employ local L2 pooling (Hyv¨arinen et al.,
2009; Gregor & LeCun, 2010; Le et al., 2010) and local
contrast normalization (Jarrett et al., 2009). L2
pooling, in particular, allows the learning of invariant
features (Hyv¨arinen et al., 2009; Le et al., 2010).
Our deep autoencoder is constructed by replicating
three times the same stage composed of local filtering,
local pooling and local contrast normalization. The
output of one stage is the input to the next one and
the overall model can be interpreted as a nine-layered
network (see Figure 1).
The first and second sublayers are often known as filtering
(or simple) and pooling (or complex) respectively.
The third sublayer performs local subtractive
and divisive normalization and it is inspired by biological
and computational models (Pinto et al., 2008;
Lyu & Simoncelli, 2008; Jarrett et al., 2009).2
As mentioned above, central to our approach is the use
of local connectivity between neurons. In our experiments,
the first sublayer has receptive fields of 18x18
pixels and the second sub-layer pools over 5x5 overlapping
neighborhoods of features (i.e., pooling size).
The neurons in the first sublayer connect to pixels in all
input channels (or maps) whereas the neurons in the
second sublayer connect to pixels of only one channel
(or map).3 While the first sublayer outputs linear filter
responses, the pooling layer outputs the square root of
the sum of the squares of its inputs, and therefore, it
is known as L2 pooling.
Our style of stacking a series of uniform modules,
switching between selectivity and tolerance
layers, is reminiscent of Neocognition and
HMAX (Fukushima & Miyake, 1982; LeCun et al.,
1998; Riesenhuber & Poggio, 1999). It has also
been argued to be an architecture employed by the
brain (DiCarlo et al., 2012).
Although we use local receptive fields, they are
not convolutional: the parameters are not shared
across different locations in the image. This is
a stark difference between our approach and previous
work (LeCun et al., 1998; Jarrett et al., 2009;
Lee et al., 2009). In addition to being more biologically
plausible, unshared weights allow the learning
of more invariances other than translational invariances
(Le et al., 2010).
In terms of scale, our network is perhaps one of the
largest known networks to date. It has 1 billion trainable
parameters, which is more than an order of magnitude
larger than other large networks reported in literature,
e.g., (Ciresan et al., 2010; Sermanet & LeCun,
2011) with around 10 million parameters. It is
worth noting that our network is still tiny compared
to the human visual cortex, which is 106
times larger in terms of the number of neurons and
synapses (Pakkenberg et al., 2003).
3.3. Learning and Optimization
Learning: During learning, the parameters of the
second sublayers (H) are fixed to uniform weights,
2The subtractive normalization removes the
weighted average of neighboring neurons from the
current neuron gi,j,k = hi,j,k −
P
iuv Guvhi,j+u,i+v
The divisive normalization computes yi,j,k =
gi,j,k/ max{c,(
P
iuv Guvg
2
i,j+u,i+v)
0.5
}, where c is set
to be a small number, 0.01, to prevent numerical errors.
G is a Gaussian weighting window. (Jarrett et al., 2009)
3For more details regarding connectivity patterns and
parameter sensitivity, see Appendix B and E.Building high-level features using large-scale unsupervised learning
whereas the encoding weights W1 and decoding
weights W2 of the first sublayers are adjusted using
the following optimization problem
minimize
W1,W2
Xm
i=1
W2WT
1 x
(i) − x
(i)
2
2
+
λ
Xk
j=1
q
ǫ + Hj (WT
1 x(i))
2
. (1)
Here, λ is a tradeoff parameter between sparsity and
reconstruction; m, k are the number of examples and
pooling units in a layer respectively; Hj is the vector of
weights of the j-th pooling unit. In our experiments,
we set λ = 0.1.
This optimization problem is also known as reconstruction
Topographic Independent Component Analysis
(Hyv¨arinen et al., 2009; Le et al., 2011a).4 The
first term in the objective ensures the representations
encode important information about the data, i.e.,
they can reconstruct input data; whereas the second
term encourages pooling features to group similar features
together to achieve invariances.
Optimization: All parameters in our model were
trained jointly with the objective being the sum of the
objectives of the three layers.
To train the model, we implemented model parallelism
by distributing the local weights W1, W2 and H to
different machines. A single instance of the model
partitions the neurons and weights out across 169 machines
(where each machine had 16 CPU cores). A
set of machines that collectively make up a single copy
of the model is referred to as a “model replica.” We
have built a software framework called DistBelief that
manages all the necessary communication between the
different machines within a model replica, so that users
of the framework merely need to write the desired upwards
and downwards computation functions for the
neurons in the model, and don’t have to deal with the
low-level communication of data across machines.
We further scaled up the training by implementing
asynchronous SGD using multiple replicas of the core
model. For the experiments described here, we divided
the training into 5 portions and ran a copy of
the model on each of these portions. The models communicate
updates through a set of centralized “parameter
servers,” which keep the current state of all parameters
for the model in a set of partitioned servers
(we used 256 parameter server partitions for training
the model described in this paper). In the simplest
4
In (Bengio et al., 2007; Le et al., 2011a), the encoding
weights and the decoding weights are tied: W1 = W2.
However, for better parallelism and better features, our
implementation does not enforce tied weights.
implementation, before processing each mini-batch a
model replica asks the centralized parameter servers
for an updated copy of its model parameters. It then
processes a mini-batch to compute a parameter gradient,
and sends the parameter gradients to the appropriate
parameter servers, which then apply each
gradient to the current value of the model parameter.
We can reduce the communication overhead by
having each model replica request updated parameters
every P steps and by sending updated gradient
values to the parameter servers every G steps (where
G might not be equal to P). Our DistBelief software
framework automatically manages the transfer of parameters
and gradients between the model partitions
and the parameter servers, freeing implementors of the
layer functions from having to deal with these issues.
Asynchronous SGD is more robust to failure and slowness
than standard (synchronous) SGD. Specifically,
for synchronous SGD, if one of the machines is slow,
the entire training process is delayed; whereas for asynchronous
SGD, if one machine is slow, only one copy
of SGD is delayed while the rest of the optimization
can still proceed.
In our training, at every step of SGD, the gradient is
computed on a minibatch of 100 examples. We trained
the network on a cluster with 1,000 machines for three
days. See Appendix B, C, and D for more details regarding
our implementation of the optimization.
4. Experiments on Faces
In this section, we describe our analysis of the learned
representations in recognizing faces (“the face detector”)
and present control experiments to understand
invariance properties of the face detector. Results for
other concepts are presented in the next section.
4.1. Test set
The test set consists of 37,000 images sampled
from two datasets: Labeled Faces In the
Wild dataset (Huang et al., 2007) and ImageNet
dataset (Deng et al., 2009). There are 13,026 faces
sampled from non-aligned Labeled Faces in The Wild.5
The rest are distractor objects randomly sampled from
ImageNet. These images are resized to fit the visible
areas of the top neurons. Some example images are
shown in Appendix A.
4.2. Experimental protocols
After training, we used this test set to measure the
performance of each neuron in classifying faces against
distractors. For each neuron, we found its maximum
5http://vis-www.cs.umass.edu/lfw/lfw.tgzBuilding high-level features using large-scale unsupervised learning
and minimum activation values, then picked 20 equally
spaced thresholds in between. The reported accuracy
is the best classification accuracy among 20 thresholds.
4.3. Recognition
Surprisingly, the best neuron in the network performs
very well in recognizing faces, despite the fact that no
supervisory signals were given during training. The
best neuron in the network achieves 81.7% accuracy in
detecting faces. There are 13,026 faces in the test set,
so guessing all negative only achieves 64.8%. The best
neuron in a one-layered network only achieves 71% accuracy
while best linear filter, selected among 100,000
filters sampled randomly from the training set, only
achieves 74%.
To understand their contribution, we removed the local
contrast normalization sublayers and trained the
network again. Results show that the accuracy of
best neuron drops to 78.5%. This agrees with previous
study showing the importance of local contrast
normalization (Jarrett et al., 2009).
We visualize histograms of activation values for face
images and random images in Figure 2. It can be seen,
even with exclusively unlabeled data, the neuron learns
to differentiate between faces and random distractors.
Specifically, when we give a face as an input image, the
neuron tends to output value larger than the threshold,
0. In contrast, if we give a random image as an input
image, the neuron tends to output value less than 0.
Figure 2. Histograms of faces (red) vs. no faces (blue).
The test set is subsampled such that the ratio between
faces and no faces is one.
4.4. Visualization
In this section, we will present two visualization techniques
to verify if the optimal stimulus of the neuron is
indeed a face. The first method is visualizing the most
responsive stimuli in the test set. Since the test set
is large, this method can reliably detect near optimal
stimuli of the tested neuron. The second approach
is to perform numerical optimization to find the optimal
stimulus (Berkes & Wiskott, 2005; Erhan et al.,
2009; Le et al., 2010). In particular, we find the normbounded
input x which maximizes the output f of the
tested neuron, by solving:
x
∗ = arg min
x
f(x; W, H), subject to ||x||2 = 1.
Here, f(x; W, H) is the output of the tested neuron
given learned parameters W, H and input x. In our
experiments, this constraint optimization problem is
solved by projected gradient descent with line search.
These visualization methods have complementary
strengths and weaknesses. For instance, visualizing
the most responsive stimuli may suffer from fitting to
noise. On the other hand, the numerical optimization
approach can be susceptible to local minima. Results,
shown in Figure 3, confirm that the tested neuron indeed
learns the concept of faces.
Figure 3. Top: Top 48 stimuli of the best neuron from the
test set. Bottom: The optimal stimulus according to numerical
constraint optimization.
4.5. Invariance properties
We would like to assess the robustness of the face detector
against common object transformations, e.g.,
translation, scaling and out-of-plane rotation. First,
we chose a set of 10 face images and perform distortions
to them, e.g., scaling and translating. For outof-plane
rotation, we used 10 images of faces rotating
in 3D (“out-of-plane”) as the test set. To check the robustness
of the neuron, we plot its averaged response
over the small test set with respect to changes in scale,
3D rotation (Figure 4), and translation (Figure 5).6
6Scaled, translated faces are generated by standard
cubic interpolation. For 3D rotated faces, we used 10 se-Building high-level features using large-scale unsupervised learning
Figure 4. Scale (left) and out-of-plane (3D) rotation (right)
invariance properties of the best feature.
Figure 5. Translational invariance properties of the best
feature. x-axis is in pixels
The results show that the neuron is robust against
complex and difficult-to-hard-wire invariances such as
out-of-plane rotation and scaling.
Control experiments on dataset without faces:
As reported above, the best neuron achieves 81.7% accuracy
in classifying faces against random distractors.
What if we remove all images that have faces from the
training set?
We performed the control experiment by running a
face detector in OpenCV and removing those training
images that contain at least one face. The recognition
accuracy of the best neuron dropped to 72.5% which
is as low as simple linear filters reported in section 4.3.
5. Cat and human body detectors
Having achieved a face-sensitive neuron, we would like
to understand if the network is also able to detect other
high-level concepts. For instance, cats and body parts
are quite common in YouTube. Did the network also
learn these concepts?
To answer this question and quantify selectivity properties
of the network with respect to these concepts,
we constructed two datasets, one for classifying human
bodies against random backgrounds and one for
classifying cat faces against other random distractors.
For the ease of interpretation, these datasets have a
positive-to-negative ratio identical to the face dataset.
The cat face images are collected from the dataset dequences
of rotated faces from The Sheffield Face Database –
http://www.sheffield.ac.uk/eee/research/iel/research/face.
See Appendix F for a sample sequence.
Figure 6. Visualization of the cat face neuron (left) and
human body neuron (right).
scribed in (Zhang et al., 2008). In this dataset, there
are 10,000 positive images and 18,409 negative images
(so that the positive-to-negative ratio is similar to the
case of faces). The negative images are chosen randomly
from the ImageNet dataset.
Negative and positive examples in our human body
dataset are subsampled at random from a benchmark
dataset (Keller et al., 2009). In the original dataset,
each example is a pair of stereo black-and-white images.
But for simplicity, we keep only the left images.
In total, like in the case of human faces, we have 13,026
positive and 23,974 negative examples.
We then followed the same experimental protocols as
before. The results, shown in Figure 6, confirm that
the network learns not only the concept of faces but
also the concepts of cat faces and human bodies.
Our high-level detectors also outperform standard
baselines in terms of recognition rates, achieving 74.8%
and 76.7% on cat and human body respectively. In
comparison, best linear filters (sampled from the training
set) only achieve 67.2% and 68.1% respectively.
In Table 1, we summarize all previous numerical results
comparing the best neurons against other baselines
such as linear filters and random guesses. To understand
the effects of training, we also measure the
performance of best neurons in the same network at
random initialization.
We also compare our method against several
other algorithms such as deep autoencoders
(Hinton & Salakhutdinov, 2006; Bengio et al.,
2007) and K-means (Coates et al., 2011). Results of
these baselines are reported in the bottom of Table 1.
6. Object recognition with ImageNet
We applied the feature learning method to the
task of recognizing objects in the ImageNet
dataset (Deng et al., 2009). We started from a
network that already learned features from YouTube
and ImageNet images using the techniques described
in this paper. We then added one-versus-all logistic
classifiers on top of the highest layer of this network.
This method of initializing a network by unsupervisedBuilding high-level features using large-scale unsupervised learning
Table 1. Summary of numerical comparisons between our algorithm against other baselines. Top: Our algorithm vs.
simple baselines. Here, the first three columns are results for methods that do not require training: random guess,
random weights (of the network at initialization, without any training) and best linear filters selected from 100,000
examples sampled from the training set. The last three columns are results for methods that have training: the best
neuron in the first layer, the best neuron in the highest layer after training, the best neuron in the network when the
contrast normalization layers are removed. Bottom: Our algorithm vs. autoencoders and K-means.
Concept Random Same architecture Best Best first Best Best neuron without
guess with random weights linear filter layer neuron neuron contrast normalization
Faces 64.8% 67.0% 74.0% 71.0% 81.7% 78.5%
Human bodies 64.8% 66.5% 68.1% 67.2% 76.8% 71.8%
Cats 64.8% 66.0% 67.8% 67.1% 74.6% 69.3%
Concept Our Deep autoencoders Deep autoencoders K-means on
network 3 layers 6 layers 40x40 images
Faces 81.7% 72.3% 70.9% 72.5%
Human bodies 76.7% 71.2% 69.8% 69.3%
Cats 74.8% 67.5% 68.3% 68.5%
Table 2. Summary of classification accuracies for our method and other state-of-the-art baselines on ImageNet.
Dataset version 2009 (∼9M images, ∼10K categories) 2011 (∼14M images, ∼22K categories)
State-of-the-art 16.7% (Sanchez & Perronnin, 2011) 9.3% (Weston et al., 2011)
Our method 16.1% (without unsupervised pretraining) 13.6% (without unsupervised pretraining)
19.2% (with unsupervised pretraining) 15.8% (with unsupervised pretraining)
learning is also known as “unsupervised pretraining.”
During supervised learning with labeled ImageNet
images, the parameters of lower layers and the logistic
classifiers were both adjusted. This was done by first
adjusting the logistic classifiers and then adjusting
the entire network (also known as “fine-tuning”). As
a control experiment, we also train a network starting
with all random weights (i.e., without unsupervised
pretraining: all parameters are initialized randomly
and only adjusted by ImageNet labeled data).
We followed the experimental protocols specified
by (Deng et al., 2010; Sanchez & Perronnin, 2011), in
which, the datasets are randomly split into two halves
for training and validation. We report the performance
on the validation set and compare against state-of-theart
baselines in Table 2. Note that the splits are not
identical to previous work but validation set performances
vary slightly across different splits.
The results show that our method, starting from
scratch (i.e., raw pixels), bests many state-of-the-art
hand-engineered features. On ImageNet with 10K categories,
our method yielded a 15% relative improvement
over previous best published result. On ImageNet
with 22K categories, it achieved a 70% relative
improvement over the highest other result of which we
are aware (including unpublished results known to the
authors of (Weston et al., 2011)). Note, random guess
achieves less than 0.005% accuracy for this dataset.
7. Conclusion
In this work, we simulated high-level class-specific neurons
using unlabeled data. We achieved this by combining
ideas from recently developed algorithms to
learn invariances from unlabeled data. Our implementation
scales to a cluster with thousands of machines
thanks to model parallelism and asynchronous SGD.
Our work shows that it is possible to train neurons to
be selective for high-level concepts using entirely unlabeled
data. In our experiments, we obtained neurons
that function as detectors for faces, human bodies, and
cat faces by training on random frames of YouTube
videos. These neurons naturally capture complex invariances
such as out-of-plane and scale invariances.
The learned representations also work well for discriminative
tasks. Starting from these representations, we
obtain 15.8% accuracy for object recognition on ImageNet
with 20,000 categories, a significant leap of 70%
relative improvement over the state-of-the-art.
Acknowledgements: We thank Samy Bengio,
Adam Coates, Tom Dean, Jia Deng, Mark Mao, Peter
Norvig, Paul Tucker, Andrew Saxe, and Jon Shlens for
helpful discussions and suggestions.
References
Bengio, Y. and LeCun, Y. Scaling learning algorithms towards
AI. In Large-Scale Kernel Machines, 2007.
Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H.
Greedy layerwise training of deep networks. In NIPS,
2007.
Berkes, P. and Wiskott, L. Slow feature analysis yields
a rich repertoire of complex cell properties. Journal of
Vision, 2005.
Ciresan, D. C., Meier, U., Gambardella, L. M., andBuilding high-level features using large-scale unsupervised learning
Schmidhuber, J. Deep big simple neural nets excel on
handwritten digit recognition. CoRR, 2010.
Coates, A., Lee, H., and Ng, A. Y. An analysis of singlelayer
networks in unsupervised feature learning. In AISTATS
14, 2011.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and FeiFei,
L. ImageNet: A Large-Scale Hierarchical Image
Database. In CVPR, 2009.
Deng, J., Berg, A., Li, K., and Fei-Fei, L. What does
classifying more than 10,000 image categories tell us?
In ECCV, 2010.
Desimone, R., Albright, T., Gross, C., and Bruce, C.
Stimulus-selective properties of inferior temporal neurons
in the macaque. The Journal of Neuroscience, 1984.
DiCarlo, J. J., Zoccolan, D., and Rust, N. C. How does
the brain solve visual object recognition? Neuron, 2012.
Erhan, D., Bengio, Y., Courville, A., and Vincent, P. Visualizing
higher-layer features of deep networks. Technical
report, University of Montreal, 2009.
Fukushima, K. and Miyake, S. Neocognitron: A new algorithm
for pattern recognition tolerant of deformations
and shifts in position. Pattern Recognition, 1982.
Gregor, K. and LeCun, Y. Emergence of complex-like cells
in a temporal product network with local receptive fields.
arXiv:1006.0448, 2010.
Hinton, G. E. and Salakhutdinov, R.R. Reducing the dimensionality
of data with neural networks. Science,
2006.
Hinton, G. E., Osindero, S., and Teh, Y. W. A fast learning
algorithm for deep belief nets. Neural Computation,
2006.
Huang, G. B., Ramesh, M., Berg, T., and Learned-Miller,
E. Labeled faces in the wild: A database for studying
face recognition in unconstrained environments. Technical
Report 07-49, University of Massachusetts, Amherst,
October 2007.
Hubel, D. H. and Wiesel, T.N. Receptive fields of single
neurons in the the cat’s visual cortex. Journal of Physiology,
1959.
Hyv¨arinen, A., Hurri, J., and Hoyer, P. O. Natural Image
Statistics. Springer, 2009.
Jarrett, K., Kavukcuoglu, K., Ranzato, M.A., and LeCun,
Y. What is the best multi-stage architecture for object
recognition? In ICCV, 2009.
Keller, C., Enzweiler, M., and Gavrila, D. M. A new benchmark
for stereo-based pedestrian detection. In Proc. of
the IEEE Intelligent Vehicles Symposium, 2009.
Krizhevsky, A. Learning multiple layers of features from
tiny images. Technical report, University of Toronto,
2009.
Le, Q. V., Ngiam, J., Chen, Z., Chia, D., Koh, P. W., and
Ng, A. Y. Tiled convolutional neural networks. In NIPS,
2010.
Le, Q. V., Karpenko, A., Ngiam, J., and Ng, A. Y. ICA
with Reconstruction Cost for Efficient Overcomplete
Feature Learning. In NIPS, 2011a.
Le, Q.V., Ngiam, J., Coates, A., Lahiri, A., Prochnow,
B., and Ng, A.Y. On optimization methods for deep
learning. In ICML, 2011b.
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient
based learning applied to document recognition.
Proceeding of the IEEE, 1998.
Lee, H., Battle, A., Raina, R., and Ng, Andrew Y. Efficient
sparse coding algorithms. In NIPS, 2007.
Lee, H., Ekanadham, C., and Ng, A. Y. Sparse deep belief
net model for visual area V2. In NIPS, 2008.
Lee, H., Grosse, R., Ranganath, R., and Ng, A.Y. Convolutional
deep belief networks for scalable unsupervised
learning of hierarchical representations. In ICML, 2009.
Lyu, S. and Simoncelli, E. P. Nonlinear image representation
using divisive normalization. In CVPR, 2008.
Olshausen, B. and Field, D. Emergence of simple-cell receptive
field properties by learning a sparse code for natural
images. Nature, 1996.
Pakkenberg, B., P., D., Marner, L., Bundgaard, M. J.,
Gundersen, H. J. G., Nyengaard, J. R., and Regeur, L.
Aging and the human neocortex. Experimental Gerontology,
2003.
Pinto, N., Cox, D. D., and DiCarlo, J. J. Why is real-world
visual object recognition hard? PLoS Computational
Biology, 2008.
Quiroga, R. Q., Reddy, L., Kreiman, G., Koch, C., and
Fried, I. Invariant visual representation by single neurons
in the human brain. Nature, 2005.
Raina, R., Battle, A., Lee, H., Packer, B., and Ng, A.Y.
Self-taught learning: Transfer learning from unlabelled
data. In ICML, 2007.
Raina, R., Madhavan, A., and Ng, A. Y. Large-scale
deep unsupervised learning using graphics processors. In
ICML, 2009.
Ranzato, M., Huang, F. J, Boureau, Y., and LeCun, Y. Unsupervised
learning of invariant feature hierarchies with
applications to object recognition. In CVPR, 2007.
Riesenhuber, M. and Poggio, T. Hierarchical models of
object recognition in cortex. Nature Neuroscience, 1999.
Sanchez, J. and Perronnin, F. High-dimensional signature
compression for large-scale image-classification. In
CVPR, 2011.
Sermanet, P. and LeCun, Y. Traffic sign recognition with
multiscale convolutional neural networks. In IJCNN,
2011.
Weston, J., Bengio, S., and Usunier, N. Wsabie: Scaling up
to large vocabulary image annotation. In IJCAI, 2011.
Zhang, W., Sun, J., and Tang, X. Cat head detection -
how to effectively exploit shape and texture features. In
ECCV, 2008.Building high-level features using large-scale unsupervised learning
A. Training and test images
A subset of training images is shown in Figure 7. As
can be seen, the positions, scales, orientations of faces
in the dataset are diverse. A subset of test images for
Figure 7. Thirty randomly-selected training images (shown
before the whitening step).
identifying the face neuron is shown in Figure 8.
Figure 8. Some example test set images (shown before the
whitening step).
B. Models
Central to our approach in this paper is the use of
locally-connected networks. In these networks, neurons
only connect to a local region of the layer below.
In Figure 9, we show the connectivity patterns of the
neural network architecture described in the paper.
The actual images in the experiments are 2D, but for
simplicity, our images in the visualization are in 1D.
Figure 9. Diagram of the network we used with more detailed
connectivity patterns. Color arrows mean that
weights only connect to only one map. Dark arrows mean
that weights connect to all maps. Pooling neurons only
connect to one map whereas simple neurons and LCN neurons
connect to all maps.
C. Model Parallelism
We use model parallelism to distribute the storage of
parameters and gradient computations to different machines.
In Figure 10, we show how the weights are
divided and stored in different “partitions,” or more
simply, machines (see also (Krizhevsky, 2009)).
D. Further multicore parallelism
Machines in our cluster have many cores which allow
further parallelism. Hence, we split these cores to perform
different tasks. In our implementation, the cores
are divided into three groups: reading data, sending
(or writing) data, and performing arithmetic computations.
At every time instance, these groups work in
parallel to load data, compute numerical results and
send to network or write data to disks.
E. Parameter sensitivity
The hyper-parameters of the network are chosen to
fit computational constraints and optimize the training
time of our algorithm. These parameters can be
changed at the expense of longer training time or more
computational resources. For instance, one could increase
the size of the receptive fields at an expense of
using more memory, more computation, and more net-Building high-level features using large-scale unsupervised learning
Figure 10. Model parallelism with the network architecture
in use. Here, it can be seen that the weights are divided according
to the locality of the image and stored on different
machines. Concretely, the weights that connect to the left
side of the image are stored in machine 1 (“partition 1”).
The weights that connect to the central part of the image
are stored in machine 2 (“partition 2”). The weights that
connect to the right side of the image are stored in machine
3 (“partition 3”).
work bandwidth per machine; or one could increase the
number of maps at an expense of using more machines
and memories.
These hyper-parameters also could affect the performance
of the features. We performed control experiments
to understand the effects of the two hyperparameters:
the size of the receptive fields and the
number of maps. By varying each of these parameters
and observing the test set accuracies, we can gain
an understanding of how much they affect the performance
on the face recognition task. Results, shown
in Figure 11, confirm that the results are only slightly
sensitive to changes in these control parameters.
12 14 16 18 20
60
65
70
75
80
85
receptive field size
test set accuracy
6 7 8 9 10
60
65
70
75
80
85
number of maps
test set accuracy
Figure 11. Left: effects of receptive field sizes on the test
set accuracy. Right: effects of number of maps on the test
set accuracy.
F. Example out-of-plane rotated face
sequence
In Figure 12, we show an example sequence of 3D
(out-of-plane) rotated faces. Note that the faces
are black and white but treated as a color picture
in the test. More details are available at the
webpage for The Sheffield Face Database dataset –
http://www.sheffield.ac.uk/eee/research/
iel/research/face
Figure 12. A sequence of 3D (out-of-plane) rotated face of
one individual. The dataset consists of 10 sequences.
G. Best linear filters
In the paper, we performed control experiments to
compare our features against “best linear filters.”
This baseline works as follows. The first step is to sample
100,000 random patches (or filters) from the training
set (each patch has the size of a test set image).
Then for each patch, we compute its cosine distances
between itself and the test set images. The cosine distances
are treated as the feature values. Using these
feature values, we then search among 20 thresholds to
find the best accuracy of a patch in classifying faces
against distractors. Each patch gives one accuracy for
our test set.
The reported accuracy is the best accuracy among
100,000 patches randomly-selected from the training
set.
H. Histograms on the entire test set
Here, we also show the detailed histograms for the neurons
on the entire test sets.
The fact that the histograms are distinctive for positive
and negative images suggests that the network
has learned the concept detectors.Building high-level features using large-scale unsupervised learning
Figure 13. Histograms of neuron’s activation values for the
best face neuron on the test set. Red: the histogram for
face images. Blue: the histogram for random distractors.
Figure 14. Histograms for the best human body neuron on
the test set. Red: the histogram for human body images.
Blue: the histogram for random distractors.
I. Most responsive stimuli for cats and
human bodies
In Figure 16, we show the most responsive stimuli for
cat and human body neurons on the test sets. Note
that, the top stimuli for the human body neuron are
black and white images because the test set images are
black and white (Keller et al., 2009).
J. Implementation details for
autoencoders and K-means
In our implementation, deep autoencoders are also locally
connected and use sigmoidal activation function.
For K-means, we downsample images to 40x40 in order
to lower computational costs. We also varied the
parameters of autoencoders, K-means and chose them
to maximize performances given resource constraints.
In our experiments, we used 30,000 centroids for Kmeans.
These models also employed parallelism in a
similar fashion described in the paper. They also used
1,000 machines for three days.
Figure 15. Histograms for the best cat neuron on the test
set. Red: the histogram for cat images. Blue: the histogram
for random distractors.
Figure 16. Top: most responsive stimuli on the test set for
the cat neuron. Bottom: Most responsive human body
stimuli on the test set for the human body neuron.
On-Demand Language Model Interpolation for Mobile Speech Input
Brandon Ballinger1, Cyril Allauzen2, Alexander Gruenstein1, Johan Schalkwyk2
1Google, 1600 Amphitheatre Parkway, Mountain View, CA 94043, USA
2Google, 76 Ninth Avenue, New York, NY 10011, USA
brandonb@google.com, allauzen@google.com, alexgru@google.com, johans@google.com
Abstract
Google offers several speech features on the Android mobile
operating system: search by voice, voice input to any text field,
and an API for application developers. As a result, our speech
recognition service must support a wide range of usage scenarios
and speaking styles: relatively short search queries, addresses,
business names, dictated SMS and e-mail messages,
and a long tail of spoken input to any of the applications users
may install. We present a method of on-demand language
model interpolation in which contextual information about each
utterance determines interpolation weights among a number of
n-gram language models. On-demand interpolation results in
an 11.2% relative reduction in WER compared to using a single
language model to handle all traffic.
Index Terms: language modeling, interpolation, mobile
1. Introduction
Entering text on mobile devices is often slow and error-prone
in comparison to typing on a full-sized keyboard. Google offers
several features on Android aimed at making speech a
viable alternative input method: search by voice, voice input
into any text field, and a speech API for application developers.
To search by voice, users simply tap a microphone icon on
the desktop search box, or hold down the physical search button.
They can speak any query, and are then shown the Google
search results. To use the Voice Input feature, users tap the
microphone key on the on-screen keyboard, and then speak to
enter text virtually anywhere they would normally type. Users
may dictate e-mail and SMS messages, fill in forms on web
pages, or enter text into any application. Finally, the Android
Speech API is a simple way for developers to integrate speech
recognition capabilities into their own applications.
While a large portion of usage of the speech recognition
service is comprised of spoken queries and dictation of SMS
messages, there is a long tail of usage from thousands of other
applications. Due to this diversity, choosing an appropriate language
model for each utterance (recorded audio) is challenging.
Two viable options are to build a single language model to handle
all traffic, or to train a language model appropriate to each
major use case and then choose the “best” one for each utterance,
depending on the context of that utterance.
We develop and compare a third option in this paper, in
which a development set of utterances from each context is
used to optimize interpolation weights among a small number
of component language models. Since there may be thousands
of such “contexts”, the language models are interpolated ondemand,
either during decoding or as a post-processing rescoring
phase. On-demand interpolation is performed efficiently via
the use of a “compact interpolated” finite state transducer (FST),
in which transition weights are dynamically computed.
Percent of utterances
Voice input 49%
Search by Voice 44%
Speech API 7%
Table 1: Breakdown of speech traffic on Android devices that
support Voice Input, Search by Voice, and Speech API.
2. Related Work
The technique of creating interpolated language models for different
contexts has been used with success in a number of conversational
interfaces [1, 2, 3] In this case, the pertinent context
is the system’s “dialogue state”, and it’s typical to group
transcribed utterances by dialogue state and build one language
model per state. Typically, states with little data are merged, and
the state-specific language models are interpolated, or otherwise
merged. Language models corresponding to multiple states may
also be interpolated, to share information across similar states.
The technique we develop here differs in two key respects.
First, we derive interpolation weights for thousands of recognition
contexts, rather than a handful of dialogue states. This
makes it impractical to create each interpolated language model
offline and swap in the desired one at runtime. Our language
models are large, and we only learn the recognition context for
a particular utterance when the audio starts to arrive. Second,
rather than relying on transcribed utterances from each recognition
context to train state-specific language modes, we instead
interpolate a small number of language models trained from
large corpora.
3. Android Speech Usage Analysis
The challenge of supporting a variety of use cases is illustrated
by examining the usage of the speech features available on
Android. Table 1 breaks down the portion of utterances from
the Android platform associated with the three speech features:
voice input, search by voice, and the speech API. We note that
this distinction isn’t perfect, as some users might, for example,
speak a search query into a text box in the browser using
the voice input feature. In addition, a large majority of the
speech API utterances come from built-in Google applications –
Google Maps provides a popular voice-enabled search box, for
example. Overall, we observe roughly an even split between
searching and dictation.
The voice input feature encourages a wide range of usage.
Since its launch in January, 2010, users have dictated text into
over 8,000 distinct text fields. Table 2 shows the 10 most popular
text fields. SMS is extremely popular, with usage levels an
order of magnitude greater than any other application. Moreover,
among the top 10 fields, 4 of them come from either the
built-in SMS application, or one of the many SMS applicaCopyright
© 2010 ISCA 26-30 September 2010, Makuhari, Chiba, Japan
INTERSPEECH 2010
1812Text Field Usage
SMS - Compose 63.1%
An SMS app from Market - Compose 4.9%
Browser 4.8%
Google Talk 4.5%
Gmail - Compose 3.3%
Android Market - Search 2.4%
Email - Compose 1.8%
SMS - To 1.3%
Maps - Directions Endpoint 1.0%
An SMS app from Market - Compose 1.0%
Table 2: The 10 most popular voice input text fields and their
percent usage.
0 10 20 30 40 50 60 70 80 90 100 0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Number of fields, sorted by usage
Cumulative percent of utterances
Figure 1: Cumulative usage for the most popular 100 text fields,
rank ordered by usage.
tions available on the Android Market. Also popular are other
dictation-style applications: Gmail, Email, and Google Talk.
Android Market and Maps, both of which also appear in the top
10, represent different kinds of utterances – search queries. Finally,
the Browser category here actually encompasses a wide
range of fields – any text field on any web page.
Figure 1 shows the cumulative usage per text field of the
100 most popular text fields, rank ordered by usage. Although
the usage is certainly concentrated among a handful of applications,
there remains a significant tail. While increasing accuracy
for the tail may not have a huge effect on the overall accuracy
of the system, it’s important for users to have a seamless experience
using voice input: users will have a difficult time discerning
that voice input may work better in some text fields than
others.
4. Compact Interpolated FST
In this setting, we have a relatively small set of language models
that is fixed and known in advance. At recognition time,
each utterance comes with a custom set of interpolation (or mixture)
weights and we need to be able to efficiently compute ondemand
the corresponding interpolated model.
In a backoff language model, the conditional probability of
w ∈ Σ given context h ∈ Σ∗ is recursively defined as
P(w | h) = P(w | h) if hw ∈ S
αhP(w | h
) otherwise,
where P is the adjusted maximum likelihood probability (derived
from the training corpus), S is the skeleton of the model,
αh is the backoff weight for the context h and h is the longest
common suffix of h. The order of the model is maxhw∈S |hw|.
Such a language model can naturally be represented by
a weighted automaton over the real semiring (R, +, ×, 0, 1)
using failure transitions [4]: the set of states is Q =
x
y
φ/.5
xa a/.5
ya a/.4
yb b/.4
yc
c/.04
x
y
φ/.4
xb b/.6
ya a/.6
yb b/.2
yc
c/.02
x
y
φ/(.5,.4)
a/(.5,.24) xa
xb
b/(.2,.6)
a/(.4,.6) ya
yb b/(.4,.2)
c/(.04,.02) yc
(a) (b) (c)
Figure 2: Outgoing transitions from state x in (a) G1, (b) G2
and (c) I. For λ = (.6, .4)T , PIλ (a | x) = .6 × .5 + .4 × .24.
{h ∈ Σ∗ | ∃w ∈ Σ such that hw ∈ S}, for each state h, there
is a failure transition from h to h labeled by φ and with weight
αh, and for each hw ∈ S, there is a transition from h to the
longest suffix of hw that belongs to Q, labeled by w and with
weight P(w | h).
Given a set G = {G1,...,Gm} of m backoff language
models and a vector of mixture weights λ = (λ1,...λm)
T ,
the linear interpolation of G by λ is defined as the language
model Iλ assigning the conditional probability:
PIλ (w | h) = m
i=1
λiPGi (w | h). (1)
Using (1) directly to perform on-demand interpolation would
be inefficient because for a given pair (w, h) we might need to
backoff several times in several of the models and this can become
rather expensive when using the automata representation.
Instead, we chose to reformulate the interpolated model as a
backoff model:
PIλ (w | h) = λT phw if hw ∈ S(G),
f(λ, αh)PIλ (w | h
) otherwise,
where phw = (PG1 (w|h),..., PGm(w|h))T , S(G) =
∪m
i=1S(Gi) and αh = (αh(G1),...,αh(Gm))T . There exists
a closed-form expression of f(λ, α) that ensure the proper
normalization of the model. However, in practice we decided
to approximate it by the dot product of λ and αh: f(λ, αh) =
λT αh.
The benefit of this formulation is that it perfectly fits our
requirement. Since the set of models is known in advance we
can precompute S(G) and all the relevant vectors (phw and αh)
effectively building a generic interpolated model I as a model
over Rm. Given a new utterance and a corresponding vector
of mixture weights λ, we can obtain the relevant interpolated
model Iλ by taking the dot product of each component vector
of I with λ.
Moreover, this approach also allows for an efficient
representation of I as a weighted automaton over the
semiring (Rm, +, ◦, 0, 1) (◦ denotes componentwise multiplication),
the weight of each transition in the automaton
being a vector in Rm. The set of states is Q =
{h ∈ Σ∗ | ∃w ∈ Σ such that hw ∈ S(G)}. For each state h,
there is a failure transition from h to h labeled by φ and with
weight αh, and for each hw ∈ S(G), there is a transition from
h to the longest suffix of hw that belongs to Q, labeled by w
and with weight phw. Figure 2 illustrates this construction.
Given a new utterance and a corresponding vector of mixture
weights λ, this automaton can be converted on-demand into
a weighted automaton over the real semiring by taking the dot
product of λ and the weight vector of each visited transition.
1813
ReFr: An Open-Source Reranker Framework
Daniel M. Bikel, Keith B. Hall
Google Research, New York, NY
{dbikel,kbhall}@google.com
Abstract
ReFr (http://refr.googlecode.com) is a software
architecture for specifying, training and using reranking models,
which take the n-best output of some existing system and
produce new scores for each of the n hypotheses that potentially
induce a different ranking, ideally yielding better results than
the original system. The Reranker Framework has some special
support for building discriminative language models, but can be
applied to any reranking problem. The framework is designed
with parallelism and scalability in mind, being able to run on
any Hadoop cluster out of the box. While extremely efficient,
ReFr is also quite flexible, allowing researchers to explore a
wide variety of features and learning methods. ReFr has been
used for building state-of-the-art discriminative LM’s for both
speech recognition and machine translation systems.
Index Terms: language modeling, discriminative language
modeling, reranking, structured prediction
1. Introduction
Creating effective software tools for research is a tricky business.
The classic tension between flexibility and efficiency
arises with greater urgency. We want researchers to be able to
try out many different ideas easily, but we also want them to be
able to have a quick code-test-evaluate cycle.
ReFr grew out of the 2011 Johns Hopkins Summer Workshop,
from the team using automatically generated confusions
to synthesize training data for discriminative language models
for speech and machine translation, led by Prof. Brian Roark
of OHSU. That approach required tools that would scale up to
training data sizes orders of magnitude larger than had previously
been used to build discriminative language models, so
we not only needed our training and inference to be inherently
fast, but we needed to design tools with distributed computing
in mind from the outset.
This paper describes the tools we have developed to solve
not only the immediate research problem of exploring confusions
for discriminative language modeling, but also the more
general problem of reranking approaches to speech and language
processing, including structured prediction. We designed
ReFr to have the following properites:
• “library quality” code
• industrial strength
• academic flexibility
• easy exploration of different types of features, different
update methods (e.g., MIRA-style, direct loss minimization,
loss-sensitive) and different learning methods (e.g.,
perceptron-style, log-linear, kernel methods)
• modern, object-oriented design, complete with dynamic
factories and dynamic composition for flexibility
• parallelizable, especially for distributed-computing environments
2. Data Format for I/O
There are two main choices when building discriminative
reranking models for speech or machine translation: (a) rescore
a lattice or hypergraph or (b) simply use a strict reranking approach
applied to n-best lists. For ReFr, early on we decided
to use (b) reranking n-best lists. The primary reasons were the
flexibility this would allow us in designing features and tools.
N-best lists readily allow for sentence-level features in a way
that, say, lattices do not. Additionally, it is far easier to de-
fine generic schemes of passing around n-best lists than it is for
designing schemes to take speech lattices as well as machine
translation hypergraphs or other, problem-specific data types.
ReFr is meant to be flexible enough to allow for a variety of
data sources. In order to avoid the need for overly complex data
formats, we have chosen to adopt a formalism which allows
one to augment the input format, allowing for flexible feature
extraction and data manipulation/analysis. We opted to use a
data format which mirrors the data-structures that are used internally
for training. The Google protocol buffers[1] provide a
programming-language independent specification framework to
define data formats. The protocol buffers specification language
is used by the protocol buffer tools to generate source-code for
serializing and deserializing the data stored in the format. Code
is generated to allow for native programming-language encapsulation
of the data. For example, in C++ each item of data is
stored in an object based on a object oriented data specification
(a C++ class) allowing for access to the data.1
3. Core learning framework
Consider Algorithm 1, which describes the training procedure
for a generic online-learning algorithm. Each training example
ei comprises a set of candidate hypotheses, each of which is
projected via some function Φ into a feature space, R
F
. We
typically think of Φ as being a suite of feature functions, one
per dimension. The model itself is defined as a weight vector
in this space, w. Decoding, or inference, is carried out simply
by taking the dot product of the model and a test instance.
More generally, any kernel function K may be used. The training
procedure iterates over the training data T—each iteration
is called an epoch—until the NEEDTOKEEPTRAINING() predicate
returns false. Often, such a predicate is based on the
average loss of the current model on some held-out development
data D, which is the purpose of the EVALUATE(D) line
in the TRAIN(T) procedure.
1For the 2011 Johns Hopkins Workshop, we were targeting multiple
tasks (ASR and MT), and so our toolkit provides a means to convert
from two types of text-based n-best formats, one the output of an ASR
system, the other the output of an MT system. These conversion tools
are not only useful in their own right, but serve as example implementations
for any developer converting from their own, proprietary format
to the Google Protocol Buffer format used by ReFr.
Copyright © 2013 ISCA 25-29 August 2013, Lyon, France
INTERSPEECH 2013: Show & Tell Contribution
756Algorithm 1 Training algorithm for online-learning reranking
models.
Let ei = {c1, . . . , ck} be a training example, where each cj is a candidate
hypothesis.
Similarly, let di = {c1, . . . , ck} be a held-out development data example, also
consisting of k candidate hypotheses.
Finally, let K be a kernel function.
procedure TRAIN(T = {e1, . . . , en}, D = {d1, . . . , dm})
while NEEDTOKEEPTRAINING() do
TRAINONEEPOCH(T)
EVALUATE(D)
end while
end procedure
procedure TRAINONEEPOCH(T)
foreach training example ei do
SCORECANDIDATES(ei)
if NEEDTOUPDATE() then
UPDATE()
end if
end for
end procedure
procedure SCORECANDIDATES(ei)
foreach candidate hypothesis cj ∈ ei do
cj .score ← K(wt, cj )
end for
end procedure
Model
Candidate Scorer
Update Predicate
Updater
…
Figure 1: A pictorial view of how a Model wraps instances of
other interfaces that specify the predicates and functions needed
to carry out model training.
For the basic perceptron, the model starts out at time step 0
as the zero vector; that is, wo = ~0. The update is
wt+1 = wt + Rt [Φ (yoracle (ei)) − Φ (ˆy (ei))] , (1)
where yoracle is a function that picks out the hypothesis towards
which we want to bias our model, yˆ is a function that picks out
the candidate hypothesis we want to bias our model against and
Rt is a learning rate or step size. Most often, yoracle is defined
to pick the hypothesis with the lowest loss relative to some goldstandard
truth, and yˆ is defined to pick the candidate hypothesis
that scores highest under the current model wt.
Most of the variations of this basic learning method involve
finding different ways of defining Rt, Φ, yoracle and yˆ, along
with the various procedures and predicates shown in Algorithm
1. Therefore, we would like our Reranker Framework to make
it easy for the researcher to define these various functions, as
well as to specify which ones to use at run-time.
ReFr defines a Model interface with virtual methods for all
of the functions shown in Algorithm 1. To avoid the exponential
blow-up of overriding different combinations of these methods,
ReFr also employs dynamic composition. That is, we keep the
idea of a Model interface, but additionally have each Model
instance wrap a set of predicate/manipulator objects, each of
which itself conforms to an interface. Figure 1 shows a pictorial
representation of this scheme.
As we discussed above, we employ dynamic composition
to avoid defining a new subclass of Model every time we wish
model file = "my model file"; // model output file
model =
PerceptronModel(
name("my model"),
score comparator(DirectLossScoreComparator()));
exec feature extractor =
ExecutiveFeatureExtractorImpl(
feature extractors({NgramFeatureExtractor(n(2)),
RankFeatureExtractor()});
training efe = exec feature extractor;
dev efe = exec feature extractor;
training files = {"training1.gz", "training2.gz"};
devtest files = {"dev1.gz", "dev2.gz"};
Figure 2: An example ReFr configuration file, read by its
Interpreter class.
to explore a new combination of learning method functions. To
do this, ReFr includes a very lightweight and yet powerful interpreter
for a language that allows for assignment statements for
primitives, vectors of primitives, Factory-constructible objects
and vectors of Factory-constructible objects. Figure 2 shows
an example ReFr configuration file. The syntax is intentionally
very similar to that of C++. This lightweight language provides
a flexible mechanism by which to specify how feature extraction,
training and inference shall occur.
4. Cluster-based distributed training
As Algorithm 1 shows, the basic perceptron algorithm involves
“online” updating, and thus it is possible to read in each training
example from file each time it is needed, only keeping
the model’s parameters persistently in memory. The Reranker
Framework allows both the memory-intensive way of training
as well as this “streaming mode” version of training, essential
for distributed learning.
The structured perceptron [2] and it’s variants have proven
to be effective in supervised, discriminative language modeling
work [3]. We have centered the development of our opensource
discriminative learning toolkit around perceptron-style
algorithms, which are, by definition, online learning algorithms.
Identifying the optimal solution for a distributed online optimization
algorithm is still an open research question. We borrow
from our previous work on distributed perceptron training
in [4, 5] and use the Iterative Parameter Mixtures algorithm
for distributed computation. The Reranker Framework makes
it easy to switch between single processor and distributed training,
which uses the Hadoop implementation of MapReduce [6].
5. Demo Plan
Our demo will consist of a walk-through of all ReFr’s features,
followed by a hands-on demonstration of how easy it is to implement
a new class of features for the reranker based on the
rank of each candidate hypothesis. We will also show how easy
it is to integrate that new class of features into training and inference.
We will then demonstrate the ease with which one can
use the API and the interpreted configuration language to alter
the training algorithm. Finally, we will demonstrate the simple
way that a user can switch from single processor training to
large-scale distributed training.
6. Acknowledgements
The authors would like to thank Prof. Brian Roark of Oregon Health and Science
University for leading a fantastic team at the 2011 Johns Hopkins Workshop, and
we would also like to thank all of our teammates, especially Prof. Izhak Shafran
of OHSU and Ph.D. candidate Maider Lehr, who are actively working with and
helping us improve ReFr.
7577. References
[1] Google, “Protocol buffers,”
http://code.google.com/apis/protocolbuffers/.
[2] M. Collins, “Discriminative training methods for hidden Markov
models: Theory and experiments with perceptron algorithms,” in
Proc. EMNLP, 2002, pp. 1–8.
[3] B. Roark, M. Sarac¸lar, and M. Collins, “Discriminative n-gram
language modeling,” Computer Speech and Language, vol. 21,
no. 2, pp. 373 – 392, 2007. [Online]. Available:
http:
//www.sciencedirect.com/science/article/pii/S0885230806000271
[4] R. McDonald, K. Hall, and G. Mann, “Distributed training strategies
for the structured perceptron,” in HLT-NAACL, 2010.
[5] K. Hall, S. Gilpin, and G. Mann, “Mapreduce/bigtable for distributed
optimization,” in NIPS Workshop on Leaning on Cores,
Clusters, and Clouds, 2010.
[6] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing
on large clusters,” CACM, vol. 51:1, 2008.
758
Accurate and Compact Large Vocabulary Speech Recognition
on Mobile Devices
Xin Lei1 Andrew Senior2 Alexander Gruenstein1
Jeffrey Sorensen2
1Google Inc., Mountain View, CA USA
2Google Inc., New York, NY USA
{xinlei,andrewsenior,alexgru,sorenj}@google.com
Abstract
In this paper we describe the development of an accurate, smallfootprint,
large vocabulary speech recognizer for mobile devices.
To achieve the best recognition accuracy, state-of-the-art
deep neural networks (DNNs) are adopted as acoustic models.
A variety of speedup techniques for DNN score computation
are used to enable real-time operation on mobile devices. To
reduce the memory and disk usage, on-the-fly language model
(LM) rescoring is performed with a compressed n-gram LM.
We were able to build an accurate and compact system that runs
well below real-time on a Nexus 4 Android phone.
Index Terms: Deep neural networks, embedded speech recognition,
SIMD, LM compression.
1. Introduction
Smartphones and tablets are rapidly overtaking desktop and laptop
computers as people’s primary computing device. They are
heavily used to access the web, read and write messages, interact
on social networks, etc. This popularity comes despite the
fact that it is significantly more difficult to input text on these
devices, predominantly by using an on-screen keyboard.
Automatic speech recognition (ASR) is a natural, and increasingly
popular, alternative to typing on mobile sevices.
Google offers the ability to search by voice [1] on Android,
iOS, and Chrome; Apple’s iOS devices come with Siri, a conversational
assistant. On both Android and iOS devices, users
can also speak to fill in any text field where they can type (see,
e.g., [2]), a capability heavily used to dictate SMS messages and
e-mail.
A major limitation of these products is that speech recognition
is performed on a server. Mobile network connections are
often slow or intermittent, and sometimes non-existant. Therefore,
in this study, we investigate techniques to build an accurate,
small-footprint speech recognition system that can run in
real-time on modern mobile devices.
Previously, speech recognition on handheld computers
and smartphones has been studied in the DARPA sponsored
Transtac Program, where speech-to-speech translation systems
were developed on the phone [3, 4, 5]. In the Transtac systems,
Gaussian mixture models (GMMs) were used to as acoustic
models. While the task was a small domain, with limited
training data, the memory usage in the resulting systems was
moderately high.
In this paper, we focus on large vocabulary on-device dictation.
We show that deep neural networks (DNNs) can provide
large accuracy improvements over GMM acoustic models,
with a significantly smaller footprint. We also demonstrate how
memory usage can be significantly reduced by performing onthe-fly
rescoring with a compressed language model during decoding.
The rest of this paper is organized as follows. In Section 2,
the embedded GMM acoustic model is described. Section 3
presents the training of embedded DNNs, and the techniques we
employed to speed up DNN inference at runtime. Section 4 describes
the compressed language models for on-the-fly rescoring.
Section 5 shows the experimental results of recognition
accuracy and speed on Nexus 4 platform. Finally, Section 6
concludes the paper and discusses future work.
2. GMM Acoustic Model
Our embedded GMM acoustic model is trained on 4.2M utterances,
or more than 3,000 hours of speech data containing randomly
sampled anonymized voice search queries and other dictation
requests on mobile devices. The acoustic features are
9 contiguous frames of 13-dimensional PLP features spliced
and projected to 40 dimensions by linear discriminant analysis
(LDA). Semi-tied covariances [6] are used to further diagonalize
the LDA transformed features. Boosted-MMI [7] was used
to train the model discriminatively.
The GMM acoustic model contains 1.3k clustered acoustic
states, with a total of 100k Gaussians. To reduce model size
and speed up computation on embedded platforms, the floatingpoint
GMM model is converted to a fixed-point representation,
similar to that described in [8]. Each dimension of the Gaussian
mean vector is quantized into 8 bits, and 16-bit for precision
vector. The resulting fixed-point GMM model size is about 1/3
of the floating-point model, and there is no loss of accuracy due
to this conversion in our empirical testing.
3. DNNs for Embedded Recognition
We have previously described the use of deep neural networks
for probability estimation in our cloud-based mobile voice
recognition system [9]. We have adopted this system for developing
DNN models for embedded recognition, and summarize
it here.
The model is a standard feed-forward neural network with
k hidden layers of nh nodes, each computing a nonlinear function
of the weighted sum of the outputs of the previous layer.
The input layer is the concatenation of ni consecutive frames
of 40-dimensional log filterbank energies calculated on 25ms
windows of speech every 10ms. The no softmax outputs estimate
the posterior of each acoustic state. We have experimented
with conventional logistic nonlinearities and rectified
linear units that have recently shown superior performance in
our large scale task [10], while also reducing computation.
Copyright © 2013 ISCA 25-29 August 2013, Lyon, France
INTERSPEECH 2013
662While our server-based model has 50M parameters (k = 4,
nh = 2560, ni = 26 and no = 7969), to reduce the memory
and computation requirement for the embedded model, we experimented
with a variety of sizes and chose k = 6, nh = 512,
ni = 16 and no = 2000, or 2.7M parameters. The input window
is asymmetric; each additional frame of future context adds
10ms of latency to the system so we limit ourselves to 5 future
frames, and choose around 10 frames of past context, trading
off accuracy and computation.
Our context dependency (CD) phone trees were initially
constructed using a GMM training system that gave 14,247
states. By pruning this system using likelihood gain thresholds,
we can choose an arbitrary number of CD states. We used
an earlier large scale model with the full state inventory that
achieved around 14% WER to align the training data, then map
the 14k states to the desired smaller inventory. Thus we use a
better model to label the training data to an accuracy that cannot
be achieved with the embedded scale model.
3.1. Training
Training uses conventional backpropagation of gradients from a
cross entropy error criterion. We use minibatches of 200 frames
with an exponentially decaying learning rate and a momentum
of 0.9. We train our neural networks on a dedicated GPU based
system. With all of the data available locally on this system, the
neural network trainer can choose minibatches and calculate the
backpropagation updates.
3.2. Decoding speedup
Mobile CPUs are designed primarily for lower power usage and
do not have as many or as powerful math units as CPUs used
in server or desktop applications. This makes DNN inference,
which is mathematically computationally expensive, a particular
challenge. We exploit a number of techniques to speed up
the DNN score computation on these platforms.
As described in [11], we use a fixed-point representation
of DNNs. All activations and intermediate layer weights are
quantized into 8-bit signed char, and biases are encoded as
32-bit int. The input layer remains floating-point, to better accommodate
the larger dynamic ranges of input features. There
is no measured accuracy loss resulting from this conversion to
fixed-point format.
Single Instruction Multiple Data (SIMD) instructions are
used to speed up the DNN computation. With our choice of
smaller-sized fixed-point integer units, the SIMD acceleration is
significantly more efficient, exploiting up to 8 way parallelism
in each computation. We use a combination of inline assembly
to speed up the most expensive matrix multiplication functions,
and compiler intrinsics in the sigmoid and rectified linear calculations.
Batched lazy computation [11] is also performed. To exploit
the multiple cores present on modern smartphones, we
compute the activations up to the last layer in a dedicated thread.
The output posteriors of the last layer are computed only when
needed by the decoder in a separate thread. Each thread computes
results for a batch of frames at a time. The choice of batch
size is a tradeoff between computation efficiency and recognition
latency.
Finally, frame skipping [12] is adopted to further reduce
computation. Activations and posteriors are computed only every
nb frames and used for nb consecutive frames. In experiments
we find that for nb = 2, the accuracy loss is negligible;
however for nb ≥ 3, the accuracy degrades quickly.
4. Language Model Compression
We create n-gram language models appropriate for embedded
recognition by first training a 1M word vocabulary and
18M n-gram Katz-smoothed 4-gram language model using
Google’s large-scale LM training infrastructure [13]. The language
model is trained using a very large corpus (on the order
of 20 billion words) from a variety of sources, including search
queries, web documents and transcribed speech logs.
To reduce memory usage, we use two language models during
decoding. First, a highly-pruned LM is used to build a small
CLG transducer [14] that is traversed by the decoder. Second,
we use a larger LM to perform on-the-fly lattice rescoring during
search, similar to [15]. We have observed that a CLG transducer
is generally two to three times larger than a standalone
LM, so this rescoring technique significantly reduces the memory
footprint.
Both language models used in decoding are obtained by
shrinking the 1M vocabulary and 18M n-gram LM. We aggressively
reduce the vocabulary to the 50K highest unigram terms.
We then apply relative entropy pruning [16] as implemented
in the OpenGrm toolkit [17]. The resulting finite state model
for rescoring LM has 1.4M n-grams, with just 280K states and
1.7M arcs. The LM for first pass decoding contains only unigrams
and about 200 bigrams.
We further reduce the memory footprint of the rescoring
LM by storing it in an extremely memory-efficient manner, discussed
below.
4.1. Succinct storage using LOUDS
If you consider a backoff language model’s structure, the failure
arcs from (n + 1)-gram contexts to n-gram contexts and,
ultimately, to the unigram state form a tree. Trees can be stored
using 2 bits per node using a level-order unary degree sequence
(LOUDS), where we visit the nodes breadth-first writing 1s for
the number of (n + 1)-gram contexts and then terminating with
a 0 bit. We build a bit sequence similarly for the degree of outbound
non-φ arcs.
The LOUDS data structure provides first-child, last-child,
and parent navigation, so we are able to store a language model
without storing any next-state values. As a contiguous, indexfree
data object, the language model can be easily memory
mapped.
The implementation of this model is part of the OpenFst
library [18] and covered in detail in [19]. The precise storage
requirements, measured in bits, are
4ns + na + (W + L)(ns + na) + W nf + c
where ns is the number of states, nf the number of final states,
na is the number of arcs, L is the number of bits per wordid,
and W is the number of bits per probability value. This
is approximately one third the storage required by OpenFst’s
vector representation. For the models discussed here, we use 16
bits for both labels and weights.
During run time, to support fast navigation in the language
model, we build additional indexes of the LOUDS bit sequences
to support the operations rankb(i) the number of b valued bits
before index i, and its inverse selectb(r). We maintain a two
level index that adds an additional 0.251(4ns + na) bits. Here
it is important to make use of fast assembly operations such as
find first set during decoding, which we do through compiler
intrinsics.
6634.2. Symbol table compression
The word symbol table for an LM is used to map words to
unique identifiers. Symbol tabels are another example of a
data structure that can be represented as a tree. In this case
we relied upon the implementation contained in the MARISA
library [20].
This produces a symbol table that fits in just one third the
space of the concatenated strings of the vocabulary, yet provides
a bidirectional mapping between integers and vocabulary
strings. We are able to store our vocabulary in about 126K
bytes, less than 3 bytes per entry in a memory mappable image.
The MARISA library assigns the string to integer ids during
compression, so we relabel all of the other components in our
system to match this assignment.
5. Experimental Results
To evaluate accuracy performance, we use a test set of 20,000
anonymized transcribed utterances from users speaking in order
to fill in text fields on mobile devices. This biases the test set
towards dictation, as opposed to voice search queries, because
dictation is more useful than search when no network connection
is available.
To measure speed performance, we decode a subset of 100
utterances on an Android Nexus 4 (LG) phone. The Nexus 4 is
equipped with a 1.5GHz quad-core Qualcomm Snapdragon S4
pro CPU, and 2GB of RAM. It runs the Android 4.2 operating
system. To reduce start up loading time, all data files, including
the acoustic model, the CLG transducer, the rescoring LM
and the symbol tables are memory mapped on the device. We
use a background thread to “prefetch” the memory mapped resources
when decoding starts, which mitigates the slowdown in
decoding for the first several utterances.
5.1. GMM acoustic model
The GMM configuration achieves a word error rate (WER) of
20.7% on this task, with an average real-time (RT) factor of
0.63. To achieve this speed, the system uses integer arithmetic
for likelihood calculation and decoding. The Mahalanobis distance
computation is accelerated using fixed-point SIMD instructions.
Gaussian selection is used to reduce the burden
of likelihood computation, and further efficiencies come from
computing likelihoods for batches of frames.
5.2. Accuracy with DNNs
We compare the accuracy of DNNs with different configurations
to the baseline GMM acoustic model in Table 1. A DNN
with 1.48M parameters already outperforms the GMM in accuracy,
with a disk size of only 17% of the GMM’s. By increasing
the number of hidden layers from 4 to 6 and number of outputs
from 1000 to 2000, we obtain a large improvement of 27.5%
relative in WER compared to the GMM baseline. The disk size
of this DNN is 26% of the size of the GMMs.
For comparison, we also evaluate a server-sized DNN with
an order of magnitude of more parameters, and it gives 12.3%
WER. Note that all experiments in Table 1 use smaller LMs in
decoding. In addition, with an un-pruned server LM, the server
DNN achieves 9.9% WER while the server GMM achieves
13.5%. Therefore, compared to a full-size DNN server system,
there is a 2.4% absolute loss due to smaller LMs, and 2.8% due
to smaller DNN. Compared to the full-size GMM server system,
the embedded DNN system is about 10% relatively worse
in WER.
The impact of frame skipping is evaluated with the
DNN 6×512 model. As shown in Table 2, the accuracy performance
quickly degrades when nb is larger than 2.
Table 2: Accuracy results with frame skipping in a DNN system.
nb 1 2 3 4 5
WER (%) 15.1 15.2 15.6 16.0 16.7
5.3. Speed benchmark
For speed benchmark, we measure average RT factor as well
as 90-percentile RT factor. As shown in Table 3, the baseline
GMM system with SIMD optimization gives an average RT
factor of 0.63. The fixed-point DNN gives 1.32×RT without
SIMD optimization, and 0.75×RT with SIMD. Batched lazy
computation improves average RT by 0.06 but degrades the 90-
percentile RT performance, probably due to less efficient ondemand
computation for difficult utterances. After frame skipping
with nb = 2, the speed of DNN system is further improved
slightly to 0.66×RT. Finally, the overhead of the compact
LOUDS based LM is about 0.13×RT on average.
Table 3: Averge real-time (RT) and 90-percentile RT factors of
different system settings.
Average RT RT(90)
GMM 0.63 0.90
DNN (fixed-point) 1.32 1.43
+ SIMD 0.75 0.87
+ lazy batch 0.69 1.01
+ frame skipping 0.66 0.97
+ LOUDS 0.79 1.24
5.4. System Footprint
Compared to the baseline GMM system, the new system with
LM compression and DNN acoustic model achieves a much
smaller footprint. The data files sizes are listed in Table 4. Note
that conversion of the 34MB floating-point GMM model to a
14MB fixed-point GMM model itself provides a large reduction
in size.
The use of DNN reduces the size by 10MB, and the LM
compression contributed to another 18MB reduction. Our final
embedded DNN system size is reduced from 46MB to 17MB,
while achieving a big WER reduction from 20.7% to 15.2%.
6. Conclusions
In this paper, we have described a fast, accurate and smallfootprint
speech recognition system for large vocabulary dictation
on the device. DNNs are used as acoustic model, which
provides a 27.5% relative WER improvement over the baseline
GMM models. The use of DNNs also significantly reduces the
memory usage. Various techniques are adopted to speed up the
DNN inference at decoding time. In addition, a LOUDS based
language model compression reduces the rescoring LM size by
more than 60% relative. Overall, the size of the data files of the
system is reduced from 46MB to 17MB.
664Table 1: Comparison of GMM and DNNs with different sizes. The input layer is denoted by number of filterbank energies × the context
window size (left + current + right). The hidden layers are denoted by number of hidden layers × number of nodes per layer. The
number of outputs is the number of HMM states in the model.
Model WER (%) Input Layer Hidden Layers # Outputs # Parameters Size
GMM 20.7 - - 1314 8.08M 14MB
DNN 4×400 22.6 40×(8+1+4) 4×400 512 0.9M 1.5MB
DNN 4×480 20.3 40×(10+1+5) 4×480 1000 1.5M 2.4MB
DNN 6×512 15.1 40×(10+1+5) 6×512 2000 2.7M 3.7MB
Server DNN 12.3 40×(20+1+5) 4×2560 7969 49.3M 50.8MB
Table 4: Comparison of data file sizes (in MB) in baseline GMM
system and DNN system with and without LOUDS LM compression.
AM denotes acoustic model, CLG is the transducer for decoding,
LM denotes the rescoring LM, and symbols denote the
word symbol table.
System AM CLG LM Symbols Total
GMM 14 2.7 29 0.55 46
+ LOUDS 14 2.7 10.7 0.13 27
DNN 3.7 2.8 29 0.55 36
+ LOUDS 3.7 2.8 10.7 0.13 17
Future work includes speeding up rescoring using the
LOUDS LM as well as further compression techniques. We
also continue to investigate the accuracy performance with different
sizes of LM for CLG and rescoring.
7. Acknowledgements
The authors would like to thank our former colleague Patrick
Nguyen for implementing the portable neural network runtime
engine used in this study. Thanks also to Vincent Vanhoucke
and Johan Schalkwyk for helpful discussions and support during
this work.
[9] N. Jaitly, P. Nguyen, A. W. Senior, and V. Vanhoucke, “Application
of pretrained deep neural networks to large vocabulary speech
recognition,” in Proc. Interspeech, 2012.
[10] M. D. Zeiler et al., “On rectified linear units for speech processing,”
in Proc. ICASSP, 2013.
[11] V. Vanhoucke, A. Senior, and M. Z. Mao, “Improving the speed
of neural networks on CPUs,” in Proc. Deep Learning and Unsupervised
Feature Learning NIPS Workshop, 2011.
[12] V. Vanhoucke, M. Devin, and G. Heigold, “Multiframe deep neural
networks for acoustic modeling,” in Proc. ICASSP, 2013.
[13] T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean, “Large
language models in machine translation,” in EMNLP, 2007, pp.
858–867.
[14] M. Mohri, F. Pereira, and M. Riley, “Speech recognition with
weighted finite-state transducers,” Handbook of Speech Processing,
pp. 559–582, 2008.
[15] T. Hori and A. Nakamura, “Generalized fast on-the-fly composition
algorithm for WFST-based speech recognition,” in Proc.
Interspeech, 2005.
[16] A. Stolcke, “Entropy-based pruning of backoff language models,”
in DARPA Broadcast News Transcription and Understanding
Workshop, 1998, pp. 8–11.
[17] B. Roark, R. Sproat, C. Allauzen, M. Riley, J. Sorensen, and
T. Tai, “The OpenGrm open-source finite-state grammar software
libraries,” in Proceedings of the ACL 2012 System Demonstrations.
2012, ACL ’12, pp. 61–66, Association for Computational
Linguistics.
8. References
[1] J. Schalkwyk, D. Beeferman, F. Beaufays, B. Byrne, C. Chelba,
M. Cohen, M. Kamvar, and B. Strope, “Google search by voice:
A case study,” in Advances in Speech Recognition: Mobile Environments,
Call Centers and Clinics, pp. 61–90. Springer, 2010.
[2] B. Ballinger, C. Allauzen, A. Gruenstein, and J. Schalkwyk, “Ondemand
language model interpolation for mobile speech input,”
in Proc. Interspeech, 2010.
[3] J. Zheng et al., “Implementing SRI’s Pashto speech-to-speech
translation system on a smart phone,” in SLT, 2010.
[4] J. Xue, X. Cui, G. Daggett, E. Marcheret, and B. Zhou, “Towards
high performance LVCSR in speech-to-speech translation system
on smart phones,” in Proc. Interspeech, 2012.
[5] R. Prasad et al., “BBN Transtalk: Robust multilingual two-way
speech-to-speech translation for mobile platforms,” Computer
Speech and Language, vol. 27, pp. 475–491, February 2013.
[6] M. J. F. Gales, “Semi-tied covariance matrices for hidden Markov
models,” IEEE Trans. Speech and Audio Processing, vol. 7, pp.
272–281, 1999.
[7] D. Povey, D. Kanevsky, B. Kingsbury, B. Ramabhadran, G. Saon,
and K. Visweswariah, “Boosted MMI for model and feature-space
discriminative training,” in Proc. ICASSP, 2008.
[8] E. Bocchieri, “Fixed-point arithmetic,” Automatic Speech Recognition
on Mobile Devices and over Communication Networks, pp.
255–275, 2008.
[18] C. Allauzen, M. Riley, J. Schalkwyk, W. Skut, and M. Mohri,
“OpenFst: A general and efficient weighted finite-state transducer
library,” in Proceedings of the Ninth International Conference
on Implementation and Application of Automata, (CIAA 2007).
2007, vol. 4783 of Lecture Notes in Computer Science, pp. 11–
23, Springer, http://www.openfst.org.
[19] J. Sorensen and C. Allauzen, “Unary data structures for language
models,” in Proc. Interspeech, 2011.
[20] S. Yata, “Prefix/Patricia trie dictionary compression by
nesting prefix/Patricia tries (japanese),” in Proceedings
of 17th Annual Meeting of the Association for
Natural Language, Toyohashi, Japan, 2011, NLP2011,
https://code.google.com/p/marisa-trie/.
665
Backoff Inspired Features for Maximum Entropy Language Models
Fadi Biadsy, Keith Hall, Pedro Moreno and Brian Roark
Google, Inc.
{biadsy,kbhall,pedro,roark}@google.com
Abstract
Maximum Entropy (MaxEnt) language models [1, 2] are linear
models that are typically regularized via well-known L1 or L2
terms in the likelihood objective, hence avoiding the need for
the kinds of backoff or mixture weights used in smoothed ngram
language models using Katz backoff [3] and similar techniques.
Even though backoff cost is not required to regularize
the model, we investigate the use of backoff features in MaxEnt
models, as well as some backoff-inspired variants. These
features are shown to improve model quality substantially, as
shown in perplexity and word-error rate reductions, even in very
large scale training scenarios of tens or hundreds of billions of
words and hundreds of millions of features.
Index Terms: maximum entropy modeling, language modeling,
n-gram models, linear models
1. Introduction
A central problem in language modeling is how to combine information
from various model components, e.g., mixing models
trained with differing Markov orders for smoothing or on
distinct corpora for adaptation. Smoothing (regularization) for
n-gram language models is typically presented as a mechanism
whereby higher-order models are combined with lower-order
models so as to achieve both the specificity of the higher-order
model and the more robust generality of the lower-order model.
Most commonly, this combination is effected via an interpolation
or backoff mechanism, in which each prefix (history) of an
n-gram has a parameter which dictates how much cost is associated
with making use of lower-order n-gram estimates, often
called the “backoff cost”. This becomes a parameter estimation
problem in its own right, either through discounting or mixing
parameters; and these are often estimated via extensive parameter
tying, heuristics based on count histograms, or both.
Log linear models provide an alternative to n-gram backoff
or interpolated models for combining evidence from multiple,
overlapping sources of evidence, with very different regularization
methods. Instead of defining a specific model structure with
backoff costs and/or mixing parameters, these models combine
features from many sources into a single linear feature vector,
and score a word by taking the dot product of the feature vector
with a learned parameter vector. Learning can be via locally
normalized likelihood objective functions, as in Maximum Entropy
(MaxEnt) models [1, 2, 4] or global “whole sentence” objectives
[5, 6, 7]. For locally normalized MaxEnt models, which
estimate a conditional distribution over a vocabulary given the
prefix history (just as the backoff smoothed n-gram models do),
the brute-force local normalization over the vocabulary obviates
the need for complex backoff schemes to avoid zero probabilities.
One can simply toss in n-gram features of all the orders,
and learn their relative contribution.
Recall, however, that the standard backoff n-gram models
do not only contain parameters associated with n-grams; they
also contain parameters associated with the backoff weights for
each prefix history. For every proper prefix of an n-gram in
the model, there will be an associated backoff weight, which
penalizes to a greater or lesser extent words that have been previously
unseen following that prefix history. For some histories
we should have a relatively high expectation of seeing something
new, either because the history itself is rare (hence we
do not have enough observations yet to be strongly predictive)
or it simply predicts relatively open classes of possible words,
e.g., “the”, which can precede many possible words, including
many that were presumably unobserved following “the” in the
training corpus. Other prefixes may be highly predictive so that
the expectation of seeing something previously unobserved is
relatively low, e.g., “Barack”.
Granted, MaxEnt language models (LMs) do not need this
information about prefix histories to estimate regularized probabilities.
Chen and Rosenfeld [4] survey various smoothing and
regularization methods for MaxEnt language models, including
reducing the number of features (as L1 regularization does), optimizing
to match expected frequencies to discounted counts, or
optimizing to modified objectives, such as L2 regularization. In
none of these methods are there parameters in the model associated
with the sort of “otherwise” semantics of conventional
n-gram backoffs. Because such features are not required for
smoothing, they are not part of the typical feature set used in
log linear language modeling, yet our results demonstrate that
they should be. The ultimate usefulness of such features likely
depends on the amount of training data available, and we have
thus applied highly optimized MaxEnt training to very large
data sets. In large scale n-gram modeling, it has been shown
that the specific details of the smoothing algorithm is typically
less important than the scale. So-called “stupid backoff” [8]
is an efficient, scalable estimation method that, despite lack of
normalization guarantees, is shown to be extremely effective
in very large data set scenarios. While this has been taken to
demonstrate that the specifics of smoothing is unimportant as
the data gets large, those parameters are still important components
of the modeling approach, even if their usefulness is
robust to variation in parameter value.
We demonstrate that features patterned after backoff
weights, and several related generalizations of these features,
can in fact make a large difference to a MaxEnt language model,
even if the amount of training data is very large. In the next section,
we present background for language modeling and cover
related work. We then present our MaxEnt training approach,
and the new features. Finally, we present experimental results
on a range of large scale speech tasks.
2. Background and Related Work
Let wi be the word at position i in the string, and let w
i−1
i−k =
wi−k . . . wi−1 be the prefix history of the string prior to wi,
and P a probability estimate assigned to seen n-grams by the
specific smoothing method. Then the standard backoff language
model formulation is as follows:
P(wi | w
i−1
i−k
) = (
P(wi | w
i−1
i−k
) if c(wi
i−k
) > 0
α(w
i−1
i−k
) P(wi | w
i−1
i−k+1) otherwise
This recursive smoothing formulation has two kinds of paramCopyright
© 2014 ISCA 14-18 September 2014, Singapore
INTERSPEECH 2014
2645eters: n-gram probabilities P(wi | w
i−1
i−k
) and backoff weights
α(w
i−1
i−k
), which are parameters associated with the prefix history
w
i−1
i−k
.
MaxEnt models are log linear models score that alternatives
by taking the exponential of the dot product between a
feature vector and a parameter vector and normalizing. Let
Φ(wi−k . . . wi) be a d-dimensional feature vector, θ a ddimensional
parameter vector, and V a vocabulary. Then
P(wi | w
i−1
i−k
) = exp(Φ(wi−k . . . wi) · θ)
Z(wi−k . . . wi−1, θ)
where Z is a partition function (normalization constant):
Z(wi−k, . . . , wi−1, θ) = X
v∈V
exp(Φ(wi−k, . . . , wi−1v) · θ)
Training with a likelihood objective function is a convex optimization
problem, with well-studied efficient estimation techniques,
such as stochastic gradient descent. Regularization
techniques are also well-studied, and include L1 and L2 regularization,
or their combination, which are modifications of the
likelihood objective to either keep parameter values as close to
zero as possible (L2) or reduce the number of features with nonzero
parameter weights by pushing many parameters to zero
(L1). We employ a distributed approximation to L1, see Section
3.1.
The most expensive part of this optimization is the calculation
of the partition function, since it requires summing over the
entire vocabulary, which can be very large. Efficient methods
to enable training with very large corpora and large vocabularies
have been investigated over the past decades, from methods
to exploit structural overlap between features [9, 10] to methods
for decomposing the multi-class language modeling problem
into many binary language modeling problems (one versus
the rest) and sampling less data to effectively learn the models
[11]. For this paper, we employed many optimizations to enable
training with very large vocabularies (several hundred thousand
words) and very large training sets (>100B words).
3. Methods
3.1. Maximum Entropy training
Many features have been used in MaxEnt language models, including
standard n-grams and trigger words [1], topic-based
features [12] and morphological and sub-word based features
[13, 14]. Feature engineering is a major consideration in this
sort of modeling, and in Section 3.2 we detail our newly designed
feature templates. Before we do so, we present the training
methods that allow us to scale up to a very large vocabulary
and many training instances. In this work, we wish to scale up
MaxEnt language model training to learn from the same amount
of data used for standard backoff n-gram language models. We
achieve this by exploiting recent work on gradient-based distributed
optimization; specifically, distributed stochastic gradient
descent (SGD) [15, 16, 17, 18, 19].
We differ slightly from previous work in multiple aspects:
(1) we apply a final L1 regularization setp at the end of each
reducer using statistics collected from the mappers; (2) We estimate
the gradient using a mini-batch of 16 samples where the
mini-batch is processed in parallel via multi-threading; (3) We
do not perform any binarization or subsampling as in [20]; (4)
Unlike [21], we do not peform any clustering of our vocabulary.
Algorithm 1 presents our variant of the iterative parameter
mixtures (IPM) algorithm based on sampling. This presents a
merging of concepts from the original IPM algorithm described
in [16] and the distributed sample-based algorithm in [18] as
well as the lazy L1 SGD computation from [22].
Algorithm 1 Sample-based Iterative Parameter Mixtures
Require: n is the number of samples per worker per epoch
Require: Break S into K partitions
1: S ← {D
1
, . . . , Dj
, . . . , DK}
2: t ← 0
3: Θt ← 0
4: repeat
5: t ← t + 1
6: {θ
1
1, . . . , θK
L } ← IPMMAP(D
1
, . . . , DK, Θt−1, n)
7: Θ
0
t ← IPMREDUCE(θ
1
1, . . . , θj
l
, . . . , θK
L )
8: Θt ← APPLYL1(Θ0
t)
9: until converged
10: function IPMMAP(D, Θ, n)
11: . IPMMAP processes training data in parallel
12: Θ0 ← Θ
13: for i = 1 . . . n do . n examples from D
14: Sample di from D
15: Θ
0
i ← ApplyLazyL1(ActiveF eatures(di, Θi−1))
16: Θi ← Θ
0
i − α∇Fdi
(Θ0
i)
17: α ← U pdateAlpha(α, i)
18: end for
19: return Θn
20: end function
21: function IPMREDUCE(θ
1
l
, . . . , θj
l
, . . . , θK
l
)
22: . IPMREDUCE processes model parameters in parallel
23: θl ← 1
K
P
j
θ
j
l
24: return θl
25: end function
While this is a general paradigm for distributed optimization,
we show the MapReduce [23] implementation in Algorithm
1. We begin the process by partitioning the training data
S into multiple units D
j
, processing each of these units with
the IPMMAP function on separate processing nodes. On each
of these nodes, IPMMAP samples a subset of D
j which we call
di. This can be a single example or a mini-batch of examples.
We perform the Lazy L1 regularization update to the model,
compute the gradient of the regularized loss associated with the
mini-batch (which can be also be done in parallel), update the
local copy of the model parameters Θ, and update the learningrate
α. Each node samples n examples from its data partition.
Finally, IPMREDUCE collects the local model parameters from
each IPMMAP and averages them in parallel. Parallelization
here can be done over subsets of the parameter indices (each
IPMREDUCE node averages a subset of the parameter space).
We refer to each full MapReduce pass as an epoch of training.
Starting with the second epoch, the IPMMAP nodes are initialized
with the previous epoch’s merged, regularized model.
In a general shared distributed framework, which is used
at Google, some machines may be slower than others (due to
hardware or overload), machines may fail, or jobs may be preempted.
When using a large number of machines this is inevitable.
To avoid starting the training process over in these
cases, and make all others wait for for the lagging machines,
we enforce a timeout on our trainers. In other words, all mappers
have to finish within a certain amount of time. Therefore,
the reducer will merge all models when they either finished processing
their samples or timed-out.
3.2. Backoff inspired features
MaxEnt language models commonly have n-gram features,
which we denote here as a function of the string, the position,
2646and the order as follows:
NGram(w1 . . . wn, i, k) =
We now introduce some features inspired by the backoff parameters
α(w
i−1
i−k
) presented in Section 2. We begin with the most
directly related features, which we term suffix backoff features.
SuffixBackoff(w1 . . . wn, i, k) =
These fire if and only if the full n-gram
NGram(w1 . . . wn, i, k) is not in the feature dictionary
(see section 4.1). This is directly analogous to the backoff
weights in standard n-gram models, since it is a parameter
associated with the prefix history that fires when the particular
n-gram is unobserved.
Inspired by the form of this feature, we can introduce other
general backoff features. First, rather than just replacing the
suffix, we can replace the prefix:
PrefixBackoff(w1 . . . wn, i, k) =
Next, we can replace multiple words in the feature, to generalize
across several such contexts:
PrefixBackoffj (w1 . . . wn, i, k) =
SuffixBackoffj (w1 . . . wn, i, k) =
These features indicate that an n-gram of length k + 1 ending
with (PrefixBackoff), or beginning with (SuffixBackoff), the
particular j words, in the feature, are not in the feature dictionary.
Note that, if j=k−1, then PrefixBackoffj is identical to
the earlier defined PrefixBackoff feature, and SuffixBackoffj is
identical to SuffixBackoff.
For example, suppose that we have the following string
S=“we will save the quail eggs” and that the 4-gram “will save
the quail” does not exist in our feature dictionary. Then we can
fire the following features at word wi=5 = “quail”:
SuffixBackoff(S, 5, 3) = < will, save, the, BO >
PrefixBackoff(S, 5, 3) = < BO, save, the, quail >
SuffixBackoff0(S, 5, 3) = < will, BO3 >
SuffixBackoff1(S, 5, 3) = < will, save, BO3 >
PrefixBackoff0(S, 5, 3) = < BO3, quail >
PrefixBackoff1(S, 5, 3) = < BO3, the, quail >
As with n-gram feature templates, we include all such features
up to some specified length, e.g., if we have a trigram model,
that includes n-grams up to length 3, including unigrams, bigrams
and trigrams. Similarly, for our prefix and suffix backoff
features, we will have a maximum length and include in our
possible feature set all such features of that length or shorter.
4. Experimental results
We performed two experiments to evaluate the utility of these
new backoff-inspired features in maximum entropy language
models trained on very large corpora. First, we examine perplexity
improvements when such features are included in the
model alongside n-gram features. Next, we look at Word Error
Rate (WER) performance when reranking the output of a baseline
recognizer, again using different backoff feature templates.
In all cases, we fixed the vocabulary and feature budget of the
model so that improvements are not simply due to having more
parameters in the model. We set the vocabulary of our model to
200 thousand words, by selecting all words from the 2M words
in the baseline recognizer vocabulary that had been emitted by
the recognizer in the last 6 months of log files. All other words
are mapped to “”. We use the same vocabulary in all
of our experiments.
For our experiments, we focus on the voice search task. Our
data sets are assembled and pooled from anonymized supervised
and unsupervised spoken queries (such as, search queries,
questions, and voice actions) and typed queries to google.com,
YouTube, and Google Maps, from desktop and mobile devices.
Our overall training set is about 305 billion words (including
end of sentence symbols). We divide this set into K subsets. We
assign subset D
k
to trainer k (where, 1 ≤ k ≤ K). Then, we
run our distributed training (Algorithm 1) using K machines.
Since the amount of training data is very large, trainer k randomly
samples data points from its subset D
k
. Each epoch utilizes
a different seed for sampling, which equals to the epoch
number. As mentioned above, the trainer may terminate due to
completing its subsample or due to a timeout. We fix the timeout
threshold for each epoch across all our experiments. In our
experiments, the timout is 6 hours.
4.1. Feature Dictionary
A feature dictionary maps each feature key (e.g., trigram: “save
the quail”) to an index in the paramater vector Θ. As described
in Algorithm 2, we build this dictionary by iterating over all
strings in our training data and make use of the NGgram function
(defined above) to build the ngram feature keys (for every
k = 0 . . . 4). Also, for each string, we build the required backoff
feature keys (depending on the experiment).
Upon collecting all of these keys, we compute the total observed
count for each feature key and then retain only the most
frequent ones. We assign a different count cutoff for each feature
template. We determine these counts based on a classical
cross-entropy pruned n-gram model trained on the same data
Afterwards, our dictionary maps each key to a unique consecutive
index = 0 . . . Dim. In all our experiments, we allocated the
same budget of 228 million paramaters. It is important to note
that the number of features dedicated for backoff features may
significantly vary across backoff-feature types.
Note that, while the backoff inspired features detailed in
section 3.2 are defined to fire only when the corresponding ngram
does not appear in the feature dictionary, they themselves
must appear in the feature dictionary in order to fire. If one
of these features does not appear frequently enough, it will not
appear in the feature dictionary and neither the original n-gram
nor the backoff feature will fire.
4.2. Feature Sets
In these experiments, all MaxEnt language models include ngrams
up to 5-grams. Our backoff inspired features are also
Algorithm 2 Dictionary Construction
for all w1, w2, . . . , wn ∈ Data do
for i ← 1 . . . n do
. We use 5-gram features.
for k ← 0 . . . 4 do
key ← NGram(w1, . . . , wn, i, k)
dictk ← dictk ∪ {key}
countk[key] ← countk[key] + 1
. Call the backoff functions above.
bo key ← SuffixBackoff(w1, . . . , wn, i, k)
dictk ← dictk ∪ {bo key}
countk[bo key] ← countk[bo key] + 1
end for
end for
end for
. Retain the most frequent features in dictk and map each
feature to a unique index, for each k = 0, . . . , 4.
2647Perplexity
Figure 1: Perplexity versus number of epochs of training for various
feature sets under the same feature budget constraint. Feature sets include:
(1) n-gram features (NG); (2) PrefixBackoff (P); (3) SuffixBackoff
(S); (4) PrefixBackoff-k (Pk); and (5) SuffixBackoff-k (Sk).
Perplexity
Figure 1: Perplexity versus number of epochs of training for various
feature sets under the same feature budget constraint. Feature sets include:
(1) n-gram features (NG); (2) PrefixBackoff (P); (3) SuffixBackoff
(S); (4) PrefixBackoff-k (Pk); and (5) SuffixBackoff-k (Sk).
Epochs
Figure 1: Perplexity versus number of epochs of training for various
feature sets under the same feature budget constraint. Feature sets include:
(1) n-gram features (NG); (2) PrefixBackoff (P); (3) SuffixBackoff
(S); (4) PrefixBackoff-k (Pk); and (5) SuffixBackoff-k (Sk).
based on substrings up to length 5, i.e., up to 4 words, either
preceded (prefix) or followed (suffix) by the “BO” token in the
case of PrefixBackoff and SuffixBackoff features; or “BOj ” up
to j = 4 preceding (prefix) or following (suffix) the word.
We examine several feature set pools: (1) n-gram features
alone (NG); (2) n-gram features plus PrefixBackoff (NG+P) or
SuffixBackoff (NG+S); (3) n-gram features plus PrefixBackoffj
(NG+Pk) or SuffixBackoffj (NG+Sk); and (4) n-gram features
plus PrefixBackoffj and SuffixBackoff (NG+Pk+S) or
SuffixBackoffj (NG+Pk+Sk). In each case, feature dictionaries
are built, so they may contain more or fewer n-grams as
required to include the backoff features in the dictionary.
For the current experiments, trials with PrefixBackoffj or
SuffixBackoffj only include features with j = 0, i.e., a single
word alongside the “BOk” token. Note that the number of such
features is relatively constrained compared to the n-gram features
and other backoff features – at most k|V | possible features
for a vocabulary V .
4.3. Perplexity
Perplexity was measured on a held-aside random sample of 5
million words from our pool of data. Figure 1 plots perplexity
versus number of epochs (up to 11) for different possible feature
sets. Recall that data is randomly sampled from the overall
training set, so that this plot also shows behavior as the amount
of training data is increased.
Table 1 presents perplexities after the epoch 11, along with
the number of samples used during the training and number of
active features with non-zero parameters. The number of samples
varies because some trainers may run faster than others
depending on the number and type of features used; since we
enforce a timeout, an epoch may vary in the number of samples
processed in time. Nonetheless, Figure 1 shows that most models
have approached or reached convergence before completing
all the 11 epochs. A notable exception is the n-gram only model,
which seems to require a few more epochs before reaching convergence
– though clearly performance will not reach that of the
other trials. This points to another benefit of the backoff features
– they also seem to speed convergence for these models. Interestingly,
they also seem to considerably reduce the number of
active features.
The results show a large perplexity improvement due to
the use of backoff features, and in particular the generalized
Prefix/SuffixBackoff-k features. One potential reason for the
improved performance with these generalized backoff features
is the relatively small number of them and they fire more often,
as discussed in the previous section.
Feature Set Description Pplx Samp ActFt
NG N-grams only 167.0 137B 197.8M
NG+P N-grams + PrefixBackoff 122.6 112B 189.5M
NG+S N-grams + SuffixBackoff 109.8 125B 188.9M
NG+Pk N-grams + PrefixBackoffk 88.0 100B 170.1M
NG+Pk+S N-grams + PrefixBackoffk
+ SuffixBackoff
85.5 113B 172.6M
NG+Sk N-grams + SuffixBackoffk 82.7 126B 160.2M
NG+Pk+Sk N-grams + PrefixBackoffk
+ SuffixBackoffk
80.2 96B 162.4M
Table 1: Perplexity (Pplx) after 11 epochs of training, with a fixed
feature budget. Also giving number of samples (Samp) used for training
each model, in billions; and active features (ActFt), in millions.
4.4. Speech Recognition Rescoring Results
We evaluated our models by rescoring n-best outputs from a
baseline recognizer. In our experiments, we set n to 500. The
acoustic model of the baseline system is a deep-neural networkbased
model with 85M parameters, consisting of eight hidden
layers with 2560 Rectified Linear hidden units each and softmax
outputs for the 14,000 context-dependent state posteriors. The
network processes a context window of 26 (20 past and 5 future)
frames of speech, each represented with 40 dimensional log mel
filterbank energies taken from 25ms windows every 10ms. The
system is trained to a Cross-Entropy criterion on a US English
data set of 3M anonymized utterances (1,700 hours or about
600 million frames) collected from live voice search dictation
trafic. The utterances are hand-transcribed and force-aligned
with a previously trained DNN. See [24] for Google’s VoiceSearch
system design. The baseline LM is a Katz [3] smoothed
5-gram model pruned to 23M n-grams, trained on the same data
using Bayesian interpolation to balance multiple sources. It has
vocabulary size of 2M and an OOV rate of 0.57% [25].
The score assigned to each hypothesis by our MaxEnt LM
is linearly interpolated with the baseline recognizer’s LM score
(with an untuned mixture factor of 0.33). Table 2 presents WER
results for multiple anonymized voice-search data sets collected
from anonymized and manually transcribed live traffic from
mobile devices. These data sets contain regular spoken search
queries, questions, and YouTube queries. We achieve modest
gains over the baseline system and over rescoring with just ngram
features in all of the test sets, achieving, in aggregate, a
half a point of improvement over the baseline system.
5. Conclusion
In this paper we introduced and explored features for maximum
entropy language models inspired by the backoff mechanism
of standardly smoothed language models. We found large perplexity
improvements over using n-gram features alone, for the
same feature budget; and a 0.5% absolute (3.4% relative) WER
improvement over the baseline system for our best performing
model. Future work will include exploring further variants of
our general backoff feature templates and combining with other
features beyond n-grams.
Table 2: WER results on 7 sub-corpora and overall, for the baseline
recognizer (no reranking) versus reranking models trained with different
feature sets.
Reranking feature set
Test Utts / Wds NG+ NG+ NG+
Set (×1000) None NG Pk Sk Pk+Sk
1 22.5 / 98.0 12.7 12.6 12.4 12.4 12.4
2 17.8 / 74.0 12.7 12.5 12.4 12.4 12.3
3 16.2 / 61.1 17.3 17.1 16.7 16.8 16.7
4 18.0 / 64.0 12.8 12.7 12.6 12.6 12.5
5 7.4 / 50.7 16.8 16.6 16.2 16.2 16.2
6 7.3 / 31.9 15.1 15.0 14.8 14.8 14.9
7 19.6 / 69.1 16.5 16.2 15.9 15.9 15.9
all 108.9 / 448.8 14.6 14.4 14.2 14.2 14.1
26486. References
[1] R. Lau, R. Rosenfeld, and S. Roukos, “Trigger-based language
models: a maximum entropy approach,” in Proceedings
of the International Conference on Acoustics,
Speech, and Signal Processing (ICASSP), 1993, pp. 45–
48.
[2] R. Rosenfeld, “A maximum entropy approach to adaptive
statistical language modeling,” Computer Speech and
Language, vol. 10, pp. 187–228, 1996.
[3] S. M. Katz, “Estimation of probabilities from sparse data
for the language model component of a speech recogniser,”
IEEE Transactions on Acoustics, Speech, and Signal
Processing, vol. 35, no. 3, pp. 400–401, 1987.
[4] S. F. Chen and R. Rosenfeld, “A survey of smoothing techniques
for ME models,” IEEE Transactions on Speech and
Audio Processing, vol. 8, pp. 37–50, 2000.
[5] R. Rosenfeld, “A whole sentence maximum entropy language
model,” in Proceedings of IEEE Workshop on
Speech Recognition and Understanding, 1997, pp. 230–
237.
[6] R. Rosenfeld, S. F. Chen, and X. Zhu, “Whole-sentence
exponential language models: a vehicle for linguisticstatistical
integration,” Computer Speech and Language,
vol. 15, no. 1, pp. 55–73, Jan. 2001.
[7] B. Roark, M. Saraclar, and M. Collins, “Discriminative ngram
language modeling,” Computer Speech & Language,
vol. 21, no. 2, pp. 373–392, 2007.
[8] T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean,
“Large language models in machine translation,” in In
Proceedings of the Joint Conference on Empirical Methods
in Natural Language Processing (EMNLP) and Computational
Natural Language Learning (CoNLL), 2007.
[9] J. Wu and S. Khudanpur, “Efficient training methods
for maximum entropy language modeling.” in INTERSPEECH,
2000, pp. 114–118.
[10] T. Alumae and M. Kurimo, “Efficient estimation of maxi- ¨
mum entropy language models with n-gram features: an
srilm extension.” in INTERSPEECH, 2010, pp. 1820–
1823.
[11] P. Xu, A. Gunawardana, and S. Khudanpur, “Efficient subsampling
for training complex language models,” in Proceedings
of the Conference on Empirical Methods in Natural
Language Processing. Association for Computational
Linguistics, 2011, pp. 1128–1136.
[12] J. Wu and S. Khudanpur, “Building a topic-dependent
maximum entropy model for very large corpora,” in
Acoustics, Speech, and Signal Processing (ICASSP), 2002
IEEE International Conference on, vol. 1. IEEE, 2002,
pp. I–777.
[13] R. Sarikaya, M. Afify, Y. Deng, H. Erdogan, and Y. Gao,
“Joint morphological-lexical language modeling for processing
morphologically rich languages with application
to dialectal arabic,” Audio, Speech, and Language Processing,
IEEE Transactions on, vol. 16, no. 7, pp. 1330–
1339, 2008.
[14] M. A. B. Shaik, A. E.-D. Mousa, R. Schluter, and H. Ney, ¨
“Feature-rich sub-lexical language models using a maximum
entropy approach for german LVCSR,” in INTERSPEECH,
2013.
[15] J. N. Tsitsiklis, D. P. Bertsekas, and M. Athans, “Distributed
asynchronous deterministic and stochastic gradient
optimization algorithms,” IEEE Transactions on Automatic
Control, vol. 31:9, 1986.
[16] K. Hall, S. Gilpin, and G. Mann, “Mapreduce/bigtable for
distributed optimization,” in Neural Information Processing
Systems Workshop on Leaning on Cores, Clusters, and
Clouds, 2010.
[17] R. McDonald, K. Hall, and G. Mann, “Distributed training
strategies for the structured perceptron,” in Human Language
Technologies: The 2010 Annual Conference of the
North American Chapter of the Association for Computational
Linguistics, 2010, pp. 456–464.
[18] M. Zinkevich, M. Weimer, A. Smola, and L. Li, “Parallelized
stochastic gradient descent,” in Advances in Neural
Information Processing Systems 23, J. Lafferty, C. K. I.
Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta,
Eds., 2010, pp. 2595–2603.
[19] F. Niu, B. Recht, C. Re, and S. J. Wright, “Hogwild: A ´
lock-free approach to parallelizing stochastic gradient descent,”
in Advances in Neural Information Processing Systems,
2011.
[20] P. Xu, A. Gunawardana, and S. Khudanpur, “Efficient
subsampling for training complex language models.” in
EMNLP. ACL, 2011, pp. 1128–1136. [Online]. Available:
http://dblp.uni-trier.de/db/conf/emnlp/emnlp2011.
html#XuGK11
[21] F. Morin and Y. Bengio, “Hierarchical probabilistic neural
network language model,” in AISTATS05, 2005, pp. 246–
252.
[22] Y. Tsuruoka, J. Tsujii, and S. Ananiadou, “Stochastic gradient
descent training for l1-regularized log-linear models
with cumulative penalty,” in Proceedings of the Joint Conference
of the 47th Annual Meeting of the ACL, 2009, pp.
477–485.
[23] J. Dean and S. Ghemawat, “Mapreduce: Simplified data
processing on large clusters,” in Proceedings of the 6th
Conference on Symposium on Opearting Systems Design
& Implementation - Volume 6, ser. OSDI’04, 2004, pp.
10–10.
[24] J. Schalkwyk, D. Beeferman, F. Beaufays, B. Byrne,
C. Chelba, M. Cohen, M. Kamvar, and B. Strope, “your
word is my command: Google search by voice: A case
study,” in Advances in Speech Recognition. Springer,
2010, pp. 61–90.
[25] C. Allauzen and M. Riley, “Bayesian language model interpolation
for mobile speech input.” in INTERSPEECH,
2011, pp. 1429–1432.
2649
Unsupervised Testing Strategies for ASR
Brian Strope, Doug Beeferman, Alexander Gruenstein, Xin Lei
Google, Inc.
bps, dougb, alexgru, xinlei @google.com
Abstract
This paper describes unsupervised strategies for estimating
relative accuracy differences between acoustic models or language
models used for automatic speech recognition. To test
acoustic models, the approach extends ideas used for unsupervised
discriminative training to include a more explicit validation
on held out data. To test language models, we use a
dual interpretation of the same process, this time allowing us to
measure differences by exploiting expected ‘truth gradients’ between
strong and weak acoustic models. The paper shows correlations
between supervised and unsupervised measures across
a range of acoustic model and language model variations. We
also use unsupervised tests to assess the non-stationary nature
of mobile speech input.
Index Terms: speech recognition, unsupervised testing, nonstationary
distributions
1. Introduction
Current commercial speech recognition systems can use years
of unsupervised data to train relatively large, discriminatively
optimized, acoustic models (AM). Similarly, web-scale text corpora
for estimating language models (LM) are often available
online, and unsupervised recognition results themselves can
provide an additional source of LM training data.
Since there is no human transcription in any of these steps,
the remaining use for manual human transcription is for generating
test sets, as a final sanity check for validating system
parameters and models. In this paper, we augment that strategy
with unsupervised evaluations and begin the discussion of
whether eventually we might be able to get rid of the need for
any explicit human transcription.
The motivation for human transcription for testing is obvious.
Despite steady advances and relative commercial successes,
it is generally accepted that humans are much more accurate
transcribers than automatic speech recognition systems
[1]. While there are a few notable exceptions where machines
were more accurate than humans [2], human transcription accuracy
is so much better, we use it unquiestioningly as our best
approximate for absolute truth.
But there are equally obvious disadvantages to relying on
human transcription. While it may feel premature, accepting
human performance as absolute truth imposes an upper bound
on accuracy. The absolute truth is not absolute, and so we’ll
eventually have to figure out how to beat it. In fact with our
current processes and tasks, below, we show that human transcribers
can be only comparable in accuracy to current ASR
systems. Absolute truth is already a problem. In response, we
are improving transcription processes, but also considering unsupervised
ways to augment traditional testing.
Another obvious disadvantage of human transcription is
that the tests themselves have to be limited in size and type.
Even in a commercially successful research lab, getting extensive
tests across every combination of speaker and channel type,
recognition context, language, and time period is prohibitive.
But a detailed characterization of those types of variations could
help prioritize efforts. Similarly when tests are unsupervised,
it is easier to update development and evaluation sets to avoid
problems related to stale, over-fit tests.
This is mostly an empirical paper. The next section describes
some of the experiments we ran trying to assess our
existing human transcription accuracy. Then we describe the
generalizations of unsupervised discriminative training that enable
a new evaluation strategy. Next the paper includes evaluations
that show correlations between supervised and unsupervised
tests, and concludes with unsupervised tests that start to
characterize the non-stationary distribution of spoken data coming
through Google mobile applications.
2. Problems with human transcriptions
Recent efforts have begun to consider human transcription accuracy
in the context of increased efficiency. These studies have
generally shown that depending on the amount of effort, and the
task, individual word error rates can vary from 2-15% [3, 4]. Ef-
ficiency pressures on human transcription can lead to transcription
noise and bias.
2.1. Early experiments
Over the last few years we have seen several simple experiments
not work: we have added matched data to our language models
and seen error rates get worse; we have added unsupervised
acoustic modeling data matched to a new fielded acoustic condition,
and seen the error rates on new matched tests go up,
but surprisingly, error rates on an old test, with slightly mismatched
conditions, go down.
For each of these, after tediously examining errors, we
found the problem was that we typically “seed” our transcription
process with the recognition result from the field. Mostly
as a matter of expedience; it is easier for the transcriber to hit return
than to type “home depot in palo alto california” yet again,
and it can improve reliability since retyping can be error prone.
But the power of the suggested transcription is also enough to
bias the transcribers into rubber-stamping some of the fielded
recognition results. When the transcriber rubber-stamps an error
we potentially get penalized twice. The baseline gets credit
where it should not, and a new system that corrects that error is
falsely penalized for adding an error.
The surprising improvement noted on the older, slightly
mis-matched test happened because the transcriptions for the
older test were seeded with transcriptions from an older system,
decorrelating some of the transcription bias with the current
baseline. In this case, transcription bias toward the baseline
model was a bigger effect than the change in acoustics.
Copyright © 2011 ISCA 28-31 August 2011, Florence, Italy
INTERSPEECH 2011
16852.2. Multiple attempts
To measure the human transcription accuracy more directly we
started sending the same data for multiple attempts at human
transcription, and we intentionally reduced the quality of our
starting seeds to move any bias away from our best systems.
For one test we sent 200K Voice Search utterances to be transcribed
twice. Ignoring trivial differences like spaces, apostrophes,
function words, and others, half of the transcripts agreed,
which implies a sentence transcription accuracy of 71%, assuming
independence of the attempts.
Similarly when we sent the remaining 100K utterances,
where transcriptions did not agree, back for two more attempts,
we were still left with about 10% of the original set with 4 distinct
human transcriptions. Again assuming independence, 10%
disagreement in 4 attempts is consistent with 68% accuracy for
each attempt. But we believe our system has a sentence accuracy
higher than 70%.
Looking through the errors many of the problems are related
to cultural references, popular names, and businesses that
are not obvious to everyone. The cultural and geographic requirements
of the voice search task may be unusually difficult.
It combines short utterances and wide open semantic contexts
to generate surprisingly unfamiliar sounding speech. Finding
ways to bring the correct cultural context to the transcriber is
another obvious path to pursue.
3. Generalizing unsupervised
discriminative training
While some published results considered unsupervised maximum
likelihood estimation of model parameters [5], many systems
use unsupervised discriminative optimization, directly using
recognizer output as input [6]. Cynically we might ask what
we are learning if we are using the recognition result as truth for
discriminatively optimizing its parameters. It is hard to imagine
that we can fix the errors it makes, when we use the model to
generate truth.
But when we look into the details of commonly used discriminative
training techniques based on maximum mutual information,
we see that the LM used to generate competing hypotheses
is not the same LM used to generate truth. To improve
the generalization of discriminative training, we use a unigram
to describe the space of potential errors [7], but a trigram or
higher to give us transcription truth with unsupervised training.
One interpretation of unsupervised discriminative training
for acoustic models is that we are using the difference between
a weak unigram and a relatively stronger trigram to give us a
known improvement in relative truth. We do not know that the
strong-LM (trigram) result is absolutely correct, we only know
that it is better than the result with the weak LM (unigram).
When there is a difference, if we can move toward the results of
the strong-LM system by changing acoustic model parameters,
then we are building a more accurate AM, that also helps with
the final system using a stronger LM. With this interpretation,
the AM learns from the ‘truth gradient’ between the strong and
weak LMs.
3.1. Unsupervised AM testing
Extending unsupervised discriminative AM training to unsupervised
AM testing involves retesting the criterion used during
training in a new test context. More prescriptively, we sample a
new set of live data from production logs, and take the recognition
result from the fielded system using a strong AM and LM
as assumed truth. Then we re-recognize the same data using
multiple strong acoustic models and a weak LM. If one of the
systems using a weak LM can better approximate the system
using a strong LM, then at a minimum, we can say that it is doing
a better job of generalizing our training criteria to new data.
More directly, we have evidence that one of the strong acoustic
models could be more accurate than the rest.
For scoring we are assuming truth from the fielded system,
not a human transcriber. Therefore, when reporting unsupervised
testing results, we count traditional word error rates, but
because there is no human transcription, we report it as a word
difference rate (WDR), to highlight that, for example, in the
case of unsupervised AM tests, it is the word differences between
the systems with the strong and weak LM.
3.2. Unsupervised LM testing
To use the same strategy for LM testing we reverse the roles
of the AM and the LM. For better generalization of discriminative
AM testing, we used a weak LM to generate more competing
alternates. That establishes a truth gradient that generally
changes around 1/3 of the words. The dual for LM testing is to
use a weak AM instead. To get a truth gradient of a similar magnitude
with our systems, we backed off to a context-dependent
acoustic model that uses around 1/10th the number of parameters
of our strong models, and only uses maximum likelihood
training.
Then as above, we test with multiple strong LMs and assume
that the LM that can move the results of the system using
the weak AM closest to the results of the production system
(with the strong AM), is the most accurate LM. With unsupervised
LM testing we again report WDR and not WER, where
the magnitude of the difference is now from the difference between
the strong AM and the weak AM.
3.3. Relative measures
In this paper we are ignoring the harder problem of measuring
absolute accuracy. Instead we focus on relative differences between
different acoustic or language models. Others have predicted
absolute error measures using statistics from the training
set as represented in the final acoustic models [8], without looking
at testing data. But here we are interested in estimating relative
performance across production data that was unseen during
training. Our goal is to assess whether new models or new approaches
are helping on new data, and whether the data might
be changing from the distributions used during training.
4. Correlating supervised and unsupervised
measures
First we show that the performance on unsupervised offline tests
for the AM and for the LM correlate with more traditional supervised
tests. Our production data started with primarily Voice
Search queries intended for google.com, but over time has included
increasing amounts of general Voice Input traffic which
includes a large fraction of short person-to-person messages. To
start the analyses, we consider these data streams separately.
For Voice Search, our traditional supervised test is built
from the 200K utterance set that we sent for multiple transcriptions.
For this test we exclude the 10% of the utterances where
we got 4 distinct human transcriptions and sample a test set randomly
from the remaining 90%. Similarly for the supervised
1686Voice Input test, we sent utterances twice and selected from the
utterances with at least 80% agreement between human transcriptions.
On the utterances where not all the words agreed,
we randomly chose one of the human transcriptions as truth.
This led to a test that excluded about 28% of the utterances.
Both of these supervised tests are biased in that they only
include the utterances that we could reliably transcribe. The
Voice Search test has 27K utterances and 87K words. The Voice
Input test has 49K utterances and 320K words.
For the first unsupervised tests here, we sampled production
logs for a single day of traffic. We found the median recognizer
confidence for each task and then randomly selected a few hundred
thousand utterances that were above median confidence for
each task. For all unsupervised experiments we used the recognition
results from the field as truth.
Our recognition configuration for both systems is fairly
standard and described in the literature. Specifically we use
a PLP front-end [11] together with LDA and STC [12], and
optimize our acoustic models using BMMI [13] on mostly unsupervised
data mixed from both tasks. Our language models
are n-grams, with Katz interpolation and entropy pruning, and
the fielded Voice Input system also includes dynamic interpolation
[14]. The Voice Search system used trigrams and the Voice
Input system included 4-grams.
4.1. AM experiments
The AM experiments use a weak LM (in this case a unigram)
for each task estimated from the few hundred thousand high
confidence utterances sampled for that day’s test. All the utterances
in the test were also used to train the LM, so there is no
OOV. This step is consistent with the matched unigram we train
for discriminative acoustic model training. For Voice Search,
the resulting unigram had 17K words, and for Voice Input there
were 18K unique words.
The acoustic models we tested here were trained using 11M
(mostly unsupervised) utterances from a mix of both tasks. The
parameter we vary for these experiments is the size of the acoustic
models. We use the same decision tree and context state definitions
for all models, but we vary the number of Gaussians
assigned to each state. Each model is trained with the same
number of iterations through all the data. The final model sizes
range from 100K to 1M Gaussians. Decoder parameters are
set in production mode, which generally means we lose around
0.5% absolute from the best possible accuracy to have faster
than real-time search.
# Gauss Sup VS Unsup VS Sup VI Unsup VI
100K 16.0 36.0 14.5 24.8
200K 15.3 34.4 13.6 22.8
340K 14.6 33.9 13.4 22.7
500K 14.3 33.3 13.2 22.3
1M 13.9 33.0 12.9 21.8
Table 1: WER in % on supervised (Sup) and WDR in % on
unsupervised (Unsup) AM tests for Voice Search (VS) and Voice
Input (VI).
4.2. LM experiments
For the LM experiments we vary the number of n-grams used
for the Voice Input task from around 2M to 30M by varying our
final entropy pruning threshold. Unlike the production system
used to generate truth for the unsupervised tests, for these tests
the LM is a static n-gram.
We show results with two different weak acoustic models
(A/B). Condition A is a context-dependent model estimated using
maximum likelihood criteria with 2 Gaussians per state for
a total of 16K Gaussians. Condition B uses a similar model with
a variable number of Gaussians across model states, and a total
of 40K Gaussians. On supervised tests, these weak acoustic
models have around two to three times the error rates of final
strong production models.
n-grams Sup PPL Sup WER Unsup A/B WDR
1.9M 109 15.2 38.1/25.9
3.8M 98 14.4 36.8/24.5
7.6M 92 14.1 36.0/23.8
15M 87 13.9 35.5/23.2
30M 85 13.7 35.1/22.8
Table 2: Comparing supervised (Sup) and unsupervised (Unsup)
LM tests for Voice Input. WER/WDR are in %, PPL is
perplexity. Unsup A and B are for different sized AMs.
The relative improvement in both AM and LM experiments
is consistently around 10% for a 10x increase in model size.
Correlations between supervised and unsupervised tests range
between 0.98 and 0.99.
5. Additional experiments
Varying model size is a controlled way to generate accuracy
differences. Here we include additional unsupervised measurements
that show expected differences in the context of other AM
and LM modeling efforts.
5.1. CMLLR
To evaluate an implementation of constrained maximum likelihood
linear regression [9] for adaptation, we started by testing
with read speech corpora from several data collections [10] used
to initialize acoustic models in a new context. With a large and
regular amounts of acoustic data per speaker, we see the typical
improvements of 6-10% relative, over a matched discriminative
baseline.
To estimate the accuracy impact of CMLLR on the production
system, (where the actual distributions of amount of data
per user is not imposed by the strict specifications of a data collection)
we used unsupervised testing. Here we sampled all personalized
users over a 30 day period, and measured the change
in WDR with a weak LM and either the production AM or the
production AM with CMLLR. Further we break the differences
in WDR down by the amount of data available for each speaker.
# Utts No Adapt Adapt
1-20 25.7 25.4
20-50 26.6 25.6
50-100 25.8 24.6
100-200 23.5 22.5
Table 3: WDR in % on adaptation tests. Input is binned by the
number of utterances for a given user.
From the table, it is clear that we are seeing a similar relative
difference as we saw with more traditional read speech
tests, and we are further able to characterize the expected satu-
1687ration of the relatively small number of parameters in CMLLR
after around 20 voice input utterances.
5.2. LM update
At one point we updated our language model to include a rescoring
pass more explicitly matched to recent Voice Search queries.
By testing this update with recent unsupervised tests we are able
to show the expected win on new voice search type utterances.
# Model Config Sup VS Unsup VS
Original 14.6 30.0
Updated 14.6 28.6
Table 4: WER in % on supervised (Sup) and WDR in % unsupervised
(Unsup) LM tests for Voice Search.
One interpretation of these results is that we are updating
the LM to better represent the recent query data which itself is
better matched to the recent unsupervised test. It also suggests
that the distribution of our data might be moving.
5.3. Estimating non-stationary distributions
Finally we ran two sweeps of AM tests to estimate how stationary
the acoustics for our system have been over the last 14
months. The first system is trained using the Voice Search supervised
data available at the beginning of the 14 months, and
the second uses only unsupervised data sampled from the last
3 months. Therefore, one model represents our initial estimate
of the distribution, and the other approximates a most recent
distribution. Both systems use around 350K gaussians. To evaluate
the AM performance, we use a weak LM estimated from a
year’s worth of production data.
Figure 1: Change in WDR over time with two different AMs.
Both lines show that the distribution of the data has shifted
away from the original supervised data, and toward the recent
unsupervised data. Additional unsupervised tests will illuminate
the causes of this change in more detail. We currently suspect
an increase in the fraction of voice input recognition, but it
is already obvious that the distribution of the acoustics for this
data is changing. The plot also suggests that with a single AM
the change of WDR across conditions may also be informative.
Note that since we are generalizing from the same criteria
we used for AM training, and we are getting rid of some of
the necessity of human transcription, we are concerned about
converging away from reality. The ground is a little firmer for
the LM side, since our current LM processes are in fact not
yet learning from AM truth gradients the same way our unsupervised
AM training learns from LM truth gradients. From
the AM side, our current unsupervised tests are simply checking
whether the training optimizations extend to unseen data.
Pragmatically, because it is unsupervised we also have the opportunity
to test that generalization with a range of weak LMs
and with a range of input data, and thereby to increase our con-
fidence in the generalization. Moreover, reducing the accuracy
improvement provided by a strong LM seems like a safe requirement
to impose on AM training. But from an experiment
perspective, we have to remember what gradient we are exploiting
and not cheat. In other words, augmenting the AM with
features directly related to the strong LM would not lead to
improvements. We also monitor coarse signals related to application
use (counts of user actions in response to recognition
results) to give us additional complimentary evidence of successful
generalization.
6. Conclusions
This paper extends unsupervised discriminative training to an
unsupervised testing strategy suitable for evaluating AM and
LM changes. We show strong correlations with traditional testing
strategies when we change AM or LM model size. We also
show expected gains on unsupervised measures with other types
of AM and LM changes, and use the unsupervised measures to
begin to characterize the stationarity of the input data to Google
mobile. Together with unsupervised training, unsupervised testing
enables development paths that no longer impose human
performance as the upper bound for accuracy.
7. References
[1] R. Lippmann, “Speech recognition by machines and humans,”
Speech Communication, July 1997.
[2] T. Kristjansson, J. Hershey, P. Olsen, S. Rennie, R. Gopinath,
“Super-Human Multi-Talker Speech Recognition: The IBM 2006
Speech Separation Challenge System,” Proc. ICSLP, 2006.
[3] S. Novotney, C. Callison-Burch, “Cheap, Fast and Good Enough:
Automatic Speech Recognition with Non-Expert Transcription,”
Proc. NAACL, 2010.
[4] A. Gruenstein, I. McGraw, A. Sutherland, “A Self-Transcribing
Speech Corpus: Collecting Continuous Speech with an Online
Educational Game,” Proc. SLaTE, 2009.
[5] J. Ma, R. Schwartz, “Unsupervised versus supervised training of
acoustic models,” Proc. ICSLP, 2008.
[6] L. Wang, M. Gales, P. Woodland, “Unsupervised Training for
Mandarin Broadcast News Conversation Transcription,” Proc
ICASSP, 2007.
[7] P.C. Woodland, D. Povey, “Large scale discriminative training of
hidden Markov models for speech recognition,” Comp. Speech &
Lang., Jan. 2002.
[8] Y. Deng, M. Mahajan, A. Acero, “Estimating Speech Recognition
Error Rate without Acoustic Test Data,” Proc. Eurospeech, 2003.
[9] M. J. F. Gales, “Maximum likelihood linear transformations for
HMM-based speech recognition,” Comp. Speech & Lang., Vol
12.2 1998.
[10] T. Hughes, K. Nakajima, L. Ha, A. Vasu, P. Moreno, M. LeBeau,
“Building transcribed speech corpora quickly and cheaply for
many languages,” Proc ICSLP, 2010.
[11] H. Hermansky, “Perceptual linear predictive (PLP) analysis of
speech,” JASA, v87.4, 1990.
[12] M. Gales, “Semi-Tied Covariance Matrices for Hidden Markov
Models,” Proc. IEEE Trans. SAP, May 2000.
[13] D. Povey, D. Kanevsky, B. Kingsbury, B. Ramabhadran, G. Saon,
K. Visweswariah, “Boosted MMI for model and feature-space discriminative
training,” Proc. ICASSP, 2008.
[14] B. Ballinger, C. Allauzen, A. Gruenstein, J. Schalkwyk, “OnDemand
Language Model Interpolation for Mobile Speech Input,”
Proc. ICSLP, 2010.
1688
Parallel Algorithms for Unsupervised Tagging
Sujith Ravi
Google
Mountain View, CA 94043
sravi@google.com
Sergei Vassilivitskii
Google
Mountain View, CA 94043
sergeiv@google.com
Vibhor Rastogi∗
Twitter
San Francisco, CA
vibhor.rastogi@gmail.com
Abstract
We propose a new method for unsupervised
tagging that finds minimal models which are
then further improved by Expectation Maximization
training. In contrast to previous
approaches that rely on manually specified
and multi-step heuristics for model minimization,
our approach is a simple greedy approximation
algorithm DMLC (DISTRIBUTEDMINIMUM-LABEL-COVER)
that solves this
objective in a single step.
We extend the method and show how to ef-
ficiently parallelize the algorithm on modern
parallel computing platforms while preserving
approximation guarantees. The new method
easily scales to large data and grammar sizes,
overcoming the memory bottleneck in previous
approaches. We demonstrate the power
of the new algorithm by evaluating on various
sequence labeling tasks: Part-of-Speech tagging
for multiple languages (including lowresource
languages), with complete and incomplete
dictionaries, and supertagging, a
complex sequence labeling task, where the
grammar size alone can grow to millions of
entries. Our results show that for all of these
settings, our method achieves state-of-the-art
scalable performance that yields high quality
tagging outputs.
1 Introduction
Supervised sequence labeling with large labeled
training datasets is considered a solved problem. For
∗The research described herein was conducted while the
author was working at Google.
instance, state of the art systems obtain tagging accuracies
over 97% for part-of-speech (POS) tagging
on the English Penn Treebank. However, learning
accurate taggers without labeled data remains a challenge.
The accuracies quickly drop when faced with
data from a different domain, language, or when
there is very little labeled information available for
training (Banko and Moore, 2004).
Recently, there has been an increasing amount
of research tackling this problem using unsupervised
methods. A popular approach is to learn from
POS-tag dictionaries (Merialdo, 1994), where we
are given a raw word sequence and a dictionary of
legal tags for each word type. Learning from POStag
dictionaries is still challenging. Complete wordtag
dictionaries may not always be available for use
and in every setting. When they are available, the
dictionaries are often noisy, resulting in high tagging
ambiguity. Furthermore, when applying taggers
in new domains or different datasets, we may
encounter new words that are missing from the dictionary.
There have been some efforts to learn POS
taggers from incomplete dictionaries by extending
the dictionary to include these words using some
heuristics (Toutanova and Johnson, 2008) or using
other methods such as type-supervision (Garrette
and Baldridge, 2012).
In this work, we tackle the problem of unsupervised
sequence labeling using tag dictionaries. The
first reported work on this problem was on POS tagging
from Merialdo (1994). The approach involved
training a standard Hidden Markov Model (HMM)
using the Expectation Maximization (EM) algorithm
(Dempster et al., 1977), though EM does notperform well on this task (Johnson, 2007). More recent
methods have yielded better performance than
EM (see (Ravi and Knight, 2009) for an overview).
One interesting line of research introduced by
Ravi and Knight (2009) explores the idea of performing
model minimization followed by EM training
to learn taggers. Their idea is closely related
to the classic Minimum Description Length principle
for model selection (Barron et al., 1998). They
(1) formulate an objective function to find the smallest
model that explains the text (model minimization
step), and then, (2) fit the minimized model to the
data (EM step). For POS tagging, this method (Ravi
and Knight, 2009) yields the best performance to
date; 91.6% tagging accuracy on a standard test
dataset from the English Penn Treebank. The original
work from (Ravi and Knight, 2009) uses an integer
linear programming (ILP) formulation to find
minimal models, an approach which does not scale
to large datasets. Ravi et al. (2010b) introduced a
two-step greedy approximation to the original objective
function (called the MIN-GREEDY algorithm)
that runs much faster while maintaining the
high tagging performance. Garrette and Baldridge
(2012) showed how to use several heuristics to further
improve this algorithm (for instance, better
choice of tag bigrams when breaking ties) and stack
other techniques on top, such as careful initialization
of HMM emission models which results in further
performance gains. Their method also works under
incomplete dictionary scenarios and can be applied
to certain low-resource scenarios (Garrette and
Baldridge, 2013) by combining model minimization
with supervised training.
In this work, we propose a new scalable algorithm
for performing model minimization for this task. By
making an assumption on the structure of the solution,
we prove that a variant of the greedy set cover
algorithm always finds an approximately optimal label
set. This is in contrast to previous methods that
employ heuristic approaches with no guarantee on
the quality of the solution. In addition, we do not
have to rely on ad hoc tie-breaking procedures or
careful initializations for unknown words. Finally,
not only is the proposed method approximately optimal,
it is also easy to distribute, allowing it to easily
scale to very large datasets. We show empirically
that our method, combined with an EM training step
outperforms existing state of the art systems.
1.1 Our Contributions
• We present a new method, DISTRIBUTED
MINIMUM LABEL COVER, DMLC, for model
minimization that uses a fast, greedy algorithm
with formal approximation guarantees to the
quality of the solution.
• We show how to efficiently parallelize the algorithm
while preserving approximation guarantees.
In contrast, existing minimization approaches
cannot match the new distributed algorithm
when scaling from thousands to millions
or even billions of tokens.
• We show that our method easily scales to both
large data and grammar sizes, and does not require
the corpus or label set to fit into memory.
This allows us to tackle complex tagging tasks,
where the tagset consists of several thousand
labels, which results in more than one million
entires in the grammar.
• We demonstrate the power of the new
method by evaluating under several different
scenarios—POS tagging for multiple languages
(including low-resource languages),
with complete and incomplete dictionaries, as
well as a complex sequence labeling task of supertagging.
Our results show that for all these
settings, our method achieves state-of-the-art
performance yielding high quality taggings.
2 Related Work
Recently, there has been an increasing amount of
research tackling this problem from multiple directions.
Some efforts have focused on inducing
POS tag clusters without any tags (Christodoulopoulos
et al., 2010; Reichart et al., 2010; Moon et
al., 2010), but evaluating such systems proves dif-
ficult since it is not straightforward to map the cluster
labels onto gold standard tags. A more popular
approach is to learn from POS-tag dictionaries
(Merialdo, 1994; Ravi and Knight, 2009), incomplete
dictionaries (Hasan and Ng, 2009; Garrette and
Baldridge, 2012) and human-constructed dictionaries
(Goldberg et al., 2008).Another direction that has been explored in the
past includes bootstrapping taggers for a new language
based on information acquired from other languages
(Das and Petrov, 2011) or limited annotation
resources (Garrette and Baldridge, 2013). Additional
work focused on building supervised taggers
for noisy domains such as Twitter (Gimpel et
al., 2011). While most of the relevant work in this
area centers on POS tagging, there has been some
work done for building taggers for more complex
sequence labeling tasks such as supertagging (Ravi
et al., 2010a).
Other related work include alternative methods
for learning sparse models via priors in Bayesian inference
(Goldwater and Griffiths, 2007) and posterior
regularization (Ganchev et al., 2010). But these
methods only encourage sparsity and do not explicitly
seek to minimize the model size, which is the objective
function used in this work. Moreover, taggers
learned using model minimization have been shown
to produce state-of-the-art results for the problems
discussed here.
3 Model
Following Ravi and Knight (2009), we formulate the
problem as that of label selection on the sentence
graph. Formally, we are given a set of sequences,
S = {S1, S2, . . . , Sn} where each Si
is a sequence
of words, Si = wi1, wi2, . . . , wi,|Si|
. With each
word wij we associate a set of possible tags Tij . We
will denote by m the total number of (possibly duplicate)
words (tokens) in the corpus.
Additionally, we define two special words w0 and
w∞ with special tags start and end, and consider
the modified sequences S
0
i = w0, Si
, w∞. To simplify
notation, we will refer to w∞ = w|Si|+1. The
sequence label problem asks us to select a valid tag
tij ∈ Tij for each word wij in the input to minimize
a specific objective function.
We will refer to a tag pair (ti,j−1, tij ) as a label.
Our aim is to minimize the number of distinct labels
used to cover the full input. Formally, given a sequence
S
0
i
and a tag tij for each word wij in S
0
i
, let
the induced set of labels for sequence S
0
i
be
Li =
|S
0
i [
|
j=1
{(ti,j−1, tij )}.
The total number of distinct labels used over all sequences
is then
φ =
∪i Li
| =
[
i
|S
[i|+1
j=1
{(ti,j−1, tij )}|.
Note that the order of the tokens in the label makes
a difference as {(NN, VP)} and {(VP, NN)} are two
distinct labels.
Now we can define the problem formally, following
(Ravi and Knight, 2009).
Problem 1 (Minimum Label Cover). Given a set S
of sequences of words, where each word wij has a
set of valid tags Tij , the problem is to find a valid tag
assignment tij ∈ Tij for each word that minimizes
the number of distinct labels or tag pairs over all
sequences, φ =
S
i
S|Si|+1
j=1 {(ti,j−1, tij )}| .
The problem is closely related to the classical Set
Cover problem and is also NP-complete. To reduce
Set Cover to the label selection problem, map each
element i of the Set Cover instance to a single word
sentence Si = wi1, and let the valid tags Ti1 contain
the names of the sets that contain element i.
Consider a solution to the label selection problem;
every sentence Si
is covered by two labels (w0, ki)
and (ki
, w∞), for some ki ∈ Ti1, which corresponds
to an element i being covered by set ki
in the Set
Cover instance. Thus any valid solution to the label
selection problem leads to a feasible solution to the
Set Cover problem ({k1, k2, . . .}) of exactly half the
size.
Finally, we will use {{. . .}} notation to denote a
multiset of elements, i.e. a set where an element may
appear multiple times.
4 Algorithm
In this Section, we describe the DISTRIBUTEDMINIMUM-LABEL-COVER,
DMLC, algorithm for
approximately solving the minimum label cover
problem. We describe the algorithm in a centralized
setting, and defer the distributed implementation
to Section 5. Before describing the algorithm,
we briefly explain the relationship of the minimum
label cover problem to set cover.
4.1 Modification of Set Cover
As we pointed out earlier, the minimum label cover
problem is at least as hard as the Set Cover prob-1: Input: A set of sequences S with each
words wij having possible tags Tij .
2: Output: A tag assignment tij ∈ Tij for
each word wij approximately minimizing
labels.
3: Let M be the multi set of all possible labels
generated by choosing each possible tag t ∈
Tij .
M =
[
i
|S
[i|+1
j=1
[
t
0∈Ti,j−1
t∈Tij
{{(t
0
, t)}}
(1)
4: Let L = ∅ be the set of selected labels.
5: repeat
6: Select the most frequent label not yet selected:
(t
0
, t) = arg max(s
0
,s)∈L/
|M ∩
(s
0
, s)|.
7: For each bigram (wi,j−1, wij ) where t
0 ∈
Ti,j−1 and t ∈ Tij tentatively assign t
0
to
wi,j−1 and t to wij . Add (t
0
, t) to L.
8: If a word gets two assignments, select
one at random with equal probability.
9: If a bigram (wij , wi,j+1) is consistent
with assignments in (t, t0
), fix the tentative
assignments, and set Ti,j−1 = {t
0}
and Tij = t. Recompute M, the multiset
of possible labels, with the updated
Ti,j−1 and Tij .
10: until there are no unassigned words
Algorithm 1: MLC Algorithm
1: Input: A set of sequences S with each words wij
having possible tags Tij .
2: Output: A tag assignment tij ∈ Tij for each word
wij approximately minimizing labels.
3: (Graph Creation) Initialize each vertex vij with the
set of possible tags Tij and its neighbors vi,j+1 and
vi,j−1.
4: repeat
5: (Message Passing) Each vertex vij sends its possibly
tags Tij to its forward neighbor vij+1.
6: (Counter Update) Each vertex receives the
the tags Ti,j−1 and adds all possible labels
{(s, s0
)|s ∈ Ti,j−1, s0 ∈ Tij} to a global counter
(M).
7: (MaxLabel Selection) Each vertex queries the
global counter M to find the maximum label
(t, t0
).
8: (Tentative Assignment) Each vertex vij selects a
tag tentatively as follows: If one of the tags t, t0
is in the feasible set Tij , it tentatively selects the
tag.
9: (Random Assignment) If both are feasible it selects
one at random. The vertex communicates
its assignment to its neighbors.
10: (Confirmed Assignment) Each vertex receives
the tentative assignment from its neighbors. If
together with its neighbors it can match the selected
label, the assignment is finalized. If the
assigned tag is T, then the vertex vij sets the
valid tag set Tij to {t}.
11: until no unassigned vertices exist.
Algorithm 2: DMLC Implementation
lem. An additional challenge comes from the fact
that labels are tags for a pair of words, and hence
are related. For example, if we label a word pair
(wi,j−1, wij ) as (NN, VP), then the label for the next
word pair (wij , wi,j+1) has to be of the form (VP, *),
i.e., it has to start with VP.
Previous work (Ravi et al., 2010a; Ravi et al.,
2010b) recognized this challenge and employed two
phase heuristic approaches. Eschewing heuristics,
we will show that with one natural assumption, even
with this extra set of constraints, the standard greedy
algorithm for this problem results in a solution with
a provable approximation ratio of O(log m). In
practice, however, the algorithm performs far better
than the worst case ratio, and similar to the work
of (Gomes et al., 2006), we find that the greedy
approach selects a cover approximately 11% worse
than the optimum solution.
4.2 MLC Algorithm
We present in Algorithm 1 our MINIMUM LABEL
COVER algorithm to approximately solve the minimum
label cover problem. The algorithm is simple,
efficient, and easy to distribute.
The algorithm chooses labels one at a time, selecting
a label that covers as many words as possible inevery iteration. For this, it generates and maintains
a multi-set of all possible labels M (Step 3). The
multi-set contains an occurrence of each valid label,
for example, if wi,j−1 has two possible valid tags
NN and VP, and wij has one possible valid tag VP,
then M will contain two labels, namely (NN, VP)
and (VP, VP). Since M is a multi-set it will contain
duplicates, e.g. the label (NN, VP) will appear for
each adjacent pair of words that have NN and VP as
valid tags, respectively.
In each iteration, the algorithm picks a label with
the most number of occurrences in M and adds it to
the set of chosen labels (Step 6). Intuitively, this is
a greedy step to select a label that covers the most
number of word pairs.
Once the algorithm picks a label (t
0
, t), it tries to
assign as many words to tags t or t
0
as possible (Step
7). A word can be assigned t
0
if t
0
is a valid tag for
it, and t a valid tag for the next word in sequence.
Similarly, a word can be assigned t, if t is a valid
tag for it, and t
0
a valid tag for the previous word.
Some words can get both assignments, in which case
we choose one tentatively at random (Step 8). If
a word’s tentative random tag, say t, is consistent
with the choices of its adjacent words (say t
0
from
the previous word), then the tentative choice is fixed
as a permanent one. Whenever a tag is selected, the
set of valid tags Tij for the word is reduced to a singleton
{t}. Once the set of valid tags Tij changes,
the multi-set M of all possible labels also changes,
as seen from Eq 1. The multi-set is then recomputed
(Step 9) and the iterations repeated until all
of words have been tagged.
We can show that under a natural assumption this
simple algorithm is approximately optimal.
Assumption 1 (c-feasibility). Let c ≥ 1 be any number,
and k be the size of the optimal solution to the
original problem. In each iteration, the MLC algorithm
fixes the tags for some words. We say that the
algorithm is c-feasible, if after each iteration there
exists some solution to the remaining problem, consistent
with the chosen tags, with size at most ck .
The assumption encodes the fact that a single bad
greedy choice is not going to destroy the overall
structure of the solution, and a nearly optimal solution
remains. We note that this assumption of cfeasibility
is not only sufficient, as we will formally
show, but is also necessary. Indeed, without any assumptions,
once the algorithm fixes the tag for some
words, an optimal label may no longer be consistent
with the chosen tags, and it is not hard to find
contrived examples where the size of the optimal solution
doubles after each iteration of MLC.
Since the underlying problem is NP-complete, it
is computationally hard to give direct evidence verifying
the assumption on natural language inputs.
However, on small examples we are able to show
that the greedy algorithm is within a small constant
factor of the optimum, specifically it is within 11%
of the optimum model size for the POS tagging
problem using the standard 24k dataset (Ravi and
Knight, 2009). Combined with the fact that the final
method outperforms state of the art approaches, this
leads us to conclude that the structural assumption is
well justified.
Lemma 1. Under the assumption of c-feasibility,
the MLC algorithm achieves a O(c log m) approximation
to the minimum label cover problem, where
m =
P
i
|Si
| is the total number of tokens.
Proof. To prove the Lemma we will define an objective
function φ¯, counting the number of unlabeled
word pairs, as a function of possible labels, and
show that φ¯ decreases by a factor of (1−O(1/ck)) at
every iteration.
To define φ¯, we first define φ, the number of labeled
word pairs. Consider a particular set of labels,
L = {L1, L2, . . . , Lk} where each label is a
pair (ti
, tj ). Call {tij} a valid assignment of tokens
if for each wij , we have tij ∈ Tij . Then the
score of L under an assignment t, which we denote
by φt
, is the number of bigram labels that appear in
L. Formally, φt(L) = | ∪i,j {{(ti,j−1, tij ) ∩ L}}|.
Finally, we define φ(L) to be the best such assignment,
φ(L) = maxt φt(L), and φ¯(L) = m − φ(L)
the number of uncovered labels.
Consider the label selected by the algorithm in every
step. By the c-feasibility assumption, there exists
some solution having ck labels. Thus, some label
from that solution covers at least a 1/ck fraction
of the remaining words. The selected label (t, t0
)
maximizes the intersection with the remaining feasible
labels. The conflict resolution step ensures that
in expectation the realized benefit is at least a half
of the maximum, thereby reducing φ¯ by at least a(1 − 1/2ck) fraction. Therefore, after O(kc log m)
operations all of the labels are covered.
4.3 Fitting the Model Using EM
Once the greedy algorithm terminates and returns a
minimized grammar of tag bigrams, we follow the
approach of Ravi and Knight (2009) and fit the minimized
model to the data using the alternating EM
strategy.
In this step, we run an alternating optimization
procedure iteratively in phases. In each phase,
we initialize (and prune away) parameters within
the two HMM components (transition or emission
model) using the output from the previous phase.
We initialize this procedure by restricting the transition
parameters to only those tag bigrams selected
in the model minimization step. We train in conjunction
with the original emission model using EM
algorithm which prunes away some of the emission
parameters. In the next phase, we alternate the initialization
by choosing the pruned emission model
along with the original transition model (with full
set of tag bigrams) and retrain using EM. The alternating
EM iterations are terminated when the change
in the size of the observed grammar (i.e., the number
of unique bigrams in the tagging output) is ≤ 5%.
1
We refer to our entire approach using greedy minimization
followed by EM training as DMLC + EM.
5 Distributed Implementation
The DMLC algorithm is directly suited towards
parallelization across many machines. We turn to
Pregel (Malewicz et al., 2010), and its open source
version Giraph (Apa, 2013). In these systems the
computation proceeds in rounds. In every round, every
machine does some local processing and then
sends arbitrary messages to other machines. Semantically,
we think of the communication graph as
fixed, and in each round each vertex performs some
local computation and then sends messages to its
neighbors. This mode of parallel programming directs
the programmers to “Think like a vertex.”
The specific systems like Pregel and Giraph build
infrastructure that ensures that the overall system
1
For more details on the alternating EM strategy and how
initialization with minimized models improve EM performance
in alternating iterations, refer to (Ravi and Knight, 2009).
is fault tolerant, efficient, and fast. In addition,
they provide implementation of commonly used distributed
data structures, such as, for example global
counters. The programmer’s job is simply to specify
the code that each vertex will run at every round.
We implemented the DMLC algorithm in Pregel.
The implementation is straightforward and given in
Algorithm 2. The multi-set M of Algorithm 1 is
represented as a global counter in Algorithm 2. The
message passing (Step 3) and counter update (Step
4) steps update this global counter and hence perform
the role of Step 3 of Algorithm 1. Step 5 selects
the label with largest count, which is equivalent
to the greedy label picking step 6 of Algorithm 1. Finally
steps 6, 7, and 8 update the tag assignment of
each vertex performing the roles of steps 7, 8, and 9,
respectively, of Algorithm 1.
5.1 Speeding up the Algorithm
The implementation described above directly copies
the sequential algorithm. Here we describe additional
steps we took to further improve the parallel
running times.
Singleton Sets: As the parallel algorithm proceeds,
the set of feasible sets associated with a node
slowly decreases. At some point there is only one
tag that a node can take on, however this tag is rare,
and so it takes a while for it to be selected using the
greedy strategy. Nevertheless, if a node and one of
its neighbors have only a single tag left, then it is
safe to assign the unique label 2
.
Modifying the Graph: As is often the case, the
bottleneck in parallel computations is the communication.
To reduce the amount of communication
we reduce the graph on the fly, removing nodes and
edges once they no longer play a role in the computation.
This simple modification decreases the communication
time in later rounds as the total size of
the problem shrinks.
6 Experiments and Results
In this Section, we describe the experimental setup
for various tasks, settings and compare empirical
performance of our method against several existing
2We must judiciously initialize the global counter to take
care of this assignment, but this is easily accomplished.baselines. The performance results for all systems
(on all tasks) are measured in terms of tagging accuracy,
i.e. % of tokens from the test corpus that were
labeled correctly by the system.
6.1 Part-of-Speech Tagging Task
6.1.1 Tagging Using a Complete Dictionary
Data: We use a standard test set (consisting of
24,115 word tokens from the Penn Treebank) for
the POS tagging task. The tagset consists of 45 distinct
tag labels and the dictionary contains 57,388
word/tag pairs derived from the entire Penn Treebank.
Per-token ambiguity for the test data is about
1.5 tags/token. In addition to the standard 24k
dataset, we also train and test on larger data sets—
973k tokens from the Penn Treebank, 3M tokens
from PTB+Europarl (Koehn, 2005) data.
Methods: We evaluate and compare performance
for POS tagging using four different methods that
employ the model minimization idea combined with
EM training:
• EM: Training a bigram HMM model using EM
algorithm (Merialdo, 1994).
• ILP + EM: Minimizing grammar size using
integer linear programming, followed by EM
training (Ravi and Knight, 2009).
• MIN-GREEDY + EM: Minimizing grammar
size using the two-step greedy method (Ravi et
al., 2010b).
• DMLC + EM: This work.
Results: Table 1 shows the results for POS tagging
on English Penn Treebank data. On the smaller
test datasets, all of the model minimization strategies
(methods 2, 3, 4) tend to perform equally well,
yielding state-of-the-art results and large improvement
over standard EM. When training (and testing)
on larger corpora sizes, DMLC yields the best reported
performance on this task to date. A major
advantage of the new method is that it can easily
scale to large corpora sizes and the distributed nature
of the algorithm still permits fast, efficient optimization
of the global objective function. So, unlike
the earlier methods (such as MIN-GREEDY) it
is fast enough to run on several millions of tokens
to yield additional performance gains (shown in last
column).
Speedups: We also observe a significant speedup
when using the parallelized version of the DMLC
algorithm. Performing model minimization on the
24k tokens dataset takes 55 seconds on a single machine,
whereas parallelization permits model minimization
to be feasible even on large datasets. Fig 1
shows the running time for DMLC when run on a
cluster of 100 machines. We vary the input data
size from 1M word tokens to about 8M word tokens,
while holding the resources constant. Both the algorithm
and its distributed implementation in DMLC
are linear time operations as evident by the plot.
In fact, for comparison, we also plot a straight line
passing through the first two runtimes. The straight
line essentially plots runtimes corresponding to a
linear speedup. DMLC clearly achieves better runtimes
showing even better than linear speedup. The
reason for this is that distributed version has a constant
overhead for initialization, independent of the
data size. While the running time for rest of the implementation
is linear in data size. Thus, as the data
size becomes larger, the constant overhead becomes
less significant, and the distributed implementation
appears to complete slightly faster as data size increases.
Figure 1: Runtime vs. data size (measured in # of word
tokens) on 100 machines. For comparison, we also plot a
straight line passing through the first two runtimes. The
straight line essentially plots runtimes corresponding to a
linear speedup. DMLC clearly achieves better runtimes
showing a better than linear speedup.
6.1.2 Tagging Using Incomplete Dictionaries
We also evaluate our approach for POS tagging
under other resource-constrained scenarios. Obtain-Method Tagging accuracy (%)
te=24k te=973k
tr=24k tr=973k tr=3.7M
1. EM 81.7 82.3
2. ILP + EM (Ravi and Knight, 2009) 91.6
3. MIN-GREEDY + EM (Ravi et al., 2010b) 91.6 87.1
4. DMLC + EM (this work) 91.4 87.5 87.8
Table 1: Results for unsupervised part-of-speech tagging on English Penn Treebank dataset. Tagging accuracies for
different methods are shown on multiple datasets. te shows the size (number of tokens) in the test data, tr represents
the size of the raw text used to perform model minimization.
ing a complete dictionary is often difficult, especially
for new domains. To verify the utility of our
method when the input dictionary is incomplete, we
evaluate against standard datasets used in previous
work (Garrette and Baldridge, 2012) and compare
against the previous best reported performance for
the same task. In all the experiments (described
here and in subsequent sections), we use the following
terminology—raw data refers to unlabeled
text used by different methods (for model minimization
or other unsupervised training procedures such
as EM), dictionary consists of word/tag entries that
are legal, and test refers to data over which tagging
evaluation is performed.
English Data: For English POS tagging with incomplete
dictionary, we evaluate on the Penn Treebank
(Marcus et al., 1993) data. Following (Garrette
and Baldridge, 2012), we extracted a word-tag dictionary
from sections 00-15 (751,059 tokens) consisting
of 39,087 word types, 45,331 word/tag entries,
a per-type ambiguity of 1.16 yielding a pertoken
ambiguity of 2.21 on the raw corpus (treating
unknown words as having all 45 possible tags). As
in their setup, we then use the first 47,996 tokens
of section 16 as raw data and perform final evaluation
on the sections 22-24. We use the raw corpus
along with the unlabeled test data to perform model
minimization and EM training. Unknown words are
allowed to have all possible tags in both these procedures.
Italian Data: The minimization strategy presented
here is a general-purpose method that does
not require any specific tuning and works for other
languages as well. To demonstrate this, we also perform
evaluation on a different language (Italian) using
the TUT corpus (Bosco et al., 2000). Following
(Garrette and Baldridge, 2012), we use the same
data splits as their setting. We take the first half of
each of the five sections to build the word-tag dictionary,
the next quarter as raw data and the last
quarter as test data. The dictionary was constructed
from 41,000 tokens comprised of 7,814 word types,
8,370 word/tag pairs, per-type ambiguity of 1.07 and
a per-token ambiguity of 1.41 on the raw data. The
raw data consisted of 18,574 tokens and the test contained
18,763 tokens. We use the unlabeled corpus
from the raw and test data to perform model minimization
followed by unsupervised EM training.
Other Languages: In order to test the effectiveness
of our method in other non-English settings, we
also report the performance of our method on several
other Indo-European languages using treebank
data from CoNLL-X and CoNLL-2007 shared tasks
on dependency parsing (Buchholz and Marsi, 2006;
Nivre et al., 2007). The corpus statistics for the five
languages (Danish, Greek, Italian, Portuguese and
Spanish) are listed below. For each language, we
construct a dictionary from the raw training data.
The unlabeled corpus from the raw training and test
data is used to perform model minimization followed
by unsupervised EM training. As before, unknown
words are allowed to have all possible tags.
We report the final tagging performance on the test
data and compare it to baseline EM.
Garrette and Baldridge (2012) treat unknown
words (words that appear in the raw text but are
missing from the dictionary) in a special manner and
use several heuristics to perform better initialization
for such words (for example, the probability that an
unknown word is associated with a particular tag isconditioned on the openness of the tag). They also
use an auto-supervision technique to smooth counts
learnt from EM onto new words encountered during
testing. In contrast, we do not apply any such
technique for unknown words and allow them to be
mapped uniformly to all possible tags in the dictionary.
For this particular set of experiments, the only
difference from the Garrette and Baldridge (2012)
setup is that we include unlabeled text from the test
data (but without any dictionary tag labels or special
heuristics) to our existing word tokens from raw text
for performing model minimization. This is a standard
practice used in unsupervised training scenarios
(for example, Bayesian inference methods) and
in general for scalable techniques where the goal is
to perform inference on the same data for which one
wishes to produce some structured prediction.
Language Train Dict Test
(tokens) (entries) (tokens)
DANISH 94386 18797 5852
GREEK 65419 12894 4804
ITALIAN 71199 14934 5096
PORTUGUESE 206678 30053 5867
SPANISH 89334 17176 5694
Results: Table 2 (column 2) compares previously
reported results against our approach for English.
We observe that our method obtains a huge improvement
over standard EM and gets comparable results
to the previous best reported scores for the same task
from (Garrette and Baldridge, 2012). It is encouraging
to note that the new system achieves this performance
without using any of the carefully-chosen
heuristics employed by the previous method. However,
we do note that some of these techniques can
be easily combined with our method to produce further
improvements.
Table 2 (column 3) also shows results on Italian
POS tagging. We observe that our method
achieves significant improvements in tagging accuracy
over all the baseline systems including the previous
best system (+2.9%). This demonstrates that
the method generalizes well to other languages and
produces consistent tagging improvements over existing
methods for the same task.
Results for POS tagging on CoNLL data in five
different languages are displayed in Figure 2. Note
that the proportion of raw data in test versus train
50
60
70
80
90
DANISH GREEK ITALIAN PORTUGUESE SPANISH
79.4
66.3
84.6
80.1
83.1
77.8
65.6
82
78.5
81.3
EM DMLC+EM
Figure 2: Part-of-Speech tagging accuracy for different
languages on CoNLL data using incomplete dictionaries.
(from the standard CoNLL shared tasks) is much
smaller compared to the earlier experimental settings.
In general, we observe that adding more raw
data for EM training improves the tagging quality
(same trend observed earlier in Table 1: column 2
versus column 3). Despite this, DMLC + EM still
achieves significant improvements over the baseline
EM system on multiple languages (as shown in Figure
2). An additional advantage of the new method
is that it can easily scale to larger corpora and it produces
a much more compact grammar that can be
efficiently incorporated for EM training.
6.1.3 Tagging for Low-Resource Languages
Learning part-of-speech taggers for severely lowresource
languages (e.g., Malagasy) is very challenging.
In addition to scarce (token-supervised)
labeled resources, the tag dictionaries available
for training taggers are tiny compared to
other languages such as English. Garrette and
Baldridge (2013) combine various supervised and
semi-supervised learning algorithms into a common
POS tagger training pipeline to addresses some of
these challenges. They also report tagging accuracy
improvements on low-resource languages when using
the combined system over any single algorithm.
Their system has four main parts, in order: (1) Tag
dictionary expansion using label propagation algorithm,
(2) Weighted model minimization, (3) Expectation
maximization (EM) training of HMMs using
auto-supervision, (4) MaxEnt Markov Model
(MEMM) training. The entire procedure results in
a trained tagger model that can then be applied to
tag any raw data.3 Step 2 in this procedure involves
3
For more details, refer (Garrette and Baldridge, 2013).
"We consider a model of repeated online auctions in which an ad with an uncertain click-through rate faces a random distribution of competing bids in each auction and there is discounting of payoffs. We formulate the optimal solution to this explore/exploit problem as a dynamic programming problem and show that efficiency is maximized by making a bid for each advertiser equal to the advertiser's expected value for the advertising opportunity plus a term proportional to the variance in this value divided by the number of impressions the advertiser has received thus far. We then use this result to illustrate that the value of incorporating active exploration into a machine learning system in an auction environment is exceedingly small."
Accepted for publication in the Annals of Applied Statistics (in press), 09/2014
INFERRING CAUSAL IMPACT USING BAYESIAN
STRUCTURAL TIME-SERIES MODELS
By Kay H. Brodersen, Fabian Gallusser, Jim Koehler,
Nicolas Remy, and Steven L. Scott
Google, Inc.
E-mail: kbrodersen@google.com
Abstract An important problem in econometrics and marketing
is to infer the causal impact that a designed market intervention has
exerted on an outcome metric over time. This paper proposes to infer
causal impact on the basis of a diffusion-regression state-space
model that predicts the counterfactual market response in a synthetic
control that would have occurred had no intervention taken
place. In contrast to classical difference-in-differences schemes, statespace
models make it possible to (i) infer the temporal evolution of
attributable impact, (ii) incorporate empirical priors on the parameters
in a fully Bayesian treatment, and (iii) flexibly accommodate
multiple sources of variation, including local trends, seasonality, and
the time-varying influence of contemporaneous covariates. Using a
Markov chain Monte Carlo algorithm for posterior inference, we illustrate
the statistical properties of our approach on simulated data.
We then demonstrate its practical utility by estimating the causal
effect of an online advertising campaign on search-related site visits.
We discuss the strengths and limitations of state-space models
in enabling causal attribution in those settings where a randomised
experiment is unavailable. The CausalImpact R package provides an
implementation of our approach.
1. Introduction. This article proposes an approach to inferring the
causal impact of a market intervention, such as a new product launch or the
onset of an advertising campaign. Our method generalizes the widely used
‘difference-in-differences’ approach to the time-series setting by explicitly
modelling the counterfactual of a time series observed both before and after
the intervention. It improves on existing methods in two respects: it provides
a fully Bayesian time-series estimate for the effect; and it uses model averaging
to construct the most appropriate synthetic control for modelling the
counterfactual. The CausalImpact R package provides an implementation
of our approach (http://google.github.io/CausalImpact/).
Inferring the impact of market interventions is an important and timely
Keywords and phrases: causal inference, counterfactual, synthetic control, observational,
difference in differences, econometrics, advertising, market research
12 K.H. BRODERSEN ET AL.
problem. Partly because of recent interest in ‘big data,’ many firms have
begun to understand that a competitive advantage can be had by systematically
using impact measures to inform strategic decision making. An example
is the use of ‘A/B experiments’ to identify the most effective market
treatments for the purpose of allocating resources (Danaher and Rust, 1996;
Seggie, Cavusgil and Phelan, 2007; Leeflang et al., 2009; Stewart, 2009).
Here, we focus on measuring the impact of a discrete marketing event,
such as the release of a new product, the introduction of a new feature, or
the beginning or end of an advertising campaign, with the aim of measuring
the event’s impact on a response metric of interest (e.g., sales). The causal
impact of a treatment is the difference between the observed value of the
response and the (unobserved) value that would have been obtained under
the alternative treatment, i.e., the effect of treatment on the treated (Rubin,
1974; Hitchcock, 2004; Morgan and Winship, 2007; Rubin, 2007; Cox
and Wermuth, 2001; Heckman and Vytlacil, 2007; Antonakis et al., 2010;
Kleinberg and Hripcsak, 2011; Hoover, 2012; Claveau, 2012). In the present
setting the response variable is a time series, so the causal effect of interest
is the difference between the observed series and the series that would have
been observed had the intervention not taken place.
A powerful approach to constructing the counterfactual is based on the
idea of combining a set of candidate predictor variables into a single ‘synthetic
control’ (Abadie and Gardeazabal, 2003; Abadie, Diamond and Hainmueller,
2010). Broadly speaking, there are three sources of information
available for constructing an adequate synthetic control. The first is the
time-series behaviour of the response itself, prior to the intervention. The
second is the behaviour of other time series that were predictive of the target
series prior to the intervention. Such control series can be based, for
example, on the same product in a different region that did not receive the
intervention, or on a metric that reflects activity in the industry as a whole.
In practice, there are often many such series available, and the challenge is to
pick the relevant subset to use as contemporaneous controls. This selection
is done on the pre-treatment portion of potential controls; but their value
for predicting the counterfactual lies in their post-treatment behaviour. As
long as the control series received no intervention themselves, it is often reasonable
to assume the relationship between the treatment and the control
series that existed prior to the intervention to continue afterwards. Thus,
a plausible estimate of the counterfactual time series can be computed up
to the point in time where the relationship between treatment and controls
can no longer be assumed to be stationary, e.g., because one of the controls
received treatment itself. In a Bayesian framework, a third source ofBAYESIAN CAUSAL IMPACT ANALYSIS 3
information for inferring the counterfactual is the available prior knowledge
about the model parameters, as elicited, for example, by previous studies.
We combine the three preceding sources of information using a statespace
time-series model, where one component of state is a linear regression
on the contemporaneous predictors. The framework of our model allows us
to choose from among a large set of potential controls by placing a spikeand-slab
prior on the set of regression coefficients, and by allowing the model
to average over the set of controls (George and McCulloch, 1997). We then
compute the posterior distribution of the counterfactual time series given
the value of the target series in the pre-intervention period, along with the
values of the controls in the post-intervention period. Subtracting the predicted
from the observed response during the post-intervention period gives
a semiparametric Bayesian posterior distribution for the causal effect (Figure
1).
Related work. As with other domains, causal inference in marketing requires
subtlety. Marketing data are often observational and rarely follow the
ideal of a randomised design. They typically exhibit a low signal-to-noise
ratio. They are subject to multiple seasonal variations, and they are often
confounded by the effects of unobserved variables and their interactions
(for recent examples, see Seggie, Cavusgil and Phelan, 2007; Stewart, 2009;
Leeflang et al., 2009; Takada and Bass, 1998; Chan et al., 2010; Lewis and
Reiley, 2011; Lewis, Rao and Reiley, 2011; Vaver and Koehler, 2011, 2012).
Rigorous causal inferences can be obtained through randomised experiments,
which are often implemented in the form of geo experiments (Vaver
and Koehler, 2011, 2012). Many market interventions, however, fail to satisfy
the requirements of such approaches. For instance, advertising campaigns
are frequently launched across multiple channels, online and offline,
which precludes measurement of individual exposure. Campaigns are often
targeted at an entire country, and one country only, which prohibits the use
of geographic controls within that country. Likewise, a campaign might be
launched in several countries but at different points in time. Thus, while a
large control group may be available, the treatment group often consists of
no more than one region, or a few regions with considerable heterogeneity
among them.
A standard approach to causal inference in such settings is based on a
linear model of the observed outcomes in the treatment and control group
before and after the intervention. One can then estimate the difference between
(i) the pre-post difference in the treatment group and (ii) the pre-post
difference in the control group. The assumption underlying such differencein-differences
(DD) designs is that the level of the control group provides4 K.H. BRODERSEN ET AL.
20 60 100 140
(a)
Y
Model fit
Prediction
X1
X2
-40 0 20 60
(b)
Point-wise impact
Ground truth
-2000 0 2000 6000
(c)
2013-01
2013-02
2013-03
2013-04
2013-05
2013-06
2013-07
2013-08
2013-09
2013-10
2013-11
2013-12
2014-01
2014-02
2014-03
2014-04
2014-05
2014-06
Cumulative impact
Ground truth
Figure 1. Inferring causal impact through counterfactual predictions. (a) Simulated trajectory
of a treated market (Y ) with an intervention beginning in January 2014. Two other
markets (X1, X2) were not subject to the intervention and allow us to construct a synthetic
control (cf. Abadie and Gardeazabal, 2003; Abadie, Diamond and Hainmueller, 2010). Inverting
the state-space model described in the main text yields a prediction of what would
have happened in Y had the intervention not taken place (posterior predictive expectation
of the counterfactual with pointwise 95% posterior probability intervals). (b) The difference
between observed data and counterfactual predictions is the inferred causal impact of
the intervention. Here, predictions accurately reflect the true (Gamma-shaped) impact. A
key characteristic of the inferred impact series is the progressive widening of the posterior
intervals (shaded area). This effect emerges naturally from the model structure and agrees
with the intuition that predictions should become increasingly uncertain as we look further
and further into the (retrospective) future. (c) Another way of visualizing posterior inferences
is by means of a cumulative impact plot. It shows, for each day, the summed effect
up to that day. Here, the 95% credible interval of the cumulative impact crosses the zeroline
about five months after the intervention, at which point we would no longer declare a
significant overall effect.BAYESIAN CAUSAL IMPACT ANALYSIS 5
an adequate proxy for the level that would have been observed in the treatment
group in the absence of treatment (see Lester, 1946; Campbell, Stanley
and Gage, 1963; Ashenfelter and Card, 1985; Card and Krueger, 1993; Angrist
and Krueger, 1999; Athey and Imbens, 2002; Abadie, 2005; Meyer,
1995; Shadish, Cook and Campbell, 2002; Donald and Lang, 2007; Angrist
and Pischke, 2008; Robinson, McNulty and Krasno, 2009; Antonakis et al.,
2010).
DD designs have been limited in three ways. First, DD is traditionally
based on a static regression model that assumes i.i.d. data despite the
fact that the design has a temporal component. When fit to serially correlated
data, static models yield overoptimistic inferences with too narrow
uncertainty intervals (see also Solon, 1984; Hansen, 2007a,b; Bertrand, Duflo
and Mullainathan, 2002). Second, most DD analyses only consider two time
points: before and after the intervention. In practice, the manner in which
an effect evolves over time, especially its onset and decay structure, is often
a key question.
Third, when DD analyses are based on time series, previous studies have
imposed restrictions on the way in which a synthetic control is constructed
from a set of predictor variables, which is something we wish to avoid.
For example, one strategy (Abadie and Gardeazabal, 2003; Abadie, Diamond
and Hainmueller, 2010) has been to choose a convex combination
(w1, . . . , wJ ), wj ≥ 0,
P wj = 1 of J predictor time series in such a way that
a vector of pre-treatment variables (not time series) X1 characterising the
treated unit before the intervention is matched most closely by the combination
of pre-treatment variables X0 of the control units w.r.t. a vector of
importance weights (v1, . . . , vJ ). These weights are themselves determined in
such a way that the combination of pre-treatment outcome time series of the
control units most closely matches the pre-treatment outcome time series of
the treated unit. Such a scheme relies on the availability of interpretable
characteristics (e.g., growth predictors), and it precludes non-convex combinations
of controls when constructing the weight vector W. We prefer to
select a combination of control series without reference to external characteristics
and purely in terms of how well they explain the pre-treatment outcome
time series of the treated unit (while automatically balancing goodness
of fit and model complexity through the use of regularizing priors). Another
idea (Belloni et al., 2013) has been to use classical variable-selection methods
(such as the Lasso) to find a sparse set of predictors. This approach,
however, ignores posterior uncertainty about both which predictors to use
and their coefficients.
The limitations of DD schemes can be addressed by using state-space6 K.H. BRODERSEN ET AL.
models, coupled with highly flexible regression components, to explain the
temporal evolution of an observed outcome. State-space models distinguish
between a state equation that describes the transition of a set of latent variables
from one time point to the next and an observation equation that
specifies how a given system state translates into measurements. This distinction
makes them extremely flexible and powerful (see Leeflang et al.,
2009, for a discussion in the context of marketing research).
The approach described in this paper inherits three main characteristics
from the state-space paradigm. First, it allows us to flexibly accommodate
different kinds of assumptions about the latent state and emission processes
underlying the observed data, including local trends and seasonality. Second,
we use a fully Bayesian approach to inferring the temporal evolution of
counterfactual activity and incremental impact. One advantage of this is the
flexibility with which posterior inferences can be summarised. Third, we use
a regression component that precludes a rigid commitment to a particular set
of controls by integrating out our posterior uncertainty about the influence
of each predictor as well as our uncertainty about which predictors to include
in the first place, which avoids overfitting.
The remainder of this paper is organised as follows. Section 2 describes
the proposed model, its design variations, the choice of diffuse empirical
priors on hyperparameters, and a stochastic algorithm for posterior inference
based on Markov chain Monte Carlo (MCMC). Section 3 demonstrates
important features of the model using simulated data, followed by an application
in Section 4 to an advertising campaign run by one of Google’s
advertisers. Section 5 puts our approach into context and discusses its scope
of application.
2. Bayesian structural time-series models. Structural time-series
models are state-space models for time-series data. They can be defined in
terms of a pair of equations
yt = Z
T (2.1) t αt + t
αt+1 = Ttαt + Rtηt (2.2) ,
where t ∼ N (0, σ2
t
) and ηt ∼ N (0, Qt) are independent of all other unknowns.
Equation (2.1) is the observation equation; it links the observed
data yt to a latent d-dimensional state vector αt
. Equation (2.2) is the state
equation; it governs the evolution of the the state vector αt through time.
In the present paper, yt
is a scalar observation, Zt
is a d-dimensional output
vector, Tt
is a d × d transition matrix, Rt
is a d × q control matrix, t
is
a scalar observation error with noise variance σt
, and ηt
is a q-dimensionalBAYESIAN CAUSAL IMPACT ANALYSIS 7
system error with a q × q state-diffusion matrix Qt
, where q ≤ d. Writing
the error structure of equation (2.2) as Rtηt allows us to incorporate state
components of less than full rank; a model for seasonality will be the most
important example.
Structural time-series models are useful in practice because they are flexible
and modular. They are flexible in the sense that a very large class of
models, including all ARIMA models, can be written in the state-space form
given by (2.1) and (2.2). They are modular in the sense that the latent state
as well as the associated model matrices Zt
, Tt
, Rt
, and Qt can be assembled
from a library of component sub-models to capture important features of the
data. There are several widely used state-component models for capturing
the trend, seasonality, or effects of holidays.
A common approach is to assume the errors of different state-component
models to be independent (i.e., Qt
is block-diagonal). The vector αt can then
be formed by concatenating the individual state components, while Tt and
Rt become block-diagonal matrices.
The most important state component for the applications considered in
this paper is a regression component that allows us to obtain counterfactual
predictions by constructing a synthetic control based on a combination of
markets that were not treated. Observed responses from such markets are
important because they allow us to explain variance components in the
treated market that are not readily captured by more generic seasonal submodels.
This approach assumes that covariates are unaffected by the effects of
treatment. For example, an advertising campaign run in the United States
might spill over to Canada or the United Kingdom. When assuming the
absence of spill-over effects, the use of such indirectly affected markets as
controls would lead to pessimistic inferences; that is, the effect of the campaign
would be underestimated (cf. Meyer, 1995).
2.1. Components of state.
Local linear trend. The first component of our model is a local linear trend,
defined by the pair of equations
µt+1 = µt + δt + ηµ,t
δt+1 = δt + ηδ,t
(2.3)
where ηµ,t ∼ N (0, σ2
µ
) and ηδ,t ∼ N (0, σ2
δ
). The µt component is the value of
the trend at time t. The δt component is the expected increase in µ between
times t and t + 1, so it can be thought of as the slope at time t.8 K.H. BRODERSEN ET AL.
The local linear trend model is a popular choice for modelling trends because
it quickly adapts to local variation, which is desirable when making
short-term predictions. This degree of flexibility may not be desired when
making longer-term predictions, as such predictions often come with implausibly
wide uncertainty intervals.
There is a generalization of the local linear trend model where the slope
exhibits stationarity instead of obeying a random walk. This model can be
written as
µt+1 = µt + δt + ηµ,t
δt+1 = D + ρ(δt − D) + ηδ,t,
(2.4)
where the two components of η are independent. In this model, the slope of
the time trend exhibits AR(1) variation around a long-term slope of D. The
parameter |ρ| < 1 represents the learning rate at which the local trend is
updated. Thus, the model balances short-term information with information
from the distant past.
Seasonality. There are several commonly used state-component models to
capture seasonality. The most frequently used model in the time domain is
(2.5) γt+1 = −
S
X−2
s=0
γt−s + ηγ,t,
where S represents the number of seasons, and γt denotes their joint contribution
to the observed response yt
. The state in this model consists of
the S − 1 most recent seasonal effects, but the error term is a scalar, so
the evolution equation for this state model is less than full rank. The mean
of γt+1 is such that the total seasonal effect is zero when summed over S
seasons. For example, if we set S = 4 to capture four seasons per year, the
mean of the winter coefficient will be −1 × (spring + summer + autumn).
The part of the transition matrix Tt representing the seasonal model is an
S−1 × S−1 matrix with −1’s along the top row, 1’s along the subdiagonal,
and 0’s elsewhere.
The preceding seasonal model can be generalized to allow for multiple
seasonal components with different periods. When modelling daily data, for
example, we might wish to allow for an S = 7 day-of-week effect, as well
as an S = 52 weekly annual cycle. The latter can be handled by setting
Tt = IS−1, with zero variance on the error term, when t is not the start
of a new week, and setting Tt to the usual seasonal transition matrix, with
nonzero error variance, when t is the start of a new week.BAYESIAN CAUSAL IMPACT ANALYSIS 9
Contemporaneous covariates with static coefficients. Control time series
that received no treatment are critical to our method for obtaining accurate
counterfactual predictions since they account for variance components
that are shared by the series, including in particular the effects of other
unobserved causes otherwise unaccounted for by the model. A natural way
of including control series in the model is through a linear regression. Its
coefficients can be static or time-varying.
A static regression can be written in state-space form by setting Zt = β
Txt
and αt = 1. One advantage of working in a fully Bayesian treatment is
that we do not need to commit to a fixed set of covariates. The spike-andslab
prior described in Section 2.2 allows us to integrate out our posterior
uncertainty about which covariates to include and how strongly they should
influence our predictions, which avoids overfitting.
All covariates are assumed to be contemporaneous; the present model does
not infer on a potential lag between treated and untreated time series. A
known lag, however, can be easily incorporated by shifting the corresponding
regressor in time.
Contemporaneous covariates with dynamic coefficients. An alternative to
the above is a regression component with dynamic regression coefficients to
account for time-varying relationships (e.g., Banerjee, Kauffman and Wang,
2007; West and Harrison, 1997). Given covariates j = 1 . . . J, this introduces
the dynamic regression component
x
T
t βt =
X
J
j=1
xj,tβj,t
βj,t+1 = βj,t + ηβ,j,t,
(2.6)
where ηβ,j,t ∼ N (0, σ2
βj
). Here, βj,t is the coefficient for the j
th control series
and σβj
is the standard deviation of its associated random walk. We can write
the dynamic regression component in state-space form by setting Zt = xt
and αt = βt and by setting the corresponding part of the transition matrix
to Tt = IJ×J , with Qt = diag(σ
2
βj
).
Assembling the state-space model. Structural time-series models allow us to
examine the time series at hand and flexibly choose appropriate components
for trend, seasonality, and either static or dynamic regression for the controls.
The presence or absence of seasonality, for example, will usually be obvious
by inspection. A more subtle question is whether to choose static or dynamic
regression coefficients.
When the relationship between controls and treated unit has been stable
in the past, static coefficients are an attractive option. This is because10 K.H. BRODERSEN ET AL.
a spike-and-slab prior can be implemented efficiently within a forward-
filtering, backward-sampling framework. This makes it possible to quickly
identify a sparse set of covariates even from tens or hundreds of potential
variables (Scott and Varian, 2013). Local variability in the treated time series
is captured by the dynamic local level or dynamic linear trend component.
Covariate stability is typically high when the available covariates are close in
nature to the treated metric. The empirical analyses presented in this paper,
for example, will be based on a static regression component (Section 4). This
choice provides a reasonable compromise between capturing local behaviour
and accounting for regression effects.
An alternative would be to use dynamic regression coefficients, as we
do, for instance, in our analyses of simulated data (Section 3). Dynamic
coefficients are useful when the linear relationship between treated metrics
and controls is believed to change over time. There are a number of ways
of reducing the computational burden of dealing with a potentially large
number of dynamic coefficients. One option is to resort to dynamic latent
factors, where one uses xt = But +νt with dim(ut) J and uses ut
instead
of xt as part of Zt
in (2.1), coupled with an AR-type model for ut
itself.
Another option is latent thresholding regression, where one uses a dynamic
version of the spike-and-slab prior as in Nakajima and West (2013).
The state-component models are assembled independently, with each component
providing an additive contribution to yt
. Figure 2 illustrates this process
assuming a local linear trend paired with a static regression component.
2.2. Prior distributions and prior elicitation. Let θ generically denote
the set of all model parameters and let α = (α1, . . . , αm) denote the full
state sequence. We adopt a Bayesian approach to inference by specifying
a prior distribution p(θ) on the model parameters as well as a distribution
p(α0|θ) on the initial state values. We may then sample from p(α, θ|y) using
MCMC.
Most of the models in Section 2.1 depend solely on a small set of variance
parameters that govern the diffusion of the individual state components. A
typical prior distribution for such a variance is
(2.7) 1
σ
2
∼ G
ν
2
,
s
2
,
where G (a, b) is the Gamma distribution with expectation a/b. The prior
parameters can be interpreted as a prior sum of squares s, so that s/ν is
a prior estimate of σ
2
, and ν is the weight, in units of prior sample size,
assigned to the prior estimate.BAYESIAN CAUSAL IMPACT ANALYSIS 11
𝜇1
𝑦1 𝒩 𝑦𝑡 𝜇𝑡 + 𝑥𝑡
T𝛽𝜚, 𝜎𝑦
2
𝜇𝑛
… 𝑦𝑛
𝜇𝑛+1
𝑦 𝑛+1
𝜇𝑚
𝑦 𝑚
𝑥1 𝑥𝑛 𝑥𝑛+1 𝑥𝑚
𝒩 𝜇𝑡 𝜇𝑡−1 + 𝛿𝑡−1, 𝜎𝜇
2
𝒩 𝛿𝑡 𝛿𝑡−1, 𝜎𝛿
2
…
…
…
…
…
𝜎𝜇 𝜎𝛿
𝜎𝑦
𝜇0
local trend
local level
observed (𝑦) and
counterfactual
(𝑦 ) activity
controls
control
selection
pre-intervention period post-intervention period
𝛿1 𝛿𝑛+1 𝛿𝑚
𝜚
diffusion
parameters
𝒩 𝛽𝜚 𝑏𝜚, 𝜎𝜖
2 Σ𝜚
−1 −1
𝛿0 𝛿𝑛
𝜎𝜖
𝛽𝜚
regression
coefficients
observation
noise
Figure 2. Graphical model for the static-regression variant of the proposed state-space
model. Observed market activity y1:n = (y1, . . . , yn) is modelled as the result of a latent
state plus Gaussian observation noise with error standard deviation σy. The state αt includes
a local level µt, a local linear trend δt, and a set of contemporaneous covariates xt,
scaled by regression coefficients β%. State components are assumed to evolve according to
independent Gaussian random walks with fixed standard deviations σµ and σδ (conditionaldependence
arrows shown for the first time point only). The model includes empirical priors
on these parameters and the initial states. In an alternative formulation, the regression
coefficients β are themselves subject to random-walk diffusion (see main text). Of principal
interest is the posterior predictive density over the unobserved counterfactual responses
y˜n+1, . . . , y˜m. Subtracting these from the actual observed data yn+1, . . . , ym yields a probability
density over the temporal evolution of causal impact.12 K.H. BRODERSEN ET AL.
We often have a weak default prior belief that the incremental errors in
the state process are small, which we can formalize by choosing small values
of ν (e.g., 1) and small values of s/ν. The notion of ‘small’ means different
things in different models; for the seasonal and local linear trend models our
default priors are 1/σ2 ∼ G(10−2
, 10−2
s
2
y
), where s
2
y =
P
t
(yt − y¯)
2/(n − 1)
is the sample variance of the target series. Scaling by the sample variance
is a minor violation of the Bayesian paradigm, but it is an effective means
of choosing a reasonable scale for the prior. It is similar to the popular
technique of scaling the data prior to analysis, but we prefer to do the
scaling in the prior so we can model the data on its original scale.
When faced with many potential controls, we prefer letting the model
choose an appropriate set. This can be achieved by placing a spike-andslab
prior over coefficients (George and McCulloch, 1993, 1997; Polson and
Scott, 2011; Scott and Varian, 2013). A spike-and-slab prior combines point
mass at zero (the ‘spike’), for an unknown subset of zero coefficients, with
a weakly informative distribution on the complementary set of non-zero
coefficients (the ‘slab’). Contrary to what its name might suggest, the ‘slab’
is usually not completely flat, but rather a Gaussian with a large variance.
Let % = (%1, . . . , %J ), where %j = 1 if βj 6= 0 and %j = 0 otherwise. Let β%
denote the non-zero elements of the vector β and let Σ−1
% denote the rows
and columns of Σ−1
corresponding to non-zero entries in %. We can then
factorize the spike-and-slab prior as
(2.8) p(%, β, 1/σ2
) = p(%) p(σ
2
|%) p(β%|%, σ2
).
The spike portion of (2.8) can be an arbitrary distribution over {0, 1}
J
in
principle; the most common choice in practice is a product of independent
Bernoulli distributions,
(2.9) p(%) = Y
J
j=1
π
%j
j
(1 − πj )
1−%j
,
where πj is the prior probability of regressor j being included in the model.
Values for πj can be elicited by asking about the expected model size M,
and then setting all πj = M/J. An alternative is to use a more specific set
of values πj . In particular, one might choose to set certain πj to either 1
or 0 to force the corresponding variables into or out of the model. Generally,
framing the prior in terms of expected model size has the advantage that the
model can adapt to growing numbers of predictor variables without having
to switch to a hierarchical prior (Scott and Berger, 2010).BAYESIAN CAUSAL IMPACT ANALYSIS 13
For the ‘slab’ portion of the prior we use a conjugate normal-inverse
Gamma distribution,
β%|σ
2
∼ N
b%, σ2
(Σ−1
%
)
−1
(2.10)
1
σ
2
∼ G
ν
2
,
s
2
(2.11) .
The vector b in equation (2.10) encodes our prior expectation about the
value of each element of β. In practice, we usually set b = 0. The prior
parameters in equation (2.11) can be elicited by asking about the expected
R2 ∈ [0, 1] as well as the number of observations worth of weight ν the prior
estimate should be given. Then s = ν(1 − R2
)s
2
y
.
The final prior parameter in (2.10) is Σ−1 which, up to a scaling factor, is
the prior precision over β in the full model, with all variables included. The
total information in the covariates is XTX, and so 1
nXTX is the average
information in a single observation. Zellner’s g-prior (Zellner, 1986; Chipman
et al., 2001; Liang et al., 2008) sets Σ−1 =
g
nXTX, so that g can be
interpreted as g observations worth of information. Zellner’s prior becomes
improper when XTX is not positive definite; we therefore ensure propriety
by averaging XTX with its diagonal,
(2.12) Σ−1 =
g
n
n
wXTX + (1 − w) diag
XTX
o
with default values of g = 1 and w = 1/2. Overall, this prior specification
provides a broadly useful default while providing considerable flexibility in
those cases where more specific prior information is available.
2.3. Inference. Posterior inference in our model can be broken down into
three pieces. First, we simulate draws of the model parameters θ and the
state vector α given the observed data y1:n in the training period. Second,
we use the posterior simulations to simulate from the posterior predictive
distribution p(y˜n+1:m|y1:n) over the counterfactual time series y˜n+1:m given
the observed pre-intervention activity y1:n. Third, we use the posterior predictive
samples to compute the posterior distribution of the pointwise impact
yt−y˜t for each t = 1, . . . , m. We use the same samples to obtain the posterior
distribution of cumulative impact.
Posterior simulation. We use a Gibbs sampler to simulate a sequence
(θ, α)
(1)
,(θ, α)
(2)
, . . . from a Markov chain whose stationary distribution is
p(θ, α|y1:n). The sampler alternates between: a data-augmentation step that
simulates from p(α|y1:n, θ); and a parameter-simulation step that simulates
from p(θ|y1:n, α).14 K.H. BRODERSEN ET AL.
The data-augmentation step uses the posterior simulation algorithm from
Durbin and Koopman (2002), providing an improvement over the earlier
forward-filtering, backward-sampling algorithms by Carter and Kohn (1994),
Fr¨uhwirth-Schnatter (1994), and de Jong and Shephard (1995). In brief, because
p(y1:n, α|θ) is jointly multivariate normal, the variance of p(α|y1:n, θ)
does not depend on y1:n. We can therefore simulate (y
∗
1:n
, α∗
) ∼ p(y1:n, α|θ)
and subtract E(α
∗
|y
∗
1:n
, θ) to obtain zero-mean noise with the correct variance.
Adding E(α|y1:n, θ) restores the correct mean, which completes the
draw. The required expectations can be computed using the Kalman filter
and a fast mean smoother described in detail by Durbin and Koopman
(2002). The result is a direct simulation from p(α|y1:n, θ) in an algorithm
that is linear in the total (pre- and post-intervention) number of time points
(m), and quadratic in the dimension of the state space (d).
Given the draw of the state, the parameter draw is straightforward for
all state components other than the static regression coefficients β. All state
components that exclusively depend on variance parameters can translate
their draws back to error terms ηt
, accumulate sums of squares of η, and because
of conjugacy with equation (2.7) the posterior distribution will remain
Gamma distributed.
The draw of the static regression coefficients β proceeds as follows. For
each t = 1, . . . , n in the pre-intervention period, let ˙yt denote yt with the
contributions from the other state components subtracted away, and let
y˙ 1:n = ( ˙y1, . . . , y˙n). The challenge is to simulate from p(%, β, σ2
|y˙ 1:n), which
we can factor into p(%|y1:n)p(1/σ2
|%, y˙ 1:n)p(β|%, σ, y˙ 1:n). Because of conjugacy,
we can integrate out β and 1/σ2
and be left with
(2.13) %|y˙ 1:n ∼ C(y˙ 1:n)
|Σ
−1
%
|
1
2
|V
−1
% |
1
2
p(%)
S
N
2 −1
%
,
where C(y˙ 1:n) is an unknown normalizing constant. The sufficient statistics
in equation (2.13) are
V
−1
% =
XTX
%
+ Σ−1
% β˜
% = (V
−1
%
)
−1
(XT
% y˙ 1:n + Σ−1
%
b%)
N = ν + n S% = s + y˙
T
1:ny˙ 1:n + b
T
% Σ
−1
%
b% − β˜T
% V
−1
% β˜
%.
To sample from (2.13), we use a Gibbs sampler that draws each %j given
all other %−j . Each full-conditional is easy to evaluate because %j can only
assume two possible values. It should be noted that the dimension of all
matrices in (2.13) is P
j %j , which is small if the model is truly sparse. There
are many matrices to manipulate, but because each is small the overallBAYESIAN CAUSAL IMPACT ANALYSIS 15
algorithm is fast. Once the draw of % is complete, we sample directly from
p(β, 1/σ2
|%, y˙ 1:n) using standard conjugate formulae. For an alternative that
may be even more computationally efficient, see Ghosh and Clyde (2011).
Posterior predictive simulation. While the posterior over model parameters
and states p(θ, α|y1:n) can be of interest in its own right, causal impact
analyses are primarily concerned with the posterior incremental effect,
(2.14) p (y˜n+1:m | y1:n, x1:m).
As shown by its indices, the density in equation (2.14) is defined precisely
for that portion of the time series which is unobserved: the counterfactual
market response ˜yn+1, . . . , y˜m that would have been observed in the treated
market, after the intervention, in the absence of treatment.
It is also worth emphasizing that the density is conditional on the observed
data (as well as the priors) and only on these, i.e., on activity in
the treatment market before the beginning of the intervention as well as
activity in all control markets both before and during the intervention. The
density is not conditioned on parameter estimates or the inclusion or exclusion
of covariates with static regression coefficients, all of which have been
integrated out. Thus, through Bayesian model averaging, we neither commit
to any particular set of covariates, which helps avoid an arbitrary selection;
nor to point estimates of their coefficients, which prevents overfitting.
The posterior predictive density in (2.14) is defined as a coherent (joint)
distribution over all counterfactual data points, rather than as a collection of
pointwise univariate distributions. This ensures that we correctly propagate
the serial structure determined on pre-intervention data to the trajectory
of counterfactuals. This is crucial, in particular, when forming summary
statistics, such as the cumulative effect of the intervention on the treatment
market.
Posterior inference was implemented in C++ with an R interface. Given
a typically-sized dataset with m = 500 time points, J = 10 covariates, and
10,000 iterations (see Section 4 for an example), this implementation takes
less than 30 seconds to complete on a standard computer, enabling nearinteractive
analyses.
2.4. Evaluating impact. Samples from the posterior predictive distribution
over counterfactual activity can be readily used to obtain samples from
the posterior causal effect, i.e., the quantity we are typically interested in.
For each draw τ and for each time point t = n + 1, . . . , m, we set
φ
(τ)
t
:= yt − y˜
(τ)
t
(2.15) ,16 K.H. BRODERSEN ET AL.
yielding samples from the approximate posterior predictive density of the
effect attributed to the intervention.
In addition to its pointwise impact, we often wish to understand the
cumulative effect of an intervention over time. One of the main advantages
of a sampling approach to posterior inference is the flexibility and ease with
which such derived inferences can be obtained. Reusing the impact samples
obtained in (2.15), we compute for each draw τ
X
t
t
0=n+1
φ
(τ)
t
(2.16) 0 ∀t = n + 1, . . . , m.
The preceding cumulative sum of causal increments is a useful quantity
when y represents a flow quantity, measured over an interval of time (e.g., a
day), such as the number of searches, sign-ups, sales, additional installs, or
new users. It becomes uninterpretable when y represents a stock quantity,
usefully defined only for a point in time, such as the total number of clients,
users, or subscribers. In this case we might instead choose, for each τ , to draw
a sample of the posterior running average effect following the intervention,
1
t − n
X
t
t
0=n+1
φ
(τ)
t
(2.17) 0 ∀t = n + 1, . . . , m.
Unlike the cumulative effect in (2.16), the running average is always interpretable,
regardless of whether it refers to a flow or a stock. However,
it is more context-dependent on the length of the post-intervention period
under consideration. In particular, under the assumption of a true impact
that grows quickly at first and then declines to zero, the cumulative impact
approaches its true total value (in expectation) as we increase the counterfactual
forecasting period, whereas the average impact will eventually
approach zero (while, in contrast, the probability intervals diverge in both
cases, leading to more and more uncertain inferences as the forecasting period
increases).
3. Application to simulated data. To study the characteristics of
our approach, we analysed simulated (i.e., computer-generated) data across a
series of independent simulations. Generated time series started on 1 January
2013 and ended on 30 June 2014, with a perturbation beginning on 1 January
2014. The data were simulated using a dynamic regression component with
two covariates whose coefficients evolved according to independent random
walks, βt ∼ N (βt−1, 0.012
), initialized at β0 = 1. The covariates themselves
were simple sinusoids with wavelengths of 90 days and 360 days, respectively.BAYESIAN CAUSAL IMPACT ANALYSIS 17
12 14 16 18 20 22 24
(a)
2013-01
2013-02
2013-03
2013-04
2013-05
2013-06
2013-07
2013-08
2013-09
2013-10
2013-11
2013-12
2014-01
2014-02
2014-03
2014-04
2014-05
2014-06
observed
predicted
true
(b)
effect size (%)
proportion of intervals excluding zero
0 0.1 1 10 25 50
0.0 0.2 0.4 0.6 0.8 1.0
50 100 150
0.0 0.2 0.4 0.6 0.8 1.0
(c)
campaign duration (days)
proportion of intervals containing truth
Figure 3. Adequacy of posterior uncertainty. (a) Example of one of the 256 datasets
created to assess estimation accuracy. Simulated observations (black) are based on two
contemporaneous covariates, scaled by time-varying coefficients plus a time-varying local
level (not shown). During the campaign period, where the data are lifted by an effect size of
10%, the plot shows the posterior expectation of counterfactual activity (blue), along with
its pointwise central 95% credible intervals (blue shaded area), and, for comparison, the
true counterfactual (green). (b) Power curve. Following repeated application of the model
to simulated data, the plot shows the empirical frequency of concluding that a causal effect
was present, as a function of true effect size, given a post-intervention period of 6 months.
The curve represents sensitivity in those parts of the graph where the true effect size is
positive, and 1 − specificity where the true effect size is zero. Error bars represent 95%
credible intervals for the true sensitivity, using a uniform Beta(1, 1) prior. (c) Interval
coverage. Using an effect size of 10%, the plot shows the proportion of simulations in which
the pointwise central 95% credible interval contained the true impact, as a function of
campaign duration. Intervals should contain ground truth in 95% of simulations, however
much uncertainty its predictions may be associated with. Error bars represent 95% credible
intervals.
The latent state underlying the observed data was generated using a local
level that evolved according to a random walk, µt ∼ N (µt−1, 0.1
2
), initialized
at µ0 = 0. Independent observation noise was sampled using t ∼ N (0, 0.1
2
).
In summary, observations yt were generated using
yt = βt,1zt,1 + βt,2zt,2 + µt + t
.
To simulate the effect of advertising, the post-intervention portion of the
preceding series was multiplied by 1 +e, where e (not to be confused with )
represented the true effect size specifying the (uniform) relative lift during
the campaign period. An example is shown in Figure 3a.
Sensitivity and specificity. To study the properties of our model, we began
by considering under what circumstances we successfully detected a causal18 K.H. BRODERSEN ET AL.
effect, i.e., the statistical power or sensitivity of our approach. A related
property is the probability of not detecting an absent impact, i.e., specificity.
We repeatedly generated data, as described above, under different true effect
sizes. We then computed the posterior predictive distribution over the
counterfactuals, and recorded whether or not we would have concluded a
causal effect.
For each of the effect sizes 0%, 0.1%, 1%, 10%, and 100%, a total of
2
8 = 256 simulations were run. This number was chosen simply on the
grounds that it provided reasonably tight intervals around the reported summary
statistics without requiring excessive amounts of computation. In each
simulation, we concluded that a causal effect was present if and only if the
central 95% posterior probability interval of the cumulative effect excluded
zero.
The model used throughout this section comprised two structural blocks.
The first one was a local level component. We placed an inverse-Gamma
prior on its diffusion variance with a prior estimate of s/ν = 0.1σy and a
prior sample size ν = 32. The second structural block was a dynamic regression
component. We placed a Gamma prior with prior expectation 0.1σy on
the diffusion variance of both regression coefficients. By construction, the
outcome variable did not exhibit any local trends or seasonality other than
the variation conveyed through the covariates. This obviated the need to
include an explicit local linear trend or seasonality component in the model.
In a first analysis, we considered the empirical proportion of simulations
in which a causal effect had been detected. When taking into account only
those simulations where the true effect size was greater than zero, these
empirical proportions provide estimates of the sensitivity of the model w.r.t.
the process by which the data were generated. Conversely, those simulations
where the campaign had had no effect yield an estimate of the model’s
specificity. In this way, we obtained the power curve shown in Figure 3b.
The curve shows that, in data such as these, a market perturbation leading
to a lift no larger than 1% is missed in about 90% of cases. By contrast, a
perturbation that lifts market activity by 25% is correctly detected as such
in most cases.
In a second analysis, we assessed the coverage properties of the posterior
probability intervals obtained through our model. It is desirable to use
a diffuse prior on the local level component such that central 95% intervals
contain ground truth in about 95% of the simulations. This coverage
frequency should hold regardless of the length of the campaign period. In
other words, a longer campaign should lead to posterior intervals that are
appropriately widened to retain the same coverage probability as the nar-BAYESIAN CAUSAL IMPACT ANALYSIS 19
rower intervals obtained for shorter campaigns. This was approximately the
case throughout the simulated campaign (Figure 3c).
Estimation accuracy. To study the accuracy of the point estimates supported
by our approach, we repeated the preceding simulations with a fixed
effect size of 10% while varying the length of the campaign. When given a
quadratic loss function, the loss-minimizing point estimate is the posterior
expectation of the predictive density over counterfactuals. Thus, for each
generated dataset i, we computed the expected causal effect for each time
point,
φˆ
i,t := hφt
| y1, . . . , ym, x1, . . . , xmi
∀t = n + 1, . . . , m; i = 1, . . . , 256.
(3.1)
To quantify the discrepancy between estimated and true impact, we calculated
the absolute percentage estimation error,
ai,t :=
φˆ
i,t − φt
φt
(3.2) .
This yielded an empirical distribution of absolute percentage estimation errors
(Figure 4a; blue), showing that impact estimates become less and less
accurate as the forecasting period increases. This is because, under the local
linear trend model in (2.3), the true counterfactual activity becomes more
and more likely to deviate from its expected trajectory.
It is worth emphasizing that all preceding results are based on the assumption
that the model structure remains intact throughout the modelling
period. In other words, even though the model is built around the idea of
multiple (non-stationary) components (i.e., a time-varying local trend and,
potentially, time-varying regression coefficients), this structure itself remains
unchanged. If the model structure does change, estimation accuracy may
suffer.
We studied the impact of a changing model structure in a second simulation
in which we repeated the procedure above in such a way that 90 days
after the beginning of the campaign the standard deviation of the random
walk governing the evolution of the regression coefficient was tripled (now
0.03 instead of 0.01). As a result, the observed data began to diverge much
more quickly than before. Accordingly, estimations became considerably less
reliable (Figure 4a, red). An example of the underlying data is shown in Figure
4b.
The preceding simulations highlight the importance of a model that is
sufficiently flexible to account for phenomena typically encountered in sea-20 K.H. BRODERSEN ET AL.
0 50 100 150 200 250 300
(a)
absolute % error
2014-01
2014-02
2014-03
2014-04
2014-05
2014-06
15 20 25
(b)
2014-01
2014-02
2014-03
2014-04
2014-05
2014-06
Figure 4. Estimation accuracy. (a) Time series of absolute percentage discrepancy between
inferred effect and true effect. The plot shows the rate (mean ± 2 s.e.m.) at which
predictions become less accurate as the length of the counterfactual forecasting period increases
(blue). The well-behaved decrease in estimation accuracy breaks down when the
data are subject to a sudden structural change (red), as simulated for 1 April 2014. (b) Illustration
of a structural break. The plot shows one example of the time series underlying
the red curve in (a). On 1 April 2014, the standard deviation of the generating random
walk of the local level was tripled, causing the rapid decline in estimation accuracy seen in
the red curve in (a).BAYESIAN CAUSAL IMPACT ANALYSIS 21
sonal empirical data. This rules out entirely static models in particular (such
as multiple linear regression).
4. Application to empirical data. To illustrate the practical utility
of our approach, we analysed an advertising campaign run by one of Google’s
advertisers in the United States. In particular, we inferred the campaign’s
causal effect on the number of times a user was directed to the advertiser’s
website from the Google search results page. We provide a brief overview
of the underlying data below (see Vaver and Koehler, 2011, for additional
details).
The campaign analysed here was based on product-related ads to be displayed
alongside Google’s search results for specific keywords. Ads went live
for a period of 6 consecutive weeks and were geo-targeted to a randomised
set of 95 out of 190 designated market areas (DMAs). The most salient
observable characteristic of DMAs is offline sales. To produce balance in
this characteristic, DMAs were first rank-ordered by sales volume. Pairs of
regions were then randomly assigned to treatment/control. DMAs provide
units that can be easily supplied with distinct offerings, although this finegrained
split was not a requirement for the model. In fact, we carried out
the analysis as if only one treatment region had been available (formed by
summing all treated DMAs). This allowed us to evaluate whether our approach
would yield the same results as more conventional treatment-control
comparisons would have done.
The outcome variable analysed here was search-related visits to the advertiser’s
website, consisting of organic clicks (i.e., clicks on a search result)
and paid clicks (i.e., clicks on an ad next to the search results, for which the
advertiser was charged). Since paid clicks were zero before the campaign,
one might wonder why we could not simply count the number of paid clicks
after the campaign had started. The reason is that paid clicks tend to cannibalize
some organic clicks. Since we were interested in the net effect, we
worked with the total number of clicks.
The first building block of the model used for the analyses in this section
was a local level component. For the inverse-Gamma prior on its diffusion
variance we used a prior estimate of s/ν = 0.1σy and a prior sample size
ν = 32. The second structural block was a static regression component. We
used a spike-and-slab prior with an expected model size of M = 3, an expected
explained variance of R2 = 0.8 and 50 prior df. We deliberately kept
the model as simple as this. Since the covariates came from a randomised
experiment, we expected them to already account for any additional local
linear trends and seasonal variation in the response variable. If one suspects22 K.H. BRODERSEN ET AL.
that a more complex model might be more appropriate, one could optimize
model design through Bayesian model selection. Here, we focus instead on
comparing different sets of covariates, which is critical in counterfactual analyses
regardless of the particular model structure used. Model estimation was
carried out using 10 000 MCMC samples.
Analysis 1: Effect on the treated, using a randomised control. We began by
applying the above model to infer the causal effect of the campaign on the
time series of clicks in the treated regions. Given that a set of unaffected
regions was available in this analysis, the best possible set of controls was
given by the untreated DMAs themselves (see below for a comparison with
a purely observational alternative).
As shown in Figure 5a, the model provided an excellent fit on the precampaign
trajectory of clicks (including a spike in ‘week −2’ and a dip at the
end of ‘week −1’). Following the onset of the campaign, observations quickly
began to diverge from counterfactual predictions: the actual number of clicks
was consistently higher than what would have been expected in the absence
of the campaign. The curves did not reconvene until one week after the end
of the campaign. Subtracting observed from predicted data, as we did in
Figure 5b, resulted in a posterior estimate of the incremental lift caused by
the campaign. It peaked after about three weeks into the campaign, and
faded away after about one week after the end of the campaign. Thus, as
shown in Figure 5c, the campaign led to a sustained cumulative increase
in total clicks (as opposed to a mere shift of future clicks into the present,
or a pure cannibalization of organic clicks by paid clicks). Specifically, the
overall effect amounted to 88 400 additional clicks in the targeted regions
(posterior expectation; rounded to three significant digits), i.e., an increase
of 22%, with a central 95% credible interval of [13%, 30%].
To validate this estimate, we returned to the original experimental data,
on which a conventional treatment-control comparison had been carried out
using a two-stage linear model (Vaver and Koehler, 2011). This analysis had
led to an estimated lift of 84 700 clicks, with a 95% confidence interval for
the relative expected lift of [19%, 22%]. Thus, with a deviation of less than
5%, the counterfactual approach had led to almost precisely the same estimate
as the randomised evaluation, except for its wider intervals. The latter
is expected, given that our intervals represent prediction intervals, not confi-
dence intervals. Moreover, in addition to an interval for the sum over all time
points, our approach yields a full time series of pointwise intervals, which
allows analysts to examine the characteristics of the temporal evolution of
attributable impact.
The posterior predictive intervals in Figure 5b widen more slowly than inBAYESIAN CAUSAL IMPACT ANALYSIS 23
4000 8000 12000
(a)
US clicks
Model fit
Prediction
pre-intervention intervention post-intervention
-2000 0 2000
(b)
Point-wise impact
0e+00 1e+05
(c)
Cumulative impact week -4 week -3
week -2
week -1
week 0
week 1
week 2
week 3
week 4
week 5
week 6
week 7
Figure 5. Causal effect of online advertising on clicks in treated regions. (a) Time series
of search-related visits to the advertiser’s website (including both organic and paid clicks).
(b) Pointwise (daily) incremental impact of the campaign on clicks. Shaded vertical bars
indicate weekends. (c) Cumulative impact of the campaign on clicks.24 K.H. BRODERSEN ET AL.
the illustrative example in Figure 1. This is because the large number of controls
available in this data set offers a much higher pre-campaign predictive
strength than in the simulated data in Figure 1. This is not unexpected,
given that controls came from a randomised experiment, and we will see
that this also holds for a subsequent analysis (see below) that is based on
yet another data source for predictors. A consequence of this is that there
is little variation left to be captured by the random-walk component of the
model. A reassuring finding is that the estimated counterfactual time series
in Figure 5a eventually almost exactly rejoins the observed series, only a few
days after the end of the intervention.
Analysis 2: Effect on the treated, using observational controls. An important
characteristic of counterfactual-forecasting approaches is that they do
not require a setting in which a set of controls, selected at random, was exempt
from the campaign. We therefore repeated the preceding analysis in the
following way: we discarded the data from all control regions and, instead,
used searches for keywords related to the advertiser’s industry, grouped into
a handful of verticals, as covariates. In the absence of a dedicated set of
control regions, such industry-related time series can be very powerful controls
as they capture not only seasonal variations but also market-specific
trends and events (though not necessarily advertiser-specific trends). A major
strength of the controls chosen here is that time series on web searches are
publicly available through Google Trends (http://www.google.com/trends/).
This makes the approach applicable to virtually any kind of intervention. At
the same time, the industry as a whole is unlikely to be moved by a single
actor’s activities. This precludes a positive bias in estimating the effect of
the campaign that would arise if a covariate was negatively affected by the
campaign.
As shown in Figure 6, we found a cumulative lift of 85 900 clicks (posterior
expectation), or 21%, with a [12%, 30%] interval. In other words, the
analysis replicated almost perfectly the original analysis that had access to
a randomised set of controls. One feature in the response variable which
this second analysis failed to account for was a spike in clicks in the second
week before the campaign onset; this spike appeared both in treated and
untreated regions and appears to be specific to this advertiser. In addition,
the series of point-wise impact (Figure 6b) is slightly more volatile than
in the original analysis (Figure 5). On the other hand, the overall point
estimate of 85 900, in this case, was even closer to the randomised-design
baseline (84 700; deviation ca. 1%) than in our first analysis (88 400; deviation
ca. 4%). In summary, the counterfactual approach effectively obviated
the need for the original randomised experiment. Using purely observationalBAYESIAN CAUSAL IMPACT ANALYSIS 25
4000 8000 12000
(a)
US clicks
Model fit
Prediction
pre-intervention intervention post-intervention
-2000 0 2000
(b)
Point-wise impact
0e+00 1e+05
(c)
Cumulative impact week -4 week -3
week -2
week -1
week 0
week 1
week 2
week 3
week 4
week 5
week 6
week 7
Figure 6. Causal efect of online advertising on clicks, using only searches for keywords
related to the advertiser’s industry as controls, discarding the original control regions as
would be the case in studies where a randomised experiment was not carried out. (a) Time
series of clicks on to the advertiser’s website. (b) Pointwise (daily) incremental impact
of the campaign on clicks. (c) Cumulative impact of the campaign on clicks. The plots
show that this analysis, which was based on observational covariates only, provided almost
exactly the same inferences as the first analysis (Figure 5) that had been based on a
randomised design.26 K.H. BRODERSEN ET AL.
variables led to the same substantive conclusions.
Analysis 3: Absence of an effect on the controls. To go one step further
still, we analysed clicks in those regions that had been exempt from the
advertising campaign. If the effect of the campaign was truly specific to
treated regions, there should be no effect in the controls. To test this, we
inferred the causal effect of the campaign on unaffected regions, which should
not lead to a significant finding. In analogy with our second analysis, we
discarded clicks in the treated regions and used searches for keywords related
to the advertiser’s industry as controls.
As summarized in Figure 7, no significant effect was found in unaffected
regions, as expected. Specifically, we obtained an overall non-significant lift
of 2% in clicks with a central 95% credible interval of [−6%, 10%].
In summary, the empirical data considered in this section showed: (i) a
clear effect of advertising on treated regions when using randomised control
regions to form the regression component, replicating previous treatmentcontrol
comparisons (Figure 5); (ii) notably, an equivalent finding when discarding
control regions and instead using observational searches for keywords
related to the advertiser’s industry as covariates (Figure 6); (iii) reassuringly,
the absence of an effect of advertising on regions that were not targeted (Figure
7).
5. Discussion. The increasing interest in evaluating the incremental
impact of market interventions has been reflected by a growing literature on
applied causal inference. With the present paper we are hoping to contribute
to this literature by proposing a Bayesian state-space model for obtaining a
counterfactual prediction of market activity. We discuss the main features
of this model below.
In contrast to most previous schemes, the approach described here is fully
Bayesian, with regularizing or empirical priors for all hyperparameters. Posterior
inference gives rise to complete-data (smoothing) predictions that are
only conditioned on past data in the treatment market and both past and
present data in the control markets. Thus, our model embraces a dynamic
evolution of states and, optionally, coefficients (departing from classical linear
regression models with a fixed number of static regressors) and enables
us to flexibly summarize posterior inferences.
Because closed-form posteriors for our model do not exist, we suggest a
stochastic approximation to inference using MCMC. One convenient consequence
of this is that we can reuse the samples from the posterior to obtain
credible intervals for all summary statistics of interest. Such statistics include,
for example, the average absolute and relative effect caused by theBAYESIAN CAUSAL IMPACT ANALYSIS 27
4000 8000 12000
(a)
US clicks
Model fit
Prediction
pre-intervention intervention post-intervention
-2000 0 2000
(b)
Point-wise impact
0e+00 1e+05
(c)
Cumulative impact week -4 week -3
week -2
week -1
week 0
week 1
week 2
week 3
week 4
week 5
week 6
week 7
Figure 7. Causal effect of online advertising on clicks in non-treated regions, which should
not show an effect. Searches for keywords related to the advertiser’s industry are used as
controls. Plots show inferences in analogy with Figure 5. (a) Time series of clicks to the
advertiser’s website. (b) Pointwise (daily) incremental impact of the campaign on clicks.
(c) Cumulative impact of the campaign on clicks.28 K.H. BRODERSEN ET AL.
intervention as well as its cumulative effect.
Posterior inference was implemented in C++ and R and, for all empirical
datasets presented in Section 4, took less than 30 seconds on a standard
Linux machine. If the computational burden of sampling-based inference
ever became prohibitive, one option would be to replace it by a variational
Bayesian approximation (see Mathys et al., 2011; Brodersen et al., 2013, for
examples).
Another way of using the proposed model is for power analyses. In particular,
given past time series of market activity, we can define a point in
the past to represent a hypothetical intervention and apply the model in
the usual fashion. As a result, we obtain a measure of uncertainty about the
response in the treated market after the beginning of the hypothetical intervention.
This provides an estimate of what incremental effect would have
been required to be outside of the 95% central interval of what would have
happened in the absence of treatment.
The model presented here subsumes several simpler models which, in consequence,
lack important characteristics, but which may serve as alternatives
should the full model appear too complex for the data at hand. One example
is classical multiple linear regression. In principle, classical regression
models go beyond difference-in-differences schemes in that they account for
the full counterfactual trajectory. However, they are not suited for predicting
stochastic processes beyond a few steps. This is because ordinary leastsquares
estimators disregard serial autocorrelation; the static model structure
does not allow for temporal variation in the coefficients; and predictions
ignore our posterior uncertainty about the parameters. Put differently: classical
multiple linear regression is a special case of the state-space model
described here in which (i) the Gaussian random walk of the local level has
zero variance; (ii) there is no local linear trend; (iii) regression coefficients
are static rather than time-varying; (iv) ordinary least squares estimators
are used which disregard posterior uncertainty about the parameters and
may easily overfit the data.
Another special case of the counterfactual approach discussed in this paper
is given by synthetic control estimators that are restricted to the class
of convex combinations of predictor variables and do not include time-series
effects such as trends and seasonality (Abadie, Diamond and Hainmueller,
2010; Abadie, 2005). Relaxing this restriction means we can utilize predictors
regardless of their scale, even if they are negatively correlated with the
outcome series of the treated unit.
Other special cases include autoregressive (AR) and moving-average (MA)
models. These models define autocorrelation among observations rather thanBAYESIAN CAUSAL IMPACT ANALYSIS 29
latent states, thus precluding the ability to distinguish between state noise
and observation noise (Ataman, Mela and Van Heerde, 2008; Leeflang et al.,
2009).
In the scenarios we consider, advertising is a planned perturbation of
the market. This generally makes it easier to obtain plausible causal inferences
than in genuinely observational studies in which the experimenter had
no control about treatment (see discussions in Berndt, 1991; Brady, 2002;
Hitchcock, 2004; Robinson, McNulty and Krasno, 2009; Winship and Morgan,
1999; Camillo and d’Attoma, 2010; Antonakis et al., 2010; Lewis and
Reiley, 2011; Lewis, Rao and Reiley, 2011; Kleinberg and Hripcsak, 2011;
Vaver and Koehler, 2011). The principal problem in observational studies
is endogeneity: the possibility that the observed outcome might not be the
result of the treatment but of other omitted, endogenous variables. In principle,
propensity scores can be used to correct for the selection bias that arises
when the treatment effect is correlated with the likelihood of being treated
(Rubin and Waterman, 2006; Chan et al., 2010). However, the propensityscore
approach requires that exposure can be measured at the individual
level, and it, too, does not guarantee valid inferences, for example in the
presence of a specific type of selection bias recently termed ‘activity bias’
(Lewis, Rao and Reiley, 2011). Counterfactual modelling approaches avoid
these issues when it can be assumed that the treatment market was chosen
at random.
Overall, we expect inferences on the causal impact of designed market interventions
to play an increasingly prominent role in providing quantitative
accounts of return on investment (Danaher and Rust, 1996; Seggie, Cavusgil
and Phelan, 2007; Leeflang et al., 2009; Stewart, 2009). This is because
marketing resources, specifically, can only be allocated to whichever campaign
elements jointly provide the greatest return on ad spend (ROAS) if we
understand the causal effects of spend on sales, product adoption, or user
engagement. At the same time, our approach could be used for many other
applications involving causal inference. Examples include problems found in
economics, epidemiology, biology, or the political and social sciences. With
the release of the CausalImpact R package we hope to provide a simple
framework serving all of these areas. Structural time-series models are being
used in an increasing number of applications at Google, and we anticipate
that they will prove equally useful in many analysis efforts elsewhere.
Acknowledgements. The authors wish to thank Jon Vaver for sharing
the empirical data analysed in this paper.30 K.H. BRODERSEN ET AL.
References.
Abadie, A. (2005). Semiparametric Difference-in-Differences Estimators. The Review of
Economic Studies 72 1–19.
Abadie, A., Diamond, A. and Hainmueller, J. (2010). Synthetic control methods for
comparative case studies: Estimating the effect of Californias tobacco control program.
Journal of the American Statistical Association 105.
Abadie, A. and Gardeazabal, J. (2003). The economic costs of conflict: A case study
of the Basque Country. American economic review 113–132.
Angrist, J. D. and Krueger, A. B. (1999). Empirical strategies in labor economics.
Handbook of labor economics 3 1277–1366.
Angrist, J. D. and Pischke, J.-S. (2008). Mostly Harmless Econometrics: An Empiricist’s
Companion. Princeton University Press.
Antonakis, J., Bendahan, S., Jacquart, P. and Lalive, R. (2010). On making causal
claims: A review and recommendations. The Leadership Quarterly 21 1086–1120.
Ashenfelter, O. and Card, D. (1985). Using the longitudinal structure of earnings
to estimate the effect of training programs. The Review of Economics and Statistics
648–660.
Ataman, M. B., Mela, C. F. and Van Heerde, H. J. (2008). Building brands. Marketing
Science 27 1036–1054.
Athey, S. and Imbens, G. W. (2002). Identification and Inference in Nonlinear DifferenceIn-Differences
Models Working Paper No. 280, National Bureau of Economic Research.
Banerjee, S., Kauffman, R. J. and Wang, B. (2007). Modeling Internet firm survival
using Bayesian dynamic models with time-varying coefficients. Electronic Commerce
Research and Applications 6 332–342.
Belloni, A., Chernozhukov, V., Fernandez-Val, I. and Hansen, C. (2013). Program
evaluation with high-dimensional data CeMMAP working papers No. CWP77/13, Centre
for Microdata Methods and Practice, Institute for Fiscal Studies.
Berndt, E. R. (1991). The practice of econometrics: classic and contemporary. AddisonWesley
Reading, MA.
Bertrand, M., Duflo, E. and Mullainathan, S. (2002). How Much Should We Trust
Differences-in-Differences Estimates? Working Paper No. 8841, National Bureau of Economic
Research.
Brady, H. E. (2002). Models of causal inference: Going beyond the Neyman-RubinHolland
theory. In Annual Meetings of the Political Methodology Group.
Brodersen, K. H., Daunizeau, J., Mathys, C., Chumbley, J. R., Buhmann, J. M.
and Stephan, K. E. (2013). Variational Bayesian mixed-effects inference for classifi-
cation studies. NeuroImage 76 345–361.
Camillo, F. and d’Attoma, I. (2010). A new data mining approach to estimate causal
effects of policy interventions. Expert Systems with Applications 37 171–181.
Campbell, D. T., Stanley, J. C. and Gage, N. L. (1963). Experimental and quasiexperimental
designs for research. Houghton Mifflin Boston.
Card, D. and Krueger, A. B. (1993). Minimum wages and employment: A case study
of the fast food industry in New Jersey and Pennsylvania Technical Report, National
Bureau of Economic Research.
Carter, C. K. and Kohn, R. (1994). On Gibbs sampling for state space models.
Biometrika 81 541-553.
Chan, D., Ge, R., Gershony, O., Hesterberg, T. and Lambert, D. (2010). Evaluating
Online Ad Campaigns in a Pipeline: Causal Models at Scale. In Proceedings of ACM
SIGKDD 2010 7–15.BAYESIAN CAUSAL IMPACT ANALYSIS 31
Chipman, H., George, E. I., McCulloch, R. E., Clyde, M., Foster, D. P. and
Stine, R. A. (2001). The practical implementation of Bayesian model selection. Lecture
Notes-Monograph Series 65–134.
Claveau, F. (2012). The RussoWilliamson Theses in the social sciences: Causal inference
drawing on two types of evidence. Studies in History and Philosophy of Science Part
C: Studies in History and Philosophy of Biological and Biomedical Sciences 0.
Cox, D. and Wermuth, N. (2001). Causal Inference and Statistical Fallacies. In International
Encyclopedia of the Social & Behavioral Sciences (E. in Chief:Neil J. Smelser
and P. B. Baltes, eds.) 1554–1561. Pergamon, Oxford.
Danaher, P. J. and Rust, R. T. (1996). Determining the optimal return on investment
for an advertising campaign. European Journal of Operational Research 95 511–521.
de Jong, P. and Shephard, N. (1995). The simulation smoother for time series models.
Biometrika 82 339–350.
Donald, S. G. and Lang, K. (2007). Inference with Difference-in-Differences and Other
Panel Data. Review of Economics and Statistics 89 221–233.
Durbin, J. and Koopman, S. J. (2002). A Simple and Efficient Simulation Smoother for
State Space Time Series Analysis. Biometrika 89 603–616.
Fruhwirth-Schnatter, S. ¨ (1994). Data augmentation and dynamic linear models. Journal
of Time Series Analysis 15 183–202.
George, E. I. and McCulloch, R. E. (1993). Variable selection via Gibbs sampling.
Journal of the American Statistical Association 88 881–889.
George, E. I. and McCulloch, R. E. (1997). Approaches for Bayesian variable selection.
Statistica Sinica 7 339–374.
Ghosh, J. and Clyde, M. A. (2011). Rao-Blackwellization for Bayesian Variable Selection
and Model Averaging in Linear and Binary Regression: A Novel Data Augmentation
Approach. Journal of the American Statistical Association 106 1041–1052.
Hansen, C. B. (2007a). Asymptotic properties of a robust variance matrix estimator for
panel data when T is large. Journal of Econometrics 141 597–620.
Hansen, C. B. (2007b). Generalized least squares inference in panel and multilevel models
with serial correlation and fixed effects. Journal of Econometrics 140 670–694.
Heckman, J. J. and Vytlacil, E. J. (2007). Econometric Evaluation of Social Programs,
Part I: Causal Models, Structural Models and Econometric Policy Evaluation. In Handbook
of Econometrics, (J. J. Heckman and E. E. Leamer, eds.) 6, Part B 4779–4874.
Elsevier.
Hitchcock, C. (2004). Do All and Only Causes Raise the Probabilities of Effects? In
Causation and Counterfactuals MIT Press.
Hoover, K. D. (2012). Economic Theory and Causal Inference. In Philosophy of Economics,
(U. Mki, ed.) 13 89–113. Elsevier.
Kleinberg, S. and Hripcsak, G. (2011). A review of causal inference for biomedical
informatics. Journal of Biomedical Informatics 44 1102–1112.
Leeflang, P. S., Bijmolt, T. H., van Doorn, J., Hanssens, D. M., van
Heerde, H. J., Verhoef, P. C. and Wieringa, J. E. (2009). Creating lift versus
building the base: Current trends in marketing dynamics. International Journal of Research
in Marketing 26 13–20.
Lester, R. A. (1946). Shortcomings of marginal analysis for wage-employment problems.
The American Economic Review 36 63–82.
Lewis, R. A., Rao, J. M. and Reiley, D. H. (2011). Here, there, and everywhere:
correlated online behaviors can lead to overestimates of the effects of advertising. In
Proceedings of the 20th international conference on World wide web. WWW ’11 157–
166. ACM, New York, NY, USA.32 K.H. BRODERSEN ET AL.
Lewis, R. A. and Reiley, D. H. (2011). Does Retail Advertising Work? Technical Report.
Liang, F., Paulo, R., Molina, G., Clyde, M. A. and Berger, J. O. (2008). Mixtures of
g-priors for Bayesian variable selection. Journal of the American Statistical Association
103 410-423.
Mathys, C., Daunizeau, J., Friston, K. J. and Stephan, K. E. (2011). A Bayesian
Foundation for Individual Learning Under Uncertainty. Frontiers in Human Neuroscience
5.
Meyer, B. D. (1995). Natural and Quasi-Experiments in Economics. Journal of Business
& Economic Statistics 13 151.
Morgan, S. L. and Winship, C. (2007). Counterfactuals and causal inference: Methods
and principles for social research. Cambridge University Press.
Nakajima, J. and West, M. (2013). Bayesian analysis of latent threshold dynamic
models. Journal of Business & Economic Statistics 31 151–164.
Polson, N. G. and Scott, S. L. (2011). Data augmentation for support vector machines.
Bayesian Analysis 6 1–23.
Robinson, G., McNulty, J. E. and Krasno, J. S. (2009). Observing the Counterfactual?
The Search for Political Experiments in Nature. Political Analysis 17 341–357.
Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized
studies. Journal of Educational Psychology; Journal of Educational Psychology
66 688.
Rubin, D. B. (2007). Statistical Inference for Causal Effects, With Emphasis on Applications
in Epidemiology and Medical Statistics. In Handbook of Statistics, (J. M. C. R.
Rao and D. Rao, eds.) 27 28–63. Elsevier.
Rubin, D. B. and Waterman, R. P. (2006). Estimating the Causal Effects of Marketing
Interventions Using Propensity Score Methodology. Statistical Science 21 206-222.
Scott, J. G. and Berger, J. O. (2010). Bayes and empirical-Bayes multiplicity adjustment
in the variable-selection problem. Annals of Statistics 38 2587-2619.
Scott, S. L. and Varian, H. R. (2013). Predicting the Present with Bayesian Structural
Time Series. International Journal of Mathematical Modeling and Optimization.
(forthcoming).
Seggie, S. H., Cavusgil, E. and Phelan, S. E. (2007). Measurement of return on
marketing investment: A conceptual framework and the future of marketing metrics.
Industrial Marketing Management 36 834–841.
Shadish, W. R., Cook, T. D. and Campbell, D. T. (2002). Experimental and quasiexperimental
designs for generalized causal inference. Wadsworth Cengage learning.
Solon, G. (1984). Estimating autocorrelations in fixed-effects models. National Bureau of
Economic Research Cambridge, Mass., USA.
Stewart, D. W. (2009). Marketing accountability: Linking marketing actions to financial
results. Journal of Business Research 62 636–643.
Takada, H. and Bass, F. M. (1998). Multiple Time Series Analysis of Competitive
Marketing Behavior. Journal of Business Research 43 97–107.
Vaver, J. and Koehler, J. (2011). Measuring Ad Effectiveness Using Geo Experiments
Technical Report, Google Inc.
Vaver, J. and Koehler, J. (2012). Periodic Measurement of Advertising Effectiveness
Using Multiple-Test-Period Geo Experiments Technical Report, Google Inc.
West, M. and Harrison, J. (1997). Bayesian Forecasting and Dynamic Models.
Springer.
Winship, C. and Morgan, S. L. (1999). The estimation of causal effects from observational
data. Annual review of sociology 659–706.BAYESIAN CAUSAL IMPACT ANALYSIS 33
Zellner, A. (1986). On assessing prior distributions and Bayesian regression analysis with
g-prior distributions. In Bayesian Inference and Decision Techniques: Essays in Honor
of Bruno de Finetti (P. K. Goel and A. Zellner, eds.) 233–243. North-Holland/Elsevier.
Google, Inc.
1600 Amphitheatre Parkway
Mountain View
CA 94043, U.S.A.
Estimating reach curves from one data point
Georg M. Goerg
Google Inc.
Last update: November 21, 2014
Abstract
Reach curves arise in advertising and media analysis
as they relate the number of content impressions to
the number of people who have seen it. This is especially
important for measuring the effectiveness of
an ad on TV or websites (Nielsen, 2009; PricewaterhouseCoopers,
2010). For a mathematical and datadriven
analysis, it would be very useful to know the
entire reach curve; advertisers, however, often only
know its last data point, i.e., the total number of impressions
and the total reach. In this work I present
a new method to estimate the entire curve using only
this last data point.
Furthermore, analytic derivations reveal a surprisingly
simple, yet insightful relationship between
marginal cost per reach, average cost per impression,
and frequency. Thus, advertisers can estimate the
cost of an additional reach point by just knowing their
total number of impressions, reach, and cost.
A comparison of the proposed one-data point method
to two competing regression models on TV reach
curve data, shows that the proposed methodology
performs only slightly poorer than regression fits to
a collection of several points along the curve.
1 Introduction
Let k+ reach, rk, be the percentage of the population
that is exposed to a campaign at least k times. As
usual, we measure impressions in gross rating points
(GRPs), which is calculated as number of impressions
divided by total (target) population multiplied by 100
(measured in percent).
Equipped with a functional form of the reach curve, a
variety of quantities of interest can be computed, e.g.,
marginal cost per reach or maximum possible reach.
Advertisers, however, often only have two points of
the reach curve rk(g): rk(0) = 0 and
rk(G) = R ∈ [0, 100], (1)
where G ≥ 0 is the total GRPs and R is total reach.
With this information alone one is tempted to use a
linear approximation r
(1)
k
(g) = R
G
g. However, reach
curves are not linear and in particular, the marginal
reach per GRP would equal average reach per GRP
(= 1/frequency); thus (1) alone is not helpful to get
a better estimate of marginal GRP (and thus cost)
per reach at g = G.
While the behavior of rk(g) around g = G is in general
unknown, the tangent at g = 0 can be approximated
quite well: starting with no exposure, adding
an infinitesimally small unit of GRPs (say ) one
reaches · ι % of the population, where ι = ι(k) is
the reciprocal of the expected number of impressions
needed for the first person to see k impressions. One
can lower bound ι by 1/k. For k = 1, the bound is
tight, ι = 1; getting an exact expression of ι for k > 1
is ongoing research.1 That is, for small g the reach
curve can be approximated with a line through (0, 0)
with slope ι:
rk(g) ≈ g · ι for small g. (2)
Thus, approximately,
lim
G→0
∂
∂g rk(g = G) = ι. (3)
Combining (1) with (3) allows us to estimate a
two-parameter model.
Section 2 reviews parametric models for reach curves.
Section 3 derives the parameter estimates based on
1
In practice we found that ι = (k + log2 k)−1 gives good
fits for several k ≥ 1.
12.2 Conditional Logit 2 REACH CURVE MODELS
the total GRP and reach. Simulations and comparisons
to full least squares estimates are presented in
Section 4. Finally, Section 5 summarizes the main
findings and discusses future work. Details on the
TV reach curve data and analytical derivations can
be found in the Appendix.
2 Reach curve models
Let X ≥ 0 be the number of content impressions, e.g.,
TV shows, websites, or commercials. For a probabilistic
view of reach curves, it is useful to decompose
k+ reach as
P (X ≥ k,reachable) =
P (X ≥ k | reachable) · P (reachable) (4)
⇔ rk = pk · ρ, (5)
where ρ is the maximum possible reach, and pk is the
probability of being reached k times, given that an individual
is indeed reachable. This distinction allows
us to model ρ and pk with separate probabilistic models.
Since reach is usually denoted in percent, we also
use percent for maximum possible reach ρ ∈ [0, 100],
while we use proportions for pk ∈ [0, 1].
For further analytical derivations it is necessary to
parametrize pk(g). Below we review two functional
forms which are parsimonious (2 + 1 parameters),
have excellent empirical fits, and lend themselves for
simple analytical derivations.
2.1 Gamma-Mixture
Jin et al. (2012) propose a Poisson distribution for the
impressions g, with an exponential prior distribution
with rate β on the Poisson rate λ. This yields a model
of the form
pk(g) = 1 −
β
g + β
. (6)
The exponential prior can be generalized to a Γ(α, β)
distribution, which yields
rk(g) = ρ
1 −
β
β + g
α
. (7)
By construction, (6) is nested in (7), which can be
tested using a hypothesis test for H0 : α = 1.
2.1.1 Marginal reach
The derivative of (7) with respect to g equals2
∂
∂g pk(g) = α
β
β
g + β
α+1
, (8)
with
limg→0
∂
∂g rk(g) = ρα
β
. (9)
Eq. (9) has three degrees of freedom; since only two
data points are available, one parameters has to be
fixed. Given the nested structure of the exponential
model, it is natural to set α ≡ 1.
2.2 Conditional Logit
As an alternative we propose a logistic regression
logit(pk(g)) = β0 + β1 · log g, (10)
where logit(p) = log p
1−p
, and β0 and β1 are intercept
and slope.3 Using the logit inverse expit(x) = e
x
1+e
x =
1
1+e−x , Eq. (10) can be rewritten as
pk = expit(β0 + β1 log g) = e
β0+β1 log g
1 + e
β0+β1 log g
(11)
= 1 −
1
1 + e
β0 · g
β1
(12)
= 1 −
e
−β0
e−β0 + g
β1
(13)
which shows similarity to (7). In fact, identifying
β ≡ e
−β0
, both models coincide if α = 1 and β1 = 1,
respectively. Again, this can be tested using a
two-sided hypothesis test for H0 : β1 = 1.
The Logit conditional model can also be interpreted
as the baseline Gamma mixture model with α ≡ 1,
but with transformed GRPs, ˜g = g
β1
, in (7). Here
β1 can be interpreted as a parameter that measures
the efficiency of GRPs: for β1 > 1 GRPs are more
efficient than baseline; for β1 = 1 GRPs are spent
according to the baseline model; and for β1 < 1 are
not spent as efficiently as expected. For an empirical
estimates see Section 4.
2See Section B.1 for details.
3We deliberately do not use α and β to parametrize intercept
and slope, as it is prone to confusion with the (reversed)
roles of α and β in (8).
Google Inc. 23.1 ρ 1.
(15)
Thus for the logit model one has to assume β1 = 1 to
use the linear approximation of R(g) at g = 0 for 1+
reach.4
3 Methodology
Equipped with the two parameter model
r(g; ρ, β) = ρ
1 −
β
β + g
= ρ
g
β + g
∈ [0, ρ], (16)
we can use the tangent approximation in (3) and total
GRP and reach to estimate ρ and β. Note that
β ≥ 0 is a saturation parameter and controls how
efficient GRPs are: for small β reach grows quickly
with GRPs, for large β it grows slowly.
Its derivative equals
r
0
(g; ρ, β) = ρ
β
(β + g)
2
, (17)
which at g = 0 evaluates to r
0
(0) = ρ
β
.
This gives a system of two equations (maximum GRP
and reach & marginal reach at 0) with two unknowns,
ρ ∈ [0, 100] and β > 0:
ρ
β
= ι ⇔ ρ = β · ι, (18)
ρ
G
β + G
= R ⇔ ρ =
R(G + β)
G
. (19)
First note that for 1+ reach, ρ ≡ β since ι(k = 1) = 1.
Moreover, ρ in (19) satisfies ρ ≥ 0 for all β, but it
satisfies ρ ≤ 100 only for β ≤ G ·
100−R
R
.
4For k > 1, the Logit model with β1 > 1 might become
useful as the marginal k+ reach for the very first impression is
0. However, one then has to estimate three parameters again,
which is not possible without any further assumptions or more
than one data point.
Solving for β and plugging in to ρ = ρ(β) gives
ρb = min
G · R
G − R/ι, 100
, (20)
and βb =
(ρb
ι =
G·R/ι
G−R/ι , if ρ < 100,
G ·
100−R
R
, if ρ = 100.
(21)
Condition ρb ≤ 100 is equivalent to G ≤
100
ι
R
100−R
;
thus GRPs must be less or equal to a constant times
the odds ratio of reach.
Plugging them back into (16) yields expressions for
reach solely as a function of R and G (details see
Appendix B). According to (21) we consider the two
scenarios separately.
3.1 ρ 1 − F
θ
pz
> 1 − F.
The first inequality will be close to an equality when pz and hence θ is small. For
our applications 1 − F θ/pz is a reasonable approximation to the variance ratio.
The second inequality reflects the fact that pooling the data cannot possibly be
better than what we would get with an SSP of size n + N.
From var( f ˆθI)/var(ˆθS) ≈ 1 − F θ/pz we see that using the BRP is effectively
like multiplying the SSP sample size n by 1/(1−F θ/pz). Our greatest precision
gains come when a high fraction of online reaches are incremental, that is, when
θ/pz is largest. In our application this proportion ranges from 20% to 50% when
aggregated to the campaign level. See Table 2.1 in Section 2.
3.2 Gain from the CIA alone
Here we evaluate the variance reduction that would follow from the CIA. In
that case, we could take advantage of the Z–Y independence, and estimate θ
by
ˆθC = Z¯
S(1 − Y¯
S).
It is shown in the Appendix that the delta method variance of ˆθC satisfies
var( f ˆθC)
var(ˆθS)
= 1 −
py(1 − pz)
1 − θ
> 1 − py, (2)
when the CIA holds. This can represent a dramatic improvement, when the
online reach pz and incremental reach θ are both small while the TV reach py is
large. If the CIA holds, our application data suggest the variance reduction can
be from 50% to 80%. The reverse setting with tiny TV reach and large online
reach would not be favorable to ˆθC, but our data are not of that type.
3.3 Gain from the CIA and IDA
Finally, suppose that both the CIA and IDA hold. If we apply both assumptions,
we can get the estimator ˆθI,C = (fZ¯
S + FZ¯B)(1 − Y¯
S). We already gain a lot4 Example campaigns 7
from the CIA, so it is interesting to see how much more the IDA adds when the
CIA holds. We show in the Appendix that under both assumptions,
var( f ˆθI,C)
var( f ˆθC)
=
f(1 − py)(1 − pz) + pypz
(1 − py)(1 − pz) + pypz
.
If both reaches are high then we gain little, but if both reaches are small then
we reduce the variance by almost a factor of f, when adding the IDA to the
CIA. In our case we expect that the television reach is large but the online reach
is small, fitting neither of these extremes. Consider a campaign with f = 1/3,
py = 2/3 and pz = 99/100, similar to the soap campaigns. For such a campaign,
var( f ˆθI,C)
var( f ˆθC)
=
(1/9) × .99 + (2/3) × .01
(1/3) × .99 + (2/3) × .01
.= .34,
so the combined assumptions then allow a nearly three-fold variance reduction
compared to CIA alone.
4 Example campaigns
Our data enrichment scheme is described in Section 5. Here we illustrate the
results from that scheme on six marketing campaigns and discuss the differences
among different algorithms.
In addition to data enrichment, we also show results from tree structured
models. Those split the data into groups and recursively split the groups. More
about tree fitting is in Section 5. One model fits a tree to the SSP data alone
and another one works with the pooled SSP and BRP data.
For all three of those methods we have aggregated the predictions over the
age variable, which takes six levels. In addition, we show the empirical results
for age, which amount to recording the percentage of incremental reaches, that
is, data with Z(1 − Y ) = 1, at each unique level of age in the SSP. There is no
corresponding empirical prediction fully disaggregated by age, gender, income
and education, because of the great many empty cells that would cause.
We found the age related patterns of incremental reach particularly interesting.
Figure 4.1 shows estimated incremental reach for all three models and
the empirical counts, on all six campaigns, averaged over age groups. The beer
campaign is particularly telling. The empirical data show a decreasing trend of
incremental reach with age. The tree fit to SSP-only data yields a fit that is
constant in age. The tree model had to explore splitting the data on all four
variables without a prior focus on age. There were only 23 incremental reach
events for beer in the SSP data set. With such a small number of events and
four predictors, there is considerable possibility of overfitting. Cross-validation
lead to a model that grouped the entire SSP into one set, that is, the tree had
no splits. Both pooling and data enrichment were able to borrow strength from
the BRP as well as take advantage of approximate independence of television
and web exposure. They then recover the trend with age.5 Data enrichment for incremental reach 8
Age level
Incremental reach (%)
0.0 0.2 0.4 0.6 0.8 1.0 1.2
1 2 3 4 5 6
●
●
● ●
● ●
Beer
2
4
6
8 10
●
●
●
●
●
●
Chrome
0.0 0.5 1.0 1.5
1 2 3 4 5 6
●
●
●
●
●
●
Salt
0.0 0.5 1.0 1.5 2.0
●
●
● ●
● ●
Soap 1
1 2 3 4 5 6
0.5 1.0 1.5 2.0
●
●
●
●
●
●
Soap 2
0.0 0.2 0.4 0.6 0.8 1.0 1.2
●
●
●
●
●
●
Soap 3
● Emp SSP Pool DEIR
Fig. 4.1: Estimated incremental reach by age, for six campaigns and three models:
SSP, Pooling and DEIR as described in the text. Empirical counts
are marked by Emp.
The Salt campaign had a similarly small number of incremental reaches and
once again the SSP only tree was constant. Fitting a tree to the SSP data
always gave a flatter fit versus age than did DEIR which in turn was flatter
than what we would get simply pooling the data. Section 6 gives simulations in
which DEIR has greater accuracy than using pooling or SSP only.
5 Data enrichment for incremental reach
For a given sample we would like to combine incremental reach estimates ˆθS,
ˆθI ,
ˆθC and ˆθI,C whose assumptions are: none, IDA, CIA and IDA+CIA, respectively.
The latter three add some value if their corresponding assumptions are
nearly true, but our information about how well those assumptions hold comes
from the same data we are using to form the estimates.
The circumstances are similar to those in data enriched linear regression (Chen5 Data enrichment for incremental reach 9
et al., 2013). In that problem there is a regression model Yi = XT
i β + εi which
holds in the SSP and a biased regression model Yi = XT
i
(β + γ) + εi holds in
the BRP. The estimates are found by minimizing
S(λ) = X
i∈S
(Yi − XT
i β)
2 +
X
i∈B
(Yi − XT
i
(β + γ))2 + λ
X
i∈S
(XT
i γ)
2
, (3)
over β and γ for a nonnegative penalty factor λ. The εi are independent with
mean 0 and variance σ
2
S
in the SSP and σ
2
B in the BRP.
Taking λ = 0 amounts to fitting regressions separately in the two samples
yielding an estimate βˆ that does not use the BRP at all. The limit λ → ∞
corresponds to pooling the two data sets, which would be optimal if there were
no bias, i.e., if γ = 0. The specific penalty in (3) discourages the estimated γ
from making large changes to the SSP; it is one of several penalties considered
in that paper.
Varying λ from 0 to ∞ gives a family of estimators that weight the SSP
to varying degrees. The optimal λ is unknown. An oracle that knew γ and
the error variance in the two data sets would be able to compute the optimal
λ under a mean squared error loss. Chen et al. (2013) get a formula for the
oracle’s λ and then plug estimates of γ and the variances into that formula.
They show, under conditions, that the resulting plugin estimate gives better
estimates of β than using the SSP only would. The conditions are that the Y
values are normally distributed, and that the model have at least 5 regression
parameters and 10 error degrees of freedom. The normality assumption allows
a technical lemma due to Stein (1981) to be used and we believe that gains from
using the BRP do not require normality.
In principle we might multiply the sum of squared errors in the BRP by
τ = σ
2
S
/σ2
B if that ratio is known. If σ
2
BRP > σ2
SSP then we should put less
weight on the BRP sample relative to the SSP sample. However the same effect
is gained by increasing λ. Since the algorithm searches for optimal λ over a wide
range it is less important to precisely specify τ . Chen et al. (2013) took τ = 1,
simply summing all squared errors, and we will generalize that approach.
For the present setting we must modify the method. First our responses
are binary, not Gaussian. Second we have four estimators to combine, not two.
Third, those estimators are dependent, being fit to overlapping data sets.
5.1 Modification for binary response
To address the binary response there are two reasonable choices. One is to
employ logistic regression. The other is to use tree-structured regression and
then pool the estimators at the leaves of the tree. Regarding prediction accuracy,
there is no unique best algorithm. There will be data sets for which simple
logistic regression outperforms tree based classifiers and vice versa.
For this paper we have adopted trees. Tree structured models have two
practical advantages. First, the resulting cells that they select correspond to
empirically determined market segments, which are then interpretable. Sec-5 Data enrichment for incremental reach 10
Data set Source Imputed V Assumptions
D0 SSP ZS(1 − YS) none
D1 BRP ZB(1 − YbSSP(XB, ZB)) IDA
D2 SSP ZbSSP(XS)(1 − YbSSP(XS)) CIA
D3 SSP ZbSSP+BRP(XS)(1 − YbSSP(XS)) CIA & IDA
Tab. 5.1: Four incremental reach data sets and their imputed incremental
reaches. The hats denote model-imputed values. For example
YbSSP(XB, ZB) is a predictive model for Y based on values X and Z
fit using data from SSP and evaluated at X = XB and Z = XB (from
BRP).
ond, within any of those cells, the model is intercept-only. Then both logistic
regression and least squares reduce to a simple average.
Each leaf of the regression tree defines a subset of the data that we call a
cell. There are cells 1, . . . , C. The SSP has nc observations in cell c and the
BRP has Nc observations there.
For each cell and each set of assumptions we use a linear regression model
relating an incremental reach quantity like Vei to an intercept. When there are
no assumptions then Vei
is the observed incremental reach for i ∈ S. Otherwise
we may take advantage of the assumptions to impute values Vei using more of
the data. The incremental reach values for each set of assumptions are given in
Table 5.1. The predictive models shown there are all fit using rpart.
For k = 0, 1, 2, 3 let Vek be vector of imputed responses under any of the assumptions
from Table 5.1 and Xek their corresponding predictors. The regression
framework minimizes
kVe0 − Xe0βk
2 +
X
3
k=1
kVek − Xek(β + γk)k
2 +
X
3
k=1
λkkXe0γkk
2
. (4)
over β and γk for penalties λk. In our setting each Xek is a column vector of
ones of length mk. For cell c, m1k = Nc and m0k = m2k = m3k = nc.
5.2 Search for λk
It is very convenient to search for suitable weights in the simplex
∆(K) = {(ω0, ω1, . . . , ωK) | ωk > 0,
X
K
k=0
ωk = 1}
because it is a bounded set, unlike the set [0, ∞]
K of usable vectors λ =
(λ1, . . . , λK). Chen et al. (2013) remark that it is more reasonable to use a
common set of λk over all cells, stemming from unequal sample sizes. The
search we use combines the advantages of both approaches.5 Data enrichment for incremental reach 11
Our search strategy for the simplex is to choose a grid of weight vectors
ωg = (ωg0, ωg1, . . . , ωgK) ∈ ∆(K)
, g = 1, . . . , G.
For each vector ωg we find a vector λg = (λ1, . . . , λK) such that
X
C
c=1
pcωk,c = ωgk, k = 0, 1, . . . , K,
where pc is the proportion of our target population in cell c. That is, the
population average weight of ωk,c matches ωgk. These weights give us the vector
λg = (λg1, . . . , λgK). Using λg in the penalty criterion (4) specifies the weights
we use within each cell.
Our algorithm chooses the tree and the vector ω jointly using cross-validation.
It is computationally expensive to make high dimensional searches. With K factors
there is a K − 1 dimensional space of weights to search. Adding in the tree
size gives a K’th dimension. As a result, combining all of our estimators requires
us to search a 4 dimensional grid of values.
We have chosen to set one of the ωk to 0 to reduce the search space from 4
dimensions to 3. We always retained the unbiased estimate ˆθS along with two
others. In some computations reported in section A.4 of the Appendix we find
only small differences among setting ω1 = 0, or ω2 = 0 or ω3 = 0. The best
outcome was setting ω1 = 0. That has the effect of removing the estimate based
on IDA only. As we saw in section 3, the IDA-only model had the least potential
to improve our estimate. As a bonus, all three of the retained submodels have
the same sample sizes and then common λ over cells coincides with common ω
over cells.
In the special case with ω1 = 0 we find after some calculus that the minimizer
of (4) has
βˆ
c =
V¯
0c +
P
k∈{2,3}
λk
1+λk
V¯
kc
1 + P
k∈{2,3}
λk
1+λk
≡
X
k∈{0,2,3}
ωkc(λ)Vkc (5)
where V¯
kc is the simple average of Vek over i ∈ S for cell c.
Our default grid takes all values of ω whose coefficients are integer multiples
of 10%. Populations D0, D2 and D3 all have the sample size n and of these
only D0 is surely unbiased. An observation in D0 is worth at least as much as
an observation in D2 or D3 and so we require ω0 > max{ω2, ω3}. Figure 5.1
shows this region and the set of 24 weight combinations that we use.
5.3 Search for tree size
Here we give a brief review of regression trees in order to define our algorithm.
For a full description see the monograph by Breiman et al. (1985). The version
we use is the function rpart (Therneau and Atkinson, 1997) in the R programming
language (R Core Team, 2012).5 Data enrichment for incremental reach 12
Weight region
● ● ● ● ●
●
●
●
●
●
●
● ● ● ●
●
●
●
●
●
● ● ●
●
●
●
●
● ● ● ●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
Weight points
Fig. 5.1: The left panel shows the simplex of weights applied to data sets D0,
D2 and D3 with the unbiased data set D0 in the lower left. The shaded
region has the valid weights. The right panel shows that region with
points for the 24 weights we use in our algorithm.
Regression trees are built from splits of the set of subjects. A split uses
one of the features in X and creates two subsets based on the values of that
feature. For example it might split males from females or it might split those
with the two smallest education levels from the others. Such a split defines two
subpopulations of our target population and it equally defines two subsamples
of our sample.
A regression tree is a recursively defined set of splits. After the subjects are
split into two groups based on one variable, each of those two groups may then
be split again, using the same or different variables. Recursive splitting of splits
yields a tree structure with subsets of subjects in the leaf nodes. Given a tree,
we predict for subjects by a rule based on the leaf to which they belong. That
rule uses the average within the subject’s leaf node.
The tree is found by a greedy search that minimizes a measure of prediction
error. In our case, the measure R(T), is the sum of squared prediction errors.
By construction any tree with more splits than T has lower error and this brings
a risk of overfitting. To counter overfitting, rpart adds a penalty proportional
to the number |T| of leaves in tree T. The penalized criterion is R(T) + α|T|
where the parameter α > 0 is chosen by M-fold cross-validation. This reduces
the potentially complicated problem of choosing a tree to the simpler problem
of selecting a scalar penalty parameter α.
The rpart function has one option that we have changed from the default.
That parameter is cp, the complexity parameter. The default is 10−2
. The cp
parameter stops tree growing early if a proposed split improves R(T) by less6 Numerical investigation 13
than a factor of cp. We set cp = 10−4
. Our choice creates somewhat larger trees
to get more choices to use in cross-validation.
5.4 The algorithm
Here is a summary of the entire algorithm. First we make the following preprocessing
steps.
1) Fit a large tree T by rpart relating observed incremental reaches Vi to
predictor variables Xi
in the SSP data. This tree returns a nested sequence
of subtrees T0 ⊂ T1 ⊂ · · · ⊂ TL ⊂ T . Each T` corresponds to a critical value
α` of the penalty. Choosing α` from this list selects the tree T`. The value
L is data-dependent, and chosen by rpart.
2) Specify a grid of values ωg for g = 1, . . . , G. Here ωg = (ωg0, ωg1, . . . , ωgK)
with ωgk > 0 and PG
k=0 ωgk = 1.
3) Randomly partition SSP data (Xi
, Yi
, Zi) into M folds Sm for m = 1, . . . , M
each of roughly equal size n/M. For fold m the SSP will contain ∪Sm0 for
all m0 6= m. We call this S−m. The BRP for fold m is the entire BRP. We
also considered using a bootstrap sample for the fold m BRP, but that was
more expensive and less accurate in our numerical investigation as described
in section A.4 of the Appendix.
After this precomputation, our algorithm proceeds to the cross-validation
shown in Figure 5.2 to make a joint selection of the tree penalty parameter α`
and the simplex grid point ωg. Let the chosen values be α∗ and ω∗. We select
the tree T∗ from step 1 above, corresponding to penalty parameter α∗. We treat
each leaf node of T∗ as a cell c. We translate ω∗ into the corresponding λc in
every cell c of tree T∗. Then we minimize (4) using this λc and the resulting βˆ
c
is our estimate Vbc of incremental reach in cell c.
After choosing the tuning parameters ωg and α` by cross-validation, we use
these parameters on the whole data set to make our final prediction.
6 Numerical investigation
In order to measure the effect of data enriched estimates on incremental reach,
we conducted a simulation where we knew the ground truth. Our goal is to predict
for ensembles, not for individuals, so we constructed two large populations
in which ground truth was known to us, simulated our process of subsampling
them, and scored predictions against the ground truth incremental reach probabilities.
To make our large samples realistic, we built them from our real data. We
created S- and B-populations by replicating our SSP (respectively BRP) records
100 times each. Then in each simulation, we form an SSP by drawing 6000
observations at random from the S-population, and a BRP by drawing 13,000
observations at random from the B-population.6 Numerical investigation 14
for ` = 1, . . . , L do // initialize error sum of squares
for g = 1, . . . , G do
SSE`,g ← 0
for m = 1, . . . , M do // folds
construct Table 5.1 for fold m, using S−m and B
fit tree Tm for fold m by rpart
prune tree Tm to T1,m, . . . , TL,m, tree T`,m uses α`
for ` = 1, . . . , L do // tree sizes
define cells S−m,c and Bc, c = 1, . . . , C from leaves of T`,m
for g = 1, . . . , G do // simplex weights
convert ωg into λg
for c = 1, . . . , C do // cells
compute Vek for k = 0, 2, 3 in cell c
get Vbc = βˆ
c from the weighted average (5)
Vc ←
1
|Sm,c|
X
i∈Sm,c
Vi // held out incr. reach
pc ← fraction of true S population in cell c
SSE`,g ←SSE`,g + pc(Vbc − Vc)
2
Fig. 5.2: Data enrichment for incremental reach (deir) algorithm. After precomputation
described on page 13 we run this cross-validation algorithm
to choose the complexity parameter α` and the weights ωg, as the joint
minimizers `
∗ and g
∗ of SSE`,g. The values pc come from a census or
from the SSP if the census does not have the variables we need. We
use M = 10.
For each campaign, we apply deir with this sample data to estimate the
incremental reach Vˆ (x). We used 10–fold cross-validation. The mean square
estimation error (MSE) is P
x
p(x)(Vˆ (x) − V (x))2
. This sum is taken over all
x values in the SSP.
The simulation above was repeated 1000 times. The root mean square error
was divided by the true incremental reach to get a relative RMSE.
We consider two comparison methods. The first is to use the SSP only. That
method computes ˆθS within the leaves of a tree. The tree is found by rpart.
The second comparison is a tree fit by rpart to the pooled SSP and BRP data
and using both CIA and IDA. We do not compare to the empirical fractions
because many of them are from empty cells.
Figure 6.1 compares the relative errors in the SSP only method to data6 Numerical investigation 15
DEIR RMSE (%)
SSP RMSE (%)
1 2
3
1 2 3
● ●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
● ●
● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●●
●
● ● ●
●
●
●
●
●
● ●
●
● ●
● ●
●
●
●
●
●●●
●
● ●
● ● ● ●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●● ● ●
●
●
●
● ●
●
●
●
●
● ● ●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●● ●
●
●
●
●
●
●
● ●
●
●
●●
●
● ●
●
●
●
●
●
●
●
● ● ● ●
●
●
●
●●●
●
●
●
●● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
● ●●
●
●
● ● ● ●
●
●
●●
● ●
● ●
●
●
●
●
●
● ● ●●
●
●
● ●
●
●
●
●
●
●
●
●●● ●●● ●
●
●
● ●
●
●
●
● ●
●
●
●
● ● ● ●
●
●
●
●
●
●
●
● ● ●
●
● ● ● ●
●
●
● ●
●
●
●
●
● ●
●
●
●
● ●
●
● ●●● ●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
● ●● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●● ●● ●
●
● ●●
●
●
●
●
●
● ●
●
● ●
● ●
●
●
●
●
●
●
● ●
●
●
●
● ● ●
●
●
● ●
●
●
● ●●● ●
●
●
●
●
● ● ● ●
●
●
● ●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
● ●● ●
●
●
●
●
●
● ●● ● ● ● ● ●
●
● ● ● ● ● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ●●
●
●●
● ●
●
●
●
● ●
●
● ●● ●
●
●
●
● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ● ● ●
●
● ●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
● ●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
● ● ●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●● ●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ● ●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
● ●●
● ●
● ●
●
●
●
●
●
● ●
●
● ● ● ● ●
● ●
●
● ●●●●
●
●
●
●
●
●● ● ●
●
● ●●● ● ●
●
●
●
● ●
● ● ● ● ●
●
●●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
● ● ●●
●
●
●
●
●
● ●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●● ● ●
●
●
●
●
●
● ●
●
Beer
6
8 10 12
3 4 5 6 7 8 9
●
●
●
● ●
●
●
●
● ● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
● ● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
● ● ● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
Chrome
0.5 1.0 1.5 2.0 2.5
0.5 1.0 1.5
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
● ● ●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ● ●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
● ●
●
●
● ● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
● ● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ● ● ●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
● ● ● ●
●
● ●
●
●
●
●
●
●
●
● ●
●
● ●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ● ●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
● ● ●
●
●
●
● ●● ●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
● ●
● ●
●●
●
●
●
●
●
●
● ● ● ● ● ● ●
●
●
●
●
●● ● ●●
●
● ●
●
● ●
● ●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ● ●
●
● ●
●
●
●
●
●
●
●
● ● ● ●
●
●
●
●
●
●
● ●
●
●
●
●● ●
● ●
●● ●
●
●
● ●●
●
●
● ●
●
● ● ● ●
● ●
● ●
● ● ●
●
●
●
●
●
●
● ● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
● ●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
● ●
● ●
●
●
●
●
● ●
●
● ●● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
● ● ●
●
● ●
●
●
●
●
●
● ●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
● ● ●
● ●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●● ● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
● ● ● ●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
● ●
●
●
● ●
●
● ●
●
●
Salt
0 1 2
3
4
5
6
0 1 2 3 4
●
● ●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●
● ●
● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ● ●
●
●
●
●
●
●●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●●
● ●
●
●
●
●
● ● ●
● ●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
● ●
●
●
● ●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ● ●
●
●
●
● ● ●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
● ●
●
● ●
●
● ●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
Soap 1
0.5 1.0 1.5 2.0 2.5 3.0
0.5 1.0 1.5
●
●
●
●
●
●
●
● ● ●
● ●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
● ●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ● ● ●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
● ● ● ●
● ●
●
● ●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ● ●
●
●
●
●
●
● ● ●
●
●
●
● ● ●
●
●
●●
●
●
●
●
●
●
● ●
● ●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ● ●
●
●
●
●
●
● ● ●
●
●
●
●●
●
● ●
●
●
●
●
●
● ● ●
● ●
● ●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
● ●
● ●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
● ●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
● ● ●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
● ●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
Soap 2
0.0 0.5 1.0 1.5 2.0
0.0 0.5 1.0 1.5
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●
● ●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ● ●● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
● ●
●
Soap 3
Fig. 6.1: Performance comparison, SSP only versus data enrichment, predictive
relative mean square errors. There is one panel for each of 6 campaigns
with one point for each of 1000 replicates. The reference line is the
forty-five degree line.
enrichment. Data enrichment is consistently better over all 6 campaigns we
simulated in the great majority of replications. It is clear that the populations
are similar enough that using the larger data set improves estimation of
incremental reach.
Under the IDA we can pool the SSP and BRP together using rpart on the
combined data to estimate Pr(Z = 1 | X). Under the CIA we can multiply
this estimate by Pr(Y = 0 | X) fit by rpart to the SSP, see Table 5.1 under
the assumption CIA & IDA. This method, as an implementation of statistical
matching, uses two separate applications of rpart each with their own built in
cross-validation.
Figure 6.2 compares the relative errors of statistical matching to data enrichment.
Data enrichment is consistently better over all 6 campaigns we simulated
in the great majority of replications.
We also investigate for each estimator, how much of the predictive error is
contributed by bias. It is well known that predictive mean square error can
be decomposed as the sum of variance and squared bias. These quantities
are typically unknown in practice, but can be evaluated in simulation studies.6 Numerical investigation 16
DEIR RMSE (%)
Pool RMSE (%)
0.6 0.8 1.0 1.2
1 2 3
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
● ●
●
●
● ●
●
● ●
●
● ● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
● ●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Beer
5
6
7
8
9 10
3 4 5 6 7 8 9
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
● ●
● ●
●
Chrome
0.6 0.8 1.0 1.2 1.4
0.5 1.0 1.5
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
● ●
●
●
●
● ●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
● ● ● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
● ●
●
●
● ●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
● ●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
● ●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
Salt
0.5 1.0 1.5
0 1 2 3 4
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
● ●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Soap 1
0.8 1.0 1.2 1.4 1.6 1.8
0.5 1.0 1.5
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
● ●
● ●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
Soap 2
0.4 0.6 0.8 1.0 1.2
0.0 0.5 1.0 1.5
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
● ●
●
●
●
●
● ●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Soap 3
Fig. 6.2: Performance comparison, statistical matching (data pooling) versus
data enrichment, predictive relative mean square errors. There is one
panel for each of 6 campaigns with one point for each of 1000 replicates.
The reference line is the forty-five degree line.
Table 6.1 reports the fractions of squared bias in predictive mean square errors
for each method in all six studies. We see there that the error for statistical
matching (data pooling) is dominated by bias while the error for SSP only is
dominated by variance. These results are not surprising because the SSP only
method has no sampling bias (only algorithmic bias) while the pooled data set
has maximal sampling bias. The proportion of bias for DEIR is in between these
extremes. Here we have less population bias than a typical data fusion situation
because the TV and online-only panels were recruited in the same way. The
bottom of Table 6.1 shows that DEIR is able to trade off bias and variance more
effectively than SSP only or data pooling, because DEIR attains the smallest
predictive mean squared error.
Conclusions
Predictions of incremental reach can be improved by making use of additional
data. That improvement comes only if certain strong assumptions are true or at6 Numerical investigation 17
bias2
/mse Beer Chrome Salt Soap 1 Soap 2 Soap 3
SSP 0.35 0.42 0.26 0.12 0.28 0.12
Pool 0.88 0.82 0.88 0.88 0.88 0.93
DEIR 0.49 0.59 0.47 0.33 0.47 0.39
mse Beer Chrome Salt Soap 1 Soap 2 Soap 3
SSP 1.02 7.76 0.89 0.84 1.26 0.66
Pool 0.82 7.39 0.80 0.86 1.12 0.78
DEIR 0.61 5.42 0.48 0.52 0.68 0.42
Tab. 6.1: The upper rows show the fraction bias2
/mse of the mean squared
prediction error due to bias for 3 methods to estimate incremental
reach in 6 campaigns. The lower rows show the total mse, that is
bias2 + var.
least approximately true. Our only guide to the accuracy of those assumptions
may come from the data themselves. Our data enriched incremental reach
estimate uses a shrinkage strategy to pool estimates using different assumptions.
Cross-validating the level of pooling gave us an algorithm that worked better
than either ignoring the additional data or treating it the same as the unbiased
data.
Acknowledgment
This project was not part of Art Owen’s Stanford responsibilities. His participation
was done as a consultant at Google. The authors would like to thank Penny
Chu, Tony Fagan, Yijia Feng, Jerome Friedman, Yuxue Jin, Daniel Meyer, Jeffrey
Oldham and Hal Varian for support and constructive comments.
References
Breiman, L., Friedman, J. H. Olshen, R. A., and Stone, C. J. (1985). Classifi-
cation and Regression Trees. Chapman & Hall/CRC, Baton Rouge, FL.
Chen, A., Owen, A. B., and Shi, M. (2013). Data enriched linear regression.
Technical report, Google. http://arxiv.org/abs/1304.1837.
Collins, J. and Doe, P. (2009). Developing an integrated television, print and
consumer behavior database from national media and purchasing currency
data sources. In Worldwide Readership Symposium, Valencia.
Doe, P. and Kudon, D. (2010). Data integration in practice: connecting currency
and proprietary data to understand media use. ARF Audience Measurement
5.0.A Appendix 18
D’Orazio, M., Di Zio, M., and Scanu, M. (2006). Statistical Matching: Theory
and Practice. Wiley, Chichester, UK.
Gilula, Z., McCulloch, R. E., and Rossi, P. E. (2006). A direct approach to data
fusion. Journal of Marketing Research, XLIII:73–83.
Jin, Y., Shobowale, S., Koehler, J., and Case, H. (2012). The incremental reach
and cost efficiency of online video ads over TV ads. Technical report, Google.
Lehmann, E. L. and Romano, J. P. (2005). Testing statistical hypotheses.
Springer, New York, Third edition.
Little, R. J. A. and Rubin, D. B. (2009). Statistical Analysis with Missing Data.
John Wiley & Sons Inc., Hoboken, NJ, 2nd edition.
R Core Team (2012). R: A Language and Environment for Statistical Computing.
R Foundation for Statistical Computing, Vienna, Austria. ISBN
3-900051-07-0.
R¨assler, S. (2004). Data fusion: identification problems, validity, and multiple
imputation. Austrian Journal of Statistics, 33(1&2):153–171.
Singh, A. C., Mantel, H., Kinack, M., and Rowe, G. (1993). Statistical matching:
Use of auxiliary information as an alternative to the conditional independence
assumption. Survey Methodology, 19:59–79.
Stein, C. M. (1956). Inadmissibility of the usual estimator for the mean of a multivariate
normal distribution. In Proceedings of the Third Berkeley symposium
on mathematical statistics and probability, volume 1, pages 197–206.
Stein, C. M. (1981). Estimation of the mean of a multivariate normal distribution.
The Annals of Statistics, 9(6):1135–1151.
The Nielsen Company (2011). The cross-platform report. Quarter 2, U.S.
Therneau, T. M. and Atkinson, E. J. (1997). An introduction to recursive
partitioning using the RPART routines. Technical Report 61, Mayo Clinic.
A Appendix
A.1 Variance reduction by IDA
Recall that f = n/(n + N) and F = N/(n + N) are sample size proportions of
the two data sets. Under the IDA we may estimate incremental reach by
ˆθI = (fZ¯
S + FZ¯B)
V¯
S
Z¯
S
= V¯
I
f + F
Z¯B
Z¯
S
.A Appendix 19
By the delta method (Lehmann and Romano, 2005), var(ˆθI) is approximately
var( f ˆθI) = var(V¯
S)
∂ ˆθI
∂V¯
S
2
+ var(Z¯B)
∂ ˆθI
∂Z¯B
2
+ var(Z¯
S)
∂ ˆθI
∂Z¯
S
2
+ 2cov(V¯
S,Z¯
S)
∂ ˆθI
∂V¯
S
∂ ˆθI
∂Z¯
S
,
with partial derivatives evaluated with expectations E(V¯
S), E(Z¯
S), and E(Z¯B)
replacing the corresponding random quantities. The other two covariances are
zero because the S and B samples are independent.
From the binomial distribution we have var(V¯
S) = θ(1 − θ)/n, var(Z¯B) =
pz(1 − pz)/N and var(Z¯
S) = pz(1 − pz)/n. Also
cov(V¯
S,Z¯
S) = 1
n
E(ViZi) − E(Vi)E(Zi)
= θ(1 − pz)/n.
After some calculus,
var( f ˆθI) = θ(1 − θ)
n
+
pz(1 − pz)
N
θ
2F
2
p
2
z
+
pz(1 − pz)
n
θ
2F
2
p
2
z
− 2
θ(1 − pz)
n
θF
pz
= var(ˆθS) + θ
2F(1 − pz)
pz
F
N
+
F
n
−
2
n
= var(ˆθS) −
θ
2F(1 − pz)
pz
1
n
= var(ˆθS)
1 − F
1 − pz
pz
θ
1 − θ
.
A.2 Variance reduction by CIA
Applying the delta method to ˆθC = Z¯
S(1 − Y¯
S), we find that
var( f ˆθC) = var(Z¯
S)
∂ ˆθC
∂Z¯
S
2
+ var(Y¯
S)
∂ ˆθC
∂Y¯
S
2
+ 2cov(Y¯
S,Z¯
S)
∂ ˆθC
∂Y¯
S
∂ ˆθC
∂Z¯
S
= var(Z¯
S)(1 − py)
2 + var(Y¯
S)p
2
z + 2cov(Y¯
S,Z¯
S)(1 − py)pz.
Here var(Z¯
S) = pz(1 − pz)/n, var(Z¯
S) = pz(1 − pz)/n, and under conditional
independence cov(Y¯
S,Z¯
S) = 0. Thus
var( f ˆθC) = 1
n
pz(1 − pz)(1 − py)
2 + py(1 − py)p
2
z
=
1
n
pz(1 − pz)(1 − py)
2 + py(1 − py)p
2
z
=
pz(1 − py)
n
(1 − pz)(1 − py) + pypz
.
When the CIA holds, θ = pz(1 − py). Note that var(ˆθS) = θ(1 − θ)/n. After
some algebraic simplification we find that
var( f ˆθC)
var(ˆθS)
= 1 −
py(1 − pz)
1 − θ
.A Appendix 20
A.3 Variance reduction by CIA and IDA
When both assumptions hold we can estimate θ by
ˆθI,C = (fZ¯
S + FZ¯B)(1 − Y¯
S).
Under these assumptions, Z¯
S, Z¯B and Y¯
S are all independent, and var( f ˆθI,C)
equals
var(Z¯
S)
∂ ˆθI,C
∂Z¯
S
2
+ var(Z¯B)
∂ ˆθI,C
∂Z¯B
2
+ var(Y¯
S)
∂ ˆθI,C
∂Y¯
S
2
=
pz(1 − pz)
n
f
2
(1 − py)
2 +
pz(1 − pz)
N
F
2
(1 − py)
2 +
py(1 − py)
n
p
2
z
=
pz(1 − py)
n
f(1 − py)(1 − pz) + pypz
after some simplification. As a result
var( f ˆθI,C)
var( f ˆθC)
=
f(1 − py)(1 − pz) + pypz
(1 − py)(1 − pz) + pypz
.
A.4 Alternative algorithms
We faced some design choices in our algorithm. First, we had to decide which
estimators to include in our algorithm. We always include the unbiased choice
ˆθS as well as two others. Second, we had to decide whether to use the entire
BRP or to bootstrap sample it. We ran all six choices on simulations of all six
data sets where we knew the correct answer. Table A.1 shows the mean squared
errors for the six possible estimators on each of the six data sets. In every case
we divided the mean squared error by that for the estimator combining ˆθS,
ˆθC ,
and ˆθI,C without the bootstrap. We only see small differences, but the evidence
favors choosing λI = 0 as well as not bootstrapping.
Our default method is consistently the best in this table, although only by a
small amount. We saw that data enrichment is consistently better than either
pooling the data or ignoring the large sample, and by much larger amounts than
we see in Table A.1. As a result, any of the data enrichment methods in this
table would make a big improvement over either pooling the samples or ignoring
the BRP.A Appendix 21
Estimators ˆθS,
ˆθI ,
ˆθC
ˆθS,
ˆθI ,
ˆθI,C ˆθS,
ˆθC ,
ˆθI,C
BRP All Boot All Boot All Boot
Beer 1.02 1.02 1.00 1.01 1 1.01
Chrome 1.04 1.04 1.01 1.01 1 1.00
Salt 1.04 1.04 1.01 1.01 1 1.01
Soap 1 1.04 1.05 1.01 1.02 1 1.00
Soap 2 1.05 1.05 1.01 1.03 1 1.01
Soap 3 1.02 1.02 1.01 1.00 1 1.00
Tab. A.1: Relative performance of our estimators on six problems. The relative
errors are mean squared prediction errors normalized to the case that
uses ˆθS,
ˆθC ,
ˆθI,C without bootstrapping. The relative error for that
case is 1 by definition.
Coupled and k-Sided Placements: Generalizing
Generalized Assignment
Madhukar Korupolu1
, Adam Meyerson1
, Rajmohan Rajaraman2
, and Brian
Tagiku1
1 Google, 1600 Amphitheater Parkway, Mountain View, CA. Email:
{mkar,awmeyerson,btagiku}@google.com
2 Northeastern University, Boston, MA 02115. Email: rraj@ccs.neu.edu
Abstract. In modern data centers and cloud computing systems, jobs
often require resources distributed across nodes providing a wide variety
of services. Motivated by this, we study the Coupled Placement problem,
in which we place jobs into computation and storage nodes with capacity
constraints, so as to optimize some costs or profits associated with
the placement. The coupled placement problem is a natural generalization
of the widely-studied generalized assignment problem (GAP), which
concerns the placement of jobs into single nodes providing one kind of
service. We also study a further generalization, the k-Sided Placement
problem, in which we place jobs into k-tuples of nodes, each node in a
tuple offering one of k services.
For both the coupled and k-sided placement problems, we consider minimization
and maximization versions. In the minimization versions (MinCP
and MinkSP), the goal is to achieve minimum placement cost, while incurring
a minimum blowup in the capacity of the individual nodes. Our
first main result is an algorithm for MinkSP that achieves optimal cost
while increasing capacities by at most a factor of k + 1, also yielding the
first constant-factor approximation for MinCP. In the maximization versions
(MaxCP and MaxkSP), the goal is to maximize the total weight
of the jobs that are placed under hard capacity constraints. MaxkSP
can be expressed as a k-column sparse integer program, and can be approximated
to within a factor of O(k) factor using randomized rounding
of a linear program relaxation. We consider alternative combinatorial
algorithms that are much more efficient in practice. Our second main
result is a local search based approximation algorithm that yields a 15-
approximation and O(k
3
)-approximation for MaxCP and MaxkSP respectively.
Finally, we consider an online version of MaxkSP and present
algorithms that achieve logarithmic competitive ratio under certain necessary
technical assumptions.
1 Introduction
The data center has become one of the most important assets of a modern business.
Whether it is a private data center for exclusive use or a shared public cloud
data center, the size and scale of the data center continues to rise. As a companygrows, so too must its data center to accommodate growing computational, storage
and networking demand. However, the new components purchased for this
expansion need not be the same as the components already in place. Over time,
the data center becomes quite heterogeneous [1]. This complicates the problem
of placing jobs within the data center so as to maximize performance.
Jobs often require resources of more than one type: for example, compute and
storage. Modern data centers typically separate computation from storage and
interconnect the two using a network of switches. As such, when placing a job
within a data center, we must decide which computation node and which storage
node will serve the job. If we pick nodes that are far apart, then communication
latency may become too prohibitive. On the other hand, nodes are capacitated,
so picking nodes close together may not always be possible.
Most prior work in data center resource management is focussed on placing
one type of resource at a time: e.g., placing storage requirements assuming job
compute location is fixed [2, 3] or placing compute requirements assuming job
storage location is fixed [4, 5]. One sided placement methods cannot suitably
take advantage of the proximities and heterogeneities that exist in modern data
centers. For example, a database analytics application requiring high throughput
between its compute and storage elements can benefit by being placed on a
storage node that has a nearby available compute node.
In this paper, we study Coupled Placement (CP), which is the problem of
placing jobs into computation and storage nodes with capacity constraints, so as
to optimize costs or profits associated with the placement. Coupled placement
was first addressed in [6] in a setting where we are required to place all jobs
and we wish to minimize the communication latency over all jobs. They show
that this problem, which we call MinCP, is NP-hard and investigate the performance
of heuristic solutions. Another natural formulation is where the goal
is to maximize the total number of jobs or revenue generated by the placement,
subject to capacity constraints. We refer to this problem as MaxCP. We also
study a generalization of Coupled Placement, the k-Sided Placement Problem
(kSP), which considers k ≥ 2 kinds of resources.
1.1 Problem definition
In the coupled placement problem, we are given a bipartite graph G = (U, V, E)
where U is a set of compute nodes and V is a set of storage nodes. We have
capacity functions C : U → R and S : V → R for the compute and storage
nodes, respectively. We are also given a set T of jobs, each of which needs to be
allocated to one compute node and one storage node. Each job may prefer some
compute-storage node pairs more than others, and may also consume different
resources at different nodes. To capture these heterogeneities, we have for each
job j a function fj : E → R, a processing requirement pj : E → R and a
storage requirement sj : E → R. We note that without loss of generality, we
can assume that the capacities are unit, since we can scale the processing and
storage requirements of individual nodes accordingly.We consider two versions of the coupled placement problems. For the maximization
version MaxCP, we view fj as a payment function. Our goal is to
select a subset A ⊆ T of jobs and an assignment σ : A → E such that all capacities
are observed and our total profit P
j∈A fj (σ(j)) is maximized. For the
minimization version MinCP, we view fj as a cost function. Our goal is to find
an assignment σ : T → E such that all capacities are observed and our total
cost P
j∈A fj (σ(j)) is minimized.
A generalization of the coupled placement problem is k-sided placement
(kSP), in which we have k different sets of nodes, S1, . . . , Sk, each set of nodes
providing a distinct service. For each i, we have a capacity function Ci
: Si → R
that gives the capacity of a node in Si to provide the ith service. We are given
a set T of jobs, each of which needs each kind of service; the exact resource
needs may depend on the particular k-tuple of nodes from Q
i Si to which it is
assigned. That is, for each job j, we have a demand function dj :
Q
i Si → Rk
.
We also have another function fj :
Q
i Si → R. As for coupled placement, we can
assume that the capacities are unit, since we can scale the demands of individual
nodes accordingly. Similar to coupled placement, we consider two versions of
kSP, MinkSP and MaxkSP.
1.2 Our Results
All of the variants of CP and kSP are NP-hard, so our focus is on approximation
algorithms. Our first set of results consist of the first non-trivial approximation
algorithms for MinCP and MinkSP. Under hard capacity constraints, it is easy
to see that it is NP-hard to achieve any bounded approximation ratio to cost
minimization. So we consider approximation algorithms that incur a blowup in
capacity. We say that an algorithm is α-approximate for the minimization version
if its cost is at most that of an optimal solution, while incurring a blowup factor
of at most α in the capacity of any node.
– We present a (k + 1)-approximation algorithm for MinkSP using iterative
rounding, yielding a 3-approximation for MinCP.
We next consider the maximization version. MaxkSP can be expressed as a
k-column sparse integer packing program (k-CSP). From this, it is immediate
that MaxkSP can be approximated to within an O(k) approximation factor
by applying randomized rounding to a linear programming relaxation [7]. An
Ω(k/ log k)-inapproximability result for k-set packing due to [16] implies the
same hardness result for MaxkSP. Our second main result is a simpler approximation
algorithm for MaxCP and MaxkSP based on local search.
– We present a local search based 15-approximation algorithm for MaxCP.
We extend it to MaxkSP and obtain an O(k
3
)-approximation.
The local search result applies directly to a version where we can assign tasks
fractionally but only to a single pair of machines (this is like assigning a task
with lower priority and may have additional applications). We then describe asimple rounding scheme to obtain an integral version. The rounding technique
involves establishing a one-to-one correspondence between fractional assignments
and machines. This is much like the cycle-removing rounding for GAP; there is
a crucial difference, however, since coupled and k-sided placements assign jobs
to tuples of machines.
Finally, we study the online version of MaxCP, in which tasks arrive online
and must be irrevocably assigned or rejected immediately upon arrival.
– We extend the techniques of [8] to the case where the capacity requirement
for a job is arbitrarily machine-dependent. This enables us to achieve
competitive ratio logarithmic in the ratio of best to worst value-per-capacity
density, under necessary technical assumptions about the maximum job size.
1.3 Related Work
The coupled and k-sided placement problems are natural generalizations of the
Generalized Assignment Problem (GAP), which can be viewed as a 1-sided placement
problem. In GAP, which was first introduced by Shmoys and Tardos [9],
the goal is assign items of various sizes to bins of various capacities. A subset of
items is feasible for a bin if their total size is no more than the bin’s capacity.
If we are required to assign all items and minimize our cost (MinGAP), Shmoys
and Tardos [9] give an algorithm for computing an assignment that achieves optimal
cost while doubling the capacities of each bin. A previous result by Lenstra
et al. [10] for scheduling on unrelated machines show it is NP-hard to achieve
optimal cost without incurring a capacity blowup of at least 3/2. On the other
hand, if we wish to maximize our profit and are allowed to leave items unassigned
(MaxGAP), Chekuri and Khanna [11] observe that the (1, 2)-approximation for
MinGAP implies a 2-approximation for MaxGAP. This can be improved to a
(
e
e−1
)-approximation using LP-based techniques [12]. It is known that MaxGAP
is APX-hard [11], though no specific constant of hardness is shown.
On the experimental side, most prior work in data center resource management
focusses on placing one type of resource at a time: for example, placing
storage requirements assuming job compute location is fixed (file allocation problem
[2], [13, 14, 3]) or placing compute requirements assuming job storage location
is fixed [4, 5]. These in a sense are variants of GAP. The only prior work
on Coupled Placement is [6], where they show that MinCP is NP-hard and experimentally
evaluate heuristics: in particular, a fast approach based on stable
marriage and knapsacks is shown to do well in practice, close to the LP optimal.
The MaxkSP problem is related to the recently studied hypermatching assignment
problem (HAP) [15], and special cases, including k-set packing, and a
uniform version of the problem. A (k + 1 + ε)-approximation is given for HAP
in [15], where other variants of HAP are also studied. While the MaxkSP problem
can be viewed as a variant of HAP, there are critical differences. For instance,
in MaxkSP, each task is assigned at most one tuple, while in the hypermatching
problem each client (or task) is assigned a subset of the hyperedges. Hence, the
MaxkSP and HAP problems are not directly comparable. The k-set packing canbe captured as a special case of MaxkSP, and hence the Ω(k/ log k)-hardness
due to [16] applies to MaxkSP as well.
2 The minimization version
Next, we consider the minimization version of the Coupled Placement problem,
MinCP. We write the following integer linear program for MinCP, where xtuv
is the indicator variable for the assignment of t to pair (u, v), u ∈ U, v ∈ V .
Minimize: X
t,u,v
xtuvft(u, v)
Subject to: X
u,v
xtuv ≥ 1, ∀t ∈ T,
X
t,v
pt(u, v)xtuv ≤ cu, ∀u ∈ U,
X
t,u
st(u, v)xtuv ≤ dv, ∀v ∈ V,
xtuv ∈ {0, 1}, ∀t ∈ T, u ∈ U, v ∈ V.
We refer the first set of constraints as satisfaction constraints, the second and
third set as capacity constraints (processing and storage). We consider the linear
relaxation of this program which replaces the integrality constraints above with
0 ≤ xtuv ≤ 1, ∀t ∈ T, u ∈ U, v ∈ V .
2.1 A 3-approximation algorithm for MinCP
We now present algorithm IterRound, based on iterative rounding [21], which
achieves a 3-approximation for MinCP. We start with a basic algorithm that
achieves a 5-approximation by identifying tight constraints with a small number
of variables. Each iteration of this algorithm repeats the following round until
all variables have been rounded.
1 Extreme point: Compute an extreme point solution x to the current LP.
2 Eliminate variable or constraint: Execute one of these two steps. By
Lemma 3, one of these steps can always be executed if the LP is nonempty.
a Remove from the LP all variables xtuv that take the value 0 or 1 in x. If xtuv
is 1, then assign job t to the pair (u, v), remove the job t and its associated
variables from the LP, and reduce cu by pt(u, v) and dv by st(u, v).
b Remove from the LP any tight capacity constraint with at most 4 variables.
Fix an iteration of the algorithm, and an extreme point x. Let nt, nc, and ns denote
the number of tight task satisfaction constraints, computation constraints,
and storage constraints, respectively, in x. Note that every task satisfaction constraint
can be assumed to be tight, without loss of generality. Let N denote the
number of variables in the LP. Since x is an extreme point, if all variables in x
take values in (0, 1), then we have N = nt + nc + ns.Lemma 1. If all variables in x take values in (0, 1), then nt ≤ N/2.
Proof. Since a variable only occurs once over all satisfaction constraints, if nt >
N/2, there exists a satisfaction constraint that has exactly one variable. But
then, this variable needs to take value 1, a contradiction.
Lemma 2. If nt ≤ N/2, then there exists a tight capacity constraint that has at
most 4 variables.
Proof. If nt ≤ N/2, then ns + nc = N − nt ≥ N/2. Since each variable occurs
in at most one computation constraint and at most one storage constraint, the
total number of variable occurrences over all tight storage and computation
constraints is at most 2N, which is at most 4(ns +nc). This implies that at least
one of these tight capacity constraints has at most 4 variables.
Using Lemmas 1 and 2, we can argue that the above algorithm yields a 5-
approximation. Step 2a does not cause any increase in cost or capacity. Step 2b
removes a constraint, hence cannot increase cost; since the removed constraint
has at most 4 variables, the total demand allocated on the relevant node is at
most the demand of four tasks plus the capacity already used in earlier iterations.
Since each task demand is at most the capacity of the node, we obtain a 5-
approximation with respect to capacity.
Studying the proof of Lemma 2 more closely, one can separate the case nt <
N/2 from the nt = N/2; in the former case, one can, in fact, show that there
exists a tight capacity constraint with at most 3 variables. Together with a
careful consideration of the nt = N/2 case, one can improve the approximation
factor to 4. We now present an alternative selection of tight capacity constraint
that leads to a 3-approximation. One interesting aspect of this step is that the
constraint being selected may not have a small number of variables. We replace
step 2b by the following.
2b Remove from the LP any tight capacity constraint in which the number of
variables is at most two more than the sum of the values of the variables.
Lemma 3. If all variables in x take values in (0, 1), then there exists a tight
capacity constraint in which the number of variables is at most two more than
the sum of the values of the variables.
Proof. Since each variable occurs in at most two tight capacity constraints, the
total number of occurrences of all variables across the tight capacity constraints
is 2N − s for some nonnegative integer s. Since each satisfaction constraint is
tight, each variable appears in 2 capacity constraints, and each variable takes on
value less than 1, the sum of all the variables over the tight capacity constraints
is at least 2nt − s. Therefore, the sum, over all tight capacity constraints, of
the difference between the number of variables and their sum is at most 2(N −
nt). Since there are N − nt tight capacity constraints, for at least one of these
constraints, the difference between the number of variables and their sum is at
most 2.Lemma 4. Let u be a node with a tight capacity constraint, in which the number
of variables is at most 2 more than the sum of the variables. Then, the sum of the
capacity requirements of the tasks partially assigned to u is a most the current
available capacity of u plus twice the capacity of u.
Proof. Let ` be the number of variables in the constraint for u, and let the
associated tasks be numbered 1 through `. Let the demand of task j for the
capacity of node u be dj . Then, the capacity constraint for u is P
j
djxj = bc(u),
where bc(u) is the available capacity of u in the current LP.
We know that ` −
P
i
xi ≤ 2. Since di ≤ C(u), the capacity of u:
X
j
dj = bc(u) +X
`
j=1
(1 − xj )dj ≤ bc(u) + (` −
Xm
j=`
xj )C(u) ≤ bc(u) + 2C(u).
Theorem 1. IterRound is a polynomial-time 3-approximation algorithm for
MinCP.
Proof. By Lemma 3, each iteration of the algorithm removes either a variable or a
constraint from the LP. Hence the algorithm is polynomial time. The elimination
of a variable that takes value 0 or 1 does not change the cost. The elimination
of a constraint can only decrease cost, so the final solution has cost no more
than the value achieved by the original LP. Finally, when a capacity constraint
is eliminated, by Lemma 4, we incur a blowup of at most 3 in capacity.
2.2 A (k + 1)-approximation algorithm for MinkSP
It is straightforward to generalize the the algorithm of the preceding section to
obtain a k + 1-approximation to MinkSP. We first set up the integer LP for
MinkSP. For a given element e ∈
Q
i Si
, we use ei to denote the ith coordinate
of e. Let xte be the indicator variable that t is assigned to e ∈
Q
i Si
.
Minimize: X
t,e
xteft(e)
Subject to: X
e
xte ≥ 1, ∀t ∈ T,
X
t,e:ei=u
(dt(e))ixte ≤ Ci(u), ∀1 ≤ i ≤ k, u ∈ U,
xte ∈ {0, 1}, ∀t ∈ T, e ∈ E
The algorithm, which we call IterRound(k), is identical to IterRound of
Section 2.1 except that step 2b is replaced by the following.
2b Remove from the LP any tight capacity constraint in which the number of
variables is at most k more than the sum of the values of the variables.
The claims and proofs are almost identical to the k = 2 case and are moved to
Appendix A. A natural question to ask is whether a linear approximation factor
of MinkSP is unavoidable for polynomial time algorithms. Unfortunately, we donot have any non-trivial results in this direction. We have been able to show that
the MinkSP linear program has an integrality that grows as Ω(log k/ log log k)
(see Appendix A).
3 The maximization problems
We present approximation algorithms for the maximization versions of coupled
placement and k-sided placement problems. We first observe, in Section 3.1,
that these problems reduce to column sparse integer packing. We next present,
in Section 3.2, an alternative combinatorial approach based on local search.
3.1 An LP-based approximation algorithm
One can write a positive integer linear program for MaxCP. Let xtuv denote
the indicator variable for the the assignment of job t to the pair (u, v), u ∈ U,
v ∈ V . The goal is then to
Maximize: X
t,u,v
xtuvft(u, v)
Subject to: X
u,v
xtuv ≤ 1, ∀t ∈ T,
X
t,v
pt(u, v)xtuv ≤ cu, ∀u ∈ U,
X
t,u
st(u, v)xtuv ≤ dv, ∀v ∈ V,
xtuv ∈ {0, 1}, ∀t ∈ T, u ∈ U, v ∈ V.
Note that we can deal with capacities on u, v by scaling the pt(u, v) and st(u, v)
values appropriately. The above LP can be easily extended to MaxkSP (see
Appendix B). These linear programs are 3- and k-column sparse packing programs,
respectively, and can be approximated to within a factor of 15.74 and
ek + o(k), respectively using a clever randomized rounding approach. We next
give a combinatorial approach based on local search which is likely to be much
more efficient in practice.
3.2 Approximation algorithms based on local search
Before giving the details, we start with a few helpful definitions. For any u ∈ U,
Fu = Σt,vxtuvft(u, v). Similarly, for any v ∈ V , Fv = Σt,uxtuvft(u, v). We set
µ =
1
n maxt,u,v ft(u, v). It follows that the optimum solution is at least nµ and
at most n
2µ.
The local search algorithm will maintain the following two invariants: (1)
For each t, there is at most one pair (u, v) for which xtuv > 0; (2) All the linear
program inequalities hold. It’s easy to set an initial state where the invariant
holds (all xtuv = 0). The local search algorithm proceeds in the following steps:
While ∃t, u, v : ft(u, v) > Fu
pt(u,v)
cu
+ Fv
st(u,v)
dv
+ Σu0
,v0xtu0v
0ft(u
0
, v0
) + µ:1. Set xtuv = 1 and set xtu0v
0 = 0 for all (u
0
, v0
) 6= (u, v).
2. While Σt,vpt(u, v)xtuv > cu, reduce xtuv for the job with minimum cuft(u, v)/pt(u, v)
such that xtuv > 0.
3. While Σu,vst(u, v)xtuv > dv, reduce xtuv for the job with minimum dvft(u, v)/st(u, v)
such that xtuv > 0
Theorem 2. The local search algorithm maintains the two stated invariants.
Proof. The first invariant is straightforward, because the only time we increase
an xtuv value we simultaneously set all other values for the same t to zero. The
only time the linear program inequalities can be violated is immediately after
setting xtuv = 1. However, the two steps immediately after this operation will
reduce the values of other jobs so as to satisfy the inequalities (and this is done
without increasing any xtuv so no new constraint can be violated).
Theorem 3. The local search algorithm produces a 3+ approximate fractional
solution satisfying the invariants.
Proof. When the algorithm terminates, we have for all t, u, v: ft(u, v) ≤ Fu
pt(u,v)
cu
+
Fv
st(u,v)
dv
+ Σu0
,v0xtu0v
0ft(u
0
, v0
)µ. We sum this over t, u, v representing the optimum
integer assignments: OP T ≤ ΣuFu + ΣvFv + Σt,u,vxtuvft(u, v) + OP T.
Each summation simplifies to the algorithm’s objective value, giving the result.
Theorem 4. The local search algorithm runs in polynomial time.
Proof. Setting xtuv = 1 and setting all other xtu0v
0 = 0 adds ft(u, v)−Σu0v
0xtu0v
0ft(u
0
, v0
)
to the algorithm’s objective. The next two steps of the algorithm (making sure
the LP inequalities hold) reduce the objective by at most Fu
pt(u,v)
cu
+ Fv
st(u,v)
dv
.
It follows that each iteration of the main loop increases the solution value by at
least µ. By definition of µ, this can happen at most n
2/ times. Each selection
of (t, u, v) can be done in polynomial time (at worst, by simply trying all tuples).
Rounding Phase: When the local search algorithm terminates, we have a
fractional solution with the additional guarantee from the first invariant. Note
that we can extend this to the k-sided version if we increase the approximation
factor to k+1+. Below, we give two different rounding schemes. The first works
for general values of k and loses an O(k
2
) factor, for an overall approximation
factor of O(k
3
). The second is specific to the k = 2 case and obtains a better
approximation.
1. We randomly make each assignment with probability p times the fractional
value (so pxtuv for Coupled Placement), for some p to be defined later.
2. For each assigned job t, if the other jobs t
0 6= t assigned to any one of its
assigned machines violate the corresponding linear program constraint, we immediately
drop job t. For Coupled Placement this means if P
t
06=t,v pt
0 (u, v)xt
0uv >
1 for any t, u we set xtuv = 0.3. Note that we may still violate linear program constraints, but for any particular
machine the constraint would be satisfied if we dropped any one of its
assigned jobs. We divide the assigned jobs into k + 1 groups. These groups
should guarantee that for any machine with at least two assigned jobs, not all
its jobs are members of the same group. We then select the group with largest
total objective value as our final solution.
Theorem 5. For the k-sided version, the rounding scheme runs in poly-time
and achieves an O(k
2
)-approximation over the fractional approximation factor
(so an overall factor of O(k
3
) using local search) for appropriate choice of p.
Proof. The first two steps finish with a solution of value at least p(1 − p)
k
times
the optimum in expectation. This is because for any job t, the probability of
placing this job in step one is exactly p times its fractional value. Consider any
machine m where the job is assigned; the expected total size of the other jobs
t
0 6= t assigned to this machine is at most pcm and thus the probability that these
other jobs exceed cm is at most p. The probability that none of the k machines
where t is assigned exceed capacity from other jobs will be at most (1 − p)
k
.
We may still violate constraints. Dividing into k + 1 groups and picking the
best gives a result which is at least 1
k+1 p(1−p)
k
times optimum without violating
constraints. Selecting p =
1
k
gives the desired approximation factor.
It remains to show that the division into groups can be performed in polytime.
We start with all machines unmarked. For each group, we select a maximal
set of jobs no two of which are assigned the same unmarked machine. We then
mark all machines to which one of our current group of jobs is assigned. Note
that immediately before we select group i, each remaining job is assigned to at
most k−i+1 unmarked machines. For i = 1 this is obvious. Inductively, suppose
that job j is assigned to more than k −i unmarked machines immediately before
selecting group i + 1. Before selecting group i, job j was assigned to at most
k −i+ 1 unmarked machines, and since we never “unmark” a machine it follows
that job j was assigned to exactly k − i + 1 unmarked machines both before and
after the selection of group i. But then none of the jobs selected in group i are
assigned to any of the unmarked machines assigned to job j (else they would
have become marked after selection of group i). So we can augment group i with
job j without violating the constraint that no two jobs of group i are on the
same unmarked machine. This contradicts the maximality of group i.
We thus conclude that immediately before we select group k + 1, each remaining
job is assigned only to marked machines. Thus group k + 1 selects all
remaining jobs (maximality) and the jobs are divided into k+1 groups. Consider
any machine m with at least two assigned jobs. Let group i be the first group to
contain a job from m. Thus prior to selection of group i, we had not selected any
job which was assigned to m and m was unmarked. So group i cannot include
more than one job from machine m without violating the condition that no two
jobs share an unmarked machine. It follows that there are at least two distinct
groups which contain jobs from machine m (group i and also some later group).For MaxCP, we can improve the approximation factor. We refer the reader
to Appendix B for details.
Theorem 6. For MaxCP, there exists a polynomial-time algorithm based on
local search that achieves a 15 + approximation for MaxCP.
4 Online MaxCP and MaxkSP
We now study the online version of MaxCP, in which jobs arrive in an online
fashion. When a job arrives we must irrevocably assign it or reject it. Our goal is
to maximize our total value at the end of the instance. We apply the techniques
of [8] to obtain a logarithmic competitive online algorithm under certain assumptions.
We first note that online MaxCP differs from the model considered in [8]
in that a job’s computation/storage requirements need not be the same.
As demonstrated in [8] certain assumptions have to be made to achieve competitive
ratios of any interest. We extend these assumptions for the MaxCP
model as follows:
Assumption 1 There exists F such that for all t, u, v either ft(u, v) = 0 or
1 ≤ ft(u, v) ≤ F min( pt(u,v)
cu
,
st(u,v)
dv
).
Assumption 2 For = min( 1
2
,
1
ln 2F +1 ), for all t, u, v: pt(u, v) ≤ cu and
st(u, v) ≤ dv.
It is not hard to show that they (or some similar flavor of these assumptions)
are in fact necessary to obtain any interesting competitive ratios (proof
in Appendix C).
Theorem 7. No deterministic online algorithm can be competitive over classes
of instances where either one of the following is true: (i) job size is allowed to be
arbitrarily large relative to capacities, or (ii) job values and resource requirements
are completely uncorrelated.
A small modification to the algorithm of [8] gives an O(log F)-competitive algorithm.
Moreover, the lower bound of Ω(log F) shown in [8] applies to online
MaxCP as well. (See Appendix D for proof.)
Theorem 8. There exists a deterministic O(log F)-competitive algorithm for
online MaxCP under Assumptions 1 and 2. For MaxkSP, this can be extended
to a O(log kF)-competitive algorithm. Moreover, any online deterministic
algorithm for online MaxCP has competitive ratio Ω(log F), and for online
MaxkSP has competitive ratio Ω(log kF).
Theorem 9. There exist a randomized O(log F)-competitive algorithm (in expectation)
for online MaxCP under assumption 1 even if we weaken assumption
2 to require only that =
1
2
. No deterministic online algorithm for the
problem can accomplish such a result.Acknowledgments
We would like to thank Aravind Srinivasan for helpful discussions, and for pointing
us to the Ω(k/ log k)-hardness result for k-set packing, in particular. We
thank anonymous referees for helpful comments on an earlier version of the
paper, and are especially grateful to a referee who generously offered the key
insights leading to improved results for MinCP and MinkSP.
References
1. Patterson, D.A.: Technical perspective: the data center is the computer. Communications
of the ACM 51 (January 2008) 105–105
2. Dowdy, L.W., Foster, D.V.: Comparative models of the file assignment problem.
ACM Surveys 14 (1982)
3. Anderson, E., Kallahalla, M., Spence, S., Swaminathan, R., Wang, Q.: Quickly
finding near-optimal storage designs. ACM Transactions on Computer Systems 23
(2005) 337–374
4. Appleby, K., Fakhouri, S., Fong, L., Goldszmidt, G., Kalantar, M., Krishnakumar,
S., Pazel, D., Pershing, J., Rochwerger, B.: Oceano-SLA based management of a
computing utility. In: Proceedings of the International Symposium on Integrated
Network Management. (2001) 855–868
5. Chase, J.S., Anderson, D.C., Thakar, P.N., Vahdat, A.M., Doyle, R.P.: Managing
energy and server resources in hosting centers. In: Proceedings of the Symposium
on Operating Systems Principles. (2001) 103–116
6. Korupolu, M., Singh, A., Bamba, B.: Coupled placement in modern data centers.
In: Proceedings of the International Parallel and Distributed Processing Symposium.
(2009) 1–12
7. Bansal, N., Korula, N., Nagarajan, V., Srinivasan, A.: On k-column sparse packing
programs. In: Proceedings of the Conference on Integer Programming and
Combinatorial Optimization. (2010) 369–382
8. Awerbuch, B., Azar, Y., Plotkin, S.: Throughput-competitive on-line routing. In:
Proceedings of the Symposium on Foundations of Computer Science. (1993) 32–40
9. Shmoys, D.B., Eva Tardos: An approximation algorithm for the generalized as- ´
signment problem. Mathematical Programming 62(3) (1993) 461–474
10. Lenstra, J.K., Shmoys, D.B., Eva Tardos: Approximation algorithms for scheduling ´
unrelated parallel machines. Mathematical Programming 46(3) (1990) 259–271
11. Chekuri, C., Khanna, S.: A PTAS for the multiple knapsack problem. In: Proceedings
of the Symposium on Discrete Algorithms. (2000) 213–222
12. Fleischer, L., Goemans, M.X., Mirrokni, V.S., Sviridenko, M.: Tight approximation
algorithms for maximum general assignment problems. In: SODA. (2006) 611–620
13. Alvarez, G.A., Borowsky, E., Go, S., Romer, T.H., Becker-Szendy, R., Golding, R.,
Merchant, A., Spasojevic, M., Veitch, A., Wilkes, J.: Minerva: An automated resource
provisioning tool for large-scale storage systems. Transactions on Computer
Systems 19 (November 2001) 483–518
14. Anderson, E., Hobbs, M., Keeton, K., Spence, S., Uysal, M., Veitch, A.: Hippodrome:
Running circles around storage administration. In: Proceedings of the
Conference on File and Storage Technologies. (2002) 175–188
15. Cygan, M., Grandoni, F., Mastrolilli, M.: How to sell hyperedges: The hypermatching
assignment problem. In: SODA. (2013) 342–35116. Hazan, E., Safra, S., Schwartz, O.: On the complexity of approximating k-set
packing. Computational Complexity 15(1) (2006) 20–39
17. Vazirani, V.V.: Approximation Algorithms. Springer-Verlag (2001)
18. Frieze, A.M., Clarke, M.: Approximation algorithms for the m-dimensional 0-1
knapsack problem: Worst-case and probabilistic analyses. European Journal of
Operational Research 15(1) (1984) 100–109
19. Chekuri, C., Khanna, S.: On multi-dimensional packing problems. In: Proceedings
of the Symposium on Discrete Algorithms. (1999) 185–194
20. Srinivasan, A.: Improved approximations of packing and covering problems. In:
Proceedings of the Symposium on Theory of Computing. (1995) 268–276
21. Lau, L., Ravi, R., Singh, M.: Iterative Methods in Combinatorial Optimization.
Cambridge Texts in Applied Mathematics. Cambridge University Press (2011)
A Proofs for MinkSP
Fix an iteration of the algorithm, and an extreme point x. Let nt denote the
number of tight satisfaction constraints, and ni denote the number of tight capacity
constraints on the ith side. Since x is an extreme point, if all variables in
x take values in (0, 1), then we have N = nt +
P
i
ni
.
Lemma 5. If all variables in x take values in (0, 1), then there exists a tight
capacity constraint in which the number of variables is at most k more than the
sum of the variables.
Proof. Since each variable occurs in at most k tight capacity constraints, the
total number of occurrences of all variables across the tight capacity constraints
is kN − s for some nonnegative integer s. Since each satisfaction constraint is
tight, each variable appears in k capacity constraints, and each variable takes on
value at most 1, the sum of all the variables over the tight capacity constraints
is at least knt − s. Therefore, the sum, over all tight capacity constraints, of the
difference between the number of variables and their sum is at most k(N − nt).
Since the number of tight capacity constraints is N −nt, for at least one of these
constraints, the difference between the number of variables and their sum is at
most k.
Lemma 6. Let u be a side-i node with a tight capacity constraint, in which the
number of variables is at most k more than the sum of the variables. Then, the
sum of the capacity requirements of the tasks partially assigned to u is at most
the available capacity of u plus kCi(u).
Proof. Let ` be the number of variables in the constraint for u, and let the
associated tasks be numbered 1 through `. Let the demand of task j for the
capacity of node u be dj . Then, the capacity constraint for u is P
j
djxj = bc(u).We know that m −
P
i
xi ≤ k. We also have di ≤ Ci(u). Letting bc(u) denote
the current capacity of u, we now derive
X
i
di = bc(u) +Xm
j=1
(1 − xi)di
≤ bc(u) + (m −
Xm
j=1
xi)Ci(u)
≤ bc(u) + kCi(u).
Theorem 10. IterRound(k) is a polynomial-time k + 1-approximation algorithm
for MinkSP.
Proof. By Lemma 5, each iteration of the algorithm removes either a variable or a
constraint from the LP. Hence the algorithm is polynomial time. The elimination
of a variable that takes value 0 or 1 neither changes cost nor incurs capacity
blowup. The elimination of a constraint can only decrease cost, so the final
solution has cost no more than the value achieved by the original LP. Finally,
by Lemma 6, we incur a blowup of at most 1 + k in capacity.
We now show that the MinkSP LP has an integrality gap of Ω(log k/ log log k).
We recursively construct an integrality gap instance with `
t
sides, for parameters
` and t, with two nodes per side one with infinite capacity and the other
with unit capacity, such that any integral solution has at least t tasks on the
unit-capacity node on some side, while there is a fractional solution with load
of at most t/` on the unit-capacity node of each side. Setting t = ` and k = `
`
,
we obtain an instance in which the capacity used by the fractional solution is 1,
while any integral solution has load ` = Θ(log k/ log log k).
Each task can be placed on one tuple from a subset of tuples; for a given
tuple, the demand of the task on each side of the tuple is one. We start with the
construction for t = 1. We introduce a task that has ` choices, the ith choice
consisting of the unit-capacity node from side i and infinite capacity nodes on
all other sides. Clearly, any integral solution uses up unit capacity of one unitcapacity
node, while there is a fractional solution (1/` for each choice) that uses
only 1/` fraction of each unit capacity node.
Given a construction for `
t
sides, we show how to extend to `
t+1 sides. We
take ` identical copies of the instance with `
t
sides and combine the tuples for
each task in such a way that for any i, any integral placement places exactly the
same task on side i of each copy. Now we add task t + 1 which can be placed
in one of ` tuples: unit capacity node on all sides of copy i and infinite capacity
node on all other sides, for each i. Clearly, any integral solution will have to add
one more task to a unit-capacity node of a side that already has load t, yielding
a load t + 1, while a fractional solution can assign load of at most 1/` to the
unit-capacity nodes of each side.B Proofs for MaxkSP and MaxCP
We first present the linear program for MaxkSP (recall the definition in Section
1.1). Let xte denote the indicator variable for the assignment of job t to the
k-tuple e.
Maximize: X
t,e
xteft(e)
Subject to: X
e
xte ≤ 1, ∀t ∈ T,
X
t,e
(dt(e))ixte ≤ Ci(ei), ∀i ∈ {1, . . . , k},
xte ∈ {0, 1}, ∀t ∈ T, e ∈
Q
i Si
.
We now present the improved approximation algorithm for MaxCP. The
idea is to obtain a one-to-one correspondance between fractional assignments
and machines. Essentially we view the machines as nodes of a graph where
the edges are the fractional assignments (this is similar to the rounding for
generalized assignment). If we have a cycle, the idea is to shift the fractions
around the cycle (i.e. increase one xtuv then decrease some xt
0vw and increase
some xt
00wx and so forth). Applying this directly on a single cycle may violate
some constraints; while we try to increase and decrease the fractions in such a
way that constraints hold, since each job has different “size” on its two endpoints
we may wind up violating the constraint P
t,v xtuvpt(u, v) at a single node u. This
prevents us from doing a simple cycle elimination as in generalized assignment.
However, if we have two adjoining (or connected) cycles the process can be made
to work. The remaining case is a single cycle, where we can assign each edge
to one of its endpoints. Generalized assignment rounding would now proceed to
integrally assign each job to its corresponding machine; we cannot do this because
each job requires two machines, and each machine thus has multiple fractional
assignments (all but one of which “correspond” to some other machine).
Lemma 7. Given any fractional solution which satisfies the local search invariants,
we can produce an alternative fractional solution (also satisfying the local
search invariants and with equal or greater value). This new fractional solution
labels each job t with 0 < xtuv < 1, with either u or v, guaranteeing that each u
is labeled with at most one job.
Proof. Consider a graph where the nodes are machines, and we have an edge
(u, v) for any fractional assignment 0 < xtuv < 1. If any node has degree zero or
one, we remove that node and its assigned edge (if any), labeling the removed
edge with the node that removed it. We continue this process until all remaining
nodes have degree at least two. If there is a node of degree three, then there must
exist two (distinct but not necessarily edge-disjoint) cycles with a path between
them (possibly a path of length zero); since the graph is bipartite all cycles are
even in length. We can alternately increase and decrease the fractional assignments
of edges along a cycle such that the total load P
t,v pt(u, v)xtuv changesonly on a single node u where the path between cycles intersects this cycle. We
can do the same along the other cycle. We can then do the same thing along the
path, and equalize the changes (multiplicatively) such that there is no overall
change in load, but at least one edge has its fractional value changing. If this
process decreases the value, we can reverse it to increase the value. This allows us
to modify the fractional solution in a way that increases the number of integral
assignments without decreasing the value. After applying this repeatedly (and
repeating the node/edge removal process above where necessary), we are left
with a graph that consists only of node-disjoint cycles. Each of the remaining
edges will be labeled with one of its two endpoints (one to each). The overall
effect is that we have a one-to-one labeling correspondance between fractional
assignments and machines (each fractional edge to one of its two assigned machines).
Note however that since each job is assigned to two machines and labeled
with only one of the two, this does not imply that each machine has only one
fractional assignment.
Once this is done, we consider three possible solutions. One consists of all
the integral assignments. The second considers only those assignments which are
fractional and labeled with nodes u. For each node v, we select a subset of its
fractional assignments to make integrally, so as to maximize the value without
violating capacity of v. We cannot violate capacity of u because we select at most
one job for each such machine. The result has at least 1
2
the value of assignments
labeled with nodes u. For the third solution, we do the same but with the roles
of u, v reversed. We select the best of these three solutions; our choice obtains
at least 1
5
of the overall value.
Proof of Theorem 6: The algorithm sketch contains most of the proof. We
need to establish that we can get at least 1
2
the fractional value on a single
machine integrally. This can be done by selecting jobs in decreasing order of
density (ft(u, v)/pt(u, v)) until we overflow the capacity. Including the job that
overflows capacity, this must be better than the fractional solution. Thus we can
select either everything but the job that overflows capacity, or that job by itself.
We also need to establish the 1
5
value claim. If we were to select the integral
assignments with probability 1
5
and each of the other two solutions with probability
2
5
, we would get an expected 1
5
of the fractional solution. Deterministially
selecting the best of the three solutions can only be better than this. ut
C Proof of Theorem 7
We first show that if resource requirements are large compared to capacities,
payment functions ft are exactly equal to the total amount of resources and
each job requires the same amount over all resources/dimensions (but different
jobs can require different amounts), then no deterministic online algorithm can
be competitive.
Consider a graph G with a single compute node and a single data storage
node. Each node has one-dimensional compute/storage capacity of L. A jobarrives requesting 1 unit of computing and storage and will pay 2. Clearly, any
competitive deterministic algorithm must accept this job, in case this is the only
job. However, a second job arrives requesting L units of computing and storage
and will pay 2L. In this case, the algorithm is L-competitive, and L can be
arbitrarily large.
Next, we show that if resource requirements are small relative to capacities,
payment functions ft are arbitrary and resource requirements are identical, then
no deterministic online algorithm can be competitive. This instance satisfies
Assumption 2 but not Assumption 1.
Consider again a graph G with a single compute node and single data storage
node each with one-dimensional, unit capacities. We will use up to k + 1 jobs,
each requiring 1/k units of computing and storage. The i-th job, 0 ≤ i ≤ k, will
pay Mi
for some large value M. Now, consider any deterministic algorithm. If
it fails to accept any job j < k, then if job j is the last job, it will be Ω(M)-
competitive. If the algorithm accepts jobs 0 up through k − 1 then it will not be
able to accept job k and will be Ω(M)-competitive. In all cases it has competitive
ratio at least Ω(M) and M and k can be arbitrarily large.
Similarly, if resource requirements are small relative to capacities, payment
functions ft are exactly equal to the total amount of resources requested and
resource requirements are arbitrary, then no deterministic online algorithm can
be competitive.
Consider once more a graph G with a single compute node and single data
store node with one-dimensional compute/storage capacities. However, this time
the compute capacity will be 1 and the storage capacity will be some very large L.
We will use up to k+1 jobs, each requiring 1/k units of computing. The i-th job,
0 ≤ i ≤ k, will require the appropriate amount of storage so that its value is Mi
for very large M. Assuming L = O(kMk
), all these storage requirements are at
most 1/k of L. Note that storage can accommodate all jobs, but computing can
accommodate at most k jobs. Any deterministic algorithm will have competitive
ratio Ω(M) and k, M and L can be suitably large.
Thus, it follows that some flavor of Assumptions 1 and 2 are necessary to
achieve any interesting competitive result.
D Proof of Theorem 8
We adapt the framework of [8] to solve the online MaxCP problem. This framework
uses an exponential cost function to place a price on remaining capacity of
a node. If the value obtained from a task can cover the cost of the capacity it
consumes, we admit the task. In the algorithm below, e is the base of the natural
logarithm.
We first show that our algorithm will not exceed capacities. Essentially, this
occurs because the cost will always be sufficiently high.
Lemma 8. Capacity constraints are not violated at any time during this algorithm.Algorithm
1 Online algorithm for MaxCP.
1: λu(1) ← 0, λv(1) ← 0 for all u ∈ U, v ∈ V
2: for each new task j do
3: costu(j) ← 1
2
(e
λu(j)
ln(2F +1)
1− − 1)
4: costv(j) ← 1
2
(e
λv(j)
ln(2F +1)
1− − 1)
5: For all uv let Ztuv =
pj (u,v)
cu
costu(j) + sj (u,v)
dv
costv(j)
6: Let uv maximize fj (u, v) subject to Zjuv < fj (u, v)
7: if such uv exist with fj (u, v) > 0 then
8: Assign j to uv
9: λu(j + 1) ← λu(j) + pj (u,v)
cu
10: λv(j + 1) ← λv(j) + sj (u,v)
dv
11: For all other u
0
6= u let λu0 (j + 1) ← λu0 (j)
12: For all other v
0
6= v let λv0 (j + 1) ← λv0 (j)
13: else
14: Reject task j
15: For all u let λu(j + 1) ← λu(j)
16: For all v let λv(j + 1) ← λv(j)
17: end if
18: end for
Proof. Note that λu(n + 1) will be 1
cu
Σt,vpt(u, v)xtuv, since any time we assign
a job j to uv we immediately increase λu(j + 1) by the appopriate amount. Thus
if we can prove λu(n + 1) ≤ 1 we will not violate capacity of u.
Initially we had λu(1) = 0 < 1, so suppose that the first time we exceed
capacity is after the placement of job j. Thus we have λu(j) ≤ 1 < λu(j + 1).
By applying assumption 2 we have λu(j) > 1 − . From this it follows that
costu(j) >
1
2
(e
ln(2F +1) − 1) = F, and since these costs are always non-negative
we must have had Zjuv >
pj (u,v)
cu
F ≥ fj (u, v) by applying assumption 1. But
then we must have rejected job j and would have λu(j + 1) = λu(j)
Identical reasoning applies to v ∈ V .
Next, we bound the algorithms revenue from below using the sum of the node
costs.
Lemma 9. Let A(j) be the total objective value Σt,u,vxtuvft(u, v) obtained by
P
the algorithm immediately before job j arrives. Then (3e ln(2F + 1))A(j) ≥
u∈U
costu(j) + P
v∈V
costv(j).
Proof. The proof will be by induction on j; the base case where j = 1 is immediate
since no jobs have yet arrived or been scheduled and costu(1) = costv(1) = 0
for all u and v.
Consider what happens when job j arrives. If this job is rejected, neither side
of the inequality changes and the induction holds. Otherwise, suppose job j is
assigned to uv. We have:
A(j + 1) = A(j) + fj (u, v)We can bound the new value of the righthand side by observing that since
costu has derivative increasing in the value of λu, the new value will be at most
the new derivative times the increase in λu. It follows that:
costu(j + 1) ≤ costu(j) + (λu(j + 1) − λu(j))1
2
(
ln(2F + 1)
1 −
)(e
λu(j+1) ln(2F +1)
1− )
costu(j + 1) ≤ costu(j) + pj (u, v)
cu
(
ln(2F + 1)
1 −
)(1
2
e
λu(j) ln(2F +1)
1− )(e
ln(2F +1)
1− )
costu(j + 1) ≤ costu(j) + pj (u, v)
cu
ln(2F + 1)
1 −
(costu(j) + 1
2
)(e
ln(2F +1))
Applying assumption 2 gives:
costu(j + 1) ≤ costu(j) + (2e ln(2F + 1))(pj (u, v)
cu
costu(j) + 1
4
)
Identical reasoning can be applied to costv, allowing us to show that the
increase in the righthand side is at most:
(2e ln(2F + 1))(pj (u, v)
cu
costu(j) + sj (u, v)
du
costv(j) + 1
2
)
Since j was assigned to uv, we must have fj (u, v) >
pj (u,v)
cu
costu(j)+sj (u,v)
dv
costv(j);
from assumption 1 we also have fj (u, v) ≥ 1 so we can conclude that the increase
in the righthand side is at most:
(3e ln(2F + 1))fj (u, v) ≤ (3e ln(2F + 1))(A(j + 1) − A(j))
Now, we can bound the profit the optimum solution gets from tasks which
we either fail to assign, or assign with a lower value of ft(u, v). The reason we
did not assign these tasks was because the node costs were suitably high. Thus,
we can bound the profit of tasks using the node costs.
Lemma 10. Suppose the optimum solution assigned j to u, v, but the online
algorithm either rejected j or assigned it to some u
0
, v0 with fj (u
0
, v0
) < fj (u, v).
Then pj (u,v)
cu
costu(n + 1) + sj (u,v)
dv
costv(n + 1) ≥ fj (u, v)
Proof. When the algorithm considered j, it would find the u, v with maximum
fj (u, v) satisfying Zjuv < fj (u, v). Since the algortihm either could not find
such u, v or else selected u
0
, v0 with fj (u
0
, v0
) < fj (u, v) it must be that Zjuv ≥
fj (u, v). The lemma then follows by inserting the definition of Zjuv and then
observing that costu and costv only increase as the algorithm continues.Lemma 11. Let Q be the total value of tasks which the optimum offline algorithm
assigns, but which Algorithm 1 either rejects or assigns to a uv with lower
value of ft(u, v). Then Q ≤ Σu∈U costu(n + 1) + Σv∈V costv(n + 1).
Proof. Consider any task q as described above. Suppose offline optimum assigns
q to uq, vq. By applying lemma 10 we have:
Q = Σqfq(uq, vq) ≤ Σq
pq(uq, vq)
cu
costuq
(n + 1) + sq(uq, vq)
dv
costq(n + 1)
The lemma then follows from the fact that the offline algorithm must obey
the capacity constraints.
Finally, we can combine Lemmas 9 and 11 to bound our total profit. In
particular, this shows that we are within a factor 3e ln(2F + 1) of the optimum
offline solution, for an O(log F)-competitive algorithm.
Theorem 11. Algorithm 1 never violates capacity constraints and is O(log F)-
competitive.
We can extend the result to k-sided placement, and can get a slight improvement
in the required assumptions if we are willing to randomize. The results are
given below:
Theorem 12. For the k-sided placement problem, we can adapt algorithm 1
to be O(log kF)-competitive provided that assumption 2 is tightened to =
min( 1
2
,
1
ln(kF +1) ).
Proof. We must modify the definition of cost to:
costu(j) = 1
k
(e
λu(j) ln(kF +1)
1− − 1)
The rest of the proof will then go through. The intuition for the increase in
competitive ratio is that we need to assign the first task to arrive (otherwise
after this task our competitive ratio would be unbounded). This task potentially
uses up space on k machines while obtaining a value of only 1. So as the value
of k increases, the ratio of “best” to “worst” task effectively increases as well.
Theorem 13. If we select a random power of two z ∈ [1, F] and then reject all
placements with ft(u, v) < z or ft(u, v) > 2z, then we can obtain a competitive
ratio of O(log F log k) while weakening assumption 2 to = min( 1
2
,
1
ln(2k+1) ).
Note that in the specific case of two-sided placement this is O(log F)-competitive
requiring only that no single job consumes more than a constant fraction of any
machine.
Proof. Once we make our random selection of z, we effectively have F = 2 and
can apply the algorithm and analysis above. The selection of z causes us to lose
(in expectation) all but 1
log F
of the possible profit, so we have to multiply this
into our competitive ratio.
Collaboration in the Cloud at Google
Yunting Sun, Diane Lambert, Makoto Uchida, Nicolas Remy
Google Inc.
January 8, 2014
Abstract
Through a detailed analysis of logs of activity for
all Google employees1
, this paper shows how the
Google Docs suite (documents, spreadsheets and
slides) enables and increases collaboration within
Google. In particular, visualization and analysis
of the evolution of Google’s collaboration network
show that new employees2
, have started
collaborating more quickly and with more people
as usage of Docs has grown. Over the last two
years, the percentage of new employees who collaborate
on Docs per month has risen from 70%
to 90% and the percentage who collaborate with
more than two people has doubled from 35% to
70%. Moreover, the culture of collaboration has
become more open, with public sharing within
Google overtaking private sharing.
1 Introduction
Google Docs is a cloud productivity suite and it
is designed to make collaboration easy and natural,
regardless of whether users are in the same
or different locations, working at the same or different
times, or working on desktops or mobile
devices. Edits and comments on the document
are displayed as they are made, even if many people
are simultaneously writing and commenting
on or viewing the document. Comments enable
real-time discussion and feedback on the document,
without changing the document itself. Authors
are notified when a new comment is made
or replied to, and authors can continue a conversation
by replying to the comment, or end
the discussion by resolving it, or re-start the discussion
by re-opening a closed discussion stream.
Because documents are stored in the cloud, users
can access any document they own or that has
been shared with them anywhere, any time and
on any device. The question is whether this enriched
model of collaboration matters?
There have been a few previous qualitative analyses
of the effects of Google Docs on collaboration.
For example, the review of Google Docs in
[1] suggested that its features should improve collaboration
and productivity among college students.
A technical report [2] from the University
of Southern Queensland, Australia argued that
Google Docs can overcome barriers to usability
such as difficulty of installation and document
version control and help resolve conflicts among
co-authors of research papers. There has also
been at least one rigorous study of the effect of
Google Docs on collaboration. Blau and Caspi
[3] ran a small experiment that was designed to
compare collaboration on writing documents to
merely sharing documents. In their experiment,
118 undergraduate students of the Open University
of Israel were randomized to one of five
groups in which they shared their written assignments
and received feedback from other students
to varying degrees, ranging from keeping texts
1Full-time Google employees, excluding interns, part-times, vendors, etc
2Full-time employees who have joined Google for less than 90 days
12 COLLABORATION VISUALIZATION
private to allowing in-text suggestions or allowing
in-text edits. None of the students had used
Google Docs previously. The authors found that
only students in the collaboration group perceived
the quality of their final document to be
higher after receiving feedback, and students in
all groups thought that collaboration improves
documents.
This paper takes a different approach, and looks
for the effects of collaboration on a large, diverse
organization with thousands of users over a much
longer period of time. The first part of the paper
describes some of the contexts in which Google
Docs is used for collaboration, and the second
part analyzes how collaboration has evolved over
the last two years.
2 Collaboration Visualization
2.1 The Data
This section introduces a way to visualize the
events during a collaboration and some simple
statistics that summarize how widespread collaboration
using Google Docs is at Google. The
graphics and metrics are based on the view, edit
and comment actions of all full-time employees
on tens of thousands of documents created in
April 2013.
2.2 A Simple Example
To start, a document with three collaborators
Adam (A), Bryant (B) and Catherine (C) is
shown in Figure 1. The horizontal axis represents
time during the collaboration. The vertical
axis is broken into three regions representing
viewing, editing and commenting. Each contributor
is assigned a color. A box with the contributor’s
color is drawn in any time interval in
which the contributor was active, at a vertical
position that indicates what the user was doing
in that time interval. This allows us to see when
contributors were active and how often they contributed
to the document. Stacking the boxes allows
us to show when contributors were acting at
the same time. Only time intervals in which at
least one contributor was active are shown, and
gaps in time that are shorter than a threshold
are ignored. Gray vertical bars of fixed width
are used to represent periods of no activity that
are longer than the threshold. In this paper, the
threshold is set to be 12 hours in all examples.
In Figure 1, an interval represents an hour.
Adam and Bryant edited the document together
during the hour of 10 AM May 4 and Bryant
edited alone in the following hour. The collaboration
paused for 8 days and resumed during
the hour of 2 pm on May 12. Adam, Bryant and
Catherine all viewed the document during that
hour. Catherine commented on the document
in the next hour. Altogether, the collaboration
had two active sessions, with a pause of 8 days
between them.
Figure 1: This figure shows an example of the
collaboration visualization technique. Each colored
block except the gray one represents an hour and the
gray one represents a period of no activity. The Y
axis is the number of users for each action type. This
document has three contributors, each assigned a different
color.
Although we have used color to represent collaborators
here, we could instead use color to
represent the locations of the collaborators, their
organizations, or other variables. Examples with
different colorings are given in Sections 2.5 and
2.6.
2 Google Inc.2 COLLABORATION VISUALIZATION 2.3 Collaboration Metrics
2.3 Collaboration Metrics
To estimate the percentage of users who concurrently
edit a document and the percentage of
documents which had concurrent editing, we discretize
the timestamps of editing actions into 15
minute intervals and consider editing actions by
different contributors in the same 15 minute interval
to be concurrent. Two users who edit the
same document but always more than 15 minutes
apart would not be considered as concurrent, although
they would still be considered collaborators.
Edge cases in which two collaborators edit
the same document within 15 minutes of each
other but in two adjacent 15 minute intervals
would not be counted as concurrent events.
The choice of 15 minutes is arbitrary; however,
metrics based on a 15 minute discretization and
a 5 minute discretization are little different. The
choice of 15 minute intervals makes computation
faster. A more accurate approach would be to
look for sequences of editing actions by different
users with gaps below 15 minutes, but that
requires considerably more computing.
2.4 Collaborative Editing
Collaborative editing is common at Google. 53%
of the documents that were created and shared
in April 2013 were edited by more than one employee,
and half of those had at least one concurrent
editing session in the following six months.
Looking at employees instead of documents, 80%
of the employees who edited any document contributed
content to a document owned by others
and 65% participated in at least one 15 minute
concurrent editing session in April 2013. Concurrent
editing is sticky, in the sense that 76% of the
employees who participate in a 15 minute concurrent
editing session in April will do so again
the following month.
There are many use cases for collaborative editing,
including weekly reports, design documents,
and coding interviews. The following three plots
show an example of each of these use cases.
Figure 2: Collaboration activity on a design document. The X axis is time in hours and the Y axis is the
number of users for each action type. The document was mainly edited by 3 employees, commented on by
18 and viewed by 50+.
Google Inc. 32.5 Commenting 2 COLLABORATION VISUALIZATION
Figure 2 shows the life of a design document created
by engineers. The X axis is time in hours
and the Y axis is the number of employees working
on the document for each action type. The
document was mainly edited by three employees,
commented on by 18 employees and viewed
by more than 50 employees from three major locations.
This document was completed within
two weeks and viewed many times in the subsequent
month. Design documents are common at
Google, and they typically have many contributors.
Figure 3 shows the life of a weekly report document.
Each bar represents a day and the Y
axis is the number of employees who edited and
viewed the document in a day. This document
has the following submission rules:
• Wednesday, AM: Reminder for submissions
• Wednesday, PM: All teams submit updates
• Thursday, AM: Document is locked
The activities on the document exhibit a pronounced
weekly pattern that mirrors the submission
rules. Weekly reports and meeting notes
that are updated regularly are often used by employees
to keep everyone up-to-date as projects
progress.
Figure 3: Collaboration on a weekly report. The
X axis is time in days and the Y axis is the number
of users for each action type. The activities exhibit
a pronounced weekly pattern and reflect the submission
rules of the document.
Finally, Figure 4 shows the life of a document
used in an interview. The X axis represents time
in minutes. The document was prepared by a recruiter
and then viewed by an engineer. At the
beginning of the interview, the engineer edited
the document and the candidate then wrote code
in the document. The engineer was able to watch
the candidate typing. At the end of the interview,
the candidate’s access to the document was
revoked so no further change could be made, and
the document was reviewed by the engineer. Collaborative
editing allows the coding interview to
take place remotely, and it is an integral part of
interviews for software engineers at Google.
Figure 4: The activity on a phone interview document.
The X axis is time in minutes and the Y axis
is the number of users for each action type. The engineer
was able to watch the candidate typing on the
document during a remote interview.
2.5 Commenting
Commenting is common at Google. 30% of the
documents created in April 2013 that are shared
received comments within six months of creation.
57% of the employees who used Google Docs in
April commented at least once in April, and 80%
of the users who commented in April commented
again in the following month.
4 Google Inc.2 COLLABORATION VISUALIZATION 2.6 Collaboration Across Sites
Figure 5: Commenting and editing on a design document. The X axis is time in hours and the Y axis
is the number of user actions for each user location. There are four user actions, each assigned a different
color. Timestamps are in Pacific time.
Figure 5 shows the life of a design document.
Here color represents the type of user action (create
a comment, reply to a comment, resolve a
comment and edit the document), and the Y axis
is split into two locations. The document was
written by one engineering team and reviewed
by another. The review team used commenting
to raise many questions, which the engineering
team resolved over the next few days. Collaborators
were located in London, UK and Mountain
View, California, with a nine hour time zone difference,
so the two teams were almost ”taking
turns” working on the document (timestamps
are in Pacific time). There are many similar
communication patterns between engineers via
commenting to ask questions, have discussions
and suggest modifications.
2.6 Collaboration Across Sites
Employees use the Docs suite to collaborate with
colleagues across the world, as Figure 6 shows.
In that figure, employees working from nine locations
in eight countries across the globe contributed
to a document that was written within a
week. The document was either viewed or edited
with gaps of less than 12 hours (the threshold for
suppressing gaps in the plot) in the first seven
days as people worked in their local timezones.
After final changes were made to the document,
it was reviewed by people in Dublin, Mountain
View, and New York.
Figure 7 shows one month of global collaborations
for full-time employees using Google Docs.
The blue dots show the locations of the employees
and a line connects two locations if a document
is created in one location and viewed in the
other. The warmer the color of the line, moving
from green to red, the more documents shared
between the two locations.
Google Inc. 52.6 Collaboration Across Sites 2 COLLABORATION VISUALIZATION
Figure 6: Activity on a document. Each user location is assigned a different color. The X axis is time in
hours and the Y axis is the number of locations for each action type. Users from nine different locations
contributed to the document.
Figure 7: Global collaboration on Docs. The blue dots are locations and the dots are connected if there is
collaboration on Google Docs between the two locations.
6 Google Inc.3 THE EVOLUTION OF COLLABORATION 2.7 Cross Device Work
2.7 Cross Device Work
The advantage of cloud-based software and storage
is that a document can be accessed from any
device. Figure 8 shows one employee’s visits to
a document from multiple devices and locations.
When the employee was in Paris, a desktop or
laptop was used during working hours and a mobile
device during non-working hours. Apparently,
the employee traveled to Aix-En-Provence
on August 18. On August 18 and the first part of
August 19, the employee continued working on
the same document from a mobile device while
on the move.
Figure 8: Visits to a document by one user working
on multiple devices and from multiple locations.
Not surprisingly, the pattern of working on desktops
or laptops during working hours and on mobile
devices out of business hours holds generally
at Google, as Figure 9 shows. The day of week
is shown on the X axis and hour of day in local
time on the Y axis. Each pixel is colored
according to the average number of employees
working in Google Docs in a day of week and
time of day slot, with brighter colors representing
higher numbers. Pixel values are normalized
within each plot separately. Desktop and laptop
usage of Google Docs peaks during conventional
working hours (9:00 AM to 11:00 AM and
1:00 PM to 5:00 PM), while mobile device usage
peaks during conventional commuting and other
out-of-office hours (7:00 AM to 9:00 AM and 6:00
PM to 8:00 PM).
Figure 9: The average number of active users working
in Google Docs in each day of week and time of
day slot. The X axis is day of the week and the Y
axis is time of the day in local time. Desktop/Laptop
usage peaks during working hours while mobile usage
peaks at out-of-office working hours.
3 The Evolution of Collaboration
3.1 The Data
This section explores changes in the usage of
Google Docs over time. Section 2 defined collaborators
as users who edited or commented on the
same document and used logs of employee editing,
viewing and commenting actions to describe
collaboration within Google. This section defines
collaborators differently using metadata on documents.
Metadata is much less rich than the
event history logs used in Section 2, but metadata
is retained for a much longer period of time.
Document metadata includes the document creation
time and the last time that the document
Google Inc. 73.2 Collaboration for New Employees 3 THE EVOLUTION OF COLLABORATION
was accessed, but no other information about its
revision history. However, the metadata does include
the identification numbers for employees
who have subscribed to the document, where a
subscriber is anyone who has permission to view,
edit or comment on a document and who has
viewed the document at least once. Here we use
metadata on documents, slides and spreadsheets.
We call two employees collaborators (or subscription
collaborators to be clear) if one is a subscriber
to a document owned by the other and
has viewed the document at least once and the
document has fewer than 20 subscribers. The
owner of the document is said to have shared
the document with the subscriber. The number
of subscribers is capped at 20 to avoid overcounting
collaborators. The more subscribers
the document has, the less likely it is that all
the subscribers contributed to the document.
There is no timestamp for when the employee
subscribed to the document in the metadata, so
the exact time of the collaboration is not known.
Instead, the document creation time, which is
known, is taken to be the time of the collaboration.
An analysis (not shown here) of the event
history data discussed in Section 2 showed that
most collaborators join a collaboration soon after
a document is created, so taking collaboration
time to be document creation time is not
unreasonable. To make this assumption even
more tenable, we exclude documents for which
the time of the last view, comment or edit is more
than six months after the document was created.
This section uses metadata on documents created
between January 1, 2011 and March 31,
2013. We say that two employees had a subscription
collaboration in July if they collaborated on
a document that was created in July.
3.2 Collaboration for New Employees
Here we define the new employees for a given
month to be all the employees who joined Google
no more than 90 days before the beginning of
the month and started using Google Docs in
the given month. For example, employees called
new in the month of January 2011 must have
joined Google no more than 90 days before January
1, 2011 and used Google Docs in January
2011. Each month can include different employees.
New employees are said to share a document
if they own a document that someone else subscribed
to, whether or not the person subscribed
to the document is a new employee. Similarly, a
new employee is counted as a subscriber, regardless
of the tenure of the document creator.
Figure 10 shows that collaboration among new
employees has increased since 2011. Over the
last two years, subscribing has risen from 55% to
85%, sharing has risen from 30% to 50%, and the
fraction of users who either share or subscribe
has risen from 70% to 90%. In other words, new
employees are collaborating earlier in their career,
so there is a faster ramp-up and easier access
to collective knowledge.
Figure 10: This figure shows the percentage of new
employees who share, subscribe to others’ documents
and either share or subscribe in each one-month period
over the last two years.
Not only do new employees start collaborating
more often (as measured by subscription and
sharing), they also collaborate with more people.
Figure 11 shows the percentage of new employees
with at least a given number of collaborators
by month. For example, the percentage of
8 Google Inc.3 THE EVOLUTION OF COLLABORATION 3.3 Collaboration in Sales and Marketing
new employees with at least three subscription
collaborators was 35% in January 2011 (the bottom
red curve) and 70% in March 2013 (the top
blue curve), a doubling over two years. It is interesting
that the curves hardly cross each other
and the curves for the farthest back months lie
below those for recent months, suggesting that
there has been steady growth in the number of
subscription collaborators per new employee over
this period.
Figure 11: This figure shows the proportion of new
employees who have at least a given number of collaborators
in each one-month period. Each period is
assigned a different color. The cooler the color of the
curve, moving from red to blue, the more recent the
month. The legend only shows the labels for a subset
of curves. The percentage of new employees who have
at least three collaborators has doubled from 35% to
70%.
To present the data in Figure 11 in another way,
Table 1 shows percentiles of the distribution of
the number of subscription collaborators per new
employee using Google Docs in January 2011 and
in January 2013. For example, the lowest 25% of
new employees using Google Docs had no such
collaborators in January 2011 and two such collaborators
in January 2013.
25% 50% 75% 90% 95%
January 2011 0 1 4 7 11
January 2013 2 5 10 17 22
Table 1: This table shows the percentile of number
of collaborators a new employee have in January 2011
and January 2013. The entire distribution shifts to
the right.
3.3 Collaboration in Sales and Marketing
Section 3.2 compared new employees who joined
Google in different months. This section follows
current employees in Sales and Marketing who
joined Google before January 1, 2011. That is,
the previous section considered changes in new
employee behavior over time and this section
considers changes in behavior for a fixed set of
employees over time. We only analyze subscription
collaborations among this fixed set of employees
and collaborations with employees not
in this set are excluded.
Figure 12: This figure shows the percentage of current
employees in Sales and Marketing who have at
least a given number of collaborators in each onemonth
period.
Figure 12 shows the percentage of current employees
in Sales and Marketing who have at least
Google Inc. 93.4 Collaboration Between Organizations 3 THE EVOLUTION OF COLLABORATION
a given number of collaborators at several times
in the past. There we see that more employees
are sharing and subscribing over time because
the fraction of the group with at least one subscription
collaborator has increased from 80%
to 95%. And the fraction of the group with
at least three subscription collaborators has increased
from 50% to 80%. It shows that many of
the employees who used to have no or very few
subscription collaborators have migrated to having
multiple subscription collaborators. In other
words, the distribution of number of subscription
collaborators for employees who have been
in Sales and Marketing since January 1, 2011 has
shifted right over time, which implies that collaboration
in that group of employees has increased
over time.
Finally, the number of documents shared by the
employees who have been in Sales and Marketing
at Google since January 1, 2011 has nearly doubled
over the last two years. Figure 13 shows the
number of shared documents normalized by the
number of shared documents in January, 2011.
Figure 13: This figure shows the number of shared
documents created by employees in Sales and Marketing
each month normalized by the number of shared
documents in January 2011. The number has almost
doubled over the last two years.
3.4 Collaboration Between Organizations
Collaboration between organizations has increased
over time. To show that, we consider
hundreds of employees in nine teams within the
Sales and Marketing group and the Engineering
and Product Management group who joined
Google before January 1, 2011, were still active
in March 31, 2013 and used Google Docs in that
period. Figure 14 represents the Engineering and
Product Management employees as red dots and
the Sales and Marketing employees as blue dots.
The same dots are included in all three plots
in Figure 14 because the employees included in
this analysis do not change. A line connects two
dots if the two employees had at least one subscription
collaboration in the month shown. The
denser the lines in the graph, the more collaboration,
and the more lines connecting red and blue
dots, the more collaboration between organizations.
Clearly, subscription collaboration has increased
both within and across organizations in
the past two years. Moreover, the network shows
more pronounced communities (groups of connected
dots) over time. Although there are nine
individual teams, there seems to be only three
major communities in the network. Figure 14
indicates that teams can work closely with each
other even though they belong to separate departments.
We also sampled 187 teams within the Sales and
Marketing group and the Engineering and Product
Management group. Figure 15 represents
teams in Engineering and Product Management
as red dots and teams in Sales and Marketing
as blue dots. Two dots are connected if the two
teams had a least one subscription collaboration
between their members in the month. Figure
15 shows that the collaboration between those
teams has increased and the interaction between
the two organizations has becomed stronger over
the past two years.
10 Google Inc.3 THE EVOLUTION OF COLLABORATION 3.4 Collaboration Between Organizations
Figure 14: An example of collaboration across organizations.
Red dots represent employees in Engineering
and Product Management and blue dots represent
employees in Sales and Marketing
Figure 15: An example of collaboration between
teams. Red dots represent teams in Engineering and
Product Management and blue dots represent teams
in Sales and Marketing
Google Inc. 113.5 Cultural Changes in Collaboration 4 CONCLUSIONS
3.5 Cultural Changes in Collaboration
Google Docs allows users to specify the access
level (visibility) of their documents. The default
access level in Google Docs is private, which
means that only the user who created the document
or the current owner of the document can
view it. Employees can change the access level on
a document they own and allow more people to
access it. For example, the document owner can
specify particular employees who are allowed to
access the document, or the owner can mark the
document as public within Google, in which case
any employee can access the document. Clearly,
not all documents created in Google can be visible
to everyone at Google, but the more documents
are widely shared, the more open the environment
is to collaboration.
Figure 16: This figure shows the percentage of
shared documents that are ”public within Google”
created in each month. Public sharing is overtaking
private sharing at Google.
Figure 16 shows the percentage of shared documents
in Google created each month between
January 1, 2012 and March 31, 2013 that are
public within Google. The red line, which is a
curve fit to the data to smooth out variability,
shows that the percentage has increased about
12% from 48% to 54% in the last year alone. In
that sense, the culture of sharing is changing in
Google from private sharing to public sharing.
4 Conclusions
We have examined how Google employees collaborate
with Docs and how that collaboration has
evolved using logs of user activity and document
metadata. To show the current usage of Docs in
Google, we have developed a visualization technique
for the revision history of a document and
analyzed key features in Docs such as collaborative
editing, commenting, access from anywhere
and on any device. To show the evolution
of collaboration in the cloud, we have analyzed
new employees and a fixed group of employees
in Sales and Marketing, and computed collaboration
network statistics each month. We find
that employees are engaged in using the Docs
suite, and collaboration has grown rapidly over
the last two years.
It would also be interesting to conduct a similar
analysis for other enterprises and see how long it
would take them to reach the benchmark Google
has set for collaboration on Docs. Not only has
the collaboration on Docs changed at Google,
the number of emails, comments on G+, calender
meetings between people who work together
has also had significant changes over the past few
years. How those changes reinforce each other
over time would also be an interesting topic to
study.
Acknowledgements
We would like to thank Ariel Kern for her
insights about collaboration on Google Docs,
Penny Chu and Tony Fagan for their encouragement
and support and many thanks to Jim
Koehler for his constructive feedback.
12 Google Inc.REFERENCES REFERENCES
References
[1] Dan R. Herrick (2009). Google this!: using
Google apps for collaboration and productivity.
Proceeding of the ACM SIGUCCS fall
conference (pp. 55-64).
[2] Stijn Dekeyser, Richard Watson (2009). Extending
Google Docs to Collaborate on Research
Papers. Technical Report, The University
of Southern Queensland, Australia.
[3] Ina Blau, Avner Caspi (2009). What Type
of Collaboration Helps? Psychological Ownership,
Perceived Learning and Outcome
Quality of Collaboration Using Google Docs.
Learning in the technological era: Proceedings
of the Chais conference on instructional
technologies research (pp. 48-55).
Google Inc. 13
Circulant Binary Embedding
Felix X. Yu1 YUXINNAN@EE.COLUMBIA.EDU
Sanjiv Kumar2
SANJIVK@GOOGLE.COM
Yunchao Gong3 YUNCHAO@CS.UNC.EDU
Shih-Fu Chang1
SFCHANG@EE.COLUMBIA.EDU
1Columbia University, New York, NY 10027
2Google Research, New York, NY 10011
3University of North Carolina at Chapel Hill, Chapel Hill, NC 27599
Abstract
Binary embedding of high-dimensional data requires
long codes to preserve the discriminative
power of the input space. Traditional binary coding
methods often suffer from very high computation
and storage costs in such a scenario. To
address this problem, we propose Circulant Binary
Embedding (CBE) which generates binary
codes by projecting the data with a circulant matrix.
The circulant structure enables the use of
Fast Fourier Transformation to speed up the computation.
Compared to methods that use unstructured
matrices, the proposed method improves
the time complexity from O(d
2
) to O(d log d),
and the space complexity from O(d
2
) to O(d)
where d is the input dimensionality. We also
propose a novel time-frequency alternating optimization
to learn data-dependent circulant projections,
which alternatively minimizes the objective
in original and Fourier domains. We show
by extensive experiments that the proposed approach
gives much better performance than the
state-of-the-art approaches for fixed time, and
provides much faster computation with no performance
degradation for fixed number of bits.
1. Introduction
Embedding input data in binary spaces is becoming popular
for efficient retrieval and learning on massive data sets
(Li et al., 2011; Gong et al., 2013a; Raginsky & Lazebnik,
2009; Gong et al., 2012; Liu et al., 2011). Moreover,
Proceedings of the 31 st International Conference on Machine
Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright
2014 by the author(s).
in a large number of application domains such as computer
vision, biology and finance, data is typically highdimensional.
When representing such high dimensional
data by binary codes, it has been shown that long codes
are required in order to achieve good performance. In fact,
the required number of bits is O(d), where d is the input dimensionality
(Li et al., 2011; Gong et al., 2013a; Sanchez ´
& Perronnin, 2011). The goal of binary embedding is to
well approximate the input distance as Hamming distance
so that efficient learning and retrieval can happen directly in
the binary space. It is important to note that another related
area called hashing is a special case with slightly different
goal: creating hash tables such that points that are similar
fall in the same (or nearby) bucket with high probability. In
fact, even in hashing, if high accuracy is desired, one typically
needs to use hundreds of hash tables involving tens of
thousands of bits.
Most of the existing linear binary coding approaches generate
the binary code by applying a projection matrix, followed
by a binarization step. Formally, given a data point,
x ∈ R
d
, the k-bit binary code, h(x) ∈ {+1, −1}
k
is generated
simply as
h(x) = sign(Rx), (1)
where R ∈ R
k×d
, and sign(·) is a binary map which returns
element-wise sign1
. Several techniques have been
proposed to generate the projection matrix randomly without
taking into account the input data (Charikar, 2002; Raginsky
& Lazebnik, 2009). These methods are very popular
due to their simplicity but often fail to give the best performance
due to their inability to adapt the codes with respect
to the input data. Thus, a number of data-dependent techniques
have been proposed with different optimization criteria
such as reconstruction error (Kulis & Darrell, 2009),
data dissimilarity (Norouzi & Fleet, 2012; Weiss et al.,
1A few methods transform the linear projection via a nonlinear
map before taking the sign (Weiss et al., 2008; Raginsky &
Lazebnik, 2009).Circulant Binary Embedding
2008), ranking loss (Norouzi et al., 2012), quantization error
after PCA (Gong et al., 2013b), and pairwise misclassification
(Wang et al., 2010). These methods are shown to
be effective for learning compact codes for relatively lowdimensional
data. However, the O(d
2
) computational and
space costs prohibit them from being applied to learning
long codes for high-dimensional data. For instance, to generate
O(d)-bit binary codes for data with d ∼1M, a huge
projection matrix will be required needing TBs of memory,
which is not practical2
.
In order to overcome these computational challenges, Gong
et al. (2013a) proposed a bilinear projection based coding
method for high-dimensional data. It reshapes the input
vector x into a matrix Z, and applies a bilinear projection
to get the binary code:
h(x) = sign(RT
1 ZR2). (2)
When the shapes of Z, R1, R2 are chosen appropriately,
the method has time and space complexity of O(d
1.5
) and
O(d), respectively. Bilinear codes make it feasible to
work with datasets with very high dimensionality and have
shown good results in a variety of tasks.
In this work, we propose a novel Circulant Binary Embedding
(CBE) technique which is even faster than the bilinear
coding. It is achieved by imposing a circulant structure
on the projection matrix R in (1). This special structure
allows us to use Fast Fourier Transformation (FFT) based
techniques, which have been extensively used in signal processing.
The proposed method further reduces the time
complexity to O(d log d), enabling efficient binary embedding
for very high-dimensional data3
. Table 1 compares
the time and space complexity for different methods. This
work makes the following contributions:
• We propose the circulant binary embedding method,
which has space complexity O(d) and time complexity
O(d log d) (Section 2, 3).
• We propose to learn the data-dependent circulant projection
matrix by a novel and efficient time-frequency
alternating optimization, which alternatively optimizes
the objective in the original and frequency domains
(Section 4).
• Extensive experiments show that, compared to the
state-of-the-art, the proposed method improves the result
dramatically for a fixed time cost, and provides
much faster computation with no performance degradation
for a fixed number of bits (Section 5).
2
In principle, one can generate the random entries of the matrix
on-the-fly (with fixed seeds) without needing to store the matrix.
But this will increase the computational time even further.
3One could in principal use other structured matrices like
Hadamard matrix along with a sparse random Gaussian matrix to
achieve fast projection as was done in fast Johnson-Lindenstrauss
transform(Ailon & Chazelle, 2006; Dasgupta et al., 2011), but it
is still slower than circulant projection and needs more space.
Method Time Space Time (Learning)
Full projection O(d
2
) O(d
2
) O(nd3
)
Bilinear proj. O(d
1.5
) O(d) O(nd1.5
)
Circulant proj. O(d log d) O(d) O(nd log d)
Table 1. Comparison of the proposed method (Circulant proj.)
with other methods for generating long codes (code dimension
k comparable to input dimension d). n is the number of instances
used for learning data-dependent projection matrices.
2. Circulant Binary Embedding (CBE)
A circulant matrix R ∈ R
d×d
is a matrix defined by a vector
r = (r0, r2, · · · , rd−1)
T
(Gray, 2006)4
.
R = circ(r) :=
r0 rd−1 . . . r2 r1
r1 r0 rd−1 r2
.
.
. r1 r0
.
.
.
.
.
.
rd−2
.
.
.
.
.
. rd−1
rd−1 rd−2 . . . r1 r0
. (3)
Let D be a diagonal matrix with each diagonal entry being
a Bernoulli variable (±1 with probability 1/2). For x ∈ R
d
,
its d-bit Circulant Binary Embedding (CBE) with r ∈ R
d
is defined as:
h(x) = sign(RDx), (4)
where R = circ(r). The k-bit (k < d) CBE is defined
as the first k elements of h(x). The need for such a D is
discussed in Section 3. Note that applying D to x is equivalent
to applying random sign flipping to each dimension of
x. Since sign flipping can be carried out as a preprocessing
step for each input x, here onwards for simplicity we will
drop explicit mention of D. Hence the binary code is given
as h(x) = sign(Rx).
The main advantage of circulant binary embedding is its
ability to use Fast Fourier Transformation (FFT) to speed
up the computation.
Proposition 1. For d-dimensional data, CBE has space
complexity O(d), and time complexity O(d log d).
Since a circulant matrix is defined by a single column/row,
clearly the storage needed is O(d). Given a data point x,
the d-bit CBE can be efficiently computed as follows. Denote
~ as operator of circulant convolution. Based on the
definition of circulant matrix,
Rx = r ~ x. (5)
The above can be computed based on Discrete Fourier
Transformation (DFT), for which fast algorithm (FFT) is
available. The DFT of a vector t ∈ C
d
is a d-dimensional
vector with each element defined as
4The circulant matrix is sometimes equivalently defined by
“circulating” the rows instead of the columns.Circulant Binary Embedding
F(t)l =
X
d−1
m=0
tn · e
−i2πlm/d, l = 0, · · · , d − 1. (6)
The above can be expressed equivalently in a matrix form
as
F(t) = Fdt, (7)
where Fd is the d-dimensional DFT matrix. Let F
H
d
be the
conjugate transpose of Fd. It is easy to show that F
−1
d =
(1/d)F
H
d
. Similarly, for any t ∈ C
d
, the Inverse Discrete
Fourier Transformation (IDFT) is defined as
F
−1
(t) = (1/d)F
H
d
t. (8)
Since the convolution of two signals in their original domain
is equivalent to the hadamard product in their frequency
domain (Oppenheim et al., 1999),
F(Rx) = F(r) ◦ F(x). (9)
Therefore,
h(x) = sign
F
−1
(F(r) ◦ F(x))
. (10)
For k-bit CBE, k < d, we only need to pick the first k bits
of h(x). As DFT and IDFT can be efficiently computed in
O(d log d) with FFT (Oppenheim et al., 1999), generating
CBE has time complexity O(d log d).
3. Randomized Circulant Binary Embedding
A simple way to obtain CBE is by generating the elements
of r in (3) independently from the standard normal
distribution N (0, 1). We call this method randomized
CBE (CBE-rand). A desirable property of any embedding
method is its ability to approximate input distances in the
embedded space. Suppose Hk(x1, x2) is the normalized
Hamming distance between k-bit codes of a pair of points
x1, x2 ∈ R
d
:
Hk(x1, x2)= 1
k
k
X−1
i=0
sign(Ri·x1)−sign(Ri·x2)
/2, (11)
and Ri·
is the i-th row of R, R = circ(r). If r is sampled
from N (0, 1), from (Charikar, 2002),
Pr
sign(r
T x1) 6= sign(r
T x2)
= θ/π, (12)
where θ is the angle between x1 and x2. Since all the vectors
that are circulant variants of r also follow the same
distribution, it is easy to see that
E(Hk(x1, x2)) = θ/π. (13)
For the sake of discussion, if k projections, i.e., first k rows
of R, were generated independently, it is easy to show that
the variance of Hk(x1, x2) will be
Var(Hk(x1, x2)) = θ(π − θ)/kπ2
. (14)
0 1 2 3 4 5 6 7 8
0
0.05
0.1
0.15
0.2
0.25
log k
Variance
Independent
Circulant
(a) θ = π/12
0 1 2 3 4 5 6 7 8
0
0.05
0.1
0.15
0.2
0.25
log k
Variance
Independent
Circulant
(b) θ = π/6
0 1 2 3 4 5 6 7 8
0
0.05
0.1
0.15
0.2
0.25
log k
Variance
Independent
Circulant
(c) θ = π/3
0 1 2 3 4 5 6 7 8
0
0.05
0.1
0.15
0.2
0.25
log k
Variance
Independent
Circulant
(d) θ = π/2
Figure 1. The analytical variance of normalized hamming distance
of independent bits as in (14), and the sample variance of
normalized hamming distance of circulant bits, as a function of
angle between points (θ) and number of bits (k). The two curves
overlap.
Thus, with more bits (larger k), the normalized hamming
distance will be close to the expected value, with lower
variance. In other words, the normalized hamming distance
approximately preserves the angle5
. Unfortunately in CBE,
the projections are the rows of R = circ(r), which are
not independent. This makes it hard to derive the variance
analytically. To better understand CBE-rand, we run simulations
to compare the analytical variance of normalized
hamming distance of independent projections (14), and the
sample variance of normalized hamming distance of circulant
projections in Figure 1. For each θ and k, we randomly
generate x1, x2 ∈ R
d
such that their angle is θ
6
.
We then generate k-dimensional code with CBE-rand, and
compute the hamming distance. The variance is estimated
by applying CBE-rand 1,000 times. We repeat the whole
process 1,000 times, and compute the averaged variance.
Surprisingly, the curves of “Independent” and “Circulant”
variances are almost indistinguishable. This means that bits
generated by CBE-rand are generally as good as the independent
bits for angle preservation. An intuitive explanation
is that for the circulant matrix, though all the rows are
dependent, circulant shifting one or multiple elements will
in fact result in very different projections in most cases.
We will later show in experiments on real-world data that
CBE-rand and Locality Sensitive Hashing (LSH)7 has almost
identical performance (yet CBE-rand is significantly
faster) (Section 5).
5
In this paper, we consider the case that the data points are
`2 normalized. Therefore the cosine distance, i.e., 1 - cos(θ), is
equivalent to the l2 distance.
6This can be achieved by extending the 2D points (1, 0),
(cos θ, sin θ) to d-dimension, and performing a random orthonormal
rotation, which can be formed by the Gram-Schmidt process
on random vectors.
7Here, by LSH we imply the binary embedding using R such
that all the rows of R are sampled iid. With slight abuse of notation,
we still call it “hashing” following (Charikar, 2002).Circulant Binary Embedding
Note that the distortion in input distances after circulant
binary embedding comes from two sources: circulant projection,
and binarization. For the circulant projection step,
recent works have shown that the Johnson-Lindenstrausstype
lemma holds with a slightly worse bound on the number
of projections needed to preserve the input distances
with high probability (Hinrichs & Vyb´ıral, 2011; Zhang
& Cheng, 2013; Vyb´ıral, 2011; Krahmer & Ward, 2011).
These works also show that before applying the circulant
projection, an additional step of randomly flipping the signs
of input dimensions is necessary8
. To show why such a step
is required, let us consider the special case when x is an allone
vector, 1. The circulant projection with R = circ(r)
will result in a vector with all elements to be r
T 1. When
r is independently drawn from N (0, 1), this will be close
to 0, and the norm cannot be preserved. Unfortunately the
Johnson-Lindenstrauss-type results do not generalize to the
distortion caused by the binarization step.
One problem with the randomized CBE method is that it
does not utilize the underlying data distribution while generating
the matrix R. In the next section, we propose to
learn R in a data-dependent fashion, to minimize the distortions
due to circulant projection and binarization.
4. Learning Circulant Binary Embedding
We propose data-dependent CBE (CBE-opt), by optimizing
the projection matrix with a novel time-frequency alternating
optimization. We consider the following objective
function in learning the d-bit CBE. The extension of
learning k < d bits will be shown in Section 4.2.
argmin
B,r
||B − XRT
||2
F + λ||RRT − I||2
F (15)
s.t. R = circ(r),
where X ∈ R
n×d
, is the data matrix containing n training
points: X = [x0, · · · , xn−1]
T
, and B ∈ {−1, 1}
n×d
is the
corresponding binary code matrix.9
In the above optimization, the first term minimizes distortion
due to binarization. The second term tries to make the
projections (rows of R, and hence the corresponding bits)
as uncorrelated as possible. In other words, this helps to
reduce the redundancy in the learned code. If R were to be
an orthogonal matrix, the second term will vanish and the
optimization would find the best rotation such that the distortion
due to binarization is minimized. However, when
R is a circulant matrix, R, in general, will not be orthogonal.
Similar objective has been used in previous works
including (Gong et al., 2013b;a) and (Wang et al., 2010).
8
For each dimension, whether the sign needs to be flipped is
predetermined by a (p = 0.5) Bernoulli variable.
9
If the data is `2 normalized, we can set B ∈
{−1/
√
d, 1/
√
d}
n×d
to make B and XRT more comparable.
This does not empirically influence the performance.
4.1. The Time-Frequency Alternating Optimization
The above is a combinatorial optimization problem, for
which an optimal solution is hard to find. In this section
we propose a novel approach to efficiently find a local solution.
The idea is to alternatively optimize the objective
by fixing r, and B, respectively. For a fixed r, optimizing
B can be easily performed in the input domain (“time” as
opposed to “frequency”). For a fixed B, the circulant structure
of R makes it difficult to optimize the objective in the
input domain. Hence we propose a novel method, by optimizing
r in the frequency domain based on DFT. This leads
to a very efficient procedure.
For a fixed r. The objective is independent on each element
of B. Denote Bij as the element of the i-th row and j-th
column of B. It is easy to show that B can be updated as:
Bij =
(
1 if Rj·xi ≥ 0
−1 if Rj·xi < 0
, (16)
i = 0, · · · , n − 1. j = 0, · · · , d − 1.
For a fixed B. Define ˜r as the DFT of the circulant vector
˜r := F(r). Instead of solving r directly, we propose to
solve ˜r, from which r can be recovered by IDFT.
Key to our derivation is the fact that DFT projects the signal
to a set of orthogonal basis. Therefore the `2 norm can be
preserved. Formally, according to Parseval’s theorem , for
any t ∈ C
d
(Oppenheim et al., 1999),
||t||2
2 = (1/d)||F(t)||2
2
.
Denote diag(·) as the diagonal matrix formed by a vector.
Denote <(·) and =(·) as the real and imaginary parts, respectively.
We use Bi·
to denote the i-th row of B. With
complex arithmetic, the first term in (15) can be expressed
in the frequency domain as:
||B − XRT
||2
F =
1
d
nX−1
i=0
||F(B
T
i· − Rxi)||2
2 (17)
=
1
d
nX−1
i=0
||F(B
T
i·)−˜r◦F(xi)||2
2 =
1
d
nX−1
i=0
||F(B
T
i·)−diag(F(xi))˜r||2
2
=
1
d
nX−1
i=0
F(B
T
i·)−diag(F(xi))˜r
T
F(B
T
i·)−diag(F(xi))˜r
=
1
d
h
<(˜r)
TM<(˜r)+=(˜r)
TM=(˜r)+<(˜r)
T h+=(˜r)
T
g
i
+||B||2
F ,
where,
M=diag
nX−1
i=0
<(F(xi))◦<(F(xi))+=(F(xi))◦=(F(xi))
h = −2
nX−1
i=0
<(F(xi))◦<(F(B
T
i·))+=(F(xi)) ◦ =(F(B
T
i·))
g = 2
nX−1
i=0
=(F(xi)) ◦ <(F(B
T
i·)) − <(F(xi)) ◦ =(F(B
T
i·)).Circulant Binary Embedding
For the second term in (15), we note that the circulant matrix
can be diagonalized by DFT matrix Fd and its conjugate
transpose F
H
d
. Formally, for R = circ(r), r ∈ R
d
,
R = (1/d)F
H
d diag(F(r))Fd. (18)
Let Tr(·) be the trace of a matrix. Therefore,
||RRT − I||2
F = || 1
d
F
H
d
(diag(˜r)
Hdiag(˜r) − I)Fd||2
F
= Tr
1
d
F
H
d
(diag(˜r)
Hdiag(˜r)−I)
H(diag(˜r)
Hdiag(˜r)−I)Fd
= Tr
(diag(˜r)
Hdiag(˜r) − I)
H(diag(˜r)
Hdiag(˜r) − I)
=||˜r
H ◦ ˜r − 1||2
2 = ||<(˜r)
2 + =(˜r)
2 − 1||2
2
. (19)
Furthermore, as r is real-valued, additional constraints on
˜r are needed. For any u ∈ C, denote u as the complex
conjugate of u. We have the following result (Oppenheim
et al., 1999): For any real-valued vector t ∈ C
d
,
F(t)0 is real-valued, and
F(t)d−i = F(t)i
, i = 1, · · · , bd/2c.
From (17) − (19), the problem of optimizing ˜r becomes
argmin
˜r
<(˜r)
TM<(˜r) + =(˜r)
TM=(˜r) + <(˜r)
T h
+ =(˜r)
T g + λd||<(˜r)
2 + =(˜r)
2 − 1||2
2
(20)
s.t. =(˜r0) = 0
<(˜ri) = <(˜rd−i), i = 1, · · · , bd/2c
=(˜ri) = −=(˜rd−i), i = 1, · · · , bd/2c.
The above is non-convex. Fortunately, the objective function
can be decomposed, such that we can solve two variables
at a time. Denote the diagonal vector of the diagonal
matrix M as m. The above optimization can then be decomposed
to the following sets of optimizations.
argmin
r˜0
m0r˜
2
0 + h0r˜0+ λd
r˜
2
0 − 1
2
, s.t. r˜0 = r˜0. (21)
argmin
r˜i
(mi + md−i)(<(˜ri)
2 + =(˜ri)
2
) (22)
+ 2λd
<(˜ri)
2 + =(˜ri)
2 − 1
2
+ (hi + hd−i)<(˜ri) + (gi − gd−i)=(˜ri),
i = 1, · · · , bd/2c.
In (21), we need to minimize a 4
th order polynomial with
one variable, with the closed form solution readily available.
In (22), we need to minimize a 4
th order polynomial
with two variables. Though the closed form solution is hard
(requiring solution of a cubic bivariate system), we can find
local minima by gradient descent, which can be considered
as having constant running time for such small-scale
problems. The overall objective is guaranteed to be nonincreasing
in each step. In practice, we can get a good solution
with just 5-10 iterations. In summary, the proposed
time-frequency alternating optimization procedure has running
time O(nd log d).
4.2. Learning k < d Bits
In the case of learning k < d bits, we need to solve the
following optimization problem:
argmin
B,r
||BPk−XPT
k RT
||2
F +λ||RPkP
T
k RT −I||2
F
s.t. R = circ(r), (23)
in which Pk =
Ik O
O Od−k
, Ik is a k × k identity matrix,
and Od−k is a (d − k) × (d − k) all-zero matrix.
In fact, the right multiplication of Pk can be understood as
a “temporal cut-off”, which is equivalent to a frequency domain
convolution. This makes the optimization difficult, as
the objective in frequency domain can no longer be decomposed.
To address this issues, we propose a simple solution
in which Bij = 0, i = 0, · · · , n − 1, j = k, · · · , d − 1 in
(15). Thus, the optimization procedure remains the same,
and the cost is also O(nd log d). We will show in experiments
that this heuristic provides good performance in
practice.
5. Experiments
To compare the performance of the proposed circulant
binary embedding technique, we conducted experiments
on three real-world high-dimensional datasets used by the
current state-of-the-art method for generating long binary
codes (Gong et al., 2013a). The Flickr-25600 dataset contains
100K images sampled from a noisy Internet image
collection. Each image is represented by a 25, 600 dimensional
vector. The ImageNet-51200 contains 100k images
sampled from 100 random classes from ImageNet
(Deng et al., 2009), each represented by a 51, 200 dimensional
vector. The third dataset (ImageNet-25600) is another
random subset of ImageNet containing 100K images
in 25, 600 dimensional space. All the vectors are normalized
to be of unit norm.
We compared the performance of the randomized (CBErand)
and learned (CBE-opt) versions of our circulant
embeddings with the current state-of-the-art for highdimensional
data, i.e., bilinear embeddings. We use both
the randomized (bilinear-rand) and learned (bilinear-opt)
versions. Bilinear embeddings have been shown to perform
similar or better than another promising technique
called Product Quantization (Jegou et al., 2011). Finally,
we also compare against the binary codes produced by the
baseline LSH method (Charikar, 2002), which is still applicable
to 25,600 and 51,200 dimensional feature but with
much longer running time and much more space. We also
show an experiment with relatively low-dimensional data
in 2048 dimensional space using Flickr data to compare
against techniques that perform well for low-dimensional
data but do not scale to high-dimensional scenario. Exam-
C/C++ Thread Safety Analysis
DeLesley Hutchins
Google Inc.
Email: delesley@google.com
Aaron Ballman
CERT/SEI
Email: aballman@cert.org
Dean Sutherland
Email: dfsuther@cs.cmu.edu
Abstract—Writing multithreaded programs is hard. Static
analysis tools can help developers by allowing threading policies
to be formally specified and mechanically checked. They essentially
provide a static type system for threads, and can detect
potential race conditions and deadlocks.
This paper describes Clang Thread Safety Analysis, a tool
which uses annotations to declare and enforce thread safety
policies in C and C++ programs. Clang is a production-quality
C++ compiler which is available on most platforms, and the
analysis can be enabled for any build with a simple warning
flag: −Wthread−safety.
The analysis is deployed on a large scale at Google, where
it has provided sufficient value in practice to drive widespread
voluntary adoption. Contrary to popular belief, the need for
annotations has not been a liability, and even confers some
benefits with respect to software evolution and maintenance.
I. INTRODUCTION
Writing multithreaded programs is hard, because developers
must consider the potential interactions between concurrently
executing threads. Experience has shown that developers need
help using concurrency correctly [1]. Many frameworks and
libraries impose thread-related policies on their clients, but
they often lack explicit documentation of those policies. Where
such policies are clearly documented, that documentation
frequently takes the form of explanatory prose rather than a
checkable specification.
Static analysis tools can help developers by allowing
threading policies to be formally specified and mechanically
checked. Examples of threading policies are: “the mutex mu
should always be locked before reading or writing variable
accountBalance” and “the draw() method should only be
invoked from the GUI thread.”
Formal specification of policies provides two main benefits.
First, the compiler can issue warnings on policy violations.
Finding potential bugs at compile time is much less expensive
in terms of engineer time than debugging failed unit tests, or
worse, having subtle threading bugs hit production.
Second, specifications serve as a form of machine-checked
documentation. Such documentation is especially important
for software libraries and APIs, because engineers need to
know the threading policy to correctly use them. Although
documentation can be put in comments, our experience shows
that comments quickly “rot” because they are not updated
when variables are renamed or code is refactored.
This paper describes thread safety analysis for Clang. The
analysis was originally implemented in GCC [2], but the GCC
version is no longer being maintained. Clang is a productionquality
C++ compiler, which is available on most platforms,
including MacOS, Linux, and Windows. The analysis is
currently implemented as a compiler warning. It has been
deployed on a large scale at Google; all C++ code at Google is
now compiled with thread safety analysis enabled by default.
II. OVERVIEW OF THE ANALYSIS
Thread safety analysis works very much like a type system
for multithreaded programs. It is based on theoretical work
on race-free type systems [3]. In addition to declaring the
type of data (int , float , etc.), the programmer may optionally
declare how access to that data is controlled in a multithreaded
environment.
Clang thread safety analysis uses annotations to declare
threading policies. The annotations can be written using either
GNU-style attributes (e.g., attribute ((...))) or C++11-
style attributes (e.g., [[...]] ). For portability, the attributes are
typically hidden behind macros that are disabled when not
compiling with Clang. Examples in this paper assume the use
of macros; actual attribute names, along with a complete list
of all attributes, can be found in the Clang documentation [4].
Figure 1 demonstrates the basic concepts behind the
analysis, using the classic bank account example. The
GUARDED BY attribute declares that a thread must lock mu
before it can read or write to balance, thus ensuring that
the increment and decrement operations are atomic. Similarly,
REQUIRES declares that the calling thread must lock mu
before calling withdrawImpl. Because the caller is assumed to
have locked mu, it is safe to modify balance within the body
of the method.
In the example, the depositImpl() method lacks a REQUIRES
clause, so the analysis issues a warning. Thread safety analysis
is not interprocedural, so caller requirements must be explicitly
declared. There is also a warning in transferFrom(), because it
fails to lock b.mu even though it correctly locks this−>mu.
The analysis understands that these are two separate mutexes
in two different objects. Finally, there is a warning in the
withdraw() method, because it fails to unlock mu. Every lock
must have a corresponding unlock; the analysis detects both
double locks and double unlocks. A function may acquire
a lock without releasing it (or vice versa), but it must be
annotated to specify this behavior.
A. Running the Analysis
To run the analysis, first download and install Clang [5].
Then, compile with the −Wthread−safety flag:
clang −c −Wthread−s af et y example . cpp#include ” mutex . h ”
class BankAcct {
Mutex mu;
i n t balance GUARDED BY(mu ) ;
void d e p o s itIm p l ( i n t amount ) {
/ / WARNING! Must l o c k mu.
balance += amount ;
}
void withd rawImpl ( i n t amount ) REQUIRES (mu) {
/ / OK. C a l l e r must have lo c ked mu.
balance −= amount ;
}
public :
void withd raw ( i n t amount ) {
mu. l o c k ( ) ;
/ / OK. We ’ ve lo c ked mu.
withd rawImpl ( amount ) ;
/ / WARNING! F a i l e d t o unlo c k mu.
}
void t r a n sf e rF r om ( BankAcct& b , i n t amount ) {
mu. l o c k ( ) ;
/ / WARNING! Must l o c k b .mu.
b . withd rawImpl ( amount ) ;
/ / OK. d e p o s itIm p l ( ) has no requi rement s .
d e p o s itIm p l ( amount ) ;
mu. unlo c k ( ) ;
}
} ;
Fig. 1. Thread Safety Annotations
Note that this example assumes the presence of a suitably
annotated mutex.h [4] that declares which methods perform
locking and unlocking.
B. Thread Roles
Thread safety analysis was originally designed to enforce
locking policies such as the one previously described, but locks
are not the only way to ensure safety. Another common pattern
in many systems is to assign different roles to different threads,
such as “worker thread” or “GUI thread” [6].
The same concepts used for mutexes and locking can also
be used for thread roles, as shown in Figure 2. Here, a
widget library has two threads, one to handle user input, like
mouse clicks, and one to handle rendering. It also enforces a
constraint: the draw() method should only be invoked only by
the GUI thread. The analysis will warn if draw() is invoked
directly from onClick().
The rest of this paper will focus discussion on mutexes in
the interest of brevity, but there are analogous examples for
thread roles.
III. BASIC CONCEPTS
Clang thread safety analysis is based on a calculus of
capabilities [7] [8]. To read or write to a particular location in
memory, a thread must have the capability, or permission, to
do so. A capability can be thought of as an unforgeable key,
#include ” ThreadRole . h ”
ThreadRole Input Th read ;
ThreadRole GUI Thread ;
class Widget {
public :
v i r t u a l void o nC l i c k ( ) REQUIRES ( Input Th read ) ;
v i r t u a l void draw ( ) REQUIRES ( GUI Thread ) ;
} ;
class Button : public Widget {
public :
void o nC l i c k ( ) o v e r r i d e {
depressed = t rue ;
draw ( ) ; / / WARNING!
}
} ;
Fig. 2. Thread Roles
or token, which the thread must present to perform the read
or write.
Capabilities can be either unique or shared. A unique
capability cannot be copied, so only one thread can hold
the capability at any one time. A shared capability may
have multiple copies that are shared among multiple threads.
Uniqueness is enforced by a linear type system [9].
The analysis enforces a single-writer/multiple-reader discipline.
Writing to a guarded location requires a unique capability,
and reading from a guarded location requires either a
unique or a shared capability. In other words, many threads can
read from a location at the same time because they can share
the capability, but only one thread can write to it. Moreover, a
thread cannot write to a memory location at the same time that
another thread is reading from it, because a capability cannot
be both shared and unique at the same time.
This discipline ensures that programs are free of data
races, where a data race is defined as a situation that occurs
when multiple threads attempt to access the same location in
memory at the same time, and at least one of the accesses
is a write [10]. Because write operations require a unique
capability, no other thread can access the memory location
at that time.
A. Background: Uniqueness and Linear Logic
Linear logic is a formal theory for reasoning about resources;
it can be used to express logical statements like: “You
cannot have your cake and eat it too” [9]. A unique, or linear,
variable must be used exactly once; it cannot be duplicated
(used multiple times) or forgotten (not used).
A unique object is produced at one point in the program,
and then later consumed. Functions that use the object without
consuming it must be written using a hand-off protocol. The
caller hands the object to the function, thus relinquishing
control of it; the function hands the object back to the caller
when it returns.
For example, if std :: stringstream were a linear type, stream
programs would be written as follows:st d : : st r i n g st r e am ss ; / / produce ss
auto& ss2 = ss << ” H e l l o ” ; / / consume ss
auto& ss3 = ss2 << ” World . ” ; / / consume ss2
re tu rn ss3 . s t r ( ) ; / / consume ss3
Notice that each stream variable is used exactly once. A
linear type system is unaware that ss and ss2 refer to the same
stream; the calls to << conceptually consume one stream and
produce another with a different name. Attempting to use ss
a second time would be flagged as a use-after-consume error.
Failure to call ss3. str () before returning would also be an
error because ss3 would then be unused.
B. Naming of Capabilities
Passing unique capabilities explicitly, following the pattern
described previously, would be needlessly tedious, because
every read or write operation would introduce a new name.
Instead, Clang thread safety analysis tracks capabilities as
unnamed objects that are passed implicitly. The resulting type
system is formally equivalent to linear logic but is easier to
use in practical programming.
Each capability is associated with a named C++ object,
which identifies the capability and provides operations to
produce and consume it. The C++ object itself is not unique.
For example, if mu is a mutex, mu.lock() produces a unique,
but unnamed, capability of type Cap (a dependent type).
Similarly, mu.unlock() consumes an implicit parameter of type
Cap. Operations that read or write to data that is
guarded by mu follow a hand-off protocol: they consume an
implicit parameter of type Cap and produce an implicit
result of type Cap.
C. Erasure Semantics
Because capabilities are implicit and are used only for typechecking
purposes, they have no run time effect. As a result,
capabilities can be fully erased from an annotated program,
yielding an unannoted program with identical behavior.
In Clang, this erasure property is expressed in two ways.
First, recommended practice is to hide the annotations behind
macros, where they can be literally erased by redefining the
macros to be empty. However, literal erasure is unnecessary.
The analysis is entirely static and is implemented as a compile
time warning; it cannot affect Clang code generation in any
way.
IV. THREAD SAFETY ANNOTATIONS
This section provides a brief overview of the main annotations
that are supported by the analysis. The full list can be
found in the Clang documentation [4].
GUARDED BY(...) and PT GUARDED BY(...)
GUARDED BY is an attribute on a data member; it declares
that the data is protected by the given capability. Read operations
on the data require at least a shared capability; write
operations require a unique capability.
PT GUARDED BY is similar but is intended for use on
pointers and smart pointers. There is no constraint on the data
member itself; rather, the data it points to is protected by the
given capability.
Mutex mu;
i n t ∗p2 PT GUARDED BY(mu ) ;
void t e s t ( ) {
∗p2 = 42; / / Warning !
p2 = new i n t ; / / OK ( no GUARDED BY ) .
}
REQUIRES(...) and REQUIRES SHARED(...)
REQUIRES is an attribute on functions; it declares that
the calling thread must have unique possession of the
given capability. More than one capability may be specified,
and a function may have multiple REQUIRES attributes.
REQUIRES SHARED is similar, but the specified capabilities
may be either shared or unique.
Formally, the REQUIRES clause states that a function takes
the given capability as an implicit argument and hands it back
to the caller when it returns, as an implicit result. Thus, the
caller must hold the capability on entry to the function and
will still hold it on exit.
Mutex mu;
i n t a GUARDED BY(mu ) ;
void foo ( ) REQUIRES (mu) {
a = 0; / / OK.
}
void t e s t ( ) {
foo ( ) ; / / Warning ! Requi res mu.
}
ACQUIRE(...) and RELEASE(...)
The ACQUIRE attribute annotates a function that produces
a unique capability (or capabilities), for example, by acquiring
it from some other thread. The caller must not-hold the given
capability on entry, and will hold the capability on exit.
RELEASE annotates a function that consumes a unique
capability, (e.g., by handing it off to some other thread). The
caller must hold the given capability on entry, and will nothold
it on exit.
ACQUIRE SHARED and RELEASE SHARED are similar,
but produce and consume shared capabilities.
Formally, the ACQUIRE clause states that the function
produces and returns a unique capability as an implicit result;
RELEASE states that the function takes the capability as an
implicit argument and consumes it.
Attempts to acquire a capability that is already held or
to release a capability that is not held are diagnosed with a
compile time warning.
CAPABILITY(...)
The CAPABILITY attribute is placed on a struct, class or a
typedef; it specifies that objects of that type can be used to
identify a capability. For example, the threading libraries at
Google define the Mutex class as follows:class CAPABILITY ( ” mutex ” ) Mutex {
public :
void l o c k ( ) ACQUIRE ( t hi s ) ;
void reade rLock ( ) ACQUIRE SHARED( t hi s ) ;
void unlo c k ( ) RELEASE( t hi s ) ;
void reade rUnlock ( ) RELEASE SHARED( t hi s ) ;
} ;
Mutexes are ordinary C++ objects. However, each mutex
object has a capability associated with it; the lock () and
unlock() methods acquire and release that capability.
Note that Clang thread safety analysis makes no attempt
to verify the correctness of the underlying Mutex implementation.
Rather, the annotations allow the interface of Mutex
to be expressed in terms of capabilities. We assume that the
underlying code implements that interface correctly, e.g., by
ensuring that only one thread can hold the mutex at any one
time.
TRY ACQUIRE(b, ...) and TRY ACQUIRE SHARED(b, ...)
These are attributes on a function or method that attempts
to acquire the given capability and returns a boolean value
indicating success or failure. The argument b must be true or
false, to specify which return value indicates success.
NO THREAD SAFETY ANALYSIS
NO THREAD SAFETY ANALYSIS is an attribute on functions
that turns off thread safety checking for the annotated
function. It provides a means for programmers to opt out
of the analysis for functions that either (a) are deliberately
thread-unsafe, or (b) are thread-safe, but too complicated for
the analysis to understand.
Negative Requirements
All of the previously described requirements discussed are
positive requirements, where a function requires that certain
capabilities be held on entry. However, the analysis can also
track negative requirements, where a function requires that a
capability be not-held on entry.
Positive requirements are used to prevent race conditions.
Negative requirements are used to prevent deadlock. Many mutex
implementations are not reentrant, because making them
reentrant entails a significant performance cost. Attempting
to acquire a non-reentrant mutex that is already held will
deadlock the program.
To avoid deadlock, acquiring a capability requires a proof
that the capability is not currently held. The analysis represents
this proof as a negative capability, which is expressed using
the ! negation operator:
Mutex mu;
i n t a GUARDED BY(mu ) ;
void c l e a r ( ) REQUIRES ( !mu) {
mu. l o c k ( ) ;
a = 0;
mu. unlo c k ( ) ;
}
void r e s et ( ) {
mu. l o c k ( ) ;
/ / Warning ! C a l l e r cannot hold ’mu ’ .
c l e a r ( ) ;
mu. unlo c k ( ) ;
}
Negative capabilities are tracked in much the same way as
positive capabilities, but there is a bit of extra subtlety.
Positive requirements are typically confined within the class
or the module in which they are declared. For example, if a
thread-safe class declares a private mutex, and does all locking
and unlocking of that mutex internally, then there is no reason
clients of the class need to know that the mutex exists.
Negative requirements lack this property. If a class declares
a private mutex mu, and locks mu internally, then clients
should theoretically have to provide proof that they have not
locked mu before calling any methods of the class. Moreover,
there is no way for a client function to prove that it does not
hold mu, except by adding REQUIRES(!mu) to the function
definition. As a result, negative requirements tend to propagate
throughout the code base, which breaks encapsulation.
To avoid such propagation, the analysis restricts the visibility
of negative capabilities. The analyzer assumes that it holds
a negative capability for any object that is not defined within
the current lexical scope. The scope of a class member is
assumed to be its enclosing class, while the scope of a global
variable is the translation unit in which it is defined.
Unfortunately, this visibility-based assumption is unsound.
For example, a class with a private mutex may lock the mutex,
and then call some external routine, which calls a method in
the original class that attempts to lock the mutex a second
time. The analysis will generate a false negative in this case.
Based on our experience in deploying thread safety analysis
at Google, we believe this to be a minor problem. It is
relatively easy to avoid this situation by following good
software design principles and maintaining proper separation
of concerns. Moreover, when compiled in debug mode, the
Google mutex implementation does a run time check to see
if the mutex is already held, so this particular error can be
caught by unit tests at run time.
V. IMPLEMENTATION
The Clang C++ compiler provides a sophisticated infrastructure
for implementing warnings and static analysis. Clang
initially parses a C++ input file to an abstract syntax tree
(AST), which is an accurate representation of the original
source code, down to the location of parentheses. In contrast,
many compilers, including GCC, lower to an intermediate
language during parsing. The accuracy of the AST makes it
easier to emit quality diagnostics, but complicates the analysis
in other respects.
The Clang semantic analyzer (Sema) decorates the AST
with semantic information. Name lookup, function overloading,
operator overloading, template instantiation, and type
checking are all performed by Sema when constructing the
AST. Clang inserts special AST nodes for implicit C++ operations,
such as automatic casts, LValue-to-RValue conversions,implicit destructor calls, and so on, so the AST provides an
accurate model of C++ program semantics.
Finally, the Clang analysis infrastructure constructs a control
flow graph (CFG) for each function in the AST. This is not a
lowering step; each statement in the CFG points back to the
AST node that created it. The CFG is shared infrastructure;
the thread safety analysis is only one of its many clients.
A. Analysis Algorithm
The thread safety analysis algorithm is flow-sensitive, but
not path-sensitive. It starts by performing a topological sort of
the CFG, and identifying back edges. It then walks the CFG in
topological order, and computes the set of capabilities that are
known to be held, or known not to be held, at every program
point.
When the analyzer encounters a call to a function that
is annotated with ACQUIRE, it adds a capability to the set;
when it encounters a call to a function that is annotated with
RELEASE, it removes it from the set. Similarly, it looks for
REQUIRES attributes on function calls, and GUARDED BY
on loads or stores to variables. It checks that the appropriate
capability is in the current set, and issues a warning if it is
not.
When the analyzer encounters a join point in the CFG,
it checks to confirm that every predecessor basic block has
the same set of capabilities on exit. Back edges are handled
similarly: a loop must have the same set of capabilities on
entry to and exit from the loop.
Because the analysis is not path-sensitive, it cannot handle
control-flow situations in which a mutex might or might not
be held, depending on which branch was taken. For example:
void foo ( ) {
i f ( b ) mutex . l o c k ( ) ;
/ / Warning : b may o r may not be held here .
doSomething ( ) ;
i f ( b ) mutex . unlo c k ( ) ;
}
void l o c k A l l ( ) {
/ / Warning : c a p a b i l i t y s et s do not match
/ / at s t a r t and end of loop .
fo r ( unsigned i =0; i < n ; ++ i )
mutexArray [ i ] . l o c k ( ) ;
}
Although this seems like a serious limitation, we have found
that conditionally held locks are relatively unimportant in practical
programming. Reading or writing to a guarded location
in memory requires that the mutex be held unconditionally, so
attempting to track locks that might be held has little benefit in
practice, and usually indicates overly complex or poor-quality
code.
Requiring that capability sets be the same at join points
also speeds up the algorithm considerably. The analyzer need
not iterate to a fixpoint; thus it traverses every statement in
the program exactly once. Consequently, the computational
complexity of the analysis is O(n) with respect to code size.
The compile time overhead of the warning is minimal.
B. Intermediate Representation
Each capability is associated with a C++ object. C++ objects
are run time entities, that are identified by C++ expressions.
The same object may be identified by different expressions in
different scopes. For example:
class Foo {
Mutex mu;
bool compare ( const Foo& ot h e r )
REQUIRES ( this−>mu, ot h e r .mu ) ;
}
void ba r ( ) {
Foo a ;
Foo ∗b ;
. . .
a .mu. l o c k ( ) ;
b−>mu. l o c k ( ) ;
/ / REQUIRES (&a)−>mu, (∗ b ) .mu
a . compare (∗ b ) ;
. . .
}
Clang thread safety analysis is dependently typed: note that
the REQUIRES clause depends on both this and other, which
are parameters to compare. The analyzer performs variable
substitution to obtain the appropriate expressions within bar();
it substitutes &a for this and ∗b for other.
Recall, however, that the Clang AST does not lower C++
expressions to an intermediate language; rather, it stores them
in a format that accurately represents the original source code.
Consequently, (&a)−>mu and a.mu are different expressions.
A dependent type system must be able to compare expressions
for semantic (not syntactic) equality. The analyzer implements
a simple compiler intermediate representation (IR),
and lowers Clang expressions to the IR for comparison. It
also converts the Clang CFG into single static assignment
(SSA) form so that the analyzer will not be confused by local
variables that take on different values in different places.
C. Limitations
Clang thread safety analysis has a number of limitations.
The three major ones are:
No attributes on types. Thread safety attributes are attached
to declarations rather than types. For example, it
is not possible to write vector, or
( int GUARDED BY(mu))[10]. If attributes could be attached
to types, PT GUARDED BY would be unnecessary.
Attaching attributes to types would result in a better and
more accurate analysis. However, it was deemed infeasible
for C++ because it would require invasive changes to the C++
type system that could potentially affect core C++ semantics
in subtle ways, such as template instantiation and function
overloading.
No dependent type parameters. Race-free type systems as
described in the literature often allow classes to be parameterized
by the objects that are responsible for controlling access.
[11] [3] For example, assume a Graph class has a list of nodes,
and a single mutex that protects all of them. In this case, theNode class should technically be parameterized by the graph
object that guards it (similar to inner classes in Java), but that
relationship cannot be easily expressed with attributes.
No alias analysis. C++ programs typically make heavy use
of pointer aliasing; we currently lack an alias analysis. This
can occasionally cause false positives, such as when a program
locks a mutex using one alias, but the GUARDED BY attribute
refers to the same mutex using a different alias.
VI. EXPERIMENTAL RESULTS AND CONCLUSION
Clang thread safety analysis is currently deployed on a wide
scale at Google. The analysis is turned on by default, across
the company, for every C++ build. Over 20,000 C++ files
are currently annotated, with more than 140,000 annotations,
and those numbers are increasing every day. The annotated
code spans a wide range of projects, including many of
Google’s core services. Use of the annotations at Google is
entirely voluntary, so the high level of adoption suggests that
engineering teams at Google have found the annotations to be
useful.
Because race conditions are insidious, Google uses both
static analysis and dynamic analysis tools such as Thread
Sanitizer [12]. We have found that these tools complement
each other. Dynamic analysis operates without annotations and
thus can be applied more widely. However, dynamic analysis
can only detect race conditions in the subset of program
executions that occur in test code. As a result, effective
dynamic analysis requires good test coverage, and cannot
report potential bugs until test time. Static analysis is less
flexible, but covers all possible program executions; it also
reports errors earlier, at compile time.
Although the need for handwritten annotations may appear
to be a disadvantage, we have found that the annotations
confer significant benefits with respect to software evolution
and maintenance. Thread safety annotations are widely used
in Google’s core libraries and APIs. Annotating libraries has
proven to be particularly important, because the annotations
serve as a form of machine-checked documentation. The
developers of a library and the clients of that library are usually
different engineering teams. As a result, the client teams often
do not fully understand the locking protocol employed by the
library. Other documentation is usually out of date or nonexistent,
so it is easy to make mistakes. By using annotations,
the locking protocol becomes part of the published API, and
the compiler will warn about incorrect usage.
Annotations have also proven useful for enforcing internal
design constraints as software evolves over time. For example,
the initial design of a thread-safe class must establish certain
constraints: locks are used in a particular way to protect
private data. Over time, however, that class will be read
and modified by many different engineers. Not only may
the initial constraints be forgotten, they may change when
code is refactored. When examining change logs, we found
several cases in which an engineer added a new method to a
class, forgot to acquire the appropriate locks, and consequently
had to debug the resulting race condition by hand. When
the constraints are explicitly specified with annotations, the
compiler can prevent such bugs by mechanically checking new
code for consistency with existing annotations.
The use of annotations does entail costs beyond the effort
required to write the annotations. In particular, we have found
that about 50% of the warnings produced by the analysis are
caused not by incorrect code but rather by incorrect or missing
annotations, such as failure to put a REQUIRES attribute
on getter and setter methods. Thread safety annotations are
roughly analogous to the C++ const qualifier in this regard.
Whether such warnings are false positives depends on
your point of view. Google’s philosophy is that incorrect
annotations are “bugs in the documentation.” Because APIs
are read many times by many engineers, it is important that
the public interfaces be accurately documented.
Excluding cases in which the annotations were clearly
wrong, the false positive rate is otherwise quite low: less than
5%. Most false positives are caused by either (a) pointer aliasing,
(b) conditionally acquired mutexes, or (c) initialization
code that does not need to acquire a mutex.
Conclusion
Type systems for thread safety have previously been implemented
for other languages, most notably Java [3] [11].
Clang thread safety analysis brings the benefit of such systems
to C++. The analysis has been implemented in a production
C++ compiler, tested in a production environment, and adopted
internally by one of the world’s largest software companies.
REFERENCES
[1] K. Asanovic et al., “A view of the parallel computing landscape,”
Communications of the ACM, vol. 52, no. 10, 2009.
[2] L.-C. Wu, “C/C++ thread safety annotations,” 2008. [Online]. Available:
https://docs.google.com/a/google.com/document/d/1 d9MvYX3LpjTk
3nlubM5LE4dFmU91SDabVdWp9-VDxc
[3] M. Abadi, C. Flanagan, and S. N. Freund, “Types for safe locking: Static
race detection for java,” ACM Transactions on Programming Languages
and Systems, vol. 28, no. 2, 2006.
[4] “Clang thread safety analysis documentation.” [Online]. Available:
http://clang.llvm.org/docs/ThreadSafetyAnalysis.html
[5] “Clang: A c-language front-end for llvm.” [Online]. Available:
http://clang.llvm.org
[6] D. F. Sutherland and W. L. Scherlis, “Composable thread coloring,”
PPoPP ’10: Proceedings of the ACM Symposium on Principles and
Practice of Parallel Programming, 2010.
[7] K. Crary, D. Walker, and G. Morrisett, “Typed memory management in
a calculus of capabilities,” Proceedings of POPL, 1999.
[8] J. Boyland, J. Noble, and W. Retert, “Capabilities for sharing,” Proceedings
of ECOOP, 2001.
[9] J.-Y. Girard, “Linear logic,” Theoretical computer science, vol. 50, no. 1,
pp. 1–101, 1987.
[10] S. Savage, M. Burrows, G. Nelson, P. Sobalvarro, and T. Anderson,
“Eraser: A dynamic data race detector for multithreaded programs.”
ACM Transactions on Computer Systems (TOCS), vol. 15, no. 4, 1997.
[11] C. Boyapati and M. Rinard, “A parameterized type system for race-free
Java programs,” Proceedings of OOPSLA, 2001.
[12] K. Serebryany and T. Iskhodzhanov, “Threadsanitizer: data race detection
in practice,” Workshop on Binary Instrumentation and Applications,
2009.
An Optimized Template Matching Approach to Intra Coding
in Video/Image Compression
Hui Su, Jingning Han, and Yaowu Xu
Chrome Media, Google Inc., 1950 Charleston Road, Mountain View, CA 94043
ABSTRACT
The template matching prediction is an established approach to intra-frame coding that makes use of previously
coded pixels in the same frame for reference. It compares the previously reconstructed upper and left boundaries
in searching from the reference area the best matched block for prediction, and hence eliminates the need of
sending additional information to reproduce the same prediction at decoder. In viewing the image signal as an
auto-regressive model, this work is premised on the fact that pixels closer to the known block boundary are better
predicted than those far apart. It significantly extends the scope of the template matching approach, which is
typically followed by a conventional discrete cosine transform (DCT) for the prediction residuals, by employing an
asymmetric discrete sine transform (ADST), whose basis functions vanish at the prediction boundary and reach
maximum magnitude at far end, to fully exploit statistics of the residual signals. It was experimentally shown
that the proposed scheme provides substantial coding performance gains on top of the conventional template
matching method over the baseline.
Keywords: Template matching, Intra prediction, Transform coding, Asymmetric discrete sine transform
1. INTRODUCTION
Intra-frame coding is a key component in video/image compression system. It predicts from previously reconstructed
neighboring pixels to largely remove spatial redundancies. A codec typically allows various prediction
directions1–3 , and the encoder selects the one that best describes the texture patterns (and hence rendering
minimal rate-distortion cost) for block coding. Such boundary extrapolation based prediction is efficient when
the image signals are well modeled by a first-order Markovian process. In practice, however, image signals might
contain certain complicated patterns repeatedly appearing, which the boundary prediction approach can not
effectively capture. This motivates the initial block matching prediction, that searches in the previously reconstructed
frame area for reference, as an additional mode.4 A displacement vector per block is hence needed to
inform decoder to reproduce the prediction, akin the motion vector for inter-frame motion compensation. To
overcome such overhead cost that diminishes the performance gains, a template matching prediction (TMP)
approach was developed5
that employs the available neighboring pixels of a block as a template, measures the
template similarity between the block of interest and the candidate references, and chooses the most “similar” one
as the prediction. Clearly the decoder is able to repeat the same process without recourse to further information,
which further allows the TMP to operate in smaller block size for more precise prediction at no additional cost.
A conventional 2D-DCT is then applied to the prediction residuals, followed by quantization and entropy coding,
to encode the block. Certain coding performance gains were obtained by integrating the TMP in a regular intra
coder.
In viewing the image signals as an auto-regressive process, pixels close to the block boundaries are more
correlated to the template pixels, and hence are better predicted by the matched reference, than those sitting
at far end. Therefore, the residuals ought to exhibit smaller variance at the known boundaries and gradually
increased energy to the opposite end, which makes the efficacy of the use of DCT questionable due to the fact that
its basis functions get to the maximal magnitude at both ends and are agnostic to the statistical characteristics
of the residuals. This work addresses this issue by incorporating the ADST,6, 7 whose basis functions possess
the desired asymmetric properties, as an alternative to the TMP residuals for optimal coding performance. A
complementary similarity measurement based on weighted template matching, in recognition of the statistical
E-mails: {huisu, jingning, yaowu}@google.com.
Visual Information Processing and Communication V, edited by Amir Said, Onur G. Guleryuz,
Robert L. Stevenson, Proc. of SPIE-IS&T Electronic Imaging, SPIE Vol. 9029, 902904
© 2014 SPIE-IS&T · CCC code: 0277-786X/14/$18 · doi: 10.1117/12.2040890
SPIE-IS&T/ Vol. 9029 902904-1
Downloaded From: http://proceedings.spiedigitallibrary.org/ on 01/05/2015 Terms of Use: http://spiedl.org/termsReconstructed Pixels
To-be-coded Pixels
Target
Block
Prediction
Block
Target Template
Candidate Template
Figure 1. Template matching intra prediction.
variations across the block, was also proposed to improve the search quality. The scheme was implemented in
the VP9 framework, in conjunction with other boundary prediction based intra coding modes. Experiments
demonstrated remarkable performance advantages over conventional TMP as well as the baseline codec.
The rest of the paper is organized as follows: Sec. 2 presents a brief review on the template matching
approach. In Sec. 3, we describe the proposed techniques in details. Experimenting results are presented in Sec.
4, and Sec. 5 concludes the paper.
2. REVISITING MATCHING PREDICTION
We provide a brief review on the TMP approach5
in this section. As shown in Fig. 1, the TMP employs the
pixels in the adjacent upper rows and left columns of a block as its template. Every template in the reconstructed
area of the frame is considered as a reference template, and the template of the block to be encoded is the target
template. The similarity between the target template and the reference templates is then evaluated in terms
of sum of absolute/squared difference. The encoder selects amongst the reference templates the one that best
resembles the target template as the candidate template, and the block corresponding to this candidate template
is used as the prediction for the target block. Since it only involves comparing reconstructed pixels, the same
operations can be repeated at the decoder side without any additional side information sent, resulting in higher
compression efficiency than the direct block matching approach.4 As a consequence, the decoding process gets
more computationally loaded.
The TMP approach was shown to be particularly efficient in the scenarios where certain complicated texture
patterns, that cannot be captured by the conventional directional intra prediction modes, appear repeatedly in the
image/frame. Recent research efforts have been devoted to further improve the TMP scheme, including combining
multiple candidates with top similarity scores,8 using hybrid TMP and block matching (with displacement vector
sent explicitly),9
etc. This work is focused on optimizing the original TMP approach by observing and exploiting
the statistical property of the TMP residual signals. It is noteworthy that the proposed principles are generally
applicable to other advanced variants as well.
3. PROPOSED TECHNIQUES
We view the image signals as an auto-regressive model, which implies that two nearby pixels are more correlated
than those far apart. Since the template of a matched reference block closely resembles that of the block of
TEMPLATE
SPIE-IS&T/ Vol. 9029 902904-2
Downloaded From: http://proceedings.spiedigitallibrary.org/ on 01/05/2015 Terms of Use: http://spiedl.org/termsTarget Block Pixels
Template Pixels with
Larger Weight
Template Pixels with
Smaller Weight
Figure 2. The proposed weighted template matching scheme. The pixels that are closer to the target block are assigned
with larger weight.
interest, the pixels sitting close to the known boundaries of the two blocks are element-wisely more correlated,
than those at the opposite end. Hence the pixels near the top/left boundaries are better predicted by the matched
reference block, which translates into a key observation that the variance of prediction residuals tends to vanish
at the known boundaries and gradually increase towards the far end. This inspires that unlike the discrete
cosine transform (DCT) whose basis functions get maximum magnitude at both ends, the (near) optimal spatial
transform for the TMP residuals should possess such asymmetric properties. We hence propose to employ the
asymmetric discrete sine transform (ADST)6, 7 for transform coding of the TMP residuals. A complementary
matching approach that expands the template to multiple boundary rows and columns, and uses a weighted
sum of difference measurement is first developed for more precise referencing. A statistical study of the TMP
residuals, followed by the detailed discussion of ADST will be provided next.
3.1 Weighted Template Matching
In order to obtain reliable template matching, it is reasonable to define multiple layers of boundary pixels
as the template of a block. In our study, we have observed that the prediction accuracy can be improved
when the number of rows and columns in the template increases. However, it is not wise to adopt too many
layers, as the gain in matching accuracy becomes saturated and the computational complexity explodes. In our
implementation, the template consists of the pixels in the 2 rows and 2 columns above and to the left of the
given block, which gives a good tradeoff between accuracy and computation complexity.
The similarity between the target templates and reference templates can be measured by the sum of absolute
difference (SAD). Along the line of recognizing the variations in statistics, the template pixels closer to the block
are highly correlated to the block content, and hence should be more weighted in the SAD calculation than those
distant ones. This idea is illustrated in Fig. 2. A weight ratio of 3:2 for the inner row/column versus the outer
row/column is used in this work.
3.2 Spatial Transformation
In video/image compression, the prediction residuals are typically processed via transformation to further remove
the remaining spatial redundancy, before the quantization and entropy coding modules. The Karhunen-Loeve
transform (KLT) is considered as the optimal spatial transform in terms of energy compaction. However, KLT
SPIE-IS&T/ Vol. 9029 902904-3
Downloaded From: http://proceedings.spiedigitallibrary.org/ on 01/05/2015 Terms of Use: http://spiedl.org/termsis rarely used in practical coding system due to its high computation complexity. The DCT has long been a
popular substitute due to its good tradeoff between energy compaction and complexity. The basis functions of
the DCT are as follows:
[TC ]j,i =
α cos
π(j − 1)(2i − 1)
2N
, (1)
where N is the block size, i, j ∈ {1, 2, · · · , N} denote the space and frequency indexes, respectively; and
α =
q
1
N
, if j = 1
q
2
N
, otherwise
It is easy to see that the basis functions of the DCT achieve their maximum energy at both ends (i.e., i = 1 or
i = N).
Assuming the template of a matched reference block closely approximates that of the block of interest, it is
highly likely that pixels close to these known boundaries are also well predicted, while those distant pixels are
less correlated, which results in a relatively higher residual variance. This postulation is verified by the following
experimental study. We collected the absolute values of the TMP prediction residues element-wisely over 8000
blocks (of dimension 4 × 4) from the foreman sequence, and the average of the residue signal at each pixel
location was calculated, as shown below:
4.05 4.37 4.51 5.13
4.72 5.48 5.50 6.60
5.04 5.95 6.12 7.32
5.50 6.28 6.97 8.20
As can be seen from the matrix, the variance of the prediction residue signal indeed increases along both the
horizontal and vertical directions.
As abovementioned, the basis functions of the conventional DCT achieve their maximum energy at both ends
and are therefore agnostic to the statistical patterns of the prediction residuals. As an alternative, the ADST6, 7
has basis functions of form:
[TS]j,i =
2
√
2N + 1
sin
(2j − 1)iπ
2N + 1
, (2)
where N is the block size, i, j ∈ {1, 2, · · · , N} denote the space and frequency indexes, respectively. It is
shown6, 7 that the ADST is a better approximation of the optimal KLT than the DCT when the partial boundary
information is available. Clearly, the basis functions of ADST vanishes at the known prediction boundary
(i = 1) and maximizes at the far end (i = N), and therefore matches well with the statistical patterns of the
TMP residuals. We hence propose to employ the ADST as the spatial transform for the TMP residuals. It is
experimentally shown in the next section that the use of ADST provides substantial performance improvement
over the TMP followed by the conventional DCT.
4. EXPERIMENT RESULTS
The proposed scheme was tested in the VP9 framework.1 We verified its efficacy in a relatively simplified setting,
where the block size was fixed as 8 × 8. There are 10 intra prediction modes in VP9, including vertical prediction,
horizontal prediction, 8 angular prediction modes, and a “true motion” mode that utilizes the left, above and
corner pixels simultaneously. The TMP scheme was implemented as an additional mode to the 10 existing ones.
The selection among the 11 modes is based on rate-distortion optimization. In the TMP mode, the 8 × 8 block
is further partitioned into four 4 × 4 blocks, each of which is predicted via template matching, followed by the
2D-ADST transform, quantization, and reconstruction, in a raster scan order. The template consists of pixels
from 2 rows and 2 columns above and to the left of the given block. For the weighted template matching, we
use a weight ratio 3:2 for the inner row/column versus the outer row/column, as shown in Fig. 2.
SPIE-IS&T/ Vol. 9029 902904-4
Downloaded From: http://proceedings.spiedigitallibrary.org/ on 01/05/2015 Terms of Use: http://spiedl.org/terms2 3 4 5 6
x 104
36
37
38
39
40
41
42
43
44
Bits per frame
PSNR
VP9 Baseline
Template Matching
Proposed Scheme
5 6 7 8 9 10 11 12
x 104
36
37
38
39
40
41
42
Bits per frame
PSNR
VP9 Baseline
Template Matching
Proposed Scheme
Figure 3. Rate-Distortion curves of the Ice (upper) and Foreman (lower) test sequences.
Several test video clips were used to compare the coding efficiency, including the Ice, Foreman, and Carphone
sequence. For every test sequence, the first 75 frames were coded as key-frame (i.e., all blocks were coded in intra
modes), at various bit-rates. The coding performance gains of the conventional TMP and the proposed method
over the reference codec, measured by the Bjontegaard metric, are shown in Table 1. Clearly, the proposed
approach that optimizes the transformation for prediction residual significantly improves the performance of
TMP, and both outperform the reference VP9 baseline. The rate-distortion curves of the ice and foreman
sequence are also provided in Fig. 3. It can be seen from the figure that the proposed techniques boost the
coding efficiency of the conventional TMP consistently.
5. CONCLUSIONS AND FUTURE WORK
This work proposed a novel approach that incorporated the ADST for TMP prediction residuals as an additional
mode for intra-frame coding. A complementary template matching method along the lines of recognizing the
SPIE-IS&T/ Vol. 9029 902904-5
Downloaded From: http://proceedings.spiedigitallibrary.org/ on 01/05/2015 Terms of Use: http://spiedl.org/termsTable 1. Coding performance gains over VP9 baseline in terms of bit-rate reduction percentage.
Sequences Conventional TMP Proposed Method
Ice 2.89 3.78
Foreman 2.88 3.33
Carphone 1.05 1.35
statistical variations across block was also provided for more precise reference search. The scheme implemented
in the VP9 framework demonstrated substantial performance improvements over the conventional TMP as well
as the reference codec.
The TMP approach can also be applied to inter-frame prediction.10 The template of a block is defined as
the pixels in the adjacent upper rows and left columns, in the same way as in the case of intra prediction.
The optimal template which is best matched to that of the block of interest is found in a previously encoded
reference frame, and the block to be encoded is filled in by copying the block corresponding the optimal template.
By the same principles in this work, the residue signal of the template matching inter prediction should also
present asymmetric statistical property across the block. We thus expect the ADST to be more efficient than the
conventional DCT for the transform coding of the template matching inter prediction, and are currently working
along this direction.
REFERENCES
[1] VP9 Video Codec , http://www.webmproject.org/vp9/.
[2] Wiegand, T., Sullivan, G. J., Bjontegaard, G., and Luthra, A., “Overview of the H.264/avc video coding
standard,” IEEE Trans. Circuits and Systems for Video Technology 13, 560–576 (July 2003).
[3] Sullivan, G., J. Ohm, W. H., and Wiegand, T., “Overview of the high effciency video coding (HEVC)
standard,” IEEE Trans. Circuits and Systems for Video Technology 22, 1649–1668 (Dec. 2012).
[4] Yu, S. and Chrysafis, C., “New intra prediction using intra-macroblock motion compensation,” Tech. Rep.
JVT-C151 (2002).
[5] Tan, T., Boon, C., and Suzuki, Y., “Intra prediction by template matching,” IEEE Proc. ICIP , 1693–1696
(2006).
[6] Han, J., Saxena, A., and Rose, K., “Towards jointly optimal spatial prediction and adaptive transform in
video/image coding,” IEEE Proc. ICASSP , 726–729 (2010).
[7] Han, J., Saxena, A., Melkote, V., and Rose, K., “Jointly optimized spatial prediction and block transform
for video and image coding,” IEEE Trans. on Image Processing 21, 1874–1884 (2012).
[8] Tan, T., Boon, C., and Suzuki, Y., “Intra prediction by averaged template matching predictors,” IEEE
Proc. CCNC (2007).
[9] Cherigui, S., Thoreau, D., Guillotel, P., and Perez, P., “Hybrid template and block matching algorithm for
image intra prediction,” IEEE Proc. ICASSP , 781–784 (2012).
[10] Sugimoto, K., Kobayashi, M., Suzuki, Y., Kato, S., and Boon, C. S., “Inter frame coding with template
matching spatio-temporal prediction,” IEEE Proc. ICIP (2004).
SPIE-IS&T/ Vol. 9029 902904-6
Downloaded From: http://proceedings.spiedigitallibrary.org/ on 01/05/2015 Terms of Use: http://spiedl.org/terms
How Many People Visit YouTube? Imputing
Missing Events in Panels With Excess Zeros
Georg M. Goerg, Yuxue Jin, Nicolas Remy, Jim Koehler1
1 Google, Inc.; United States
E-mail for correspondence: gmg@google.com
Abstract: Media-metering panels track TV and online usage of people to analyze
viewing behavior. However, panel data is often incomplete due to nonregistered
devices, non-compliant panelists, or work usage. We thus propose a
probabilistic model to impute missing events in data with excess zeros using
a negative-binomial hurdle model for the unobserved events and beta-binomial
sub-sampling to account for missingness. We then use the presented models to
estimate the number of people in Germany who visit YouTube.
Keywords: imputation; missing data; zero inflation; panel data.
1 Introduction
Media panels (GfK Consumer Panels, 2013) are used by advertisers to
estimate reach and frequency of a campaign: reach is the fraction of the
population that has seen an ad, frequency tells us how often they have seen
it (on average). It is important to get good estimates from panel data, as
they largely determine the cost of an ad spot on TV or a website.
Na¨ıvely, one would use a sample fraction of the number of non-zero events
(website visits, TV spots watched, etc.) per unit time to estimate reach;
similarly, for frequency. This, however, suffers from underestimation as panels
often only record a fraction of all events due to e.g., non-compliance or
work usage. Correcting this bias and imputing missing events has been
studied previously (Fader and Hardie, 2000; Yang et al., 2010).
In this work we i) extend the beta-binomial negative-binomial (BBNB)
model (Hofler and Scrogin, 2008) with a hurdle component to improve
modeling excess zeros in panel data (§2); ii) present the maximum likelihood
estimator (MLE) and also add prior information on missingness (§3); and
iii) use the methodology to estimate – from online media panels and internal
YouTube log files – how many people in Germany visit YouTube (§4).
The proposed methodology can be applied to a great variety of situations
where events have been counted – but some are known to be missing.2 How Many People Visit YouTube?
2 Hierarchical Event Imputation
Let Ni ∈ {0, 1, 2, . . .} count the true (but unobserved) number of visits by
panelist i. The population consists of people who do not visit YouTube at
all (with probability q0 ∈ [0, 1]), and those who visit at least once. If she
visits (overcoming the “hurdle” with probability 1 − q0), we assume that
Ni
is distributed according to a shifted Poisson distribution (starting at
n = 1) with rate λi
. For model heterogeneity among the population we use
a Gamma
r,
q1
1−q1
prior for λi
, with r > 0 and q1 ∈ (0, 1).
Overall, this yields a shifted negative binomial hurdle (NBH) distribution
P (N = n; q0, q1, r) = (
q0, if n = 0,
(1 − q0) ·
Γ(n+r−1)
Γ(r)Γ(n)
· (1 − q1)
r
q
n−1
1
, if n ≥ 1.
(1)
We choose a hurdle, rather than a mixture, model for the excess zeros (Hu
et al., 2011), since 1 − q0 can be directly interpreted as the true – but
unobserved – 1+ reach: if an advertiser shows an ad on YouTube they can
expect that a fraction of 1 − q0 of the population sees it at least once.
Let pi be the probability a visit of user i is recorded in the panel. Assuming
independence across visits the total number of recorded panel events,
Ki ∈ {0, 1, 2, . . .}, thus follows a binomial distribution, Ki ∼ Bin(Ni
, pi). To
account for heterogeneity across the population we assume pi ∼ Beta(µ, φ),
with mean µ and precision φ (Ferrari and Cribari-Neto, 2004). Here µ represents
the expected non-missing rate and φ the (inverse) variation across
the population. Integrating out pi gives a Beta-Binomial (BB) distribution,
Ki
| Ni ∼ BB(Ni
; µ, φ). (2)
Combining (1) and (2) yields a hierarchical beta-binomial negative-binomial
hurdle (BBNBH) imputation model with parameter vector θ = (µ, φ, q0, r, q1):
Ni ∼ NBH(N; q0, r, q1) and Ki
| Ni ∼ BB(K | Ni
; µ, φ). (3)
2.1 Joint Distribution
The pdf of (2) can be written as
g(k | n; µ, φ) =
n
k
Γ(k + φµ)Γ(n − k + (1 − µ)φ)
Γ(n + φ)
Γ(φ)
Γ(µφ)Γ(φ(1 − µ)) .
For k = 0 this reduces to
P (K = 0 | N, µ, φ) = Γ(n + (1 − µ)φ)
Γ(n + φ)
×
Γ(φ)
Γ(φ(1 − µ)) . (4)Goerg et al. 3
Due to the zero hurdle it is useful to treat N = 0 and N > 0 separately:
P (N, K) = P (K | N) · P (N) = BB(k | n; µ, φ) · NBH(n; q0, q1, r) (5)
For n = 0, (5) is non-zero only for k = 0, P (N = 0, K = 0) = q0, since
P (K > N) = 0. For n > 0,
P (N = n, K = k) =(1 − q0)
1
B(φµ, φ(1 − µ))
(1 − q1)
r
Γ(r)
×
Γ(k + φµ)
Γ(k + 1)
×
Γ(n − k + φ(1 − µ))
Γ(n − k + 1)
Γ(n + r − 1)
Γ(n + φ)
q
n−1
1 ×
Γ(n + 1)
Γ(n)
.
(6)
2.2 Conditional Predictive Distribution For Imputation
The panel records ki events for panelist i, but we want to know how many
events truly occurred. That is, we are interested in (dropping subscript i)
P (N = n | K = k) = P (K = k | N = n) P (N = n)
P (K = k)
, (7)
To obtain analytical expressions we consider k = 0 and k > 0 separately:
k = 0: Either none truly happened (n = 0) or a panelist visited at least
once (n > 0), but none were recorded.
n = 0:
P (N = 0 | K = 0) = q0
P (K = 0). (8)
n > 0:
P (N = n | K = 0) = 1
P (K = 0) ×
Γ(n + φ(1 − µ))
Γ(n + φ)
Γ(φ)
Γ(φ(1 − µ))
× (1 − q0)
Γ(n + r − 1)
Γ(n)
(1 − q1)
r
Γ(r)
q
n−1
1
,
where the second term comes from (4).
k > 0: The zero “hurdle” for N has been surpassed for sure.
n < k : By construction of Binomial subsampling
P (N = n | K = k) = 0 for all n < k. (9)
n ≥ k: Here
P (N = n | K = k) = n · q
n−1
1
Γ(n − k + (1 − µ)φ)
Γ(n − k + 1)Γ(n + φ)
Γ(n + r − 1)×
X∞
m=0
(m + k)
Γ(m + φ(1 − µ))
Γ(m + 1)
Γ(m + k + r − 1)
Γ(m + k + φ)
q
m+k−1
1
!−1
.4 How Many People Visit YouTube?
Estimate Std. Err. t value P r(> |t|)
µ 0.272
q0 0.641 0.016 38.858 0.000
q1 0.982 0.002 494.105 0.000
r 0.252 0.021 11.811 0.000
φ 2.320 0.594 3.907 0.000
TABLE 1: MLE for θ for panel data on YouTube visits in Germany.
3 Parameter Estimation
Let k = {k1, . . . , kP } be the number of observed events for all P panelist.
Each panelist also has socio-economic indicators such as gender, age, and
income. These attributes determine their demographic weight ˜wi
, which
equals the number of people in the entire population that panelist i represents.
Finally, let wi = ˜wi
·
P/PP
i=1 w˜i
be re-scaled weight of panelist i
such that PP
i=1 wi equals sample size P.
We estimate θ using maximum likelihood (MLE), θb = arg maxθ∈Θ `(θ; x),
where the log-likelihood
`(θ; x) = X
{k|xk>0}
xk · log P (K = k; θ), (10)
and x = {xk | k = 0, 1, . . . , max (k)}, where xk =
P
{i|ki=k} wi
is the total
weight of all panelists with k visits.
For deriving closed form expressions of P (K = k) = P∞
n=0 P (N = n, K = k)
it is simpler to consider k = 0 and k > 0 separately:
P (K = 0) = q0 + (1 − q0) ×
Γ(φ)
Γ(φ(1 − µ))
(1 − q1)
r
Γ(r)
×
X∞
n=0
Γ(n + 1 + φ(1 − µ))
Γ(n + 1)
Γ(n + r)
Γ(n + 1 + φ)
q
n
1
,
(11)
and for k > 0,
P (K = k) =(1 − q0)(1 − q1)
r Γ(φ)
Γ(µφ)Γ(φ(1 − µ))
1
Γ(r)
×
Γ(k + µφ)
Γ(k + 1)
×
X∞
m=0
(m + k)
Γ(m + φ(1 − µ))
Γ(m + 1)
Γ(m + k + r − 1)
Γ(m + k + φ)
q
m+k−1
1
.
(12)Goerg et al. 5
0 4 8 12 17 22
cdf
0.65 0.80
P(N <= n; r = 0.25, q1 = 0.98, q0 = 0.64)
true counts (N)
q0 = 64
%
0.0 0.2 0.4 0.6 0.8 1.0
0 1 2
3
4
5
pdf
Beta(p; µ = 0.27, φ = 2.3)
non−missingness rate
α = 0.63, β = 1.7
0 5 10 15 20 25
0.80
cdf
0.90
●
●
●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
P(K <= k; θ)
observed counts (K)
●
empirical
model
Log−likelihood: −6466.69
0 2 4 6 8 10
K = 0
K = 2
pmf
0.0 0.4 0.8
P(N = n | K = k; θ)
true counts (N)
E(N|K=0) = 1.02
E(N|K=2) = 13.12
79.5%
FIGURE 1: Model estimates for: (top left) true counts Ni
; (top right) nonmissing
rate pi
; (bottom left) empirical count frequency and model fit;
(bottom right) conditional predictive distributions and expectations.
3.1 Fix expected non-missing rate µ
Usually, researchers must estimate all 5 parameters from panel data. For
our application, though, we can estimate (and fix) the non-missing rate µ
a-priori as we have access to internal YouTube log files.
Let ¯kW˜ =
PP
i=1 w˜iki be the observed panel visits projected to the entire
population. Analogously, let N¯W˜ =
PP
i=1 w˜iNi be the panel projections of
the number of true YouTube visits. While any single Ni
is unobservable,
we can estimate N¯W˜ by simply counting all YouTube homepage views in
Germany from our YouTube log files, yielding Nb¯W˜ . We herewith obtain a
plug-in estimate of the non-missing rate, µbLogs = ¯kW˜ /Nb¯W˜ . The remaining
4 parameters, θ(−µ) = (φ, q0, r, q1), can be obtained by MLE, θb
(−µ) =
arg maxθ(−µ)
`((µbLogs, θ(−µ)); x). The overall estimate is θb = (µbLogs, θb
(−µ)).
4 Estimating YouTube Audience in Germany
Here we use data from a German online panel (GfK Consumer Panels,
2013), which monitors web usage of P = 6, 545 individuals in October, 2013
(31 days). In particular, we are interested in the probability that an adult
in Germany visited the YouTube homepage www.youtube.de. Empirically,Pb (K = 0) = 0.81, yielding 19% observed 1+ reach. However, we know by
comparison to YouTube log files that the panel only recorded 27.2% of all
impressions. We fix the expected non-missing rate at µb = 0.272 and obtain
the remaining parameters via MLE (Table 1): Figure 1 shows the model fit
for the true, observed, and predictive distribution. In particular, the true
1+ reach is 36% (qb0 = 0.64), not 19% as the na¨ıve estimate suggests.
5 Discussion
We introduce a probabilistic framework to impute missing events in count
data, including a hurdle component for more flexibility to model lots of zeros.
Researchers can use our models to obtain accurate probabilistic predictions
of the number of true, unobserved events. We apply our methodology
to accurately estimate how many people in Germany visit YouTube.
Acknowledgments: We want to thank Christoph Best, Penny Chu, Tony
Fagan, Yijia Feng, Oli Gaymond, Simon Morris, Raimundo Mirisola, Andras
Orban, Simon Rowe, Sheethal Shobowale, Yunting Sun, Wiesner Vos,
Xiaojing Wang, and Fan Zhang for constructive discussions and feedback.
References
Fader, P. and Hardie, B. (2000). A note on modelling underreported Poisson
counts. Journal of Applied Statistics, 27(8):953–964.
Ferrari, S. and Cribari-Neto, F. (2004). Beta Regression for Modelling
Rates and Proportions. Journal of Applied Statistics, 31(7):799–815.
GfK Consumer Panels (2013). Media Efficiency Panel.
Hofler, R. A. and Scrogin, D. (2008). A count data frontier model. Technical
report, University of Central Florida.
Hu, M., Pavlicova, M., and Nunes, E. (2011). Zero-inflated and hurdle models
of count data with extra zeros: examples from an HIV-risk reduction
intervention trial. Am J Drug Alcohol Abuse, 37(5):367–75.
Rose, C., Martin, S., Wannemuehler, K., and Plikaytis, B. (2006). On the
use of zero-inflated and hurdle models for modeling vaccine adverse event
count data. J Biopharm Stat, 16(4):463–81.
Schmittlein, D. C., Bemmaor, A. C., and Morrison, D. G. (1985). Why
Does the NBD Model Work? Robustness in Representing Product Purchases,
Brand Purchases and Imperfectly Recorded Purchases. Marketing
Science, 4(3):255–266.
Yang, S., Zhao, Y., and Dhar, R. (2010). Modeling the underreporting bias
in panel survey data. Marketing Science, 29(3):525–539.
38 COMMUNICATIONS OF THE ACM | SEPTEMBER 2014 | VOL. 57 | NO. 9
practice
DOI:10.1145/2643134
Article development led by
queue.acm.org
Preventing script injection vulnerabilities
through software design.
BY CHRISTOPH KERN
SCRIPT INJECTION VULNERABILITIES are a bane of
Web application development: deceptively simple in
cause and remedy, they are nevertheless surprisingly
difficult to prevent in large-scale Web development.
Cross-site scripting (XSS)2,7,8 arises when insufficient
data validation, sanitization, or escaping within a Web
application allow an attacker to cause browser-side
execution of malicious JavaScript in
the application’s context. This injected
code can then do whatever the attacker
wants, using the privileges of the victim.
Exploitation of XSS bugs results
in complete (though not necessarily
persistent) compromise of the victim’s
session with the vulnerable application.
This article provides an overview
of how XSS vulnerabilities arise and
why it is so difficult to avoid them in
real-world Web application software
development. Software design patterns
developed at Google to address
the problem are then described.
A key goal of these design patterns
Securing
the
Tangled
WebSEPTEMBER 2014 | VOL. 57 | NO. 9 | COMMUNICATIONS OF THE ACM 39
IMAGE BY PHOTOBANK GALLERY
is to confine the potential for XSS
bugs to a small fraction of an application’s
code base, significantly improving
one’s ability to reason about the
absence of this class of security bugs.
In several software projects within
Google, this approach has resulted in a
substantial reduction in the incidence
of XSS vulnerabilities.
Most commonly, XSS vulnerabilities
result from insufficiently validating,
sanitizing, or escaping strings that
are derived from an untrusted source
and passed along to a sink that interprets
them in a way that may result in
script execution.
Common sources of untrustworthy
data include HTTP request parameters,
as well as user-controlled data located
in persistent data stores. Strings
are often concatenated with or interpolated
into larger strings before assignment
to a sink. The most frequently
encountered sinks relevant to XSS
vulnerabilities are those that interpret
the assigned value as HTML markup,
which includes server-side HTTP responses
of MIME-type text/html, and
the Element.prototype.innerHTML
Document Object Model (DOM)8
property
in browser-side JavaScript code.
Figure 1a shows a slice of vulnerable
code from a hypothetical photosharing
application. Like many modern
Web applications, much of its
user-interface logic is implemented in
browser-side JavaScript code, but the
observations made in this article transfer
readily to applications whose UI is
implemented via traditional serverside
HTML rendering.
In code snippet (1) in the figure,
the application generates HTML
markup for a notification to be shown
to a user when another user invites
the former to view a photo album.
The generated markup is assigned to
the innerHTML property of a DOM 40 COMMUNICATIONS OF THE ACM | SEPTEMBER 2014 | VOL. 57 | NO. 9
practice
main page. If the login resulted from
a session time-out, however, the app
navigates back to the URL the user
had visited before the time-out. Using
a common technique for short-term
state storage in Web applications,
this URL is encoded in a parameter of
the current URL.
The page navigation is implemented
via assignment to the window.location.href
DOM property, which
browsers interpret as instruction to
navigate the current window to the
provided URL. Unfortunately, navigating
a browser to a URL of the form
javascript:attackScript causes
execution of the URL’s body as Java
Script. In this scenario, the target
URL is extracted from a parameter of
the current URL, which is generally
under attacker control (a malicious
page visited by a victim can instruct
the browser to navigate to an attacker-chosen
URL).
Thus, this code is also vulnerable
to XSS. To fix the bug, it is necessary to
validate that the URL will not result in
script execution when dereferenced, by
ensuring that its scheme is benign—
for example, https.
Why Is XSS So Difficult to Avoid?
Avoiding the introduction of XSS into
nontrivial applications is a difficult
problem in practice: XSS remains
among the top vulnerabilities in Web
applications, according to the Open
Web Application Security Project
(OWASP);4
within Google it is the most
common class of Web application vulnerabilities
among those reported under
Google’s Vulnerability Reward Program
(https://goo.gl/82zcPK).
Traditionally, advice (including my
own) on how to prevent XSS has largely
focused on:
˲ Training developers how to treat
(by sanitization, validation, and/or escaping)
untrustworthy values interpolated
into HTML markup.2,5
˲ Security-reviewing and/or testing
code for adherence to such guidance.
In our experience at Google, this approach
certainly helps reduce the incidence
of XSS, but for even moderately
complex Web applications, it does not
prevent introduction of XSS to a reasonably
high degree of confidence. We
see a combination of factors leading to
this situation.
element (a node in the hierarchical
object representation of UI elements
in a browser window), resulting in its
evaluation and rendering.
The notification contains the album’s
title, chosen by the second user. A malicious
user can create an album titled:
Since no escaping or validation is
applied, this attacker-chosen HTML is
interpolated as-is into the markup generated
in code snippet (1). This markup
is assigned to the innerHTML sink,
and hence evaluated in the context of
the victim’s session, executing the attacker-chosen
JavaScript code.
To fix this bug, the album’s title
must be HTML-escaped before use in
markup, ensuring that it is interpreted
as plain text, not markup. HTMLescaping
replaces HTML metacharacters
such as <, >, ", ', and & with corresponding
character entity references
or numeric character references: <,
>, ", ', and &. The
result will then be parsed as a substring
in a text node or attribute value
and will not introduce element or attribute
boundaries.
As noted, most data flows with a
potential for XSS are into sinks that
interpret data as HTML markup. But
other types of sinks can result in XSS
bugs as well: Figure 1b shows another
slice of the previously mentioned
photo-sharing application, responsible
for navigating the user interface
after a login operation. After a fresh
login, the app navigates to a preconfigured
URL for the application’s
The following code snippet intends to populate a DOM element with markup for a
hyperlink (an HTML anchor element):
var escapedCat = goog.string.htmlEscape(category);
var jsEscapedCat = goog.string.escapeString(escapedCat);
catElem.innerHTML = '' + escapedCat + '';
The anchor element’s click-event handler, which is invoked by the browser when
a user clicks on this UI element, is set up to call a JavaScript function with the value of
category as an argument. Before interpolation into the HTML markup, the value of
category is HTML-escaped using an escaping function from the JavaScript Closure
Library. Furthermore, it is JavaScript-string-literal-escaped (replacing ' with \' and
so forth) before interpolation into the string literal within the onclick handler’s
JavaScript expression. As intended, for a value of Flowers & Plants for variable
category, the resulting HTML markup is:
Flowers & Plants
So where’s the bug? Consider a value for category of:
');attackScript();//
Passing this value through htmlEscape results in:
');attackScript();//
because htmlEscape escapes the single quote into an HTML character reference.
After this, JavaScript-string-literal escaping is a no-op, since the single quote at the
beginning of the page is already HTML-escaped. As such, the resulting markup becomes:
');attackScript();//
When evaluating this markup, a browser will first HTML-unescape the value of the
onclick attribute before evaluation as a JavaScript expression. Hence, the JavaScript
expression that is evaluated results in execution of the attacker’s script:
createCategoryList('');attackScript();//')
Thus, the underlying bug is quite subtle: the programmer invoked the appropriate
escaping functions, but in the wrong order.
A Subtle XSS BugSEPTEMBER 2014 | VOL. 57 | NO. 9 | COMMUNICATIONS OF THE ACM 41
practice
Subtle security considerations.
As seen, the requirements for secure
handling of an untrustworthy value
depend on the context in which the
value is used. The most commonly
encountered context is string interpolation
within the content of HTML
markup elements; here, simple
HTML-escaping suffices to prevent
XSS bugs. Several special contexts,
however, apply to various DOM elements
and within certain kinds of
markup, where embedded strings are
interpreted as URLs, Cascading Style
Sheets (CSS) expressions, or JavaScript
code. To avoid XSS bugs, each of
these contexts requires specific validation
or escaping, or a combination
of the two.2,5 The accompanying sidebar,
“A Subtle XSS Bug,” shows this
can be quite tricky to get right.
Complex, difficult-to-reason-about
data flows. Recall that XSS arises from
flows of untrustworthy, unvalidated/escaped
data into injection-prone sinks.
To assert the absence of XSS bugs in
an application, a security reviewer
must first find all such data sinks, and
then inspect the surrounding code for
context-appropriate validation and escaping
of data transferred to the sink.
When encountering an assignment
that lacks validation and escaping, the
reviewer must backward-trace this data
flow until one of the following situations
can be determined:
˲ The value is entirely under application
control and hence cannot result in
attacker-controlled injection.
˲ The value is validated, escaped,
or otherwise safely constructed somewhere
along the way.
˲ The value is in fact not correctly
validated and escaped, and an XSS vulnerability
is likely present.
Let’s inspect the data flow into
the innerHTML sink in code snippet
(1) in Figure 1a. For illustration purposes,
code snippets and data flows
that require investigation are shown
in red. Since no escaping is applied
to sharedAlbum.title, we trace its
origin to the albums entity (4) in persistent
storage, via Web front-end code
(2). This is, however, not the data’s ultimate
origin—the album name was previously
entered by a different user (that
is, originated in a different time context).
Since no escaping was applied to
this value anywhere along its flow from
an ultimately untrusted source, an XSS
vulnerability arises.
Similar considerations apply to the
data flows in Figure 1b: no validation
occurs immediately prior to the assignment
to window.location.href
in (5), so back-tracing is necessary. In
code snippet (6), the code exploration
branches: in the true branch, the value
originates in a configuration entity in
the data store (3) via the Web front end
(8); this value can be assumed application-controlled
and trustworthy and is
safe to use without further validation.
It is noteworthy that the persistent
storage contains both trustworthy and
untrustworthy data in different entities
of the same schema—no blanket
assumptions can be made about the
provenance of stored data.
In the else-branch, the URL originates
from a parameter of the current
URL, obtained from window.location.href,
which is an attacker-controlled
source (7). Since there is no validation,
this code path results in an XSS
vulnerability.
Many opportunities for mistakes.
Figures 1a and 1b show only two small
slices of a hypothetical Web application.
In reality, a large, nontrivial Web
application will have hundreds if not
thousands of branching and merging
data flows into injection-prone sinks.
Each such flow can potentially result in
an XSS bug if a developer makes a mistake
related to validation or escaping.
Exploring all these data flows and
asserting absence of XSS is a monumental
task for a security reviewer, especially
considering an ever-changing
code base of a project under active
development. Automated tools that
employ heuristics to statically analyze
data flows in a code base can help. In
our experience at Google, however,
they do not substantially increase confidence
in review-based assessments,
since they are necessarily incomplete
in their reasoning and subject to both
false positives and false negatives. Furthermore,
they have similar difficulties
as human reviewers with reasoning
about whole-system data flows across
multiple system components, using
a variety of programming languages,
RPC (remote procedure call) mechanisms,
and so forth, and involving
flows traversing multiple time contexts
across data stores.
The primary
goal of this
approach is to
limit code that
could potentially
give rise to XSS
vulnerabilities
to a very small
fraction of
an application’s
code base.42 COMMUNICATIONS OF THE ACM | SEPTEMBER 2014 | VOL. 57 | NO. 9
practice
user-profile field).
Unfortunately, there is an XSS bug:
the markup in profile.aboutHtml
ultimately originates in a rich-text editor
implemented in browser-side code,
but there is no server-side enforcement
preventing an attacker from injecting
malicious markup using a tampered-with
client. This bug could arise
in practice from a misunderstanding
between front-end and back-end developers
regarding responsibilities for
data validation and sanitization.
Reliably Preventing the
Introduction of XSS Bugs
In our experience in Google’s security
team, code inspection and testing do
not ensure, to a reasonably high degree
of confidence, the absence of XSS bugs
in large Web applications. Of course,
both inspection and testing provide
tremendous value and will typically
find some bugs in an application (perhaps
even most of the bugs), but it is
difficult to be sure whether or not they
discovered all the bugs (or even almost
all of them).
The primary goal of this approach is
to limit code that could potentially give
rise to XSS vulnerabilities to a very small
fraction of an application’s code base.
A key goal of this approach is to
drastically reduce the fraction of code
that could potentially give rise to
XSS bugs. In particular, with this approach,
an application is structured
such that most of its code cannot be
responsible for XSS bugs. The potential
for vulnerabilities is therefore
confined to infrastructure code such
as Web application frameworks and
HTML templating engines, as well
as small, self-contained applicationspecific
utility modules.
A second, equally important goal is
to provide a developer experience that
does not add an unacceptable degree
of friction as compared with existing
developer workflows.
Key components of this approach
are:
˲ Inherently safe APIs. Injection-prone
Web-platform and HTML-rendering
APIs are encapsulated in wrapper APIs
designed to be inherently safe against
XSS in the sense that no use of such
APIs can result in XSS vulnerabilities.
˲ Security type contracts. Special
types are defined with contracts stipuSimilar
limitations apply to dynamic
testing approaches: it is difficult to
ascertain whether test suites provide
adequate coverage for whole-system
data flows.
Templates to the rescue? In practice,
HTML markup, and interpolation
points therein, are often specified using
HTML templates. Template systems
expose domain-specific languages for
rendering HTML markup. An HTML
markup template induces a function
from template variables into strings of
HTML markup.
Figure 1c illustrates the use of an
HTML markup template (9): this example
renders a user profile in the
photo-sharing application, including
the user’s name, a hyperlink to a personal
blog site, as well as free-form
text allowing the user to express any
special interests.
Some template engines support
automatic escaping, where escaping
operations are automatically inserted
around each interpolation point into
the template. Most template engines’
auto-escape facilities are noncontextual
and indiscriminately apply HTML
escaping operations, but do not account
for special HTML contexts such
as URLs, CSS, and JavaScript.
Contextually auto-escaping template
engines6
infer the necessary
validation and escaping operations required
for the context of each template
substitution, and therefore account for
such special contexts.
Use of contextually auto-escaping
template systems dramatically reduces
the potential for XSS vulnerabilities: in
(9), the substitution of untrustworthy
values profile.name and profile.
blogUrl into the resulting markup
cannot result in XSS—the template system
automatically infers the required
HTML-escaping and URL-validation.
XSS bugs can still arise, however,
in code that does not make use of templates,
as in Figure 1a (1), or that involves
non-HTML sinks, as in Figure 1b (5).
Furthermore, developers occasionally
need to exempt certain substitutions
from automatic escaping: in Figure 1c
(9), escaping of profile.aboutHtml
is explicitly suppressed because that
field is assumed to contain a user-supplied
message with simple, safe HTML
markup (to support use of fonts, colors,
and hyperlinks in the “about myself”
lating that their values are safe to use
in specific contexts without further escaping
and validation.
˲ Coding guidelines. Coding guidelines
restrict direct use of injectionprone
APIs, and ensure security review
of certain security-sensitive APIs. Adherence
to these guidelines can be enforced
through simple static checks.
Inherently safe APIs. Our goal is
to provide inherently safe wrapper
APIs for injection-prone browser-side
Web platform API sinks, as well as for
server- and client-side HTML markup
rendering.
For some APIs, this is straightforward.
For example, the vulnerable assignment
in Figure 1b (5) can be replaced
with the use of an inherently
safe wrapper API, provided by the JavaScript
Closure Library, as shown in
Figure 2b (5’). The wrapper API validates
at runtime that the supplied URL
represents either a scheme-less URL or
one with a known benign scheme.
Using the safe wrapper API ensures
this code will not result in an XSS
vulnerability, regardless of the provenance
of the assigned URL. Crucially,
none of the code in (5’) nor its fan-in
in (6-8) needs to be inspected for XSS
bugs. This benefit comes at the very
small cost of a runtime validation that
is technically unnecessary if (and only
if) the first branch is taken—the URL
obtained from the configuration store
is validated even though it is actually a
trustworthy value.
In some special scenarios, the runtime
validation imposed by an inherently
safe API may be too strict. Such
cases are accommodated via variants
of inherently safe APIs that accept
types with a security contract appropriate
for the desired use context. Based
on their contract, such values are exempt
from runtime validation. This
approach is discussed in more detail in
the next section.
Strictly contextually auto-escaping
template engines. Designing an inherently
safe API for HTML rendering is
more challenging. The goal is to devise
APIs that guarantee that at each substitution
point of data into a particular
context within trusted HTML markup,
data is appropriately validated, sanitized,
and/or escaped, unless it can be
demonstrated that a specific data item
is safe to use in that context based on SEPTEMBER 2014 | VOL. 57 | NO. 9 | COMMUNICATIONS OF THE ACM 43
practice
Figure 1. XSS vulnerabilities in a hypothetical Web application.
Browser Web-App Frontend Application Backends
(4)
(3)
(1)
Application data store
(2)
Browser Web-App Frontend Application Backends
(4)
(3)
(5)
(6)
(7)
Application data store
(8)
Browser Web-App Frontend Application Backends
(12)
(13)
(9)
(10)
Profile Store
(11)
(a) Vulnerable code of a hypothetical photo-sharing application.
(b) Another slice of the photo-sharing application.
(c) Using an HTML markup template. 44 COMMUNICATIONS OF THE ACM | SEPTEMBER 2014 | VOL. 57 | NO. 9
practice
sanitizer to remove any markup that
may result in script execution renders
it safe to use in HTML context and
thus produces a value that satisfies the
SafeHtml type contract.
To actually create values of these
types, unchecked conversion factory
methods are provided that consume
an arbitrary string and return an instance
of a given wrapper type (for example,
SafeHtml or SafeUrl) without
applying any runtime sanitization
or escaping.
Every use of such unchecked conversions
must be carefully security reviewed
to ensure that in all possible
program states, strings passed to the
conversion satisfy the resulting type’s
contract, based on context-specific
processing or construction. As such,
unchecked conversions should be used
as rarely as possible, and only in scenarios
where their use is readily reasoned
about for security-review purposes.
For example, in Figure 2c, the unchecked
conversion is encapsulated
in a library (12’’) along with the HTML
sanitizer implementation on whose
correctness its use depends, permitting
security review and testing in isolation.
Coding guidelines. For this approach
to be effective, it must ensure
developers never write application
code that directly calls potentially injection-prone
sinks, and that they instead
use the corresponding safe wrapper
API. Furthermore, it must ensure
uses of unchecked conversions are designed
with reviewability in mind, and
are in fact security reviewed. Both constraints
represent coding guidelines
with which all of an application’s code
base must comply.
In our experience, automated enforcement
of coding guidelines is
necessary even in moderate-size projects—otherwise,
violations are bound
to creep in over time.
At Google we use the open source
error-prone static checker1
(https://
goo.gl/SQXCvw), which is integrated
into Google’s Java tool chain, and a feature
of Google’s open source Closure
Compiler (https://goo.gl/UyMVzp) to
whitelist uses of specific methods
and properties in JavaScript. Errors
arising from use of a “banned” API
include references to documentation
for the corresponding safe API, advising
developers on how to address
its provenance or prior validation, sanitization,
or escaping.
These inherently safe APIs are created
by strengthening the concept of
contextually auto-escaping template
engines6
into SCAETEs (strictly contextually
auto-escaping template engines).
Essentially, a SCAETE places two additional
constraints on template code:
˲ Directives that disable or modify the
automatically inferred contextual escaping
and validation are not permitted.
˲ A template may use only sub-templates
that recursively adhere to the
same constraint.
Security type contracts. In the form
just described, SCAETEs do not account
for scenarios where template
parameters are intended to be used
without validation or escaping, such
as aboutHtml in Figure 1c—the
SCAETE unconditionally validates
and escapes all template parameters,
and disallows directives to disable the
auto-escaping mechanism.
Such use cases are accommodated
through types whose contracts stipulate
their values are safe to use in corresponding
HTML contexts, such as
“inner HTML,” hyperlink URLs, executable
resource URLs, and so forth.
Type contracts are informal: a value
satisfies a given type contract if it is
known that it has been validated, sanitized,
escaped, or constructed in a way
that guarantees its use in the type’s target
context will not result in attackercontrolled
script execution. Whether
or not this is indeed the case is established
by expert reasoning about code
that creates values of such types, based
on expert knowledge of the relevant
behaviors of the Web platform.8
As will
be seen, such security-sensitive code
is encapsulated in a small number of
special-purpose libraries; application
code uses those libraries but is itself
not relied upon to correctly create instances
of such types and hence does
not need to be security-reviewed.
The following are examples of types
and type contracts in use:
˲ SafeHtml. A value of type
SafeHtml, converted to string, will not
result in attacker-controlled script execution
when used as HTML markup.
˲ SafeUrl. Values of this type will
not result in attacker-controlled script
execution when dereferenced as hyperlink
URLs.
˲ TrustedResourceUrl. Values
of this type are safe to use as the
URL of an executable or “control” resource,
such as the src attribute of a